PythonSaga: redefining the benchmark to evaluate code generating LLMs

dc.contributor.author	Yadav, Ankit
dc.contributor.author	Beniwal, Himanshu
dc.contributor.author	Singh, Mayank
dc.contributor.other	Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)
dc.coverage.spatial	United States of America
dc.date.accessioned	2024-11-20T13:29:59Z
dc.date.available	2024-11-20T13:29:59Z
dc.date.issued	2024-11-12
dc.identifier.citation	Yadav, Ankit; Beniwal, Himanshu and Singh, Mayank, "PythonSaga: redefining the benchmark to evaluate code generating LLMs", in the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, US, Nov. 12-16, 2024.
dc.identifier.uri	https://aclanthology.org/2024.findings-emnlp.996
dc.identifier.uri	https://repository.iitgn.ac.in/handle/123456789/10790
dc.description.abstract	Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations. To address these limitations, we propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts in a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs. The code and data set are openly available to the NLP community at this [URL](https://github.com/PythonSaga/PythonSaga).
dc.description.statementofresponsibility	by Ankit Yadav, Himanshu Beniwal and Mayank Singh
dc.language.iso	en_US
dc.publisher	Association for Computational Linguistics
dc.title	PythonSaga: redefining the benchmark to evaluate code generating LLMs
dc.type	Conference Paper

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Conference Papers [286]

Show simple item record

Search Digital Repository

Browse

All of DSpace
This Collection
- Titles
- Authors
- By Advisor
- By Issue Date
- Subjects
- By Type
- By Degree
- By Department

PythonSaga: redefining the benchmark to evaluate code generating LLMs

Files in this item

This item appears in the following Collection(s)

Search Digital Repository

Browse

All of DSpace

This Collection

My Account