dc.contributor.author |
Yadav, Ankit |
|
dc.contributor.author |
Beniwal, Himanshu |
|
dc.contributor.author |
Singh, Mayank |
|
dc.contributor.other |
Conference on Empirical Methods in Natural Language Processing (EMNLP 2024) |
|
dc.coverage.spatial |
United States of America |
|
dc.date.accessioned |
2024-11-20T13:29:59Z |
|
dc.date.available |
2024-11-20T13:29:59Z |
|
dc.date.issued |
2024-11-12 |
|
dc.identifier.citation |
Yadav, Ankit; Beniwal, Himanshu and Singh, Mayank, "PythonSaga: redefining the benchmark to evaluate code generating LLMs", in the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, US, Nov. 12-16, 2024. |
|
dc.identifier.uri |
https://aclanthology.org/2024.findings-emnlp.996 |
|
dc.identifier.uri |
https://repository.iitgn.ac.in/handle/123456789/10790 |
|
dc.description.abstract |
Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of *HumanEval* and *MBPP*, two popular benchmarks for Python code generation, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations. To address these limitations, we propose a novel benchmark, *PythonSaga*, featuring 185 hand-crafted prompts in a balanced representation of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of existing Code-LLMs. The code and data set are openly available to the NLP community at this [URL](https://github.com/PythonSaga/PythonSaga). |
|
dc.description.statementofresponsibility |
by Ankit Yadav, Himanshu Beniwal and Mayank Singh |
|
dc.language.iso |
en_US |
|
dc.publisher |
Association for Computational Linguistics |
|
dc.title |
PythonSaga: redefining the benchmark to evaluate code generating LLMs |
|
dc.type |
Conference Paper |
|