Boldly going where no benchmark has gone before: exposing bias and shortcomings in code generation evaluation

Show simple item record

dc.contributor.author Yadav, Ankit
dc.contributor.author Singh, Mayank
dc.coverage.spatial United States of America
dc.date.accessioned 2024-01-17T15:23:10Z
dc.date.available 2024-01-17T15:23:10Z
dc.date.issued 2024-01
dc.identifier.citation Yadav, Ankit and Singh, Mayank, "Boldly going where no benchmark has gone before: exposing bias and shortcomings in code generation evaluation", arXiv, Cornell University Library, DOI: arXiv:2401.03855, Jan. 2024.
dc.identifier.issn 2331-8422
dc.identifier.uri https://doi.org/10.48550/arXiv.2401.03855
dc.identifier.uri https://repository.iitgn.ac.in/handle/123456789/9677
dc.description.abstract Motivated by the increasing popularity of code generation from human descriptions using large language models (LLMs), several benchmarks have been proposed to assess the capabilities of existing and emerging models. This study presents a large-scale human evaluation of HumanEval and MBPP, two widely used benchmarks for Python code generation, focusing on their diversity and difficulty. Our findings reveal a significant bias towards a limited number of programming concepts, with negligible or no representation of most concepts. Additionally, we identify a concerningly high proportion of easy programming questions, potentially leading to an overestimation of model performance on code generation tasks.
dc.description.statementofresponsibility by Ankit Yadav and Mayank Singh
dc.language.iso en_US
dc.publisher Cornell University Library
dc.title Boldly going where no benchmark has gone before: exposing bias and shortcomings in code generation evaluation
dc.type Article
dc.relation.journal arXiv


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search Digital Repository


Browse

My Account