Boldly going where no benchmark has gone before: exposing bias and shortcomings in code generation evaluation

dc.contributor.author	Yadav, Ankit
dc.contributor.author	Singh, Mayank
dc.coverage.spatial	United States of America
dc.date.accessioned	2024-01-17T15:23:10Z
dc.date.available	2024-01-17T15:23:10Z
dc.date.issued	2024-01
dc.identifier.citation	Yadav, Ankit and Singh, Mayank, "Boldly going where no benchmark has gone before: exposing bias and shortcomings in code generation evaluation", arXiv, Cornell University Library, DOI: arXiv:2401.03855, Jan. 2024.
dc.identifier.issn	2331-8422
dc.identifier.uri	https://doi.org/10.48550/arXiv.2401.03855
dc.identifier.uri	https://repository.iitgn.ac.in/handle/123456789/9677
dc.description.abstract	Motivated by the increasing popularity of code generation from human descriptions using large language models (LLMs), several benchmarks have been proposed to assess the capabilities of existing and emerging models. This study presents a large-scale human evaluation of HumanEval and MBPP, two widely used benchmarks for Python code generation, focusing on their diversity and difficulty. Our findings reveal a significant bias towards a limited number of programming concepts, with negligible or no representation of most concepts. Additionally, we identify a concerningly high proportion of easy programming questions, potentially leading to an overestimation of model performance on code generation tasks.
dc.description.statementofresponsibility	by Ankit Yadav and Mayank Singh
dc.language.iso	en_US
dc.publisher	Cornell University Library
dc.title	Boldly going where no benchmark has gone before: exposing bias and shortcomings in code generation evaluation
dc.type	Article
dc.relation.journal	arXiv

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

E-print Articles [187]

Show simple item record

Search Digital Repository

Browse

All of DSpace
This Collection
- Titles
- Authors
- By Advisor
- By Issue Date
- Subjects
- By Type
- By Degree
- By Department

Boldly going where no benchmark has gone before: exposing bias and shortcomings in code generation evaluation

Files in this item

This item appears in the following Collection(s)

Search Digital Repository

Browse

All of DSpace

This Collection

My Account