Evaluating pre-trained large language models using simulated human inputs for python code generation

dc.contributor.author	Potta, Mukul Paras
dc.contributor.author	Mondal, Shouvick
dc.contributor.author	Meena, Yogesh Kumar
dc.coverage.spatial	United States of America
dc.date.accessioned	2025-07-11T08:30:50Z
dc.date.available	2025-07-11T08:30:50Z
dc.date.issued	2025-06
dc.identifier.citation	Potta, Mukul Paras; Mondal, Shouvick and Meena, Yogesh Kumar, "Evaluating pre-trained large language models using simulated human inputs for python code generation", SSRN, Elsevier, DOI: 10.2139/ssrn.5293364, Jun. 2025.
dc.identifier.uri	https://dx.doi.org/10.2139/ssrn.5293364
dc.identifier.uri	https://repository.iitgn.ac.in/handle/123456789/11624
dc.description.abstract	Large language models (LLMs) have transformed human-AI interactions, yet research on making their use more accessible is still limited. While some studies address the inclusivity of generated language, less focus has been placed on interaction mechanisms that enhance accessibility, such as speech-to-text and optical character recognition, as well as the errors they may incur. This study investigates these input errors and additionally models interactions from non-native English speakers using data augmentation tools like nlpaug and textattack. We assess the performance of the gpt-4o-mini and o3-mini models by augmenting datasets like HumanEval+ and generating Python code, leveraging the models’ strengths. Our results show a statistically significant degradation in performance across various metrics, including pass@k, Pylint score, and code similarity. The text prompts that are summarized or contain speech-to-text artifacts led to performance reductions of up to 40 percentage points in the pass@k metric on the HumanEval+ dataset, with consistent trends across other metrics. These findings highlight the need for developers to evaluate LLM robustness against interaction artifacts before integration with input mechanisms and encourage natural language processing researchers to develop datasets that reflect such artifacts for improved model training.
dc.description.statementofresponsibility	by Mukul Paras Potta, Shouvick Mondal and Yogesh Kumar Meena
dc.language.iso	en_US
dc.publisher	Elsevier
dc.subject	Large Language Models
dc.subject	Human-computer interaction
dc.subject	Code generation
dc.subject	Prompt engineering
dc.subject	Perturbations
dc.subject	Accessibility
dc.title	Evaluating pre-trained large language models using simulated human inputs for python code generation
dc.type	Article
dc.relation.journal	SSRN

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

E-print Articles [187]

Show simple item record

Search Digital Repository

Browse

All of DSpace
This Collection
- Titles
- Authors
- By Advisor
- By Issue Date
- Subjects
- By Type
- By Degree
- By Department

Evaluating pre-trained large language models using simulated human inputs for python code generation

Files in this item

This item appears in the following Collection(s)

Search Digital Repository

Browse

All of DSpace

This Collection

My Account