Evaluating pre-trained large language models using simulated human inputs for python code generation

Show simple item record

dc.contributor.author Potta, Mukul Paras
dc.contributor.author Mondal, Shouvick
dc.contributor.author Meena, Yogesh Kumar
dc.coverage.spatial United States of America
dc.date.accessioned 2025-07-11T08:30:50Z
dc.date.available 2025-07-11T08:30:50Z
dc.date.issued 2025-06
dc.identifier.citation Potta, Mukul Paras; Mondal, Shouvick and Meena, Yogesh Kumar, "Evaluating pre-trained large language models using simulated human inputs for python code generation", SSRN, Elsevier, DOI: 10.2139/ssrn.5293364, Jun. 2025.
dc.identifier.uri https://dx.doi.org/10.2139/ssrn.5293364
dc.identifier.uri https://repository.iitgn.ac.in/handle/123456789/11624
dc.description.abstract Large language models (LLMs) have transformed human-AI interactions, yet research on making their use more accessible is still limited. While some studies address the inclusivity of generated language, less focus has been placed on interaction mechanisms that enhance accessibility, such as speech-to-text and optical character recognition, as well as the errors they may incur. This study investigates these input errors and additionally models interactions from non-native English speakers using data augmentation tools like nlpaug and textattack. We assess the performance of the gpt-4o-mini and o3-mini models by augmenting datasets like HumanEval+ and generating Python code, leveraging the models’ strengths. Our results show a statistically significant degradation in performance across various metrics, including pass@k, Pylint score, and code similarity. The text prompts that are summarized or contain speech-to-text artifacts led to performance reductions of up to 40 percentage points in the pass@k metric on the HumanEval+ dataset, with consistent trends across other metrics. These findings highlight the need for developers to evaluate LLM robustness against interaction artifacts before integration with input mechanisms and encourage natural language processing researchers to develop datasets that reflect such artifacts for improved model training.
dc.description.statementofresponsibility by Mukul Paras Potta, Shouvick Mondal and Yogesh Kumar Meena
dc.language.iso en_US
dc.publisher Elsevier
dc.subject Large Language Models
dc.subject Human-computer interaction
dc.subject Code generation
dc.subject Prompt engineering
dc.subject Perturbations
dc.subject Accessibility
dc.title Evaluating pre-trained large language models using simulated human inputs for python code generation
dc.type Article
dc.relation.journal SSRN


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search Digital Repository


Browse

My Account