Evaluating pre-trained large language models using simulated human inputs for python code generation

Evaluating pre-trained large language models using simulated human inputs for python code generation

Potta, Mukul Paras; Mondal, Shouvick; Meena, Yogesh Kumar

URI: https://dx.doi.org/10.2139/ssrn.5293364
https://repository.iitgn.ac.in/handle/123456789/11624

Date: 2025-06

Abstract:

Large language models (LLMs) have transformed human-AI interactions, yet research on making their use more accessible is still limited. While some studies address the inclusivity of generated language, less focus has been placed on interaction mechanisms that enhance accessibility, such as speech-to-text and optical character recognition, as well as the errors they may incur. This study investigates these input errors and additionally models interactions from non-native English speakers using data augmentation tools like nlpaug and textattack. We assess the performance of the gpt-4o-mini and o3-mini models by augmenting datasets like HumanEval+ and generating Python code, leveraging the models’ strengths. Our results show a statistically significant degradation in performance across various metrics, including pass@k, Pylint score, and code similarity. The text prompts that are summarized or contain speech-to-text artifacts led to performance reductions of up to 40 percentage points in the pass@k metric on the HumanEval+ dataset, with consistent trends across other metrics. These findings highlight the need for developers to evaluate LLM robustness against interaction artifacts before integration with input mechanisms and encourage natural language processing researchers to develop datasets that reflect such artifacts for improved model training.

Show full item record

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

E-print Articles [187]

Search Digital Repository

Browse

All of DSpace
This Collection
- Titles
- Authors
- By Advisor
- By Issue Date
- Subjects
- By Type
- By Degree
- By Department

Evaluating pre-trained large language models using simulated human inputs for python code generation

Evaluating pre-trained large language models using simulated human inputs for python code generation

Abstract:

Files in this item

This item appears in the following Collection(s)

Search Digital Repository

Browse

All of DSpace

This Collection

My Account