Abstract:
Large language models (LLMs) have transformed human-AI interactions, yet research on making their use more accessible is still limited. While some studies address the inclusivity of generated language, less focus has been placed on interaction mechanisms that enhance accessibility, such as speech-to-text and optical character recognition, as well as the errors they may incur. This study investigates these input errors and additionally models interactions from non-native English speakers using data augmentation tools like nlpaug and textattack. We assess the performance of the gpt-4o-mini and o3-mini models by augmenting datasets like HumanEval+ and generating Python code, leveraging the models’ strengths. Our results show a statistically significant degradation in performance across various metrics, including pass@k, Pylint score, and code similarity. The text prompts that are summarized or contain speech-to-text artifacts led to performance reductions of up to 40 percentage points in the pass@k metric on the HumanEval+ dataset, with consistent trends across other metrics. These findings highlight the need for developers to evaluate LLM robustness against interaction artifacts before integration with input mechanisms and encourage natural language processing researchers to develop datasets that reflect such artifacts for improved model training.