Breaking mBad! supervised fine-tuning for cross-lingual detoxification

Show simple item record

dc.contributor.author Beniwal, Himanshu
dc.contributor.author Kim, Youngwoo
dc.contributor.author Sap, Maarten
dc.contributor.author Dan, Soham
dc.contributor.author Hartvigsen, Thomas
dc.coverage.spatial United States of America
dc.date.accessioned 2025-05-29T07:58:02Z
dc.date.available 2025-05-29T07:58:02Z
dc.date.issued 2025-05
dc.identifier.citation Beniwal, Himanshu; Kim, Youngwoo; Sap, Maarten; Dan, Soham and Hartvigsen, Thomas, "Breaking mBad! supervised fine-tuning for cross-lingual detoxification", arXiv, Cornell University Library, DOI: arXiv:2505.16722, May 2025.
dc.identifier.uri https://doi.org/10.48550/arXiv.2505.16722
dc.identifier.uri https://repository.iitgn.ac.in/handle/123456789/11468
dc.description.abstract As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 504 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad
dc.description.statementofresponsibility by Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan and Thomas Hartvigsen
dc.language.iso en_US
dc.publisher Cornell University Library
dc.title Breaking mBad! supervised fine-tuning for cross-lingual detoxification
dc.type Article
dc.relation.journal arXiv


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search Digital Repository


Browse

My Account