SciDQA: a deep reading comprehension dataset over scientific papers

dc.contributor.author	Singh, Shruti
dc.contributor.author	Sarkar, Nandan
dc.contributor.author	Cohan, Arman
dc.coverage.spatial	United States of America
dc.date.accessioned	2025-02-28T05:26:26Z
dc.date.available	2025-02-28T05:26:26Z
dc.date.issued	2024-11-12
dc.identifier.citation	Singh, Shruti; Sarkar, Nandan and Cohan, Arman, "SciDQA: a deep reading comprehension dataset over scientific papers", in the Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, US, Nov. 12-16, 2024.
dc.identifier.uri	https://doi.org/10.18653/v1/2024.emnlp-main.1163
dc.identifier.uri	https://repository.iitgn.ac.in/handle/123456789/11068
dc.description.abstract	Scientific literature is typically dense, requiring significant background knowledge and deep comprehension for effective engagement. We introduce SciDQA, a new dataset for reading comprehension that challenges language models to deeply understand scientific articles, consisting of 2,937 QA pairs. Unlike other scientific QA datasets, SciDQA sources questions from peer reviews by domain experts and answers by paper authors, ensuring a thorough examination of the literature. We enhance the dataset’s quality through a process that carefully decontextualizes the content, tracks the source document across different versions, and incorporates a bibliography for multi-document question-answering. Questions in SciDQA necessitate reasoning across figures, tables, equations, appendices, and supplementary materials, and require multi-document reasoning. We evaluate several open-source and proprietary LLMs across various configurations to explore their capabilities in generating relevant and factual responses, as opposed to simple review memorization. Our comprehensive evaluation, based on metrics for surface-level and semantic similarity, highlights notable performance discrepancies. SciDQA represents a rigorously curated, naturally derived scientific QA dataset, designed to facilitate research on complex reasoning within the domain of question answering for scientific texts.
dc.description.statementofresponsibility	by Shruti Singh, Nandan Sarkar and Arman Cohan
dc.language.iso	en_US
dc.publisher	Association for Computational Linguistics
dc.title	SciDQA: a deep reading comprehension dataset over scientific papers
dc.type	Conference Paper
dc.relation.journal	Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Conference Papers [286]

Show simple item record

Search Digital Repository

Browse

All of DSpace
This Collection
- Titles
- Authors
- By Advisor
- By Issue Date
- Subjects
- By Type
- By Degree
- By Department

SciDQA: a deep reading comprehension dataset over scientific papers

Files in this item

This item appears in the following Collection(s)

Search Digital Repository

Browse

All of DSpace

This Collection

My Account