RESEARCH PAPERS
Benchmarking the Benchmarks: Critical Findings on Climate-Related NLP Dataset Quality
A groundbreaking reproducibility study on 29 climate-related NLP datasets reveals that most tasks rely on surface-level keyword patterns rather than deep understanding, with 96% of datasets containing annotation issues that compromise evaluation reliability.