RESEARCH PAPERS

A groundbreaking reproducibility study on 29 climate-related NLP datasets reveals that most tasks rely on surface-level keyword patterns rather than deep understanding, with 96% of datasets containing annotation issues that compromise evaluation reliability.

Benchmarking the Benchmarks: Critical Findings on Climate-Related NLP Dataset Quality

· 9 min read

Benchmarking the Benchmarks: Critical Findings on Climate-Related NLP Dataset Quality

As climate change becomes increasingly urgent, the NLP community has developed over 60 works analyzing climate-related text—from detecting climate discourse to identifying greenwashing in corporate communications. A new reproducibility study published at ACL 2025 reveals a troubling reality: 96% of climate-related NLP datasets contain annotation issues, and simple keyword matching performs nearly as well as state-of-the-art AI models.

The paper, “Benchmarking the Benchmarks: Reproducing Climate-Related NLP Tasks” by Tom Calamai (Amundi, Télécom Paris, Inria), Oana Balalau (Inria, Fonica), and Fabian M. Suchanek (Télécom Paris), presents a comprehensive analysis of 8 tasks across 29 datasets, testing 6 different models. Their findings challenge assumptions about progress in climate-related NLP and raise fundamental questions about how we evaluate AI tools for climate action.

The Core Problem: Simplicity Masks Real Capabilities

In simple terms: Most climate-related NLP tasks can be solved using simple word-frequency patterns (TF-IDF), which means the datasets are too easy. When even basic keyword matching performs nearly as well as advanced AI models, we can’t tell which models are actually better at understanding language.

The study’s most striking finding is that TF-IDF with logistic regression performs nearly as well as fine-tuned transformer models and zero-shot large language models (LLMs) across most datasets. The average F1-scores tell the story:

  • TF-IDF baseline: 69.18%
  • Fine-tuned DistilRoBERTa: 74.73%
  • Fine-tuned Longformer: 74.04%
  • GPT-4o-mini (zero-shot): 69.78%
  • Llama 3.1 70B (zero-shot): 67.39%

The performance gap between simple keyword matching and state-of-the-art models is alarmingly small—often less than 5 percentage points. This suggests that most climate-related NLP datasets consist of examples that can be predicted using word-frequency patterns alone, rather than requiring deeper semantic or contextual understanding.

The Annotation Crisis: 96% of Datasets Have Issues

In simple terms: When researchers manually checked errors from AI models, they found that 16.6% were actually mistakes in the dataset labels themselves, and another 38.8% were ambiguous cases where multiple labels could reasonably apply. This means the datasets are often measuring annotation quality rather than model capability.

Perhaps the study’s most concerning finding is the prevalence of annotation issues. The researchers manually analyzed over 500 errors from GPT-4o-mini predictions and found:

  • 96% of datasets contain annotation issues
  • 16.6% of sampled errors were clear annotation mistakes (the model was actually correct)
  • 38.8% were ambiguous examples where multiple labels could fit or it’s unclear which label to assign
  • 44.6% were genuine model errors

This means that more than half of the “errors” attributed to models may actually reflect problems with the datasets themselves. The study identifies three main categories of annotation problems:

1. Ambiguous Annotation Guidelines

Many datasets suffer from unclear or under-specified annotation guidelines that lead to inconsistent labeling. For example:

  • ClimateBUG: Guidelines say “Sustainability not related to the environment (i.e. sustainable profits)” should be classified as “Not Climate,” but many sentences don’t specify whether sustainability refers to environment or economics
  • ClimateFEVER: It’s unclear whether “SUPPORTING” means evidence is sufficient to entail the claim, or merely aligned with it
  • ClimateStance: Guidelines don’t specify how to annotate tweets that oppose climate policies while acknowledging climate urgency

2. Implicit and Context-Dependent Examples

Many errors stem from statements that are implicitly related to climate change but lack explicit keywords. The study found that models struggle with:

  • Indirect references (e.g., energy-related statements that imply climate relevance)
  • Out-of-context statements (12.5% of errors were truncated or missing context)
  • Ambiguous cases where the link to climate is implied but not stated

3. Weak Labels and Non-Exhaustive Annotations

Several datasets use automatically generated labels rather than human annotation, leading to:

  • Evolving label definitions over time
  • Non-exhaustive annotations (e.g., LobbyMap identifies some policy stances but not all)
  • Inconsistent labeling that doesn’t follow strict guidelines

Key Findings: What the Data Reveals

Finding 1: Tasks Are Too Simple

In simple terms: If a simple keyword-matching algorithm can solve most of your test cases, your test isn’t measuring language understanding—it’s measuring whether models can recognize specific words. This compresses all model performances into a narrow range, hiding real differences in capability.

The study systematically demonstrates that tasks 1-5 (climate topic detection, thematic analysis, risk classification, green claim detection, and green claim characteristics) are “highly based on vocabulary.” When excluding weakly labeled datasets, fine-tuned models perform well—but so does TF-IDF, showing these tasks can be solved using word-frequency patterns alone.

Some datasets remain challenging (TCFD recommendations, ClimateEng, climate sentiment, and specificity), but even these could likely be improved with clearer annotation guidelines.

Finding 2: Fine-Tuned Models May Overfit on Noise

In simple terms: Fine-tuned models might perform better not because they understand language better, but because they’ve learned to match the specific patterns—including mistakes and biases—in the training data. This makes them look good on the test set but doesn’t mean they’ll work well in real applications.

The manual error analysis reveals a critical insight: fine-tuned models may outperform zero-shot LLMs not because they generalize better, but because they overfit on annotation noise, bias, or unclear guidelines.

The study notes that fine-tuned models can pick up patterns that were “informally discussed between the annotators, but that are not codified in the guidelines.” This means their performance gains may reflect learning dataset-specific quirks rather than genuine improvements in language understanding.

Finding 3: Stance Detection Is Inherently Difficult

In simple terms: Determining whether evidence supports or refutes a claim about climate change is genuinely hard—even for humans. The study found that human annotators only agreed with each other 33-64% of the time on some stance detection tasks, which explains why models struggle too.

Task 6 (green stance detection) shows that models have predictive power, but performance remains relatively low (F1-scores under 75%). However, the study notes this is expected: the task is inherently difficult, as shown by:

  • Low human performance on ClimateFEVER (average macro F1-score of 60% among annotators, ranging from 54% to 69%)
  • Moderate inter-annotator agreement (Krippendorff’s α = 0.54-0.64) reported in previous studies
  • The fact that GPT-4o-mini’s performance (60.0% F1-score) falls within the range of human annotator performance

This suggests that stance detection is a genuinely challenging task that requires nuanced reasoning, not just a problem with the models.

Finding 4: Deceptive Technique Classification Is Highly Subjective

In simple terms: Identifying logical fallacies and deceptive techniques in climate arguments is extremely subjective. Different experts might classify the same argument differently, which makes it hard to create reliable training data or evaluate model performance.

Task 8 (classification of deceptive techniques) reveals the challenges of subjective annotation. On LogicClimate, which classifies fallacies in climate arguments:

  • Fine-tuned models performed poorly (F1-scores of 9-15%), underperforming even random classifiers
  • Zero-shot approaches reached only 27% F1-score
  • The task requires expertise and is highly subjective, as noted by Helwe et al. (2024)

The study explains that fallacies have “different granularity and can easily overlap,” making consistent annotation difficult. The “intentional fallacy” label, for example, could theoretically fit most cases since fallacies are by definition flawed reasoning.

The Broader Context

In simple terms: This study is part of a larger movement in machine learning to improve reproducibility and dataset quality. Climate-related NLP faces the same challenges as other NLP domains—addressing them is essential for meaningful progress.

This research connects to broader concerns about reproducibility in machine learning. Previous work has identified 41 design choices that can affect reproducibility, including hyperparameter tuning, dataset preprocessing, annotation quality, insufficient baselines, and lack of confidence intervals.

The climate-related NLP community is not alone in facing these challenges. However, addressing them is particularly urgent given the need for reliable tools to combat climate misinformation, detect greenwashing, and analyze public discourse around climate action.

What This Means: Implications and Recommendations

In simple terms: These findings don’t mean climate-related NLP is impossible. They mean we need better datasets that require real language understanding, clearer annotation guidelines, and more rigorous evaluation practices.

For Researchers and Practitioners

If simple baselines perform nearly as well as advanced models, and annotation issues affect more than half of apparent errors, current benchmarks may be compressing performance into a narrow band that masks real differences in model capability. This has immediate practical consequences:

For climate communication and fact-checking:

  • AI tools may appear to work well in testing but fail in real-world applications
  • Performance comparisons between different approaches may be misleading
  • It’s unclear whether improvements reflect genuine progress or overfitting to dataset-specific patterns

For the research community:

  • Difficult to identify which models are genuinely better
  • Hard to track meaningful progress in the field
  • Challenging to make informed decisions about which approaches to use in practice

Five Key Recommendations

The study provides actionable guidance for building more reliable climate NLP systems:

1. Design More Challenging Datasets

  • Go beyond keywords—require inference and contextual reasoning
  • Include ambiguous or implicit statements that test genuine understanding
  • Ensure simple baselines struggle, indicating real language comprehension is needed

2. Prioritize Quality Over Quantity

  • Focus on clean, high-quality test sets rather than large weakly labeled datasets
  • Small annotation errors significantly impact evaluation when models perform above 90%
  • Poor LLM performance on weakly labeled datasets may reflect structural flaws, not model limitations

3. Create Precise Annotation Guidelines

  • Clearly define all terms and concepts
  • Address edge cases and ambiguities explicitly
  • Provide concrete examples and avoid vague phrasing
  • Avoid automatically labeled datasets that lack well-defined annotation procedures

4. Always Include Simple Baselines

  • Use TF-IDF or similar simple methods to assess task difficulty
  • Without baselines, impossible to gauge whether a task requires genuine language understanding
  • If simple methods work well, the dataset is likely too easy

5. Quantify and Report Annotation Quality

  • Estimate annotation error rates through manual review or inter-annotator agreement analysis
  • Define performance margins below which differences may fall within dataset noise
  • Compute meaningful confidence intervals that account for dataset sampling and variability

Access the Research

The full paper “Benchmarking the Benchmarks: Reproducing Climate-Related NLP Tasks” by Tom Calamai, Oana Balalau, and Fabian M. Suchanek is available at ACL Anthology. The research was presented at Findings of the Association for Computational Linguistics: ACL 2025 (July 27 - August 1, 2025).

Code and Datasets

The authors have made all cleaned-up datasets, baselines, and a Python library to run comparisons publicly available:

  • GitHub Repository: acl_climateNLPtoolbox: Complete codebase including dataset preprocessing scripts, model evaluation code, and tools for running all experiments. The repository contains:
    • Preprocessed and cleaned datasets for all 29 datasets
    • Baseline implementations (TF-IDF, fine-tuned transformers)
    • LLM evaluation scripts (GPT-4o-mini, Llama 3.1)
    • Analysis tools and visualization scripts
    • Documentation for reproducing all results from the paper

Conclusion: Building Better Foundations for Climate AI

The study’s title—“Benchmarking the Benchmarks”—reflects its meta-analytical approach. Rather than proposing new models or techniques, it evaluates the evaluation methods themselves, revealing fundamental issues that affect the entire field.

The climate crisis demands effective tools for analyzing discourse, detecting misinformation, and understanding public sentiment. But these tools must be built on solid foundations. This research shows that improving dataset quality and evaluation rigor is not just an academic concern—it’s essential for creating AI systems that can genuinely contribute to climate action.

By identifying these problems clearly and providing actionable recommendations, the authors have created a foundation for building better datasets, more reliable evaluations, and ultimately more trustworthy climate NLP tools.


Note: This analysis reflects our interpretation of the research findings. The paper presents a comprehensive reproducibility study with detailed error analysis across 29 datasets. For complete technical details, statistical analyses, and dataset-specific findings, readers should consult the original paper and the accompanying code repository.