COMPETITIONS

The first ClimateCheck shared task at SDP 2025 challenged participants to build systems for scientific fact-checking of climate claims, with tasks focusing on abstract retrieval and claim verification against a corpus of 394,269 climate science publications.

ClimateCheck@SDP 2025: Advancing Scientific Fact-Checking of Climate Claims

· 4 min read

ClimateCheck@SDP 2025: Advancing Scientific Fact-Checking of Climate Claims

The first ClimateCheck competition challenged researchers to build systems that could verify climate claims from social media against a corpus of 394,000 scientific publications. Held at SDP 2025 in Vienna, the shared task attracted 10 teams who competed on two subtasks: retrieving relevant abstracts and verifying whether they support or refute the claims.

The Competition

Subtask I: Abstract Retrieval — Find the most relevant scientific abstracts for each claim. The winning system used a hybrid approach combining BM25 sparse retrieval, fine-tuned dense embeddings (e5-large-v2-climatecheck), and neural reranking. Top performers achieved 0.574 Recall@10.

Subtask II: Claim Verification — Predict whether abstracts support, refute, or provide insufficient information about claims. The best system scored 0.716 F1 (precision and recall both 0.717), with the final score scaled by retrieval quality reaching 1.291.

Key Findings

The competition revealed that hybrid retrieval (combining sparse and dense methods) significantly outperforms either approach alone. Fine-tuning on domain-specific data matters—general-purpose models struggled with climate jargon and technical terminology.

Most notably, one team demonstrated that fine-tuned BERT models could match LLM accuracy while running 11-380 times faster, consuming far less energy, and offering better explainability. For climate fact-checking where the tools themselves shouldn’t worsen the problem, this energy efficiency matters. And for applications requiring transparency, smaller models’ interpretability is crucial. This challenges the assumption that bigger models are always better.

Claim Verification Performance (F1 Scores)

System F1 Score
Qwen3 14B (Reasoning)
Fine-tuned BERT
Phi 4 14B
Llama 3.3 70B
BERT-based models (Fine-tuned) Large Language Models

Processing Speed Comparison

Fine-tuned BERT processes claims 11-380x faster than large language models, making it ideal for real-time fact-checking.

System Time per Claim vs BERT
Fine-tuned BERT 0.032s
Phi 4 14B 0.737s 23× slower
Llama 3.3 70B 2.347s 73× slower
Qwen3 14B (Reasoning) 12.229s 382× slower
Fine-tuned BERT (fastest) Large Language Models

Abstract Retrieval Performance (Recall@10)

Approach Recall@10
Hybrid Retrieval
Dense Only
Sparse Only (BM25)
Best performing approach

The competition is hosted on Codabench with full leaderboards and reproducible submissions.

Papers & Next Steps

The competition produced multiple papers: the dataset paper by Raia Abu Ahmad et al., the BERT vs LLM comparison by Max Upravitelev et al., and the shared task overview. Organized by Raia Abu Ahmad, Aida Usmanova, and Georg Rehm, the competition has spawned ClimateCheck 2026, which will add disinformation narrative classification using the CARDS taxonomy and feature substantially more training data.


Related Resources: