The first ClimateCheck shared task at SDP 2025 challenged participants to build systems for scientific fact-checking of climate claims, with tasks focusing on abstract retrieval and claim verification against a corpus of 394,269 climate science publications.
ClimateCheck@SDP 2025: Advancing Scientific Fact-Checking of Climate Claims
ClimateCheck@SDP 2025: Advancing Scientific Fact-Checking of Climate Claims
The first ClimateCheck competition challenged researchers to build systems that could verify climate claims from social media against a corpus of 394,000 scientific publications. Held at SDP 2025 in Vienna, the shared task attracted 10 teams who competed on two subtasks: retrieving relevant abstracts and verifying whether they support or refute the claims.
The Competition
Subtask I: Abstract Retrieval — Find the most relevant scientific abstracts for each claim. The winning system used a hybrid approach combining BM25 sparse retrieval, fine-tuned dense embeddings (e5-large-v2-climatecheck), and neural reranking. Top performers achieved 0.574 Recall@10.
Subtask II: Claim Verification — Predict whether abstracts support, refute, or provide insufficient information about claims. The best system scored 0.716 F1 (precision and recall both 0.717), with the final score scaled by retrieval quality reaching 1.291.
Key Findings
The competition revealed that hybrid retrieval (combining sparse and dense methods) significantly outperforms either approach alone. Fine-tuning on domain-specific data matters—general-purpose models struggled with climate jargon and technical terminology.
Most notably, one team demonstrated that fine-tuned BERT models could match LLM accuracy while running 11-380 times faster, consuming far less energy, and offering better explainability. For climate fact-checking where the tools themselves shouldn’t worsen the problem, this energy efficiency matters. And for applications requiring transparency, smaller models’ interpretability is crucial. This challenges the assumption that bigger models are always better.
Claim Verification Performance (F1 Scores)
| System | F1 Score |
|---|---|
| Qwen3 14B (Reasoning) | |
| Fine-tuned BERT | |
| Phi 4 14B | |
| Llama 3.3 70B |
Processing Speed Comparison
Fine-tuned BERT processes claims 11-380x faster than large language models, making it ideal for real-time fact-checking.
| System | Time per Claim | vs BERT |
|---|---|---|
| Fine-tuned BERT | 0.032s | — |
| Phi 4 14B | 0.737s | 23× slower |
| Llama 3.3 70B | 2.347s | 73× slower |
| Qwen3 14B (Reasoning) | 12.229s | 382× slower |
Abstract Retrieval Performance (Recall@10)
| Approach | Recall@10 |
|---|---|
| Hybrid Retrieval | |
| Dense Only | |
| Sparse Only (BM25) |
The competition is hosted on Codabench with full leaderboards and reproducible submissions.
Papers & Next Steps
The competition produced multiple papers: the dataset paper by Raia Abu Ahmad et al., the BERT vs LLM comparison by Max Upravitelev et al., and the shared task overview. Organized by Raia Abu Ahmad, Aida Usmanova, and Georg Rehm, the competition has spawned ClimateCheck 2026, which will add disinformation narrative classification using the CARDS taxonomy and feature substantially more training data.
Related Resources: