ClimateCheck@SDP 2025: Advancing Scientific Fact-Checking of Climate Claims

The first ClimateCheck competition challenged researchers to build systems that could verify climate claims from social media against a corpus of 394,000 scientific publications. Held at SDP 2025 in Vienna, the shared task attracted 10 teams who competed on two subtasks: retrieving relevant abstracts and verifying whether they support or refute the claims.

The Competition

Subtask I: Abstract Retrieval — Find the most relevant scientific abstracts for each claim. The winning system used a hybrid approach combining BM25 sparse retrieval, fine-tuned dense embeddings (e5-large-v2-climatecheck), and neural reranking. Top performers achieved 0.574 Recall@10.

Subtask II: Claim Verification — Predict whether abstracts support, refute, or provide insufficient information about claims. The best system scored 0.716 F1 (precision and recall both 0.717), with the final score scaled by retrieval quality reaching 1.291.

Key Findings

The competition revealed that hybrid retrieval (combining sparse and dense methods) significantly outperforms either approach alone. Fine-tuning on domain-specific data matters—general-purpose models struggled with climate jargon and technical terminology.

Most notably, one team demonstrated that fine-tuned BERT models could match LLM accuracy while running 11-380 times faster, consuming far less energy, and offering better explainability. For climate fact-checking where the tools themselves shouldn’t worsen the problem, this energy efficiency matters. And for applications requiring transparency, smaller models’ interpretability is crucial. This challenges the assumption that bigger models are always better.

Claim Verification Performance (F1 Scores)

System	F1 Score
Qwen3 14B (Reasoning)
Fine-tuned BERT
Phi 4 14B
Llama 3.3 70B

BERT-based models (Fine-tuned) Large Language Models

Processing Speed Comparison

Fine-tuned BERT processes claims 11-380x faster than large language models, making it ideal for real-time fact-checking.

System	Time per Claim	vs BERT
Fine-tuned BERT	0.032s	—
Phi 4 14B	0.737s	23× slower
Llama 3.3 70B	2.347s	73× slower
Qwen3 14B (Reasoning)	12.229s	382× slower

Fine-tuned BERT (fastest) Large Language Models

Abstract Retrieval Performance (Recall@10)

Approach	Recall@10
Hybrid Retrieval
Dense Only
Sparse Only (BM25)

Best performing approach

The competition is hosted on Codabench with full leaderboards and reproducible submissions.

Papers & Next Steps

The competition produced multiple papers: the dataset paper by Raia Abu Ahmad et al., the BERT vs LLM comparison by Max Upravitelev et al., and the shared task overview. Organized by Raia Abu Ahmad, Aida Usmanova, and Georg Rehm, the competition has spawned ClimateCheck 2026, which will add disinformation narrative classification using the CARDS taxonomy and feature substantially more training data.

Related Resources: