About Histoboard

Why Histoboard?

The pathology foundation model landscape is growing fast, but comparing models across benchmarks remains fragmented and time-consuming. Histoboard aggregates results from published benchmarks into a single, accessible interface — giving the community a clear comparative view of existing models.

For the full story behind Histoboard, read our blog post.

What We Aggregate

Histoboard currently aggregates results from 12 published benchmarks covering:

433 evaluation tasks spanning classification, survival prediction, biomarker detection, and more
48 foundation models from academic and industry labs worldwide
20 organs including Bladder, Brain, Breast, Cervix, Colorectal, and others
Robustness evaluation across domain shifts, scanners, and staining variations

All data comes directly from official benchmark publications and repositories. See each benchmark card for exact data sources and evaluation protocols.

How Rankings Work

The main leaderboard ranks models per benchmark using either the benchmark’s own aggregate score or a metric we compute ourselves. When a benchmark provides an official aggregate (e.g. a robustness index or rank sum), we follow it directly. Otherwise, we average per-task ranks to compute an overall ranking metric for that benchmark.

Benchmark	Ranking metric	Source	Direction
BC Survival	Average rank across 2 survival tasks and 2 populations	Official	↑ higher
EVA	Average metric across 13 tasks	Computed	↑ higher
HEST	Average Pearson's R across 9 different organs	Official	↑ higher
HKUST PathBench	Average task rank across 229 tasks	Computed	↓ lower
Patho-Bench	Average task rank across 53 tasks	Computed	↓ lower
PathoROB	Robustness index across 3 scenarios	Official	↑ higher
PFM-DenseBench	Average rank across 18 segmentation datasets × 5 methods	Official	↓ lower
Plismbench	Aggregate robustness score	Official	↑ higher
Sinai SSL	Average AUROC across 22 tasks	Official	↑ higher
STAMP	Average task rank across 31 tasks	Computed	↓ lower
Stanford PathBench	Average AUROC across 41 tasks	Official	↑ higher
THUNDER	Rank sum across 6 tasks	Official	↑ higher

For detailed per-task results, visit the individual benchmark pages. Inside the arena, we implement a metric-agnostic ranking system that enables you to compare models based on specific organs and tasks.

Contributing

Histoboard is open source and welcomes contributions. If you notice missing models, incorrect data, or want to add a new benchmark, please:

Key References

Mahmood, F. (2025). A benchmarking crisis in biomedical machine learning. Nature Medicine, 31, 1060.

Zhang, A., Jaume, G., Vaidya, A., Ding, T., & Mahmood, F. (2025). Accelerating Data Processing and Benchmarking of AI Models for Pathology. arXiv:2502.06750.

Leaderboard

Overall model rankings

Arena

Head-to-head comparison

Benchmarks

Detailed per-task results

Models

Browse all foundation models