About Histoboard
Why Histoboard?
The pathology foundation model landscape is growing fast, but comparing models across benchmarks remains fragmented and time-consuming. Histoboard aggregates results from published benchmarks into a single, accessible interface — giving the community a clear comparative view of existing models.
For the full story behind Histoboard, read our blog post.
What We Aggregate
Histoboard currently aggregates results from 12 published benchmarks covering:
- 433 evaluation tasks spanning classification, survival prediction, biomarker detection, and more
- 48 foundation models from academic and industry labs worldwide
- 20 organs including Bladder, Brain, Breast, Cervix, Colorectal, and others
- Robustness evaluation across domain shifts, scanners, and staining variations
All data comes directly from official benchmark publications and repositories. See each benchmark card for exact data sources and evaluation protocols.
How Rankings Work
The main leaderboard ranks models per benchmark using either the benchmark’s own aggregate score or a metric we compute ourselves. When a benchmark provides an official aggregate (e.g. a robustness index or rank sum), we follow it directly. Otherwise, we average per-task ranks to compute an overall ranking metric for that benchmark.
| Benchmark | Ranking metric | Source | Direction |
|---|---|---|---|
| BC Survival | Average rank across 2 survival tasks and 2 populations | Official | ↑ higher |
| EVA | Average metric across 13 tasks | Computed | ↑ higher |
| HEST | Average Pearson's R across 9 different organs | Official | ↑ higher |
| HKUST PathBench | Average task rank across 229 tasks | Computed | ↓ lower |
| Patho-Bench | Average task rank across 53 tasks | Computed | ↓ lower |
| PathoROB | Robustness index across 3 scenarios | Official | ↑ higher |
| PFM-DenseBench | Average rank across 18 segmentation datasets × 5 methods | Official | ↓ lower |
| Plismbench | Aggregate robustness score | Official | ↑ higher |
| Sinai SSL | Average AUROC across 22 tasks | Official | ↑ higher |
| STAMP | Average task rank across 31 tasks | Computed | ↓ lower |
| Stanford PathBench | Average AUROC across 41 tasks | Official | ↑ higher |
| THUNDER | Rank sum across 6 tasks | Official | ↑ higher |
For detailed per-task results, visit the individual benchmark pages. Inside the arena, we implement a metric-agnostic ranking system that enables you to compare models based on specific organs and tasks.
Contributing
Key References
Mahmood, F. (2025). A benchmarking crisis in biomedical machine learning. Nature Medicine, 31, 1060.
Zhang, A., Jaume, G., Vaidya, A., Ding, T., & Mahmood, F. (2025). Accelerating Data Processing and Benchmarking of AI Models for Pathology. arXiv:2502.06750.