About Histoboard
Why Histoboard?
Foundation models have transformed computational pathology. Trained on large-scale histopathology datasets through self-supervised learning, these models learn general-purpose visual representations from digitized tissue slides. Once trained, they can be adapted to a wide range of downstream tasks—cancer detection and grading, biomarker prediction, survival analysis, tissue segmentation—often matching or exceeding task-specific approaches. Beyond purely visual encoders, multi-modal models that integrate additional data sources—pathology reports, genomic profiles, clinical metadata—are expanding the scope of what foundation models can achieve in this domain.
As the number of pathology foundation models grows, so does the need for rigorous evaluation. Yet the field currently faces a benchmarking crisis (Mahmood, 2025): the lack of standardized evaluation protocols makes it difficult to reliably assess model strengths, robustness, limitations, and readiness for clinical deployment.
Ideally, the community would converge on a single, comprehensive, publicly available benchmark against which all models could be evaluated. We are not there yet. Many benchmarks rely on proprietary or restricted-access data, and individual labs often develop their own internal evaluation suites, further fragmenting the landscape. Recent community-driven initiatives such as Patho-Bench propose clinically relevant tasks built on public datasets and represent important steps toward greater transparency, reproducibility, and continued progress (Zhang, 2025).
Until a unified benchmark emerges, Histoboard aims to bridge the gap by aggregating results from published benchmarks into a single, accessible interface. Our goal is to provide the community with a clear comparative view of existing models and to advocate for more publicly available evaluation datasets.
If you are aware of public benchmarks that should be included, or have suggestions for improving pathology model evaluation, we welcome your contributions.
What We Aggregate
Histoboard currently aggregates results from 10 published benchmarks covering:
- 411 evaluation tasks spanning classification, survival prediction, biomarker detection, and more
- 46 foundation models from academic and industry labs worldwide
- 20 organs including Bladder, Brain, Breast, Cervix, Colorectal, and others
- Robustness evaluation across domain shifts, scanners, and staining variations
All data comes directly from official benchmark publications and repositories. See each benchmark card for exact data sources and evaluation protocols.
How Rankings Work
The main leaderboard ranks models per benchmark using either the benchmark’s own aggregate score or a metric we compute ourselves. When a benchmark provides an official aggregate (e.g. a robustness index or rank sum), we follow it directly. Otherwise, we average per-task ranks to compute an overall ranking metric for that benchmark.
| Benchmark | Ranking metric | Source | Direction |
|---|---|---|---|
| EVA | Average metric across 13 tasks | Computed | ↑ higher |
| PathBench | Average task rank across 229 tasks | Computed | ↓ lower |
| Stanford | Average AUROC across 41 tasks | Official | ↑ higher |
| HEST | Average Pearson's R across 9 different organs | Official | ↑ higher |
| Patho-Bench | Average task rank across 53 tasks | Computed | ↓ lower |
| Sinai | Average AUROC across 22 tasks | Official | ↑ higher |
| STAMP | Average task rank across 31 tasks | Computed | ↓ lower |
| THUNDER | Rank sum across 6 tasks | Official | ↓ lower |
| PathoROB | Robustness index across 3 scenarios | Official | ↑ higher |
| Plismbench | Aggregate robustness score | Official | ↑ higher |
For detailed per-task results, visit the individual benchmark pages. Inside the arena, we implement a metric-agnostic ranking system that enables you to compare models based on specific organs and tasks.
Contributing
Key References
Mahmood, F. (2025). A benchmarking crisis in biomedical machine learning. Nature Medicine, 31, 1060.
Zhang, A., Jaume, G., Vaidya, A., Ding, T., & Mahmood, F. (2025). Accelerating Data Processing and Benchmarking of AI Models for Pathology. arXiv:2502.06750.