About Histoboard

Why Histoboard?

The pathology foundation model landscape is growing fast, but comparing models across benchmarks remains fragmented and time-consuming. Histoboard aggregates results from published benchmarks into a single, accessible interface — giving the community a clear comparative view of existing models.

For the full story behind Histoboard, read our blog post.

What We Aggregate

Histoboard currently aggregates results from 12 published benchmarks covering:

  • 433 evaluation tasks spanning classification, survival prediction, biomarker detection, and more
  • 48 foundation models from academic and industry labs worldwide
  • 20 organs including Bladder, Brain, Breast, Cervix, Colorectal, and others
  • Robustness evaluation across domain shifts, scanners, and staining variations

All data comes directly from official benchmark publications and repositories. See each benchmark card for exact data sources and evaluation protocols.

How Rankings Work

The main leaderboard ranks models per benchmark using either the benchmark’s own aggregate score or a metric we compute ourselves. When a benchmark provides an official aggregate (e.g. a robustness index or rank sum), we follow it directly. Otherwise, we average per-task ranks to compute an overall ranking metric for that benchmark.

BenchmarkRanking metricSourceDirection
BC SurvivalAverage rank across 2 survival tasks and 2 populationsOfficial↑ higher
EVAAverage metric across 13 tasksComputed↑ higher
HESTAverage Pearson's R across 9 different organsOfficial↑ higher
HKUST PathBenchAverage task rank across 229 tasksComputed↓ lower
Patho-BenchAverage task rank across 53 tasksComputed↓ lower
PathoROBRobustness index across 3 scenariosOfficial↑ higher
PFM-DenseBenchAverage rank across 18 segmentation datasets × 5 methodsOfficial↓ lower
PlismbenchAggregate robustness scoreOfficial↑ higher
Sinai SSLAverage AUROC across 22 tasksOfficial↑ higher
STAMPAverage task rank across 31 tasksComputed↓ lower
Stanford PathBenchAverage AUROC across 41 tasksOfficial↑ higher
THUNDERRank sum across 6 tasksOfficial↑ higher

For detailed per-task results, visit the individual benchmark pages. Inside the arena, we implement a metric-agnostic ranking system that enables you to compare models based on specific organs and tasks.

Contributing

Histoboard is open source and welcomes contributions. If you notice missing models, incorrect data, or want to add a new benchmark, please:

Key References

Mahmood, F. (2025). A benchmarking crisis in biomedical machine learning. Nature Medicine, 31, 1060.

Zhang, A., Jaume, G., Vaidya, A., Ding, T., & Mahmood, F. (2025). Accelerating Data Processing and Benchmarking of AI Models for Pathology. arXiv:2502.06750.