About Histoboard

Why Histoboard?

Foundation models have transformed computational pathology. Trained on large-scale histopathology datasets through self-supervised learning, these models learn general-purpose visual representations from digitized tissue slides. Once trained, they can be adapted to a wide range of downstream tasks—cancer detection and grading, biomarker prediction, survival analysis, tissue segmentation—often matching or exceeding task-specific approaches. Beyond purely visual encoders, multi-modal models that integrate additional data sources—pathology reports, genomic profiles, clinical metadata—are expanding the scope of what foundation models can achieve in this domain.

As the number of pathology foundation models grows, so does the need for rigorous evaluation. Yet the field currently faces a benchmarking crisis (Mahmood, 2025): the lack of standardized evaluation protocols makes it difficult to reliably assess model strengths, robustness, limitations, and readiness for clinical deployment.

Ideally, the community would converge on a single, comprehensive, publicly available benchmark against which all models could be evaluated. We are not there yet. Many benchmarks rely on proprietary or restricted-access data, and individual labs often develop their own internal evaluation suites, further fragmenting the landscape. Recent community-driven initiatives such as Patho-Bench propose clinically relevant tasks built on public datasets and represent important steps toward greater transparency, reproducibility, and continued progress (Zhang, 2025).

Until a unified benchmark emerges, Histoboard aims to bridge the gap by aggregating results from published benchmarks into a single, accessible interface. Our goal is to provide the community with a clear comparative view of existing models and to advocate for more publicly available evaluation datasets.

If you are aware of public benchmarks that should be included, or have suggestions for improving pathology model evaluation, we welcome your contributions.

What We Aggregate

Histoboard currently aggregates results from 10 published benchmarks covering:

  • 411 evaluation tasks spanning classification, survival prediction, biomarker detection, and more
  • 46 foundation models from academic and industry labs worldwide
  • 20 organs including Bladder, Brain, Breast, Cervix, Colorectal, and others
  • Robustness evaluation across domain shifts, scanners, and staining variations

All data comes directly from official benchmark publications and repositories. See each benchmark card for exact data sources and evaluation protocols.

How Rankings Work

The main leaderboard ranks models per benchmark using either the benchmark’s own aggregate score or a metric we compute ourselves. When a benchmark provides an official aggregate (e.g. a robustness index or rank sum), we follow it directly. Otherwise, we average per-task ranks to compute an overall ranking metric for that benchmark.

BenchmarkRanking metricSourceDirection
EVAAverage metric across 13 tasksComputed↑ higher
PathBenchAverage task rank across 229 tasksComputed↓ lower
StanfordAverage AUROC across 41 tasksOfficial↑ higher
HESTAverage Pearson's R across 9 different organsOfficial↑ higher
Patho-BenchAverage task rank across 53 tasksComputed↓ lower
SinaiAverage AUROC across 22 tasksOfficial↑ higher
STAMPAverage task rank across 31 tasksComputed↓ lower
THUNDERRank sum across 6 tasksOfficial↓ lower
PathoROBRobustness index across 3 scenariosOfficial↑ higher
PlismbenchAggregate robustness scoreOfficial↑ higher

For detailed per-task results, visit the individual benchmark pages. Inside the arena, we implement a metric-agnostic ranking system that enables you to compare models based on specific organs and tasks.

Contributing

Histoboard is open source and welcomes contributions. If you notice missing models, incorrect data, or want to add a new benchmark, please:

Key References

Mahmood, F. (2025). A benchmarking crisis in biomedical machine learning. Nature Medicine, 31, 1060.

Zhang, A., Jaume, G., Vaidya, A., Ding, T., & Mahmood, F. (2025). Accelerating Data Processing and Benchmarking of AI Models for Pathology. arXiv:2502.06750.