Evaluation logic
How we compare tools
We do not judge AI tools only by how impressive they sound. We compare them by how accurate they are, how fast they are, and whether they can realistically fit into archive workflows.
A plain-language guide to the main metrics
Simple example: if a tool highlights ten names and eight are real names, precision is eight out of ten. If there were twelve real names in the text and the tool found eight of them, recall is eight out of twelve.
Primary criteria
- Accuracy. Precision, recall, and F1 are the main headline measures, reported overall and, where relevant, per entity or label class.
- Speed. Total runtime and seconds per processed sentence help separate tools suitable for bulk indexing from tools better suited to smaller, targeted tasks.
- Integration fit. We favor solutions that can be inserted into a multilingual archival workflow without forcing every partner into the same infrastructure or model stack.
Operational constraints
The current benchmarking environment is described as a system centered on an NVIDIA GB10 chip, which in practice limits routine evaluation to models that fit within approximately 128 GB RAM. That matters because some large language models may be interesting experimentally but remain impractical for widespread operational deployment.
The comparison therefore treats transformer pipelines as the default high-throughput baseline and LLM-based methods as complementary: useful for fallback cases, targeted enrichment, or rapid experimentation where dedicated task models are weak or unavailable.
Testing framework
- Benchmarking is cyclical rather than one-off.
- New models and datasets are added as the field changes and project needs become clearer.
- This website is intended for continuous updates.
- Formal report versions can still be refreshed at larger project milestones.
Current evidence base
The strongest reusable material currently available in the workspace covers three technology tracks:
- NER. Evaluation across modern, historical, and legal-domain datasets with mapped PER, ORG, and LOC labels, including dedicated transformer models and a secondary LLM comparison.
- PII detection and anonymization. Operational comparison between Presidio and MAPA on legal-domain multilingual data.
- Tone and sentiment analysis. A structured preliminary benchmark note is published, with scored results still pending.
Additional tracks can be added into the same site structure as benchmarking results stabilize.