💻 Methodology

Evaluation methods

This page explains how ArchXAI runs tool comparisons: what we measure, what hardware constraints shape the tests, and how results move from experiments into public benchmark notes.

What we report:

⭐ Quality

Each task uses metrics suited to its output type: precision, recall, F1, ranking metrics, correlation metrics, or task-specific retrieval scores.

⚡ Performance

Where relevant, we add runtime, latency, memory, or vector-size measurements so that a high-scoring model is not mistaken for an automatically deployable model.

🌐 Coverage

We report results by language whenever possible, because a model that performs well in one project language may be much weaker in another.

⚠️ Caveats

We call out weak datasets, synthetic data, suspiciously strong results, or incomplete model coverage instead of hiding those limits in a single average.

Hardware and runtime limitations

The current local benchmarking environment is centered on an NVIDIA GB10 system with CUDA support and approximately 128 GB of shared memory. This is large enough to test many modern transformer and embedding models locally, including several multi-billion-parameter models.

Those resources still matter. Some models can be tested experimentally but remain expensive for routine deployment because of memory use, vector size, latency, or operational complexity. For that reason, the site treats quality scores and operational costs as separate but connected evidence.

Illustration of the GB10-based evaluation hardware used for local benchmarking

Testing framework

  • Benchmarking is cyclical rather than one-off.
  • New models and datasets are added as the field changes and project needs become clearer.
  • Public pages are updated continuously, while formal report versions can still be refreshed at larger project milestones.
  • Local experiments are preferred when model weights are available, while API-only tools are listed separately unless credentials and usage terms allow a fair comparison.