📑 Topics

Similarity and Semantic Search

This topic is already moving from exploration toward deployment-oriented questions. The main issue is less whether embeddings work at all, and more which multilingual model is the safest default for archive search and where current dataset coverage is still uneven.

Current View

This topic already supports a practical recommendation, with the main caution now focused on synthetic Latvian benchmark coverage.

Default recommendation

Use Octen-Embedding-4B as the broadest current default for multilingual semantic search across the four target languages.

Language-specific signal

Octen-Embedding-8B leads Finnish, while Octen-Embedding-8B-INT8 leads Latvian after adding synthetic similarity and paraphrase data.

Main caveat

The Latvian ranking is no longer FLORES-only, but the added MultiSimLex and TAPACO data are machine-translated and should be validated against native Latvian archive material.

Operational signal

Octen-Embedding-4B is also the fastest measured model and has the smallest vectors, making it the strongest practical default for indexing experiments.

Evidence Notes

The published note summarizes the current multilingual embedding benchmark and its language-level recommendations.

🔎 Embedding 2️⃣ Secondary

Embedding Search Meets Archive RAG

Follow-up RAG-style tests show that dense embeddings are useful for semantic paraphrase search, but archival retrieval needs lexical, structured, and hybrid search as well.

🔎 Embedding 1️⃣ Preliminary

Similarity and Semantic Search

Similarity and semantic search use embedding models to turn words, sentences, or passages into vectors so that related texts land close together in search. In archives, this mat...