📑 Topics
Similarity and Semantic Search
This topic is already moving from exploration toward deployment-oriented questions. The main issue is less whether embeddings work at all, and more which multilingual model is the safest default for archive search and where current dataset coverage is still uneven.
Current View
This topic already supports a practical recommendation, with the main caution now focused on synthetic Latvian benchmark coverage.
Default recommendation
Use Octen-Embedding-4B as the broadest current default for multilingual semantic search across the four target languages.
Language-specific signal
Octen-Embedding-8B leads Finnish, while Octen-Embedding-8B-INT8 leads Latvian after adding synthetic similarity and paraphrase data.
Main caveat
The Latvian ranking is no longer FLORES-only, but the added MultiSimLex and TAPACO data are machine-translated and should be validated against native Latvian archive material.
Operational signal
Octen-Embedding-4B is also the fastest measured model and has the smallest vectors, making it the strongest practical default for indexing experiments.
Evidence Notes
The published note summarizes the current multilingual embedding benchmark and its language-level recommendations.
Embedding Search Meets Archive RAG
Follow-up RAG-style tests show that dense embeddings are useful for semantic paraphrase search, but archival retrieval needs lexical, structured, and hybrid search as well.
Similarity and Semantic Search
Similarity and semantic search use embedding models to turn words, sentences, or passages into vectors so that related texts land close together in search. In archives, this mat...