Large language models (LLMs) can also be asked to find named entities in text. Instead of using a dedicated NER model, the system gives the model instructions such as: list the people, organisations, and places in this passage.
This is useful when no good dedicated model exists, or when the project wants to test a new entity definition quickly. It is less useful when millions of sentences need to be processed, because LLM extraction is much slower.
What we tested
This secondary note uses the LLM results from the current NER benchmark. The models were run locally through Ollama with prompt-based extraction, and the outputs were mapped to the same PER / ORG / LOC evaluation scheme used for dedicated NER models.
The headline score is F1, shown below as a percentage. The table highlights the strongest result reached by the local LLM set for each language.
Headline results
| Language | Best local LLM | Dataset | F1 (%) |
|---|---|---|---|
| 🇪🇪 Estonian | gpt-oss_120b |
et_modern |
64.6% |
| 🇫🇮 Finnish | gpt-oss_120b |
fi_multileg |
50.1% |
| 🇱🇻 Latvian | gpt-oss_120b |
lv_diverse |
76.9% |
| 🪆 Russian | gpt-oss_120b |
ru_modern |
88.8% |
Evaluation setup
Evaluation models
Evaluation datasets
- 🌐 Multilingual
et_multileg.conll(100 804)fi_multileg.conll(96 488)lv_multileg.conll(110 860)
- 🇪🇪 Estonian
et_modern.conll(165 947)et_old.conll(54 069)
- 🇫🇮 Finnish
fi_old.conll(51 839)
- 🇱🇻 Latvian
lv_modern.conll(21 951)lv_diverse.conll(199 155)
- 🪆 Russian
ru_modern.conll(47 187)ru_oldish.conll(18 838)
Early interpretation
LLM extraction is not the main indexing recommendation at this stage. On most datasets, the best dedicated NER model is more accurate and much faster.
The value of LLMs is different: they are flexible. They can help with targeted enrichment, low-resource cases, rapid testing of new entity types, or cases where a dedicated model performs poorly and a slower fallback is acceptable.
What to update next
The next useful update is to document the prompt format and add more dataset-level detail, especially for cases where LLM extraction still misses entities that a dedicated model can capture reliably. The benchmark should also keep testing smaller and faster local models, because the operational question is not only whether LLMs can work, but whether they can work at a realistic cost.