Large Language Models for Entities? | ArchXAI Technology Updates

LLM-based NER is useful as a fallback and experimentation path, but current results make it slower and usually less accurate than dedicated NER models.

Large language models (LLMs) can also be asked to find named entities in text. Instead of using a dedicated NER model, the system gives the model instructions such as: list the people, organisations, and places in this passage.

This is useful when no good dedicated model exists, or when the project wants to test a new entity definition quickly. It is less useful when millions of sentences need to be processed, because LLM extraction is much slower.

What we tested

This secondary note uses the LLM results from the current NER benchmark. The models were run locally through Ollama with prompt-based extraction, and the outputs were mapped to the same PER / ORG / LOC evaluation scheme used for dedicated NER models.

The headline score is F1, shown below as a percentage. The table highlights the strongest result reached by the local LLM set for each language.

Headline results

Language	Best local LLM	Dataset	F1 (%)
🇪🇪 Estonian	`gpt-oss_120b`	`et_modern`	64.6%
🇫🇮 Finnish	`gpt-oss_120b`	`fi_multileg`	50.1%
🇱🇻 Latvian	`gpt-oss_120b`	`lv_diverse`	76.9%
🪆 Russian	`gpt-oss_120b`	`ru_modern`	88.8%

Evaluation setup

Evaluation models

Evaluation datasets

🌐 Multilingual
- et_multileg.conll (100 804)
- fi_multileg.conll (96 488)
- lv_multileg.conll (110 860)
🇪🇪 Estonian
- et_modern.conll (165 947)
- et_old.conll (54 069)
🇫🇮 Finnish
- fi_old.conll (51 839)
🇱🇻 Latvian
- lv_modern.conll (21 951)
- lv_diverse.conll (199 155)
🪆 Russian
- ru_modern.conll (47 187)
- ru_oldish.conll (18 838)

Early interpretation

LLM extraction is not the main indexing recommendation at this stage. On most datasets, the best dedicated NER model is more accurate and much faster.

The value of LLMs is different: they are flexible. They can help with targeted enrichment, low-resource cases, rapid testing of new entity types, or cases where a dedicated model performs poorly and a slower fallback is acceptable.

What to update next

The next useful update is to document the prompt format and add more dataset-level detail, especially for cases where LLM extraction still misses entities that a dedicated model can capture reliably. The benchmark should also keep testing smaller and faster local models, because the operational question is not only whether LLMs can work, but whether they can work at a realistic cost.