I've written up my analysis of the ElasticSearch language detection plugin
that Erik recently enabled:

https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Evaluation

The short version is that it really likes Romanian (and Italian, and has a
bit of a thing for French), and precision on English is great, but recall
is poor (probably because of all the typos and other crap that go to enwiki
that is still technically "English"). Chinese and Arabic are good.

I think we could do better, and we should evaluate (a) other language
detectors and (b) the effect of a good language detector on zero results
rate (i.e., simulate sending queries to the right place and see how much of
a difference it makes).

Moderately pretty pictures included.

—Trey

Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Reply via email to