I've written up my analysis of the ElasticSearch language detection plugin
that Erik recently enabled:


The short version is that it really likes Romanian (and Italian, and has a
bit of a thing for French), and precision on English is great, but recall
is poor (probably because of all the typos and other crap that go to enwiki
that is still technically "English"). Chinese and Arabic are good.

I think we could do better, and we should evaluate (a) other language
detectors and (b) the effect of a good language detector on zero results
rate (i.e., simulate sending queries to the right place and see how much of
a difference it makes).

Moderately pretty pictures included.


Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
