Yay! Thank you for this awesome research, Trey. Evaluating language plugins sounds like it would make a /great/ blog post. What alternatives are up next?
On 4 September 2015 at 18:45, Trey Jones <[email protected]> wrote: > I've written up my analysis of the ElasticSearch language detection plugin > that Erik recently enabled: > > https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Evaluation > > The short version is that it really likes Romanian (and Italian, and has a > bit of a thing for French), and precision on English is great, but recall is > poor (probably because of all the typos and other crap that go to enwiki > that is still technically "English"). Chinese and Arabic are good. > > I think we could do better, and we should evaluate (a) other language > detectors and (b) the effect of a good language detector on zero results > rate (i.e., simulate sending queries to the right place and see how much of > a difference it makes). > > Moderately pretty pictures included. > > —Trey > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > _______________________________________________ > Wikimedia-search mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search > -- Oliver Keyes Count Logula Wikimedia Foundation _______________________________________________ Wikimedia-search mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
