Thanks!
this is awesome.
Concerning soburdia: the typo is in the first 2 chars so our misspelling
identification will fail, searching for sucurbia properly displays
"suburbia" as a "did you mean" suggestion. This was one the enhancement
we tried to implement but we are currently blocked by a bug in
elasticsearch. I hope it's not a common pattern because we'll add a
second error with language detection...
Is it possible to identify how many queries are 1 one/2 words/3 words?
I'm asking this question because there's another weakness in this
language detector. Characters at word boundaries seems to bear some
valuable informations concerning language features and the detector
fails to make any benefit of them if it's a one word query. Running the
detector with additional trailing spaces changed significantly the results.
For example граничащее (russian)
Detecting "граничащее" returns bg at 0.99
But detecting " граничащее " returns ru at 0.57 and bg at 0.42
But in the end I agree with your analysis in "Stupid language
detection". Mainly because the detector does not weight its results on
the wiki size (ru should be weighted higher because ruwiki is larger
than bgwiki) because it's what we are looking for. We're looking for
results, we don't care too much about the actual language of the query.
Le 05/09/2015 00:45, Trey Jones a écrit :
I've written up my analysis of the ElasticSearch language detection
plugin that Erik recently enabled:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Evaluation
<https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Language_Detection_Evaluation>
The short version is that it really likes Romanian (and Italian, and
has a bit of a thing for French), and precision on English is great,
but recall is poor (probably because of all the typos and other crap
that go to enwiki that is still technically "English"). Chinese and
Arabic are good.
I think we could do better, and we should evaluate (a) other language
detectors and (b) the effect of a good language detector on zero
results rate (i.e., simulate sending queries to the right place and
see how much of a difference it makes).
Moderately pretty pictures included.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search