Re: [Wikimedia-search] Analysis of ElasticSearch language detection plugin against enwiki zero-results queries

2015-09-08 Thread Trey Jones
There's a lot to catch up on, but some quick and easy stuff first, in response to David's comments. For queries that are marked as "language" (775 queries), the distribution of token counts (word counts) up to 10 is below: 160 1 tokens 152 2 tokens 141 3 tokens 91 4 tokens 63 5 tokens 49 6 tokens

Re: [Wikimedia-search] Analysis of ElasticSearch language detection plugin against enwiki zero-results queries

2015-09-07 Thread David Causse
Thanks! this is awesome. Concerning soburdia: the typo is in the first 2 chars so our misspelling identification will fail, searching for sucurbia properly displays "suburbia" as a "did you mean" suggestion. This was one the enhancement we tried to implement but we are currently blocked by a

Re: [Wikimedia-search] Analysis of ElasticSearch language detection plugin against enwiki zero-results queries

2015-09-05 Thread Oliver Keyes
Ooh, excellent! Thanks Nemo! On 5 September 2015 at 12:33, Federico Leva (Nemo) wrote: > Oliver Keyes, 05/09/2015 02:24: >> >> Well, we have the implementation of Kolkus's algorithm in Java - >> although it's a training-based model so it'll need a known dataset to >> run off. > > > Niklas made a

Re: [Wikimedia-search] Analysis of ElasticSearch language detection plugin against enwiki zero-results queries

2015-09-05 Thread Federico Leva (Nemo)
Oliver Keyes, 05/09/2015 02:24: Well, we have the implementation of Kolkus's algorithm in Java - although it's a training-based model so it'll need a known dataset to run off. Niklas made a dataset for one of the available language detectors, using some millions translatewiki.net documents in

Re: [Wikimedia-search] Analysis of ElasticSearch language detection plugin against enwiki zero-results queries

2015-09-04 Thread Oliver Keyes
Well, we have the implementation of Kolkus's algorithm in Java - although it's a training-based model so it'll need a known dataset to run off. On 4 September 2015 at 20:08, Trey Jones wrote: > Thanks, Oliver! > > I'm not sure what's up next. We could look around for other available > detectors,

Re: [Wikimedia-search] Analysis of ElasticSearch language detection plugin against enwiki zero-results queries

2015-09-04 Thread Trey Jones
Thanks, Oliver! I'm not sure what's up next. We could look around for other available detectors, algorithms, or ideas to try. Fortunately we don't need to integrate them to test them—we can just run the queries and evaluate the results. We could also try something of our own devising, because it'

Re: [Wikimedia-search] Analysis of ElasticSearch language detection plugin against enwiki zero-results queries

2015-09-04 Thread Oliver Keyes
Yay! Thank you for this awesome research, Trey. Evaluating language plugins sounds like it would make a /great/ blog post. What alternatives are up next? On 4 September 2015 at 18:45, Trey Jones wrote: > I've written up my analysis of the ElasticSearch language detection plugin > that Erik recent