Re: [Wikimedia-search] testing the value of a reverse index

David Causse Thu, 30 Jul 2015 08:31:23 -0700

Le 30/07/2015 16:50, Trey Jones a écrit :

Thanks for all the technical details! So much going on... so much tolearn!
I didn't know/remember that suggester only works on titles andredirects. Then, obviously, using just that would be great! That'sgotta be a 98%+ reduction in text.


Yes and Erik suggested that we could try to inject more content.

This option already exists and we could turn it on, but I suspect it wasdisabled in wmf config for good reasons.

I like your reasonable process—it's quite reasonable!
You asked about which wikis to look at. Are en, fr, de, it and es theones we can best read? (I'm okay with that to start, by the way—itoptimizes developer time.) By number of zero-result queries from my500K sample, the top five are en, de, pt, ja, and ru—though thatsample is small. By overall size, it's en, sv, de, nl, and fr. Clearlyenwiki dominates, and I'm guessing the performance will differ acrosslanguages—so I don't have a clear suggestion here. But enwiki makessense because it's the biggest on every front, and itwiki, because itdoes the most interesting crosswiki stuff.
Hmm. Is enwiki big enough to drag everything else along if it's verybeneficial there?

If we have a process that works for enwiki it'd be "easy" to reiterateover other wikis. I'd say we could start with enwiki.

    We have some technical restrictions here, if we activate this
    settings on one wiki we'll need to reindex most of the wikis
    because we have cross-wiki searches.

    wikiA can query wikiB's index, if wikiB index is not updated with
    correct settings the query will fail.

...

    So it's hard to work with mixed settings with the current
    architecture :(
I'm a bit confused. Will elasticsearch do really bad things if you askit to search in a way that isn't enabled on a particular index? Doesfail mean zero results, or does it waste lots of CPU and startthrowing errors? Is there a reasonable way to assess what features aquery needs and whether a given index supports those features? Soundsterribly ugly, but I had to ask.


"Fails" means a big red message displayed to the user :)

Elasticsearch can run a single query over multiple indexes. In the caseyou ask for a suggest field that's missing in one of the index yourequested the whole query will fail.Today we have a config per wiki and not a config per index, having aconfig per index would imply a big refactoring and we would have to dropthis convenient "multi-index" feature.

    Note that we will not be able to measure things like :
    search is a better than samech for the query saerch.


    This seems impossible to check without human review. We could do
    another run with queries where a suggestion was found and generate
a diff that will be reviewed by hand:
    user_query: saerch
    prod_suggestion: samech
    with_reverse: search
Are you thinking of manual review of the suggestions, or of a diff ofthe results of the suggestions? I'm assuming just looking at theterms—I feel that a fluent speaker could easily tell that search isbetter than samech just by looking at the words. (So I could helpreview in English, at least.)

Yes the idea was to extract only the suggestions that differ from theone we have in search logs.

That said, there are two things I can think of that would make for atleast a weak heuristic: edit distance and frequency.
Since there only going to be a small number of suggestions in eachcase, running full edit distance on them offline wouldn't be toocostly. There are many versions of edit distance you could use. Withplain dumb E.D., these are both distance 2, but with reversalscounting less than a full insert + delete, "search" is better than"samech". You can also do more generic weighted edit distance to allowtypos (x is more likely for z than p for z) or likely spelling errors(mixing up vowels or double vs single letters) to count less.
As for frequency, you could look at overall term frequency or documentfrequency in the index, or if that's too expensive, get a genericfrequency list for the language in question. "search" is clearlybetter than "samech" by any frequency metric.

With an index in lab I can extract the frequencies, you'll havesomething like :


search:1345
search engine:122
google search:32
google search engine:2

You will have to filter on space to keep only unigrams if it's betterfor you.

We could take hand-reviewed results (seems like it'd be quick work—I'ddo a pile from enwiki) as training data to fit a model that wouldallow us to predict which suggestions are likely to be better.
If/when we do roll it out to production, we could obviously furthertest by giving multiple suggestions and seeing which ones users like.


This is another very good idea :)

_______________________________________________
Wikimedia-search mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Re: [Wikimedia-search] testing the value of a reverse index

Reply via email to