Can I say how hilarious it is that we're discussing stop words in the context of Zipf Distributions? ;)
On 27 August 2015 at 16:29, Trey Jones <[email protected]> wrote: > So, the technical term (in English) for these filler words is "stop > words",[1] and stripping them is common practice (esp. back in the bad old > days when we had less powerful computers—though it made searching for "to be > or not to be" really really hard). Stripping them when a query fails is a > reasonable fallback plan, as Kevin suggests. (And "between" is usually on > the list, too, so searching /connection power laws zipf distribution/ gives > fine results, and I'd certainly include "what's" and other contractions on > the list.) > > The wiki link at [1] has links to several lists, including one with 29 > languages [2]—though the link there is broken (but I found it on > archive.org.[3] The Spanish and French, at least, are a little light (part > of the problem is all the forms of a given verb—which they don't seem to > include, just the most common ones). (And I'd suggest adding variants > without diacritics.) > > Alternatively, a native speaker could take frequency list of terms taken > from search queries (or maybe just zero search queries) and make a custom > list of stop words (which may account for question words showing up more, > and other ways that queries differ from random text). It takes a couple of > hours at most given the list. (I've recently done this for a personal > project.) > > Anyway, I don't know if doing this in English will help a whole lot for full > text search. The recent analysis I did for Dan on full text zero rates > indicate that enwiki is not the problem.[4] enwiki had ~14% zero results > over a one-week period in both July and August. Given the level of crap we > see in our searches, I can't imagine that going below 10% (for full text), > which would only lower the overall rate by ~2%. > > Let's ignore itwiki* for the moment; my analysis doesn't take into account > the interwiki search there—are we 100% sure dashboards do? I believe it > does, I just don't want it to be true. :( > > It looks like we're going to have to pull down numbers for lots of > individual non-English wikis—though we may get lucky of we look into > individual ones and find big stupid activities (like nlwiktionary's .de > domain name searches accounting for their 99% zero results rate.) > > Anyway, I like stripping stop words better than relaxing AND to OR, unless > there's some additional post-search ranking to sort the results into a more > AND-ish order. > > —Trey > > [1] https://en.wikipedia.org/wiki/Stop_words > [2] https://code.google.com/p/stop-words/ > [3] > https://web.archive.org/web/*/http://tonyb.sk/_my/ir/stop-words-collection-2014-02-24.zip > [4] > https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#Change_in_Zero_Results_Rate_by_Wiki_.28July_to_August.29 > > > Trey Jones > Software Engineer, Discovery > Wikimedia Foundation > > On Thu, Aug 27, 2015 at 9:21 AM, David Causse <[email protected]> wrote: >> >> Le 27/08/2015 17:59, Kevin Smith a écrit : >> >> >> On Thu, Aug 27, 2015 at 4:30 AM, David Causse <[email protected]> >> wrote: >>> >>> There's another feature we could work on after this one: >>> Review the default AND operator between words. This seems to be in line >>> with Moiz's survey results and "somewhat" related to the paper reviewed by >>> Trey : >>> Users ask questions not keywords, for example this query : >>> what's the connection between power laws and zipf law [1] >>> returns no result >>> >>> but: >>> power laws zipf distribution [2] >>> returns good results >> >> >> >> Earlier, I suggested ignoring "filler" words, but we thought elastic was >> already doing scoring adjustments that would have a similar effect. >> Apparently not, because a search for: >> >> connection between power laws zipf distribution >> >> brings up what look like pretty reasonable results. Throwing away >> "what's", "the", and "and" before running the search would help a lot (at >> least in this case). >> >> >> Yes, the term that prevents to find the result is "what". >> Elasticsearch will limit the effect of such words in the score but the >> default AND will force all these words to be in the document. >> >> We have also some troubles with "what's" vs "what is"... I'll have a look. >> >> >> >> _______________________________________________ >> Wikimedia-search mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >> > > > _______________________________________________ > Wikimedia-search mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search > -- Oliver Keyes Count Logula Wikimedia Foundation _______________________________________________ Wikimedia-search mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
