Can I say how hilarious it is that we're discussing stop words in the
context of Zipf Distributions? ;)

On 27 August 2015 at 16:29, Trey Jones <> wrote:
> So, the technical term (in English) for these filler words is "stop
> words",[1] and stripping them is common practice (esp. back in the bad old
> days when we had less powerful computers—though it made searching for "to be
> or not to be" really really hard). Stripping them when a query fails is a
> reasonable fallback plan, as Kevin suggests. (And "between" is usually on
> the list, too, so searching /connection power laws zipf distribution/ gives
> fine results, and I'd certainly include "what's" and other contractions on
> the list.)
> The wiki link at [1] has links to several lists, including one with 29
> languages [2]—though the link there is broken (but I found it on
>[3] The Spanish and French, at least, are a little light (part
> of the problem is all the forms of a given verb—which they don't seem to
> include, just the most common ones). (And I'd suggest adding variants
> without diacritics.)
> Alternatively, a native speaker could take frequency list of terms taken
> from search queries (or maybe just zero search queries) and make a custom
> list of stop words (which may account for question words showing up more,
> and other ways that queries differ from random text). It takes a couple of
> hours at most given the list. (I've recently done this for a personal
> project.)
> Anyway, I don't know if doing this in English will help a whole lot for full
> text search. The recent analysis I did for Dan on full text zero rates
> indicate that enwiki is not the problem.[4] enwiki had ~14% zero results
> over a one-week period in both July and August. Given the level of crap we
> see in our searches, I can't imagine that going below 10% (for full text),
> which would only lower the overall rate by ~2%.
> Let's ignore itwiki* for the moment; my analysis doesn't take into account
> the interwiki search there—are we 100% sure dashboards do? I believe it
> does, I just don't want it to be true. :(
> It looks like we're going to have to pull down numbers for lots of
> individual non-English wikis—though we may get lucky of we look into
> individual ones and find big stupid activities (like nlwiktionary's .de
> domain name searches accounting for their 99% zero results rate.)
> Anyway, I like stripping stop words better than relaxing AND to OR, unless
> there's some additional post-search ranking to sort the results into a more
> AND-ish order.
> —Trey
> [1]
> [2]
> [3]
> [4]
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
> On Thu, Aug 27, 2015 at 9:21 AM, David Causse <> wrote:
>> Le 27/08/2015 17:59, Kevin Smith a écrit :
>> On Thu, Aug 27, 2015 at 4:30 AM, David Causse <>
>> wrote:
>>> There's another feature we could work on after this one:
>>> Review the default AND operator between words. This seems to be in line
>>> with Moiz's survey results and "somewhat" related to the paper reviewed by
>>> Trey :
>>> Users ask questions not keywords, for example this query :
>>> what's the connection between power laws and zipf law [1]
>>> returns no result
>>> but:
>>> power laws zipf distribution [2]
>>> returns good results
>> Earlier, I suggested ignoring "filler" words, but we thought elastic was
>> already doing scoring adjustments that would have a similar effect.
>> Apparently not, because a search for:
>> connection between power laws zipf distribution
>> brings up what look like pretty reasonable results. Throwing away
>> "what's", "the", and "and" before running the search would help a lot (at
>> least in this case).
>> Yes, the term that prevents to find the result is "what".
>> Elasticsearch will limit the effect of such words in the score but the
>> default AND will force all these words to be in the document.
>> We have also some troubles with "what's" vs "what is"... I'll have a look.
>> _______________________________________________
>> Wikimedia-search mailing list
> _______________________________________________
> Wikimedia-search mailing list

Oliver Keyes
Count Logula
Wikimedia Foundation

Wikimedia-search mailing list

Reply via email to