Can I say how hilarious it is that we're discussing stop words in the
context of Zipf Distributions? ;)

On 27 August 2015 at 16:29, Trey Jones <tjo...@wikimedia.org> wrote:
> So, the technical term (in English) for these filler words is "stop
> words",[1] and stripping them is common practice (esp. back in the bad old
> days when we had less powerful computers—though it made searching for "to be
> or not to be" really really hard). Stripping them when a query fails is a
> reasonable fallback plan, as Kevin suggests. (And "between" is usually on
> the list, too, so searching /connection power laws zipf distribution/ gives
> fine results, and I'd certainly include "what's" and other contractions on
> the list.)
>
> The wiki link at [1] has links to several lists, including one with 29
> languages [2]—though the link there is broken (but I found it on
> archive.org.[3] The Spanish and French, at least, are a little light (part
> of the problem is all the forms of a given verb—which they don't seem to
> include, just the most common ones). (And I'd suggest adding variants
> without diacritics.)
>
> Alternatively, a native speaker could take frequency list of terms taken
> from search queries (or maybe just zero search queries) and make a custom
> list of stop words (which may account for question words showing up more,
> and other ways that queries differ from random text). It takes a couple of
> hours at most given the list. (I've recently done this for a personal
> project.)
>
> Anyway, I don't know if doing this in English will help a whole lot for full
> text search. The recent analysis I did for Dan on full text zero rates
> indicate that enwiki is not the problem.[4] enwiki had ~14% zero results
> over a one-week period in both July and August. Given the level of crap we
> see in our searches, I can't imagine that going below 10% (for full text),
> which would only lower the overall rate by ~2%.
>
> Let's ignore itwiki* for the moment; my analysis doesn't take into account
> the interwiki search there—are we 100% sure dashboards do? I believe it
> does, I just don't want it to be true. :(
>
> It looks like we're going to have to pull down numbers for lots of
> individual non-English wikis—though we may get lucky of we look into
> individual ones and find big stupid activities (like nlwiktionary's .de
> domain name searches accounting for their 99% zero results rate.)
>
> Anyway, I like stripping stop words better than relaxing AND to OR, unless
> there's some additional post-search ranking to sort the results into a more
> AND-ish order.
>
> —Trey
>
> [1] https://en.wikipedia.org/wiki/Stop_words
> [2] https://code.google.com/p/stop-words/
> [3]
> https://web.archive.org/web/*/http://tonyb.sk/_my/ir/stop-words-collection-2014-02-24.zip
> [4]
> https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Survey_of_Zero-Results_Queries#Change_in_Zero_Results_Rate_by_Wiki_.28July_to_August.29
>
>
> Trey Jones
> Software Engineer, Discovery
> Wikimedia Foundation
>
> On Thu, Aug 27, 2015 at 9:21 AM, David Causse <dcau...@wikimedia.org> wrote:
>>
>> Le 27/08/2015 17:59, Kevin Smith a écrit :
>>
>>
>> On Thu, Aug 27, 2015 at 4:30 AM, David Causse <dcau...@wikimedia.org>
>> wrote:
>>>
>>> There's another feature we could work on after this one:
>>> Review the default AND operator between words. This seems to be in line
>>> with Moiz's survey results and "somewhat" related to the paper reviewed by
>>> Trey :
>>> Users ask questions not keywords, for example this query :
>>> what's the connection between power laws and zipf law [1]
>>> returns no result
>>>
>>> but:
>>> power laws zipf distribution [2]
>>> returns good results
>>
>>
>>
>> Earlier, I suggested ignoring "filler" words, but we thought elastic was
>> already doing scoring adjustments that would have a similar effect.
>> Apparently not, because a search for:
>>
>> connection between power laws zipf distribution
>>
>> brings up what look like pretty reasonable results. Throwing away
>> "what's", "the", and "and" before running the search would help a lot (at
>> least in this case).
>>
>>
>> Yes, the term that prevents to find the result is "what".
>> Elasticsearch will limit the effect of such words in the score but the
>> default AND will force all these words to be in the document.
>>
>> We have also some troubles with "what's" vs "what is"... I'll have a look.
>>
>>
>>
>> _______________________________________________
>> Wikimedia-search mailing list
>> Wikimedia-search@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
>>
>
>
> _______________________________________________
> Wikimedia-search mailing list
> Wikimedia-search@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
>



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Reply via email to