Yeah, it looks like Common Terms is a low-effort, high-value way of dealing
with this issue. Of course ES is going to have some clever way of dealing
with stop words.

Here's a more readable blog post about Common Terms:
https://www.elastic.co/blog/stop-stopping-stop-words-a-look-at-common-terms-query

And, for reference, ES has stop word lists for >30 languages:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html


Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Fri, Aug 28, 2015 at 1:34 AM, David Causse <dcau...@wikimedia.org> wrote:

> Le 27/08/2015 22:29, Trey Jones a écrit :
>
>> Anyway, I like stripping stop words better than relaxing AND to OR,
>> unless there's some additional post-search ranking to sort the results into
>> a more AND-ish order.
>>
>
> I think my previous mail was misleading, I don't want to replace AND by
> OR. I mean when the query contains a lot of words (questions) the default
> AND is not appropriate because a single missing stopword could hide a good
> result. We could use the minimum_should_match attribute which allows to
> force a minimal number term to match (e.g. 90% of the query terms should
> match).
>
> There's also another interesting query which will do the "stopwords
> stripping" automagically, it's the common term query [1].
> In few words this query is able to detect stopwords by analyzing word freq
> at query time, so the query:
>
> What's the connection between power laws and zipf distribution
> will be split into 2 clauses :
> - connection power laws zipf distribution
> - what's the between and
>
> And we can control the boolean operator of these clauses independently,
> e.g. OR for high freq words and AND for low freq words. Or even more
> complex stuff like "3<80%" [2]: if there is more than 3 words only 80% of
> them are required.
>
> [1]
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html
> [2]
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-minimum-should-match.html
>
>
> _______________________________________________
> Wikimedia-search mailing list
> Wikimedia-search@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
>
_______________________________________________
Wikimedia-search mailing list
Wikimedia-search@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikimedia-search

Reply via email to