Yeah, it looks like Common Terms is a low-effort, high-value way of dealing
with this issue. Of course ES is going to have some clever way of dealing
with stop words.
Here's a more readable blog post about Common Terms:
And, for reference, ES has stop word lists for >30 languages:
Software Engineer, Discovery
On Fri, Aug 28, 2015 at 1:34 AM, David Causse <dcau...@wikimedia.org> wrote:
> Le 27/08/2015 22:29, Trey Jones a écrit :
>> Anyway, I like stripping stop words better than relaxing AND to OR,
>> unless there's some additional post-search ranking to sort the results into
>> a more AND-ish order.
> I think my previous mail was misleading, I don't want to replace AND by
> OR. I mean when the query contains a lot of words (questions) the default
> AND is not appropriate because a single missing stopword could hide a good
> result. We could use the minimum_should_match attribute which allows to
> force a minimal number term to match (e.g. 90% of the query terms should
> There's also another interesting query which will do the "stopwords
> stripping" automagically, it's the common term query .
> In few words this query is able to detect stopwords by analyzing word freq
> at query time, so the query:
> What's the connection between power laws and zipf distribution
> will be split into 2 clauses :
> - connection power laws zipf distribution
> - what's the between and
> And we can control the boolean operator of these clauses independently,
> e.g. OR for high freq words and AND for low freq words. Or even more
> complex stuff like "3<80%" : if there is more than 3 words only 80% of
> them are required.
> Wikimedia-search mailing list
Wikimedia-search mailing list