Yeah, it looks like Common Terms is a low-effort, high-value way of dealing with this issue. Of course ES is going to have some clever way of dealing with stop words.
Here's a more readable blog post about Common Terms: https://www.elastic.co/blog/stop-stopping-stop-words-a-look-at-common-terms-query And, for reference, ES has stop word lists for >30 languages: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html Trey Jones Software Engineer, Discovery Wikimedia Foundation On Fri, Aug 28, 2015 at 1:34 AM, David Causse <[email protected]> wrote: > Le 27/08/2015 22:29, Trey Jones a écrit : > >> Anyway, I like stripping stop words better than relaxing AND to OR, >> unless there's some additional post-search ranking to sort the results into >> a more AND-ish order. >> > > I think my previous mail was misleading, I don't want to replace AND by > OR. I mean when the query contains a lot of words (questions) the default > AND is not appropriate because a single missing stopword could hide a good > result. We could use the minimum_should_match attribute which allows to > force a minimal number term to match (e.g. 90% of the query terms should > match). > > There's also another interesting query which will do the "stopwords > stripping" automagically, it's the common term query [1]. > In few words this query is able to detect stopwords by analyzing word freq > at query time, so the query: > > What's the connection between power laws and zipf distribution > will be split into 2 clauses : > - connection power laws zipf distribution > - what's the between and > > And we can control the boolean operator of these clauses independently, > e.g. OR for high freq words and AND for low freq words. Or even more > complex stuff like "3<80%" [2]: if there is more than 3 words only 80% of > them are required. > > [1] > https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-common-terms-query.html > [2] > https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-minimum-should-match.html > > > _______________________________________________ > Wikimedia-search mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikimedia-search >
_______________________________________________ Wikimedia-search mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikimedia-search
