Le 29/07/2015 00:32, Erik Bernhardson a écrit :
It seems we will have a number of different options to try, I wonder
if its better to have independent rules or tie them all together into
a more generic rule.
Converting + into space (or just urldecoding)
_ and + are already handled by the lucene analysis chain. If the query
"article_title" don't match "article title" won't match also :
Do you have an example where query_with_underscore returned no result
and query with underscore returned a result?
Quote stripping (the bad `quot` ones, but also things that are
legitimately quoted but the quoted query has no results)
A highly generic rule that would probably get more (but worse) results:
Either remove or convert into a space everything thats not alphadecimal
Maybe even join the words with 'OR' instead of 'AND' if there are
Re-formating the query at character level can be quite dangerous because
it can conflict with the analysis chain.
Concerning OR and AND I agree, but we have to make sure it won't hurt
the scoring. This is the purpose of query expansion
Today we have only one query expansion profile which permits to use the
full syntax offered by cirrus. IMHO the current profile is optimized for
But we could implement different profiles. To illustrate this idea look
at the query word1 word2, today the expansion is an AND query over
the all.plain with boost 1 and all with boost 0.5.
- all.plain contains exact words
- all contains exact words + stems
Another expansion profile could be :
- AND over all.plain boost 1
- AND over all boost 0.5
- OR over all.plain with boost 0.2
- OR over all with boost 0.1
This is over simplified but if we could refactor cirrus in a way that is
easy to implement different query expansion profiles it would be great.
We could get rid of query_string for some profiles and use more advanced
DSL query clauses (dismax, boosting query, common term query...).
Wikimedia-search mailing list