Le 29/07/2015 00:32, Erik Bernhardson a écrit :
It seems we will have a number of different options to try, I wonder if its better to have independent rules or tie them all together into a more generic rule.

For example:
  Underscore stripping
  Converting + into space (or just urldecoding)

_ and + are already handled by the lucene analysis chain. If the query "article_title" don't match "article title" won't match also :

- third_term[0]
- third+term[1]

Do you have an example where query_with_underscore returned no result and query with underscore returned a result?

Quote stripping (the bad `quot` ones, but also things that are legitimately quoted but the quoted query has no results)
  Timestamp stripping?

A highly generic rule that would probably get more (but worse) results:
   Either remove or convert into a space everything thats not alphadecimal
Maybe even join the words with 'OR' instead of 'AND' if there are enough tokens

Re-formating the query at character level can be quite dangerous because it can conflict with the analysis chain. Concerning OR and AND I agree, but we have to make sure it won't hurt the scoring. This is the purpose of query expansion[2] Today we have only one query expansion profile which permits to use the full syntax offered by cirrus. IMHO the current profile is optimized for precision. But we could implement different profiles. To illustrate this idea look at the query word1 word2[3], today the expansion is an AND query over the all.plain with boost 1 and all with boost 0.5.
  - all.plain contains exact words
  - all contains exact words + stems

Another expansion profile could be :
  - AND over all.plain boost 1
  - AND over all boost 0.5
  - OR over all.plain with boost 0.2
  - OR over all with boost 0.1

This is over simplified but if we could refactor cirrus in a way that is easy to implement different query expansion profiles it would be great. We could get rid of query_string for some profiles and use more advanced DSL query clauses (dismax, boosting query, common term query...).

[0] https://en.wikipedia.org/w/api.php?action=query&format=json&srsearch=third_term&namespace=0&limit=10&list=search [1] https://en.wikipedia.org/w/api.php?action=query&format=json&srsearch=third%2Bterm&namespace=0&limit=10&list=search
[2] https://en.wikipedia.org/wiki/Query_expansion
[3] https://en.wikipedia.org/w/index.php?search=word1+word2&title=Special%3ASearch&go=Go&cirrusDumpQuery

Wikimedia-search mailing list

Reply via email to