KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Markus Jelsma Tue, 13 Nov 2018 00:52:32 -0800

Hello, apologies for this long winded e-mail.

Our fields have KeywordRepeat and language specific filters such as a stemmer, 
the final filter at query-time is SynonymGraph. We do not use 
RemoveDuplicatesFilter for those of you wondering why when you see the parsed 
queries below, this is due to [1].


We use a custom QParser extending edismax and also extend 
ExtendedSolrQueryParser, so we are able to override newFieldQuery in case we 
have to. The problem also directly applies to Solr's vanilla edismax. The file 
synonyms.txt contains the stemmed versions of the original terms.

Consider this example synonym set [bier,brouw] where bier means beer and brouw 
is the stemmed version of brouwsel (brewage, concoction), and consider these 
parameters on /select: qf=content_nl&defType=edismax&mm=2<-1 5<-2 6<90%25.

The queries q=bier and q=brouw both parse to the following query and give the 
desired results (notice the missing RemoveDuplicates here):
+(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier 
content_nl:brouw))~2))

However, for q=brouwsel something (partially) unexpected happens:
+(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))

This results in a BooleanQuery where, due to mm=2, both clauses need to match, 
giving very few matches. Removing KeywordRepeat or setting mm=1 of course fixes 
the problem, but that is not what we want.

What is also unexpected, and may be related to the problem, is that when 
checking the analzer output via the GUI, we see the position incrementing when 
KeywordRepeat and SynonymGraph are combined. When these filters are not 
combined, the positions are always 1, as expected. When combined we get this 
for 'brouw':
term: bier brouw bier brouw
pos:  1     1         2      2

or for 'brouwsel':
term: brouwsel bier brouw
pos:  1               2      2

ExtendedSolrQueryParser, and everything underneath, is a complicated piece of 
code. In the end it extends Lucene's QueryBuilder, but not always relying on 
its results, it seems. Edismax for example 'resets' minShouldMatch in 
SolrPluginUtils.setMinShouldMatch(), so this is a complicated web of code and i 
am a bit too deep in this unfamiliar area, and i am in need of help here.

So, my question is, how to solve this problem? Or how to approach it?  What is 
the actual problem? How can i get the same stable results for both queries? 
Does the odd positon increment have anything to do with it (it seems Lucene's 
QueryBuilder does something with it). What do i need to do?

Many thanks,
Markus

ps. this is on Solr 7.2.1 and 7.5.0.

[1] 
http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html

KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Reply via email to