Hello, apologies for this long winded e-mail. Our fields have KeywordRepeat and language specific filters such as a stemmer, the final filter at query-time is SynonymGraph. We do not use RemoveDuplicatesFilter for those of you wondering why when you see the parsed queries below, this is due to [1].
We use a custom QParser extending edismax and also extend ExtendedSolrQueryParser, so we are able to override newFieldQuery in case we have to. The problem also directly applies to Solr's vanilla edismax. The file synonyms.txt contains the stemmed versions of the original terms. Consider this example synonym set [bier,brouw] where bier means beer and brouw is the stemmed version of brouwsel (brewage, concoction), and consider these parameters on /select: qf=content_nl&defType=edismax&mm=2<-1 5<-2 6<90%25. The queries q=bier and q=brouw both parse to the following query and give the desired results (notice the missing RemoveDuplicates here): +(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier content_nl:brouw))~2)) However, for q=brouwsel something (partially) unexpected happens: +(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2)) This results in a BooleanQuery where, due to mm=2, both clauses need to match, giving very few matches. Removing KeywordRepeat or setting mm=1 of course fixes the problem, but that is not what we want. What is also unexpected, and may be related to the problem, is that when checking the analzer output via the GUI, we see the position incrementing when KeywordRepeat and SynonymGraph are combined. When these filters are not combined, the positions are always 1, as expected. When combined we get this for 'brouw': term: bier brouw bier brouw pos: 1 1 2 2 or for 'brouwsel': term: brouwsel bier brouw pos: 1 2 2 ExtendedSolrQueryParser, and everything underneath, is a complicated piece of code. In the end it extends Lucene's QueryBuilder, but not always relying on its results, it seems. Edismax for example 'resets' minShouldMatch in SolrPluginUtils.setMinShouldMatch(), so this is a complicated web of code and i am a bit too deep in this unfamiliar area, and i am in need of help here. So, my question is, how to solve this problem? Or how to approach it? What is the actual problem? How can i get the same stable results for both queries? Does the odd positon increment have anything to do with it (it seems Lucene's QueryBuilder does something with it). What do i need to do? Many thanks, Markus ps. this is on Solr 7.2.1 and 7.5.0. [1] http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html