RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Markus Jelsma Thu, 29 Nov 2018 15:23:26 -0800

Hello, 

Sorry for trying this once more. Is there anyone around who can help me, and 
perhaps others, on this subject and the linked Jira ticket and failing test?


I could really need some help from someone who is really familiar with edismax 
code and the underlying QueryBuilder parts that are used, and then get replaced 
by Solr code.

Many thanks,
Markus

 
 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Thursday 22nd November 2018 15:39
> To: solr-user@lucene.apache.org; solr-user <solr-user@lucene.apache.org>
> Subject: RE: KeywordRepeat, stemming, (single term) synonyms and minimum 
> should match (edismax)
> 
> Hello,
> 
> I have opened a SOLR-13009 describing the problem. The attached patch 
> contains a unit test proving the problem, i.e. the test fails. Any help would 
> be greatly appreciated.
> 
> Many thanks,
> Markus
> 
> https://issues.apache.org/jira/browse/SOLR-13009
> 
>  
>  
> -----Original message-----
> > From:Markus Jelsma <markus.jel...@openindex.io>
> > Sent: Sunday 18th November 2018 23:21
> > To: solr-user@lucene.apache.org; solr-user <solr-user@lucene.apache.org>
> > Subject: RE: KeywordRepeat, stemming, (single term) synonyms and minimum 
> > should match (edismax)
> > 
> > Hello,
> > 
> > Apologies for bothering you all again, but i really need some help in this 
> > matter. How can we resolve this issue? Are we dealing with a bug here (then 
> > i'll open a ticket), am i doing something wrong?
> > 
> > Is here anyone who had the same issue or understand the problem?
> > 
> > Many thanks,
> > Markus 
> > 
> >  
> >  
> > -----Original message-----
> > > From:Markus Jelsma <markus.jel...@openindex.io>
> > > Sent: Tuesday 13th November 2018 9:52
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: KeywordRepeat, stemming, (single term) synonyms and minimum 
> > > should match (edismax)
> > > 
> > > Hello, apologies for this long winded e-mail.
> > > 
> > > Our fields have KeywordRepeat and language specific filters such as a 
> > > stemmer, the final filter at query-time is SynonymGraph. We do not use 
> > > RemoveDuplicatesFilter for those of you wondering why when you see the 
> > > parsed queries below, this is due to [1]. 
> > > 
> > > We use a custom QParser extending edismax and also extend 
> > > ExtendedSolrQueryParser, so we are able to override newFieldQuery in case 
> > > we have to. The problem also directly applies to Solr's vanilla edismax. 
> > > The file synonyms.txt contains the stemmed versions of the original terms.
> > > 
> > > Consider this example synonym set [bier,brouw] where bier means beer and 
> > > brouw is the stemmed version of brouwsel (brewage, concoction), and 
> > > consider these parameters on /select: 
> > > qf=content_nl&defType=edismax&mm=2<-1 5<-2 6<90%25.
> > > 
> > > The queries q=bier and q=brouw both parse to the following query and give 
> > > the desired results (notice the missing RemoveDuplicates here):
> > > +(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier 
> > > content_nl:brouw))~2))
> > > 
> > > However, for q=brouwsel something (partially) unexpected happens:
> > > +(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))
> > > 
> > > This results in a BooleanQuery where, due to mm=2, both clauses need to 
> > > match, giving very few matches. Removing KeywordRepeat or setting mm=1 of 
> > > course fixes the problem, but that is not what we want.
> > > 
> > > What is also unexpected, and may be related to the problem, is that when 
> > > checking the analzer output via the GUI, we see the position incrementing 
> > > when KeywordRepeat and SynonymGraph are combined. When these filters are 
> > > not combined, the positions are always 1, as expected. When combined we 
> > > get this for 'brouw':
> > > term: bier brouw bier brouw
> > > pos:  1     1         2      2
> > > 
> > > or for 'brouwsel':
> > > term: brouwsel bier brouw
> > > pos:  1               2      2
> > > 
> > > ExtendedSolrQueryParser, and everything underneath, is a complicated 
> > > piece of code. In the end it extends Lucene's QueryBuilder, but not 
> > > always relying on its results, it seems. Edismax for example 'resets' 
> > > minShouldMatch in SolrPluginUtils.setMinShouldMatch(), so this is a 
> > > complicated web of code and i am a bit too deep in this unfamiliar area, 
> > > and i am in need of help here.
> > > 
> > > So, my question is, how to solve this problem? Or how to approach it?  
> > > What is the actual problem? How can i get the same stable results for 
> > > both queries? Does the odd positon increment have anything to do with it 
> > > (it seems Lucene's QueryBuilder does something with it). What do i need 
> > > to do?
> > > 
> > > Many thanks,
> > > Markus
> > > 
> > > ps. this is on Solr 7.2.1 and 7.5.0.
> > > 
> > > [1] 
> > > http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html
> > > 
> > 
>

RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

Reply via email to