RE: Multiple languages, boosting and, stemming and KeywordRepeat

Markus Jelsma Fri, 18 May 2018 12:10:41 -0700

Hi Alessandro,

I looked at the parsed_query again and spotted something that could be the 
problem. We extend ExtendedDismaxQParser for payload support among other 
things. I suspect something is going wrong with rewriting the claused of 
SynonymQuery there.


Thanks for letting me look at that part again, i clearly missed it the last 
time.

Thanks,
Markus
 
 
-----Original message-----
> From:Alessandro Benedetti <a.benede...@sease.io>
> Sent: Friday 18th May 2018 12:54
> To: solr-user@lucene.apache.org
> Subject: Re: Multiple languages, boosting and, stemming and KeywordRepeat
> 
> Hi Markus,
> can you show all the query parameters used when submitting the request to
> the request handler ?
> Can you also include the parsed query  ( in the debug)
> 
> I am curious to investigate this case.
> 
> Cheers
> 
> --------------------------
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> www.sease.io
> 
> On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Hello,
> >
> > And sorry to disturb again. Does anyone of you have any meaningful opinion
> > on this peculiar matter? The RemoveDuplicates filter exists for a reason,
> > but with query-time KeywordRepeat filter it causes trouble in some cases.
> > Is it normal for the clauses to be absent in the debug output, but the
> > boost doubled in value?
> >
> > I like this behaviour, but is it a side effect that is considered a bug in
> > later versions? And where is the documentation in this. I cannot find
> > anything in the Lucene or Solr Javadocs, or the reference manual.
> >
> > Many thanks, again,
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Markus Jelsma <markus.jel...@openindex.io>
> > > Sent: Wednesday 9th May 2018 17:39
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> > >
> > > Hello,
> > >
> > > First, apologies for the weird subject line.
> > >
> > > We index many languages and search over all those languages at once, but
> > boost the language of the user's preference. To differentiate between
> > stemmed tokens and unstemmed tokens we use KeywordRepeat and
> > RemoveDuplicates, this works very well.
> > >
> > > However, we just stumbled over the following example, q=australia is not
> > stemmed in English, but its suffix is removed by the Romanian stemmer,
> > causing the Romanian results to be returned on top of English results,
> > despite language boosting.
> > >
> > > This is because the Romanian part of the query consists of the stemmed
> > and unstemmed version of the word, but the English part of the query is
> > just one clause per field (title, content etc). Thus the Romanian results
> > score roughtly twice that of English results.
> > >
> > > Now, this is of course really obvious, but the 'solution' is not. To
> > work around the problem i removed the RemoveDuplicates filter so i get two
> > clauses for English as well, really ugly but it works. What i don't
> > understand is the debug output, it doesn't list two identical clauses,
> > instead, it doubled the boost on the field, so instead of:
> > >
> > >     27.048403 = PayloadSpanQuery, product of:
> > >       27.048403 = weight(title_en:australia in 15850)
> > [SchemaSimilarity], result of:
> > >         27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > > ), product of:
> > >           7.4 = boost
> > >           3.084852 = idf(docFreq=14539, docCount=317894)
> > >           1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> > * (1 - b + b * fieldLength / avgFieldLength)) from:
> > >             4.0 = phraseFreq=4.0
> > >             0.3 = parameter k1
> > >             0.5 = parameter b
> > >             15.08689 = avgFieldLength
> > >             24.0 = fieldLength
> > >       1.0 = AveragePayloadFunction.docScore()
> > >
> > > I now get
> > >
> > >     54.096806 = PayloadSpanQuery, product of:
> > >       54.096806 = weight(title_en:australia in 15850)
> > [SchemaSimilarity], result of:
> > >         54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > > ), product of:
> > >           14.8 = boost
> > >           3.084852 = idf(docFreq=14539, docCount=317894)
> > >           1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> > * (1 - b + b * fieldLength / avgFieldLength)) from:
> > >             4.0 = phraseFreq=4.0
> > >             0.3 = parameter k1
> > >             0.5 = parameter b
> > >             15.08689 = avgFieldLength
> > >             24.0 = fieldLength
> > >       1.0 = AveragePayloadFunction.docScore()
> > >
> > > So instead of expecting two clauses in the debug, i get one but with a
> > doubled boost.
> > >
> > > The question is, is this supposed to be like this?
> > >
> > > Also, are there any real solutions to this problem? Removing the
> > RemoveDuplicats filter looks really silly.
> > >
> > > Many thanks!
> > > Markus
> > >
> >
>

RE: Multiple languages, boosting and, stemming and KeywordRepeat

Reply via email to