Re: Multiple languages, boosting and, stemming and KeywordRepeat

Alessandro Benedetti Fri, 18 May 2018 03:54:54 -0700

Hi Markus,
can you show all the query parameters used when submitting the request to
the request handler ?
Can you also include the parsed query  ( in the debug)


I am curious to investigate this case.

Cheers

--------------------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
www.sease.io

On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello,
>
> And sorry to disturb again. Does anyone of you have any meaningful opinion
> on this peculiar matter? The RemoveDuplicates filter exists for a reason,
> but with query-time KeywordRepeat filter it causes trouble in some cases.
> Is it normal for the clauses to be absent in the debug output, but the
> boost doubled in value?
>
> I like this behaviour, but is it a side effect that is considered a bug in
> later versions? And where is the documentation in this. I cannot find
> anything in the Lucene or Solr Javadocs, or the reference manual.
>
> Many thanks, again,
> Markus
>
>
>
> -----Original message-----
> > From:Markus Jelsma <markus.jel...@openindex.io>
> > Sent: Wednesday 9th May 2018 17:39
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> >
> > Hello,
> >
> > First, apologies for the weird subject line.
> >
> > We index many languages and search over all those languages at once, but
> boost the language of the user's preference. To differentiate between
> stemmed tokens and unstemmed tokens we use KeywordRepeat and
> RemoveDuplicates, this works very well.
> >
> > However, we just stumbled over the following example, q=australia is not
> stemmed in English, but its suffix is removed by the Romanian stemmer,
> causing the Romanian results to be returned on top of English results,
> despite language boosting.
> >
> > This is because the Romanian part of the query consists of the stemmed
> and unstemmed version of the word, but the English part of the query is
> just one clause per field (title, content etc). Thus the Romanian results
> score roughtly twice that of English results.
> >
> > Now, this is of course really obvious, but the 'solution' is not. To
> work around the problem i removed the RemoveDuplicates filter so i get two
> clauses for English as well, really ugly but it works. What i don't
> understand is the debug output, it doesn't list two identical clauses,
> instead, it doubled the boost on the field, so instead of:
> >
> >     27.048403 = PayloadSpanQuery, product of:
> >       27.048403 = weight(title_en:australia in 15850)
> [SchemaSimilarity], result of:
> >         27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > ), product of:
> >           7.4 = boost
> >           3.084852 = idf(docFreq=14539, docCount=317894)
> >           1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> * (1 - b + b * fieldLength / avgFieldLength)) from:
> >             4.0 = phraseFreq=4.0
> >             0.3 = parameter k1
> >             0.5 = parameter b
> >             15.08689 = avgFieldLength
> >             24.0 = fieldLength
> >       1.0 = AveragePayloadFunction.docScore()
> >
> > I now get
> >
> >     54.096806 = PayloadSpanQuery, product of:
> >       54.096806 = weight(title_en:australia in 15850)
> [SchemaSimilarity], result of:
> >         54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > ), product of:
> >           14.8 = boost
> >           3.084852 = idf(docFreq=14539, docCount=317894)
> >           1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> * (1 - b + b * fieldLength / avgFieldLength)) from:
> >             4.0 = phraseFreq=4.0
> >             0.3 = parameter k1
> >             0.5 = parameter b
> >             15.08689 = avgFieldLength
> >             24.0 = fieldLength
> >       1.0 = AveragePayloadFunction.docScore()
> >
> > So instead of expecting two clauses in the debug, i get one but with a
> doubled boost.
> >
> > The question is, is this supposed to be like this?
> >
> > Also, are there any real solutions to this problem? Removing the
> RemoveDuplicats filter looks really silly.
> >
> > Many thanks!
> > Markus
> >
>

Re: Multiple languages, boosting and, stemming and KeywordRepeat

Reply via email to