RE: Multiple languages, boosting and, stemming and KeywordRepeat

Markus Jelsma Thu, 17 May 2018 14:53:59 -0700

Hello,

And sorry to disturb again. Does anyone of you have any meaningful opinion on 
this peculiar matter? The RemoveDuplicates filter exists for a reason, but with 
query-time KeywordRepeat filter it causes trouble in some cases. Is it normal 
for the clauses to be absent in the debug output, but the boost doubled in 
value?


I like this behaviour, but is it a side effect that is considered a bug in 
later versions? And where is the documentation in this. I cannot find anything 
in the Lucene or Solr Javadocs, or the reference manual.

Many thanks, again,
Markus

 
 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Wednesday 9th May 2018 17:39
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> 
> Hello,
> 
> First, apologies for the weird subject line.
> 
> We index many languages and search over all those languages at once, but 
> boost the language of the user's preference. To differentiate between stemmed 
> tokens and unstemmed tokens we use KeywordRepeat and RemoveDuplicates, this 
> works very well.
> 
> However, we just stumbled over the following example, q=australia is not 
> stemmed in English, but its suffix is removed by the Romanian stemmer, 
> causing the Romanian results to be returned on top of English results, 
> despite language boosting.
> 
> This is because the Romanian part of the query consists of the stemmed and 
> unstemmed version of the word, but the English part of the query is just one 
> clause per field (title, content etc). Thus the Romanian results score 
> roughtly twice that of English results.
> 
> Now, this is of course really obvious, but the 'solution' is not. To work 
> around the problem i removed the RemoveDuplicates filter so i get two clauses 
> for English as well, really ugly but it works. What i don't understand is the 
> debug output, it doesn't list two identical clauses, instead, it doubled the 
> boost on the field, so instead of:
> 
>     27.048403 = PayloadSpanQuery, product of:
>       27.048403 = weight(title_en:australia in 15850) [SchemaSimilarity], 
> result of:
>         27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> ), product of:
>           7.4 = boost
>           3.084852 = idf(docFreq=14539, docCount=317894)
>           1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 
> - b + b * fieldLength / avgFieldLength)) from:
>             4.0 = phraseFreq=4.0
>             0.3 = parameter k1
>             0.5 = parameter b
>             15.08689 = avgFieldLength
>             24.0 = fieldLength
>       1.0 = AveragePayloadFunction.docScore()
> 
> I now get 
> 
>     54.096806 = PayloadSpanQuery, product of:
>       54.096806 = weight(title_en:australia in 15850) [SchemaSimilarity], 
> result of:
>         54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> ), product of:
>           14.8 = boost
>           3.084852 = idf(docFreq=14539, docCount=317894)
>           1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 
> - b + b * fieldLength / avgFieldLength)) from:
>             4.0 = phraseFreq=4.0
>             0.3 = parameter k1
>             0.5 = parameter b
>             15.08689 = avgFieldLength
>             24.0 = fieldLength
>       1.0 = AveragePayloadFunction.docScore()
> 
> So instead of expecting two clauses in the debug, i get one but with a 
> doubled boost.
> 
> The question is, is this supposed to be like this?
> 
> Also, are there any real solutions to this problem? Removing the 
> RemoveDuplicats filter looks really silly.
> 
> Many thanks!
> Markus
>

RE: Multiple languages, boosting and, stemming and KeywordRepeat

Reply via email to