Hello, And sorry to disturb again. Does anyone of you have any meaningful opinion on this peculiar matter? The RemoveDuplicates filter exists for a reason, but with query-time KeywordRepeat filter it causes trouble in some cases. Is it normal for the clauses to be absent in the debug output, but the boost doubled in value?
I like this behaviour, but is it a side effect that is considered a bug in later versions? And where is the documentation in this. I cannot find anything in the Lucene or Solr Javadocs, or the reference manual. Many thanks, again, Markus -----Original message----- > From:Markus Jelsma <markus.jel...@openindex.io> > Sent: Wednesday 9th May 2018 17:39 > To: solr-user <solr-user@lucene.apache.org> > Subject: Multiple languages, boosting and, stemming and KeywordRepeat > > Hello, > > First, apologies for the weird subject line. > > We index many languages and search over all those languages at once, but > boost the language of the user's preference. To differentiate between stemmed > tokens and unstemmed tokens we use KeywordRepeat and RemoveDuplicates, this > works very well. > > However, we just stumbled over the following example, q=australia is not > stemmed in English, but its suffix is removed by the Romanian stemmer, > causing the Romanian results to be returned on top of English results, > despite language boosting. > > This is because the Romanian part of the query consists of the stemmed and > unstemmed version of the word, but the English part of the query is just one > clause per field (title, content etc). Thus the Romanian results score > roughtly twice that of English results. > > Now, this is of course really obvious, but the 'solution' is not. To work > around the problem i removed the RemoveDuplicates filter so i get two clauses > for English as well, really ugly but it works. What i don't understand is the > debug output, it doesn't list two identical clauses, instead, it doubled the > boost on the field, so instead of: > > 27.048403 = PayloadSpanQuery, product of: > 27.048403 = weight(title_en:australia in 15850) [SchemaSimilarity], > result of: > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > ), product of: > 7.4 = boost > 3.084852 = idf(docFreq=14539, docCount=317894) > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 > - b + b * fieldLength / avgFieldLength)) from: > 4.0 = phraseFreq=4.0 > 0.3 = parameter k1 > 0.5 = parameter b > 15.08689 = avgFieldLength > 24.0 = fieldLength > 1.0 = AveragePayloadFunction.docScore() > > I now get > > 54.096806 = PayloadSpanQuery, product of: > 54.096806 = weight(title_en:australia in 15850) [SchemaSimilarity], > result of: > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > ), product of: > 14.8 = boost > 3.084852 = idf(docFreq=14539, docCount=317894) > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 > - b + b * fieldLength / avgFieldLength)) from: > 4.0 = phraseFreq=4.0 > 0.3 = parameter k1 > 0.5 = parameter b > 15.08689 = avgFieldLength > 24.0 = fieldLength > 1.0 = AveragePayloadFunction.docScore() > > So instead of expecting two clauses in the debug, i get one but with a > doubled boost. > > The question is, is this supposed to be like this? > > Also, are there any real solutions to this problem? Removing the > RemoveDuplicats filter looks really silly. > > Many thanks! > Markus >