Hi Markus, can you show all the query parameters used when submitting the request to the request handler ? Can you also include the parsed query ( in the debug)
I am curious to investigate this case. Cheers -------------------------- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director www.sease.io On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma <markus.jel...@openindex.io> wrote: > Hello, > > And sorry to disturb again. Does anyone of you have any meaningful opinion > on this peculiar matter? The RemoveDuplicates filter exists for a reason, > but with query-time KeywordRepeat filter it causes trouble in some cases. > Is it normal for the clauses to be absent in the debug output, but the > boost doubled in value? > > I like this behaviour, but is it a side effect that is considered a bug in > later versions? And where is the documentation in this. I cannot find > anything in the Lucene or Solr Javadocs, or the reference manual. > > Many thanks, again, > Markus > > > > -----Original message----- > > From:Markus Jelsma <markus.jel...@openindex.io> > > Sent: Wednesday 9th May 2018 17:39 > > To: solr-user <solr-user@lucene.apache.org> > > Subject: Multiple languages, boosting and, stemming and KeywordRepeat > > > > Hello, > > > > First, apologies for the weird subject line. > > > > We index many languages and search over all those languages at once, but > boost the language of the user's preference. To differentiate between > stemmed tokens and unstemmed tokens we use KeywordRepeat and > RemoveDuplicates, this works very well. > > > > However, we just stumbled over the following example, q=australia is not > stemmed in English, but its suffix is removed by the Romanian stemmer, > causing the Romanian results to be returned on top of English results, > despite language boosting. > > > > This is because the Romanian part of the query consists of the stemmed > and unstemmed version of the word, but the English part of the query is > just one clause per field (title, content etc). Thus the Romanian results > score roughtly twice that of English results. > > > > Now, this is of course really obvious, but the 'solution' is not. To > work around the problem i removed the RemoveDuplicates filter so i get two > clauses for English as well, really ugly but it works. What i don't > understand is the debug output, it doesn't list two identical clauses, > instead, it doubled the boost on the field, so instead of: > > > > 27.048403 = PayloadSpanQuery, product of: > > 27.048403 = weight(title_en:australia in 15850) > [SchemaSimilarity], result of: > > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > > ), product of: > > 7.4 = boost > > 3.084852 = idf(docFreq=14539, docCount=317894) > > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 > * (1 - b + b * fieldLength / avgFieldLength)) from: > > 4.0 = phraseFreq=4.0 > > 0.3 = parameter k1 > > 0.5 = parameter b > > 15.08689 = avgFieldLength > > 24.0 = fieldLength > > 1.0 = AveragePayloadFunction.docScore() > > > > I now get > > > > 54.096806 = PayloadSpanQuery, product of: > > 54.096806 = weight(title_en:australia in 15850) > [SchemaSimilarity], result of: > > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > > ), product of: > > 14.8 = boost > > 3.084852 = idf(docFreq=14539, docCount=317894) > > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 > * (1 - b + b * fieldLength / avgFieldLength)) from: > > 4.0 = phraseFreq=4.0 > > 0.3 = parameter k1 > > 0.5 = parameter b > > 15.08689 = avgFieldLength > > 24.0 = fieldLength > > 1.0 = AveragePayloadFunction.docScore() > > > > So instead of expecting two clauses in the debug, i get one but with a > doubled boost. > > > > The question is, is this supposed to be like this? > > > > Also, are there any real solutions to this problem? Removing the > RemoveDuplicats filter looks really silly. > > > > Many thanks! > > Markus > > >