RE: Multiple languages, boosting and, stemming and KeywordRepeat
Hi Alessandro, I looked at the parsed_query again and spotted something that could be the problem. We extend ExtendedDismaxQParser for payload support among other things. I suspect something is going wrong with rewriting the claused of SynonymQuery there. Thanks for letting me look at that part again, i clearly missed it the last time. Thanks, Markus -Original message- > From:Alessandro Benedetti <a.benede...@sease.io> > Sent: Friday 18th May 2018 12:54 > To: solr-user@lucene.apache.org > Subject: Re: Multiple languages, boosting and, stemming and KeywordRepeat > > Hi Markus, > can you show all the query parameters used when submitting the request to > the request handler ? > Can you also include the parsed query ( in the debug) > > I am curious to investigate this case. > > Cheers > > -- > Alessandro Benedetti > Search Consultant, R Software Engineer, Director > www.sease.io > > On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > Hello, > > > > And sorry to disturb again. Does anyone of you have any meaningful opinion > > on this peculiar matter? The RemoveDuplicates filter exists for a reason, > > but with query-time KeywordRepeat filter it causes trouble in some cases. > > Is it normal for the clauses to be absent in the debug output, but the > > boost doubled in value? > > > > I like this behaviour, but is it a side effect that is considered a bug in > > later versions? And where is the documentation in this. I cannot find > > anything in the Lucene or Solr Javadocs, or the reference manual. > > > > Many thanks, again, > > Markus > > > > > > > > -Original message- > > > From:Markus Jelsma <markus.jel...@openindex.io> > > > Sent: Wednesday 9th May 2018 17:39 > > > To: solr-user <solr-user@lucene.apache.org> > > > Subject: Multiple languages, boosting and, stemming and KeywordRepeat > > > > > > Hello, > > > > > > First, apologies for the weird subject line. > > > > > > We index many languages and search over all those languages at once, but > > boost the language of the user's preference. To differentiate between > > stemmed tokens and unstemmed tokens we use KeywordRepeat and > > RemoveDuplicates, this works very well. > > > > > > However, we just stumbled over the following example, q=australia is not > > stemmed in English, but its suffix is removed by the Romanian stemmer, > > causing the Romanian results to be returned on top of English results, > > despite language boosting. > > > > > > This is because the Romanian part of the query consists of the stemmed > > and unstemmed version of the word, but the English part of the query is > > just one clause per field (title, content etc). Thus the Romanian results > > score roughtly twice that of English results. > > > > > > Now, this is of course really obvious, but the 'solution' is not. To > > work around the problem i removed the RemoveDuplicates filter so i get two > > clauses for English as well, really ugly but it works. What i don't > > understand is the debug output, it doesn't list two identical clauses, > > instead, it doubled the boost on the field, so instead of: > > > > > > 27.048403 = PayloadSpanQuery, product of: > > > 27.048403 = weight(title_en:australia in 15850) > > [SchemaSimilarity], result of: > > > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > > > ), product of: > > > 7.4 = boost > > > 3.084852 = idf(docFreq=14539, docCount=317894) > > > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 > > * (1 - b + b * fieldLength / avgFieldLength)) from: > > > 4.0 = phraseFreq=4.0 > > > 0.3 = parameter k1 > > > 0.5 = parameter b > > > 15.08689 = avgFieldLength > > > 24.0 = fieldLength > > > 1.0 = AveragePayloadFunction.docScore() > > > > > > I now get > > > > > > 54.096806 = PayloadSpanQuery, product of: > > > 54.096806 = weight(title_en:australia in 15850) > > [SchemaSimilarity], result of: > > > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > > > ), product of: > > > 14.8 = boost > > > 3.084852 = idf(docFreq=14539, docCount=317894) > > > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 > > * (1 - b + b * fieldLength / avgFieldLength)) from: > > > 4.0 = phraseFreq=4.0 > > > 0.3 = parameter k1 > > > 0.5 = parameter b > > > 15.08689 = avgFieldLength > > > 24.0 = fieldLength > > > 1.0 = AveragePayloadFunction.docScore() > > > > > > So instead of expecting two clauses in the debug, i get one but with a > > doubled boost. > > > > > > The question is, is this supposed to be like this? > > > > > > Also, are there any real solutions to this problem? Removing the > > RemoveDuplicats filter looks really silly. > > > > > > Many thanks! > > > Markus > > > > > >
Re: Multiple languages, boosting and, stemming and KeywordRepeat
Hi Markus, can you show all the query parameters used when submitting the request to the request handler ? Can you also include the parsed query ( in the debug) I am curious to investigate this case. Cheers -- Alessandro Benedetti Search Consultant, R Software Engineer, Director www.sease.io On Thu, May 17, 2018 at 10:53 PM, Markus Jelsmawrote: > Hello, > > And sorry to disturb again. Does anyone of you have any meaningful opinion > on this peculiar matter? The RemoveDuplicates filter exists for a reason, > but with query-time KeywordRepeat filter it causes trouble in some cases. > Is it normal for the clauses to be absent in the debug output, but the > boost doubled in value? > > I like this behaviour, but is it a side effect that is considered a bug in > later versions? And where is the documentation in this. I cannot find > anything in the Lucene or Solr Javadocs, or the reference manual. > > Many thanks, again, > Markus > > > > -Original message- > > From:Markus Jelsma > > Sent: Wednesday 9th May 2018 17:39 > > To: solr-user > > Subject: Multiple languages, boosting and, stemming and KeywordRepeat > > > > Hello, > > > > First, apologies for the weird subject line. > > > > We index many languages and search over all those languages at once, but > boost the language of the user's preference. To differentiate between > stemmed tokens and unstemmed tokens we use KeywordRepeat and > RemoveDuplicates, this works very well. > > > > However, we just stumbled over the following example, q=australia is not > stemmed in English, but its suffix is removed by the Romanian stemmer, > causing the Romanian results to be returned on top of English results, > despite language boosting. > > > > This is because the Romanian part of the query consists of the stemmed > and unstemmed version of the word, but the English part of the query is > just one clause per field (title, content etc). Thus the Romanian results > score roughtly twice that of English results. > > > > Now, this is of course really obvious, but the 'solution' is not. To > work around the problem i removed the RemoveDuplicates filter so i get two > clauses for English as well, really ugly but it works. What i don't > understand is the debug output, it doesn't list two identical clauses, > instead, it doubled the boost on the field, so instead of: > > > > 27.048403 = PayloadSpanQuery, product of: > > 27.048403 = weight(title_en:australia in 15850) > [SchemaSimilarity], result of: > > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > > ), product of: > > 7.4 = boost > > 3.084852 = idf(docFreq=14539, docCount=317894) > > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 > * (1 - b + b * fieldLength / avgFieldLength)) from: > > 4.0 = phraseFreq=4.0 > > 0.3 = parameter k1 > > 0.5 = parameter b > > 15.08689 = avgFieldLength > > 24.0 = fieldLength > > 1.0 = AveragePayloadFunction.docScore() > > > > I now get > > > > 54.096806 = PayloadSpanQuery, product of: > > 54.096806 = weight(title_en:australia in 15850) > [SchemaSimilarity], result of: > > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > > ), product of: > > 14.8 = boost > > 3.084852 = idf(docFreq=14539, docCount=317894) > > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 > * (1 - b + b * fieldLength / avgFieldLength)) from: > > 4.0 = phraseFreq=4.0 > > 0.3 = parameter k1 > > 0.5 = parameter b > > 15.08689 = avgFieldLength > > 24.0 = fieldLength > > 1.0 = AveragePayloadFunction.docScore() > > > > So instead of expecting two clauses in the debug, i get one but with a > doubled boost. > > > > The question is, is this supposed to be like this? > > > > Also, are there any real solutions to this problem? Removing the > RemoveDuplicats filter looks really silly. > > > > Many thanks! > > Markus > > >
RE: Multiple languages, boosting and, stemming and KeywordRepeat
Hello, And sorry to disturb again. Does anyone of you have any meaningful opinion on this peculiar matter? The RemoveDuplicates filter exists for a reason, but with query-time KeywordRepeat filter it causes trouble in some cases. Is it normal for the clauses to be absent in the debug output, but the boost doubled in value? I like this behaviour, but is it a side effect that is considered a bug in later versions? And where is the documentation in this. I cannot find anything in the Lucene or Solr Javadocs, or the reference manual. Many thanks, again, Markus -Original message- > From:Markus Jelsma> Sent: Wednesday 9th May 2018 17:39 > To: solr-user > Subject: Multiple languages, boosting and, stemming and KeywordRepeat > > Hello, > > First, apologies for the weird subject line. > > We index many languages and search over all those languages at once, but > boost the language of the user's preference. To differentiate between stemmed > tokens and unstemmed tokens we use KeywordRepeat and RemoveDuplicates, this > works very well. > > However, we just stumbled over the following example, q=australia is not > stemmed in English, but its suffix is removed by the Romanian stemmer, > causing the Romanian results to be returned on top of English results, > despite language boosting. > > This is because the Romanian part of the query consists of the stemmed and > unstemmed version of the word, but the English part of the query is just one > clause per field (title, content etc). Thus the Romanian results score > roughtly twice that of English results. > > Now, this is of course really obvious, but the 'solution' is not. To work > around the problem i removed the RemoveDuplicates filter so i get two clauses > for English as well, really ugly but it works. What i don't understand is the > debug output, it doesn't list two identical clauses, instead, it doubled the > boost on the field, so instead of: > > 27.048403 = PayloadSpanQuery, product of: > 27.048403 = weight(title_en:australia in 15850) [SchemaSimilarity], > result of: > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > ), product of: > 7.4 = boost > 3.084852 = idf(docFreq=14539, docCount=317894) > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 > - b + b * fieldLength / avgFieldLength)) from: > 4.0 = phraseFreq=4.0 > 0.3 = parameter k1 > 0.5 = parameter b > 15.08689 = avgFieldLength > 24.0 = fieldLength > 1.0 = AveragePayloadFunction.docScore() > > I now get > > 54.096806 = PayloadSpanQuery, product of: > 54.096806 = weight(title_en:australia in 15850) [SchemaSimilarity], > result of: > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0 > ), product of: > 14.8 = boost > 3.084852 = idf(docFreq=14539, docCount=317894) > 1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 > - b + b * fieldLength / avgFieldLength)) from: > 4.0 = phraseFreq=4.0 > 0.3 = parameter k1 > 0.5 = parameter b > 15.08689 = avgFieldLength > 24.0 = fieldLength > 1.0 = AveragePayloadFunction.docScore() > > So instead of expecting two clauses in the debug, i get one but with a > doubled boost. > > The question is, is this supposed to be like this? > > Also, are there any real solutions to this problem? Removing the > RemoveDuplicats filter looks really silly. > > Many thanks! > Markus >