RE: Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-18 Thread Markus Jelsma
Hi Alessandro,

I looked at the parsed_query again and spotted something that could be the 
problem. We extend ExtendedDismaxQParser for payload support among other 
things. I suspect something is going wrong with rewriting the claused of 
SynonymQuery there.

Thanks for letting me look at that part again, i clearly missed it the last 
time.

Thanks,
Markus
 
 
-Original message-
> From:Alessandro Benedetti <a.benede...@sease.io>
> Sent: Friday 18th May 2018 12:54
> To: solr-user@lucene.apache.org
> Subject: Re: Multiple languages, boosting and, stemming and KeywordRepeat
> 
> Hi Markus,
> can you show all the query parameters used when submitting the request to
> the request handler ?
> Can you also include the parsed query  ( in the debug)
> 
> I am curious to investigate this case.
> 
> Cheers
> 
> --
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> www.sease.io
> 
> On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Hello,
> >
> > And sorry to disturb again. Does anyone of you have any meaningful opinion
> > on this peculiar matter? The RemoveDuplicates filter exists for a reason,
> > but with query-time KeywordRepeat filter it causes trouble in some cases.
> > Is it normal for the clauses to be absent in the debug output, but the
> > boost doubled in value?
> >
> > I like this behaviour, but is it a side effect that is considered a bug in
> > later versions? And where is the documentation in this. I cannot find
> > anything in the Lucene or Solr Javadocs, or the reference manual.
> >
> > Many thanks, again,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:Markus Jelsma <markus.jel...@openindex.io>
> > > Sent: Wednesday 9th May 2018 17:39
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> > >
> > > Hello,
> > >
> > > First, apologies for the weird subject line.
> > >
> > > We index many languages and search over all those languages at once, but
> > boost the language of the user's preference. To differentiate between
> > stemmed tokens and unstemmed tokens we use KeywordRepeat and
> > RemoveDuplicates, this works very well.
> > >
> > > However, we just stumbled over the following example, q=australia is not
> > stemmed in English, but its suffix is removed by the Romanian stemmer,
> > causing the Romanian results to be returned on top of English results,
> > despite language boosting.
> > >
> > > This is because the Romanian part of the query consists of the stemmed
> > and unstemmed version of the word, but the English part of the query is
> > just one clause per field (title, content etc). Thus the Romanian results
> > score roughtly twice that of English results.
> > >
> > > Now, this is of course really obvious, but the 'solution' is not. To
> > work around the problem i removed the RemoveDuplicates filter so i get two
> > clauses for English as well, really ugly but it works. What i don't
> > understand is the debug output, it doesn't list two identical clauses,
> > instead, it doubled the boost on the field, so instead of:
> > >
> > > 27.048403 = PayloadSpanQuery, product of:
> > >   27.048403 = weight(title_en:australia in 15850)
> > [SchemaSimilarity], result of:
> > > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > > ), product of:
> > >   7.4 = boost
> > >   3.084852 = idf(docFreq=14539, docCount=317894)
> > >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> > * (1 - b + b * fieldLength / avgFieldLength)) from:
> > > 4.0 = phraseFreq=4.0
> > > 0.3 = parameter k1
> > > 0.5 = parameter b
> > > 15.08689 = avgFieldLength
> > > 24.0 = fieldLength
> > >   1.0 = AveragePayloadFunction.docScore()
> > >
> > > I now get
> > >
> > > 54.096806 = PayloadSpanQuery, product of:
> > >   54.096806 = weight(title_en:australia in 15850)
> > [SchemaSimilarity], result of:
> > > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > > ), product of:
> > >   14.8 = boost
> > >   3.084852 = idf(docFreq=14539, docCount=317894)
> > >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> > * (1 - b + b * fieldLength / avgFieldLength)) from:
> > > 4.0 = phraseFreq=4.0
> > > 0.3 = parameter k1
> > > 0.5 = parameter b
> > > 15.08689 = avgFieldLength
> > > 24.0 = fieldLength
> > >   1.0 = AveragePayloadFunction.docScore()
> > >
> > > So instead of expecting two clauses in the debug, i get one but with a
> > doubled boost.
> > >
> > > The question is, is this supposed to be like this?
> > >
> > > Also, are there any real solutions to this problem? Removing the
> > RemoveDuplicats filter looks really silly.
> > >
> > > Many thanks!
> > > Markus
> > >
> >
> 


Re: Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-18 Thread Alessandro Benedetti
Hi Markus,
can you show all the query parameters used when submitting the request to
the request handler ?
Can you also include the parsed query  ( in the debug)

I am curious to investigate this case.

Cheers

--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io

On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma 
wrote:

> Hello,
>
> And sorry to disturb again. Does anyone of you have any meaningful opinion
> on this peculiar matter? The RemoveDuplicates filter exists for a reason,
> but with query-time KeywordRepeat filter it causes trouble in some cases.
> Is it normal for the clauses to be absent in the debug output, but the
> boost doubled in value?
>
> I like this behaviour, but is it a side effect that is considered a bug in
> later versions? And where is the documentation in this. I cannot find
> anything in the Lucene or Solr Javadocs, or the reference manual.
>
> Many thanks, again,
> Markus
>
>
>
> -Original message-
> > From:Markus Jelsma 
> > Sent: Wednesday 9th May 2018 17:39
> > To: solr-user 
> > Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> >
> > Hello,
> >
> > First, apologies for the weird subject line.
> >
> > We index many languages and search over all those languages at once, but
> boost the language of the user's preference. To differentiate between
> stemmed tokens and unstemmed tokens we use KeywordRepeat and
> RemoveDuplicates, this works very well.
> >
> > However, we just stumbled over the following example, q=australia is not
> stemmed in English, but its suffix is removed by the Romanian stemmer,
> causing the Romanian results to be returned on top of English results,
> despite language boosting.
> >
> > This is because the Romanian part of the query consists of the stemmed
> and unstemmed version of the word, but the English part of the query is
> just one clause per field (title, content etc). Thus the Romanian results
> score roughtly twice that of English results.
> >
> > Now, this is of course really obvious, but the 'solution' is not. To
> work around the problem i removed the RemoveDuplicates filter so i get two
> clauses for English as well, really ugly but it works. What i don't
> understand is the debug output, it doesn't list two identical clauses,
> instead, it doubled the boost on the field, so instead of:
> >
> > 27.048403 = PayloadSpanQuery, product of:
> >   27.048403 = weight(title_en:australia in 15850)
> [SchemaSimilarity], result of:
> > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > ), product of:
> >   7.4 = boost
> >   3.084852 = idf(docFreq=14539, docCount=317894)
> >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> * (1 - b + b * fieldLength / avgFieldLength)) from:
> > 4.0 = phraseFreq=4.0
> > 0.3 = parameter k1
> > 0.5 = parameter b
> > 15.08689 = avgFieldLength
> > 24.0 = fieldLength
> >   1.0 = AveragePayloadFunction.docScore()
> >
> > I now get
> >
> > 54.096806 = PayloadSpanQuery, product of:
> >   54.096806 = weight(title_en:australia in 15850)
> [SchemaSimilarity], result of:
> > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > ), product of:
> >   14.8 = boost
> >   3.084852 = idf(docFreq=14539, docCount=317894)
> >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> * (1 - b + b * fieldLength / avgFieldLength)) from:
> > 4.0 = phraseFreq=4.0
> > 0.3 = parameter k1
> > 0.5 = parameter b
> > 15.08689 = avgFieldLength
> > 24.0 = fieldLength
> >   1.0 = AveragePayloadFunction.docScore()
> >
> > So instead of expecting two clauses in the debug, i get one but with a
> doubled boost.
> >
> > The question is, is this supposed to be like this?
> >
> > Also, are there any real solutions to this problem? Removing the
> RemoveDuplicats filter looks really silly.
> >
> > Many thanks!
> > Markus
> >
>


RE: Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-17 Thread Markus Jelsma
Hello,

And sorry to disturb again. Does anyone of you have any meaningful opinion on 
this peculiar matter? The RemoveDuplicates filter exists for a reason, but with 
query-time KeywordRepeat filter it causes trouble in some cases. Is it normal 
for the clauses to be absent in the debug output, but the boost doubled in 
value?

I like this behaviour, but is it a side effect that is considered a bug in 
later versions? And where is the documentation in this. I cannot find anything 
in the Lucene or Solr Javadocs, or the reference manual.

Many thanks, again,
Markus

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Wednesday 9th May 2018 17:39
> To: solr-user 
> Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> 
> Hello,
> 
> First, apologies for the weird subject line.
> 
> We index many languages and search over all those languages at once, but 
> boost the language of the user's preference. To differentiate between stemmed 
> tokens and unstemmed tokens we use KeywordRepeat and RemoveDuplicates, this 
> works very well.
> 
> However, we just stumbled over the following example, q=australia is not 
> stemmed in English, but its suffix is removed by the Romanian stemmer, 
> causing the Romanian results to be returned on top of English results, 
> despite language boosting.
> 
> This is because the Romanian part of the query consists of the stemmed and 
> unstemmed version of the word, but the English part of the query is just one 
> clause per field (title, content etc). Thus the Romanian results score 
> roughtly twice that of English results.
> 
> Now, this is of course really obvious, but the 'solution' is not. To work 
> around the problem i removed the RemoveDuplicates filter so i get two clauses 
> for English as well, really ugly but it works. What i don't understand is the 
> debug output, it doesn't list two identical clauses, instead, it doubled the 
> boost on the field, so instead of:
> 
> 27.048403 = PayloadSpanQuery, product of:
>   27.048403 = weight(title_en:australia in 15850) [SchemaSimilarity], 
> result of:
> 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> ), product of:
>   7.4 = boost
>   3.084852 = idf(docFreq=14539, docCount=317894)
>   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 
> - b + b * fieldLength / avgFieldLength)) from:
> 4.0 = phraseFreq=4.0
> 0.3 = parameter k1
> 0.5 = parameter b
> 15.08689 = avgFieldLength
> 24.0 = fieldLength
>   1.0 = AveragePayloadFunction.docScore()
> 
> I now get 
> 
> 54.096806 = PayloadSpanQuery, product of:
>   54.096806 = weight(title_en:australia in 15850) [SchemaSimilarity], 
> result of:
> 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> ), product of:
>   14.8 = boost
>   3.084852 = idf(docFreq=14539, docCount=317894)
>   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 
> - b + b * fieldLength / avgFieldLength)) from:
> 4.0 = phraseFreq=4.0
> 0.3 = parameter k1
> 0.5 = parameter b
> 15.08689 = avgFieldLength
> 24.0 = fieldLength
>   1.0 = AveragePayloadFunction.docScore()
> 
> So instead of expecting two clauses in the debug, i get one but with a 
> doubled boost.
> 
> The question is, is this supposed to be like this?
> 
> Also, are there any real solutions to this problem? Removing the 
> RemoveDuplicats filter looks really silly.
> 
> Many thanks!
> Markus
>