RE: Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-18 Thread Markus Jelsma
Hi Alessandro,

I looked at the parsed_query again and spotted something that could be the 
problem. We extend ExtendedDismaxQParser for payload support among other 
things. I suspect something is going wrong with rewriting the claused of 
SynonymQuery there.

Thanks for letting me look at that part again, i clearly missed it the last 
time.

Thanks,
Markus
 
 
-Original message-
> From:Alessandro Benedetti <a.benede...@sease.io>
> Sent: Friday 18th May 2018 12:54
> To: solr-user@lucene.apache.org
> Subject: Re: Multiple languages, boosting and, stemming and KeywordRepeat
> 
> Hi Markus,
> can you show all the query parameters used when submitting the request to
> the request handler ?
> Can you also include the parsed query  ( in the debug)
> 
> I am curious to investigate this case.
> 
> Cheers
> 
> --
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> www.sease.io
> 
> On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Hello,
> >
> > And sorry to disturb again. Does anyone of you have any meaningful opinion
> > on this peculiar matter? The RemoveDuplicates filter exists for a reason,
> > but with query-time KeywordRepeat filter it causes trouble in some cases.
> > Is it normal for the clauses to be absent in the debug output, but the
> > boost doubled in value?
> >
> > I like this behaviour, but is it a side effect that is considered a bug in
> > later versions? And where is the documentation in this. I cannot find
> > anything in the Lucene or Solr Javadocs, or the reference manual.
> >
> > Many thanks, again,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:Markus Jelsma <markus.jel...@openindex.io>
> > > Sent: Wednesday 9th May 2018 17:39
> > > To: solr-user <solr-user@lucene.apache.org>
> > > Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> > >
> > > Hello,
> > >
> > > First, apologies for the weird subject line.
> > >
> > > We index many languages and search over all those languages at once, but
> > boost the language of the user's preference. To differentiate between
> > stemmed tokens and unstemmed tokens we use KeywordRepeat and
> > RemoveDuplicates, this works very well.
> > >
> > > However, we just stumbled over the following example, q=australia is not
> > stemmed in English, but its suffix is removed by the Romanian stemmer,
> > causing the Romanian results to be returned on top of English results,
> > despite language boosting.
> > >
> > > This is because the Romanian part of the query consists of the stemmed
> > and unstemmed version of the word, but the English part of the query is
> > just one clause per field (title, content etc). Thus the Romanian results
> > score roughtly twice that of English results.
> > >
> > > Now, this is of course really obvious, but the 'solution' is not. To
> > work around the problem i removed the RemoveDuplicates filter so i get two
> > clauses for English as well, really ugly but it works. What i don't
> > understand is the debug output, it doesn't list two identical clauses,
> > instead, it doubled the boost on the field, so instead of:
> > >
> > > 27.048403 = PayloadSpanQuery, product of:
> > >   27.048403 = weight(title_en:australia in 15850)
> > [SchemaSimilarity], result of:
> > > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > > ), product of:
> > >   7.4 = boost
> > >   3.084852 = idf(docFreq=14539, docCount=317894)
> > >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> > * (1 - b + b * fieldLength / avgFieldLength)) from:
> > > 4.0 = phraseFreq=4.0
> > > 0.3 = parameter k1
> > > 0.5 = parameter b
> > > 15.08689 = avgFieldLength
> > > 24.0 = fieldLength
> > >   1.0 = AveragePayloadFunction.docScore()
> > >
> > > I now get
> > >
> > > 54.096806 = PayloadSpanQuery, product of:
> > >   54.096806 = weight(title_en:australia in 15850)
> > [SchemaSimilarity], result of:
> > > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > > ), product of:
> > >   14.8 = boost
> > >   3.084852 = idf(docFreq=14539, docCount=317894)
> > >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> > * (1 - b + b * fieldLength / avgFieldLength)) from:
> > > 4.0 = phraseFreq=4.0
> > > 0.3 = parameter k1
> > > 0.5 = parameter b
> > > 15.08689 = avgFieldLength
> > > 24.0 = fieldLength
> > >   1.0 = AveragePayloadFunction.docScore()
> > >
> > > So instead of expecting two clauses in the debug, i get one but with a
> > doubled boost.
> > >
> > > The question is, is this supposed to be like this?
> > >
> > > Also, are there any real solutions to this problem? Removing the
> > RemoveDuplicats filter looks really silly.
> > >
> > > Many thanks!
> > > Markus
> > >
> >
> 


Re: Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-18 Thread Alessandro Benedetti
Hi Markus,
can you show all the query parameters used when submitting the request to
the request handler ?
Can you also include the parsed query  ( in the debug)

I am curious to investigate this case.

Cheers

--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io

On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello,
>
> And sorry to disturb again. Does anyone of you have any meaningful opinion
> on this peculiar matter? The RemoveDuplicates filter exists for a reason,
> but with query-time KeywordRepeat filter it causes trouble in some cases.
> Is it normal for the clauses to be absent in the debug output, but the
> boost doubled in value?
>
> I like this behaviour, but is it a side effect that is considered a bug in
> later versions? And where is the documentation in this. I cannot find
> anything in the Lucene or Solr Javadocs, or the reference manual.
>
> Many thanks, again,
> Markus
>
>
>
> -Original message-
> > From:Markus Jelsma <markus.jel...@openindex.io>
> > Sent: Wednesday 9th May 2018 17:39
> > To: solr-user <solr-user@lucene.apache.org>
> > Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> >
> > Hello,
> >
> > First, apologies for the weird subject line.
> >
> > We index many languages and search over all those languages at once, but
> boost the language of the user's preference. To differentiate between
> stemmed tokens and unstemmed tokens we use KeywordRepeat and
> RemoveDuplicates, this works very well.
> >
> > However, we just stumbled over the following example, q=australia is not
> stemmed in English, but its suffix is removed by the Romanian stemmer,
> causing the Romanian results to be returned on top of English results,
> despite language boosting.
> >
> > This is because the Romanian part of the query consists of the stemmed
> and unstemmed version of the word, but the English part of the query is
> just one clause per field (title, content etc). Thus the Romanian results
> score roughtly twice that of English results.
> >
> > Now, this is of course really obvious, but the 'solution' is not. To
> work around the problem i removed the RemoveDuplicates filter so i get two
> clauses for English as well, really ugly but it works. What i don't
> understand is the debug output, it doesn't list two identical clauses,
> instead, it doubled the boost on the field, so instead of:
> >
> > 27.048403 = PayloadSpanQuery, product of:
> >   27.048403 = weight(title_en:australia in 15850)
> [SchemaSimilarity], result of:
> > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > ), product of:
> >   7.4 = boost
> >   3.084852 = idf(docFreq=14539, docCount=317894)
> >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> * (1 - b + b * fieldLength / avgFieldLength)) from:
> > 4.0 = phraseFreq=4.0
> > 0.3 = parameter k1
> > 0.5 = parameter b
> > 15.08689 = avgFieldLength
> > 24.0 = fieldLength
> >   1.0 = AveragePayloadFunction.docScore()
> >
> > I now get
> >
> > 54.096806 = PayloadSpanQuery, product of:
> >   54.096806 = weight(title_en:australia in 15850)
> [SchemaSimilarity], result of:
> > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > ), product of:
> >   14.8 = boost
> >   3.084852 = idf(docFreq=14539, docCount=317894)
> >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> * (1 - b + b * fieldLength / avgFieldLength)) from:
> > 4.0 = phraseFreq=4.0
> > 0.3 = parameter k1
> > 0.5 = parameter b
> > 15.08689 = avgFieldLength
> > 24.0 = fieldLength
> >   1.0 = AveragePayloadFunction.docScore()
> >
> > So instead of expecting two clauses in the debug, i get one but with a
> doubled boost.
> >
> > The question is, is this supposed to be like this?
> >
> > Also, are there any real solutions to this problem? Removing the
> RemoveDuplicats filter looks really silly.
> >
> > Many thanks!
> > Markus
> >
>


RE: Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-17 Thread Markus Jelsma
Hello,

And sorry to disturb again. Does anyone of you have any meaningful opinion on 
this peculiar matter? The RemoveDuplicates filter exists for a reason, but with 
query-time KeywordRepeat filter it causes trouble in some cases. Is it normal 
for the clauses to be absent in the debug output, but the boost doubled in 
value?

I like this behaviour, but is it a side effect that is considered a bug in 
later versions? And where is the documentation in this. I cannot find anything 
in the Lucene or Solr Javadocs, or the reference manual.

Many thanks, again,
Markus

 
 
-Original message-
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Wednesday 9th May 2018 17:39
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> 
> Hello,
> 
> First, apologies for the weird subject line.
> 
> We index many languages and search over all those languages at once, but 
> boost the language of the user's preference. To differentiate between stemmed 
> tokens and unstemmed tokens we use KeywordRepeat and RemoveDuplicates, this 
> works very well.
> 
> However, we just stumbled over the following example, q=australia is not 
> stemmed in English, but its suffix is removed by the Romanian stemmer, 
> causing the Romanian results to be returned on top of English results, 
> despite language boosting.
> 
> This is because the Romanian part of the query consists of the stemmed and 
> unstemmed version of the word, but the English part of the query is just one 
> clause per field (title, content etc). Thus the Romanian results score 
> roughtly twice that of English results.
> 
> Now, this is of course really obvious, but the 'solution' is not. To work 
> around the problem i removed the RemoveDuplicates filter so i get two clauses 
> for English as well, really ugly but it works. What i don't understand is the 
> debug output, it doesn't list two identical clauses, instead, it doubled the 
> boost on the field, so instead of:
> 
> 27.048403 = PayloadSpanQuery, product of:
>   27.048403 = weight(title_en:australia in 15850) [SchemaSimilarity], 
> result of:
> 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> ), product of:
>   7.4 = boost
>   3.084852 = idf(docFreq=14539, docCount=317894)
>   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 
> - b + b * fieldLength / avgFieldLength)) from:
> 4.0 = phraseFreq=4.0
> 0.3 = parameter k1
> 0.5 = parameter b
> 15.08689 = avgFieldLength
> 24.0 = fieldLength
>   1.0 = AveragePayloadFunction.docScore()
> 
> I now get 
> 
> 54.096806 = PayloadSpanQuery, product of:
>   54.096806 = weight(title_en:australia in 15850) [SchemaSimilarity], 
> result of:
> 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> ), product of:
>   14.8 = boost
>   3.084852 = idf(docFreq=14539, docCount=317894)
>   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 
> - b + b * fieldLength / avgFieldLength)) from:
> 4.0 = phraseFreq=4.0
> 0.3 = parameter k1
> 0.5 = parameter b
> 15.08689 = avgFieldLength
> 24.0 = fieldLength
>   1.0 = AveragePayloadFunction.docScore()
> 
> So instead of expecting two clauses in the debug, i get one but with a 
> doubled boost.
> 
> The question is, is this supposed to be like this?
> 
> Also, are there any real solutions to this problem? Removing the 
> RemoveDuplicats filter looks really silly.
> 
> Many thanks!
> Markus
> 


Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-09 Thread Markus Jelsma
Hello,

First, apologies for the weird subject line.

We index many languages and search over all those languages at once, but boost 
the language of the user's preference. To differentiate between stemmed tokens 
and unstemmed tokens we use KeywordRepeat and RemoveDuplicates, this works very 
well.

However, we just stumbled over the following example, q=australia is not 
stemmed in English, but its suffix is removed by the Romanian stemmer, causing 
the Romanian results to be returned on top of English results, despite language 
boosting.

This is because the Romanian part of the query consists of the stemmed and 
unstemmed version of the word, but the English part of the query is just one 
clause per field (title, content etc). Thus the Romanian results score roughtly 
twice that of English results.

Now, this is of course really obvious, but the 'solution' is not. To work 
around the problem i removed the RemoveDuplicates filter so i get two clauses 
for English as well, really ugly but it works. What i don't understand is the 
debug output, it doesn't list two identical clauses, instead, it doubled the 
boost on the field, so instead of:

27.048403 = PayloadSpanQuery, product of:
  27.048403 = weight(title_en:australia in 15850) [SchemaSimilarity], 
result of:
27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
), product of:
  7.4 = boost
  3.084852 = idf(docFreq=14539, docCount=317894)
  1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - 
b + b * fieldLength / avgFieldLength)) from:
4.0 = phraseFreq=4.0
0.3 = parameter k1
0.5 = parameter b
15.08689 = avgFieldLength
24.0 = fieldLength
  1.0 = AveragePayloadFunction.docScore()

I now get 

54.096806 = PayloadSpanQuery, product of:
  54.096806 = weight(title_en:australia in 15850) [SchemaSimilarity], 
result of:
54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
), product of:
  14.8 = boost
  3.084852 = idf(docFreq=14539, docCount=317894)
  1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - 
b + b * fieldLength / avgFieldLength)) from:
4.0 = phraseFreq=4.0
0.3 = parameter k1
0.5 = parameter b
15.08689 = avgFieldLength
24.0 = fieldLength
  1.0 = AveragePayloadFunction.docScore()

So instead of expecting two clauses in the debug, i get one but with a doubled 
boost.

The question is, is this supposed to be like this?

Also, are there any real solutions to this problem? Removing the 
RemoveDuplicats filter looks really silly.

Many thanks!
Markus