Re: discountOverlaps option for QueryParser

2015-09-20 Thread Ahmet Arslan
Hi Robert,

As I understand, with SynonymQuery, all expansion is recommended to be 
performed on query time only,
and SynonymQuery will take care of the below problem :

"A query for text:TV will expand into (text:TV text:Television) and the lower 
docFreq for text:Television will give the documents that match "Television" a 
much higher score then docs that match "TV" comparably -- which may be somewhat 
counter intuitive to the client. Index time expansion (or reduction) will 
result in the same idf for all documents regardless of which term the original 
text contained."


At the end of the query analysis, if there are tokens at the same position, I 
need to create my SynonymQuery programmatically, right?


Let me explain my concern with another example:







With above analyzer, the query "foo bör" will boost the term "bör" for no 
reason.
Just because bör will be expanded into two terms : bor and bör.
Its contribution to total score is counted two times. I think this is very 
trappy.

With SynonymQuery solution, I will index with StandardTokenizer only.
No expansion at index time.
I will construct the query : new TermQuery('foo') + new SynonymQuery('bor', 
'bör');

Thanks,
Ahmet




On Monday, September 21, 2015 12:33 AM, Robert Muir  wrote:
Hi Ahmet, maybe have a look at the SynonymQuery added in
https://issues.apache.org/jira/browse/LUCENE-6789

For query-time synonyms, it just tries to approximate what happens if
you instead do this work at index-time, by creating a "pseudo-term"
(disjunction of all terms at that same position) summing up the term
frequency across all matching terms before passing to score(). For the
statistics side it takes the maximum DF as the representative DF, and
the sum of the TTF as the representative TTF.

I did relevance experiments with this and the results were positive
over the existing query generated (BooleanQuery with coord disabled),
especially for scoring systems that don't do anything with coord.


On Sun, Sep 20, 2015 at 1:56 PM, Ahmet Arslan  wrote:
> Hello,
>
> Assume that term t1 is expanded into multiple terms (at the same position) 
> during both indexing and query time.
> This is possible with KeywordRepeat, SynonymFilter, or the Filters that have 
> preserveOriginal option for instance.
>
> When a two-term query (t1 t2) is executed, term t1 is boosted artificially.
> Score contribution of the term t1 is counted multiple times.
> It is like the query were issued with boosts : t1^3 t2
> This behaviour boosts expanded terms and may not be always desired.
> E.g. (When t2 is a content-bearing word)
>
> I think there should be a flag/switch which is analogous to relationship 
> between discountOverlaps & document's length.
> With this control, overlapping query terms' (tokens with a position of 
> increment of zero) scores are counted once.
> Remaining terms (not overlapping ones) are not affected.
>
> Bruno asked for this functionality in the past : 
> http://find.searchhub.org/document/bb99e435ba35f2b1
>
> What do you think about this? How difficult to implement this?
> Would this be a Lucene or Solr issue?
>
> Thanks,
> Ahmet
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org

>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: discountOverlaps option for QueryParser

2015-09-20 Thread Doug Turnbull
Another option Ahmet would be to create two fields, one that didn't do
ASCII folding *without* preserving the original and another that did.  The
ASCII folded version is a less exacting representation of the text, and the
version without ASCII folding would be more exacting

My first pass at a solution to your problem would summing the two fields
scores. Scoring the ASCII folded field provides a higher recall signal.
I'll call this the "base score." Scoring the non-ASCII folded provides a
more precise ranking signal. It kicks in only when the searcher types the
exact non ASCII folded term in. In a sense it acts like how most people
think of a boost: bonus points for harder to meet but valuable criteria.

In other words, if you match on just bor, you just get the base score. If
you match on bör you'd gain the benefit of the base and the additional
boost scores. The more exacting, non ASCII folded version of the field acts
as a boost.

On the other hand, if you don't care to differentiate between a match on an
ASCII folded or non-folded version, than simply create the base ASCII
folded field and score against that.

Shameless plug, this is exactly the sort of thing we talk quite a bit about
in John Berryman's and my book, Relevant Search (http://manning.com/turnbull).
You might find it useful.

Cheers
-Doug


On Sunday, September 20, 2015, Ahmet Arslan 
wrote:

> Hi Robert,
>
> As I understand, with SynonymQuery, all expansion is recommended to be
> performed on query time only,
> and SynonymQuery will take care of the below problem :
>
> "A query for text:TV will expand into (text:TV text:Television) and the
> lower docFreq for text:Television will give the documents that match
> "Television" a much higher score then docs that match "TV" comparably --
> which may be somewhat counter intuitive to the client. Index time expansion
> (or reduction) will result in the same idf for all documents regardless of
> which term the original text contained."
>
>
> At the end of the query analysis, if there are tokens at the same
> position, I need to create my SynonymQuery programmatically, right?
>
>
> Let me explain my concern with another example:
>
> 
> 
> 
> 
>
>
> With above analyzer, the query "foo bör" will boost the term "bör" for no
> reason.
> Just because bör will be expanded into two terms : bor and bör.
> Its contribution to total score is counted two times. I think this is very
> trappy.
>
> With SynonymQuery solution, I will index with StandardTokenizer only.
> No expansion at index time.
> I will construct the query : new TermQuery('foo') + new
> SynonymQuery('bor', 'bör');
>
> Thanks,
> Ahmet
>
>
>
>
> On Monday, September 21, 2015 12:33 AM, Robert Muir  > wrote:
> Hi Ahmet, maybe have a look at the SynonymQuery added in
> https://issues.apache.org/jira/browse/LUCENE-6789
>
> For query-time synonyms, it just tries to approximate what happens if
> you instead do this work at index-time, by creating a "pseudo-term"
> (disjunction of all terms at that same position) summing up the term
> frequency across all matching terms before passing to score(). For the
> statistics side it takes the maximum DF as the representative DF, and
> the sum of the TTF as the representative TTF.
>
> I did relevance experiments with this and the results were positive
> over the existing query generated (BooleanQuery with coord disabled),
> especially for scoring systems that don't do anything with coord.
>
>
> On Sun, Sep 20, 2015 at 1:56 PM, Ahmet Arslan 
> wrote:
> > Hello,
> >
> > Assume that term t1 is expanded into multiple terms (at the same
> position) during both indexing and query time.
> > This is possible with KeywordRepeat, SynonymFilter, or the Filters that
> have preserveOriginal option for instance.
> >
> > When a two-term query (t1 t2) is executed, term t1 is boosted
> artificially.
> > Score contribution of the term t1 is counted multiple times.
> > It is like the query were issued with boosts : t1^3 t2
> > This behaviour boosts expanded terms and may not be always desired.
> > E.g. (When t2 is a content-bearing word)
> >
> > I think there should be a flag/switch which is analogous to relationship
> between discountOverlaps & document's length.
> > With this control, overlapping query terms' (tokens with a position of
> increment of zero) scores are counted once.
> > Remaining terms (not overlapping ones) are not affected.
> >
> > Bruno asked for this functionality in the past :
> http://find.searchhub.org/document/bb99e435ba35f2b1
> >
> > What do you think about this? How difficult to implement this?
> > Would this be a Lucene or Solr issue?
> >
> > Thanks,
> > Ahmet
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> 
>
> >
>
> 

Re: discountOverlaps option for QueryParser

2015-09-20 Thread Ahmet Arslan
Hi Dough,

Boosting exact matches is not my primary concern.
By the way, ideal way to aggregate scores coming from different fields remains 
unclear.
May be geometric mean is better than summing the field scores?

I just want to warn people, if filters that produce multiple tokens at the same 
position are used carelessly, it can cause some un-obvious boostings in a 
query. 

Thanks,
Ahmet

On Monday, September 21, 2015 2:38 AM, Doug Turnbull 
 wrote:



Another option Ahmet would be to create two fields, one that didn't do ASCII 
folding *without* preserving the original and another that did.  The ASCII 
folded version is a less exacting representation of the text, and the version 
without ASCII folding would be more exacting

My first pass at a solution to your problem would summing the two fields 
scores. Scoring the ASCII folded field provides a higher recall signal. I'll 
call this the "base score." Scoring the non-ASCII folded provides a more 
precise ranking signal. It kicks in only when the searcher types the exact non 
ASCII folded term in. In a sense it acts like how most people think of a boost: 
bonus points for harder to meet but valuable criteria. 

In other words, if you match on just bor, you just get the base score. If you 
match on bör you'd gain the benefit of the base and the additional boost 
scores. The more exacting, non ASCII folded version of the field acts as a 
boost.

On the other hand, if you don't care to differentiate between a match on an 
ASCII folded or non-folded version, than simply create the base ASCII folded 
field and score against that.

Shameless plug, this is exactly the sort of thing we talk quite a bit about in 
John Berryman's and my book, Relevant Search (http://manning.com/turnbull). You 
might find it useful.

Cheers
-Doug


On Sunday, September 20, 2015, Ahmet Arslan  wrote:

Hi Robert,
>
>As I understand, with SynonymQuery, all expansion is recommended to be 
>performed on query time only,
>and SynonymQuery will take care of the below problem :
>
>"A query for text:TV will expand into (text:TV text:Television) and the lower 
>docFreq for text:Television will give the documents that match "Television" a 
>much higher score then docs that match "TV" comparably -- which may be 
>somewhat counter intuitive to the client. Index time expansion (or reduction) 
>will result in the same idf for all documents regardless of which term the 
>original text contained."
>
>
>At the end of the query analysis, if there are tokens at the same position, I 
>need to create my SynonymQuery programmatically, right?
>
>
>Let me explain my concern with another example:
>
>
>
>
>
>
>
>With above analyzer, the query "foo bör" will boost the term "bör" for no 
>reason.
>Just because bör will be expanded into two terms : bor and bör.
>Its contribution to total score is counted two times. I think this is very 
>trappy.
>
>With SynonymQuery solution, I will index with StandardTokenizer only.
>No expansion at index time.
>I will construct the query : new TermQuery('foo') + new SynonymQuery('bor', 
>'bör');
>
>Thanks,
>Ahmet
>
>
>
>
>On Monday, September 21, 2015 12:33 AM, Robert Muir  wrote:
>Hi Ahmet, maybe have a look at the SynonymQuery added in
>https://issues.apache.org/jira/browse/LUCENE-6789
>
>For query-time synonyms, it just tries to approximate what happens if
>you instead do this work at index-time, by creating a "pseudo-term"
>(disjunction of all terms at that same position) summing up the term
>frequency across all matching terms before passing to score(). For the
>statistics side it takes the maximum DF as the representative DF, and
>the sum of the TTF as the representative TTF.
>
>I did relevance experiments with this and the results were positive
>over the existing query generated (BooleanQuery with coord disabled),
>especially for scoring systems that don't do anything with coord.
>
>
>On Sun, Sep 20, 2015 at 1:56 PM, Ahmet Arslan  
>wrote:
>> Hello,
>>
>> Assume that term t1 is expanded into multiple terms (at the same position) 
>> during both indexing and query time.
>> This is possible with KeywordRepeat, SynonymFilter, or the Filters that have 
>> preserveOriginal option for instance.
>>
>> When a two-term query (t1 t2) is executed, term t1 is boosted artificially.
>> Score contribution of the term t1 is counted multiple times.
>> It is like the query were issued with boosts : t1^3 t2
>> This behaviour boosts expanded terms and may not be always desired.
>> E.g. (When t2 is a content-bearing word)
>>
>> I think there should be a flag/switch which is analogous to relationship 
>> between discountOverlaps & document's length.
>> With this control, overlapping query terms' (tokens with a position of 
>> increment of zero) scores are counted once.
>> Remaining terms (not overlapping ones) are not affected.
>>
>> Bruno asked for this functionality in the past : 

Re: discountOverlaps option for QueryParser

2015-09-20 Thread Robert Muir
On Sun, Sep 20, 2015 at 6:52 PM, Ahmet Arslan  wrote:
> Hi Robert,
>
> As I understand, with SynonymQuery, all expansion is recommended to be 
> performed on query time only,
> and SynonymQuery will take care of the below problem :

Its not that I recommend query-time expansion (vs index-time), its
just that lucene needed to deal with that option a little better than
before.

>
> "A query for text:TV will expand into (text:TV text:Television) and the lower 
> docFreq for text:Television will give the documents that match "Television" a 
> much higher score then docs that match "TV" comparably -- which may be 
> somewhat counter intuitive to the client. Index time expansion (or reduction) 
> will result in the same idf for all documents regardless of which term the 
> original text contained."

That is correct. Additionally if a document contains one instance of
TV and one instance of Television, the two term frequencies are added
up, it is treated as a single term for the document having tf=2, and
then sent to the similarity like that. So it tries to behave as if TV
and Television were one index term. This is important so that the term
frequency normalization is applied correctly, to represent the
information gain of additional occurrence.

>
> At the end of the query analysis, if there are tokens at the same position, I 
> need to create my SynonymQuery programmatically, right?

QueryBuilder (used by queryparsers) will generate SynonymQuery when it
sees the posInc=0 situation from the tokenstream.

>
> Let me explain my concern with another example:
>
> 
> 
> 
> 
>
>
> With above analyzer, the query "foo bör" will boost the term "bör" for no 
> reason.
> Just because bör will be expanded into two terms : bor and bör.
> Its contribution to total score is counted two times. I think this is very 
> trappy.
>
> With SynonymQuery solution, I will index with StandardTokenizer only.
> No expansion at index time.
> I will construct the query : new TermQuery('foo') + new SynonymQuery('bor', 
> 'bör');

Yes, that is exactly it.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: discountOverlaps option for QueryParser

2015-09-20 Thread Robert Muir
Hi Ahmet, maybe have a look at the SynonymQuery added in
https://issues.apache.org/jira/browse/LUCENE-6789

For query-time synonyms, it just tries to approximate what happens if
you instead do this work at index-time, by creating a "pseudo-term"
(disjunction of all terms at that same position) summing up the term
frequency across all matching terms before passing to score(). For the
statistics side it takes the maximum DF as the representative DF, and
the sum of the TTF as the representative TTF.

I did relevance experiments with this and the results were positive
over the existing query generated (BooleanQuery with coord disabled),
especially for scoring systems that don't do anything with coord.


On Sun, Sep 20, 2015 at 1:56 PM, Ahmet Arslan  wrote:
> Hello,
>
> Assume that term t1 is expanded into multiple terms (at the same position) 
> during both indexing and query time.
> This is possible with KeywordRepeat, SynonymFilter, or the Filters that have 
> preserveOriginal option for instance.
>
> When a two-term query (t1 t2) is executed, term t1 is boosted artificially.
> Score contribution of the term t1 is counted multiple times.
> It is like the query were issued with boosts : t1^3 t2
> This behaviour boosts expanded terms and may not be always desired.
> E.g. (When t2 is a content-bearing word)
>
> I think there should be a flag/switch which is analogous to relationship 
> between discountOverlaps & document's length.
> With this control, overlapping query terms' (tokens with a position of 
> increment of zero) scores are counted once.
> Remaining terms (not overlapping ones) are not affected.
>
> Bruno asked for this functionality in the past : 
> http://find.searchhub.org/document/bb99e435ba35f2b1
>
> What do you think about this? How difficult to implement this?
> Would this be a Lucene or Solr issue?
>
> Thanks,
> Ahmet
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



discountOverlaps option for QueryParser

2015-09-20 Thread Ahmet Arslan
Hello,

Assume that term t1 is expanded into multiple terms (at the same position) 
during both indexing and query time.
This is possible with KeywordRepeat, SynonymFilter, or the Filters that have 
preserveOriginal option for instance.

When a two-term query (t1 t2) is executed, term t1 is boosted artificially.
Score contribution of the term t1 is counted multiple times.
It is like the query were issued with boosts : t1^3 t2
This behaviour boosts expanded terms and may not be always desired.
E.g. (When t2 is a content-bearing word)

I think there should be a flag/switch which is analogous to relationship 
between discountOverlaps & document's length.
With this control, overlapping query terms' (tokens with a position of 
increment of zero) scores are counted once.
Remaining terms (not overlapping ones) are not affected.

Bruno asked for this functionality in the past : 
http://find.searchhub.org/document/bb99e435ba35f2b1

What do you think about this? How difficult to implement this?
Would this be a Lucene or Solr issue?

Thanks,
Ahmet

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org