Re: [lucy-user] C library - Scoring mechanism

2017-11-27 Thread serkanmula...@gmail.com
Thank you very much Nick and Marvin. Your replies were really helpful.

On 2017-11-23 11:38, Marvin Humphrey  wrote: 
> On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer  wrote:
> > On 21/11/2017 18:42, serkanmula...@gmail.com wrote:
> 
> >> 2- (same question but for multiple indexes and polysearcher) If I use
> >> polysearcher with 2 or more indexes, will the tf/idf scores be consistent?
> >> Or would they be calculated separately for each index?
> >
> > I don't know off top of my head. It's possible that indexes are searched
> > separately and the results are simply merged by normalized score. I'd have
> > to look at the code to answer the question, but maybe Marvin can chime in.
> 
> The scores will be consistent.
> 
> To calculate IDF for a term accurately across a composite corpus
> formed from multiple indexes, you need to know two things:
> 
> 1. The total number of documents in the corpus. (Doc_Max())
> 2. The total number of documents which contain the term. (Doc_Freq(field, 
> term))
> 
> Both PolySearcher and ClusterSearcher calculate their doc_max on
> construction by summing the doc_max totals of all subsearchers.
> Similarly, both calculate Doc_Freq for a term by summing Doc_Freq
> responses for all subsearchers.
> 
> https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69
> https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119
> https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73
> https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348
> 
> This approach trades away some performance for the sake of accuracy,
> particularly with Doc_Freq -- query normalization takes longer when
> you have to wait for a lot of subsearchers to report Doc_Freq numbers
> for N terms. However, the alternative is occasional bizarre search
> results.
> 
> The best anecdote I ever heard illustrating why it's important to
> calculate aggregate IDF consistently was an application searching a
> multi-shard index containing news articles split by year.  If you
> searched for "iphone", it would be a very common term after the first
> release of the Apple iPhone. However, in the years prior to the Apple
> iPhone's release, if "iphone" existed in a shard it was likely a typo,
> so it would be very rare **and thus heavily weighted**. So the top hit
> for "iphone", without consistent IDF calculation, would be a typo'd
> article.
> 
> (A performance improvement on this stratagem is to create a shared
> Doc_Freq source. So long as it contains all the common terms across
> all shards, it doesn't have to be updated often -- Doc_Freq values
> don't change very fast as indexes are updated.)
> 
> Marvin Humphrey
> 


Re: [lucy-user] C library - Scoring mechanism

2017-11-23 Thread Marvin Humphrey
On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer  wrote:
> On 21/11/2017 18:42, serkanmula...@gmail.com wrote:

>> 2- (same question but for multiple indexes and polysearcher) If I use
>> polysearcher with 2 or more indexes, will the tf/idf scores be consistent?
>> Or would they be calculated separately for each index?
>
> I don't know off top of my head. It's possible that indexes are searched
> separately and the results are simply merged by normalized score. I'd have
> to look at the code to answer the question, but maybe Marvin can chime in.

The scores will be consistent.

To calculate IDF for a term accurately across a composite corpus
formed from multiple indexes, you need to know two things:

1. The total number of documents in the corpus. (Doc_Max())
2. The total number of documents which contain the term. (Doc_Freq(field, term))

Both PolySearcher and ClusterSearcher calculate their doc_max on
construction by summing the doc_max totals of all subsearchers.
Similarly, both calculate Doc_Freq for a term by summing Doc_Freq
responses for all subsearchers.

https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69
https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119
https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73
https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348

This approach trades away some performance for the sake of accuracy,
particularly with Doc_Freq -- query normalization takes longer when
you have to wait for a lot of subsearchers to report Doc_Freq numbers
for N terms. However, the alternative is occasional bizarre search
results.

The best anecdote I ever heard illustrating why it's important to
calculate aggregate IDF consistently was an application searching a
multi-shard index containing news articles split by year.  If you
searched for "iphone", it would be a very common term after the first
release of the Apple iPhone. However, in the years prior to the Apple
iPhone's release, if "iphone" existed in a shard it was likely a typo,
so it would be very rare **and thus heavily weighted**. So the top hit
for "iphone", without consistent IDF calculation, would be a typo'd
article.

(A performance improvement on this stratagem is to create a shared
Doc_Freq source. So long as it contains all the common terms across
all shards, it doesn't have to be updated often -- Doc_Freq values
don't change very fast as indexes are updated.)

Marvin Humphrey


Re: [lucy-user] C library - Scoring mechanism

2017-11-22 Thread Nick Wellnhofer

On 21/11/2017 18:42, serkanmula...@gmail.com wrote:

1- Are the tf/idf scores consistent accross the all segments in a non-optimized 
index? Or is it being calculated separately for each segment (tf would not 
change but idf might be different)?


tf/idf is computed for the whole index.


2- (same question but for multiple indexes and polysearcher) If I use 
polysearcher with 2 or more indexes, will the tf/idf scores be consistent? Or 
would they be calculated separately for each index?


I don't know off top of my head. It's possible that indexes are searched 
separately and the results are simply merged by normalized score. I'd have to 
look at the code to answer the question, but maybe Marvin can chime in.


Nick


Re: [lucy-user] C library - Scoring mechanism

2017-11-21 Thread serkanmula...@gmail.com
Thank you very much Nick for your response.

I would like to ask two more questions:
1- Are the tf/idf scores consistent accross the all segments in a non-optimized 
index? Or is it being calculated separately for each segment (tf would not 
change but idf might be different)?
2- (same question but for multiple indexes and polysearcher) If I use 
polysearcher with 2 or more indexes, will the tf/idf scores be consistent? Or 
would they be calculated separately for each index?

Regards,
Serkan

On 2017-11-21 01:49, Nick Wellnhofer  wrote: 
> 
> On Nov 21, 2017, at 02:09 , serkanmula...@gmail.com wrote:
> > I have a question regarding the scoring mechanism for relevancy. Is the 
> > scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in 
> > the schema? What happens when multiple terms are used? Are tf/idf's summed?
> 
> Lucy uses Lucene's Practical Scoring Function by default:
> 
> https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html
> 
> Essentially, tf/idf values are summed after being multiplied with each term's 
> boost and normalization factor.
> 
> > How does the incorporate the location of the words to the scoring mechanism 
> > for queries with multiple words?
> 
> > How about the fields which has RegexTokenizer? Is it still the same 
> > mechanism? Does the type of the tokenizer affect the scoring?  I believe 
> > the important thing is the generated tokens (and not related to the 
> > tokenizer), and maybe the order of the tokens in a document.
> 
> If you use the core Tokenizers, the type of Tokenizer or the location of 
> terms in a document don’t affect scoring. But you can write a custom 
> Tokenizer that sets different boost values for each Token, for example 
> depending on the location within the document.
> 
> > One more thing, if I were to change the scoring mechanism for different 
> > fields, how can I do it? Are there any predefined mechanisms eg. tf/idf 
> > doc2vec etc. Or if I want to go further and come up with my own how can I 
> > do it?
> 
> You can tweak the scoring formula by supplying your own Similarity subclass 
> for each FieldType, possibly in conjunction with your own 
> Query/Compiler/Matcher subclasses:
> 
> https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html
> 
> The public documentation for Similarity is incomplete, unfortunately. But the 
> class is similar to Lucene’s. The .cfh file contains more details:
> 
> https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD
> 
> You’d typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm.
> 
> Nick
> 
> 


Re: [lucy-user] C library - Scoring mechanism

2017-11-21 Thread Nick Wellnhofer

On Nov 21, 2017, at 02:09 , serkanmula...@gmail.com wrote:
> I have a question regarding the scoring mechanism for relevancy. Is the 
> scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in the 
> schema? What happens when multiple terms are used? Are tf/idf's summed?

Lucy uses Lucene's Practical Scoring Function by default:

https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

Essentially, tf/idf values are summed after being multiplied with each term's 
boost and normalization factor.

> How does the incorporate the location of the words to the scoring mechanism 
> for queries with multiple words?

> How about the fields which has RegexTokenizer? Is it still the same 
> mechanism? Does the type of the tokenizer affect the scoring?  I believe the 
> important thing is the generated tokens (and not related to the tokenizer), 
> and maybe the order of the tokens in a document.

If you use the core Tokenizers, the type of Tokenizer or the location of terms 
in a document don’t affect scoring. But you can write a custom Tokenizer that 
sets different boost values for each Token, for example depending on the 
location within the document.

> One more thing, if I were to change the scoring mechanism for different 
> fields, how can I do it? Are there any predefined mechanisms eg. tf/idf 
> doc2vec etc. Or if I want to go further and come up with my own how can I do 
> it?

You can tweak the scoring formula by supplying your own Similarity subclass for 
each FieldType, possibly in conjunction with your own Query/Compiler/Matcher 
subclasses:

https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html

The public documentation for Similarity is incomplete, unfortunately. But the 
class is similar to Lucene’s. The .cfh file contains more details:

https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD

You’d typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm.

Nick