Re: [lucy-user] C library - Scoring mechanism

Nick Wellnhofer Tue, 21 Nov 2017 01:49:29 -0800

On Nov 21, 2017, at 02:09 , [email protected] wrote:
> I have a question regarding the scoring mechanism for relevancy. Is the 
> scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in the 
> schema? What happens when multiple terms are used? Are tf/idf's summed?


Lucy uses Lucene's Practical Scoring Function by default:

https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html

Essentially, tf/idf values are summed after being multiplied with each term's 
boost and normalization factor.

> How does the incorporate the location of the words to the scoring mechanism 
> for queries with multiple words?

> How about the fields which has RegexTokenizer? Is it still the same 
> mechanism? Does the type of the tokenizer affect the scoring?  I believe the 
> important thing is the generated tokens (and not related to the tokenizer), 
> and maybe the order of the tokens in a document.

If you use the core Tokenizers, the type of Tokenizer or the location of terms 
in a document don’t affect scoring. But you can write a custom Tokenizer that 
sets different boost values for each Token, for example depending on the 
location within the document.

> One more thing, if I were to change the scoring mechanism for different 
> fields, how can I do it? Are there any predefined mechanisms eg. tf/idf 
> doc2vec etc. Or if I want to go further and come up with my own how can I do 
> it?

You can tweak the scoring formula by supplying your own Similarity subclass for 
each FieldType, possibly in conjunction with your own Query/Compiler/Matcher 
subclasses:

https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html

The public documentation for Similarity is incomplete, unfortunately. But the 
class is similar to Lucene’s. The .cfh file contains more details:

https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD

You’d typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm.

Nick

Re: [lucy-user] C library - Scoring mechanism

Reply via email to