On Nov 21, 2017, at 02:09 , [email protected] wrote: > I have a question regarding the scoring mechanism for relevancy. Is the > scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in the > schema? What happens when multiple terms are used? Are tf/idf's summed?
Lucy uses Lucene's Practical Scoring Function by default: https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html Essentially, tf/idf values are summed after being multiplied with each term's boost and normalization factor. > How does the incorporate the location of the words to the scoring mechanism > for queries with multiple words? > How about the fields which has RegexTokenizer? Is it still the same > mechanism? Does the type of the tokenizer affect the scoring? I believe the > important thing is the generated tokens (and not related to the tokenizer), > and maybe the order of the tokens in a document. If you use the core Tokenizers, the type of Tokenizer or the location of terms in a document don’t affect scoring. But you can write a custom Tokenizer that sets different boost values for each Token, for example depending on the location within the document. > One more thing, if I were to change the scoring mechanism for different > fields, how can I do it? Are there any predefined mechanisms eg. tf/idf > doc2vec etc. Or if I want to go further and come up with my own how can I do > it? You can tweak the scoring formula by supplying your own Similarity subclass for each FieldType, possibly in conjunction with your own Query/Compiler/Matcher subclasses: https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html The public documentation for Similarity is incomplete, unfortunately. But the class is similar to Lucene’s. The .cfh file contains more details: https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD You’d typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm. Nick
