Thank you very much Nick for your response. I would like to ask two more questions: 1- Are the tf/idf scores consistent accross the all segments in a non-optimized index? Or is it being calculated separately for each segment (tf would not change but idf might be different)? 2- (same question but for multiple indexes and polysearcher) If I use polysearcher with 2 or more indexes, will the tf/idf scores be consistent? Or would they be calculated separately for each index?
Regards, Serkan On 2017-11-21 01:49, Nick Wellnhofer <[email protected]> wrote: > > On Nov 21, 2017, at 02:09 , [email protected] wrote: > > I have a question regarding the scoring mechanism for relevancy. Is the > > scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in > > the schema? What happens when multiple terms are used? Are tf/idf's summed? > > Lucy uses Lucene's Practical Scoring Function by default: > > https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html > > Essentially, tf/idf values are summed after being multiplied with each term's > boost and normalization factor. > > > How does the incorporate the location of the words to the scoring mechanism > > for queries with multiple words? > > > How about the fields which has RegexTokenizer? Is it still the same > > mechanism? Does the type of the tokenizer affect the scoring? I believe > > the important thing is the generated tokens (and not related to the > > tokenizer), and maybe the order of the tokens in a document. > > If you use the core Tokenizers, the type of Tokenizer or the location of > terms in a document donât affect scoring. But you can write a custom > Tokenizer that sets different boost values for each Token, for example > depending on the location within the document. > > > One more thing, if I were to change the scoring mechanism for different > > fields, how can I do it? Are there any predefined mechanisms eg. tf/idf > > doc2vec etc. Or if I want to go further and come up with my own how can I > > do it? > > You can tweak the scoring formula by supplying your own Similarity subclass > for each FieldType, possibly in conjunction with your own > Query/Compiler/Matcher subclasses: > > https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html > > The public documentation for Similarity is incomplete, unfortunately. But the > class is similar to Luceneâs. The .cfh file contains more details: > > https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD > > Youâd typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm. > > Nick > >
