Re: Distinct terms within a document for new Similarity class

2019-05-22 Thread Romaric Pighetti
Yes i meant the frequency of the term inside the document sorry. That is what I was afraid of. Thank you for your help and advices, I will look into implementing it as a query because i need to extract that value for other processes and doing it into solr / lucene is more convenient for me.

Re: Distinct terms within a document for new Similarity class

2019-05-22 Thread Adrien Grand
Did you mean termFreq rather than docFreq? I'm afraid that this scoring function can't be implemented as a Similarity given Lucene's new requirement that scores must - be non-negative - not increase when the norm increases. We had to remove a number of DFR similarities that we used to support

Re: Distinct terms within a document for new Similarity class

2019-05-22 Thread Romaric Pighetti
Hi Adrien, I thought about merging the two into one value that I could use in the scoring function but failed to find a way to do so. The scoring function is: log(1+(max(docrFreq-delta,0)) / (delta * d_u* p(w\c) ) ) + log( delta * d_u / |d|) (also included as image bellow)

Re: Distinct terms within a document for new Similarity class

2019-05-21 Thread Adrien Grand
Hi Romaric, Indeed similarities are not expected to create doc value fields, they should only populate norms. The similarity API has been changed in 8.0 and similarities no longer have access to the reader context, they are now expected to work with only term frequency and a length normalization

Re: Distinct terms within a document for new Similarity class

2019-05-21 Thread Romaric Pighetti
Hi, Thanks Adrien for the quick and accurate answer. Digging into the implementation I saw that the document length is already stored there and as I need both the unique term count and the length, I can't just replace one with the other. The Similarity class documentation states that it is

Re: Distinct terms within a document for new Similarity class

2019-05-20 Thread Adrien Grand
Hi Romaric, You are right, computeNorm is the right place to compute and record the number of unique terms of a document. Your computeNorm function would look something like this: @Override public final long computeNorm(FieldInvertState state) { return

Distinct terms within a document for new Similarity class

2019-05-20 Thread Romaric Pighetti
Hi, I am currently implementing a new similarity class into lucene which is based on a language model with absolute discount. I am basing my work on the work already done in the LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close. However to end my implementation I need