subject:"Distinct terms within a document for new Similarity class"

Re: Distinct terms within a document for new Similarity class

2019-05-22 Thread Romaric Pighetti

Yes i meant the frequency of the term inside the document sorry. That is what I was afraid of. Thank you for your help and advices, I will look into implementing it as a query because i need to extract that value for other processes and doing it into solr / lucene is more convenient for me.

Re: Distinct terms within a document for new Similarity class

2019-05-22 Thread Adrien Grand

Did you mean termFreq rather than docFreq? I'm afraid that this scoring function can't be implemented as a Similarity given Lucene's new requirement that scores must - be non-negative - not increase when the norm increases. We had to remove a number of DFR similarities that we used to support

Re: Distinct terms within a document for new Similarity class

2019-05-22 Thread Romaric Pighetti

Hi Adrien, I thought about merging the two into one value that I could use in the scoring function but failed to find a way to do so. The scoring function is: log(1+(max(docrFreq-delta,0)) / (delta * d_u* p(w\c) ) ) + log( delta * d_u / |d|) (also included as image bellow)

Re: Distinct terms within a document for new Similarity class

2019-05-21 Thread Adrien Grand

Hi Romaric, Indeed similarities are not expected to create doc value fields, they should only populate norms. The similarity API has been changed in 8.0 and similarities no longer have access to the reader context, they are now expected to work with only term frequency and a length normalization

Re: Distinct terms within a document for new Similarity class

2019-05-21 Thread Romaric Pighetti

Hi, Thanks Adrien for the quick and accurate answer. Digging into the implementation I saw that the document length is already stored there and as I need both the unique term count and the length, I can't just replace one with the other. The Similarity class documentation states that it is

Re: Distinct terms within a document for new Similarity class

2019-05-20 Thread Adrien Grand

Hi Romaric, You are right, computeNorm is the right place to compute and record the number of unique terms of a document. Your computeNorm function would look something like this: @Override public final long computeNorm(FieldInvertState state) { return

Distinct terms within a document for new Similarity class

2019-05-20 Thread Romaric Pighetti

Hi, I am currently implementing a new similarity class into lucene which is based on a language model with absolute discount. I am basing my work on the work already done in the LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close. However to end my implementation I need

Re: Distinct terms within a document for new Similarity class

Re: Distinct terms within a document for new Similarity class

Re: Distinct terms within a document for new Similarity class

Re: Distinct terms within a document for new Similarity class

Re: Distinct terms within a document for new Similarity class

Re: Distinct terms within a document for new Similarity class

Distinct terms within a document for new Similarity class

7 matches

Site Navigation

Mail list logo

Footer information