Yes i meant the frequency of the term inside the document sorry.
That is what I was afraid of.
Thank you for your help and advices, I will look into implementing it as
a query because i need to extract that value for other processes and
doing it into solr / lucene is more convenient for me.
Did you mean termFreq rather than docFreq?
I'm afraid that this scoring function can't be implemented as a
Similarity given Lucene's new requirement that scores must
- be non-negative
- not increase when the norm increases.
We had to remove a number of DFR similarities that we used to support
Hi Adrien,
I thought about merging the two into one value that I could use in the
scoring function but failed to find a way to do so.
The scoring function is:
log(1+(max(docrFreq-delta,0)) / (delta * d_u* p(w\c) ) ) + log( delta *
d_u / |d|)
(also included as image bellow)
Hi Romaric,
Indeed similarities are not expected to create doc value fields, they
should only populate norms. The similarity API has been changed in 8.0
and similarities no longer have access to the reader context, they are
now expected to work with only term frequency and a length
normalization
Hi,
Thanks Adrien for the quick and accurate answer.
Digging into the implementation I saw that the document length is
already stored there and as I need both the unique term count and the
length, I can't just replace one with the other.
The Similarity class documentation states that it is
Hi Romaric,
You are right, computeNorm is the right place to compute and record
the number of unique terms of a document. Your computeNorm function
would look something like this:
@Override
public final long computeNorm(FieldInvertState state) {
return
Hi,
I am currently implementing a new similarity class into lucene which is
based on a language model with absolute discount.
I am basing my work on the work already done in the
LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close.
However to end my implementation I need