During indexing, an inverted index is made with the term of the documents
and the term frequency, document frequency etc. are stored. If I know
correctly, the exact document length is not stored in the index to reduce
the size. Instead, a normalized length is stored for each document.
However, for
The BM25 similarity computes the normalized length as the number of tokens,
ignoring synonyms (tokens at the same position).
Then it encodes this length as an 8-bit integer in the index using this
logic:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/SmallFl