Hi,

bumping my question after 10 days. Any clarification is appreciated.

Best
Sascha


Hi folks,

my Solr index consists of one document with a single valued field "title" of type 
"text_general". The title field was index with the content: 1 2 3 4 5 6 7 8 9. The field 
type text_general uses a StandardTokenizer which should result in 9 tokens. The corresponding 
length of field title in the given document is 9.

The field type is defined as follows:

   <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" 
multiValued="true">
     <analyzer type="index">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
       <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" 
synonyms="synonyms.txt"/>
       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
   </fieldType>


I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word.

As expected, the query title:1 returns the given document. The BM25 score of 
the document for the given query is 0.272.

But why does Solr 6.3 states that the length of field title is 10.24?

0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of:
   0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
     0.2876821 = idf(docFreq=1, docCount=1)
     0.94664377 = tfNorm, computed from:
       1.0 = termFreq=1.0
       1.2 = parameter k1
       0.75 = parameter b
       9.0 = avgFieldLength
       10.24 = fieldLength

In contrast, the value of avgFieldLength is correct.

The same observation can be made if the index consists of two simple documents:

doc1: title = 1 2 3 4
doc2: title = 1 2 3 4 5 6 7 8

The BM25 score calculation of doc2 is explained as:

0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of:
   0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
     0.18232156 = idf(docFreq=2, docCount=2)
     0.7757405 = tfNorm, computed from:
       1.0 = termFreq=1.0
       1.2 = parameter k1
       0.75 = parameter b
       6.0 = avgFieldLength
       10.24 = fieldLength

The value of fieldLength does not match 8.

Is there same "magic“ applied to the value of field length that goes beyond the 
standard BM25 score formula?

If so, what is the idea behind this modification. If not, is this a Lucene / 
Solr bug?

Best regards,
Sascha






--
Sascha Szott :: KOBV/ZIB :: +49 30 84185-457

Reply via email to