Re: field length within BM25 score calculation in Solr 6.3

2016-12-15 Thread Sascha Szott

Hi,

bumping my question after 10 days. Any clarification is appreciated.

Best
Sascha



Hi folks,

my Solr index consists of one document with a single valued field "title" of type 
"text_general". The title field was index with the content: 1 2 3 4 5 6 7 8 9. The field 
type text_general uses a StandardTokenizer which should result in 9 tokens. The corresponding 
length of field title in the given document is 9.

The field type is defined as follows:

   
 
   
   
   
 
 
   
   
   
   
 
   


I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word.

As expected, the query title:1 returns the given document. The BM25 score of 
the document for the given query is 0.272.

But why does Solr 6.3 states that the length of field title is 10.24?

0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of:
   0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
 0.2876821 = idf(docFreq=1, docCount=1)
 0.94664377 = tfNorm, computed from:
   1.0 = termFreq=1.0
   1.2 = parameter k1
   0.75 = parameter b
   9.0 = avgFieldLength
   10.24 = fieldLength

In contrast, the value of avgFieldLength is correct.

The same observation can be made if the index consists of two simple documents:

doc1: title = 1 2 3 4
doc2: title = 1 2 3 4 5 6 7 8

The BM25 score calculation of doc2 is explained as:

0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of:
   0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
 0.18232156 = idf(docFreq=2, docCount=2)
 0.7757405 = tfNorm, computed from:
   1.0 = termFreq=1.0
   1.2 = parameter k1
   0.75 = parameter b
   6.0 = avgFieldLength
   10.24 = fieldLength

The value of fieldLength does not match 8.

Is there same "magic“ applied to the value of field length that goes beyond the 
standard BM25 score formula?

If so, what is the idea behind this modification. If not, is this a Lucene / 
Solr bug?

Best regards,
Sascha







--
Sascha Szott :: KOBV/ZIB :: +49 30 84185-457


field length within BM25 score calculation in Solr 6.3

2016-12-04 Thread Sascha Szott
Hi folks,

my Solr index consists of one document with a single valued field "title" of 
type "text_general". The title field was index with the content: 1 2 3 4 5 6 7 
8 9. The field type text_general uses a StandardTokenizer which should result 
in 9 tokens. The corresponding length of field title in the given document is 9.

The field type is defined as follows:

  

  
  
  


  
  
  
  

  


I’ve checked that none of the nine tokens (1, 2, …, 9) is a stop word.

As expected, the query title:1 returns the given document. The BM25 score of 
the document for the given query is 0.272. 

But why does Solr 6.3 states that the length of field title is 10.24?

0.27233246 = weight(title_alt:1 in 0) [SchemaSimilarity], result of:
  0.27233246 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
0.2876821 = idf(docFreq=1, docCount=1)
0.94664377 = tfNorm, computed from:
  1.0 = termFreq=1.0
  1.2 = parameter k1
  0.75 = parameter b
  9.0 = avgFieldLength
  10.24 = fieldLength

In contrast, the value of avgFieldLength is correct.

The same observation can be made if the index consists of two simple documents:

doc1: title = 1 2 3 4
doc2: title = 1 2 3 4 5 6 7 8

The BM25 score calculation of doc2 is explained as:

0.14143422 = weight(title_alt:1 in 1) [SchemaSimilarity], result of:
  0.14143422 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
0.18232156 = idf(docFreq=2, docCount=2)
0.7757405 = tfNorm, computed from:
  1.0 = termFreq=1.0
  1.2 = parameter k1
  0.75 = parameter b
  6.0 = avgFieldLength
  10.24 = fieldLength

The value of fieldLength does not match 8.

Is there same "magic“ applied to the value of field length that goes beyond the 
standard BM25 score formula? 

If so, what is the idea behind this modification. If not, is this a Lucene / 
Solr bug?

Best regards,
Sascha