That is caused by the size of the documents. The principle is pretty
intuitive if one of your documents is the entire three volumes of The Lord
of the Rings, and you search for "tree" I know that The Lord of the Rings
will be in the results, and I haven't memorized the entire text of that book
:p
It is a matter of probability that if you have a big (big!) text any word
will have a greater chance to be found than in a smaller letter. So one can
infer that the letter is more relevant than the big text. That is the
principle applied here and Lucene does that when building the ranking.
The first document is bigger (remember that all the values of a multivalued
field are merged into one field in the index, so you can not tell one value
from another apart) than the second one. In the first one you have
[Fred, coolest,
guy, town] and in the second [Fred, Anderson], so the second document is
more relevant than the first one.

To avoid all this procedure you can set omitNorms to true and that should
make the first document more relevant because Fred appears twice (not
because Fred appears alone in a value)

Regards
Emmanuel

2011/7/26 Brian Lamb <brian.l...@journalexperts.com>

> Hi all,
>
> I am a little confused as to why the scoring is working the way it is:
>
> I have a field defined as:
>
> <field name="myname" type="text" indexed="true" stored="true"
> required="false" multivalued="true" />
>
> And I have several documents where that value is:
>
> RECORD 1
> <arr name="myname">
>  <str>Fred</str>
>  <str>Fred (the coolest guy in town)</str>
> </arr>
>
> OR
>
> RECORD 2
> <arr name="myname">
>  <str>Fred Anderson</str>
> </arr>
>
> What happens when I do a search for
> http://localhost:8983/solr/search/?q=myname:Fred I get RECORD 2
> returned before RECORD 1.
>
> RECORD 2
> 5.282213 = (MATCH) fieldWeight(myname:Fred in 256575), product of:
>  1.0 = tf(termFreq(myname:Fred)=1)
>  8.451541 = idf(docFreq=7306, maxDocs=12586425)
>  0.625 = fieldNorm(field=myname, doc=256575)
>
> RECORD 1
> 4.482106 = (MATCH) fieldWeight(myname:Fred in 215), product of:
>  1.4142135 = tf(termFreq(myname:Fred)=2)
>  8.451541 = idf(docFreq=7306, maxDocs=12586425)
>  0.375 = fieldNorm(field=myname, doc=215)
>
> So the difference is fieldNorm obviously but I think that's only part
> of the story. Why is RECORD 2 returned with a higher score than RECORD
> 1 even though RECORD 1 matches "Fred" exactly? And how should I do
> this differently so that I am getting the results I am expecting?
>
> Thanks,
>
> Brian Lamb
>

Reply via email to