Re: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

2006-12-12 Thread Karl Koch
Hello Doron (and all the others who read here):),

thank you for your effort and your time. I really appreciate it. :)

I understand why normalisation is done in general. Mainly, to normalise the 
bias of oversized documents. In the literature I have read so far, there is 
usually a high effort on document normalisation (simply because they may be 
very different in size), and usually no normalisation on queries (simply 
because they are usually only a few words anyway). 

For this reason, I do not understand why Lucene (in version 1.2) normalises the 
query(!) with 

norm_q : sqrt(sum_t((tf_q*idf_t)^2)) 

which is also called cosine normalisation. This is a technique that is rather 
comprehensive and usually used for docuemnts only(!) in all systems I have seen 
so far.  For the documents Lucene employs its norm_d_t which is explained as:

norm_d_t : square root of number of tokens in d in the same field as t

basically just the square root of the number of unique terms in the document 
(since I do search over all fields always). I would have expected cosine 
normalisation here... 

The paper you provided uses document normalisation in the following way:

norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))

I am not sure how this relates to norm_d_t. Did I misunderstood the Lucene 
formula or do I misinterpret something? I enclosed the formula as a graphic 
(from LaTeX) so you can have a look should this be the case here...

I will post the other questions separately since they actually also relate to 
the new Lucene scoring algoritm (they have not changed). Thank you for your 
time again :)

Karl


 Original-Nachricht 
Datum:  Mon, 11 Dec 2006 22:41:56 -0800
Von: Doron Cohen [EMAIL PROTECTED]
An: java-user@lucene.apache.org
Betreff:  Re:  Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring 
formula needed)

  Well it doesn't since there is not justification of why it is the
  way it is. Its like saying, here is that car with 5 weels... enjoy
 driving.
 
- I think the explanations there would also answer at least some of
 your
   questions.
 
 I hoped it would answer *some* of the questions... (not all)
 
 Let me try some more :-)
 
  2a) Why does Lucene normalise with Cosine Normalisation for the
  query? In a range of different IR system variations (as shown in
  Chisholm and Kolda's paper in the table on page 8) queries where not
  normalised at all. Is there a good reason or perhaps any empirical
  evidence that would support this decision?
 
 I think why normalizing at all is answered in
 http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
 ... queryNorm(q) is a normalizing factor used to make scores between
 queries comparable. This factor does not affect document ranking (since
 all
 ranked documents are multiplied by the same factor), but rather just
 attempts to make scores from different queries (or even different indexes)
 comparable. This is a search time factor computed by the Similarity in
 effect at search time. The default computation in DefaultSimilarity is:
 queryNorm(q) = 1/sqrt(sumOfSquaredWeights)
 
 I think there were discussions on this normalization. Anyhow I am not the
 expert to justify this.
 
  2b) What is the motivation for the normalisation imposed on the
  documents (norm_d_t) which I have not seen before in any other
  system. Again, does anybody have pointers to literature on this?
 
 If I understand what you are asking, the logic is that verrry long
 documents could virtually contain the entire collection, and therefore
 should be punished for being too long, otherwise they would have an
 unfair advantage upon short documents. Google search engine document
 length normalization for more on this, or see also
 http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf
 
  Any answer or partial answer on any of the questions would be
  greatly appreciated!
 
 I hope this (although partial) helps at all,
 Doron
-- 
Ein Herz für Kinder - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de
Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

2006-12-12 Thread Soeren Pekrul

Hello Karl,

I’m very interested in the details of Lucene’s scoring as well.

Karl Koch wrote:
For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with 


norm_q : sqrt(sum_t((tf_q*idf_t)^2))

which is also called cosine normalisation. This is a technique that is rather 
comprehensive and usually used for docuemnts only(!) in all systems I have seen 
so far.


I hope I have understood 
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm 
and your problem correctly: queryNorm(q) is a normalizing factor used 
to make scores between queries comparable.


For normal searches you don’t need to compare queries. You have just 
to compare the documents of a single query. Queries in a normal search 
have usually a different semantic, so you can’t really compare the 
results of different queries.


If you use Lucene for instance for classification of documents it is 
necessary to compare the results of different queries. You have 
documents to classify indexed at one site and the classes at the other 
side (thread Store a document-like map 
http://www.gossamer-threads.com/lists/lucene/java-user/42816). Than you 
can generate queries from the classes and search against the documents. 
The score of a matching document is the similarity of the document to 
the query build from the class. Now the queries have to be comparable.


You can transform a document into a query and a query into a document. 
That could be the reason normalizing a query like a document.



For the documents Lucene employs its norm_d_t which is explained as:

norm_d_t : square root of number of tokens in d in the same field as t

basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here... 


The paper you provided uses document normalisation in the following way:

norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))

I am not sure how this relates to norm_d_t.


norm(t,d)   =   doc.getBoost()  •  lengthNorm(field)  •  ∏ f.getBoost()
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm)

That seems to be in depended of the documents length. The factor 
lengthNorm(field) uses the documents length or better the field length: 
Computes the normalization value for a field given the total number of 
terms contained in a field. 
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm).


Implemented as 1/sqrt(numTerms) 
(http://lucene.apache.org/java/docs/api/org/apache/lucene/search/DefaultSimilarity.html#lengthNorm(java.lang.String,%20int))


Sören

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

2006-12-12 Thread Doron Cohen
Karl Koch [EMAIL PROTECTED] wrote:

 For the documents Lucene employs
 its norm_d_t which is explained as:

 norm_d_t : square root of number of tokens in d in the same field as t

Actually (by default) it is:
   1 / sqrt(#tokens in d with same field as t)

 basically just the square root of the number of unique terms in the
 document (since I do search over all fields always). I would have
 expected cosine normalisation here...

 The paper you provided uses document normalisation in the following way:

 norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d))

 I am not sure how this relates to norm_d_t.

That system is less field oriented than Lucene, so you could say the
normalization there goes over all the fields.
The {0.8,0.2} args are parametric and control how aggressive this
normalization is.
If you used there {0,1} you would get
  1 / sqrt(#unique terms in d)
and that would be similar to Lucene's
  1 / sqrt(#tokens in d with same field as t)
however (in that system) that would have punish long documents too much and
would too much boost up stupid dummy short documents, and that's why the
{0.8,0.2} were introduced there.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

2006-12-11 Thread Karl Koch
Well it doesn't since there is not justification of why it is the way it is. 
Its like saying, here is that car with 5 weels... enjoy driving. 

Karl 
 Original-Nachricht 
Datum:  Sun, 10 Dec 2006 13:12:29 -0800
Von: Doron Cohen [EMAIL PROTECTED]
An: java-user@lucene.apache.org
Betreff:  Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula 
needed)

 [EMAIL PROTECTED] wrote:
  According to these sources, the Lucene scoring formula in version 1.2
 is:
 
  score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t *
  boost_t) * coord_q_d
 
 Hi Karl,
 
 A slightly more readable version of (the same) scoring formula is now in
 Lucene's Similarity jdocs -
 http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
  - I think the explanations there would also answer at least some of your
 questions.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)

2006-12-11 Thread Doron Cohen
 Well it doesn't since there is not justification of why it is the
 way it is. Its like saying, here is that car with 5 weels... enjoy
driving.

   - I think the explanations there would also answer at least some of
your
  questions.

I hoped it would answer *some* of the questions... (not all)

Let me try some more :-)

 2a) Why does Lucene normalise with Cosine Normalisation for the
 query? In a range of different IR system variations (as shown in
 Chisholm and Kolda's paper in the table on page 8) queries where not
 normalised at all. Is there a good reason or perhaps any empirical
 evidence that would support this decision?

I think why normalizing at all is answered in
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm
... queryNorm(q) is a normalizing factor used to make scores between
queries comparable. This factor does not affect document ranking (since all
ranked documents are multiplied by the same factor), but rather just
attempts to make scores from different queries (or even different indexes)
comparable. This is a search time factor computed by the Similarity in
effect at search time. The default computation in DefaultSimilarity is:
queryNorm(q) = 1/sqrt(sumOfSquaredWeights)

I think there were discussions on this normalization. Anyhow I am not the
expert to justify this.

 2b) What is the motivation for the normalisation imposed on the
 documents (norm_d_t) which I have not seen before in any other
 system. Again, does anybody have pointers to literature on this?

If I understand what you are asking, the logic is that verrry long
documents could virtually contain the entire collection, and therefore
should be punished for being too long, otherwise they would have an
unfair advantage upon short documents. Google search engine document
length normalization for more on this, or see also
http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf

 Any answer or partial answer on any of the questions would be
 greatly appreciated!

I hope this (although partial) helps at all,
Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]