Re: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)
Hello Doron (and all the others who read here):), thank you for your effort and your time. I really appreciate it. :) I understand why normalisation is done in general. Mainly, to normalise the bias of oversized documents. In the literature I have read so far, there is usually a high effort on document normalisation (simply because they may be very different in size), and usually no normalisation on queries (simply because they are usually only a few words anyway). For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with norm_q : sqrt(sum_t((tf_q*idf_t)^2)) which is also called cosine normalisation. This is a technique that is rather comprehensive and usually used for docuemnts only(!) in all systems I have seen so far. For the documents Lucene employs its norm_d_t which is explained as: norm_d_t : square root of number of tokens in d in the same field as t basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here... The paper you provided uses document normalisation in the following way: norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d)) I am not sure how this relates to norm_d_t. Did I misunderstood the Lucene formula or do I misinterpret something? I enclosed the formula as a graphic (from LaTeX) so you can have a look should this be the case here... I will post the other questions separately since they actually also relate to the new Lucene scoring algoritm (they have not changed). Thank you for your time again :) Karl Original-Nachricht Datum: Mon, 11 Dec 2006 22:41:56 -0800 Von: Doron Cohen [EMAIL PROTECTED] An: java-user@lucene.apache.org Betreff: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed) Well it doesn't since there is not justification of why it is the way it is. Its like saying, here is that car with 5 weels... enjoy driving. - I think the explanations there would also answer at least some of your questions. I hoped it would answer *some* of the questions... (not all) Let me try some more :-) 2a) Why does Lucene normalise with Cosine Normalisation for the query? In a range of different IR system variations (as shown in Chisholm and Kolda's paper in the table on page 8) queries where not normalised at all. Is there a good reason or perhaps any empirical evidence that would support this decision? I think why normalizing at all is answered in http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm ... queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation in DefaultSimilarity is: queryNorm(q) = 1/sqrt(sumOfSquaredWeights) I think there were discussions on this normalization. Anyhow I am not the expert to justify this. 2b) What is the motivation for the normalisation imposed on the documents (norm_d_t) which I have not seen before in any other system. Again, does anybody have pointers to literature on this? If I understand what you are asking, the logic is that verrry long documents could virtually contain the entire collection, and therefore should be punished for being too long, otherwise they would have an unfair advantage upon short documents. Google search engine document length normalization for more on this, or see also http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf Any answer or partial answer on any of the questions would be greatly appreciated! I hope this (although partial) helps at all, Doron -- Ein Herz für Kinder - Ihre Spende hilft! Aktion: www.deutschlandsegelt.de Unser Dankeschön: Ihr Name auf dem Segel der 1. deutschen America's Cup-Yacht! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)
Hello Karl, I’m very interested in the details of Lucene’s scoring as well. Karl Koch wrote: For this reason, I do not understand why Lucene (in version 1.2) normalises the query(!) with norm_q : sqrt(sum_t((tf_q*idf_t)^2)) which is also called cosine normalisation. This is a technique that is rather comprehensive and usually used for docuemnts only(!) in all systems I have seen so far. I hope I have understood http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm and your problem correctly: queryNorm(q) is a normalizing factor used to make scores between queries comparable. For normal searches you don’t need to compare queries. You have just to compare the documents of a single query. Queries in a normal search have usually a different semantic, so you can’t really compare the results of different queries. If you use Lucene for instance for classification of documents it is necessary to compare the results of different queries. You have documents to classify indexed at one site and the classes at the other side (thread Store a document-like map http://www.gossamer-threads.com/lists/lucene/java-user/42816). Than you can generate queries from the classes and search against the documents. The score of a matching document is the similarity of the document to the query build from the class. Now the queries have to be comparable. You can transform a document into a query and a query into a document. That could be the reason normalizing a query like a document. For the documents Lucene employs its norm_d_t which is explained as: norm_d_t : square root of number of tokens in d in the same field as t basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here... The paper you provided uses document normalisation in the following way: norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d)) I am not sure how this relates to norm_d_t. norm(t,d) = doc.getBoost() • lengthNorm(field) • ∏ f.getBoost() (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm) That seems to be in depended of the documents length. The factor lengthNorm(field) uses the documents length or better the field length: Computes the normalization value for a field given the total number of terms contained in a field. (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_norm). Implemented as 1/sqrt(numTerms) (http://lucene.apache.org/java/docs/api/org/apache/lucene/search/DefaultSimilarity.html#lengthNorm(java.lang.String,%20int)) Sören - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)
Karl Koch [EMAIL PROTECTED] wrote: For the documents Lucene employs its norm_d_t which is explained as: norm_d_t : square root of number of tokens in d in the same field as t Actually (by default) it is: 1 / sqrt(#tokens in d with same field as t) basically just the square root of the number of unique terms in the document (since I do search over all fields always). I would have expected cosine normalisation here... The paper you provided uses document normalisation in the following way: norm = 1 / sqrt(0.8*avgDocLength + 0.2*(# of unique terms in d)) I am not sure how this relates to norm_d_t. That system is less field oriented than Lucene, so you could say the normalization there goes over all the fields. The {0.8,0.2} args are parametric and control how aggressive this normalization is. If you used there {0,1} you would get 1 / sqrt(#unique terms in d) and that would be similar to Lucene's 1 / sqrt(#tokens in d with same field as t) however (in that system) that would have punish long documents too much and would too much boost up stupid dummy short documents, and that's why the {0.8,0.2} were introduced there. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)
Well it doesn't since there is not justification of why it is the way it is. Its like saying, here is that car with 5 weels... enjoy driving. Karl Original-Nachricht Datum: Sun, 10 Dec 2006 13:12:29 -0800 Von: Doron Cohen [EMAIL PROTECTED] An: java-user@lucene.apache.org Betreff: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed) [EMAIL PROTECTED] wrote: According to these sources, the Lucene scoring formula in version 1.2 is: score(q,d) = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d Hi Karl, A slightly more readable version of (the same) scoring formula is now in Lucene's Similarity jdocs - http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html - I think the explanations there would also answer at least some of your questions. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: Questions about Lucene scoring (was: Lucene 1.2 - scoring formula needed)
Well it doesn't since there is not justification of why it is the way it is. Its like saying, here is that car with 5 weels... enjoy driving. - I think the explanations there would also answer at least some of your questions. I hoped it would answer *some* of the questions... (not all) Let me try some more :-) 2a) Why does Lucene normalise with Cosine Normalisation for the query? In a range of different IR system variations (as shown in Chisholm and Kolda's paper in the table on page 8) queries where not normalised at all. Is there a good reason or perhaps any empirical evidence that would support this decision? I think why normalizing at all is answered in http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_queryNorm ... queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable. This is a search time factor computed by the Similarity in effect at search time. The default computation in DefaultSimilarity is: queryNorm(q) = 1/sqrt(sumOfSquaredWeights) I think there were discussions on this normalization. Anyhow I am not the expert to justify this. 2b) What is the motivation for the normalisation imposed on the documents (norm_d_t) which I have not seen before in any other system. Again, does anybody have pointers to literature on this? If I understand what you are asking, the logic is that verrry long documents could virtually contain the entire collection, and therefore should be punished for being too long, otherwise they would have an unfair advantage upon short documents. Google search engine document length normalization for more on this, or see also http://trec.nist.gov/pubs/trec10/papers/JuruAtTrec.pdf Any answer or partial answer on any of the questions would be greatly appreciated! I hope this (although partial) helps at all, Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]