Re: A question about scoring function in Lucene
Chuck Williams wrote: I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize. The pure vector space model implements a cosine in the strictly positive sector of the coordinate space. This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., 0.8 means something about the result quality independent of the query). I question whether such scores are more meaningful. Yes, such scores would be guaranteed to be between zero and one, but would 0.8 really be meaningful? I don't think so. Do you have pointers to research which demonstrates this? E.g., when such a scoring method is used, that thresholding by score is useful across queries? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
: I question whether such scores are more meaningful. Yes, such scores : would be guaranteed to be between zero and one, but would 0.8 really be : meaningful? I don't think so. Do you have pointers to research which : demonstrates this? E.g., when such a scoring method is used, that : thresholding by score is useful across queries? I freely admit that I'm way out of my league on these scoring discussions, but I believe what the OP was refering to was not any intrinsic benefit in having a score between 0 and 1, but of having a uniform normalization of scores regardless of search terms. For example, using the current scoring equation, if i do a search for Doug Cutting and the results/scores i get back are... 1: 0.9 2: 0.3 3: 0.21 4: 0.21 5: 0.1 ...then there are at least two meaningful pieces of data I can glean: a) document #1 is significantly better then the other results b) document #3 and #4 are both equaly relevant to Doug Cutting If I then do a search for Chris Hostetter and get back the following results/scores... 9: 0.9 8: 0.3 7: 0.21 6: 0.21 5: 0.1 ...then I can assume the same corrisponding information is true about my new search term (#9 is significantly better, and #7/#8 are equally as good) However, I *cannot* say either of the following: x) document #9 is as relevant for Chris Hostetter as document #1 is relevant to Doug Cutting y) document #5 is equally relevant to both Chris Hostetter and Doug Cutting I think the OP is arguing that if the scoring algorithm was modified in the way they suggested, then you would be able to make statements x y. If they are correct, then I for one can see a definite benefit in that. If for no other reason then in making minimum score thresholds more meaningful. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
There is one case that I can think of where this 'constant' scoring would be useful, and I think Chuck already mentioned this 1-2 months ago. For instace, having such scores would allow one to create alert applications where queries run by some scheduler would trigger an alert whenever the score is X. So that is where the absolue value of the score would be useful. I believe Chuck submitted some code that fixes this, which also helps with MultiSearcher, where you have to have this contant score in order to properly order hits from different Searchers, but I didn't dare to touch that code without further studying, for which I didn't have time. Otis --- Doug Cutting [EMAIL PROTECTED] wrote: Chuck Williams wrote: I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize. The pure vector space model implements a cosine in the strictly positive sector of the coordinate space. This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., 0.8 means something about the result quality independent of the query). I question whether such scores are more meaningful. Yes, such scores would be guaranteed to be between zero and one, but would 0.8 really be meaningful? I don't think so. Do you have pointers to research which demonstrates this? E.g., when such a scoring method is used, that thresholding by score is useful across queries? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: A question about scoring function in Lucene
PROTECTED] Sent: Wednesday, December 15, 2004 12:35 PM To: Lucene Users List Subject: Re: A question about scoring function in Lucene Chris Hostetter wrote: For example, using the current scoring equation, if i do a search for Doug Cutting and the results/scores i get back are... 1: 0.9 2: 0.3 3: 0.21 4: 0.21 5: 0.1 ...then there are at least two meaningful pieces of data I can glean: a) document #1 is significantly better then the other results b) document #3 and #4 are both equaly relevant to Doug Cutting If I then do a search for Chris Hostetter and get back the following results/scores... 9: 0.9 8: 0.3 7: 0.21 6: 0.21 5: 0.1 ...then I can assume the same corrisponding information is true about my new search term (#9 is significantly better, and #7/#8 are equally as good) However, I *cannot* say either of the following: x) document #9 is as relevant for Chris Hostetter as document #1 is relevant to Doug Cutting y) document #5 is equally relevant to both Chris Hostetter and Doug Cutting That's right. Thanks for the nice description of the issue. I think the OP is arguing that if the scoring algorithm was modified in the way they suggested, then you would be able to make statements x y. And I am not convinced that, with the changes Chuck describes, one can be any more confident of x and y. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: A question about scoring function in Lucene
Thank for your answer, In Lucene scoring function, they use only norm_q, but for one query, norm_q is the same for all documents. So norm_q is actually not effect the score. But norm_d is different, each document has a different norm_d; it effect the score of document d for query q. If you drop it, the score information is not correct anymore or it not space vector model anymore. Could you explain it a little bit. I think that it's expensive to computed in incremetal indexing because when one document is added, idf of each term changed. But drop it is not a good choice. What is the role of norm_d_t ? Nhan. --- Chuck Williams [EMAIL PROTECTED] wrote: Nhan, Re. your two differences: 1 is not a difference. Norm_d and Norm_q are both independent of t, so summing over t has no effect on them. I.e., Norm_d * Norm_q is constant wrt the summation, so it doesn't matter if the sum is over just the numerator or over the entire fraction, the result is the same. 2 is a difference. Lucene uses Norm_q instead of Norm_d because Norm_d is too expensive to compute, especially in the presence of incremental indexing. E.g., adding or deleting any document changes the idf's, so if Norm_d was used it would have to be recomputed for ALL documents. This is not feasible. Another point you did not mention is that the idf term is squared (in both of your formulas). Salton, the originator of the vector space model, dropped one idf factor from his formula as it improved results empirically. More recent theoretical justifications of tf*idf provide intuitive explanations of why idf should only be included linearly. tf is best thought of as the real vector entry, while idf is a weighting term on the components of the inner product. E.g., seen the excellent paper by Robertson, Understanding inverse document frequency: on theoretical arguments for IDF, available here: http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval. It's easy to correct for idf^2 by using a customer Similarity that takes a final square root. Chuck -Original Message- From: Vikas Gupta [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 9:32 PM To: Lucene Users List Subject: Re: A question about scoring function in Lucene Lucene uses the vector space model. To understand that: -Read section 2.1 of Space optimizations for Total Ranking paper (Linked here http://lucene.sourceforge.net/publications.html) -Read section 6 to 6.4 of http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf -Read section 1 of http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps Vikas On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote: Hi all, Lucene score document based on the correlation between the query q and document t: (this is raw function, I don't pay attention to the boost_t, coord_q_d factor) score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) (*) Could anybody explain it in detail ? Or are there any papers, documents about this function ? Because: I have also read the book: Modern Information Retrieval, author: Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley (Hope you have read it too). In page 27, they also suggest a scoring funtion for vector model based on the correlation between query q and document d as follow (I use different symbol): sum_t( weight_t_d * weight_t_q) score_d(d, q)= - (**) norm_d * norm_q where weight_t_d = tf_d * idf_t weight_t_q = tf_q * idf_t norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) ) norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) ) (**): sum_t( tf_q*idf_t * tf_d*idf_t) score_d(d, q)=- (***) norm_d * norm_q The two function, (*) and (***), have 2 differences: 1. in (***), the sum_t is just for the numerator but in the (*), the sum_t is for everything. So, with norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is calculated twice. Is this right? please explain. 2. No factor that define norms of the document: norm_d in the function (*). Can you explain this. what is the role of factor norm_d_t ? One more question: could anybody give me documents, papers that explain this function in detail. so when I apply Lucene for my system, I can adapt the document, and the field so that I still receive the correct scoring information from Lucene . Best regard, Thanks every body, = Ð#7863;ng Nhân - To unsubscribe, e-mail: [EMAIL
RE: A question about scoring function in Lucene
Nhan, You are correct that dropping the document norm does cause Lucene's scoring model to deviate from the pure vector space model. However, including norm_d would cause other problems -- e.g., with short queries, as are typical in reality, the resulting scores with norm_d would all be extremely small. You are also correct that since norm_q is invariant, it does not affect relevance ranking. Norm_q is simply part of the normalization of final scores. There are many different formulas for scoring and relevance ranking in IR. All of these have some intuitive justification, but in the end can only be evaluated empirically. There is no correct formula. I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize. The pure vector space model implements a cosine in the strictly positive sector of the coordinate space. This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., 0.8 means something about the result quality independent of the query). Lucene does not have this property. Its formula produces scores of arbitrary magnitude depending on the query. The results cannot be compared meaningfully across queries; i.e., 0.8 means nothing intrinsically. To keep final scores between 0 and 1, Lucene introduces an ad hoc query-dependent final normalization in Hits: viz., it divides all scores by the highest score if the highest score happens to be greater than 1. This makes it impossible for an application to properly inform its users about the quality of the results, to cut off bad results, etc. Applications may do that, but in fact what they are doing is random, not what they think they are doing. I've proposed a fix for this -- there was a long thread on Lucene-dev. It is possible to revise Lucene's scoring to keep its efficiency, keep its current per-query relevance ranking, and yet intrinsically normalize its scores so that they are meaningful across queries. I posted a fairly detailed spec of how to do this in the Lucene-dev thread. I'm hoping to have time to build it and submit it as a proposed update to Lucene, but it is a large effort that would involve changing just about every scoring class in Lucene. I'm not sure it would be incorporated even if I did it as that would take considerable work from a developer. There doesn't seem to be much concern about these various scoring and relevancy ranking issues among the general Lucene community. Chuck -Original Message- From: Nhan Nguyen Dang [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 15, 2004 1:18 AM To: Lucene Users List Subject: RE: A question about scoring function in Lucene Thank for your answer, In Lucene scoring function, they use only norm_q, but for one query, norm_q is the same for all documents. So norm_q is actually not effect the score. But norm_d is different, each document has a different norm_d; it effect the score of document d for query q. If you drop it, the score information is not correct anymore or it not space vector model anymore. Could you explain it a little bit. I think that it's expensive to computed in incremetal indexing because when one document is added, idf of each term changed. But drop it is not a good choice. What is the role of norm_d_t ? Nhan. --- Chuck Williams [EMAIL PROTECTED] wrote: Nhan, Re. your two differences: 1 is not a difference. Norm_d and Norm_q are both independent of t, so summing over t has no effect on them. I.e., Norm_d * Norm_q is constant wrt the summation, so it doesn't matter if the sum is over just the numerator or over the entire fraction, the result is the same. 2 is a difference. Lucene uses Norm_q instead of Norm_d because Norm_d is too expensive to compute, especially in the presence of incremental indexing. E.g., adding or deleting any document changes the idf's, so if Norm_d was used it would have to be recomputed for ALL documents. This is not feasible. Another point you did not mention is that the idf term is squared (in both of your formulas). Salton, the originator of the vector space model, dropped one idf factor from his formula as it improved results empirically. More recent theoretical justifications of tf*idf provide intuitive explanations of why idf should only be included linearly. tf is best thought of as the real vector entry, while idf is a weighting term on the components of the inner product. E.g., seen the excellent paper by Robertson, Understanding inverse document frequency: on theoretical arguments for IDF, available here: http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval. It's easy to correct for idf^2
Re: A question about scoring function in Lucene
Otis Gospodnetic wrote: There is one case that I can think of where this 'constant' scoring would be useful, and I think Chuck already mentioned this 1-2 months ago. For instace, having such scores would allow one to create alert applications where queries run by some scheduler would trigger an alert whenever the score is X. So that is where the absolue value of the score would be useful. Right, but the question is, would a single score threshold be effective for all queries, or would one need a separate score threshold for each query? My hunch is that the latter is better, regardless of the scoring algorithm. Also, just because Lucene's default scoring does not guarantee scores between zero and one does not necessarily mean that these scores are less meaningful. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
Chris Hostetter wrote: For example, using the current scoring equation, if i do a search for Doug Cutting and the results/scores i get back are... 1: 0.9 2: 0.3 3: 0.21 4: 0.21 5: 0.1 ...then there are at least two meaningful pieces of data I can glean: a) document #1 is significantly better then the other results b) document #3 and #4 are both equaly relevant to Doug Cutting If I then do a search for Chris Hostetter and get back the following results/scores... 9: 0.9 8: 0.3 7: 0.21 6: 0.21 5: 0.1 ...then I can assume the same corrisponding information is true about my new search term (#9 is significantly better, and #7/#8 are equally as good) However, I *cannot* say either of the following: x) document #9 is as relevant for Chris Hostetter as document #1 is relevant to Doug Cutting y) document #5 is equally relevant to both Chris Hostetter and Doug Cutting That's right. Thanks for the nice description of the issue. I think the OP is arguing that if the scoring algorithm was modified in the way they suggested, then you would be able to make statements x y. And I am not convinced that, with the changes Chuck describes, one can be any more confident of x and y. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: A question about scoring function in Lucene
Nhan, Re. your two differences: 1 is not a difference. Norm_d and Norm_q are both independent of t, so summing over t has no effect on them. I.e., Norm_d * Norm_q is constant wrt the summation, so it doesn't matter if the sum is over just the numerator or over the entire fraction, the result is the same. 2 is a difference. Lucene uses Norm_q instead of Norm_d because Norm_d is too expensive to compute, especially in the presence of incremental indexing. E.g., adding or deleting any document changes the idf's, so if Norm_d was used it would have to be recomputed for ALL documents. This is not feasible. Another point you did not mention is that the idf term is squared (in both of your formulas). Salton, the originator of the vector space model, dropped one idf factor from his formula as it improved results empirically. More recent theoretical justifications of tf*idf provide intuitive explanations of why idf should only be included linearly. tf is best thought of as the real vector entry, while idf is a weighting term on the components of the inner product. E.g., seen the excellent paper by Robertson, Understanding inverse document frequency: on theoretical arguments for IDF, available here: http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval. It's easy to correct for idf^2 by using a customer Similarity that takes a final square root. Chuck -Original Message- From: Vikas Gupta [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 9:32 PM To: Lucene Users List Subject: Re: A question about scoring function in Lucene Lucene uses the vector space model. To understand that: -Read section 2.1 of Space optimizations for Total Ranking paper (Linked here http://lucene.sourceforge.net/publications.html) -Read section 6 to 6.4 of http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf -Read section 1 of http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps Vikas On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote: Hi all, Lucene score document based on the correlation between the query q and document t: (this is raw function, I don't pay attention to the boost_t, coord_q_d factor) score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) (*) Could anybody explain it in detail ? Or are there any papers, documents about this function ? Because: I have also read the book: Modern Information Retrieval, author: Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley (Hope you have read it too). In page 27, they also suggest a scoring funtion for vector model based on the correlation between query q and document d as follow (I use different symbol): sum_t( weight_t_d * weight_t_q) score_d(d, q)= - (**) norm_d * norm_q where weight_t_d = tf_d * idf_t weight_t_q = tf_q * idf_t norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) ) norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) ) (**): sum_t( tf_q*idf_t * tf_d*idf_t) score_d(d, q)=- (***) norm_d * norm_q The two function, (*) and (***), have 2 differences: 1. in (***), the sum_t is just for the numerator but in the (*), the sum_t is for everything. So, with norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is calculated twice. Is this right? please explain. 2. No factor that define norms of the document: norm_d in the function (*). Can you explain this. what is the role of factor norm_d_t ? One more question: could anybody give me documents, papers that explain this function in detail. so when I apply Lucene for my system, I can adapt the document, and the field so that I still receive the correct scoring information from Lucene . Best regard, Thanks every body, = Ð#7863;ng Nhân - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
Lucene uses the vector space model. To understand that: -Read section 2.1 of Space optimizations for Total Ranking paper (Linked here http://lucene.sourceforge.net/publications.html) -Read section 6 to 6.4 of http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf -Read section 1 of http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps Vikas On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote: Hi all, Lucene score document based on the correlation between the query q and document t: (this is raw function, I don't pay attention to the boost_t, coord_q_d factor) score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) (*) Could anybody explain it in detail ? Or are there any papers, documents about this function ? Because: I have also read the book: Modern Information Retrieval, author: Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley (Hope you have read it too). In page 27, they also suggest a scoring funtion for vector model based on the correlation between query q and document d as follow (I use different symbol): sum_t( weight_t_d * weight_t_q) score_d(d, q)= - (**) norm_d * norm_q where weight_t_d = tf_d * idf_t weight_t_q = tf_q * idf_t norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) ) norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) ) (**): sum_t( tf_q*idf_t * tf_d*idf_t) score_d(d, q)=- (***) norm_d * norm_q The two function, (*) and (***), have 2 differences: 1. in (***), the sum_t is just for the numerator but in the (*), the sum_t is for everything. So, with norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is calculated twice. Is this right? please explain. 2. No factor that define norms of the document: norm_d in the function (*). Can you explain this. what is the role of factor norm_d_t ? One more question: could anybody give me documents, papers that explain this function in detail. so when I apply Lucene for my system, I can adapt the document, and the field so that I still receive the correct scoring information from Lucene . Best regard, Thanks every body, = Ð#7863;ng Nhân - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]