Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Chuck Williams wrote:
I believe the biggest problem with Lucene's approach relative to the pure vector space model is that Lucene does not properly normalize.  The pure vector space model implements a cosine in the strictly positive sector of the coordinate space.  This is guaranteed intrinsically to be between 0 and 1, and produces scores that can be compared across distinct queries (i.e., 0.8 means something about the result quality independent of the query).
I question whether such scores are more meaningful.  Yes, such scores 
would be guaranteed to be between zero and one, but would 0.8 really be 
meaningful?  I don't think so.  Do you have pointers to research which 
demonstrates this?  E.g., when such a scoring method is used, that 
thresholding by score is useful across queries?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Chris Hostetter
: I question whether such scores are more meaningful.  Yes, such scores
: would be guaranteed to be between zero and one, but would 0.8 really be
: meaningful?  I don't think so.  Do you have pointers to research which
: demonstrates this?  E.g., when such a scoring method is used, that
: thresholding by score is useful across queries?

I freely admit that I'm way out of my league on these scoring discussions,
but I believe what the OP was refering to was not any intrinsic benefit in
having a score between 0 and 1, but of having a uniform normalization of
scores regardless of search terms.

For example, using the current scoring equation, if i do a search for
Doug Cutting and the results/scores i get back are...
  1:   0.9
  2:   0.3
  3:   0.21
  4:   0.21
  5:   0.1
...then there are at least two meaningful pieces of data I can glean:
   a) document #1 is significantly better then the other results
   b) document #3 and #4 are both equaly relevant to Doug Cutting

If I then do a search for Chris Hostetter and get back the following
results/scores...
  9:   0.9
  8:   0.3
  7:   0.21
  6:   0.21
  5:   0.1

...then I can assume the same corrisponding information is true about my
new search term (#9 is significantly better, and #7/#8 are equally as good)

However, I *cannot* say either of the following:
  x) document #9 is as relevant for Chris Hostetter as document #1 is
 relevant to Doug Cutting
  y) document #5 is equally relevant to both Chris Hostetter and
 Doug Cutting


I think the OP is arguing that if the scoring algorithm was modified in
the way they suggested, then you would be able to make statements x  y.

If they are correct, then I for one can see a definite benefit in that.
If for no other reason then in making minimum score thresholds more
meaningful.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: A question about scoring function in Lucene

2004-12-15 Thread Otis Gospodnetic
There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago.  For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an alert
whenever the score is  X.  So that is where the absolue value of the
score would be useful.

I believe Chuck submitted some code that fixes this, which also helps
with MultiSearcher, where you have to have this contant score in order
to properly order hits from different Searchers, but I didn't dare to
touch that code without further studying, for which I didn't have time.

Otis


--- Doug Cutting [EMAIL PROTECTED] wrote:

 Chuck Williams wrote:
  I believe the biggest problem with Lucene's approach relative to
 the pure vector space model is that Lucene does not properly
 normalize.  The pure vector space model implements a cosine in the
 strictly positive sector of the coordinate space.  This is guaranteed
 intrinsically to be between 0 and 1, and produces scores that can be
 compared across distinct queries (i.e., 0.8 means something about
 the result quality independent of the query).
 
 I question whether such scores are more meaningful.  Yes, such scores
 
 would be guaranteed to be between zero and one, but would 0.8 really
 be 
 meaningful?  I don't think so.  Do you have pointers to research
 which 
 demonstrates this?  E.g., when such a scoring method is used, that 
 thresholding by score is useful across queries?
 
 Doug
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams
 PROTECTED]
   Sent: Wednesday, December 15, 2004 12:35 PM
   To: Lucene Users List
   Subject: Re: A question about scoring function in Lucene
   
   Chris Hostetter wrote:
For example, using the current scoring equation, if i do a search
for
Doug Cutting and the results/scores i get back are...
  1:   0.9
  2:   0.3
  3:   0.21
  4:   0.21
  5:   0.1
...then there are at least two meaningful pieces of data I can
glean:
   a) document #1 is significantly better then the other results
   b) document #3 and #4 are both equaly relevant to Doug
Cutting
   
If I then do a search for Chris Hostetter and get back the
following
results/scores...
  9:   0.9
  8:   0.3
  7:   0.21
  6:   0.21
  5:   0.1
   
...then I can assume the same corrisponding information is true
about
   my
new search term (#9 is significantly better, and #7/#8 are equally
as
   good)
   
However, I *cannot* say either of the following:
  x) document #9 is as relevant for Chris Hostetter as document
#1
   is
 relevant to Doug Cutting
  y) document #5 is equally relevant to both Chris Hostetter and
 Doug Cutting
   
   That's right.  Thanks for the nice description of the issue.
   
I think the OP is arguing that if the scoring algorithm was
modified
   in
the way they suggested, then you would be able to make statements
x 
   y.
   
   And I am not convinced that, with the changes Chuck describes, one
can
   be any more confident of x and y.
   
   Doug
   
  
-
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: A question about scoring function in Lucene

2004-12-15 Thread Nhan Nguyen Dang
Thank for your answer,
In Lucene scoring function, they use only norm_q,
but for one query, norm_q is the same for all
documents.
So norm_q is actually not effect the score.
But norm_d is different, each document has a different
norm_d; it effect the score of document d for query q.
If you drop it, the score information is not correct
anymore or it not space vector model anymore.  Could
you explain it a little bit.

I think that it's expensive to computed in incremetal
indexing because when one document is added, idf of
each term changed. But drop it is not a good choice.

What is the role of norm_d_t ?
Nhan.

--- Chuck Williams [EMAIL PROTECTED] wrote:

 Nhan,
 
 Re.  your two differences:
 
 1 is not a difference.  Norm_d and Norm_q are both
 independent of t, so summing over t has no effect on
 them.  I.e., Norm_d * Norm_q is constant wrt the
 summation, so it doesn't matter if the sum is over
 just the numerator or over the entire fraction, the
 result is the same.
 
 2 is a difference.  Lucene uses Norm_q instead of
 Norm_d because Norm_d is too expensive to compute,
 especially in the presence of incremental indexing. 
 E.g., adding or deleting any document changes the
 idf's, so if Norm_d was used it would have to be
 recomputed for ALL documents.  This is not feasible.
 
 Another point you did not mention is that the idf
 term is squared (in both of your formulas).  Salton,
 the originator of the vector space model, dropped
 one idf factor from his formula as it improved
 results empirically.  More recent theoretical
 justifications of tf*idf provide intuitive
 explanations of why idf should only be included
 linearly.  tf is best thought of as the real vector
 entry, while idf is a weighting term on the
 components of the inner product.  E.g., seen the
 excellent paper by Robertson, Understanding inverse
 document frequency: on theoretical arguments for
 IDF, available here: 
 http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl
 if you sign up for an eval.
 
 It's easy to correct for idf^2 by using a customer
 Similarity that takes a final square root.
 
 Chuck
 
-Original Message-
From: Vikas Gupta [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 9:32 PM
To: Lucene Users List
Subject: Re: A question about scoring function
 in Lucene

Lucene uses the vector space model. To
 understand that:

-Read section 2.1 of Space optimizations for
 Total Ranking paper
(Linked
here
 http://lucene.sourceforge.net/publications.html)
-Read section 6 to 6.4 of
   

http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
-Read section 1 of
   

http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps

Vikas

On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:

 Hi all,
 Lucene score document based on the correlation
 between
 the query q and document t:
 (this is raw function, I don't pay attention
 to the
 boost_t, coord_q_d factor)

 score_d = sum_t( tf_q * idf_t / norm_q * tf_d
 * idf_t
 / norm_d_t)  (*)

 Could anybody explain it in detail ? Or are
 there any
 papers, documents about this function ?
 Because:

 I have also read the book: Modern Information
 Retrieval, author: Ricardo Baeza-Yates and
 Berthier
 Ribeiro-Neto, Addison Wesley (Hope you have
 read it
 too). In page 27, they also suggest a scoring
 funtion
 for vector model based on the correlation
 between
 query q and document d as follow (I use
 different
 symbol):

  sum_t( weight_t_d * weight_t_q)
 score_d(d, q)= 
 - (**)
   norm_d * norm_q

 where weight_t_d = tf_d * idf_t
   weight_t_q = tf_q * idf_t
   norm_d = sqrt( sum_t( (tf_d * idf_t)^2 )
 )
   norm_q = sqrt( sum_t( (tf_q * idf_t)^2 )
 )

 (**):  sum_t( tf_q*idf_t * tf_d*idf_t)
 score_d(d,
 q)=-  (***)
norm_d * norm_q

 The two function, (*) and (***), have 2
 differences:
 1. in (***), the sum_t is just for the
 numerator but
 in the (*), the sum_t is for everything. So,
 with
 norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
 calculated twice. Is this right? please
 explain.

 2. No factor that define norms of the
 document: norm_d
 in the function (*). Can you explain this.
 what is the
 role of factor norm_d_t ?

 One more question: could anybody give me
 documents,
 papers that explain this function in detail.
 so when I
 apply Lucene for my system, I can adapt the
 document,
 and the field so that I still receive the
 correct
 scoring information from Lucene .

 Best regard,
 Thanks every body,

 =
 Ð#7863;ng Nhân

   

-
To unsubscribe, e-mail:
 [EMAIL

RE: A question about scoring function in Lucene

2004-12-15 Thread Chuck Williams
Nhan,

You are correct that dropping the document norm does cause Lucene's scoring 
model to deviate from the pure vector space model.  However, including norm_d 
would cause other problems -- e.g., with short queries, as are typical in 
reality, the resulting scores with norm_d would all be extremely small.  You 
are also correct that since norm_q is invariant, it does not affect relevance 
ranking.  Norm_q is simply part of the normalization of final scores.  There 
are many different formulas for scoring and relevance ranking in IR.  All of 
these have some intuitive justification, but in the end can only be evaluated 
empirically.  There is no correct formula.

I believe the biggest problem with Lucene's approach relative to the pure 
vector space model is that Lucene does not properly normalize.  The pure vector 
space model implements a cosine in the strictly positive sector of the 
coordinate space.  This is guaranteed intrinsically to be between 0 and 1, and 
produces scores that can be compared across distinct queries (i.e., 0.8 means 
something about the result quality independent of the query).

Lucene does not have this property.  Its formula produces scores of arbitrary 
magnitude depending on the query.  The results cannot be compared meaningfully 
across queries; i.e., 0.8 means nothing intrinsically.  To keep final scores 
between 0 and 1, Lucene introduces an ad hoc query-dependent final 
normalization in Hits:  viz., it divides all scores by the highest score if the 
highest score happens to be greater than 1.  This makes it impossible for an 
application to properly inform its users about the quality of the results, to 
cut off bad results, etc.  Applications may do that, but in fact what they are 
doing is random, not what they think they are doing.

I've proposed a fix for this -- there was a long thread on Lucene-dev.  It is 
possible to revise Lucene's scoring to keep its efficiency, keep its current 
per-query relevance ranking, and yet intrinsically normalize its scores so that 
they are meaningful across queries.  I posted a fairly detailed spec of how to 
do this in the Lucene-dev thread.  I'm hoping to have time to build it and 
submit it as a proposed update to Lucene, but it is a large effort that would 
involve changing just about every scoring class in Lucene.  I'm not sure it 
would be incorporated even if I did it as that would take considerable work 
from a developer.  There doesn't seem to be much concern about these various 
scoring and relevancy ranking issues among the general Lucene community.

Chuck

   -Original Message-
   From: Nhan Nguyen Dang [mailto:[EMAIL PROTECTED]
   Sent: Wednesday, December 15, 2004 1:18 AM
   To: Lucene Users List
   Subject: RE: A question about scoring function in Lucene
   
   Thank for your answer,
   In Lucene scoring function, they use only norm_q,
   but for one query, norm_q is the same for all
   documents.
   So norm_q is actually not effect the score.
   But norm_d is different, each document has a different
   norm_d; it effect the score of document d for query q.
   If you drop it, the score information is not correct
   anymore or it not space vector model anymore.  Could
   you explain it a little bit.
   
   I think that it's expensive to computed in incremetal
   indexing because when one document is added, idf of
   each term changed. But drop it is not a good choice.
   
   What is the role of norm_d_t ?
   Nhan.
   
   --- Chuck Williams [EMAIL PROTECTED] wrote:
   
Nhan,
   
Re.  your two differences:
   
1 is not a difference.  Norm_d and Norm_q are both
independent of t, so summing over t has no effect on
them.  I.e., Norm_d * Norm_q is constant wrt the
summation, so it doesn't matter if the sum is over
just the numerator or over the entire fraction, the
result is the same.
   
2 is a difference.  Lucene uses Norm_q instead of
Norm_d because Norm_d is too expensive to compute,
especially in the presence of incremental indexing.
E.g., adding or deleting any document changes the
idf's, so if Norm_d was used it would have to be
recomputed for ALL documents.  This is not feasible.
   
Another point you did not mention is that the idf
term is squared (in both of your formulas).  Salton,
the originator of the vector space model, dropped
one idf factor from his formula as it improved
results empirically.  More recent theoretical
justifications of tf*idf provide intuitive
explanations of why idf should only be included
linearly.  tf is best thought of as the real vector
entry, while idf is a weighting term on the
components of the inner product.  E.g., seen the
excellent paper by Robertson, Understanding inverse
document frequency: on theoretical arguments for
IDF, available here:
http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl
if you sign up for an eval.
   
It's easy to correct for idf^2

Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Otis Gospodnetic wrote:
There is one case that I can think of where this 'constant' scoring
would be useful, and I think Chuck already mentioned this 1-2 months
ago.  For instace, having such scores would allow one to create alert
applications where queries run by some scheduler would trigger an alert
whenever the score is  X.  So that is where the absolue value of the
score would be useful.
Right, but the question is, would a single score threshold be effective 
for all queries, or would one need a separate score threshold for each 
query?  My hunch is that the latter is better, regardless of the scoring 
algorithm.

Also, just because Lucene's default scoring does not guarantee scores 
between zero and one does not necessarily mean that these scores are 
less meaningful.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: A question about scoring function in Lucene

2004-12-15 Thread Doug Cutting
Chris Hostetter wrote:
For example, using the current scoring equation, if i do a search for
Doug Cutting and the results/scores i get back are...
  1:   0.9
  2:   0.3
  3:   0.21
  4:   0.21
  5:   0.1
...then there are at least two meaningful pieces of data I can glean:
   a) document #1 is significantly better then the other results
   b) document #3 and #4 are both equaly relevant to Doug Cutting
If I then do a search for Chris Hostetter and get back the following
results/scores...
  9:   0.9
  8:   0.3
  7:   0.21
  6:   0.21
  5:   0.1
...then I can assume the same corrisponding information is true about my
new search term (#9 is significantly better, and #7/#8 are equally as good)
However, I *cannot* say either of the following:
  x) document #9 is as relevant for Chris Hostetter as document #1 is
 relevant to Doug Cutting
  y) document #5 is equally relevant to both Chris Hostetter and
 Doug Cutting
That's right.  Thanks for the nice description of the issue.
I think the OP is arguing that if the scoring algorithm was modified in
the way they suggested, then you would be able to make statements x  y.
And I am not convinced that, with the changes Chuck describes, one can 
be any more confident of x and y.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: A question about scoring function in Lucene

2004-12-14 Thread Chuck Williams
Nhan,

Re.  your two differences:

1 is not a difference.  Norm_d and Norm_q are both independent of t, so summing 
over t has no effect on them.  I.e., Norm_d * Norm_q is constant wrt the 
summation, so it doesn't matter if the sum is over just the numerator or over 
the entire fraction, the result is the same.

2 is a difference.  Lucene uses Norm_q instead of Norm_d because Norm_d is too 
expensive to compute, especially in the presence of incremental indexing.  
E.g., adding or deleting any document changes the idf's, so if Norm_d was used 
it would have to be recomputed for ALL documents.  This is not feasible.

Another point you did not mention is that the idf term is squared (in both of 
your formulas).  Salton, the originator of the vector space model, dropped one 
idf factor from his formula as it improved results empirically.  More recent 
theoretical justifications of tf*idf provide intuitive explanations of why idf 
should only be included linearly.  tf is best thought of as the real vector 
entry, while idf is a weighting term on the components of the inner product.  
E.g., seen the excellent paper by Robertson, Understanding inverse document 
frequency: on theoretical arguments for IDF, available here:  
http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval.

It's easy to correct for idf^2 by using a customer Similarity that takes a 
final square root.

Chuck

   -Original Message-
   From: Vikas Gupta [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, December 14, 2004 9:32 PM
   To: Lucene Users List
   Subject: Re: A question about scoring function in Lucene
   
   Lucene uses the vector space model. To understand that:
   
   -Read section 2.1 of Space optimizations for Total Ranking paper
   (Linked
   here http://lucene.sourceforge.net/publications.html)
   -Read section 6 to 6.4 of
   http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
   -Read section 1 of
   http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps
   
   Vikas
   
   On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:
   
Hi all,
Lucene score document based on the correlation between
the query q and document t:
(this is raw function, I don't pay attention to the
boost_t, coord_q_d factor)
   
score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
/ norm_d_t)  (*)
   
Could anybody explain it in detail ? Or are there any
papers, documents about this function ? Because:
   
I have also read the book: Modern Information
Retrieval, author: Ricardo Baeza-Yates and Berthier
Ribeiro-Neto, Addison Wesley (Hope you have read it
too). In page 27, they also suggest a scoring funtion
for vector model based on the correlation between
query q and document d as follow (I use different
symbol):
   
   sum_t( weight_t_d * weight_t_q)
score_d(d, q)=  - (**)
norm_d * norm_q
   
where weight_t_d = tf_d * idf_t
  weight_t_q = tf_q * idf_t
  norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
  norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )
   
(**):  sum_t( tf_q*idf_t * tf_d*idf_t)
score_d(d, q)=-  (***)
 norm_d * norm_q
   
The two function, (*) and (***), have 2 differences:
1. in (***), the sum_t is just for the numerator but
in the (*), the sum_t is for everything. So, with
norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
calculated twice. Is this right? please explain.
   
2. No factor that define norms of the document: norm_d
in the function (*). Can you explain this. what is the
role of factor norm_d_t ?
   
One more question: could anybody give me documents,
papers that explain this function in detail. so when I
apply Lucene for my system, I can adapt the document,
and the field so that I still receive the correct
scoring information from Lucene .
   
Best regard,
Thanks every body,
   
=
Ð#7863;ng Nhân
   
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: A question about scoring function in Lucene

2004-12-14 Thread Vikas Gupta
Lucene uses the vector space model. To understand that:

-Read section 2.1 of Space optimizations for Total Ranking paper (Linked
here http://lucene.sourceforge.net/publications.html)
-Read section 6 to 6.4 of
http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
-Read section 1 of
http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps

Vikas

On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:

 Hi all,
 Lucene score document based on the correlation between
 the query q and document t:
 (this is raw function, I don't pay attention to the
 boost_t, coord_q_d factor)

 score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
 / norm_d_t)  (*)

 Could anybody explain it in detail ? Or are there any
 papers, documents about this function ? Because:

 I have also read the book: Modern Information
 Retrieval, author: Ricardo Baeza-Yates and Berthier
 Ribeiro-Neto, Addison Wesley (Hope you have read it
 too). In page 27, they also suggest a scoring funtion
 for vector model based on the correlation between
 query q and document d as follow (I use different
 symbol):

sum_t( weight_t_d * weight_t_q)
 score_d(d, q)=  - (**)
 norm_d * norm_q

 where weight_t_d = tf_d * idf_t
   weight_t_q = tf_q * idf_t
   norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
   norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )

 (**):  sum_t( tf_q*idf_t * tf_d*idf_t)
 score_d(d, q)=-  (***)
  norm_d * norm_q

 The two function, (*) and (***), have 2 differences:
 1. in (***), the sum_t is just for the numerator but
 in the (*), the sum_t is for everything. So, with
 norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
 calculated twice. Is this right? please explain.

 2. No factor that define norms of the document: norm_d
 in the function (*). Can you explain this. what is the
 role of factor norm_d_t ?

 One more question: could anybody give me documents,
 papers that explain this function in detail. so when I
 apply Lucene for my system, I can adapt the document,
 and the field so that I still receive the correct
 scoring information from Lucene .

 Best regard,
 Thanks every body,

 =
 Ð#7863;ng Nhân

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]