RE: A question about scoring function in Lucene

2004-12-15 Thread Nhan Nguyen Dang
Thank for your answer,
In Lucene scoring function, they use only norm_q,
but for one query, norm_q is the same for all
documents.
So norm_q is actually not effect the score.
But norm_d is different, each document has a different
norm_d; it effect the score of document d for query q.
If you drop it, the score information is not correct
anymore or it not space vector model anymore.  Could
you explain it a little bit.

I think that it's expensive to computed in incremetal
indexing because when one document is added, idf of
each term changed. But drop it is not a good choice.

What is the role of norm_d_t ?
Nhan.

--- Chuck Williams [EMAIL PROTECTED] wrote:

 Nhan,
 
 Re.  your two differences:
 
 1 is not a difference.  Norm_d and Norm_q are both
 independent of t, so summing over t has no effect on
 them.  I.e., Norm_d * Norm_q is constant wrt the
 summation, so it doesn't matter if the sum is over
 just the numerator or over the entire fraction, the
 result is the same.
 
 2 is a difference.  Lucene uses Norm_q instead of
 Norm_d because Norm_d is too expensive to compute,
 especially in the presence of incremental indexing. 
 E.g., adding or deleting any document changes the
 idf's, so if Norm_d was used it would have to be
 recomputed for ALL documents.  This is not feasible.
 
 Another point you did not mention is that the idf
 term is squared (in both of your formulas).  Salton,
 the originator of the vector space model, dropped
 one idf factor from his formula as it improved
 results empirically.  More recent theoretical
 justifications of tf*idf provide intuitive
 explanations of why idf should only be included
 linearly.  tf is best thought of as the real vector
 entry, while idf is a weighting term on the
 components of the inner product.  E.g., seen the
 excellent paper by Robertson, Understanding inverse
 document frequency: on theoretical arguments for
 IDF, available here: 
 http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl
 if you sign up for an eval.
 
 It's easy to correct for idf^2 by using a customer
 Similarity that takes a final square root.
 
 Chuck
 
-Original Message-
From: Vikas Gupta [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 9:32 PM
To: Lucene Users List
Subject: Re: A question about scoring function
 in Lucene

Lucene uses the vector space model. To
 understand that:

-Read section 2.1 of Space optimizations for
 Total Ranking paper
(Linked
here
 http://lucene.sourceforge.net/publications.html)
-Read section 6 to 6.4 of
   

http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
-Read section 1 of
   

http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps

Vikas

On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote:

 Hi all,
 Lucene score document based on the correlation
 between
 the query q and document t:
 (this is raw function, I don't pay attention
 to the
 boost_t, coord_q_d factor)

 score_d = sum_t( tf_q * idf_t / norm_q * tf_d
 * idf_t
 / norm_d_t)  (*)

 Could anybody explain it in detail ? Or are
 there any
 papers, documents about this function ?
 Because:

 I have also read the book: Modern Information
 Retrieval, author: Ricardo Baeza-Yates and
 Berthier
 Ribeiro-Neto, Addison Wesley (Hope you have
 read it
 too). In page 27, they also suggest a scoring
 funtion
 for vector model based on the correlation
 between
 query q and document d as follow (I use
 different
 symbol):

  sum_t( weight_t_d * weight_t_q)
 score_d(d, q)= 
 - (**)
   norm_d * norm_q

 where weight_t_d = tf_d * idf_t
   weight_t_q = tf_q * idf_t
   norm_d = sqrt( sum_t( (tf_d * idf_t)^2 )
 )
   norm_q = sqrt( sum_t( (tf_q * idf_t)^2 )
 )

 (**):  sum_t( tf_q*idf_t * tf_d*idf_t)
 score_d(d,
 q)=-  (***)
norm_d * norm_q

 The two function, (*) and (***), have 2
 differences:
 1. in (***), the sum_t is just for the
 numerator but
 in the (*), the sum_t is for everything. So,
 with
 norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
 calculated twice. Is this right? please
 explain.

 2. No factor that define norms of the
 document: norm_d
 in the function (*). Can you explain this.
 what is the
 role of factor norm_d_t ?

 One more question: could anybody give me
 documents,
 papers that explain this function in detail.
 so when I
 apply Lucene for my system, I can adapt the
 document,
 and the field so that I still receive the
 correct
 scoring information from Lucene .

 Best regard,
 Thanks every body,

 =
 Ð#7863;ng Nhân

   

-
To unsubscribe, e-mail:
 [EMAIL

A question about scoring function in Lucene

2004-12-14 Thread Nhan Nguyen Dang
Hi all,
Lucene score document based on the correlation between
the query q and document t:
(this is raw function, I don't pay attention to the 
boost_t, coord_q_d factor)

score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t
/ norm_d_t)  (*)

Could anybody explain it in detail ? Or are there any
papers, documents about this function ? Because:

I have also read the book: Modern Information
Retrieval, author: Ricardo Baeza-Yates and Berthier 
Ribeiro-Neto, Addison Wesley (Hope you have read it
too). In page 27, they also suggest a scoring funtion
for vector model based on the correlation between
query q and document d as follow (I use different
symbol):

 sum_t( weight_t_d * weight_t_q) 
score_d(d, q)=  - (**)
  norm_d * norm_q 

where weight_t_d = tf_d * idf_t
  weight_t_q = tf_q * idf_t
  norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) )
  norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) )

(**):  sum_t( tf_q*idf_t * tf_d*idf_t) 
score_d(d, q)=-  (***)
   norm_d * norm_q 

The two function, (*) and (***), have 2 differences:
1. in (***), the sum_t is just for the numerator but
in the (*), the sum_t is for everything. So, with
norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is
calculated twice. Is this right? please explain.

2. No factor that define norms of the document: norm_d
in the function (*). Can you explain this. what is the
role of factor norm_d_t ?

One more question: could anybody give me documents,
papers that explain this function in detail. so when I
apply Lucene for my system, I can adapt the document,
and the field so that I still receive the correct
scoring information from Lucene .

Best regard,
Thanks every body,

=
Ð#7863;ng Nhân 






__ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using Lucene to store document

2004-11-14 Thread Nhan Nguyen Dang
Hi,
When the Index is read to memory for searching, which data in the segment/ 
index will be load ? 
I mean all the indexed fields/ terms ? Is the stored field loaded ?
thanks, 


Otis Gospodnetic [EMAIL PROTECTED] wrote:Hello,

HEAD version means that you should check out Lucene straight out of
CVS. How to work with CVS is another story, probably described
somewhere on jakarta.apache.org site.

Otis

--- Nhan Nguyen Dang wrote:

 Hi Otis,
 Please let me know what HEAD version of Lucene is?
 Actually, I'm consider the advantages of storing document using
 Lucene Stored field - For my Search engine.
 I've tested with thousands of documents and see that retrieve
 document (in this case XML file) with Lucene is a little bit faster
 than using FS. But I cannot test with a large number of data to hava
 an accurate comparision. 
 So whether Lucene can support millions of document, still balance and
 retrieve the with approriate speed.
 Nhan
 
 
 -
 FREE Spam Protection! Click Here.
 SpamExtract Blocks Spam.
 
 -
 Do you Yahoo!?
 Check out the new Yahoo! Front Page. www.yahoo.com


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
Do you Yahoo!?
 Check out the new Yahoo! Front Page. www.yahoo.com

Re: Using Lucene to store document

2004-11-10 Thread Nhan Nguyen Dang
Hi Otis,
Please let me know what HEAD version of Lucene is?
Actually, I'm consider the advantages of storing document using Lucene Stored 
field - For  my Search engine.
I've tested with thousands of documents and see that retrieve document (in this 
case XML file) with Lucene is a little bit faster than using FS. But I cannot 
test with a large number of data to hava an accurate comparision. 
So whether Lucene can support millions of document, still balance and retrieve 
the with approriate speed.
Nhan


-
FREE Spam Protection! Click Here.
SpamExtract Blocks Spam.

-
Do you Yahoo!?
 Check out the new Yahoo! Front Page. www.yahoo.com

Using Lucene to store document

2004-11-09 Thread Nhan Nguyen Dang
Hi all,
I'm using Lucene to index XML document/ file (may be millions of documents in 
future, each about 5-10KB)
Beside the index for searching, I want to use Lucene to store whole document 
content with UnIndexed fields -content field(instead of store each document in 
a XML file). All the document content will be stored on a separate index. Each 
time I want to get access to a document, I will let Lucene retrieve it.
 
I am consider this issue with another one Use file system to store document 
content in separate XML document means, 400K document ill be stored in 400K 
XML file in file system.
 
Purpose of this is that I can access each document rapidly. Can any body who 
has experience with this problem before give me advise which method is suitable 
? Is this better to collect all documents to an Lucene index or store them 
separately in file system ?
 
Thanks,
Dang Nhan





-
Do you Yahoo!?
 Check out the new Yahoo! Front Page. www.yahoo.com