RE: A question about scoring function in Lucene
Thank for your answer, In Lucene scoring function, they use only norm_q, but for one query, norm_q is the same for all documents. So norm_q is actually not effect the score. But norm_d is different, each document has a different norm_d; it effect the score of document d for query q. If you drop it, the score information is not correct anymore or it not space vector model anymore. Could you explain it a little bit. I think that it's expensive to computed in incremetal indexing because when one document is added, idf of each term changed. But drop it is not a good choice. What is the role of norm_d_t ? Nhan. --- Chuck Williams [EMAIL PROTECTED] wrote: Nhan, Re. your two differences: 1 is not a difference. Norm_d and Norm_q are both independent of t, so summing over t has no effect on them. I.e., Norm_d * Norm_q is constant wrt the summation, so it doesn't matter if the sum is over just the numerator or over the entire fraction, the result is the same. 2 is a difference. Lucene uses Norm_q instead of Norm_d because Norm_d is too expensive to compute, especially in the presence of incremental indexing. E.g., adding or deleting any document changes the idf's, so if Norm_d was used it would have to be recomputed for ALL documents. This is not feasible. Another point you did not mention is that the idf term is squared (in both of your formulas). Salton, the originator of the vector space model, dropped one idf factor from his formula as it improved results empirically. More recent theoretical justifications of tf*idf provide intuitive explanations of why idf should only be included linearly. tf is best thought of as the real vector entry, while idf is a weighting term on the components of the inner product. E.g., seen the excellent paper by Robertson, Understanding inverse document frequency: on theoretical arguments for IDF, available here: http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval. It's easy to correct for idf^2 by using a customer Similarity that takes a final square root. Chuck -Original Message- From: Vikas Gupta [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 9:32 PM To: Lucene Users List Subject: Re: A question about scoring function in Lucene Lucene uses the vector space model. To understand that: -Read section 2.1 of Space optimizations for Total Ranking paper (Linked here http://lucene.sourceforge.net/publications.html) -Read section 6 to 6.4 of http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf -Read section 1 of http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps Vikas On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote: Hi all, Lucene score document based on the correlation between the query q and document t: (this is raw function, I don't pay attention to the boost_t, coord_q_d factor) score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) (*) Could anybody explain it in detail ? Or are there any papers, documents about this function ? Because: I have also read the book: Modern Information Retrieval, author: Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley (Hope you have read it too). In page 27, they also suggest a scoring funtion for vector model based on the correlation between query q and document d as follow (I use different symbol): sum_t( weight_t_d * weight_t_q) score_d(d, q)= - (**) norm_d * norm_q where weight_t_d = tf_d * idf_t weight_t_q = tf_q * idf_t norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) ) norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) ) (**): sum_t( tf_q*idf_t * tf_d*idf_t) score_d(d, q)=- (***) norm_d * norm_q The two function, (*) and (***), have 2 differences: 1. in (***), the sum_t is just for the numerator but in the (*), the sum_t is for everything. So, with norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is calculated twice. Is this right? please explain. 2. No factor that define norms of the document: norm_d in the function (*). Can you explain this. what is the role of factor norm_d_t ? One more question: could anybody give me documents, papers that explain this function in detail. so when I apply Lucene for my system, I can adapt the document, and the field so that I still receive the correct scoring information from Lucene . Best regard, Thanks every body, = Ð#7863;ng Nhân - To unsubscribe, e-mail: [EMAIL
A question about scoring function in Lucene
Hi all, Lucene score document based on the correlation between the query q and document t: (this is raw function, I don't pay attention to the boost_t, coord_q_d factor) score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) (*) Could anybody explain it in detail ? Or are there any papers, documents about this function ? Because: I have also read the book: Modern Information Retrieval, author: Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley (Hope you have read it too). In page 27, they also suggest a scoring funtion for vector model based on the correlation between query q and document d as follow (I use different symbol): sum_t( weight_t_d * weight_t_q) score_d(d, q)= - (**) norm_d * norm_q where weight_t_d = tf_d * idf_t weight_t_q = tf_q * idf_t norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) ) norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) ) (**): sum_t( tf_q*idf_t * tf_d*idf_t) score_d(d, q)=- (***) norm_d * norm_q The two function, (*) and (***), have 2 differences: 1. in (***), the sum_t is just for the numerator but in the (*), the sum_t is for everything. So, with norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is calculated twice. Is this right? please explain. 2. No factor that define norms of the document: norm_d in the function (*). Can you explain this. what is the role of factor norm_d_t ? One more question: could anybody give me documents, papers that explain this function in detail. so when I apply Lucene for my system, I can adapt the document, and the field so that I still receive the correct scoring information from Lucene . Best regard, Thanks every body, = Ð#7863;ng Nhân __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene to store document
Hi, When the Index is read to memory for searching, which data in the segment/ index will be load ? I mean all the indexed fields/ terms ? Is the stored field loaded ? thanks, Otis Gospodnetic [EMAIL PROTECTED] wrote:Hello, HEAD version means that you should check out Lucene straight out of CVS. How to work with CVS is another story, probably described somewhere on jakarta.apache.org site. Otis --- Nhan Nguyen Dang wrote: Hi Otis, Please let me know what HEAD version of Lucene is? Actually, I'm consider the advantages of storing document using Lucene Stored field - For my Search engine. I've tested with thousands of documents and see that retrieve document (in this case XML file) with Lucene is a little bit faster than using FS. But I cannot test with a large number of data to hava an accurate comparision. So whether Lucene can support millions of document, still balance and retrieve the with approriate speed. Nhan - FREE Spam Protection! Click Here. SpamExtract Blocks Spam. - Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
Re: Using Lucene to store document
Hi Otis, Please let me know what HEAD version of Lucene is? Actually, I'm consider the advantages of storing document using Lucene Stored field - For my Search engine. I've tested with thousands of documents and see that retrieve document (in this case XML file) with Lucene is a little bit faster than using FS. But I cannot test with a large number of data to hava an accurate comparision. So whether Lucene can support millions of document, still balance and retrieve the with approriate speed. Nhan - FREE Spam Protection! Click Here. SpamExtract Blocks Spam. - Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com
Using Lucene to store document
Hi all, I'm using Lucene to index XML document/ file (may be millions of documents in future, each about 5-10KB) Beside the index for searching, I want to use Lucene to store whole document content with UnIndexed fields -content field(instead of store each document in a XML file). All the document content will be stored on a separate index. Each time I want to get access to a document, I will let Lucene retrieve it. I am consider this issue with another one Use file system to store document content in separate XML document means, 400K document ill be stored in 400K XML file in file system. Purpose of this is that I can access each document rapidly. Can any body who has experience with this problem before give me advise which method is suitable ? Is this better to collect all documents to an Lucene index or store them separately in file system ? Thanks, Dang Nhan - Do you Yahoo!? Check out the new Yahoo! Front Page. www.yahoo.com