RE: A question about scoring function in Lucene
Nhan, Re. your two differences: 1 is not a difference. Norm_d and Norm_q are both independent of t, so summing over t has no effect on them. I.e., Norm_d * Norm_q is constant wrt the summation, so it doesn't matter if the sum is over just the numerator or over the entire fraction, the result is the same. 2 is a difference. Lucene uses Norm_q instead of Norm_d because Norm_d is too expensive to compute, especially in the presence of incremental indexing. E.g., adding or deleting any document changes the idf's, so if Norm_d was used it would have to be recomputed for ALL documents. This is not feasible. Another point you did not mention is that the idf term is squared (in both of your formulas). Salton, the originator of the vector space model, dropped one idf factor from his formula as it improved results empirically. More recent theoretical justifications of tf*idf provide intuitive explanations of why idf should only be included linearly. tf is best thought of as the real vector entry, while idf is a weighting term on the components of the inner product. E.g., seen the excellent paper by Robertson, "Understanding inverse document frequency: on theoretical arguments for IDF", available here: http://www.emeraldinsight.com/rpsv/cgi-bin/emft.pl if you sign up for an eval. It's easy to correct for idf^2 by using a customer Similarity that takes a final square root. Chuck > -Original Message- > From: Vikas Gupta [mailto:[EMAIL PROTECTED] > Sent: Tuesday, December 14, 2004 9:32 PM > To: Lucene Users List > Subject: Re: A question about scoring function in Lucene > > Lucene uses the vector space model. To understand that: > > -Read section 2.1 of "Space optimizations for Total Ranking" paper > (Linked > here http://lucene.sourceforge.net/publications.html) > -Read section 6 to 6.4 of > http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf > -Read section 1 of > http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps > > Vikas > > On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote: > > > Hi all, > > Lucene score document based on the correlation between > > the query q and document t: > > (this is raw function, I don't pay attention to the > > boost_t, coord_q_d factor) > > > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t > > / norm_d_t) (*) > > > > Could anybody explain it in detail ? Or are there any > > papers, documents about this function ? Because: > > > > I have also read the book: Modern Information > > Retrieval, author: Ricardo Baeza-Yates and Berthier > > Ribeiro-Neto, Addison Wesley (Hope you have read it > > too). In page 27, they also suggest a scoring funtion > > for vector model based on the correlation between > > query q and document d as follow (I use different > > symbol): > > > >sum_t( weight_t_d * weight_t_q) > > score_d(d, q)= - (**) > > norm_d * norm_q > > > > where weight_t_d = tf_d * idf_t > > weight_t_q = tf_q * idf_t > > norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) ) > > norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) ) > > > > (**): sum_t( tf_q*idf_t * tf_d*idf_t) > > score_d(d, q)=- (***) > > norm_d * norm_q > > > > The two function, (*) and (***), have 2 differences: > > 1. in (***), the sum_t is just for the numerator but > > in the (*), the sum_t is for everything. So, with > > norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is > > calculated twice. Is this right? please explain. > > > > 2. No factor that define norms of the document: norm_d > > in the function (*). Can you explain this. what is the > > role of factor norm_d_t ? > > > > One more question: could anybody give me documents, > > papers that explain this function in detail. so when I > > apply Lucene for my system, I can adapt the document, > > and the field so that I still receive the correct > > scoring information from Lucene . > > > > Best regard, > > Thanks every body, > > > > = > > Ð#7863;ng Nhân > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
LUCENE1.4.1 - LUCENE1.4.2 - LUCENE1.4.3 Exception
Hi Guys Some body tell me what this Exception am Getting Pleae Sys Specifications O/s Linux Gentoo Appserver Apache Tomcat/4.1.24 Jdk build 1.4.2_03-b02 Lucene 1.4.1 ,2, 3 Note: - This Exception is displayed on Every 2nd Query after Tomcat is started java.io.IOException: Stale NFS file handle at java.io.RandomAccessFile.readBytes(Native Method) at java.io.RandomAccessFile.read(RandomAccessFile.java:307) at org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:420) at org.apache.lucene.store.InputStream.readBytes(InputStream.java:61) at org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(Compou ndFileReader.java:220) at org.apache.lucene.store.InputStream.refill(InputStream.java:158) at org.apache.lucene.store.InputStream.readByte(InputStream.java:43) at org.apache.lucene.store.InputStream.readVInt(InputStream.java:83) at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java:142) at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115) at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:143) at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:137) at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:253) at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:69) at org.apache.lucene.search.Similarity.idf(Similarity.java:255) at org.apache.lucene.search.TermQuery$TermWeight.sumOfSquaredWeights(TermQuery. java:47) at org.apache.lucene.search.Query.weight(Query.java:86) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85) at org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java: 251) WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing a large number of DB records
Thanks Otis! What do you mean by building it in batches? Does it mean I should close the IndexWriter every 1000 rows and reopen it? Does that releases references to the document objects so that they can be garbage-collected? I'm calling optimize() only at the end. I agree that 1500 documents is very small. I'm building the index on a PC with 512 megs, and the indexing process is quickly gobbling up around 400 megs when I index around 1800 documents and the whole machine is grinding to a virtual halt. I'm using the latest DotLucene .NET port, so may be there's a memory leak in it. I have experience with AltaVista search (acquired by FastSearch), and I used to call MakeStable() every 20,000 documents to flush memory structures to disk. There doesn't seem to be an equivalent in Lucene. -- Homam --- Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hello, > > There are a few things you can do: > > 1) Don't just pull all rows from the DB at once. Do > that in batches. > > 2) If you can get a Reader from your SqlDataReader, > consider this: > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader) > > 3) Give the JVM more memory to play with by using > -Xms and -Xmx JVM > parameters > > 4) See IndexWriter's minMergeDocs parameter. > > 5) Are you calling optimize() at some point by any > chance? Leave that > call for the end. > > 1500 documents with 30 columns of short > String/number values is not a > lot. You may be doing something else not Lucene > related that's slowing > things down. > > Otis > > > --- "Homam S.A." <[EMAIL PROTECTED]> wrote: > > > I'm trying to index a large number of records from > the > > DB (a few millions). Each record will be stored as > a > > document with about 30 fields, most of them are > > UnStored and represent small strings or numbers. > No > > huge DB Text fields. > > > > But I'm running out of memory very fast, and the > > indexing is slowing down to a crawl once I hit > around > > 1500 records. The problem is each document is > holding > > references to the string objects returned from > > ToString() on the DB field, and the IndexWriter is > > holding references to all these document objects > in > > memory, so the garbage collector is getting a > chance > > to clean these up. > > > > How do you guys go about indexing a large DB > table? > > Here's a snippet of my code (this method is called > for > > each record in the DB): > > > > private void IndexRow(SqlDataReader rdr, > IndexWriter > > iw) { > > Document doc = new Document(); > > for (int i = 0; i < BrowseFieldNames.Length; i++) > { > > doc.Add(Field.UnStored(BrowseFieldNames[i], > > rdr.GetValue(i).ToString())); > > } > > iw.AddDocument(doc); > > } > > > > > > > > > > > > __ > > Do you Yahoo!? > > Yahoo! Mail - Find what you need with new enhanced > search. > > http://info.mail.yahoo.com/mail_250 > > > > > - > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: > [EMAIL PROTECTED] > For additional commands, e-mail: > [EMAIL PROTECTED] > > __ Do you Yahoo!? Take Yahoo! Mail with you! Get it on your mobile phone. http://mobile.yahoo.com/maildemo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A question about scoring function in Lucene
Lucene uses the vector space model. To understand that: -Read section 2.1 of "Space optimizations for Total Ranking" paper (Linked here http://lucene.sourceforge.net/publications.html) -Read section 6 to 6.4 of http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf -Read section 1 of http://www.cs.utexas.edu/users/inderjit/courses/dm2004/lecture5.ps Vikas On Tue, 14 Dec 2004, Nhan Nguyen Dang wrote: > Hi all, > Lucene score document based on the correlation between > the query q and document t: > (this is raw function, I don't pay attention to the > boost_t, coord_q_d factor) > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t > / norm_d_t) (*) > > Could anybody explain it in detail ? Or are there any > papers, documents about this function ? Because: > > I have also read the book: Modern Information > Retrieval, author: Ricardo Baeza-Yates and Berthier > Ribeiro-Neto, Addison Wesley (Hope you have read it > too). In page 27, they also suggest a scoring funtion > for vector model based on the correlation between > query q and document d as follow (I use different > symbol): > >sum_t( weight_t_d * weight_t_q) > score_d(d, q)= - (**) > norm_d * norm_q > > where weight_t_d = tf_d * idf_t > weight_t_q = tf_q * idf_t > norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) ) > norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) ) > > (**): sum_t( tf_q*idf_t * tf_d*idf_t) > score_d(d, q)=- (***) > norm_d * norm_q > > The two function, (*) and (***), have 2 differences: > 1. in (***), the sum_t is just for the numerator but > in the (*), the sum_t is for everything. So, with > norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is > calculated twice. Is this right? please explain. > > 2. No factor that define norms of the document: norm_d > in the function (*). Can you explain this. what is the > role of factor norm_d_t ? > > One more question: could anybody give me documents, > papers that explain this function in detail. so when I > apply Lucene for my system, I can adapt the document, > and the field so that I still receive the correct > scoring information from Lucene . > > Best regard, > Thanks every body, > > = > Ð#7863;ng Nhân - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
A question about scoring function in Lucene
Hi all, Lucene score document based on the correlation between the query q and document t: (this is raw function, I don't pay attention to the boost_t, coord_q_d factor) score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) (*) Could anybody explain it in detail ? Or are there any papers, documents about this function ? Because: I have also read the book: Modern Information Retrieval, author: Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley (Hope you have read it too). In page 27, they also suggest a scoring funtion for vector model based on the correlation between query q and document d as follow (I use different symbol): sum_t( weight_t_d * weight_t_q) score_d(d, q)= - (**) norm_d * norm_q where weight_t_d = tf_d * idf_t weight_t_q = tf_q * idf_t norm_d = sqrt( sum_t( (tf_d * idf_t)^2 ) ) norm_q = sqrt( sum_t( (tf_q * idf_t)^2 ) ) (**): sum_t( tf_q*idf_t * tf_d*idf_t) score_d(d, q)=- (***) norm_d * norm_q The two function, (*) and (***), have 2 differences: 1. in (***), the sum_t is just for the numerator but in the (*), the sum_t is for everything. So, with norm_q = sqrt(sum_t((tf_q*idf_t)^2)); sum_t is calculated twice. Is this right? please explain. 2. No factor that define norms of the document: norm_d in the function (*). Can you explain this. what is the role of factor norm_d_t ? One more question: could anybody give me documents, papers that explain this function in detail. so when I apply Lucene for my system, I can adapt the document, and the field so that I still receive the correct scoring information from Lucene . Best regard, Thanks every body, = Ð#7863;ng Nhân __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Indexing a large number of DB records
Hello, There are a few things you can do: 1) Don't just pull all rows from the DB at once. Do that in batches. 2) If you can get a Reader from your SqlDataReader, consider this: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/document/Field.html#Text(java.lang.String,%20java.io.Reader) 3) Give the JVM more memory to play with by using -Xms and -Xmx JVM parameters 4) See IndexWriter's minMergeDocs parameter. 5) Are you calling optimize() at some point by any chance? Leave that call for the end. 1500 documents with 30 columns of short String/number values is not a lot. You may be doing something else not Lucene related that's slowing things down. Otis --- "Homam S.A." <[EMAIL PROTECTED]> wrote: > I'm trying to index a large number of records from the > DB (a few millions). Each record will be stored as a > document with about 30 fields, most of them are > UnStored and represent small strings or numbers. No > huge DB Text fields. > > But I'm running out of memory very fast, and the > indexing is slowing down to a crawl once I hit around > 1500 records. The problem is each document is holding > references to the string objects returned from > ToString() on the DB field, and the IndexWriter is > holding references to all these document objects in > memory, so the garbage collector is getting a chance > to clean these up. > > How do you guys go about indexing a large DB table? > Here's a snippet of my code (this method is called for > each record in the DB): > > private void IndexRow(SqlDataReader rdr, IndexWriter > iw) { > Document doc = new Document(); > for (int i = 0; i < BrowseFieldNames.Length; i++) { > doc.Add(Field.UnStored(BrowseFieldNames[i], > rdr.GetValue(i).ToString())); > } > iw.AddDocument(doc); > } > > > > > > __ > Do you Yahoo!? > Yahoo! Mail - Find what you need with new enhanced search. > http://info.mail.yahoo.com/mail_250 > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Indexing a large number of DB records
I'm trying to index a large number of records from the DB (a few millions). Each record will be stored as a document with about 30 fields, most of them are UnStored and represent small strings or numbers. No huge DB Text fields. But I'm running out of memory very fast, and the indexing is slowing down to a crawl once I hit around 1500 records. The problem is each document is holding references to the string objects returned from ToString() on the DB field, and the IndexWriter is holding references to all these document objects in memory, so the garbage collector is getting a chance to clean these up. How do you guys go about indexing a large DB table? Here's a snippet of my code (this method is called for each record in the DB): private void IndexRow(SqlDataReader rdr, IndexWriter iw) { Document doc = new Document(); for (int i = 0; i < BrowseFieldNames.Length; i++) { doc.Add(Field.UnStored(BrowseFieldNames[i], rdr.GetValue(i).ToString())); } iw.AddDocument(doc); } __ Do you Yahoo!? Yahoo! Mail - Find what you need with new enhanced search. http://info.mail.yahoo.com/mail_250 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TFIDF Implementation
Bruce Ritchie wrote: You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2&item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in the query, I didn't think of doing that in my code. I agree, it's a good way to exclude the source doc. I don't trust calling vector.size() and vector.getTerms() within the loop but I haven't looked at the code to see if it calculates the results each time or caches them... From the code I looked at, those calls don't recalculate on every call. I was referring to this fragment below from BooksLikeThis.docsLike(), and was mentioning it as the javadoc http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/TermFreqVector.html does not say that the values returned by size() and getTerms() are cached, and while the impl may cache them (haven't checked) it's not guarenteed, thus it's safer to put the size() and getTerms() call outside the loop. for (int j = 0; j < vector.size(); j++) { TermQuery tq = new TermQuery( new Term("subject", vector.getTerms()[j])); Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TFIDF Implementation
Bruce Ritchie wrote: From the code I looked at, those calls don't recalculate on every call. I was referring to this fragment below from BooksLikeThis.docsLike(), and was mentioning it as the javadoc http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in dex/TermFreqVector.html does not say that the values returned by size() and getTerms() are cached, and while the impl may cache them (haven't checked) it's not guarenteed, thus it's safer to put the size() and getTerms() call outside the loop. for (int j = 0; j < vector.size(); j++) { TermQuery tq = new TermQuery( new Term("subject", vector.getTerms()[j])); I agree on your overall point that it's probably best to put those calls outside of the loop, I was just saying that I did look at the implementation and the calls do not recalculate anything. I'm sorry I didn't explain myself clearly enough. Oh oh oh, sorry, 10-4, no prob. Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
> > From the code I looked at, those calls don't recalculate on > every call. > > I was referring to this fragment below from BooksLikeThis.docsLike(), > and was mentioning it as the javadoc > http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/in > dex/TermFreqVector.html > does not say that the values returned by size() and getTerms() are > cached, and while the impl may cache them (haven't checked) it's not > guarenteed, thus it's safer to put the size() and getTerms() call > outside the loop. > > for (int j = 0; j < vector.size(); j++) { >TermQuery tq = new TermQuery( >new Term("subject", vector.getTerms()[j])); I agree on your overall point that it's probably best to put those calls outside of the loop, I was just saying that I did look at the implementation and the calls do not recalculate anything. I'm sorry I didn't explain myself clearly enough. Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
On Tuesday 14 December 2004 20:13, Monsur Hossain wrote: > My concern is that this just shifts the scaling issue to Lucene, and I > haven't found much info on how to scale Lucene vertically. Â You can easily use MultiSearcher to search over several indices. If you want the distribution to be more transparent, have a look at Nutch. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Opinions: Using Lucene as a thin database
Well, one could always partition an index, distribute pieces of it horizontally across multiple 'search servers' and use the built-in RMI-based and Parallel search feature. Nutch uses something similar for search scaling. Otis --- Monsur Hossain <[EMAIL PROTECTED]> wrote: > > My concern is that this just shifts the scaling issue to > > Lucene, and I haven't found much info on how to scale Lucene > > vertically. > > By "vertically", of course, I meant "horizontally". Basically > scaling > it across servers as one might do with a relational database. > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
> > You can also see 'Books like this' example from here > > > https://secure.manning.com/catalog/view.php?book=hatcher2&item=source > > Well done, uses a term vector, instead of reparsing the orig > doc, to form the similarity query. Also I like the way you > exclude the source doc in the query, I didn't think of doing > that in my code. I agree, it's a good way to exclude the source doc. > I don't trust calling vector.size() and vector.getTerms() > within the loop but I haven't looked at the code to see if it > calculates the results each time or caches them... >From the code I looked at, those calls don't recalculate on every call. Regards, Bruce Ritchie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2&item=source Otis --- Bruce Ritchie <[EMAIL PROTECTED]> wrote: > Christoph, > > I'm not entirely certain if this is what you want, but a while back > David Spencer did code up a 'More Like This' class which can be used > for generating similarities between documents. I can't seem to find > this class in the sandbox so I've attached it here. Just repackage > and test. > > > Regards, > > Bruce Ritchie > http://www.jivesoftware.com/ > > > -Original Message- > > From: Christoph Kiefer [mailto:[EMAIL PROTECTED] > > Sent: December 14, 2004 11:45 AM > > To: Lucene Users List > > Subject: TFIDF Implementation > > > > Hi, > > My current task/problem is the following: I need to implement > > TFIDF document term ranking using Jakarta Lucene to compute a > > similarity rank between arbitrary documents in the constructed > index. > > I saw from the API that there are similar functions already > > implemented in the class Similarity and DefaultSimilarity but > > I don't know exactly how to use them. At the time my index > > has about 25000 (small) documents and there are about 75000 > > terms stored in total. > > Now, my question is simple. Does anybody has done this before > > or could point me to another location for help? > > > > Thanks for any help in advance. > > Christoph > > > > -- > > Christoph Kiefer > > > > Department of Informatics, University of Zurich > > > > Office: Uni Irchel 27-K-32 > > Phone: +41 (0) 44 / 635 67 26 > > Email: [EMAIL PROTECTED] > > Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[RFE] IndexWriter.updateDocument()
Well, the subject says it all... If there is one thing which is overly cumbersome in Lucene, it's updating documents, therefore this Request For Enhancement: Please consider enhancing the IndexWriter API to include an updateDocument(...) method to take care of all the gory details involved in such operation. Thanks in advance. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Opinions: Using Lucene as a thin database
You can see Flickr-like tag (lookup) system at my Simpy site ( http://www.simpy.com ). It uses Lucene as the backend for lookups, but still uses a RDBMS as the primary storage. I find it that keeping the RDBMS and Lucene indices is a bit of a pain and error prone, so _thin_ storage layer with simple requirements will be okay with just using Lucene, while applications with more complex domain models will quickly run into limitation (using the wrong tool for the job type of problem). Otis --- Monsur Hossain <[EMAIL PROTECTED]> wrote: > I think this is a great idea, and one that I've been mulling over to > implement keyword lookups (similar to Flickr.com's tag system). I > believe the advantage over a relational database comes from Lucene's > inverted index, which is highly optimized for this kind of lookup. > > My concern is that this just shifts the scaling issue to Lucene, and > I > haven't found much info on how to scale Lucene vertically. > > > > > > -Original Message- > > From: Kevin L. Cobb [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, December 14, 2004 9:40 AM > > To: [EMAIL PROTECTED] > > Subject: Opinions: Using Lucene as a thin database > > > > > > I use Lucene as a legitimate search engine which is cool. > > But, I am also using it as a simple database too. I build an > > index with a couple of keyword fields that allows me to > > retrieve values based on exact matches in those fields. This > > is all I need to do so it works just fine for my needs. I > > also love the speed. The index is small enough that it is > > wicked fast. Was wondering if anyone out there was doing the > > same of it there are any dissenting opinions on using Lucene > > for this purpose. > > > > > > > > > > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Opinions: Using Lucene as a thin database
I think this is a great idea, and one that I've been mulling over to implement keyword lookups (similar to Flickr.com's tag system). I believe the advantage over a relational database comes from Lucene's inverted index, which is highly optimized for this kind of lookup. My concern is that this just shifts the scaling issue to Lucene, and I haven't found much info on how to scale Lucene vertically. > -Original Message- > From: Kevin L. Cobb [mailto:[EMAIL PROTECTED] > Sent: Tuesday, December 14, 2004 9:40 AM > To: [EMAIL PROTECTED] > Subject: Opinions: Using Lucene as a thin database > > > I use Lucene as a legitimate search engine which is cool. > But, I am also using it as a simple database too. I build an > index with a couple of keyword fields that allows me to > retrieve values based on exact matches in those fields. This > is all I need to do so it works just fine for my needs. I > also love the speed. The index is small enough that it is > wicked fast. Was wondering if anyone out there was doing the > same of it there are any dissenting opinions on using Lucene > for this purpose. > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TFIDF Implementation
Otis Gospodnetic wrote: You can also see 'Books like this' example from here https://secure.manning.com/catalog/view.php?book=hatcher2&item=source Well done, uses a term vector, instead of reparsing the orig doc, to form the similarity query. Also I like the way you exclude the source doc in the query, I didn't think of doing that in my code. I don't trust calling vector.size() and vector.getTerms() within the loop but I haven't looked at the code to see if it calculates the results each time or caches them... Otis --- Bruce Ritchie <[EMAIL PROTECTED]> wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox so I've attached it here. Just repackage and test. Regards, Bruce Ritchie http://www.jivesoftware.com/ -Original Message- From: Christoph Kiefer [mailto:[EMAIL PROTECTED] Sent: December 14, 2004 11:45 AM To: Lucene Users List Subject: TFIDF Implementation Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and DefaultSimilarity but I don't know exactly how to use them. At the time my index has about 25000 (small) documents and there are about 75000 terms stored in total. Now, my question is simple. Does anybody has done this before or could point me to another location for help? Thanks for any help in advance. Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
: select * from MY_TABLE where MY_NUMERIC_FIELD > 80 : : as far as I know you have only the range query so you will have to say : : my_numeric_filed:[80 TO ??] : but this would not work in the a/m example or am I missing something? RangeQuery allows you to an open ended range -- you can tell the QueryParser to leave your range opened ended using hte keyword "null", ie... my_numeric_filed:[80 TO null] -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TFIDF Implementation
Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and DefaultSimilarity but I don't know exactly how to use them. At the time my index has about 25000 (small) documents and there are about 75000 terms stored in total. Now, my question is simple. Does anybody has done this before or could point me to another location for help? Thanks for any help in advance. Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Opinions: Using Lucene as a thin database
> My concern is that this just shifts the scaling issue to > Lucene, and I haven't found much info on how to scale Lucene > vertically. By "vertically", of course, I meant "horizontally". Basically scaling it across servers as one might do with a relational database. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [RFE] IndexWriter.updateDocument()
petite_abeille wrote: Well, the subject says it all... If there is one thing which is overly cumbersome in Lucene, it's updating documents, therefore this Request For Enhancement: Please consider enhancing the IndexWriter API to include an updateDocument(...) method to take care of all the gory details involved in such operation. I agree, this is always a hassle to do right due to having to use IndexWriter and IndexReader and properly opening/closing them. I have a prelim version of a "batched index writer" that I use. The code is kinda messy, but for discussion here's what it does: Briefly the methods are: // [1] // the ctr has parameters: //'batch size # docs' e.g. it will flush pending updates every 100 docs //'batch freq' e.g. auto flush every 60 sec // [2] // queue a document to be added to the index // 'key' is the primary key name e.g. "url" // 'val' is the primary key val e.g. "http://www.tropo.com/"; // 'doc' is the doc to be added update( String key, String val, Document doc) // [3] // queue a document for removal // 'key' and 'val' are the params, as from [2] remove( String key, String val) // [4] // periodic flush, called automatically or on demand, 2 stages: // 1. call IndexReader.delete() on all pending (key,val) pairs // 2. close IndexReader // 3. call IndexWriter.add() on all pending documents // 4. optionally call optimze() // 5. close IndexWriter flush() // So in normal usage you just keep calling update() and it peridically flushes the pending updates to the index. By its nature this uses memory however it's tunable as to how many documents it'll queue in memory. Does the algorithm above, esp flush(), sound correct? It seems to work right for me and I can post this if people want to see it... - Dave Thanks in advance. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: TFIDF Implementation
Bruce Ritchie wrote: Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox Ot oh, sorry, I'll try to get this checked in soonish. For me it's always one thing to do a prelim version of a piece of code, but another matter to get it correctly packasged. so I've attached it here. Just repackage and test. An alternate approach to find "similar" docs is to use all (possibly unique) tokens in the source doc to form a large query. This is code I use: 'srch' is the entire untokenized text of the source doc 'a' is the analyzer you want to use 'field' is the field you want to search on e.g. "contents" or "body" 'stop' is an opt set of stop words to ignore It returns a query, which you then use to search for "similar" docs, and then in the return result you need to make sure you ignore the source doc, which will prob come back 1st. You can use stemming, synonyms, or fuzzy expansion for each term too. public static Query formSimilarQuery( String srch, Analyzer a, String field, Set stop) throws org.apache.lucene.queryParser.ParseException, IOException { TokenStream ts = a.tokenStream( "foo", new StringReader( srch)); org.apache.lucene.analysis.Token t; BooleanQuery tmp = new BooleanQuery(); Set already = new HashSet(); while ( (t = ts.next()) != null) { String word = t.termText(); if ( stop != null && stop.contains( word)) continue; if ( ! already.add( word)) continue; TermQuery tq = new TermQuery( new Term( field, word)); tmp.add( tq, false, false); } return tmp; } Regards, Bruce Ritchie http://www.jivesoftware.com/ -Original Message- From: Christoph Kiefer [mailto:[EMAIL PROTECTED] Sent: December 14, 2004 11:45 AM To: Lucene Users List Subject: TFIDF Implementation Hi, My current task/problem is the following: I need to implement TFIDF document term ranking using Jakarta Lucene to compute a similarity rank between arbitrary documents in the constructed index. I saw from the API that there are similar functions already implemented in the class Similarity and DefaultSimilarity but I don't know exactly how to use them. At the time my index has about 25000 (small) documents and there are about 75000 terms stored in total. Now, my question is simple. Does anybody has done this before or could point me to another location for help? Thanks for any help in advance. Christoph -- Christoph Kiefer Department of Informatics, University of Zurich Office: Uni Irchel 27-K-32 Phone: +41 (0) 44 / 635 67 26 Email: [EMAIL PROTECTED] Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: TFIDF Implementation
Christoph, I'm not entirely certain if this is what you want, but a while back David Spencer did code up a 'More Like This' class which can be used for generating similarities between documents. I can't seem to find this class in the sandbox so I've attached it here. Just repackage and test. Regards, Bruce Ritchie http://www.jivesoftware.com/ > -Original Message- > From: Christoph Kiefer [mailto:[EMAIL PROTECTED] > Sent: December 14, 2004 11:45 AM > To: Lucene Users List > Subject: TFIDF Implementation > > Hi, > My current task/problem is the following: I need to implement > TFIDF document term ranking using Jakarta Lucene to compute a > similarity rank between arbitrary documents in the constructed index. > I saw from the API that there are similar functions already > implemented in the class Similarity and DefaultSimilarity but > I don't know exactly how to use them. At the time my index > has about 25000 (small) documents and there are about 75000 > terms stored in total. > Now, my question is simple. Does anybody has done this before > or could point me to another location for help? > > Thanks for any help in advance. > Christoph > > -- > Christoph Kiefer > > Department of Informatics, University of Zurich > > Office: Uni Irchel 27-K-32 > Phone: +41 (0) 44 / 635 67 26 > Email: [EMAIL PROTECTED] > Web:http://www.ifi.unizh.ch/ddis/christophkiefer.0.html > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
On Dec 14, 2004, at 15:40, Kevin L. Cobb wrote: Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. ZOE [1] [2] takes the same approach and uses Lucene as a relational engine of sort. However, for both practical and ideological reasons, its does not store any raw data in the Lucene indices themselves but instead uses JDBM [2] for that purpose. All things considered, update issues aside, Lucene turns out to be a very flexible "thin database". Cheers, PA. [1] http://zoe.nu/ [2] http://cvs.sourceforge.net/viewcvs.py/zoe/ZOE/Frameworks/SZObject/ [3] http://jdbm.sourceforge.net/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
Hmm. So far all our fields are just strings. But I would guess you should be able to use Integer.MAX_VALUE or something on the upper bound. Or there might be a better way of doing it. Praveen - Original Message - From: "Akmal Sarhan" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Tuesday, December 14, 2004 10:23 AM Subject: Re: Opinions: Using Lucene as a thin database that sounds very interesting but how do you handle queries like select * from MY_TABLE where MY_NUMERIC_FIELD > 80 as far as I know you have only the range query so you will have to say my_numeric_filed:[80 TO ??] but this would not work in the a/m example or am I missing something? regards Akmal Am Di, den 14.12.2004 schrieb Praveen Peddi um 16:07: Even we use lucene for similar purpose except that we index and store quite a few fields. Infact I also update partial documents as people suggested. I store all the indexed fields so I don't have to build the whole document again while updating partial document. The reason we do this is due to the speed. I found the lucene search on a millions objects is 4 to 5 times faster than our oracle queries (ofcourse this might be due to our pitiful database design :) ). It works great so far. the only caveat that we had till now was incremental updates. But now I am implementing real-time updates so that the data in lucene index is almost always in sync with data in database. So now, our search does not goto the database at all. Praveen - Original Message - From: "Kevin L. Cobb" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, December 14, 2004 9:40 AM Subject: Opinions: Using Lucene as a thin database I use Lucene as a legitimate search engine which is cool. But, I am also using it as a simple database too. I build an index with a couple of keyword fields that allows me to retrieve values based on exact matches in those fields. This is all I need to do so it works just fine for my needs. I also love the speed. The index is small enough that it is wicked fast. Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] !EXCUBATOR:41bf0221115901292611315! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
that sounds very interesting but how do you handle queries like select * from MY_TABLE where MY_NUMERIC_FIELD > 80 as far as I know you have only the range query so you will have to say my_numeric_filed:[80 TO ??] but this would not work in the a/m example or am I missing something? regards Akmal Am Di, den 14.12.2004 schrieb Praveen Peddi um 16:07: > Even we use lucene for similar purpose except that we index and store quite > a few fields. Infact I also update partial documents as people suggested. I > store all the indexed fields so I don't have to build the whole document > again while updating partial document. The reason we do this is due to the > speed. I found the lucene search on a millions objects is 4 to 5 times > faster than our oracle queries (ofcourse this might be due to our pitiful > database design :) ). It works great so far. the only caveat that we had > till now was incremental updates. But now I am implementing real-time > updates so that the data in lucene index is almost always in sync with data > in database. So now, our search does not goto the database at all. > > Praveen > - Original Message - > From: "Kevin L. Cobb" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Tuesday, December 14, 2004 9:40 AM > Subject: Opinions: Using Lucene as a thin database > > > I use Lucene as a legitimate search engine which is cool. But, I am also > using it as a simple database too. I build an index with a couple of > keyword fields that allows me to retrieve values based on exact matches > in those fields. This is all I need to do so it works just fine for my > needs. I also love the speed. The index is small enough that it is > wicked fast. Was wondering if anyone out there was doing the same of it > there are any dissenting opinions on using Lucene for this purpose. > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > !EXCUBATOR:41bf0221115901292611315! > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: finalize delete without optimize
Hello John, Once you make your change locally, use 'cvs diff -u IndexWriter.java > indexwriter.patch' to make a patch. Then open a new Bugzilla entry. Finally, attach your patch to that entry. Note that Document deletion is actually done from IndexReader, so your patch may have to be on IndexReader, not IndexWriter. Thanks, Otis --- John Wang <[EMAIL PROTECTED]> wrote: > Hi Otis: > > Thanks for you reply. > > I am looking for more of an API call than a tool. e.g. > IndexWriter.finalizeDelete() > > If I implement this, how would I go about submitting a patch? > > thanks > > -John > > > On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: > > Hello John, > > > > I believe you didn't get any replies to this. What you are > describing > > cannot be done using the public, but maaay (no source code on this > > machine, so I can't double-check that) be doable if you use some of > the > > 'internal' methods. > > > > I don't have the need for this, but others might, so it may be > worth > > developing a tool that purges Documents marked as deleted without > the > > expensive segment merging, iff that is possible. If you put this > tool > > under the approprite org.apache.lucene... package, you'll get > access to > > 'internal' methods, of course. If you end up creating this, we > could > > stick it in the Sandbox, where we should really create a new > section > > for handy command-line tools that manipulate the index. > > > > Otis > > > > > > > > > > --- John Wang <[EMAIL PROTECTED]> wrote: > > > > > Hi: > > > > > >Is there a way to finalize delete, e.g. actually remove them > from > > > the segments and make sure the docIDs are contiguous again. > > > > > >The only explicit way to do this is by calling > > > IndexWriter.optmize(). But this call does a lot more (also merges > all > > > the segments), hence is very expensive. Is there a way to simply > just > > > finalize the deletes without having to merge all the segments? > > > > > > If not, I'd be glad to submit an implementation of this > feature > > > if > > > the Lucene devs agree this is useful. > > > > > > Thanks > > > > > > -John > > > > > > > - > > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
How big do you expect it to get and how often do you expect to update it, we've been using Lucene for about 1 M records (19 fields each) with incremental updates every 10 minutes, the performance during updates wasn't wonderful, so it took some seriously intense code to sort that out, as you mentioned, it comes down to why you need the Thin DB for, Lucene is a wonderful search engine, but if I were looking at a fast and dirty relational DB, MySQL wins hands down, put them both together and you've really got something. My 2 cents Nader Henein Kevin L. Cobb wrote: I use Lucene as a legitimate search engine which is cool. But, I am also using it as a simple database too. I build an index with a couple of keyword fields that allows me to retrieve values based on exact matches in those fields. This is all I need to do so it works just fine for my needs. I also love the speed. The index is small enough that it is wicked fast. Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opinions: Using Lucene as a thin database
Even we use lucene for similar purpose except that we index and store quite a few fields. Infact I also update partial documents as people suggested. I store all the indexed fields so I don't have to build the whole document again while updating partial document. The reason we do this is due to the speed. I found the lucene search on a millions objects is 4 to 5 times faster than our oracle queries (ofcourse this might be due to our pitiful database design :) ). It works great so far. the only caveat that we had till now was incremental updates. But now I am implementing real-time updates so that the data in lucene index is almost always in sync with data in database. So now, our search does not goto the database at all. Praveen - Original Message - From: "Kevin L. Cobb" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, December 14, 2004 9:40 AM Subject: Opinions: Using Lucene as a thin database I use Lucene as a legitimate search engine which is cool. But, I am also using it as a simple database too. I build an index with a couple of keyword fields that allows me to retrieve values based on exact matches in those fields. This is all I need to do so it works just fine for my needs. I also love the speed. The index is small enough that it is wicked fast. Was wondering if anyone out there was doing the same of it there are any dissenting opinions on using Lucene for this purpose. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hit 2 has score less than Hit 3
On Dec 14, 2004, at 4:53 AM, Vikas Gupta wrote: I have come across a scenario where the hits returned are not sorted. Or maybe they are sorted but the explanation is not correct. Take a look at http://cofferdam.cs.utexas.edu:8080/search.jsp? query=space+odyssey&hitsPerPage=10&hitsPerSite=0 This site was down when I tried to access it. Look at the top 3 results. Score of Hit 1 is 1.0188559 Score of Hit 2 is 0.9934416 Score of Hit 3 is 1.0188559 I can't explain how score of hit 2 can be < hit 3. I thought the hits that were returned were sorted. Hits should be in descending score order by default. Are you using the new Sort facility at all? Are you walking through Hits properly (i.e. hits.doc(i), i is the i-th hit, not document id i)? What version of Lucene are you using? Sorry - figured I'd rattle off some standard troubleshooting questions :) FYI, the docs corresponding to hits 1,2 and 3 have exactly the same scoring fields(By scoring fields, I mean the fields used in the query). Use the IndexSearcher.explain() feature to get the real scoop on why a score is computed the way it is. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: finalize delete without optimize
Hi Otis: Thanks for you reply. I am looking for more of an API call than a tool. e.g. IndexWriter.finalizeDelete() If I implement this, how would I go about submitting a patch? thanks -John On Mon, 13 Dec 2004 22:24:12 -0800 (PST), Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hello John, > > I believe you didn't get any replies to this. What you are describing > cannot be done using the public, but maaay (no source code on this > machine, so I can't double-check that) be doable if you use some of the > 'internal' methods. > > I don't have the need for this, but others might, so it may be worth > developing a tool that purges Documents marked as deleted without the > expensive segment merging, iff that is possible. If you put this tool > under the approprite org.apache.lucene... package, you'll get access to > 'internal' methods, of course. If you end up creating this, we could > stick it in the Sandbox, where we should really create a new section > for handy command-line tools that manipulate the index. > > Otis > > > > > --- John Wang <[EMAIL PROTECTED]> wrote: > > > Hi: > > > >Is there a way to finalize delete, e.g. actually remove them from > > the segments and make sure the docIDs are contiguous again. > > > >The only explicit way to do this is by calling > > IndexWriter.optmize(). But this call does a lot more (also merges all > > the segments), hence is very expensive. Is there a way to simply just > > finalize the deletes without having to merge all the segments? > > > > If not, I'd be glad to submit an implementation of this feature > > if > > the Lucene devs agree this is useful. > > > > Thanks > > > > -John > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: finalize delete without optimize
Hello John, I believe you didn't get any replies to this. What you are describing cannot be done using the public, but maaay (no source code on this machine, so I can't double-check that) be doable if you use some of the 'internal' methods. I don't have the need for this, but others might, so it may be worth developing a tool that purges Documents marked as deleted without the expensive segment merging, iff that is possible. If you put this tool under the approprite org.apache.lucene... package, you'll get access to 'internal' methods, of course. If you end up creating this, we could stick it in the Sandbox, where we should really create a new section for handy command-line tools that manipulate the index. Otis --- John Wang <[EMAIL PROTECTED]> wrote: > Hi: > >Is there a way to finalize delete, e.g. actually remove them from > the segments and make sure the docIDs are contiguous again. > >The only explicit way to do this is by calling > IndexWriter.optmize(). But this call does a lot more (also merges all > the segments), hence is very expensive. Is there a way to simply just > finalize the deletes without having to merge all the segments? > > If not, I'd be glad to submit an implementation of this feature > if > the Lucene devs agree this is useful. > > Thanks > > -John > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]