Re: URGENT: Help indexing large document set
PROTECTED] Sent: Monday, November 22, 2004 12:35 PM To: Lucene Users List Subject: Re: Index in RAM - is it realy worthy? In my test, I have 12900 documents. Each document is small, a few discreet fields (KeyWord type) and 1 Text field containing only 1 sentence. with both mergeFactor and maxMergeDocs being 1000 using RamDirectory, the indexing job took about 9.2 seconds not using RamDirectory, the indexing job took about 122 seconds. I am not calling optimize. This is on windows Xp running java 1.5. Is there something very wrong or different in my setup to cause such a big different? Thanks -John On Mon, 22 Nov 2004 09:23:40 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Otis --- [EMAIL PROTECTED] wrote: I did following test: I created the RAM folder on my Red Hat box and copied c. 1Gb of indexes there. I expected the queries to run much quicker. In reality it was even sometimes slower(sic!) Lucene has it's own RAM disk functionality. If I implement it, would it bring any benefits? Thanks in advance J. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Saturday, November 27, 2004 11:50 AM To: Chuck Williams Subject: Re: URGENT: Help indexing large document set I found the reason for the degredation. It is because I was writing to a RamDirectory and then adding to a FSWriter. I guess it makes sense since the addIndex call would slow down as the index grows. I guess it is not a good idea to use RamDirectory if there are many small batches. Are there some performace numbers that would tell me when to/not to use a RamDirectory? thanks -John On Wed, 24 Nov 2004 15:23:49 -0800, John Wang [EMAIL PROTECTED] wrote: Hi Chuck: The reason I am not using localReader.delete(term) is because I have some logic to check whether to delete the term based on a flag. I am testing with the keys to be sorted. I am not doing anything weird, just committing a batch of 500 documents to the index of 2000 batches. I don't what why it is having this linear slow down... Thanks -John On Wed, 24 Nov 2004 12:32:52 -0800, Chuck Williams [EMAIL PROTECTED] wrote: Does keyIter return the keys in sorted order? This should reduce seeks, especially if the keys are dense. Also, you should be able to localReader.delete(term) instead of iterating over the docs (of which I presume there is only one doc since keys are unique). This won't improve performance as IndexReader.delete(Term) does exactly what your code does, but it will be cleaner. A linear slowdown with number of docs doesn't make sense, so something else must be wrong. I'm not sure what the default buffer size is (it appears it used to be 128 but is dynamic now I think). You might find the slowdown stops after a certain point, especially if you increase your batch size. Chuck -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 24, 2004 12:21 PM To: Lucene Users List Subject: Re: URGENT: Help indexing large document set Thanks Paul! Using your suggestion, I have changed the update check code to use only the indexReader: try { localReader = IndexReader.open(path); while (keyIter.hasNext()) { key = (String) keyIter.next(); term = new Term(key, key); TermDocs tDocs = localReader.termDocs(term); if (tDocs != null) { try { while (tDocs.next()) { localReader.delete(tDocs.doc()); } } finally { tDocs.close(); } } } } finally { if (localReader != null
Re: URGENT: Help indexing large document set
On Wednesday 24 November 2004 00:37, John Wang wrote: Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); To speed this up a bit make sure that the iterator gives the terms in sorted order. I'd use an index reader instead of a searcher, but that will probably not make a difference. Adding the documents can be done with multiple threads. Last time I checked that, there was a moderate speed up using three threads instead of one on a single CPU machine. Tuning the values of minMergeDocs and maxMergeDocs may also help to increase performance of adding documents. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: URGENT: Help indexing large document set
Thanks Paul! Using your suggestion, I have changed the update check code to use only the indexReader: try { localReader = IndexReader.open(path); while (keyIter.hasNext()) { key = (String) keyIter.next(); term = new Term(key, key); TermDocs tDocs = localReader.termDocs(term); if (tDocs != null) { try { while (tDocs.next()) { localReader.delete(tDocs.doc()); } } finally { tDocs.close(); } } } } finally { if (localReader != null) { localReader.close(); } } Unfortunately it didn't seem to make any dramatic difference. I also see the CPU is only 30-50% busy, so I am guessing it's spending a lot of time in IO. Anyway of making the CPU work harder? Is batch size of 500 too small for 1 million documents? Currently I am seeing a linear speed degredation of 0.3 milliseconds per document. Thanks -John On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 24 November 2004 00:37, John Wang wrote: Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); To speed this up a bit make sure that the iterator gives the terms in sorted order. I'd use an index reader instead of a searcher, but that will probably not make a difference. Adding the documents can be done with multiple threads. Last time I checked that, there was a moderate speed up using three threads instead of one on a single CPU machine. Tuning the values of minMergeDocs and maxMergeDocs may also help to increase performance of adding documents. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: URGENT: Help indexing large document set
Does keyIter return the keys in sorted order? This should reduce seeks, especially if the keys are dense. Also, you should be able to localReader.delete(term) instead of iterating over the docs (of which I presume there is only one doc since keys are unique). This won't improve performance as IndexReader.delete(Term) does exactly what your code does, but it will be cleaner. A linear slowdown with number of docs doesn't make sense, so something else must be wrong. I'm not sure what the default buffer size is (it appears it used to be 128 but is dynamic now I think). You might find the slowdown stops after a certain point, especially if you increase your batch size. Chuck -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 24, 2004 12:21 PM To: Lucene Users List Subject: Re: URGENT: Help indexing large document set Thanks Paul! Using your suggestion, I have changed the update check code to use only the indexReader: try { localReader = IndexReader.open(path); while (keyIter.hasNext()) { key = (String) keyIter.next(); term = new Term(key, key); TermDocs tDocs = localReader.termDocs(term); if (tDocs != null) { try { while (tDocs.next()) { localReader.delete(tDocs.doc()); } } finally { tDocs.close(); } } } } finally { if (localReader != null) { localReader.close(); } } Unfortunately it didn't seem to make any dramatic difference. I also see the CPU is only 30-50% busy, so I am guessing it's spending a lot of time in IO. Anyway of making the CPU work harder? Is batch size of 500 too small for 1 million documents? Currently I am seeing a linear speed degredation of 0.3 milliseconds per document. Thanks -John On Wed, 24 Nov 2004 09:05:39 +0100, Paul Elschot [EMAIL PROTECTED] wrote: On Wednesday 24 November 2004 00:37, John Wang wrote: Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); To speed this up a bit make sure that the iterator gives the terms in sorted order. I'd use an index reader instead of a searcher, but that will probably not make a difference. Adding the documents can be done with multiple threads. Last time I checked that, there was a moderate speed up using three threads instead of one on a single CPU machine. Tuning the values of minMergeDocs and maxMergeDocs may also help to increase performance of adding documents. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
URGENT: Help indexing large document set
Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); if (count != 0) { localReader.delete(term); } } Then I proceed with adding the documents. This turns out to be extremely expensive, I looked into the code and I see in TermInfosReader.get(Term term) it is doing a linear look up for each term. So as the index grows, the above operation degrades at a linear rate. So for each commit, we are doing a docFreq for 500 documents. I also tried to create a BooleanQuery composed of 500 TermQueries and do 1 search for each batch, and the performance didn't get better. And if the batch size increases to say 50,000, creating a BooleanQuery composed of 50,000 TermQuery instances may introduce huge memory costs. Is there a better way to do this? Can TermInfosReader.get(Term term) be optimized to do a binary lookup instead of a linear walk? Of course that depends on whether the terms are stored in sorted order, are they? This is very urgent, thanks in advance for all your help. -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: URGENT: Help indexing large document set
Are you sure you have a performance problem with TermInfosReader.get(Term)? It looks to me like it scans sequentially only within a small buffer window (of size SegmentTermEnum.indexInterval) and that it uses binary search otherwise. See TermInfosReader.getIndexOffset(Term). Chuck -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004 3:38 PM To: [EMAIL PROTECTED] Subject: URGENT: Help indexing large document set Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); if (count != 0) { localReader.delete(term); } } Then I proceed with adding the documents. This turns out to be extremely expensive, I looked into the code and I see in TermInfosReader.get(Term term) it is doing a linear look up for each term. So as the index grows, the above operation degrades at a linear rate. So for each commit, we are doing a docFreq for 500 documents. I also tried to create a BooleanQuery composed of 500 TermQueries and do 1 search for each batch, and the performance didn't get better. And if the batch size increases to say 50,000, creating a BooleanQuery composed of 50,000 TermQuery instances may introduce huge memory costs. Is there a better way to do this? Can TermInfosReader.get(Term term) be optimized to do a binary lookup instead of a linear walk? Of course that depends on whether the terms are stored in sorted order, are they? This is very urgent, thanks in advance for all your help. -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: URGENT: Help indexing large document set
Thanks Chuck! I missed the call: getIndexOffset. I am profiling it again to pin-point where the performance problem is. -John On Tue, 23 Nov 2004 16:13:22 -0800, Chuck Williams [EMAIL PROTECTED] wrote: Are you sure you have a performance problem with TermInfosReader.get(Term)? It looks to me like it scans sequentially only within a small buffer window (of size SegmentTermEnum.indexInterval) and that it uses binary search otherwise. See TermInfosReader.getIndexOffset(Term). Chuck -Original Message- From: John Wang [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 23, 2004 3:38 PM To: [EMAIL PROTECTED] Subject: URGENT: Help indexing large document set Hi: I am trying to index 1M documents, with batches of 500 documents. Each document has an unique text key, which is added as a Field.KeyWord(name,value). For each batch of 500, I need to make sure I am not adding a document with a key that is already in the current index. To do this, I am calling IndexSearcher.docFreq for each document and delete the document currently in the index with the same key: while (keyIter.hasNext()) { String objectID = (String) keyIter.next(); term = new Term(key, objectID); int count = localSearcher.docFreq(term); if (count != 0) { localReader.delete(term); } } Then I proceed with adding the documents. This turns out to be extremely expensive, I looked into the code and I see in TermInfosReader.get(Term term) it is doing a linear look up for each term. So as the index grows, the above operation degrades at a linear rate. So for each commit, we are doing a docFreq for 500 documents. I also tried to create a BooleanQuery composed of 500 TermQueries and do 1 search for each batch, and the performance didn't get better. And if the batch size increases to say 50,000, creating a BooleanQuery composed of 50,000 TermQuery instances may introduce huge memory costs. Is there a better way to do this? Can TermInfosReader.get(Term term) be optimized to do a binary lookup instead of a linear walk? Of course that depends on whether the terms are stored in sorted order, are they? This is very urgent, thanks in advance for all your help. -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]