Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Okay, reference test is done: on JDK 1.4.2 Lucene 1.4.1 really seems to run fine: just a moderate number of SegmentTermEnums that is controlled by gc (about 500 for the 1900 test objects). Daniel Taurat wrote: Hi Doug, you are absolutely right about the older version of the JDK: it is 1.3.1 (ibm). Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 environment. Results: I patched the Lucene1.4.1: it has improved not much: after indexing 1897 Objects the number of SegmentTermEnum is up to 17936. To be realistic: This is even a deterioration :((( My next check will be with a JDK1.4.2 for the test environment, but this can only be a reference run for now. Thanks, Daniel Doug Cutting wrote: It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Ab
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
IBM JDK1.4.2 should work fine. AFAIK JDK1.3.1 is usable if you disable JIT. John Daniel Taurat wrote: Hi Doug, you are absolutely right about the older version of the JDK: it is 1.3.1 (ibm). Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 environment. Results: I patched the Lucene1.4.1: it has improved not much: after indexing 1897 Objects the number of SegmentTermEnum is up to 17936. To be realistic: This is even a deterioration :((( My next check will be with a JDK1.4.2 for the test environment, but this can only be a reference run for now. Thanks, Daniel Doug Cutting wrote: It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memor
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi Doug, you are absolutely right about the older version of the JDK: it is 1.3.1 (ibm). Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 environment. Results: I patched the Lucene1.4.1: it has improved not much: after indexing 1897 Objects the number of SegmentTermEnum is up to 17936. To be realistic: This is even a deterioration :((( My next check will be with a JDK1.4.2 for the test environment, but this can only be a reference run for now. Thanks, Daniel Doug Cutting wrote: It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Daniel Taurat wrote: Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) Depends on what OS and with what patches... Linux on i386 seems to have a physical limit of 1.7G (256M for VM) ... There are some patches to apply to get 3G but only on really modern kernels. I just need to get Athlon systems :-/ Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
It sounds like the ThreadLocal in TermInfosReader is not getting correctly garbage collected when the TermInfosReader is collected. Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is that you're running in an older JVM. Is that right? I've attached a patch which should fix this. Please tell me if it works for you. Doug Daniel Taurat wrote: Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all..
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Okay, that (1.4rc3)worked fine, too! Got only 257 SegmentTermEnums for 1900 objects. Now I will go for the final test on the production server with the 1.4rc3 version and about 40.000 objects. Daniel Daniel Taurat schrieb: Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAI
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi all, here is some update for you: I switched back to Lucene 1.3-final and now the number of the SegmentTermEnum objects is controlled by gc again: it goes up to about 1000 and then it is down again to 254 after indexing my 1900 test-objects. Stay tuned, I will try 1.4RC3 now, the last version before FieldCache was introduced... Daniel Rupinder Singh Mazara schrieb: hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) -Original Message- From: Daniel Taurat [mailto:[EMAIL PROTECTED] Sent: 10 September 2004 14:42 To: Lucene Users List Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAI
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
The Parser is pdfBox. pdf is about 25% of the over all indexing volume on the productive system. I also have word-docs and loads of hmtl resources to be indexed. In my testing environment I merely have 5 pdf docs and still those permanent object hanging around, though. Cheers, Daniel Ben Litchfield wrote: I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. <> What PDF parser are you using? Is the problem within the parser and not lucene? Are you releasing all resources? Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Out of memory in lucene 1.4.1 when re-indexing large number of documents
hi all I had a similar problem, i have database of documents with 24 fields, and a average content of 7K, with 16M+ records i had to split the jobs into slabs of 1M each and merging the resulting indexes, submissions to our job queue looked like java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22 and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally Exception in thread "main" java.lang.OutOfMemoryError at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled Code)) at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined Compiled Code)) at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled Code)) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366) at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code)) at lucene.Indexer.main(CDBIndexer.java:168) >-Original Message- >From: Daniel Taurat [mailto:[EMAIL PROTECTED] >Sent: 10 September 2004 14:42 >To: Lucene Users List >Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number >of documents > > >Hi Pete, >good hint, but we actually do have physical memory of 4Gb on the >system. But then: we also have experienced that the gc of ibm jdk1.3.1 >that we use is sometimes >behaving strangely with too large heap space anyway. (Limit seems to be >1.2 Gb) >I can say that gc is not collecting these objects since I forced gc >runs when indexing every now and then (when parsing pdf-type objects, >that is): No effect. > >regards, > >Daniel > > >Pete Lewis wrote: > >>Hi all >> >>Reading the thread with interest, there is another way I've come >across out >>of memory errors when indexing large batches of documents. >> >>If you have your heap space settings too high, then you get >swapping (which >>impacts performance) plus you never reach the trigger for garbage >>collection, hence you don't garbage collect and hence you run out >of memory. >> >>Can you check whether or not your garbage collection is being triggered? >> >>Anomalously therefore if this is the case, by reducing the heap space you >>can improve performance get rid of the out of memory errors. >> >>Cheers >>Pete Lewis >> >>----- Original Message - >>From: "Daniel Taurat" <[EMAIL PROTECTED]> >>To: "Lucene Users List" <[EMAIL PROTECTED]> >>Sent: Friday, September 10, 2004 1:10 PM >>Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large >number of >>documents >> >> >> >> >>>Daniel Aber schrieb: >>> >>> >>> >>>>On Thursday 09 September 2004 19:47, Daniel Taurat wrote: >>>> >>>> >>>> >>>> >>>> >>>>>I am facing an out of memory problem using Lucene 1.4.1. >>>>> >>>>> >>>>> >>>>> >>>>Could you try with a recent CVS version? There has been a fix >about files >>>>not being deleted after 1.4.1. Not sure if that could cause the problems >>>>you're experiencing. >>>> >>>>Regards >>>>Daniel >>>> >>>> >>>> >>>> >>>> >>>Well, it seems not to be files, it looks more like those SegmentTermEnum >>>objects accumulating in memory. >>>#I've seen some discussion on these objects in the developer-newsgroup >>>that had taken place some time ago. >>>I am afraid this is some kind of runaway caching I have to deal with. >>>Maybe not correctly addressed in this newsgroup, after all... >>> >>>Anyway: any idea if there is an API command to re-init caches? >>> >>>Thanks, >>> >>>Daniel >>> >>> >>> >>>- >>>To unsubscribe, e-mail: [EMAIL PROTECTED] >>>For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >> >> >>- >>To unsubscribe, e-mail: [EMAIL PROTECTED] >>For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> > > > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
> I can say that gc is not collecting these objects since I forced gc > runs when indexing every now and then (when parsing pdf-type objects, > that is): No effect. What PDF parser are you using? Is the problem within the parser and not lucene? Are you releasing all resources? Ben - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi Pete, good hint, but we actually do have physical memory of 4Gb on the system. But then: we also have experienced that the gc of ibm jdk1.3.1 that we use is sometimes behaving strangely with too large heap space anyway. (Limit seems to be 1.2 Gb) I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. regards, Daniel Pete Lewis wrote: Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi all Reading the thread with interest, there is another way I've come across out of memory errors when indexing large batches of documents. If you have your heap space settings too high, then you get swapping (which impacts performance) plus you never reach the trigger for garbage collection, hence you don't garbage collect and hence you run out of memory. Can you check whether or not your garbage collection is being triggered? Anomalously therefore if this is the case, by reducing the heap space you can improve performance get rid of the out of memory errors. Cheers Pete Lewis - Original Message - From: "Daniel Taurat" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, September 10, 2004 1:10 PM Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents > Daniel Aber schrieb: > > >On Thursday 09 September 2004 19:47, Daniel Taurat wrote: > > > > > > > >>I am facing an out of memory problem using Lucene 1.4.1. > >> > >> > > > >Could you try with a recent CVS version? There has been a fix about files > >not being deleted after 1.4.1. Not sure if that could cause the problems > >you're experiencing. > > > >Regards > > Daniel > > > > > > > Well, it seems not to be files, it looks more like those SegmentTermEnum > objects accumulating in memory. > #I've seen some discussion on these objects in the developer-newsgroup > that had taken place some time ago. > I am afraid this is some kind of runaway caching I have to deal with. > Maybe not correctly addressed in this newsgroup, after all... > > Anyway: any idea if there is an API command to re-init caches? > > Thanks, > > Daniel > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
Daniel Aber schrieb: On Thursday 09 September 2004 19:47, Daniel Taurat wrote: I am facing an out of memory problem using Lucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel Well, it seems not to be files, it looks more like those SegmentTermEnum objects accumulating in memory. #I've seen some discussion on these objects in the developer-newsgroup that had taken place some time ago. I am afraid this is some kind of runaway caching I have to deal with. Maybe not correctly addressed in this newsgroup, after all... Anyway: any idea if there is an API command to re-init caches? Thanks, Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents
On Thursday 09 September 2004 19:47, Daniel Taurat wrote: > I am facing an out of memory problem using ÂLucene 1.4.1. Could you try with a recent CVS version? There has been a fix about files not being deleted after 1.4.1. Not sure if that could cause the problems you're experiencing. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Out of memory in lucene 1.4.1 when re-indexing large number of documents
Hi, I am facing an out of memory problem using Lucene 1.4.1. I am re-indexing a pretty large number ( about 30.000 ) of documents. I identify old instances by checking for a unique ID field, delete those with indexReader.delete() and add the new document version. HeapDump says I am having a huge number of HashMaps with SegmentTermEnum objects (256891) . IndexReader is closed directly after delete(term)... Seems to me that this did not happen with version1.2 (same number of objects and all...). Has anyone an idea how I get these "hanging" objects? Or what to do in order to avoid them? Thanks Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]