Re: Number of documents

2004-12-20 Thread Erik Hatcher
On Dec 20, 2004, at 4:08 AM, Daniel Cortes wrote:
I've to show to my boss if Lucene is the best option for create a 
search engine of a new portal.
I want to now how many documents do you have in your index?
And how many bigger is your DB?
I highly recommend you use Luke to examine the index.  It is a great 
tool to have handy.  It shows these statistics and many others.

the types of formats who has to support the portal are html jsp txt 
doc pdf ppt
HTML, TXT, DOC, and PDF are all quite straightforward to do.  PPT is 
possible, perhaps POI will do the trick.  JSP depends on how you want 
to analyze it.  If any text in the file should be indexed (including 
JSP directives, taglibs, and HTML) then you can treat it as a text 
file.  If you need to eliminate the tags then you'll need to parse the 
JSP somehow, however I strongly recommend that content not reside in 
JSP pages but rather in a content management system, database, or such.

another question that I have is:
I'm playing with the files of the book Lucene in Action and I try to 
use the example of handling types.The folder data contains 5 files, 
and created index contain five
documents what the only one that contains any word in the index is the 
.html file
Everybody have the same result?
Perhaps you are taking the output you see from "ant 
ExtensionFileHandler" as an indication of what words were indexed.  
This output, however, is showing Document.toString() which only shows 
the text in stored fields.  This particular example does not actually 
index the documents - it shows the generalized handling framework and 
the parsing of the files into a Lucene Document.  Most of the file 
handlers use unstored fields.  The output I get is shown below.  The 
handlers have successfully extracted the text from the files.  Maybe 
you're referring to the FileIndexer example?  We did not expose this 
one to the Ant launcher.  If FileIndexer is the code you're trying, let 
me know what you've tried and how you're looking for the words that you 
expect to see.  Again, most of the fields are unstored (meaning the 
original content is not stored in the index, only the terms extracted 
through analysis).

Erik
# to make the output cleaner for e-mailing I set ANT_ARGS like this:
% echo $ANT_ARGS
-logger org.apache.tools.ant.NoBannerLogger -emacs -Dnopause=true
% ant ExtensionFileHandler 
-Dfile=src/lia/handlingtypes/data/addressbook-entry.xml
Buildfile: build.xml

ExtensionFileHandler:
  This example demonstrates the file extension document handler.
  Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
  all handled by the framework.  The contents of the Lucene Document
  built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
log4j:WARN No appenders could be found for logger 
(org.apache.commons.digester.Digester.sax).
log4j:WARN Please initialize the log4j system properly.
Document Keyword 
Keyword Keyword 
Keyword Keyword Keyword 
Keyword>

% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/HTML.html
Buildfile: build.xml
ExtensionFileHandler:
  This example demonstrates the file extension document handler.
  Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
  all handled by the framework.  The contents of the Lucene Document
  built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document Text>

% ant ExtensionFileHandler 
-Dfile=src/lia/handlingtypes/data/PlainText.txt
Buildfile: build.xml

ExtensionFileHandler:
  This example demonstrates the file extension document handler.
  Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
  all handled by the framework.  The contents of the Lucene Document
  built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
Document>
% ant ExtensionFileHandler -Dfile=src/lia/handlingtypes/data/PDF.pdf
Buildfile: build.xml
ExtensionFileHandler:
  This example demonstrates the file extension document handler.
  Documents with extensions .xml, .rtf, .doc, .pdf, .html, and .txt 
are
  all handled by the framework.  The contents of the Lucene Document
  built for the specified file is displayed.

skipping input as property nopause has already been set.
skipping input as property file has already been set.
Running lia.handlingtypes.framework.ExtensionFileHandler...
log4j:WARN No appenders could be found for logger 
(org.pdfbox.pdfparser.PDFParser).
log4j:WARN Please initialize the log4j system properly.
Document>

% ant Extensio

Number of documents

2004-12-20 Thread Daniel Cortes
I've to show to my boss if Lucene is the best option for create a search 
engine of a new portal.
I want to now how many documents do you have in your index?
And how many bigger is your DB?
the types of formats who has to support the portal are html jsp txt doc 
pdf ppt

another question that I have is:
I'm playing with the files of the book Lucene in Action and I try to use 
the example of handling types.The folder data contains 5 files, and 
created index contain five
documents what the only one that contains any word in the index is the 
.html file
Everybody have the same result?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Number of documents to be optimized

2004-11-12 Thread Ravi
How do I know the number of documents to be optimized (If I have one
large index, number of documents that are in other segments) at any
time?

Thanks in advance,
Ravi. 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Large number of documents

2004-10-26 Thread Otis Gospodnetic
Hello Gard,

This is certainly doable, it just depends on your hardware, complexity
of queries, frequency of queries, and such.  There is a benchmark page
on the Lucene site that you may want to check to get some ideas.

Otis



--- Gard Arneson Haugen <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I have just started looking at Lucene and are not an experienced user
> of 
> Java, but from what I've been reading this search tool should manage 
> large amounts of documents.
> 
> I'm wondering if someone have any experience using Lucene on large 
> amount of documents. I need to be able to index and search  through 
> 20-30 million documents of around 8kb. They are all simple text
> document 
> with some attributes to restrict the search result on.
> 
> Any feedback would be appreciated.
> 
> Best regards,
> Gard Arneson Haugen
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Large number of documents

2004-10-26 Thread Gard Arneson Haugen
Hi,
I have just started looking at Lucene and are not an experienced user of 
Java, but from what I've been reading this search tool should manage 
large amounts of documents.

I'm wondering if someone have any experience using Lucene on large 
amount of documents. I need to be able to index and search  through 
20-30 million documents of around 8kb. They are all simple text document 
with some attributes to restrict the search result on.

Any feedback would be appreciated.
Best regards,
Gard Arneson Haugen

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-13 Thread Daniel Taurat
Okay,  reference test is done:
on JDK 1.4.2 Lucene 1.4.1 really seems to run fine: just a moderate 
number of SegmentTermEnums that is controlled by gc (about 500 for the 
1900 test objects).

Daniel Taurat wrote:
Hi Doug,
you are absolutely right about the older version of the JDK: it is 
1.3.1 (ibm).
Unfortunately we cannot upgrade since we are bound to IBM Portalserver 
4 environment.
Results:
I patched the Lucene1.4.1:
it has improved not much: after indexing 1897 Objects  the number of 
SegmentTermEnum is up to 17936.
To be realistic: This is even a deterioration :(((
My next check will be with a JDK1.4.2 for the test environment, but 
this can only be a reference run for now.

Thanks,
Daniel
Doug Cutting wrote:
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess 
is that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it 
works for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before 
FieldCache was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i 
created was to after every 200K, documents create a temp 
directory, and merge them together, this was done to do the first 
production run, updates are now being handled incrementally

 

Exception in thread "main" java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems 
to be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced 
gc runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've 
come 


across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 


swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run 
out 


of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: "Daniel Taurat" 
<[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing 
large 


number of
 

documents

 

Daniel Ab

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-13 Thread John Moylan
IBM JDK1.4.2 should work fine. AFAIK JDK1.3.1 is usable if you disable JIT.
John
Daniel Taurat wrote:
Hi Doug,
you are absolutely right about the older version of the JDK: it is 1.3.1 
(ibm).
Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 
environment.
Results:
I patched the Lucene1.4.1:
it has improved not much: after indexing 1897 Objects  the number of 
SegmentTermEnum is up to 17936.
To be realistic: This is even a deterioration :(((
My next check will be with a JDK1.4.2 for the test environment, but this 
can only be a reference run for now.

Thanks,
Daniel
Doug Cutting wrote:
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess 
is that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it 
works for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before 
FieldCache was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and 
merge them together, this was done to do the first production run, 
updates are now being handled incrementally

 

Exception in thread "main" java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems 
to be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced 
gc runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've come 


across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 


swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run 
out 


of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: "Daniel Taurat" 
<[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing 
large 


number of
 

documents

 

Daniel Aber schrieb:
 
   

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:



I am facing an out of memor

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-13 Thread Daniel Taurat
Hi Doug,
you are absolutely right about the older version of the JDK: it is 1.3.1 
(ibm).
Unfortunately we cannot upgrade since we are bound to IBM Portalserver 4 
environment.
Results:
I patched the Lucene1.4.1:
it has improved not much: after indexing 1897 Objects  the number of 
SegmentTermEnum is up to 17936.
To be realistic: This is even a deterioration :(((
My next check will be with a JDK1.4.2 for the test environment, but this 
can only be a reference run for now.

Thanks,
Daniel
Doug Cutting wrote:
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess 
is that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it 
works for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before 
FieldCache was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and 
merge them together, this was done to do the first production run, 
updates are now being handled incrementally

 

Exception in thread "main" java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems 
to be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced 
gc runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've come 

across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 

swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run 
out 

of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: "Daniel Taurat" 
<[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing 
large 

number of
 

documents

  

Daniel Aber schrieb:
 


On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

 

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been 

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Kevin A. Burton
Daniel Taurat wrote:
Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm jdk1.3.1 
that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
Depends on what OS and with what patches...
Linux on i386 seems to have a physical limit of 1.7G (256M for VM) ... 
There are some patches to apply to get 3G but only on really modern kernels.

I just need to get Athlon systems :-/
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Doug Cutting
It sounds like the ThreadLocal in TermInfosReader is not getting 
correctly garbage collected when the TermInfosReader is collected. 
Researching a bit, this was a bug in JVMs prior to 1.4.2, so my guess is 
that you're running in an older JVM.  Is that right?

I've attached a patch which should fix this.  Please tell me if it works 
for you.

Doug
Daniel Taurat wrote:
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and merge 
them together, this was done to do the first production run, updates 
are now being handled incrementally

 

Exception in thread "main" java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
 

Hi all
Reading the thread with interest, there is another way I've come 

across out
 

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 

swapping (which
 

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 

of memory.
 

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: "Daniel Taurat" 
<[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 

number of
 

documents

   

Daniel Aber schrieb:
 
 

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

  

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix 


about files
 

not being deleted after 1.4.1. Not sure if that could cause the 
problems
you're experiencing.

Regards
Daniel

   

Well, it seems not to be files, it looks more like those 
SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the 
developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all..

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat
Okay, that (1.4rc3)worked fine, too!
Got only 257 SegmentTermEnums for 1900 objects.
Now I will go for the final test on the production server with the 
1.4rc3 version  and about 40.000 objects.

Daniel
Daniel Taurat schrieb:
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after 
indexing my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all
 I had a similar problem, i have  database of documents with 24 
fields, and a average content of 7K, with  16M+ records

 i had to split the jobs into slabs of 1M each and merging the 
resulting indexes, submissions to our job queue looked like

 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created 
was to after every 200K, documents create a temp directory, and merge 
them together, this was done to do the first production run, updates 
are now being handled incrementally

 

Exception in thread "main" java.lang.OutOfMemoryError
at 
org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at 
org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at 
org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number
of documents

Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm 
jdk1.3.1 that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to 
be 1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type 
objects, that is): No effect.

regards,
Daniel
Pete Lewis wrote:
  

Hi all
Reading the thread with interest, there is another way I've come 
across out
  

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 
swapping (which
  

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 
of memory.
  

Can you check whether or not your garbage collection is being 
triggered?

Anomalously therefore if this is the case, by reducing the heap 
space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis
- Original Message - From: "Daniel Taurat" 
<[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
number of
  

documents



Daniel Aber schrieb:
 
  

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

   


I am facing an out of memory problem using  Lucene 1.4.1.
 
  
Could you try with a recent CVS version? There has been a fix 


about files
  

not being deleted after 1.4.1. Not sure if that could cause the 
problems
you're experiencing.

Regards
Daniel

   

Well, it seems not to be files, it looks more like those 
SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the 
developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 
  
-
To unsubscribe, e-mail: [EMAI

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat
Hi all,
here is some update for you:
I switched back to Lucene 1.3-final and now the  number of the  
SegmentTermEnum objects is controlled by gc again:
it goes up to about 1000 and then it is down again to 254 after indexing 
my 1900 test-objects.
Stay tuned, I will try 1.4RC3 now, the last version before FieldCache 
was introduced...

Daniel
Rupinder Singh Mazara schrieb:
hi all 

 I had a similar problem, i have  database of documents with 24 fields, and a average 
content of 7K, with  16M+ records
 i had to split the jobs into slabs of 1M each and merging the resulting indexes, 
submissions to our job queue looked like
 java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
 
and i still had outofmemory exception , the solution that i created was to after every 200K, documents create a temp directory, and merge them together, this was done to do the first production run, updates are now being handled incrementally

 

Exception in thread "main" java.lang.OutOfMemoryError
at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)
 

-Original Message-
From: Daniel Taurat [mailto:[EMAIL PROTECTED]
Sent: 10 September 2004 14:42
To: Lucene Users List
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number
of documents
Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm jdk1.3.1 
that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to be 
1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type objects, 
that is): No effect.

regards,
Daniel
Pete Lewis wrote:
   

Hi all
Reading the thread with interest, there is another way I've come 
 

across out
   

of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get 
 

swapping (which
   

impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out 
 

of memory.
   

Can you check whether or not your garbage collection is being triggered?
Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.
Cheers
Pete Lewis
- Original Message - 
From: "Daniel Taurat" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
 

number of
   

documents

 

Daniel Aber schrieb:
  

   

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:



 

I am facing an out of memory problem using  Lucene 1.4.1.
  

   

Could you try with a recent CVS version? There has been a fix 
 

about files
   

not being deleted after 1.4.1. Not sure if that could cause the problems
you're experiencing.
Regards
Daniel



 

Well, it seems not to be files, it looks more like those SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...
Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

 

-
To unsubscribe, e-mail: [EMAI

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat
The Parser is pdfBox. pdf is about 25% of the over all indexing volume  
on the productive system. I also have word-docs and loads of hmtl 
resources to be indexed.
In my testing environment I merely have 5 pdf docs and still those 
permanent object hanging around, though.
Cheers,
Daniel

Ben Litchfield wrote:
I can say that gc is not collecting these objects since I  forced gc
runs when indexing every now and then (when parsing pdf-type objects,
that is): No effect.
   

<>
What PDF parser are you using? Is the problem within the parser and not
lucene? Are you releasing all resources?
Ben
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Rupinder Singh Mazara


hi all 

  I had a similar problem, i have  database of documents with 24 fields, and a average 
content of 7K, with  16M+ records

  i had to split the jobs into slabs of 1M each and merging the resulting indexes, 
submissions to our job queue looked like
 
  java -Xms100M -Xcompactexplicitgc -cp $CLASSPATH lucene.Indexer 22
  
 and i still had outofmemory exception , the solution that i created was to after 
every 200K, documents create a temp directory, and merge them together, this was done 
to do the first production run, updates are now being handled incrementally
 
  

Exception in thread "main" java.lang.OutOfMemoryError
at org.apache.lucene.store.RAMOutputStream.flushBuffer(RAMOutputStream.java(Compiled 
Code))
at org.apache.lucene.store.OutputStream.flush(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeByte(OutputStream.java(Inlined 
Compiled Code))
at org.apache.lucene.store.OutputStream.writeBytes(OutputStream.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.copyFile(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java(Compiled 
Code))
at 
org.apache.lucene.index.SegmentMerger.createCompoundFile(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java(Compiled 
Code))
at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:366)
at lucene.Indexer.doIndex(CDBIndexer.java(Compiled Code))
at lucene.Indexer.main(CDBIndexer.java:168)

>-Original Message-
>From: Daniel Taurat [mailto:[EMAIL PROTECTED]
>Sent: 10 September 2004 14:42
>To: Lucene Users List
>Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number
>of documents
>
>
>Hi Pete,
>good hint, but we actually do have physical memory of  4Gb on the 
>system. But then: we also have experienced that the gc of ibm jdk1.3.1 
>that we use is sometimes
>behaving strangely with too large heap space anyway. (Limit seems to be 
>1.2 Gb)
>I can say that gc is not collecting these objects since I  forced gc 
>runs when indexing every now and then (when parsing pdf-type objects, 
>that is): No effect.
>
>regards,
>
>Daniel
>
>
>Pete Lewis wrote:
>
>>Hi all
>>
>>Reading the thread with interest, there is another way I've come 
>across out
>>of memory errors when indexing large batches of documents.
>>
>>If you have your heap space settings too high, then you get 
>swapping (which
>>impacts performance) plus you never reach the trigger for garbage
>>collection, hence you don't garbage collect and hence you run out 
>of memory.
>>
>>Can you check whether or not your garbage collection is being triggered?
>>
>>Anomalously therefore if this is the case, by reducing the heap space you
>>can improve performance get rid of the out of memory errors.
>>
>>Cheers
>>Pete Lewis
>>
>>- Original Message - 
>>From: "Daniel Taurat" <[EMAIL PROTECTED]>
>>To: "Lucene Users List" <[EMAIL PROTECTED]>
>>Sent: Friday, September 10, 2004 1:10 PM
>>Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large 
>number of
>>documents
>>
>>
>>  
>>
>>>Daniel Aber schrieb:
>>>
>>>
>>>
>>>>On Thursday 09 September 2004 19:47, Daniel Taurat wrote:
>>>>
>>>>
>>>>
>>>>  
>>>>
>>>>>I am facing an out of memory problem using  Lucene 1.4.1.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>Could you try with a recent CVS version? There has been a fix 
>about files
>>>>not being deleted after 1.4.1. Not sure if that could cause the problems
>>>>you're experiencing.
>>>>
>>>>Regards
>>>>Daniel
>>>>
>>>>
>>>>
>>>>  
>>>>
>>>Well, it seems not to be files, it looks more like those SegmentTermEnum
>>>objects accumulating in memory.
>>>#I've seen some discussion on these objects in the developer-newsgroup
>>>that had taken place some time ago.
>>>I am afraid this is some kind of runaway caching I have to deal with.
>>>Maybe not  correctly addressed in this newsgroup, after all...
>>>
>>>Anyway: any idea if there is an API command to re-init caches?
>>>
>>>Thanks,
>>>
>>>Daniel
>>>
>>>
>>>
>>>-
>>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>
>>
>>-
>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>  
>>
>
>
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Ben Litchfield
> I can say that gc is not collecting these objects since I  forced gc
> runs when indexing every now and then (when parsing pdf-type objects,
> that is): No effect.

What PDF parser are you using?  Is the problem within the parser and not
lucene?  Are you releasing all resources?

Ben

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat
Hi Pete,
good hint, but we actually do have physical memory of  4Gb on the 
system. But then: we also have experienced that the gc of ibm jdk1.3.1 
that we use is sometimes
behaving strangely with too large heap space anyway. (Limit seems to be 
1.2 Gb)
I can say that gc is not collecting these objects since I  forced gc 
runs when indexing every now and then (when parsing pdf-type objects, 
that is): No effect.

regards,
Daniel
Pete Lewis wrote:
Hi all
Reading the thread with interest, there is another way I've come across out
of memory errors when indexing large batches of documents.
If you have your heap space settings too high, then you get swapping (which
impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out of memory.
Can you check whether or not your garbage collection is being triggered?
Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.
Cheers
Pete Lewis
- Original Message - 
From: "Daniel Taurat" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of
documents

 

Daniel Aber schrieb:
   

On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

 

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix about files
not being deleted after 1.4.1. Not sure if that could cause the problems
you're experiencing.
Regards
Daniel

 

Well, it seems not to be files, it looks more like those SegmentTermEnum
objects accumulating in memory.
#I've seen some discussion on these objects in the developer-newsgroup
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...
Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Pete Lewis
Hi all

Reading the thread with interest, there is another way I've come across out
of memory errors when indexing large batches of documents.

If you have your heap space settings too high, then you get swapping (which
impacts performance) plus you never reach the trigger for garbage
collection, hence you don't garbage collect and hence you run out of memory.

Can you check whether or not your garbage collection is being triggered?

Anomalously therefore if this is the case, by reducing the heap space you
can improve performance get rid of the out of memory errors.

Cheers
Pete Lewis

- Original Message - 
From: "Daniel Taurat" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, September 10, 2004 1:10 PM
Subject: Re: Out of memory in lucene 1.4.1 when re-indexing large number of
documents


> Daniel Aber schrieb:
>
> >On Thursday 09 September 2004 19:47, Daniel Taurat wrote:
> >
> >
> >
> >>I am facing an out of memory problem using  Lucene 1.4.1.
> >>
> >>
> >
> >Could you try with a recent CVS version? There has been a fix about files
> >not being deleted after 1.4.1. Not sure if that could cause the problems
> >you're experiencing.
> >
> >Regards
> > Daniel
> >
> >
> >
> Well, it seems not to be files, it looks more like those SegmentTermEnum
> objects accumulating in memory.
> #I've seen some discussion on these objects in the developer-newsgroup
> that had taken place some time ago.
> I am afraid this is some kind of runaway caching I have to deal with.
> Maybe not  correctly addressed in this newsgroup, after all...
>
> Anyway: any idea if there is an API command to re-init caches?
>
> Thanks,
>
> Daniel
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Daniel Taurat
Daniel Aber schrieb:
On Thursday 09 September 2004 19:47, Daniel Taurat wrote:
 

I am facing an out of memory problem using  Lucene 1.4.1.
   

Could you try with a recent CVS version? There has been a fix about files 
not being deleted after 1.4.1. Not sure if that could cause the problems 
you're experiencing.

Regards
Daniel
 

Well, it seems not to be files, it looks more like those SegmentTermEnum 
objects accumulating in memory.
#I've seen some discussion on these objects in the developer-newsgroup 
that had taken place some time ago.
I am afraid this is some kind of runaway caching I have to deal with.
Maybe not  correctly addressed in this newsgroup, after all...

Anyway: any idea if there is an API command to re-init caches?
Thanks,
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-09 Thread Daniel Naber
On Thursday 09 September 2004 19:47, Daniel Taurat wrote:

> I am facing an out of memory problem using ÂLucene 1.4.1.

Could you try with a recent CVS version? There has been a fix about files 
not being deleted after 1.4.1. Not sure if that could cause the problems 
you're experiencing.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-09 Thread Daniel Taurat
Hi,
I am facing an out of memory problem using  Lucene 1.4.1.
I am  re-indexing a pretty large number ( about 30.000 ) of documents.
I identify old instances by checking for a unique ID field, delete those 
with indexReader.delete() and add the new document version.

HeapDump says I am having  a huge number of HashMaps with 
SegmentTermEnum objects (256891) .

IndexReader is closed directly after delete(term)...
Seems to me that this did not happen with version1.2 (same number of 
objects and  all...).
Has anyone an idea how I get  these "hanging"  objects? Or what to do in 
order to avoid them?

Thanks
Daniel
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]