Hi fellow jackrabbit users,

On reindexing the entire Jackrabbit 1.4 repository I get the following problem. With the use of Sun JRE 6 I got the following stacktace (Java 5 doesn't give any):

java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOf(Arrays.java:2734)
   at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
   at java.util.ArrayList.add(ArrayList.java:351)
   at org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:105)
   at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:97)
   at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:326)
   at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:174)
   at org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:461)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:690) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:128)
   at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:268)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:200)
   at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172)
at org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:75) at org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90) at org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195) at org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:393) at org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:282) at org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:221) at org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:861) at org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:803) at org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:818) at org.apache.jackrabbit.core.query.lucene.MultiIndex$AddNode.execute(MultiIndex.java:1519) at org.apache.jackrabbit.core.query.lucene.MultiIndex.executeAndLog(MultiIndex.java:936) at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1017) at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023) at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023) at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023) at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023) at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023) at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023) at org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)

This is very unexpected because memory usage stays between 128 and 256 Mb of memory and the maximum heapsize is set to 1,3 Gigabyte. Also system memory is readily available.

It may be related to:

https://issues.apache.org/jira/browse/PDFBOX-313

Is this resolved in a newer 1.4 version of Jackrabbit? We have a text-extractor build with the following info in the META-INF pom.properties file:

#Generated by Maven
#Fri Jan 11 14:40:02 EET 2008
version=1.4
groupId=org.apache.jackrabbit
artifactId=jackrabbit-text-extractors

Regards,

Johannes

Reply via email to