Hi fellow jackrabbit users,
On reindexing the entire Jackrabbit 1.4 repository I get the following
problem. With the use of Sun JRE 6 I got the following stacktace (Java 5
doesn't give any):
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2734)
at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
at java.util.ArrayList.add(ArrayList.java:351)
at org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:105)
at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:97)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:326)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:461)
at
org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:690)
at
org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:268)
at
org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:200)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172)
at
org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:75)
at
org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at
org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at
org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:393)
at
org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:282)
at
org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:221)
at
org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:861)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:803)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:818)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex$AddNode.execute(MultiIndex.java:1519)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.executeAndLog(MultiIndex.java:936)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1017)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
This is very unexpected because memory usage stays between 128 and 256
Mb of memory and the maximum heapsize is set to 1,3 Gigabyte. Also
system memory is readily available.
It may be related to:
https://issues.apache.org/jira/browse/PDFBOX-313
Is this resolved in a newer 1.4 version of Jackrabbit? We have a
text-extractor build with the following info in the META-INF
pom.properties file:
#Generated by Maven
#Fri Jan 11 14:40:02 EET 2008
version=1.4
groupId=org.apache.jackrabbit
artifactId=jackrabbit-text-extractors
Regards,
Johannes