Hm,
Sweet. I'll probably backport it to 1.4. I guess you mean with catching
any runtime exception any throwable ;-)
Thanks and keep up the great work!
Johannes
Marcel Reutegger wrote:
Hi,
this is 'fixed' in the 1.5 release by catching any runtime exception
that might be thrown during pdf text extraction. it's not perfect, but
it keeps the system running.
regards
marcel
2009/6/22 Johannes Boneschanscher <[email protected]>
Hi fellow jackrabbit users,
On reindexing the entire Jackrabbit 1.4 repository I get the following problem.
With the use of Sun JRE 6 I got the following stacktace (Java 5 doesn't give
any):
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2734)
at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
at java.util.ArrayList.add(ArrayList.java:351)
at org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:105)
at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:97)
at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:326)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:174)
at org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:461)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:690)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:128)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:268)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:200)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172)
at
org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:75)
at
org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
at
org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
at
org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:393)
at
org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:282)
at
org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:221)
at
org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:861)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:803)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:818)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex$AddNode.execute(MultiIndex.java:1519)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.executeAndLog(MultiIndex.java:936)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1017)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
at
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
This is very unexpected because memory usage stays between 128 and 256 Mb of
memory and the maximum heapsize is set to 1,3 Gigabyte. Also system memory is
readily available.
It may be related to:
https://issues.apache.org/jira/browse/PDFBOX-313
Is this resolved in a newer 1.4 version of Jackrabbit? We have a text-extractor
build with the following info in the META-INF pom.properties file:
#Generated by Maven
#Fri Jan 11 14:40:02 EET 2008
version=1.4
groupId=org.apache.jackrabbit
artifactId=jackrabbit-text-extractors
Regards,
Johannes