Hm,

Sweet. I'll probably backport it to 1.4. I guess you mean with catching any runtime exception any throwable ;-)

Thanks and keep up the great work!

Johannes

Marcel Reutegger wrote:
Hi,

this is 'fixed' in the 1.5 release by catching any runtime exception
that might be thrown during pdf text extraction. it's not perfect, but
it keeps the system running.

regards
 marcel

2009/6/22 Johannes Boneschanscher <[email protected]>
Hi fellow jackrabbit users,

On reindexing the entire Jackrabbit 1.4 repository I get the following problem. 
With the use of Sun JRE 6 I got the following stacktace (Java 5 doesn't give 
any):

java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOf(Arrays.java:2734)
  at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
  at java.util.ArrayList.add(ArrayList.java:351)
  at org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:105)
  at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:97)
  at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:326)
  at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:174)
  at org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:461)
  at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:690)
  at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:128)
  at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:268)
  at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:200)
  at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172)
  at 
org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:75)
  at 
org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
  at 
org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
  at 
org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:393)
  at 
org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:282)
  at 
org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:221)
  at 
org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:861)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:803)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:818)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex$AddNode.execute(MultiIndex.java:1519)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.executeAndLog(MultiIndex.java:936)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1017)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
  at 
org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)

This is very unexpected because memory usage stays between 128 and 256 Mb of 
memory and the maximum heapsize is set to 1,3 Gigabyte. Also system memory is 
readily available.

It may be related to:

https://issues.apache.org/jira/browse/PDFBOX-313

Is this resolved in a newer 1.4 version of Jackrabbit? We have a text-extractor 
build with the following info in the META-INF pom.properties file:

#Generated by Maven
#Fri Jan 11 14:40:02 EET 2008
version=1.4
groupId=org.apache.jackrabbit
artifactId=jackrabbit-text-extractors

Regards,

Johannes

Reply via email to