Hi,

this is 'fixed' in the 1.5 release by catching any runtime exception
that might be thrown during pdf text extraction. it's not perfect, but
it keeps the system running.

regards
 marcel

2009/6/22 Johannes Boneschanscher <[email protected]>
>
> Hi fellow jackrabbit users,
>
> On reindexing the entire Jackrabbit 1.4 repository I get the following 
> problem. With the use of Sun JRE 6 I got the following stacktace (Java 5 
> doesn't give any):
>
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:2734)
>   at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>   at java.util.ArrayList.add(ArrayList.java:351)
>   at org.pdfbox.pdfparser.PDFStreamParser.parse(PDFStreamParser.java:105)
>   at org.pdfbox.cmapparser.CMapParser.parse(CMapParser.java:97)
>   at org.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:326)
>   at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:174)
>   at org.pdfbox.util.PDFTextStripper.showString(PDFTextStripper.java:461)
>   at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:690)
>   at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:128)
>   at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:268)
>   at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:200)
>   at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:172)
>   at 
> org.apache.jackrabbit.extractor.PdfTextExtractor.extractText(PdfTextExtractor.java:75)
>   at 
> org.apache.jackrabbit.extractor.CompositeTextExtractor.extractText(CompositeTextExtractor.java:90)
>   at 
> org.apache.jackrabbit.core.query.lucene.JackrabbitTextExtractor.extractText(JackrabbitTextExtractor.java:195)
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addBinaryValue(NodeIndexer.java:393)
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.addValue(NodeIndexer.java:282)
>   at 
> org.apache.jackrabbit.core.query.lucene.NodeIndexer.createDoc(NodeIndexer.java:221)
>   at 
> org.apache.jackrabbit.core.query.lucene.SearchIndex.createDocument(SearchIndex.java:861)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:803)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.createDocument(MultiIndex.java:818)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex$AddNode.execute(MultiIndex.java:1519)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.executeAndLog(MultiIndex.java:936)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1017)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>   at 
> org.apache.jackrabbit.core.query.lucene.MultiIndex.createIndex(MultiIndex.java:1023)
>
> This is very unexpected because memory usage stays between 128 and 256 Mb of 
> memory and the maximum heapsize is set to 1,3 Gigabyte. Also system memory is 
> readily available.
>
> It may be related to:
>
> https://issues.apache.org/jira/browse/PDFBOX-313
>
> Is this resolved in a newer 1.4 version of Jackrabbit? We have a 
> text-extractor build with the following info in the META-INF pom.properties 
> file:
>
> #Generated by Maven
> #Fri Jan 11 14:40:02 EET 2008
> version=1.4
> groupId=org.apache.jackrabbit
> artifactId=jackrabbit-text-extractors
>
> Regards,
>
> Johannes

Reply via email to