Has anyone else encountered recent problems with FlateFilter and OutOfMemory errors? Is there anyway to trap it before it results in OutOfMemory exception?
Thanks Doug On Mon, Apr 1, 2013 at 2:13 PM, Doug Sackin <[email protected]> wrote: > I appear to have something similar to the bug identified and fixed in > PDFBOX-453 - FlateFilter.decode() throwing OutOfMemoryError. > > I'm doing text extraction through Twister Data Framework using Tika 1.2 > which calls PDFBox. I have PDFBox 1.7. My OS is Scientific Linux 5.8. Java > is JDK 1.6.0_37. > > The offending exception is below: > > Caused by: java.lang.OutOfMemoryError > at java.util.zip.Inflater.inflateBytes(Native Method) > at java.util.zip.Inflater.inflate(Inflater.java:238) > at java.util.zip.Inflater.inflate(Inflater.java:256) > at > org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169) > at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98) > at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279) > at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221) > at > org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156) > at > org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:196) > at > org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:253) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237) > at > org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217) > at > org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448) > at > org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > at > org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > > Before that, I have a long string of exceptions from PDFBox attempts on > PDF files, interspersed by "FlateFilter: stop reading corrupt stream due to > a DataFormatException". These are in the attached log file. > > The other exceptions are IndexOutOfBounds, ClassCastException, > NegativeArraySizeException, NullPointerException, IOException (regarding > font(COSName}F2}) in map{}), IllegalArgumentException. These may or may not > be related (the exceptions are appearing on different files), but I wonder > if they served to corrupt the stream sufficiently that PDFBox got attempted > to inflate corrupt data. > > If it is the same issue, it was reported to be fixed in 0.8. If it is a > new issue, is it possible to fix it? I cannot provide any of the source PDF > files (client data), but I am attaching the log output containing all of > the exception traces including the final OutOfMemoryError. > > Thanks for any insights. > > Doug > > > >

