Re: submitting non-public PDFs for bugfixing

Andreas Lehmkühler Wed, 18 Jul 2012 03:56:59 -0700

Hi,



Wolfgang Kronberg <[email protected]> hat am 17. Juli 2012 um
18:50 geschrieben:

>
> Hi Maruan,
>
> thank you for pointing me to the NonSequentialParser. I haven't noticed
> that one before, and it works much better indeed - I now could extract
> the text for all files except for one. This one file still shows
> correctly in AdobeReader, but AdobeReader issues a warning that one
> embedded font is missing. NonSequentialParser issues this exception:
>
> java.io.IOException: Error reading stream using length value.
> Expected='endstream' actual='H‰tV T”Ç þæ"òXÞ " '
>         at
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseCOSStream(NonSequentialPDFParser.java:1327)
>         at
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1032)
>         at
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:955)
>         at
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseDictObjects(NonSequentialPDFParser.java:929)
>         at
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:337)
>         at
> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574)
>         at
> org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124)
>         at
> org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1107)
>
> Standard parser does not throw any exception but regards the document as
> empty.
>
> What I don't like about this solution is that I need to provide the PDF
> as a file, not as a stream. In my application that means that I first


Feel free to provide a patch ;-) You are not the only one who stumbled upon
this.


> have to copy my stream to a temporary file. Also, the RandomAccess must
> be fully re-build before each use (e.g. with new RandomAccessBuffer()),
> because in some cases it will be closed implicitly, leaving me with a
> NullPointerException on next access... Anyway, all of that is not a
> problem for my current application. So thanks a lot, problem solved!
>
> Nevertheless, perhaps some PDFBox developer is still interested in
> getting the (now three) PDFs from me which exhibit PDFBox bugs? If so,
> please drop me a note! :)


Send those pdfs to me. I'll have a look and can share them with the other devs
if necessary.


>
> Best Regards,
> Wolfgang


BR
Andreas Lehmkühler


> >
> On 17.07.2012 16:53, Maruan Sahyoun wrote:
> > Hello Wolfgang,
> >
> > did you try using the NonSequentialParser which was a new addition in 1.7.
> > improving the parsing of PDF documents? see
> > https://issues.apache.org/jira/browse/PDFBOX-1199 for details.
> >
> > With kind regards
> >
> > Maruan
> >
> >
> > Am 17.07.2012 um 16:09 schrieb Wolfgang Kronberg:
> >
> >>
> >> Hello,
> >>
> >> I have recently converted some 2500 PDF files to text using PDFBox
> >> 1.7.0. While doing so, I ran into two problems on a minority of the PDF
> >> files (some 5% are affected for each problem). Usually, I would now file
> >> a bug and attach a sample PDF so that the problem can be reproduced.
> >>
> >> However, the PDFs in question are not public, and I am not entitled to
> >> publish them to the public. Is there any person who I could mail two
> >> affected PDFs files, so that that person could nail down the actual bug
> >> for a good bug description while keeping the actual files secret?
> >>
> >> Either case, here is what I see. In all cases, the affected document can
> >> be displayed with no problems in Adobe Reader.
> >>
> >> Problem 1: The document is parsed to be empty (no pages), although it in
> >> fact contains > 50 pages full of text. Running PDFDebugger on this
> >> document produces this output (WARNUNG = WARNING):
> >> 17.07.2012 14:01:50 org.apache.pdfbox.pdfparser.XrefTrailerResolver
> >> setStartxref
> >> WARNUNG: Did not found XRef object at specified startxref position 116
> >>
> >> Problem 2: On attempting to parse the document, I get an IOException.
> >> PDFDebugger outputs the following on this document (SCHWERWIEGEND =
> >> SEVERE):
> >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> >> DataFormatException
> >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> >> DataFormatException
> >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> >> DataFormatException
> >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> >> DataFormatException
> >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> >> DataFormatException
> >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> >> DataFormatException
> >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> >> DataFormatException
> >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> >> DataFormatException
> >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> >> DataFormatException
> >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode
> >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
> >> DataFormatException
> >> PDFDebugger failed with the following exception:
> >> java.io.IOException
> >>        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138)
> >>        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
> >>        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
> >>        at
> >> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
> >>        at
> >> org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:61)
> >>        at
> >> org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(PDFParser.java:846)
> >>        at
> >> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:574)
> >>        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
> >>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071)
> >>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038)
> >>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1009)
> >>        at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:408)
> >>        at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:388)
> >>        at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:376)
> >>        at org.apache.pdfbox.PDFBox.main(PDFBox.java:48)
> >> Caused by: java.util.zip.DataFormatException: unknown compression method
> >>        at java.util.zip.Inflater.inflateBytes(Native Method)
> >>        at java.util.zip.Inflater.inflate(Unknown Source)
> >>        at java.util.zip.Inflater.inflate(Unknown Source)
> >>        at
> >> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
> >>        at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
> >>        ... 14 more
> >>
> >> Best Regards,
> >> Wolfgang
> >>
> >> --
> >> Dipl.-Math.
> >> Wolfgang Kronberg
> >> Senior Software Architect
> >>
> >> financial.com AG
> >>
> >> (t) +49 89 318528-75
> >> (f) +49 89 318528-28
> >> e-mail: [email protected]
> >> http://www.financial.com
> >>
> >>
> >> financial.com AG
> >>
> >> Munich head office/Hauptsitz München: Georg-Muche-Straße 3 | 80807 München
> >> | Germany | Tel. +49 89 318528-0 | Google Maps: http://g.co/maps/4wcz
> >> Frankfurt branch office/Niederlassung Frankfurt: Messeturm |
> >> Friedrich-Ebert-Anlage 49 | 60327 Frankfurt | Germany
> >> Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer |
> >> Dr. Yann Samson | Matthias Wiederwach
> >> Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden
> >> (Chairman/Vorsitzender)
> >> Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID
> >> number/St.Nr.: DE205 370 553
> >
>
> --
> Dipl.-Math.
> Wolfgang Kronberg
> Senior Software Architect
>
> financial.com AG
>
> (t) +49 89 318528-75
> (f) +49 89 318528-28
> e-mail: [email protected]
> http://www.financial.com
>
>
>
> financial.com AG
>
> Munich head office/Hauptsitz München: Georg-Muche-Straße 3 | 80807 München |
> Germany | Tel. +49 89 318528-0 | Google Maps: http://g.co/maps/4wcz
> Frankfurt branch office/Niederlassung Frankfurt: Messeturm |
> Friedrich-Ebert-Anlage 49 | 60327 Frankfurt | Germany
> Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer | Dr.
> Yann Samson | Matthias Wiederwach
> Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden
> (Chairman/Vorsitzender)
> Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID
> number/St.Nr.: DE205 370 553

Re: submitting non-public PDFs for bugfixing

Reply via email to