Hi,
Wolfgang Kronberg <[email protected]> hat am 17. Juli 2012 um 18:50 geschrieben: > > Hi Maruan, > > thank you for pointing me to the NonSequentialParser. I haven't noticed > that one before, and it works much better indeed - I now could extract > the text for all files except for one. This one file still shows > correctly in AdobeReader, but AdobeReader issues a warning that one > embedded font is missing. NonSequentialParser issues this exception: > > java.io.IOException: Error reading stream using length value. > Expected='endstream' actual='H‰tV T”Ç þæ"òXÞ " ' > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseCOSStream(NonSequentialPDFParser.java:1327) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1032) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:955) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseDictObjects(NonSequentialPDFParser.java:929) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:337) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:574) > at > org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1124) > at > org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1107) > > Standard parser does not throw any exception but regards the document as > empty. > > What I don't like about this solution is that I need to provide the PDF > as a file, not as a stream. In my application that means that I first Feel free to provide a patch ;-) You are not the only one who stumbled upon this. > have to copy my stream to a temporary file. Also, the RandomAccess must > be fully re-build before each use (e.g. with new RandomAccessBuffer()), > because in some cases it will be closed implicitly, leaving me with a > NullPointerException on next access... Anyway, all of that is not a > problem for my current application. So thanks a lot, problem solved! > > Nevertheless, perhaps some PDFBox developer is still interested in > getting the (now three) PDFs from me which exhibit PDFBox bugs? If so, > please drop me a note! :) Send those pdfs to me. I'll have a look and can share them with the other devs if necessary. > > Best Regards, > Wolfgang BR Andreas Lehmkühler > > > On 17.07.2012 16:53, Maruan Sahyoun wrote: > > Hello Wolfgang, > > > > did you try using the NonSequentialParser which was a new addition in 1.7. > > improving the parsing of PDF documents? see > > https://issues.apache.org/jira/browse/PDFBOX-1199 for details. > > > > With kind regards > > > > Maruan > > > > > > Am 17.07.2012 um 16:09 schrieb Wolfgang Kronberg: > > > >> > >> Hello, > >> > >> I have recently converted some 2500 PDF files to text using PDFBox > >> 1.7.0. While doing so, I ran into two problems on a minority of the PDF > >> files (some 5% are affected for each problem). Usually, I would now file > >> a bug and attach a sample PDF so that the problem can be reproduced. > >> > >> However, the PDFs in question are not public, and I am not entitled to > >> publish them to the public. Is there any person who I could mail two > >> affected PDFs files, so that that person could nail down the actual bug > >> for a good bug description while keeping the actual files secret? > >> > >> Either case, here is what I see. In all cases, the affected document can > >> be displayed with no problems in Adobe Reader. > >> > >> Problem 1: The document is parsed to be empty (no pages), although it in > >> fact contains > 50 pages full of text. Running PDFDebugger on this > >> document produces this output (WARNUNG = WARNING): > >> 17.07.2012 14:01:50 org.apache.pdfbox.pdfparser.XrefTrailerResolver > >> setStartxref > >> WARNUNG: Did not found XRef object at specified startxref position 116 > >> > >> Problem 2: On attempting to parse the document, I get an IOException. > >> PDFDebugger outputs the following on this document (SCHWERWIEGEND = > >> SEVERE): > >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > >> DataFormatException > >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > >> DataFormatException > >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > >> DataFormatException > >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > >> DataFormatException > >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > >> DataFormatException > >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > >> DataFormatException > >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > >> DataFormatException > >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > >> DataFormatException > >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > >> DataFormatException > >> 17.07.2012 14:01:10 org.apache.pdfbox.filter.FlateFilter decode > >> SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a > >> DataFormatException > >> PDFDebugger failed with the following exception: > >> java.io.IOException > >> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138) > >> at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301) > >> at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221) > >> at > >> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156) > >> at > >> org.apache.pdfbox.pdfparser.PDFXrefStreamParser.<init>(PDFXrefStreamParser.java:61) > >> at > >> org.apache.pdfbox.pdfparser.PDFParser.parseXrefStream(PDFParser.java:846) > >> at > >> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:574) > >> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187) > >> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1071) > >> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1038) > >> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1009) > >> at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:408) > >> at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:388) > >> at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:376) > >> at org.apache.pdfbox.PDFBox.main(PDFBox.java:48) > >> Caused by: java.util.zip.DataFormatException: unknown compression method > >> at java.util.zip.Inflater.inflateBytes(Native Method) > >> at java.util.zip.Inflater.inflate(Unknown Source) > >> at java.util.zip.Inflater.inflate(Unknown Source) > >> at > >> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169) > >> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98) > >> ... 14 more > >> > >> Best Regards, > >> Wolfgang > >> > >> -- > >> Dipl.-Math. > >> Wolfgang Kronberg > >> Senior Software Architect > >> > >> financial.com AG > >> > >> (t) +49 89 318528-75 > >> (f) +49 89 318528-28 > >> e-mail: [email protected] > >> http://www.financial.com > >> > >> > >> financial.com AG > >> > >> Munich head office/Hauptsitz München: Georg-Muche-Straße 3 | 80807 München > >> | Germany | Tel. +49 89 318528-0 | Google Maps: http://g.co/maps/4wcz > >> Frankfurt branch office/Niederlassung Frankfurt: Messeturm | > >> Friedrich-Ebert-Anlage 49 | 60327 Frankfurt | Germany > >> Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer | > >> Dr. Yann Samson | Matthias Wiederwach > >> Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden > >> (Chairman/Vorsitzender) > >> Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID > >> number/St.Nr.: DE205 370 553 > > > > -- > Dipl.-Math. > Wolfgang Kronberg > Senior Software Architect > > financial.com AG > > (t) +49 89 318528-75 > (f) +49 89 318528-28 > e-mail: [email protected] > http://www.financial.com > > > > financial.com AG > > Munich head office/Hauptsitz München: Georg-Muche-Straße 3 | 80807 München | > Germany | Tel. +49 89 318528-0 | Google Maps: http://g.co/maps/4wcz > Frankfurt branch office/Niederlassung Frankfurt: Messeturm | > Friedrich-Ebert-Anlage 49 | 60327 Frankfurt | Germany > Management board/Vorstand: Dr. Steffen Boehnert | Dr. Alexis Eisenhofer | Dr. > Yann Samson | Matthias Wiederwach > Supervisory board/Aufsichtsrat: Dr. Dr. Ernst zur Linden > (Chairman/Vorsitzender) > Register court/Handelsregister: Munich – HRB 128 972 | Sales tax ID > number/St.Nr.: DE205 370 553

