Hi Tilman, I just tested the link after logging out of my Dropbox account and it didn't require registration. Please can you try again?
https://www.dropbox.com/s/ic04eojpyhqm2kt/00054_CCH_Annual%20Report_2016-82-85.pdf?dl=0 You might have click 'X' to close the login dialog, if prompted. Thanks, Zeke -----Original Message----- From: Tilman Hausherr <[email protected]> Sent: 14 May 2019 04:45 To: [email protected] Subject: Re: pdfbox parse error "Header doesn't contain versioninfo" The link requires registration, which I don't want to. Tilman Am 14.05.2019 um 00:01 schrieb Zeke Steer: > Hi Tilman, > > Thanks for your reply and for sending the FAQ link. > > Regarding the non-sequential parser being the only parser in version > 2, I'm definitely seeing different behaviours when I set the -nonSeq > flag versus when I don't. Also I'm not destroying any of the files. I > extract the text to a different output directory and the PDF file > remains in the original location. > > I've uploaded the PDF document to my Dropbox and shared it with you. > You can also download it here: > https://www.dropbox.com/s/ic04eojpyhqm2kt/00054_CCH_Annual%20Report_20 > 16-82- 85.pdf?dl=0. Hoping you can reproduce the issue at your end, > it's very consistent at mine. > > Thanks again for your help, > > Zeke > > P.S. I've cc'ed my supervisor for visibility. > > -----Original Message----- > From: Tilman Hausherr <[email protected]> > Sent: 13 May 2019 16:54 > To: [email protected] > Subject: Re: pdfbox parse error "Header doesn't contain versioninfo" > > Am 13.05.2019 um 16:15 schrieb Zeke Steer: >> Hi, >> >> I'm using the latest version of the pdfbox command line tools >> (pdfbox-app-2.0.15.jar) to extract the text from UK company annual >> reports. I invoke the command line tools from a python script, >> extracting each page of the company annual report .pdf document in >> turn. >> >> I've noticed that some pages of the annual reports aren't extracted >> correctly. I originally observed this problem in an earlier version >> of the command line tools (pdfbox-app-2.0.6.jar). However, moving to >> the latest version of the tools hasn't fixed the issue. >> >> I've attached a sample report which consistently reproduces the issue >> (00054_CCH_Annual Report_2016-82-85.pdf). The report opens fine in >> Adobe Reader but pdfbox is unable to extract it. The issue manifests >> differently depending on whether the sequential (default) or >> non-sequential parser is used. >> > The non-sequential parser is the only parser in 2.0.* > > Using the option should bring a FileNotFoundException with 2.0.15, > that is what I get. > > "Header doesn't contain versioninfo" is with empty files. I suspect > one of your calls used the PDF file as destination and you destroyed it. > > Re different text extractions, please read > > https://pdfbox.apache.org/2.0/faq.html#text-extraction > > Your PDF file attachment didn't get through, please upload it to a > sharehoster. > > Tilman > > >> _Sequential Parser_ >> >> I was initially executing the following command with the -nonSeq flag >> unset: >> >> java -jar pdfbox-app-2.0.15.jar ExtractText -startPage 1 -endPage 1 >> "E:\Analyst Reports\2019-05-13 PDF Extraction Issue >> Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual >> Report_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt" >> >> This would generate a large number of unicode warnings in the >> console, >> e.g.: >> >> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font >> toUnicode >> >> WARNING: No Unicode mapping for CID+36 (36) in font >> Effra-Medium-Identity-H >> >> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font >> toUnicode >> >> WARNING: No Unicode mapping for CID+88 (88) in font >> Effra-Medium-Identity-H >> >> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font >> toUnicode >> >> WARNING: No Unicode mapping for CID+71 (71) in font >> Effra-Medium-Identity-H >> >> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font >> toUnicode >> >> WARNING: No Unicode mapping for CID+76 (76) in font >> Effra-Medium-Identity-H >> >> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font >> toUnicode >> >> WARNING: No Unicode mapping for CID+87 (87) in font >> Effra-Medium-Identity-H >> >> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font >> toUnicode >> >> WARNING: No Unicode mapping for CID+3 (3) in font >> Effra-Medium-Identity-H >> >> The pdfbox output was missing a large amount of text present on the >> first page of the report. See the pdfbox output in the attached 1.txt >> file and compare this to the first page of the company annual report, >> also attached. >> >> _Non-Sequential Parser_ >> >> I found the issue affected several of the annual reports within my >> dataset. Investigating further, I read about the non-sequential >> parser. You advise using this if the sequential parser fails so I >> tried executing the following command instead, with the -nonSeq flag set: >> >> java -jar pdfbox-app-2.0.15.jar ExtractText -nonSeq -startPage 1 >> -endPage 1 "E:\Analyst Reports\2019-05-13 PDF Extraction Issue >> Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual >> Report_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt" >> >> However, this consistently fails with a 'java.io.IOException: Error: >> Header doesn't contain versioninfo'. See the full exception stack >> trace below: >> >> Exception in thread "main" java.io.IOException: Error: Header doesn't >> contain versioninfo >> >> at >> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:221) >> >> at >> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1070) >> >> at >> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1008) >> >> at >> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java: >> 2 >> 16) >> >> at >> org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96) >> >> at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60) >> >> failed to extract text from 'E:\Analyst Reports\2019-05-13 PDF >> Extraction Issue Investigation\00054_ >> >> CCH_Annual Report_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf': >> Command 'java -jar pdfbox-app- >> >> 2.0.15.jar ExtractText -nonSeq -startPage 1 -endPage 1 "E:\Analyst >> Reports\2019-05-13 PDF Extraction >> >> Issue Investigation\00054_CCH_Annual >> Report_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf" "out >> >> \00054_CCH_Annual Report_2016-82-85\1.txt"' returned non-zero exit >> status 1. >> >> I found a similar issue reported on your JIRA issue tracker here: >> https://issues.apache.org/jira/browse/PDFBOX-4203?jql=text%20~%20%22v >> e rsioninfo%22. However, it was closed without being resolved as the >> original reporter failed to provide a PDF document which reproduced >> the issue. Hopefully with the information I've supplied, you'll be >> able to reopen the bug and take another look. >> >> Please can you keep me updated? >> >> Many thanks, >> >> Zeke >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

