Hi Tilman,

I just tested the link after logging out of my Dropbox account and it didn't 
require registration. Please can you try again? 

https://www.dropbox.com/s/ic04eojpyhqm2kt/00054_CCH_Annual%20Report_2016-82-85.pdf?dl=0

You might have click 'X' to close the login dialog, if prompted.

Thanks,

Zeke 

-----Original Message-----
From: Tilman Hausherr <[email protected]> 
Sent: 14 May 2019 04:45
To: [email protected]
Subject: Re: pdfbox parse error "Header doesn't contain versioninfo"

The link requires registration, which I don't want to.

Tilman

Am 14.05.2019 um 00:01 schrieb Zeke Steer:
> Hi Tilman,
>
> Thanks for your reply and for sending the FAQ link.
>
> Regarding the non-sequential parser being the only parser in version 
> 2, I'm definitely seeing different behaviours when I set the -nonSeq 
> flag versus when I don't. Also I'm not destroying any of the files. I 
> extract the text to a different output directory and the PDF file 
> remains in the original location.
>
> I've uploaded the PDF document to my Dropbox and shared it with you. 
> You can also download it here:
> https://www.dropbox.com/s/ic04eojpyhqm2kt/00054_CCH_Annual%20Report_20
> 16-82- 85.pdf?dl=0. Hoping you can reproduce the issue at your end, 
> it's very consistent at mine.
>
> Thanks again for your help,
>
> Zeke
>
> P.S. I've cc'ed my supervisor for visibility.
>
> -----Original Message-----
> From: Tilman Hausherr <[email protected]>
> Sent: 13 May 2019 16:54
> To: [email protected]
> Subject: Re: pdfbox parse error "Header doesn't contain versioninfo"
>
> Am 13.05.2019 um 16:15 schrieb Zeke Steer:
>> Hi,
>>
>> I'm using the latest version of the pdfbox command line tools
>> (pdfbox-app-2.0.15.jar) to extract the text from UK company annual 
>> reports. I invoke the command line tools from a python script, 
>> extracting each page of the company annual report .pdf document in 
>> turn.
>>
>> I've noticed that some pages of the annual reports aren't extracted 
>> correctly. I originally observed this problem in an earlier version 
>> of the command line tools (pdfbox-app-2.0.6.jar). However, moving to 
>> the latest version of the tools hasn't fixed the issue.
>>
>> I've attached a sample report which consistently reproduces the issue 
>> (00054_CCH_Annual Report_2016-82-85.pdf). The report opens fine in 
>> Adobe Reader but pdfbox is unable to extract it. The issue manifests 
>> differently depending on whether the sequential (default) or 
>> non-sequential parser is used.
>>
> The non-sequential parser is the only parser in 2.0.*
>
> Using the option should bring a FileNotFoundException with 2.0.15, 
> that is what I get.
>
> "Header doesn't contain versioninfo" is with empty files. I suspect 
> one of your calls used the PDF file as destination and you destroyed it.
>
> Re different text extractions, please read
>
> https://pdfbox.apache.org/2.0/faq.html#text-extraction
>
> Your PDF file attachment didn't get through, please upload it to a 
> sharehoster.
>
> Tilman
>
>
>> _Sequential Parser_
>>
>> I was initially executing the following command with the -nonSeq flag
>> unset:
>>
>> java -jar pdfbox-app-2.0.15.jar ExtractText -startPage 1 -endPage 1 
>> "E:\Analyst Reports\2019-05-13 PDF Extraction Issue 
>> Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual
>> Report_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt"
>>
>> This would generate a large number of unicode warnings in the 
>> console,
>> e.g.:
>>
>> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
>> toUnicode
>>
>> WARNING: No Unicode mapping for CID+36 (36) in font 
>> Effra-Medium-Identity-H
>>
>> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
>> toUnicode
>>
>> WARNING: No Unicode mapping for CID+88 (88) in font 
>> Effra-Medium-Identity-H
>>
>> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
>> toUnicode
>>
>> WARNING: No Unicode mapping for CID+71 (71) in font 
>> Effra-Medium-Identity-H
>>
>> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
>> toUnicode
>>
>> WARNING: No Unicode mapping for CID+76 (76) in font 
>> Effra-Medium-Identity-H
>>
>> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
>> toUnicode
>>
>> WARNING: No Unicode mapping for CID+87 (87) in font 
>> Effra-Medium-Identity-H
>>
>> May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
>> toUnicode
>>
>> WARNING: No Unicode mapping for CID+3 (3) in font 
>> Effra-Medium-Identity-H
>>
>> The pdfbox output was missing a large amount of text present on the 
>> first page of the report. See the pdfbox output in the attached 1.txt 
>> file and compare this to the first page of the company annual report, 
>> also attached.
>>
>> _Non-Sequential Parser_
>>
>> I found the issue affected several of the annual reports within my 
>> dataset. Investigating further, I read about the non-sequential 
>> parser. You advise using this if the sequential parser fails so I 
>> tried executing the following command instead, with the -nonSeq flag set:
>>
>> java -jar pdfbox-app-2.0.15.jar ExtractText -nonSeq -startPage 1 
>> -endPage 1 "E:\Analyst Reports\2019-05-13 PDF Extraction Issue 
>> Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual
>> Report_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt"
>>
>> However, this consistently fails with a 'java.io.IOException: Error:
>> Header doesn't contain versioninfo'. See the full exception stack 
>> trace below:
>>
>> Exception in thread "main" java.io.IOException: Error: Header doesn't 
>> contain versioninfo
>>
>>          at
>> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:221)
>>
>>          at
>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1070)
>>
>>          at
>> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1008)
>>
>>          at
>> org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:
>> 2
>> 16)
>>
>>          at
>> org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96)
>>
>>          at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
>>
>> failed to extract text from 'E:\Analyst Reports\2019-05-13 PDF 
>> Extraction Issue Investigation\00054_
>>
>> CCH_Annual Report_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf':
>> Command 'java -jar pdfbox-app-
>>
>> 2.0.15.jar ExtractText -nonSeq -startPage 1 -endPage 1 "E:\Analyst
>> Reports\2019-05-13 PDF Extraction
>>
>>   Issue Investigation\00054_CCH_Annual 
>> Report_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf" "out
>>
>> \00054_CCH_Annual Report_2016-82-85\1.txt"' returned non-zero exit 
>> status 1.
>>
>> I found a similar issue reported on your JIRA issue tracker here:
>> https://issues.apache.org/jira/browse/PDFBOX-4203?jql=text%20~%20%22v
>> e rsioninfo%22. However, it was closed without being resolved as the 
>> original reporter failed to provide a PDF document which reproduced 
>> the issue. Hopefully with the information I've supplied, you'll be 
>> able to reopen the bug and take another look.
>>
>> Please can you keep me updated?
>>
>> Many thanks,
>>
>> Zeke
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to