I can view it too. It is the text extraction (CTRL-A, CTRL-C, editor,
CTRL-V) that fails.
Tilman
Am 14.05.2019 um 19:08 schrieb Zeke Steer:
Hi Tilman,
Thanks for looking into this for me. I can view the PDF document in Adobe
Acrobat Reader DC just fine. Please see the screenshot here:
https://www.dropbox.com/s/huubrxt7vvmb90r/image.png?dl=0.
Can I please ask what version of Adobe Reader you're using to view the PDF?
I'll try to obtain the same version and see whether I'm also unable to view the
PDF using that version.
Thanks again,
Zeke
________________________________________
From: Tilman Hausherr [[email protected]]
Sent: 14 May 2019 17:46
To: [email protected]
Subject: Re: pdfbox parse error "Header doesn't contain versioninfo"
Am 14.05.2019 um 10:32 schrieb Zeke Steer:
Hi Tilman,
I just tested the link after logging out of my Dropbox account and it didn't
require registration. Please can you try again?
https://www.dropbox.com/s/ic04eojpyhqm2kt/00054_CCH_Annual%20Report_2016-82-85.pdf?dl=0
You might have click 'X' to close the login dialog, if prompted.
Indeed, sorry.
I tried with Adobe Reader, and here is what I got:
Dear Shareholder
Members
William W. (Bill) Douglas III
Committee Chair
Coca-Cola HBC
2016 Integrated Annual Report
Strategic Report Corporate Governance Financial Statements
Coca-Cola HBC
2016 Integrated Annual Report
Swiss Statutory Reporting Supplementary Information
Coca-Cola HBC
2016 Integrated Annual Report
����
Strategic Report Corporate Governance Financial Statements
Coca-Cola HBC
2016 Integrated Annual Report
Swiss Statutory Reporting Supplementary Information
So there's nothing we can do. See also
https://pdfbox.apache.org/2.0/faq.html#text-extraction
When I tried the -nonSeq option, I got this:
Exception in thread "main" java.io.FileNotFoundException: -nonSeq (Das
System kann die angegebene Datei nicht finden)
Tilman
Thanks,
Zeke
-----Original Message-----
From: Tilman Hausherr <[email protected]>
Sent: 14 May 2019 04:45
To: [email protected]
Subject: Re: pdfbox parse error "Header doesn't contain versioninfo"
The link requires registration, which I don't want to.
Tilman
Am 14.05.2019 um 00:01 schrieb Zeke Steer:
Hi Tilman,
Thanks for your reply and for sending the FAQ link.
Regarding the non-sequential parser being the only parser in version
2, I'm definitely seeing different behaviours when I set the -nonSeq
flag versus when I don't. Also I'm not destroying any of the files. I
extract the text to a different output directory and the PDF file
remains in the original location.
I've uploaded the PDF document to my Dropbox and shared it with you.
You can also download it here:
https://www.dropbox.com/s/ic04eojpyhqm2kt/00054_CCH_Annual%20Report_20
16-82- 85.pdf?dl=0. Hoping you can reproduce the issue at your end,
it's very consistent at mine.
Thanks again for your help,
Zeke
P.S. I've cc'ed my supervisor for visibility.
-----Original Message-----
From: Tilman Hausherr <[email protected]>
Sent: 13 May 2019 16:54
To: [email protected]
Subject: Re: pdfbox parse error "Header doesn't contain versioninfo"
Am 13.05.2019 um 16:15 schrieb Zeke Steer:
Hi,
I'm using the latest version of the pdfbox command line tools
(pdfbox-app-2.0.15.jar) to extract the text from UK company annual
reports. I invoke the command line tools from a python script,
extracting each page of the company annual report .pdf document in
turn.
I've noticed that some pages of the annual reports aren't extracted
correctly. I originally observed this problem in an earlier version
of the command line tools (pdfbox-app-2.0.6.jar). However, moving to
the latest version of the tools hasn't fixed the issue.
I've attached a sample report which consistently reproduces the issue
(00054_CCH_Annual Report_2016-82-85.pdf). The report opens fine in
Adobe Reader but pdfbox is unable to extract it. The issue manifests
differently depending on whether the sequential (default) or
non-sequential parser is used.
The non-sequential parser is the only parser in 2.0.*
Using the option should bring a FileNotFoundException with 2.0.15,
that is what I get.
"Header doesn't contain versioninfo" is with empty files. I suspect
one of your calls used the PDF file as destination and you destroyed it.
Re different text extractions, please read
https://pdfbox.apache.org/2.0/faq.html#text-extraction
Your PDF file attachment didn't get through, please upload it to a
sharehoster.
Tilman
_Sequential Parser_
I was initially executing the following command with the -nonSeq flag
unset:
java -jar pdfbox-app-2.0.15.jar ExtractText -startPage 1 -endPage 1
"E:\Analyst Reports\2019-05-13 PDF Extraction Issue
Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual
Report_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt"
This would generate a large number of unicode warnings in the
console,
e.g.:
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode
WARNING: No Unicode mapping for CID+36 (36) in font
Effra-Medium-Identity-H
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode
WARNING: No Unicode mapping for CID+88 (88) in font
Effra-Medium-Identity-H
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode
WARNING: No Unicode mapping for CID+71 (71) in font
Effra-Medium-Identity-H
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode
WARNING: No Unicode mapping for CID+76 (76) in font
Effra-Medium-Identity-H
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode
WARNING: No Unicode mapping for CID+87 (87) in font
Effra-Medium-Identity-H
May 13, 2019 10:12:18 AM org.apache.pdfbox.pdmodel.font.PDType0Font
toUnicode
WARNING: No Unicode mapping for CID+3 (3) in font
Effra-Medium-Identity-H
The pdfbox output was missing a large amount of text present on the
first page of the report. See the pdfbox output in the attached 1.txt
file and compare this to the first page of the company annual report,
also attached.
_Non-Sequential Parser_
I found the issue affected several of the annual reports within my
dataset. Investigating further, I read about the non-sequential
parser. You advise using this if the sequential parser fails so I
tried executing the following command instead, with the -nonSeq flag set:
java -jar pdfbox-app-2.0.15.jar ExtractText -nonSeq -startPage 1
-endPage 1 "E:\Analyst Reports\2019-05-13 PDF Extraction Issue
Investigation\00054_CCH_Annual Report_2016-82-85\00054_CCH_Annual
Report_2016-82-85.pdf" "out\00054_CCH_Annual Report_2016-82-85\1.txt"
However, this consistently fails with a 'java.io.IOException: Error:
Header doesn't contain versioninfo'. See the full exception stack
trace below:
Exception in thread "main" java.io.IOException: Error: Header doesn't
contain versioninfo
at
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:221)
at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1070)
at
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1008)
at
org.apache.pdfbox.tools.ExtractText.startExtraction(ExtractText.java:
2
16)
at
org.apache.pdfbox.tools.ExtractText.main(ExtractText.java:96)
at org.apache.pdfbox.tools.PDFBox.main(PDFBox.java:60)
failed to extract text from 'E:\Analyst Reports\2019-05-13 PDF
Extraction Issue Investigation\00054_
CCH_Annual Report_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf':
Command 'java -jar pdfbox-app-
2.0.15.jar ExtractText -nonSeq -startPage 1 -endPage 1 "E:\Analyst
Reports\2019-05-13 PDF Extraction
Issue Investigation\00054_CCH_Annual
Report_2016-82-85\00054_CCH_Annual Report_2016-82-85.pdf" "out
\00054_CCH_Annual Report_2016-82-85\1.txt"' returned non-zero exit
status 1.
I found a similar issue reported on your JIRA issue tracker here:
https://issues.apache.org/jira/browse/PDFBOX-4203?jql=text%20~%20%22v
e rsioninfo%22. However, it was closed without being resolved as the
original reporter failed to provide a PDF document which reproduced
the issue. Hopefully with the information I've supplied, you'll be
able to reopen the bug and take another look.
Please can you keep me updated?
Many thanks,
Zeke
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]