Am 12.05.2017 um 17:00 schrieb sunny hisa:
When I use PDFbox to extract text, I get lots of warnings and as
output I only get garbage. But when I use Abode Acrobat to export the
attached PDF file to text, it works fine.
No, it doesn't work fine, here is what I get with Adobe Reader:
ATTENTION
!
"
!"&!" #"!""% !
"#" $"
I have attached the original PDF file, the text output and the log
with warnings. And besides,
PDF file seems to have a Type-1 font embedded with a custom encoding.
The PDF didn't get through, you should have uploaded it to a
sharehoster. I accessed it because I'm a moderator.
The PDFbox version is pdfbox-app-2.0.5
The command I use is: java -jar pdfbox-app-2.0.5.jar ExtractText
FileWithIssue.pdf
I have checked lots of reports on JIRA issue tracker, still find no
way to solve it.I am looking forward to hearing from you.
See here: https://pdfbox.apache.org/2.0/faq.html#gibberish
The problem with your file is that it uses incorrect glyph names in the
/Differences table, like "C0046" for a ".", or "C0065" for an "A".
Changing that in the source code brings this output:
Preface
ATTENTION
Personnel, accessing Rack equipment described in
this document, should be familiar with and observe Safety
instructions.
The safety instructions and the meaning of the warning labels on
the equipment are given in 1.
This is still not complete, APOLT is missing (no idea why) and there are
NUL characters (which are in the PDF too).
Tilman
Thanks & Best Regards
Sunny Xia
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org