Am 12.05.2017 um 17:00 schrieb sunny hisa:
When I use PDFbox to extract text, I get lots of warnings and as output I only get garbage. But when I use Abode Acrobat to export the attached PDF file to text, it works fine.

No, it doesn't work fine, here is what I get with Adobe Reader:


 
ATTENTION
􀀀 􀀀􀀀    􀀀􀀀
􀀀
􀀀􀀀􀀀
􀀀􀀀!􀀀􀀀
 􀀀
"

􀀀!"&􀀀!" #"!􀀀􀀀"􀀀􀀀􀀀"􀀀% 􀀀!􀀀
"􀀀#"􀀀 􀀀$􀀀􀀀" 􀀀





I have attached the original PDF file, the text output and the log with warnings. And besides,
PDF file seems to have a Type-1 font embedded with a custom encoding.

The PDF didn't get through, you should have uploaded it to a sharehoster. I accessed it because I'm a moderator.



The PDFbox version is pdfbox-app-2.0.5
The command I use is: java -jar pdfbox-app-2.0.5.jar ExtractText FileWithIssue.pdf

I have checked lots of reports on JIRA issue tracker, still find no way to solve it.I am looking forward to hearing from you.

See here:  https://pdfbox.apache.org/2.0/faq.html#gibberish

The problem with your file is that it uses incorrect glyph names in the /Differences table, like "C0046" for a ".", or "C0065" for an "A".

Changing that in the source code brings this output:


Preface
ATTENTION
Personnel,  accessing    Rack  equipment  described  in
this  document,  should  be  familiar  with  and  observe  Safety
instructions.
The  safety  instructions  and  the  meaning  of  the  warning labels  on
the  equipment  are  given  in    1.


This is still not complete, APOLT is missing (no idea why) and there are NUL characters (which are in the PDF too).


Tilman



Thanks & Best Regards
Sunny Xia



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


Reply via email to