Re: Getting lots of warnings "No Unicode mapping for..." when extract text

Tilman Hausherr Fri, 12 May 2017 08:57:02 -0700

Am 12.05.2017 um 17:00 schrieb sunny hisa:

When I use PDFbox to extract text, I get lots of warnings and asoutput I only get garbage. But when I use Abode Acrobat to export theattached PDF file to text, it works fine.


No, it doesn't work fine, here is what I get with Adobe Reader:


 
ATTENTION
􀀀 􀀀􀀀    􀀀􀀀
􀀀
􀀀􀀀􀀀
􀀀􀀀!􀀀􀀀
 􀀀
"

􀀀!"&􀀀!" #"!􀀀􀀀"􀀀􀀀􀀀"􀀀% 􀀀!􀀀
"􀀀#"􀀀 􀀀$􀀀􀀀" 􀀀

I have attached the original PDF file, the text output and the logwith warnings. And besides,
PDF file seems to have a Type-1 font embedded with a custom encoding.

The PDF didn't get through, you should have uploaded it to asharehoster. I accessed it because I'm a moderator.

The PDFbox version is pdfbox-app-2.0.5
The command I use is: java -jar pdfbox-app-2.0.5.jar ExtractTextFileWithIssue.pdf
I have checked lots of reports on JIRA issue tracker, still find noway to solve it.I am looking forward to hearing from you.


See here:  https://pdfbox.apache.org/2.0/faq.html#gibberish

The problem with your file is that it uses incorrect glyph names in the/Differences table, like "C0046" for a ".", or "C0065" for an "A".


Changing that in the source code brings this output:


Preface
ATTENTION
Personnel,  accessing    Rack  equipment  described  in
this  document,  should  be  familiar  with  and  observe  Safety
instructions.
The  safety  instructions  and  the  meaning  of  the  warning labels  on
the  equipment  are  given  in    1.

This is still not complete, APOLT is missing (no idea why) and there areNUL characters (which are in the PDF too).



Tilman



Thanks & Best Regards
Sunny Xia



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Getting lots of warnings "No Unicode mapping for..." when extract text

Reply via email to