Re: "No Unicode mapping for" when extracting text from a PDF

Tilman Hausherr Thu, 04 Jan 2018 11:28:46 -0800

Am 04.01.2018 um 20:20 schrieb Luca Loiodice:

I am trying to migrate a project from a commercial Windows PDF libraryto PDFBox, but I see reduced accuracy when I extract text fromarbitrary files.
For example, I have a PDF (enclosed) that does not have Unicodemappings for certain glyph ... and so when I try and extract the textusing PDF Box I get the following:


Attachments are swallowed, you'd need to upload to a sharehoster.

WARNING: No Unicode mapping for G70 (112) in font HAGLDF+MSTT31c5ed
Jan 04, 2018 10:24:02 AM org.apache.pdfbox.pdmodel.font.PDSimpleFonttoUnicode
The Windows library returns the correct text for the gliph withmissing character mapping.Is there a way for me to add some code to make PDFBox or my programfigure out what the text is in this case ?

Yes, but you'd need to build from source because G70 is non standard,the change is described in

https://issues.apache.org/jira/browse/PDFBOX-3962
at the bottom.

Tilman


Thanks for any help,
Luca


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: "No Unicode mapping for" when extracting text from a PDF

Reply via email to