Am 18.07.2016 um 11:08 schrieb OYEBISI, Daniel:
Hi,
While extracting text from a PDF (screenshot attached), I came across a No
Unicode Mapping warning. The resulting extracted text does not contain the
Wingding3 characters present in the PDF. I have been trying to debug this PDF
for some time now but I can't seem to understand the issues involved.
Please can someone explain why PDFBox is unable to correctly extract these
symbols?
The codes are missing in the ToUnicode CMap:
/CIDInit /ProcSet findresource begin 12 dict begin begincmap
/CIDSystemInfo <<
/Registry (LNDPFO+TT11+0) /Ordering (T42UV) /Supplement 0 >> def
/CMapName /LNDPFO+TT11+0 def
/CMapType 2 def
1 begincodespacerange <0003> <0003> endcodespacerange
1 beginbfchar
<0003> <0020> <=======================
endbfchar
endcmap CMapName currentdict /CMap defineresource pop end end
All you have is code 3 that maps to a space.
Tilman
Kindly find the links related to this PDF below:
PDF file on Dropbox
https://www.dropbox.com/s/57cvb36h4x2v96k/page2.pdf?dl=0
Screenshot (Text extraction)
https://www.dropbox.com/s/ftb3tuwvq3npg8o/page2%20no%20unicode%20mapping.PNG?dl=0
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]