Missing unicode information from a font.

NH Rao Thu, 10 Apr 2025 14:49:16 -0700

Greetings,

Some of the PDF files we process do not have unicode information defined
for its type 3 fonts. I am in the process of migrating ancient code (based
on version 1.8 to the latest version). Since the characters are imited to
ASCII characters, we dumped checksum of a glyph and character to a map.
With processing enough files, we managed to get checksums for all the
characters we care about. At runtime, we get font glyph, compute it's
checksum  and set equivalent unicode using code that looks similar to
follows


font.getFontEncoding().addCharacterEncoding(letterChar, charName);
font.getToUnicodeCMap().addMapping(new byte[] { (byte) i }, letter);

With these changes, the rest of the text stripper code works as expected as
it's able to find the required information.

We're trying to migrate to the latest released version of PDF. I believe
some of these methods are now package protected
e.g. org.apache.pdfbox.pdmodel.font.encoding.Encoding.add(int, String).
Also comment on the method seems to discourage our workaround.

I am not able to figure out which method I need to call for unicode mapping
in the second line of the above code example.

What will be a solution to handle this? The solution of mapping glyph to
character  does work for us even though we created the map manually.

Regards,

Niranjan

Missing unicode information from a font.

Reply via email to