Re: Missing unicode information from a font.

Tilman Hausherr Sun, 13 Apr 2025 06:49:59 -0700

Hi,

We're usually reluctant to make stuff public, because this brings up newrisks, more support requests and also prevents us to change that API.


Tilman

On 12.04.2025 02:52, NH Rao wrote:

Greetings,

Thank you for the reply. I managed to get it working using reflection.
However I'm a bit worried that I am accessing methods that are not part of
the public API.

Essentially my solution is to override the showGlyph method from
PDFTextStripper, create a wrapper for the font and use the base class
method with the wrapper. This works. Only issue is for two
methods getStandard14Width and encode of
org.apache.pdfbox.pdmodel.font.PDFont class. I am forced to use reflection
as these methods are not public. Is it possible to make these methods
public?

For all practical purposes, my wrapper just implements toUnicode method and
delegates everything else to the wrapped object.

Thank you,

Niranjan

On Fri, Apr 11, 2025 at 5:40 AM Tilman Hausherr <thaush...@t-online.de>
wrote:

No there is no official solution to handle this.

Here's what could be done (in addition to just fork the project, or use
reflection)

- do a setUnicode() on the TextPosition elements in the stripper
- create the encoding and replace the fonts before extracting. For that
you'd have to have to find out how the encoding is stored. (probably in
"differences")
If it doesn't work, you may have to disable the cache or use your own.

Tilman

On 10.04.2025 23:48, NH Rao wrote:

Greetings,

Some of the PDF files we process do not have unicode information defined
for its type 3 fonts. I am in the process of migrating ancient code

(based

on version 1.8 to the latest version). Since the characters are imited to
ASCII characters, we dumped checksum of a glyph and character to a map.
With processing enough files, we managed to get checksums for all the
characters we care about. At runtime, we get font glyph, compute it's
checksum  and set equivalent unicode using code that looks similar to
follows

font.getFontEncoding().addCharacterEncoding(letterChar, charName);
font.getToUnicodeCMap().addMapping(new byte[] { (byte) i }, letter);

With these changes, the rest of the text stripper code works as expected

as

it's able to find the required information.

We're trying to migrate to the latest released version of PDF. I believe
some of these methods are now package protected
e.g. org.apache.pdfbox.pdmodel.font.encoding.Encoding.add(int, String).
Also comment on the method seems to discourage our workaround.

I am not able to figure out which method I need to call for unicode

mapping

in the second line of the above code example.

What will be a solution to handle this? The solution of mapping glyph to
character  does work for us even though we created the map manually.

Regards,

Niranjan


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Missing unicode information from a font.

Reply via email to