Re: Empty cmap in TTF Files.

Tilman Hausherr Wed, 24 Mar 2021 20:38:59 -0700

Am 24.03.2021 um 14:40 schrieb Gunnar Brand:

Hi.


I am working on merging original PDFs and the PDF/HOCR output of Tesseract, as 
to create a searchable PDF. Transplanting the glyphless font used by tesseract 
was no problem, it doesn’t matter if I simply use the font in the original PDF 
or use cloneutil, when saving the file the font is embedded properly.

The problem is when I show text using a content stream, I get a “No Glyph for 
…” exception. I traced this down to the glyphless font containing empty cmap 
tables. There is a CIDToGIDMap. Coincidentally PDFBOX-5103 just addressed this 
issue with a reverse mapping if the cmap is null. But the cmap is just empty 
and will return 0 for any character code, so this new feature will never work 
in this case.

For testing I modified TrueTypeFont.getUnicodeCmapImpl(isStrict) so that it 
ignores empty cmap subtables  (even the fallback at the end of the method now 
being a loop). With this PDFBox will happily use the tesseract glyphless font. 
Now I lack the knowledge if empty cmaps make any sense at all and if they do I 
will simply write raw show text commands, but maybe it is something to consider?

Gunnar

I tried tesseract some time ago and it generates searchable PDFs out ofthe box, why not use that?

Can you upload one of your files to a sharehoster so that I understandwhat this is about?


Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Re: Empty cmap in TTF Files.

Reply via email to