Hi.

I am working on merging original PDFs and the PDF/HOCR output of Tesseract, as 
to create a searchable PDF. Transplanting the glyphless font used by tesseract 
was no problem, it doesn’t matter if I simply use the font in the original PDF 
or use cloneutil, when saving the file the font is embedded properly.

The problem is when I show text using a content stream, I get a “No Glyph for 
…” exception. I traced this down to the glyphless font containing empty cmap 
tables. There is a CIDToGIDMap. Coincidentally PDFBOX-5103 just addressed this 
issue with a reverse mapping if the cmap is null. But the cmap is just empty 
and will return 0 for any character code, so this new feature will never work 
in this case.

For testing I modified TrueTypeFont.getUnicodeCmapImpl(isStrict) so that it 
ignores empty cmap subtables  (even the fallback at the end of the method now 
being a loop). With this PDFBox will happily use the tesseract glyphless font. 
Now I lack the knowledge if empty cmaps make any sense at all and if they do I 
will simply write raw show text commands, but maybe it is something to consider?

Gunnar

Reply via email to