Hi. I am working on merging original PDFs and the PDF/HOCR output of Tesseract, as to create a searchable PDF. Transplanting the glyphless font used by tesseract was no problem, it doesn’t matter if I simply use the font in the original PDF or use cloneutil, when saving the file the font is embedded properly.
The problem is when I show text using a content stream, I get a “No Glyph for …” exception. I traced this down to the glyphless font containing empty cmap tables. There is a CIDToGIDMap. Coincidentally PDFBOX-5103 just addressed this issue with a reverse mapping if the cmap is null. But the cmap is just empty and will return 0 for any character code, so this new feature will never work in this case. For testing I modified TrueTypeFont.getUnicodeCmapImpl(isStrict) so that it ignores empty cmap subtables (even the fallback at the end of the method now being a loop). With this PDFBox will happily use the tesseract glyphless font. Now I lack the knowledge if empty cmaps make any sense at all and if they do I will simply write raw show text commands, but maybe it is something to consider? Gunnar