Am 24.03.2021 um 14:40 schrieb Gunnar Brand:
Hi.
I am working on merging original PDFs and the PDF/HOCR output of Tesseract, as
to create a searchable PDF. Transplanting the glyphless font used by tesseract
was no problem, it doesn’t matter if I simply use the font in the original PDF
or use cloneutil, when saving the file the font is embedded properly.
The problem is when I show text using a content stream, I get a “No Glyph for
…” exception. I traced this down to the glyphless font containing empty cmap
tables. There is a CIDToGIDMap. Coincidentally PDFBOX-5103 just addressed this
issue with a reverse mapping if the cmap is null. But the cmap is just empty
and will return 0 for any character code, so this new feature will never work
in this case.
For testing I modified TrueTypeFont.getUnicodeCmapImpl(isStrict) so that it
ignores empty cmap subtables (even the fallback at the end of the method now
being a loop). With this PDFBox will happily use the tesseract glyphless font.
Now I lack the knowledge if empty cmaps make any sense at all and if they do I
will simply write raw show text commands, but maybe it is something to consider?
Gunnar
I tried tesseract some time ago and it generates searchable PDFs out of
the box, why not use that?
Can you upload one of your files to a sharehoster so that I understand
what this is about?
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org