Attachments are swallowed. In the meantime I looked at a PDF file created with tesseract. Yes, PDFBox claims that the ttf file has a cmapTable with two empty subtables. I tried around a bit and went nowhere. I think the solution would be to create an improved font by misusing PDCIDFontType2Embedder, and feeding it with correct cmap data and then saving this font. Or use a real font, but make it invisible (text rendering mode 3).

The behavior of PDFBox makes sense, this prevents the user from using glyphs that don't exist.

One weird thing is that DTL OTMaster claims the two subtables aren't empty. (but maybe this is some sort of default) At the same time, they claim that the length is 10.

Tilman

Am 25.03.2021 um 13:30 schrieb Gunnar Brand:
Hi.

The process is as follows:
1) For images: use the image
     For PDFs: render each page to 300 dpi (since optimized PDFs don't 
necessarily have a single big image), maybe even with text if text extraction 
returned gibberish (missing unicode mapping).
2) Use tesseract to OCR image/page with PDF and HOCR output. (for pages: create 
an imageless PDF). The HOCR is used for additional page layout information and 
word confidence values.
3) For images, use the HOCR to filter the PDF text stream and add layout 
information
     For PDFs, insert the tesseract PDF text stream into the orignal PDF's page 
(+add that glyphless font), use the HOCR to filter and add layout information.

For step 3, I would like to use a normal PDPageContentStream to add the content 
instead of working with a raw stream. But that step fails since I cannot use 
the showText() method with a Font that has an empty cmap.

I attached an empty tesseract PDF with the glyphless font. Appending text using 
the font to the single page in there will fail immediately with the exception 
due to the empty cmap. Adding the font to any other PDF and trying to show text 
using it will fail as well.

I can probably get away with just creating/transfering the Tj commands raw, but 
I was wondering if the empty cmap behaviour is ok or would it be better to 
ignore empty cmaps (i.e. look for a non empty one first and return null if none 
can be found in TrueTypeFont.getUnicodeCmapImpl).

Gunnar



-----Ursprüngliche Nachricht-----
Von: Tilman Hausherr <thaush...@t-online.de>
Gesendet: Donnerstag, 25. März 2021 04:37
An: users@pdfbox.apache.org
Betreff: Re: Empty cmap in TTF Files.

Am 24.03.2021 um 14:40 schrieb Gunnar Brand:
Hi.

I am working on merging original PDFs and the PDF/HOCR output of Tesseract, as 
to create a searchable PDF. Transplanting the glyphless font used by tesseract 
was no problem, it doesn’t matter if I simply use the font in the original PDF 
or use cloneutil, when saving the file the font is embedded properly.

The problem is when I show text using a content stream, I get a “No Glyph for 
…” exception. I traced this down to the glyphless font containing empty cmap 
tables. There is a CIDToGIDMap. Coincidentally PDFBOX-5103 just addressed this 
issue with a reverse mapping if the cmap is null. But the cmap is just empty 
and will return 0 for any character code, so this new feature will never work 
in this case.

For testing I modified TrueTypeFont.getUnicodeCmapImpl(isStrict) so that it 
ignores empty cmap subtables  (even the fallback at the end of the method now 
being a loop). With this PDFBox will happily use the tesseract glyphless font. 
Now I lack the knowledge if empty cmaps make any sense at all and if they do I 
will simply write raw show text commands, but maybe it is something to consider?

Gunnar
I tried tesseract some time ago and it generates searchable PDFs out of the 
box, why not use that?

Can you upload one of your files to a sharehoster so that I understand what 
this is about?

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org


Reply via email to