Hi Constantine

I worked on it a bit.
In the end I don't render the HOCR directly but simply transfer the Tesseract 
PDF data over to the PDF.
For this I clone the glyphless font (which is embedded as resource in the jar) 
and add it to the target page and then while transfering the PDF data change 
the font commands to the new font name on the fly. I also add a lot of marked 
content since I have the info from the HOCR file (something Tesseract should 
have done).

I still tried to get the HOCR rendering working, which is a big PITA once you 
enter the realm of sloped or rotated text. Basically all HOCR does is draw a 
rectangle around words and lines and giving you a single pointer to the 
baseline of the first word.  So you later need to calculate intersections of 
the baseline with word rectangles etc etc. I probably spent more time on it 
than I should but I wanted to get it working.

When doing the PDF transfer I found a way to mix PDContentStream with raw 
commands in a nicer way, and this works fine with the "broken" glyphless font, 
too.
All you do is set font size, set the text matrix or use newLineAtOFfset , set 
the text zoom and then write a raw COSString and  a raw Tj operator.
There are issues with RTL text I haven't looked at yet (you probably need to 
reverse the order as tesseract does) and vertical text is another thing 
probably not working.
I will dump the code right here. Please note that you don't really need to  
create a XFormObject, you can use the page directly and apppend the text data.

The HOCR classes I am using is a hastily written HOCR parser for the tesseract 
hocr files. All it does is parse blocks, paragraphs, line and word XML into 
java classes and transfer coordinates into the PDF coordinate system (bottom to 
top). It does some advanced math for the baseline and rotation handling, 
especially to get the start coordinates for lines and words . (slope only makes 
sense for unrotated text, though). If all you have is 90 degree angled stuff it 
is kinda easy to calculate these.

I added some marked content code (excesivly with confidence properties on word 
level) that might make text extractors happy later on. It also requires the raw 
method since the other stream does not do inline dictionaries at all. (There is 
a inline dicitionary property "ActualText" that can overwrite the whole content 
of the glyphs inside, maybe worth trying out, too.)

http://kba.cloud/hocr-spec/1.2/

Gunnar

class PageMerger
{
        private final static Charset UTF16BE = Charset.forName("UTF-16BE");
        
        private final static Operator BDC = 
Operator.getOperator(OperatorName.BEGIN_MARKED_CONTENT_SEQ);
        private final static COSName DIV = COSName.getPDFName("Div");
        private final static COSName P = COSName.getPDFName("P");
        private final static COSName SPAN = COSName.getPDFName("Span");
        // might be a good idea to thing about this, see PDF spec
        private final static COSName REVERSED_CHARS = 
COSName.getPDFName("ReversedChars");

        private final static COSName   LANG = COSName.getPDFName("Lang");
        private final static COSName   WRITING_MODE = 
COSName.getPDFName("WritingMode");
        private final static COSString LTR = new COSString("LrTb");
        private final static COSString RTL = new COSString("RlTb");
        private final static COSName   CONFIDENCE = 
COSName.getPDFName("X_Confidence");

        private final PDDocument _doc;

        private PDFont _glyphless;

        public PageMerger(PDDocument doc) throws IOException
        {
                _doc = doc;
        }


        public void addHocrText(PDPage page, HocrPage hocr) throws IOException
        {
                final PDRectangle pageBox = page.getCropBox(), formBox = 
rectangle(hocr.bbox);

                PDFormXObject form = new PDFormXObject(_doc);
                form.setBBox(formBox);
                form.setResources(new PDResources());
                final COSName fname = addGlyphless(form.getResources());
                final float glyphScale = 100f * 
_glyphless.getBoundingBox().getHeight() / 
_glyphless.getBoundingBox().getWidth();

                PDPageContentStream cs = new PDPageContentStream(_doc, form, 
form.getContentStream().createOutputStream(COSName.FLATE_DECODE));
                ContentStreamWriter csw = new ContentStreamWriter(new 
PDPageContentStreamAdapter(cs));
                
                for ( HocrBlock block : hocr.blocks ) {
                        cs.beginMarkedContent(DIV);
                        cs.beginText();
                        cs.setRenderingMode(RenderingMode.NEITHER);
                        float size = -1f;
                        for ( HocrParagraph para : block.paragraphs ) {
                                if ( para.lang==null && para.dir==null ) {
                                        cs.beginMarkedContent(P);
                                } else {
                                        COSDictionary dict = new 
COSDictionary();
                                        dict.setDirect(true);
                                        if ( para.lang!=null ) 
dict.setString(LANG, para.lang);
                                        if ( para.dir!=null ) 
dict.setItem(WRITING_MODE, para.dir=="ltr" ? LTR : RTL);
                                        csw.writeTokens(P, dict, BDC);
                                }

                                for ( HocrLine line : para.lines ) {
                                        cs.beginMarkedContent(SPAN);
                                        Point2D p = line.getStart();
                                        
cs.setTextMatrix(Matrix.getRotateInstance(Math.toRadians(line.getRotation()), 
(float)p.getX(), (float)p.getY()));
                                        if ( line.size!=size ) 
cs.setFont(_glyphless, size = line.size);
                                        HocrWord last = null;
                                        for ( Iterator<HocrWord> lit = 
line.words.iterator(); lit.hasNext(); ) {
                                                HocrWord word = lit.next();
                                                
                                                COSDictionary dict = new 
COSDictionary();
                                                dict.setDirect(true);
                                                dict.setFloat(CONFIDENCE, 
word.confidence);
                                                if ( word.lang!=null ) 
dict.setString(LANG, word.lang);
                                                if ( word.dir!=null ) 
dict.setItem(WRITING_MODE, word.dir=="ltr" ? LTR : RTL);
                                                csw.writeTokens(SPAN, dict, 
BDC);
                                                
                                                // preferable but less reliable 
(produces the same error as tesseract does) :( 
                                                if ( last!=null ) 
cs.newLineAtOffset(word.getDistance() - last.getDistance(), 0);
                                                last = word;
                                                
                                                // overkill but places words at 
the right position
//                                              p = word.getStart();
//                                              
cs.setTextMatrix(Matrix.getRotateInstance(Math.toRadians(line.getRotation()), 
(float)p.getX(), (float)p.getY()));
                                                
                                                String text = word.text;
                                                if ( lit.hasNext() && 
!text.isBlank() ) text += " "; 
                                                float zoom = glyphScale * 
word.getWidth() / (line.size * text.codePointCount(0, text.length()));
                                                cs.setHorizontalScaling(zoom);
                                                // COSString constructor with 
java string argument adds a BOM in the beginning, that ain't good. 
                                                csw.writeToken(new 
COSString(text.getBytes(UTF16BE)));
                                                
csw.writeToken(Operator.getOperator(OperatorName.SHOW_TEXT));
                                                cs.endMarkedContent();
                                        }
                                        cs.endMarkedContent();
                                }
                                cs.endMarkedContent();
                        }
                        cs.endText();
                        cs.endMarkedContent();
                }
                cs.close();

                // Rotation matrix a b c d e f: cos sin -sin cos 0 0
                // x = a*x + c*y + e 
                // y = b*x + d*y + f
                int rotation = page.getRotation();
                final float x1 = pageBox.getLowerLeftX(),  y1 = 
pageBox.getLowerLeftY();
                final float x2 = pageBox.getUpperRightX(), y2 = 
pageBox.getUpperRightY();
                final float s = pageBox.getWidth() / ((rotation % 180 == 0)  ? 
formBox.getWidth() : formBox.getHeight());
                Matrix m = null;
                switch(rotation) {
                        case 0:   m = new Matrix( s,  0,  0,  s, x1, y1); break;
                        case 180: m = new Matrix(-s,  0,  0, -s, x2, x2); break;
                        case 90:  m = new Matrix(0,   s, -s,  0, x2, y1); break;
                        case 270: m = new Matrix(0,  -s,  s,  0, x1, y2); break;
                }
                cs = new PDPageContentStream(_doc, page, AppendMode.APPEND, 
true, true);
                cs.transform(m);
                cs.drawForm(form);
                cs.close();
        }


        private COSName addGlyphless(PDResources target) throws IOException
        {
                if ( _glyphless!=null ) return target.add(_glyphless);
                try (
                        InputStream in = 
PageMerger.class.getResourceAsStream("glyphless.pdf");
                        PDDocument template = PDDocument.load(in)
                ) {
                        PDResources source = template.getPage(0).getResources();
                        PDFont font = cloneFont(source, 
source.getFontNames().iterator().next());
                        COSName name = target.add(font);
                        _glyphless = target.getFont(name);
                        return name;
                }
        }

        private PDFont cloneFont(PDResources source, COSName name) throws 
IOException
        {
                PDFont f1 = source.getFont(name);
                PDFCloneUtility c = new PDFCloneUtility(_doc);
                return new 
PDType0Font((COSDictionary)c.cloneForNewDocument(f1.getCOSObject()));
        }

        private static PDRectangle rectangle(Rectangle bbox)
        {
                return new PDRectangle(bbox.x, bbox.y, bbox.width, bbox.height);
        }

        @SuppressWarnings("deprecation")
        private static class PDPageContentStreamAdapter extends OutputStream
        {
                private final PDPageContentStream stream;

                PDPageContentStreamAdapter(PDPageContentStream stream) {
                        this.stream = stream;
                }
                
                @Override public void write(int b) throws IOException {
                        stream.appendRawCommands(b);
                }
                
                @Override public void write(byte[] b) throws IOException {
                        stream.appendRawCommands(b);
                }
                
                @Override public void write(byte[] b, int off, int len) throws 
IOException {
                        stream.appendRawCommands(off==0 && len==b.length ? b : 
Arrays.copyOfRange(b, off, off + len));
                }
        }
}


-----Ursprüngliche Nachricht-----
Von: Constantine Dokolas <cdoko...@gmail.com> 
Gesendet: Freitag, 26. März 2021 11:38
An: users@pdfbox.apache.org
Betreff: Re: Empty cmap in TTF Files.

Hi, Gunnar,

Do you think this SO question
<https://stackoverflow.com/questions/49363954/using-arialmt-for-arabic-text-without-embedding-font-with-pdfbox>
is related? I'm the OP and the (admittedly somewhat niche) case for no-glyph 
(i.e. non-renderable) chars on a PDF is a "capability" that's been missing for 
me.

To give some context, at work I'm responsible for a library that, among other 
things, overlays OCRed text (from diverse sources) on images placed in PDF 
pages. There have been issues I've overcome (especially concerning Unicode), 
but "glyphless font" embedding is something that would really make a noticeable 
impact on PDF size. Most OCR software that produce PDFs from images do this in 
some way, Tesseract included.

I think PDFBox is a great library for reading and generating PDFs, and I'm 
seriously considering contributing as soon as possible. A big thanks to 
everyone working to make this project successful.

C.D.
--
There is a computer disease that anybody who works with computers knows about. 
It's a very serious disease and it interferes completely with the work. The 
trouble with computers is that you 'play' with them!
- Richard P. Feynman


On Thu, Mar 25, 2021 at 2:30 PM Gunnar Brand < 
gunnar.br...@interface-projects.de> wrote:

> Hi.
>
> The process is as follows:
> 1) For images: use the image
>     For PDFs: render each page to 300 dpi (since optimized PDFs don't 
> necessarily have a single big image), maybe even with text if text 
> extraction returned gibberish (missing unicode mapping).
> 2) Use tesseract to OCR image/page with PDF and HOCR output. (for pages:
> create an imageless PDF). The HOCR is used for additional page layout 
> information and word confidence values.
> 3) For images, use the HOCR to filter the PDF text stream and add 
> layout information
>     For PDFs, insert the tesseract PDF text stream into the orignal 
> PDF's page (+add that glyphless font), use the HOCR to filter and add 
> layout information.
>
> For step 3, I would like to use a normal PDPageContentStream to add 
> the content instead of working with a raw stream. But that step fails 
> since I cannot use the showText() method with a Font that has an empty cmap.
>
> I attached an empty tesseract PDF with the glyphless font. Appending 
> text using the font to the single page in there will fail immediately 
> with the exception due to the empty cmap. Adding the font to any other 
> PDF and trying to show text using it will fail as well.
>
> I can probably get away with just creating/transfering the Tj commands 
> raw, but I was wondering if the empty cmap behaviour is ok or would it 
> be better to ignore empty cmaps (i.e. look for a non empty one first 
> and return null if none can be found in TrueTypeFont.getUnicodeCmapImpl).
>
> Gunnar
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Tilman Hausherr <thaush...@t-online.de>
> Gesendet: Donnerstag, 25. März 2021 04:37
> An: users@pdfbox.apache.org
> Betreff: Re: Empty cmap in TTF Files.
>
> Am 24.03.2021 um 14:40 schrieb Gunnar Brand:
> > Hi.
> >
> > I am working on merging original PDFs and the PDF/HOCR output of
> Tesseract, as to create a searchable PDF. Transplanting the glyphless 
> font used by tesseract was no problem, it doesn’t matter if I simply 
> use the font in the original PDF or use cloneutil, when saving the 
> file the font is embedded properly.
> >
> > The problem is when I show text using a content stream, I get a “No
> Glyph for …” exception. I traced this down to the glyphless font 
> containing empty cmap tables. There is a CIDToGIDMap. Coincidentally 
> PDFBOX-5103 just addressed this issue with a reverse mapping if the 
> cmap is null. But the cmap is just empty and will return 0 for any 
> character code, so this new feature will never work in this case.
> >
> > For testing I modified TrueTypeFont.getUnicodeCmapImpl(isStrict) so 
> > that
> it ignores empty cmap subtables  (even the fallback at the end of the 
> method now being a loop). With this PDFBox will happily use the 
> tesseract glyphless font. Now I lack the knowledge if empty cmaps make 
> any sense at all and if they do I will simply write raw show text 
> commands, but maybe it is something to consider?
> >
> > Gunnar
>
> I tried tesseract some time ago and it generates searchable PDFs out 
> of the box, why not use that?
>
> Can you upload one of your files to a sharehoster so that I understand 
> what this is about?
>
> Tilman
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to