[tesseract-ocr] Re: jbig2 encoding in PDF output file

Jeff Breidenbach Fri, 17 Jul 2015 21:12:07 -0700

JBIG2 is a mutlipage image format, but is different from - for example - 
multipage tiff
because the images are not independently compressed. They share compression 
data, specifically a symbol dictionary.

There are three possible approaches here:

1. Have Tesseract accept JBIG2 images produced by jbig2enc and embed them
into PDF without modification,

2. Have Tesseract actually do JBIG2 compression.

3. Have Tesseract do image segmentation, compress some parts of the page
as JBIG2, other parts as JP2K, and store the results in PDF in a mixed
raster
format.

I'm only going to discuss #1 because it is simplest and matches the current
'try to never transcode' philosophy. We'd need a JBIG2 decoder in Leptonica.
That's probably straightforward but still a very solid chunk of work.

Then, there is what to do in Tesseract. The PDF rendering module would need
to learn
about the symbol dictionary (or dictionaries) and add it to collection of
PDF objects.
It will need an understanding of what's going on much better than what we
currently
use, which is simply 'Hey, what image file belongs to this page? Let's try
to inline it
unchanged,'

https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L811

Now the good news is the PDF rendering module is really small and is not
cemented
down by a whole bunch of unnecessary abstraction layers. And I know it's
possible
because I've personally done it with colleagues elsewhere.

But it is a pretty significant effort, and I'm honestly not sure it's worth
putting inside
Tesseract. Maybe a better approach is post processing, with a PDF to PDF
converter
that uses approach #3. This is the winning strategy for Linearization,
which can be
done on a Tesseract produced PDF using QPDF.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/0f9e6702-a759-4053-b9be-42bc96c1d547%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: jbig2 encoding in PDF output file

Reply via email to