Thanks for the analysis and feedback, Jeff.

Unfortunately, I don't know much about QPDF (and SourceForge's storage
problems are preventing me from learning any more), but doing #3 externally
using a tool like QPDF, perhaps in conjunction with doing #1 in Tesseract
itself, sound like reasonable options.

Tom

On Sat, Jul 18, 2015 at 12:11 AM, Jeff Breidenbach <[email protected]>
wrote:

> JBIG2 is a mutlipage image format, but is different from - for example -
> multipage tiff
> because the images are not independently compressed. They share
> compression
> data, specifically a symbol dictionary.
>
> There are three possible approaches here:
>
> 1. Have Tesseract accept JBIG2 images produced by jbig2enc and embed them
> into PDF without modification,
>
> 2. Have Tesseract actually do JBIG2 compression.
>
> 3. Have Tesseract do image segmentation, compress some parts of the page
> as JBIG2, other parts as JP2K, and store the results in PDF in a mixed
> raster
> format.
>
> I'm only going to discuss #1 because it is simplest and matches the current
> 'try to never transcode' philosophy. We'd need a JBIG2 decoder in
> Leptonica.
> That's probably straightforward but still a very solid chunk of work.
>
> Then, there is what to do in Tesseract. The PDF rendering module would
> need to learn
> about the symbol dictionary (or dictionaries) and add it to collection of
> PDF objects.
> It will need an understanding of what's going on much better than what we
> currently
> use, which is simply 'Hey, what image file belongs to this page? Let's try
> to inline it
> unchanged,'
>
>
> https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L811
>
> Now the good news is the PDF rendering module is really small and is not
> cemented
> down by a whole bunch of unnecessary abstraction layers. And I know it's
> possible
> because I've personally done it with colleagues elsewhere.
>
> But it is a pretty significant effort, and I'm honestly not sure it's
> worth putting inside
> Tesseract. Maybe a better approach is post processing, with a PDF to PDF
> converter
> that uses approach #3. This is the winning strategy  for Linearization,
> which can be
> done on a Tesseract produced PDF using QPDF.
>
>
>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/NPfR1_ZkoTA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/0f9e6702-a759-4053-b9be-42bc96c1d547%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/0f9e6702-a759-4053-b9be-42bc96c1d547%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEHNciLRJcrYa4siGLPEfyqGifa9OOA1pK3rJ%3DqaKbegkg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to