Thanks for the analysis and feedback, Jeff. Unfortunately, I don't know much about QPDF (and SourceForge's storage problems are preventing me from learning any more), but doing #3 externally using a tool like QPDF, perhaps in conjunction with doing #1 in Tesseract itself, sound like reasonable options.
Tom On Sat, Jul 18, 2015 at 12:11 AM, Jeff Breidenbach <[email protected]> wrote: > JBIG2 is a mutlipage image format, but is different from - for example - > multipage tiff > because the images are not independently compressed. They share > compression > data, specifically a symbol dictionary. > > There are three possible approaches here: > > 1. Have Tesseract accept JBIG2 images produced by jbig2enc and embed them > into PDF without modification, > > 2. Have Tesseract actually do JBIG2 compression. > > 3. Have Tesseract do image segmentation, compress some parts of the page > as JBIG2, other parts as JP2K, and store the results in PDF in a mixed > raster > format. > > I'm only going to discuss #1 because it is simplest and matches the current > 'try to never transcode' philosophy. We'd need a JBIG2 decoder in > Leptonica. > That's probably straightforward but still a very solid chunk of work. > > Then, there is what to do in Tesseract. The PDF rendering module would > need to learn > about the symbol dictionary (or dictionaries) and add it to collection of > PDF objects. > It will need an understanding of what's going on much better than what we > currently > use, which is simply 'Hey, what image file belongs to this page? Let's try > to inline it > unchanged,' > > > https://github.com/tesseract-ocr/tesseract/blob/master/api/pdfrenderer.cpp#L811 > > Now the good news is the PDF rendering module is really small and is not > cemented > down by a whole bunch of unnecessary abstraction layers. And I know it's > possible > because I've personally done it with colleagues elsewhere. > > But it is a pretty significant effort, and I'm honestly not sure it's > worth putting inside > Tesseract. Maybe a better approach is post processing, with a PDF to PDF > converter > that uses approach #3. This is the winning strategy for Linearization, > which can be > done on a Tesseract produced PDF using QPDF. > > > > > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/NPfR1_ZkoTA/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/0f9e6702-a759-4053-b9be-42bc96c1d547%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/0f9e6702-a759-4053-b9be-42bc96c1d547%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEHNciLRJcrYa4siGLPEfyqGifa9OOA1pK3rJ%3DqaKbegkg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

