I don't know where all this complexity came from. PDF rasterizers have existed since the format was invented. GhostScript is one popular open source alternative. It could either be used directly or through a tool that embeds it such as ImageMagick.
Tools like Apache PDFBox can be used to add the hidden text layer (ie optional content group) back in to the PDF. You could write a custom program to call the APIs for the various components or you could just use shell scripts to string together a bunch of commands. I'm sure there are a lot of fiddly little details to get it to work, have the text be properly registered, etc, but I'm pretty sure it's possible to do. Tom On Thu, Sep 10, 2015 at 8:20 PM, Jeff Breidenbach <[email protected]> wrote: > If the PDF embedded vector graphics similar to how it does rasters, then > this becomes somewhat practical. For example, if the vector graphics were > an embedded SVG then we'd pull that out (similar to how pdfimage from > poppler can pull out embedded rasters.) Then we'd teach Leptonica to > read the SVG into a pix, which is a rasterization operation. At PDF > generation > time we'd throw out the pix and instead use the original SVG. > But I don't think it works that way at all. I think the vector graphics > commands are likely to be direct PDF primitives and therefore way, way, > way too hard play with in this fashion. > > I think a more likely approach (but still very unlikely!) is using a > modified Tesseract to create a new PDF containing invisible text > layer and nothing else. Then hope someone has written a general > purpose "composite two PDF files on top of each other" program > and use it to merge. Such a merging program would be pretty difficult > to write, and it is hard for me to imagine why it would exist. > > > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/pA415qJRRkQ/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/fc6398f2-2007-4ba6-9a67-b476bea89615%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/fc6398f2-2007-4ba6-9a67-b476bea89615%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAE9vqEEOT9MV%2BiUsTeYBDFXjseDXwFp-tTP%2BTKbWSTMYb6HC%3Dg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

