If the PDF embedded vector graphics similar to how it does rasters, then this becomes somewhat practical. For example, if the vector graphics were an embedded SVG then we'd pull that out (similar to how pdfimage from poppler can pull out embedded rasters.) Then we'd teach Leptonica to read the SVG into a pix, which is a rasterization operation. At PDF generation time we'd throw out the pix and instead use the original SVG. But I don't think it works that way at all. I think the vector graphics commands are likely to be direct PDF primitives and therefore way, way, way too hard play with in this fashion.
I think a more likely approach (but still very unlikely!) is using a modified Tesseract to create a new PDF containing invisible text layer and nothing else. Then hope someone has written a general purpose "composite two PDF files on top of each other" program and use it to merge. Such a merging program would be pretty difficult to write, and it is hard for me to imagine why it would exist. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/fc6398f2-2007-4ba6-9a67-b476bea89615%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

