[tesseract-ocr] Re: Suggestions on running PDFs through Tesseract without losing vector graphics?

Jeff Breidenbach Thu, 10 Sep 2015 17:21:06 -0700

If the PDF embedded vector graphics similar to how it does rasters, then 
this becomes somewhat practical. For example, if the vector graphics were 
an embedded SVG then we'd pull that out (similar to how pdfimage from
poppler can pull out embedded rasters.) Then we'd teach Leptonica to
read the SVG into a pix, which is a rasterization operation. At PDF 
generation
time we'd throw out the pix and instead use the original SVG.
But I don't think it works that way at all. I think the vector graphics 
commands are likely to be direct PDF primitives and therefore way, way, 
way too hard play with in this fashion.


I think a more likely approach (but still very unlikely!) is using a 
modified Tesseract to create a new PDF containing invisible text 
layer and nothing else. Then hope someone has written a general 
purpose "composite two PDF files on top of each other" program 
and use it to merge. Such a merging program would be pretty difficult
to write, and it is hard for me to imagine why it would exist.


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fc6398f2-2007-4ba6-9a67-b476bea89615%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Suggestions on running PDFs through Tesseract without losing vector graphics?

Reply via email to