[tesseract-ocr] Suggestions on running PDFs through Tesseract without losing vector graphics?

[email protected] Sun, 30 Aug 2015 11:04:49 -0700

Hello everyone,

I have a digital copy of a book I own that was delivered to me in what 
might be the most inconvenient of formats - one PDF per page, with all 
non-image data on the page - text included - converted to vector shapes. 
While I can re-combine the pages together, add bookmarks/page numbers/etc. 
with jPDFTweak, this still leaves me with the problem of not being able to 
search the book, as all of the text has been converted to vector shapes.


I thought I would use Tesseract, but I can't seem to find the latest 
Windows binaries or determine whether or not there's some workflow for 
doing OCR on a PDF, then mixing the hOCR output back into the same PDF 
without having to convert the image to a TIFF first. I'd like to not have 
to convert the PDFs into TIFFs and merge the TIFF into a PDF, as this would 
cause the vector shapes to get converted to a raster format.

Can anyone provide some insight on how to do this without pulling my hair 
out?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/12462608-3e04-43db-9f5b-fd9f1c979213%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Suggestions on running PDFs through Tesseract without losing vector graphics?

Reply via email to