Hello everyone, I have a digital copy of a book I own that was delivered to me in what might be the most inconvenient of formats - one PDF per page, with all non-image data on the page - text included - converted to vector shapes. While I can re-combine the pages together, add bookmarks/page numbers/etc. with jPDFTweak, this still leaves me with the problem of not being able to search the book, as all of the text has been converted to vector shapes.
I thought I would use Tesseract, but I can't seem to find the latest Windows binaries or determine whether or not there's some workflow for doing OCR on a PDF, then mixing the hOCR output back into the same PDF without having to convert the image to a TIFF first. I'd like to not have to convert the PDFs into TIFFs and merge the TIFF into a PDF, as this would cause the vector shapes to get converted to a raster format. Can anyone provide some insight on how to do this without pulling my hair out? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/12462608-3e04-43db-9f5b-fd9f1c979213%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

