zdenko, Thanks for the reply.
> You did not specified language but in case of python I am pretty agnostic about language as long as it can run via the CLI on linux - the OCR process is on the backend. In case anyone else runs across this: I am an OCR noob so the past few days have been pretty enlightening. I have run across a number of other options to marry hOCR w/ an image to generate searchable PDFs. Unfortunately, hocr2pdf is one of the most prominent ones. It shows up pretty high on a lot of searchs and is mentioned in various forums/blogs etc. I have found that hocr2pdf generates fairly unusable searchable PDFs - the searchable text is interleaved and really out of position. Luckily, there are a number of other options in various languages. The first OSS tool that I found to generated very usable searchable PDFs generated from tesseract hOCR files has been pdfbeads - a ruby gem. It has worked well with a diverse sample of documents. At this time my primary concern with pdfbeads is that it is a pretty niche library and it encapsulates all of the logic to generate the PDF file. pdfbeads doesn't rely on other more heavily used/vetted/current PDF generation libs to generate the PDF. It would have been a little more comforting if pdfbeads concentrated on parsing the hOCR files and adding the text layer via another lib ... assuming that is possible. If this holds up I suspect that we are going to slot this into our OCR process. Carlos -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

