Re: using tesseract hocr output to create a searchable PDF

Carlos Sat, 03 Dec 2011 00:14:43 -0800

zdenko,

Thanks for the reply.


> You did not specified language but in case of python

I am pretty agnostic about language as long as it can run via the CLI
on linux - the OCR process is on the backend.

In case anyone else runs across this:

I am an OCR noob so the past few days have been pretty enlightening.
I have run across a number of other options to marry hOCR w/ an image
to generate searchable PDFs.  Unfortunately, hocr2pdf is one of the
most prominent ones.  It shows up pretty high on a lot of searchs and
is mentioned in various forums/blogs etc.  I have found that hocr2pdf
generates fairly unusable searchable PDFs - the searchable text is
interleaved and really out of position.

Luckily, there are a number of other options in various languages.
The first OSS tool that I found to generated very usable searchable
PDFs generated from tesseract hOCR files has been pdfbeads - a ruby
gem.  It has worked well with a diverse sample of documents.

At this time my primary concern with pdfbeads is that it is a pretty
niche library and it encapsulates all of the logic to generate the PDF
file.  pdfbeads doesn't rely on other more heavily used/vetted/current
PDF generation libs to generate the PDF.  It would have been a little
more comforting if pdfbeads concentrated on parsing the hOCR files and
adding the text layer via another lib ... assuming that is possible.

If this holds up I suspect that we are going to slot this into our OCR
process.

Carlos

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: using tesseract hocr output to create a searchable PDF

Reply via email to