using tesseract hocr output to create a searchable PDF

Carlos Tue, 29 Nov 2011 19:12:52 -0800

Tesseract 3.01
hocr2pdf 0.8.5

My project has been using Tesseract to OCR documents for some time and
we are really happy with the results.


We have been recently asked to offer the documents in our system as
searchable PDFs.

My initial attempt has been to create a searchable PDF using the hocr
output generated by tesseract with hocr2pdf (http://www.exactcode.de/
site/open_source/exactimage/hocr2pdf/).

the placement of the text in the resulting PDF has some strange
quirks: words overlaying one another, words with oversized fonts,
strange line breaks etc.  The problems are so stark that our current
results are not sufficient for a viable solution.

I don't know very much about the hocr format, however "overlaying"
words doesn't seem to be caused by tesseracts hocr output.  I have
verified a number of times that over-laid words in the searchable PDF
have bbox coordinates in the hocr file that do not overlap at all.

- does anyone have experience generating searchable PDFs using
tesseract output?
- does anyone know of a simple way to visually inspect the placement
of the words specified by the hocr output - for instance, creating a
tiff from the hocr output.  i would like to confirm that the tesseract
hocr output is correctly positioning the words.

sorry if this issue doesn't relate exclusively to tesseract ... at
this point I am not certain what the cause of the problem is.

Carlos

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

using tesseract hocr output to create a searchable PDF

Reply via email to