Tesseract 3.01 hocr2pdf 0.8.5 My project has been using Tesseract to OCR documents for some time and we are really happy with the results.
We have been recently asked to offer the documents in our system as searchable PDFs. My initial attempt has been to create a searchable PDF using the hocr output generated by tesseract with hocr2pdf (http://www.exactcode.de/ site/open_source/exactimage/hocr2pdf/). the placement of the text in the resulting PDF has some strange quirks: words overlaying one another, words with oversized fonts, strange line breaks etc. The problems are so stark that our current results are not sufficient for a viable solution. I don't know very much about the hocr format, however "overlaying" words doesn't seem to be caused by tesseracts hocr output. I have verified a number of times that over-laid words in the searchable PDF have bbox coordinates in the hocr file that do not overlap at all. - does anyone have experience generating searchable PDFs using tesseract output? - does anyone know of a simple way to visually inspect the placement of the words specified by the hocr output - for instance, creating a tiff from the hocr output. i would like to confirm that the tesseract hocr output is correctly positioning the words. sorry if this issue doesn't relate exclusively to tesseract ... at this point I am not certain what the cause of the problem is. Carlos -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

