Hello, I want to ocr an image with a colored background. I have seen that tesseract produced bad results in that case. As a workaround, I want to convert my image to black and white and do the ocr on that image to produce an hocr file. After that I want to combine the hocr and the original image (with the colored background) to get a searchable pdf. To convert the hocr file I use hocr2pdf but I get bad results. The black and white image is "incas1_modif.tif". The resulting hocr file is incas.hocr. I wanted to merge the hocr file with another image, not the "incas2_modif.tif". The results of the merge was poor, so I tried to create a pdf from the hocr only containing some text and containing no image. I got it with
hocr2pdf -i incas1_modif.tif -s -o incas_test.pdf < incas.hocr The resulting pdf "incas_test.pdf" is very strange: some text is overlayed, sometimes the font is much bigger than the font in the original picture and some text has disappeared. I have found this thread <https://groups.google.com/forum/#!topic/tesseract-ocr/tdfaEfDnPPY> and I assume sometimes the result of hocr2pdf is bad. So my question is: how can I produce a pdf from a hocr file ? Until now I had no success. Or: do you have another idea to get good results from a colored image with tesseract ? I'm using tesseract 3.03. I didn't attach the file "incas2_modif.tif" because it was too big. Thank you! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f3de0990-467b-468e-99a4-0cc44f868528%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
incas.hocr
Description: Binary data
incas_test.pdf
Description: Adobe PDF document

