[tesseract-ocr] convert hocr to pdf

Cédric Tue, 07 Oct 2014 01:26:20 -0700

Hello, I want to ocr an image with a colored background. I have seen that 
tesseract produced bad results in that case.
As a workaround, I want to convert my image to black and white and do the 
ocr on that image to produce an hocr file. After
that I want to combine the hocr and the original image (with the colored 
background) to get a searchable pdf. To convert the hocr file
I use hocr2pdf but I get bad results. The black and white image is 
"incas1_modif.tif". The resulting hocr file is incas.hocr. I wanted to 
merge the hocr file with
another image, not the "incas2_modif.tif". The results of the merge was 
poor, so I tried to create a pdf from the hocr only containing some text 
and containing no image. 
I got it with


hocr2pdf -i incas1_modif.tif -s -o incas_test.pdf < incas.hocr

The resulting pdf "incas_test.pdf" is very strange: some text is overlayed, 
sometimes the font is much bigger than the font in the original picture and
some text has disappeared. I have found this thread 
<https://groups.google.com/forum/#!topic/tesseract-ocr/tdfaEfDnPPY> and I 
assume sometimes the result of hocr2pdf is bad. 

So my question is: how can I produce a pdf from a hocr file ? Until now I 
had no success. Or: do you have another idea to get good results
from a colored image with tesseract ?

I'm using tesseract 3.03. I didn't attach the file "incas2_modif.tif" 
because it was too big. 

Thank you!


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f3de0990-467b-468e-99a4-0cc44f868528%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

incas.hocr
Description: Binary data

incas_test.pdf
Description: Adobe PDF document

[tesseract-ocr] convert hocr to pdf

Reply via email to