Hello, I'm trying to use tesseract to convert scientific articles stored in image files to html. Everything works as it should, but for some reason the formatting is not preserved (basically I'm interested in bold and header sections).
Image file: https://docs.google.com/file/d/0BwXwnqv_LzoMWGQzczh0bWNZRHM/edit?usp=sharing Html: https://docs.google.com/file/d/0BwXwnqv_LzoMVGdzQjBWand6a0k/edit?usp=sharing Since the text seems really clear, I'm guessing it's just the matter of changing configuration options, but I didn't have much luck understanding which ones from the list at http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version . I'm calling tesseract like this: tesseract 1_001.png 1_001 -l eng+deu conf conf is a config file (just one line): tessedit_create_hocr T Could someone point me in the right direction? Thank you, Stanko -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

