Preserving font formatting when parsing scientific articles

Stanko Arbutina Mon, 06 May 2013 18:58:36 -0700

Hello,

I'm trying to use tesseract to convert scientific articles stored in image 
files to html.
Everything works as it should, but for some reason the formatting is not 
preserved (basically I'm interested in bold and header sections).


Image file:
https://docs.google.com/file/d/0BwXwnqv_LzoMWGQzczh0bWNZRHM/edit?usp=sharing

Html:
https://docs.google.com/file/d/0BwXwnqv_LzoMVGdzQjBWand6a0k/edit?usp=sharing

Since the text seems really clear, I'm guessing it's just the matter of 
changing configuration options, but I didn't have much luck understanding 
which ones from the list at 
http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version .

I'm calling tesseract like this:
tesseract 1_001.png 1_001 -l eng+deu conf

conf is a config file (just one line):
tessedit_create_hocr       T

Could someone point me in the right direction?

Thank you,
Stanko

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Preserving font formatting when parsing scientific articles

Reply via email to