Hi Tesseract Community!

I've a problem recognizing some sample text of a scanned image (tif with 
300dpi). It is a sample of a table-of-contents.

I want to extract meta informations of the text e.g. bold, italic, serif 
and point size. My idea was to use these informations to differentiate 
parts of the text from each other, so that i can use these as part of an 
automatic table-of-contents parser. Unfortunately Tesseract does not 
recognize those informations very well. For example the first words of the 
sample picture are written in a blocky bold arial-like font with a big 
pointsize, followed by smaller italic text.

Tesseract for some reason thinks, that all of the letters of the line do 
have the same point size. Further it does not get the "r" in "York" right. 
And it even thinks that the first word "Cape" is italic and has serifs, 
which it does not.

I tried different page segmentation methods. I also converted the image to 
monochrome first etc. 
They all change the result slightly, but not to a point where you could 
speak of a good enough accuracy for my idea of parsing.

I use version 3.04.01 with the eng.traindata provided by google.

Any ideas?


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3360168e-b7f1-4317-9057-1d4221807196%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to