[tesseract-ocr] Tesseract fails to extract meta information from sample correctly

'Bastian Fischer' via tesseract-ocr Tue, 19 Apr 2016 01:08:38 -0700

Hi Tesseract Community!

I've a problem recognizing some sample text of a scanned image (tif with 
300dpi). It is a sample of a table-of-contents.

I want to extract meta informations of the text e.g. bold, italic, serif
and point size. My idea was to use these informations to differentiate
parts of the text from each other, so that i can use these as part of an
automatic table-of-contents parser. Unfortunately Tesseract does not
recognize those informations very well. For example the first words of the
sample picture are written in a blocky bold arial-like font with a big
pointsize, followed by smaller italic text.

Tesseract for some reason thinks, that all of the letters of the line do
have the same point size. Further it does not get the "r" in "York" right.
And it even thinks that the first word "Cape" is italic and has serifs,
which it does not.

I tried different page segmentation methods. I also converted the image to
monochrome first etc.
They all change the result slightly, but not to a point where you could
speak of a good enough accuracy for my idea of parsing.

I use version 3.04.01 with the eng.traindata provided by google.

Any ideas?

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/3360168e-b7f1-4317-9057-1d4221807196%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Tesseract fails to extract meta information from sample correctly

Reply via email to