Have you tried HOCR option? Maybe that provides addl info.
There is also TSV option in development version.

- sent from my phone. excuse the brevity.
On 19-Apr-2016 1:38 pm, "'Bastian Fischer' via tesseract-ocr" <
[email protected]> wrote:

Hi Tesseract Community!

I've a problem recognizing some sample text of a scanned image (tif with
300dpi). It is a sample of a table-of-contents.

I want to extract meta informations of the text e.g. bold, italic, serif
and point size. My idea was to use these informations to differentiate
parts of the text from each other, so that i can use these as part of an
automatic table-of-contents parser. Unfortunately Tesseract does not
recognize those informations very well. For example the first words of the
sample picture are written in a blocky bold arial-like font with a big
pointsize, followed by smaller italic text.

Tesseract for some reason thinks, that all of the letters of the line do
have the same point size. Further it does not get the "r" in "York" right.
And it even thinks that the first word "Cape" is italic and has serifs,
which it does not.

I tried different page segmentation methods. I also converted the image to
monochrome first etc.
They all change the result slightly, but not to a point where you could
speak of a good enough accuracy for my idea of parsing.

I use version 3.04.01 with the eng.traindata provided by google.

Any ideas?


-- 
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/3360168e-b7f1-4317-9057-1d4221807196%40googlegroups.com
<https://groups.google.com/d/msgid/tesseract-ocr/3360168e-b7f1-4317-9057-1d4221807196%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW3kR3RfOM38LzFj4go8bD32RK7XsZvg%2Bur7bwy5ueWtQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to