Re: Get italic info from Tesseract 3 command line?

jsbien Sun, 14 Apr 2013 22:59:13 -0700

In my experiments I observe the same problem which has been raised in this 
old thread, namely the italics is practically not recognised.


My primary question is: how to interpret the output 
of tessedit_debug_fonts? Where to look for some documentation?

A sample is available here:

http://fleksem.klf.uw.edu.pl/~jsbien/Linde4tesseract/

And below is a small quote:

Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Examining fonts in 4 [34 ]0 9 [39 ]0 8 [38 ]0 
4 font URW_Bookman_L_Bold (300) font2 Verdana_Bold (317)
Examining fonts in 4 [34 ]0 9 [39 ]0 8 [38 ]0 
9 font Arial_Bold (25) font2 Century_Schoolbook_L (59)
Examining fonts in 4 [34 ]0 9 [39 ]0 8 [38 ]0 
8 font Century_Schoolbook_L_Bold (60) font2 Georgia (135)
Word modal font=URW_Bookman_L_Bold, score=2, 2nd choice 
Century_Schoolbook_L_Bold/2

In particular, I'm confused by the order of characters in this output and 
I'm curious where the name of the fonts come from.

Best regards

Janusz

On Friday, April 29, 2011 7:47:25 PM UTC+2, Dmitri Silaev wrote:
>
> It seems to me, I already read some info about font recognition. It
> was saying, fonts are just a matter of the future.
> Indeed, some investigation shows that Tesseract tries to do font
> matching but non-italic fonts outweigh italic ones, even in such an
> obvious example image (see attached).
>
> To examine how font matching works and switch to hOCR output, use the
> following lines in a config file:
> tessedit_debug_fonts      T
> tessedit_create_hocr       T
>
> Also I attach the hOCR result file. Notice, there's only one word
> recognized as italic:
> <em>works</em>
>
> If anyone knows more on this subject, please share...
>
> Warm regards,
> Dmitri Silaev
> www.CustomOCR.com
>
> On Fri, Apr 29, 2011 at 9:30 AM, Nikse <[email protected] <javascript:>> 
> wrote:
> > Thx for your answer Quan Nguyen, and sorry for my unclear question!
> >
> > I can get hocr output... but it does not contain any "<em>" tags when
> > ocr'ing italic texts.
> > Is this working for anybody?
> >
> >
> > On Apr 29, 5:46 am, Quan Nguyen <[email protected]> wrote:
> >> 
> http://groups.google.com/group/tesseract-ocr/browse_thread/thread/2f4...http://code.google.com/p/tesseract-ocr/issues/detail?id=377#c5
> >>
> >> On Apr 28, 7:54 am, Nikse <[email protected]> wrote:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> > I can see that in baseapi.cpp in method "GetHOCRText" there seems to
> >> > be support for italic in line 936/937:
> >> >       if (word->italic > 0)
> >> >         hocr_str += "<em>";
> >>
> >> > Does anybody know if that's supposed to work?
> >>
> >> > TIA
> >> > Nikolaj
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to [email protected]<javascript:>
> > To unsubscribe from this group, send email to
> > [email protected] <javascript:>
> > For more options, visit this group at
> > http://groups.google.com/group/tesseract-ocr?hl=en
> >
>
>
On Friday, April 29, 2011 7:47:25 PM UTC+2, Dmitri Silaev wrote:
>
> It seems to me, I already read some info about font recognition. It
> was saying, fonts are just a matter of the future.
> Indeed, some investigation shows that Tesseract tries to do font
> matching but non-italic fonts outweigh italic ones, even in such an
> obvious example image (see attached).
>
> To examine how font matching works and switch to hOCR output, use the
> following lines in a config file:
> tessedit_debug_fonts      T
> tessedit_create_hocr       T
>
> Also I attach the hOCR result file. Notice, there's only one word
> recognized as italic:
> <em>works</em>
>
> If anyone knows more on this subject, please share...
>
> Warm regards,
> Dmitri Silaev
> www.CustomOCR.com
>
> On Fri, Apr 29, 2011 at 9:30 AM, Nikse <[email protected] <javascript:>> 
> wrote:
> > Thx for your answer Quan Nguyen, and sorry for my unclear question!
> >
> > I can get hocr output... but it does not contain any "<em>" tags when
> > ocr'ing italic texts.
> > Is this working for anybody?
> >
> >
> > On Apr 29, 5:46 am, Quan Nguyen <[email protected]> wrote:
> >> 
> http://groups.google.com/group/tesseract-ocr/browse_thread/thread/2f4...http://code.google.com/p/tesseract-ocr/issues/detail?id=377#c5
> >>
> >> On Apr 28, 7:54 am, Nikse <[email protected]> wrote:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> > I can see that in baseapi.cpp in method "GetHOCRText" there seems to
> >> > be support for italic in line 936/937:
> >> >       if (word->italic > 0)
> >> >         hocr_str += "<em>";
> >>
> >> > Does anybody know if that's supposed to work?
> >>
> >> > TIA
> >> > Nikolaj
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to [email protected]<javascript:>
> > To unsubscribe from this group, send email to
> > [email protected] <javascript:>
> > For more options, visit this group at
> > http://groups.google.com/group/tesseract-ocr?hl=en
> >
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Get italic info from Tesseract 3 command line?

Reply via email to