In my experiments I observe the same problem which has been raised in this old thread, namely the italics is practically not recognised.
My primary question is: how to interpret the output of tessedit_debug_fonts? Where to look for some documentation? A sample is available here: http://fleksem.klf.uw.edu.pl/~jsbien/Linde4tesseract/ And below is a small quote: Tesseract Open Source OCR Engine v3.02.02 with Leptonica Examining fonts in 4 [34 ]0 9 [39 ]0 8 [38 ]0 4 font URW_Bookman_L_Bold (300) font2 Verdana_Bold (317) Examining fonts in 4 [34 ]0 9 [39 ]0 8 [38 ]0 9 font Arial_Bold (25) font2 Century_Schoolbook_L (59) Examining fonts in 4 [34 ]0 9 [39 ]0 8 [38 ]0 8 font Century_Schoolbook_L_Bold (60) font2 Georgia (135) Word modal font=URW_Bookman_L_Bold, score=2, 2nd choice Century_Schoolbook_L_Bold/2 In particular, I'm confused by the order of characters in this output and I'm curious where the name of the fonts come from. Best regards Janusz On Friday, April 29, 2011 7:47:25 PM UTC+2, Dmitri Silaev wrote: > > It seems to me, I already read some info about font recognition. It > was saying, fonts are just a matter of the future. > Indeed, some investigation shows that Tesseract tries to do font > matching but non-italic fonts outweigh italic ones, even in such an > obvious example image (see attached). > > To examine how font matching works and switch to hOCR output, use the > following lines in a config file: > tessedit_debug_fonts T > tessedit_create_hocr T > > Also I attach the hOCR result file. Notice, there's only one word > recognized as italic: > <em>works</em> > > If anyone knows more on this subject, please share... > > Warm regards, > Dmitri Silaev > www.CustomOCR.com > > On Fri, Apr 29, 2011 at 9:30 AM, Nikse <[email protected] <javascript:>> > wrote: > > Thx for your answer Quan Nguyen, and sorry for my unclear question! > > > > I can get hocr output... but it does not contain any "<em>" tags when > > ocr'ing italic texts. > > Is this working for anybody? > > > > > > On Apr 29, 5:46 am, Quan Nguyen <[email protected]> wrote: > >> > http://groups.google.com/group/tesseract-ocr/browse_thread/thread/2f4...http://code.google.com/p/tesseract-ocr/issues/detail?id=377#c5 > >> > >> On Apr 28, 7:54 am, Nikse <[email protected]> wrote: > >> > >> > >> > >> > >> > >> > >> > >> > I can see that in baseapi.cpp in method "GetHOCRText" there seems to > >> > be support for italic in line 936/937: > >> > if (word->italic > 0) > >> > hocr_str += "<em>"; > >> > >> > Does anybody know if that's supposed to work? > >> > >> > TIA > >> > Nikolaj > > > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to [email protected]<javascript:> > > To unsubscribe from this group, send email to > > [email protected] <javascript:> > > For more options, visit this group at > > http://groups.google.com/group/tesseract-ocr?hl=en > > > > On Friday, April 29, 2011 7:47:25 PM UTC+2, Dmitri Silaev wrote: > > It seems to me, I already read some info about font recognition. It > was saying, fonts are just a matter of the future. > Indeed, some investigation shows that Tesseract tries to do font > matching but non-italic fonts outweigh italic ones, even in such an > obvious example image (see attached). > > To examine how font matching works and switch to hOCR output, use the > following lines in a config file: > tessedit_debug_fonts T > tessedit_create_hocr T > > Also I attach the hOCR result file. Notice, there's only one word > recognized as italic: > <em>works</em> > > If anyone knows more on this subject, please share... > > Warm regards, > Dmitri Silaev > www.CustomOCR.com > > On Fri, Apr 29, 2011 at 9:30 AM, Nikse <[email protected] <javascript:>> > wrote: > > Thx for your answer Quan Nguyen, and sorry for my unclear question! > > > > I can get hocr output... but it does not contain any "<em>" tags when > > ocr'ing italic texts. > > Is this working for anybody? > > > > > > On Apr 29, 5:46 am, Quan Nguyen <[email protected]> wrote: > >> > http://groups.google.com/group/tesseract-ocr/browse_thread/thread/2f4...http://code.google.com/p/tesseract-ocr/issues/detail?id=377#c5 > >> > >> On Apr 28, 7:54 am, Nikse <[email protected]> wrote: > >> > >> > >> > >> > >> > >> > >> > >> > I can see that in baseapi.cpp in method "GetHOCRText" there seems to > >> > be support for italic in line 936/937: > >> > if (word->italic > 0) > >> > hocr_str += "<em>"; > >> > >> > Does anybody know if that's supposed to work? > >> > >> > TIA > >> > Nikolaj > > > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to [email protected]<javascript:> > > To unsubscribe from this group, send email to > > [email protected] <javascript:> > > For more options, visit this group at > > http://groups.google.com/group/tesseract-ocr?hl=en > > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

