OK, I have figured out the 'why' of some of the problems I was facing. I am noting it here, in case others come across the same issues.
1. I was seeing: Illegal ambiguity specification for every line in the san.config file, san.unicharset file and san.unicharambigs file. I found that this was because the san.unicharambigs was saved with Windows type of EOL characters. I saved it again after converting to UNIX EOL in Notepadd++ and made sure it was encoded with UTF-8 without BOm - and these warnings disappeared. 2. The unicharambigs file list some replacements as mandatory, others as optional based on last number being 1 or 0. These are termed as 'Replaceable Ambiguities' and 'Dangerous Ambiguities' as per debug ouput. Sample pasted below: Reading ambiguities read line 2 । । 1 ॥ 1 read line 2 र व 1 ख 0 read line 1 1 1 । 0 Illegal unichar 1 in ambiguity specification Illegal ambiguity specification on line 4 Replaceable Ambiguities for । [964 ]p: wrong_ngram:। । ( 73 73 ) correct_fragments:|॥|0|2 |॥|1|2 ( 137 138 ) Dangerous Ambiguities for र [930 ]x: wrong_ngram:र व ( 28 14 ) correct_fragments:|ख|0|2 |ख|1|2 ( 139 140 ) Tesseract Open Source OCR Engine v3.02 with Leptonica 3. Illegal unichar 1 in ambiguity specification - Illegal ambiguity specification on line 4 This means that one of the characters in the line in unicharambigs file is not there in the unicharset file. This has been noted in training wiki also. I was trying to create replacement strings based on OCR output. But need also to look at Unicharset to see that it is one of the units in first column. 4. ambigs.train config file I used this config option to get more info about inner working of Tesseract. Here is the ouput sample ;; Tesseract Open Source OCR Engine v3.02 with Leptonica TODO(antonova): clean up recog_training_segmented; It examined only a small fraction of the ambigs image. recog_training_segmented: examined 9 / 136 words. ॐ ॐ 1.6375 -0.2251 श्री श्री 2.1137 -0.2634 धी श्री 21.3329 -2.6583 श्री श्री 1.1497 -0.1474 धी श्री 20.7599 -2.6615 को को 1.4187 -0.1785 । । 0.3590 -0.2081 ॥ । 0.3590 -0.2081 । । 0.3590 -0.2081 ॥ । 0.4610 -0.2672 ।। ॥ 0.8614 -0.2533 शूर शूर 4.4771 -0.4700 शुर शूर 33.3172 -4.5999 श्शूर शूर 53.8488 -8.5634 शूकमा शून्य 45.1160 -5.4195 श्शून्य शून्य 63.2227 -6.6297 So this gives an idea of how Tesseract is treating ambiguous character units (useful specially for complex scripts such as devanagari). What I haven't figured out yet is how do I use this info to influence and improve the training so that I can minimise the errors. Any suggestions, anyone???/ On Monday, April 22, 2013 10:00:18 AM UTC+5:30, sdk wrote: > > Zdenko, > > I am also getting the message regarding 'too many ambiguities on line > ...." when processing with the newly trained data for san. > > I saw that there are two closed issues on the topic but could not figure > out what needs to be done to get rid of these errors/warnings. The OCR > output is getting created. > > I set ambigs_debug_level to 5 using a config file and the resulting output > shows: > > Illegal ambiguity specification > > for every line in the san.config file, san.unicharset file and > san.unicharambigs file. > > Do I need to train for ambiguities? > > Or is this something that happens because we are running tesseract on > windows? > > Thanks! > > On Wednesday, December 19, 2012 12:41:10 AM UTC+5:30, zdenop wrote: > > >> I do apologize, but I am not familiar with Chinese (or other Asian >> languages ;-) ). So I tried >> >> tesseract original.jpg original -l chi_sim >> >> and the message was: >> >> Too many unichars in ambiguity on line 0 >> Too many unichars in ambiguity on line 0 >> Tesseract Open Source OCR Engine v3.02.02 with Leptonica >> >> It created output. > > > > >> Messages before "Tesseract Open Source..." are from init phase. So it >> looks like there could be some problem in "chi_sim" language file. Messages >> after "Tesseract Open Source..." are from OCR phase, > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

