Too Many Unichar Ambiguities

sdk Tue, 30 Apr 2013 23:50:21 -0700

OK, I have figured out the 'why' of some of the problems I was facing. I am 
noting it here, in case others come across the same issues.

1. I was seeing: Illegal ambiguity specification for every line in the 
san.config file, san.unicharset file and san.unicharambigs file.

I found that this was because the san.unicharambigs was saved with Windows 
type of EOL characters. I saved it again after converting to UNIX EOL in 
Notepadd++ and made sure it was encoded with UTF-8 without BOm - and these 
warnings disappeared.

2. The unicharambigs file list some replacements as mandatory, others as 
optional based on last number being 1 or 0. These are termed as 
'Replaceable Ambiguities' and 'Dangerous Ambiguities' as per debug ouput. 
Sample pasted below:

Reading ambiguities
read line 2    । ।    1    ॥    1
read line 2    र व    1    ख    0
read line 1    1    1    ।    0
Illegal unichar 1 in ambiguity specification
Illegal ambiguity specification on line 4
Replaceable Ambiguities for । [964 ]p:
wrong_ngram:। । ( 73 73 )
correct_fragments:|॥|0|2 |॥|1|2 ( 137 138 )
Dangerous Ambiguities for र [930 ]x:
wrong_ngram:र व ( 28 14 )
correct_fragments:|ख|0|2 |ख|1|2 ( 139 140 )
Tesseract Open Source OCR Engine v3.02 with Leptonica

3. Illegal unichar 1 in ambiguity specification - Illegal ambiguity 
specification on line 4

This means that one of the characters in the line in unicharambigs file is 
not there in the unicharset file. This has been noted in training wiki 
also. 
I was trying to create replacement strings based on OCR output. But need 
also to look at Unicharset to see that it is one of the units in first 
column.

4. ambigs.train config file

I used this config option to get more info about inner working of 
Tesseract. Here is the ouput sample ;;

Tesseract Open Source OCR Engine v3.02 with Leptonica
TODO(antonova): clean up recog_training_segmented;  It examined only a 
small fraction of the ambigs image.
recog_training_segmented: examined 9 / 136 words.

ॐ    ॐ    1.6375    -0.2251
श्री    श्री    2.1137    -0.2634
धी    श्री    21.3329    -2.6583
श्री    श्री    1.1497    -0.1474
धी    श्री    20.7599    -2.6615
को    को    1.4187    -0.1785
।    ।    0.3590    -0.2081
॥    ।    0.3590    -0.2081
।    ।    0.3590    -0.2081
॥    ।    0.4610    -0.2672
।।    ॥    0.8614    -0.2533
शूर    शूर    4.4771    -0.4700
शुर    शूर    33.3172    -4.5999
श्‍शूर    शूर    53.8488    -8.5634
शूकमा    शून्य    45.1160    -5.4195
श्‍शून्य    शून्य    63.2227    -6.6297

So this gives an idea of how Tesseract is treating ambiguous character 
units (useful specially for complex scripts such as devanagari).

What I haven't figured out yet is how do I use this info to influence and 
improve the training so that I can minimise the errors.

Any suggestions, anyone???/

On Monday, April 22, 2013 10:00:18 AM UTC+5:30, sdk wrote:
>
> Zdenko,
>
> I am also getting the message regarding 'too many ambiguities on line 
> ...." when processing with the newly trained data for san.
>
> I saw that there are two closed issues on the topic but could not figure 
> out what needs to be done to get rid of these errors/warnings. The OCR 
> output is getting created.
>
> I set ambigs_debug_level to 5 using a config file and the resulting output 
> shows:
>
> Illegal ambiguity specification 
>
> for every line in the san.config file, san.unicharset file and 
> san.unicharambigs file.
>
> Do I need to train for ambiguities?
>
> Or is this something that happens because we are running tesseract on 
> windows?
>
> Thanks!
>
> On Wednesday, December 19, 2012 12:41:10 AM UTC+5:30, zdenop wrote:
>  
>
>> I do apologize, but I am not familiar with Chinese (or other Asian 
>> languages ;-) ). So I tried
>>
>> tesseract original.jpg original -l chi_sim
>>
>> and the message was:
>>
>> Too many unichars in ambiguity on line 0
>> Too many unichars in ambiguity on line 0
>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica
>>
>> It created output.
>
>
>  
>
>> Messages before "Tesseract Open Source..." are from init phase. So it 
>> looks like there could be some problem in "chi_sim" language file. Messages 
>> after "Tesseract Open Source..." are from OCR phase, 
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Too Many Unichar Ambiguities

Reply via email to