Matt,

I am also facing similar issues with unicharambigs. I have found it helpful
to look in the unicharset file. Some times the values being recognized are
not what was put in the box file and looking at the unicharset helps
identify them.

In fact there are times when I have had to define the character in BOX in a
way that I know is not correct, just so that I can get that character unit
in the unicharset. Once it is there, I can use it in a substitution to
replace it. Weird!!

The following will be visible only with a unicode devanagari font such as
mangal, sanskrit2003, etc.

e.g. The character श was being recognized as श्ा (probably because of
chopping)

To fix this, I had to split the character in the box file as
 श्
ा
and then add the following to the unicharambigs file:
2    श्‍ ा        1    श    1
as a mandatory replacement.

Anyway, in your situation above, check your unicharset file. I have a
feeling that the ligature - "U+017FU+0068" is not there in it, but U+EAB1
is there. That is why you could use the latter in your substitutions.

Shree


Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Tue, May 7, 2013 at 10:50 PM, matthew christy <[email protected]>wrote:

> Thanks for the information Nick.
>
> I tried my experiment and used the unicharambigs file to turn all my
> ligatures into modern character equivalents. It did not substantially
> improve the dictionary lookup results. I'll have to try increasing my
> confidence in the dictionary using the parameters that I've found mentioned
> in this group (although I'm still trying to figure out what file those
> parameters are in). However, it does look like the unicharambigs stuff is
> done BEFORE the dictionary lookup, which is good to know/confirm.
>
> One odd caveat on behavior with the unicharambigs work that I noticed:
> putting a bunch of lines like "1 ſt 2 s t 1" worked well. But I did have
> one instance where the it did not work at all. In my boxfile I had a two
> letter combination defined rather than a single, ligaturized character
> (i.e. a combo of long-s and h "ſh", which is a ligature, but one which is
> not defined in the standard unicode set). I had several occurrences of
> these in my training image, and in the boxfile the box value defined for
> this ligature was "U+017FU+0068". We have been told by folks at Google that
> doing this was OK, and indeed, it does work. Every instance of this
> ligature was correctly identified and turned into "ſh" in the result
> document. However, I could do nothing in the unicharambigs file to turn
> this into an "sh". The only way to get this to work was to change the
> boxfile to identify this ligature as a single character; in this case, I
> used the Medieval Unicode Font Initiative's (MUFI) value of U+EAB1. When I
> did that I was then able to add "1  2 s h 1" (that unidentified
> character having the unicode value of U+EAB1) to the unicharambigs file and
> get the correct results that I wanted.
>
> I don't really understand this behavior. It's almost as if using a
> two-letter character combination in the boxfile short-circuits the ability
> of unicharambigs to identify and convert it. Maybe it's a result somehow of
> the timing of when things are done in the code. I don't know, but I wanted
> to put it out there.
>
> Matt
>
> On Tuesday, May 7, 2013 3:57:22 AM UTC-5, Nick White wrote:
>>
>> Hi Matt,
>>
>>
>> > I'm also not sure how these two files are different, or if maybe
>> DangAmbigs is
>> > from an earlier version of Tesseract or something. I'm using 3.02.
>>
>> Yes, that guess was correct. unicharambigs used to be called DangAmbigs
>> before Tesseract 3. That is mentioned at:
>> http://code.google.com/p/**tesseract-ocr/wiki/**TrainingTesseract3<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>
>>
>> > answers another question I have about unicharambigs: is any
>> > ambiguity found taken into account before or after dictionary lookup.
>> Is the
>> > unicharambigs processed before or after the dictionary is consulted?
>>
>> I'm not sure, but I think the unicharambigs step happens before the
>> dictionary step. You'd have to check the code to be sure.
>>
>> > Also, I'm finding unicharambigs only seems to really work when I've got
>> more
>> > than one character on either side of the "equation". For single
>> character
>> > substitutions (t -> r, or vice versa) it doesn't really work so well.
>> I'm
>> > curious whether anyone else is finding the same thing.
>>
>> I have found in general that using the '2' ('DEFINITE_AMBIG') option
>> didn't make as much difference as I was expecting.
>>
>> Nick
>>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to