Thanks for the information Nick.

I tried my experiment and used the unicharambigs file to turn all my 
ligatures into modern character equivalents. It did not substantially 
improve the dictionary lookup results. I'll have to try increasing my 
confidence in the dictionary using the parameters that I've found mentioned 
in this group (although I'm still trying to figure out what file those 
parameters are in). However, it does look like the unicharambigs stuff is 
done BEFORE the dictionary lookup, which is good to know/confirm.

One odd caveat on behavior with the unicharambigs work that I noticed: 
putting a bunch of lines like "1 ſt 2 s t 1" worked well. But I did have one 
instance where the it did not work at all. In my boxfile I had a two letter 
combination defined rather than a single, ligaturized character (i.e. a 
combo of long-s and h "ſh", which is a ligature, but one which is not 
defined in the standard unicode set). I had several occurrences of these in 
my training image, and in the boxfile the box value defined for this 
ligature was "U+017FU+0068". We have been told by folks at Google that 
doing this was OK, and indeed, it does work. Every instance of this 
ligature was correctly identified and turned into "ſh" in the result 
document. However, I could do nothing in the unicharambigs file to turn 
this into an "sh". The only way to get this to work was to change the 
boxfile to identify this ligature as a single character; in this case, I 
used the Medieval Unicode Font Initiative's (MUFI) value of U+EAB1. When I 
did that I was then able to add "1  2 s h 1" (that unidentified character 
having the unicode value of U+EAB1) to the unicharambigs file and get the 
correct results that I wanted.

I don't really understand this behavior. It's almost as if using a 
two-letter character combination in the boxfile short-circuits the ability 
of unicharambigs to identify and convert it. Maybe it's a result somehow of 
the timing of when things are done in the code. I don't know, but I wanted 
to put it out there.

Matt

On Tuesday, May 7, 2013 3:57:22 AM UTC-5, Nick White wrote:
>
> Hi Matt, 
>
>
> > I'm also not sure how these two files are different, or if maybe 
> DangAmbigs is 
> > from an earlier version of Tesseract or something. I'm using 3.02. 
>
> Yes, that guess was correct. unicharambigs used to be called DangAmbigs 
> before Tesseract 3. That is mentioned at: 
> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 
>
> > answers another question I have about unicharambigs: is any 
> > ambiguity found taken into account before or after dictionary lookup. Is 
> the 
> > unicharambigs processed before or after the dictionary is consulted? 
>
> I'm not sure, but I think the unicharambigs step happens before the 
> dictionary step. You'd have to check the code to be sure. 
>
> > Also, I'm finding unicharambigs only seems to really work when I've got 
> more 
> > than one character on either side of the "equation". For single 
> character 
> > substitutions (t -> r, or vice versa) it doesn't really work so well. 
> I'm 
> > curious whether anyone else is finding the same thing. 
>
> I have found in general that using the '2' ('DEFINITE_AMBIG') option 
> didn't make as much difference as I was expecting. 
>
> Nick 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to