Thanks for the information Nick. I tried my experiment and used the unicharambigs file to turn all my ligatures into modern character equivalents. It did not substantially improve the dictionary lookup results. I'll have to try increasing my confidence in the dictionary using the parameters that I've found mentioned in this group (although I'm still trying to figure out what file those parameters are in). However, it does look like the unicharambigs stuff is done BEFORE the dictionary lookup, which is good to know/confirm.
One odd caveat on behavior with the unicharambigs work that I noticed: putting a bunch of lines like "1 ſt 2 s t 1" worked well. But I did have one instance where the it did not work at all. In my boxfile I had a two letter combination defined rather than a single, ligaturized character (i.e. a combo of long-s and h "ſh", which is a ligature, but one which is not defined in the standard unicode set). I had several occurrences of these in my training image, and in the boxfile the box value defined for this ligature was "U+017FU+0068". We have been told by folks at Google that doing this was OK, and indeed, it does work. Every instance of this ligature was correctly identified and turned into "ſh" in the result document. However, I could do nothing in the unicharambigs file to turn this into an "sh". The only way to get this to work was to change the boxfile to identify this ligature as a single character; in this case, I used the Medieval Unicode Font Initiative's (MUFI) value of U+EAB1. When I did that I was then able to add "1 2 s h 1" (that unidentified character having the unicode value of U+EAB1) to the unicharambigs file and get the correct results that I wanted. I don't really understand this behavior. It's almost as if using a two-letter character combination in the boxfile short-circuits the ability of unicharambigs to identify and convert it. Maybe it's a result somehow of the timing of when things are done in the code. I don't know, but I wanted to put it out there. Matt On Tuesday, May 7, 2013 3:57:22 AM UTC-5, Nick White wrote: > > Hi Matt, > > > > I'm also not sure how these two files are different, or if maybe > DangAmbigs is > > from an earlier version of Tesseract or something. I'm using 3.02. > > Yes, that guess was correct. unicharambigs used to be called DangAmbigs > before Tesseract 3. That is mentioned at: > http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > > > answers another question I have about unicharambigs: is any > > ambiguity found taken into account before or after dictionary lookup. Is > the > > unicharambigs processed before or after the dictionary is consulted? > > I'm not sure, but I think the unicharambigs step happens before the > dictionary step. You'd have to check the code to be sure. > > > Also, I'm finding unicharambigs only seems to really work when I've got > more > > than one character on either side of the "equation". For single > character > > substitutions (t -> r, or vice versa) it doesn't really work so well. > I'm > > curious whether anyone else is finding the same thing. > > I have found in general that using the '2' ('DEFINITE_AMBIG') option > didn't make as much difference as I was expecting. > > Nick > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

