Hi Matt, The long s h ligature thing not working in unicharambigs sounds to me like a bug. You should report it at: https://code.google.com/p/tesseract-ocr/issues
As to how to use parameters to increase the dictionary confidence, it is a bit weird now, but I just got a commit in to the latest SVN that simplifies things. The main way to do it is to add any configuration stuff you want (e.g. 'language_model_penalty_non_dict_word 0.5') to $TESSDATA/configs/myfile and then call Tesseract like this: tesseract in.png out myfile If you use the latest svn you can alternatively do something like this: tesseract in.png out -c language_model_penalty_non_dict_word=0.5 Nick On Tue, May 07, 2013 at 10:20:12AM -0700, matthew christy wrote: > Thanks for the information Nick. > > I tried my experiment and used the unicharambigs file to turn all my ligatures > into modern character equivalents. It did not substantially improve the > dictionary lookup results. I'll have to try increasing my confidence in the > dictionary using the parameters that I've found mentioned in this group > (although I'm still trying to figure out what file those parameters are in). > However, it does look like the unicharambigs stuff is done BEFORE the > dictionary lookup, which is good to know/confirm. > > One odd caveat on behavior with the unicharambigs work that I noticed: putting > a bunch of lines like "1 ſt 2 s t 1" worked well. But I did have one instance > where the it did not work at all. In my boxfile I had a two letter combination > defined rather than a single, ligaturized character (i.e. a combo of long-s > and > h "ſh", which is a ligature, but one which is not defined in the standard > unicode set). I had several occurrences of these in my training image, and in > the boxfile the box value defined for this ligature was "U+017FU+0068". We > have > been told by folks at Google that doing this was OK, and indeed, it does work. > Every instance of this ligature was correctly identified and turned into "ſh" > in the result document. However, I could do nothing in the unicharambigs file > to turn this into an "sh". The only way to get this to work was to change the > boxfile to identify this ligature as a single character; in this case, I used > the Medieval Unicode Font Initiative's (MUFI) value of U+EAB1. When I did that > I was then able to add "1 2 s h 1" (that unidentified character having the > unicode value of U+EAB1) to the unicharambigs file and get the correct results > that I wanted. > > I don't really understand this behavior. It's almost as if using a two-letter > character combination in the boxfile short-circuits the ability > of unicharambigs to identify and convert it. Maybe it's a result somehow of > the > timing of when things are done in the code. I don't know, but I wanted to put > it out there. > > Matt > > On Tuesday, May 7, 2013 3:57:22 AM UTC-5, Nick White wrote: > > Hi Matt, > > > > I'm also not sure how these two files are different, or if maybe > DangAmbigs is > > from an earlier version of Tesseract or something. I'm using 3.02. > > Yes, that guess was correct. unicharambigs used to be called DangAmbigs > before Tesseract 3. That is mentioned at: > http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > > > answers another question I have about unicharambigs: is any > > ambiguity found taken into account before or after dictionary lookup. Is > the > > unicharambigs processed before or after the dictionary is consulted? > > I'm not sure, but I think the unicharambigs step happens before the > dictionary step. You'd have to check the code to be sure. > > > Also, I'm finding unicharambigs only seems to really work when I've got > more > > than one character on either side of the "equation". For single > character > > substitutions (t -> r, or vice versa) it doesn't really work so well. > I'm > > curious whether anyone else is finding the same thing. > > I have found in general that using the '2' ('DEFINITE_AMBIG') option > didn't make as much difference as I was expecting. > > Nick > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email > to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

