Hi Matt,

The long s h ligature thing not working in unicharambigs sounds to
me like a bug. You should report it at:
https://code.google.com/p/tesseract-ocr/issues

As to how to use parameters to increase the dictionary confidence,
it is a bit weird now, but I just got a commit in to the latest SVN
that simplifies things. The main way to do it is to add any
configuration stuff you want (e.g.  'language_model_penalty_non_dict_word 0.5')
to $TESSDATA/configs/myfile and then call Tesseract like this:
  tesseract in.png out myfile

If you use the latest svn you can alternatively do something like this:

  tesseract in.png out -c language_model_penalty_non_dict_word=0.5

Nick

On Tue, May 07, 2013 at 10:20:12AM -0700, matthew christy wrote:
> Thanks for the information Nick.
> 
> I tried my experiment and used the unicharambigs file to turn all my ligatures
> into modern character equivalents. It did not substantially improve the
> dictionary lookup results. I'll have to try increasing my confidence in the
> dictionary using the parameters that I've found mentioned in this group
> (although I'm still trying to figure out what file those parameters are in).
> However, it does look like the unicharambigs stuff is done BEFORE the
> dictionary lookup, which is good to know/confirm.
> 
> One odd caveat on behavior with the unicharambigs work that I noticed: putting
> a bunch of lines like "1 ſt 2 s t 1" worked well. But I did have one instance
> where the it did not work at all. In my boxfile I had a two letter combination
> defined rather than a single, ligaturized character (i.e. a combo of long-s 
> and
> h "ſh", which is a ligature, but one which is not defined in the standard
> unicode set). I had several occurrences of these in my training image, and in
> the boxfile the box value defined for this ligature was "U+017FU+0068". We 
> have
> been told by folks at Google that doing this was OK, and indeed, it does work.
> Every instance of this ligature was correctly identified and turned into "ſh"
> in the result document. However, I could do nothing in the unicharambigs file
> to turn this into an "sh". The only way to get this to work was to change the
> boxfile to identify this ligature as a single character; in this case, I used
> the Medieval Unicode Font Initiative's (MUFI) value of U+EAB1. When I did that
> I was then able to add "1  2 s h 1" (that unidentified character having the
> unicode value of U+EAB1) to the unicharambigs file and get the correct results
> that I wanted.
> 
> I don't really understand this behavior. It's almost as if using a two-letter
> character combination in the boxfile short-circuits the ability
> of unicharambigs to identify and convert it. Maybe it's a result somehow of 
> the
> timing of when things are done in the code. I don't know, but I wanted to put
> it out there.
> 
> Matt
> 
> On Tuesday, May 7, 2013 3:57:22 AM UTC-5, Nick White wrote:
> 
>     Hi Matt,
> 
> 
>     > I'm also not sure how these two files are different, or if maybe
>     DangAmbigs is
>     > from an earlier version of Tesseract or something. I'm using 3.02.
> 
>     Yes, that guess was correct. unicharambigs used to be called DangAmbigs
>     before Tesseract 3. That is mentioned at:
>     http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
> 
>     > answers another question I have about unicharambigs: is any
>     > ambiguity found taken into account before or after dictionary lookup. Is
>     the
>     > unicharambigs processed before or after the dictionary is consulted?
> 
>     I'm not sure, but I think the unicharambigs step happens before the
>     dictionary step. You'd have to check the code to be sure.
> 
>     > Also, I'm finding unicharambigs only seems to really work when I've got
>     more
>     > than one character on either side of the "equation". For single 
> character
>     > substitutions (t -> r, or vice versa) it doesn't really work so well. 
> I'm
>     > curious whether anyone else is finding the same thing.
> 
>     I have found in general that using the '2' ('DEFINITE_AMBIG') option
>     didn't make as much difference as I was expecting.
> 
>     Nick
> 
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>  
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email
> to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>  
>  

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to