Thanks for the feedback. I understand that a replacement is not guaranteed, but in the cases I've tested, it seems unusual that a replacement is *never* made. Consider this experiment I performed:
I scanned a sample one page document. My dictionary contained all the words in the document -- nothing more, nothing less. After performing OCR once on the document I noted which characters were misrecognized and added rules for those characters to unicharambigs with a type indicator of 0 (NOT_AMBIG). I then performed OCR on the document again and recognition did not improve at all. Results were the same with and without unicharambigs. Very strange. On Monday, January 21, 2013 12:42:23 PM UTC+3, Nick White wrote: > > Hi Preston, > > > However, when I set the type indicator to 0 (NOT_AMBIG) the replacement > > never happens even when a replacement would change a word from a > > non-dictionary word into a dictionary word. I've assured that the > > dictionary contains the necessary word. > > It's my understanding that type 0 doesn't necessarily ensure a > potential change from a non-dictionary word to a dictionary word. It > uses weighting to decide whether to make the change, so for example > if it's pretty confident (however erroneously) that e.g. the > characters are 'c l' and not 'd', due to spacing or whatever, it > won't necessarily make the switch. That said I haven't tested it too > much, or read the code. But that would explain why it isn't always > working where you expect it to. > > > Furthermore 2 (DEFINITE_AMBIG), 3 (SIMILAR_AMBIG) and 4 (CASE_AMBIG) > don't > > seem to have any effect, though I'm not clear what they're supposed to > do > > anyways. > > Yes, it would be great to get some proper documentation on these. I > also don't have a good idea of what they're supposed to do (though > they are used in some of the .traineddata files). > > Hope this helps. > > Nick > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group, send email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

