Thanks for the feedback.

I understand that a replacement is not guaranteed, but in the cases I've 
tested, it seems unusual that a replacement is *never* made. Consider this 
experiment I performed:

I scanned a sample one page document. My dictionary contained all the words 
in the document -- nothing more, nothing less. After performing OCR once on 
the document I noted which characters were misrecognized and added rules 
for those characters to unicharambigs with a type indicator of 0 (NOT_AMBIG). 
I then performed OCR on the document again and recognition did not improve 
at all. Results were the same with and without unicharambigs.

Very strange.

On Monday, January 21, 2013 12:42:23 PM UTC+3, Nick White wrote:
>
> Hi Preston, 
>
> > However, when I set the type indicator to 0 (NOT_AMBIG) the replacement 
> > never happens even when a replacement would change a word from a 
> > non-dictionary word into a dictionary word. I've assured that the 
> > dictionary contains the necessary word. 
>
> It's my understanding that type 0 doesn't necessarily ensure a 
> potential change from a non-dictionary word to a dictionary word. It 
> uses weighting to decide whether to make the change, so for example 
> if it's pretty confident (however erroneously) that e.g. the 
> characters are 'c l' and not 'd', due to spacing or whatever, it 
> won't necessarily make the switch. That said I haven't tested it too 
> much, or read the code. But that would explain why it isn't always 
> working where you expect it to. 
>
> > Furthermore 2 (DEFINITE_AMBIG), 3 (SIMILAR_AMBIG) and 4 (CASE_AMBIG) 
> don't 
> > seem to have any effect, though I'm not clear what they're supposed to 
> do 
> > anyways. 
>
> Yes, it would be great to get some proper documentation on these. I 
> also don't have a good idea of what they're supposed to do (though 
> they are used in some of the .traineddata files). 
>
> Hope this helps. 
>
> Nick 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group, send email to 
[email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to