Re: Character confidence level/ probability score using Tesseract 2.04

patrickq Mon, 18 Jan 2010 17:56:28 -0800

I have been dutifully gathering and storing confidence values in my
application just in case there comes a time one day where these values
are reliable - however, at least in my own experience, these values
are not usable and I have routinely seen higher (meaning less
reliable) numbers for recognized characters that were in fact better
than characters returned for the same section of the image (but
scanned differently).


In addition, I think the confidence numbers are set to the same value
for all characters in a same word.

I am therefore completely ignoring these numbers unfortunately and
applying different logic (such as examining the % of non-letter
characters).

Disclaimer: it is certainly possible that my findings are caused by
some error on my part, Tesseract is still very much a black box to me.

Patrick

On Jan 18, 2:05 pm, Nik <[email protected]> wrote:
> Hi,
>         I am using Tesseract version 2.04 and trying to extract the
> confidence level for each character. There has been a previous
> discussion about this issue, but it hasnt been discussed for the past
> 2 and a half years therefore, I wanted to get some new input.
>
> the previous thread was 
> :http://groups.google.com/group/tesseract-ocr/browse_thread/thread/1cd...
>
> Tesseract works fine for the most part however, when a certain
> character is not recognized it chooses the most likely option out of
> the character set and prints it. In this case I would like to output
> an error or a special character when a certain character in the input
> file cannot be recognized with a certain confidence level.
>
> I have been able to follow the previous thread (thanks to all the
> members) and have been able to print a final file containing the
> probability of each character. But I dont know how to make sense of
> different iterations that take place to corrrect an image to improve
> its clarity and matching characteristics.
>
> If someone could explain the format in which the traces are printed in
> the tprintf funciton it would be greatly appreciated.
>
> Example output for an image containing "09063" as input :
>
> Tesseract Open Source OCR Engine
> chop_word:
> 10.79 -2.03 : 0 [30 ]0
> chop_word:
> 6.03 -1.49 : 9 [39 ]0
> chop_word:
> 8.08 -1.52 : 0 [30 ]0
> chop_word:
> 16.86 -3.94 : 6 [36 ]0
> chop_word:
> 5.20 -1.12 : 3 [33 ]0
> improve 1:
> 20.42 -5.92 : 6 [36 ]0
> improve 2:
> 16.65 -12.33 : : [3a ] 17.86 -13.23 : 0 [30 ]0
> pieces:
> 80.98 -9.23 : 0 [30 ]0
> pieces:
> 58.07 -9.68 : 3 [33 ]0
> rebuild
> 16.86 -3.94 : 6 [36 ]0
> chop_word:
> 0.42 -0.08 : 0 [30 ]0
> chop_word:
> 6.03 -1.49 : 9 [39 ]0
> chop_word:
> 6.14 -1.15 : 0 [30 ]0
> chop_word:
> 16.86 -3.94 : 6 [36 ]0
> chop_word:
> 5.20 -1.12 : 3 [33 ]0
> improve 1:
> 20.42 -5.92 : 6 [36 ]0
> improve 2:
> 16.65 -12.33 : : [3a ] 17.86 -13.23 : 0 [30 ]0
> pieces:
> 80.98 -9.23 : 0 [30 ]0
> pieces:
> 58.07 -9.68 : 3 [33 ]0
> rebuild
> 16.86 -3.94 : 6 [36 ]0
>
> Thanks,
>  Nik

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Character confidence level/ probability score using Tesseract 2.04

Reply via email to