Zdenko Podobný wrote:
As you can see dictionary improved result especially in case of "l" vs. "1".

I can see that "all" with lowercase L is better than
"aII" with uppercase i.

But is there some algorithm, some existing program,
that can tell me that the OCR accuracy in this case
is 33% because just one out of three letters was right?

My naive idea of such an algorithm is this:

#!/bin/sh
echo "all types of files" >correct.txt
echo "aII types of fiIes" >ocr.txt
size=`wc -c < correct.txt`
sed 's/\(.\)/\1\n/g' correct.txt >temp
error=`sed 's/\(.\)/\1\n/g' ocr.txt | diff temp - | grep -c '^<'`
echo $error $size | awk '{printf "%.2f %% accuracy\n", 100*(1.0 - $1/$2); }'

This example prints 84.21 % accuracy

Is that the right way to do it? Should whitespace
be excluded when computing the accuracy?

In my example, only missing characters are counted
as errors, but adding extra characters is not.


--
 Lars Aronsson ([email protected])
 Aronsson Datateknik - http://aronsson.se


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to