On Thursday, May 9, 2013 10:31:30 AM UTC+5:30, zdenop wrote:
>
> On Wed, May 8, 2013 at 7:09 PM, Shree Devi Kumar
> <[email protected]<javascript:>
> > wrote:
>
>> Hi Nick,
>>
>> The testing reports are going to be helpful :-) Thanks!
>>
>> I just used the ocrevalutf8 command directly to test one set of files. i
>> haven't tried the batch on multiple files in a directory, hopefully that
>> will also work on cygwin.
>>
>> Is there a way I can use existing OCRed output files in a directory and
>> compare them to the groundtruth files without running tesseract again?
>>
>> In Dos, I can use something like..
>>
>> for /f "delims=|" %%F in ('dir san.input.*.txt /b') do ( .... )
>>
>>
> Describe logic/aim of this and we can help your with bash (or other
> linux/unix tools)...
>
Hi Zdenko and Nick,
Here are the results from the testing suite using a small sample test data
for devanagari script - hindi language in sanskrit2003 font. I used the
accusum and wordaccsum programs to add up the info from multiple accuracy
reports.
I would like to use the results to help create additional training texts.
Specifically I would like to delete the lines which have 100% recognition
so that what is left are the lines in error from the wordacc reports which
look like:
3 0 100.00 ख्य
2 0 100.00 ख्या
1 0 100.00 ख्याल
1 1 0.00 ख्यि
1 1 0.00 ख्यी
1 0 100.00 ख्र
1 0 100.00 ख्व
1 0 100.00 ख्वा
2 2 0.00 ख्स
1 1 0.00 ख्सि
1 0 100.00 खड़े
5 0 100.00 ग
2 2 0.00 गँ
3 0 100.00 गं
1 0 100.00 गंभीर
It should be easy to say, ignore all lines that have 100.00 in them.
Can you tell me what command I can use on Win7 - CYGWin installation to
take the report and output just the text in error.
And, yes, here are the accuracy results:
UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
32220 Characters
1122 Errors
96.52% Accuracy
1 Reject Characters
1 Suspect Markers
1 False Marks
0.01% Characters Marked
96.52% Accuracy After Correction
Ins Subst Del Errors
0 0 0 0 Marked
324 310 488 1122 Unmarked
324 310 488 1122 Total
Count Missed %Right
3 3 0.00 Unassigned
6775 175 97.42 ASCII Spacing Characters
31 2 93.55 ASCII Special Symbols
6 0 100.00 ASCII Digits
26 6 76.92 ASCII Uppercase Letters
26 1 96.15 ASCII Lowercase Letters
25340 445 98.24 Devanagari
9 2 77.78 General Punctuation
4 0 100.00 Halfwidth and Fullwidth Forms
32220 634 98.03 Total
And, here are the word accuracy results:
UNLV-ISRI OCR Word Accuracy Report Version 5.1
----------------------------------------------
6770 Words
673 Misrecognized
90.06% Accuracy
Stopwords
Count Missed %Right Length
12 0 100.00 6
288 4 98.61 12
208 1 99.52 18
254 7 97.24 24
38 0 100.00 30
18 0 100.00 36
818 12 98.53 Total
Non-stopwords
Count Missed %Right Length
81 4 95.06 1
283 22 92.23 6
703 147 79.09 12
1555 181 88.36 18
1732 175 89.90 24
1 1 0.00 25
856 64 92.52 30
331 26 92.15 36
203 22 89.16 42
207 19 90.82 48
5952 661 88.89 Total
Distinct Non-stopwords
Count Missed %Right Occurs
5405 608 88.75 1
183 7 96.17 2
12 2 83.33 3
4 1 75.00 4
3 1 66.67 5
1 0 100.00 6
2 1 50.00 7
2 0 100.00 >10
5612 620 88.95 Total
Phrases
Count Missed %Right Length
6770 673 90.06 1
6760 1093 83.83 2
6750 1472 78.19 3
6740 1813 73.10 4
6730 2121 68.48 5
6720 2405 64.21 6
6710 2665 60.28 7
6700 2896 56.78 8
>
>
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.