Re: tesseract testing suite

sdk Mon, 03 Jun 2013 00:43:45 -0700


On Thursday, May 9, 2013 10:31:30 AM UTC+5:30, zdenop wrote:
>
> On Wed, May 8, 2013 at 7:09 PM, Shree Devi Kumar 
> <[email protected]<javascript:>
> > wrote:
>
>> Hi Nick,
>>
>> The testing reports are going to be helpful :-) Thanks!
>>
>> I just used the ocrevalutf8 command directly to test one set of files. i 
>> haven't tried the batch on multiple files in a directory, hopefully that 
>> will also work on cygwin.
>>
>> Is there a way I can use existing OCRed output files in a directory and 
>> compare them to the groundtruth files without running tesseract again? 
>>
>> In Dos, I can use something like..
>>
>> for /f "delims=|" %%F in ('dir san.input.*.txt /b') do ( .... )
>>
>>


> Describe logic/aim of this and we can help your with bash (or other 
> linux/unix tools)...
>


Hi Zdenko and Nick,

Here are the results from the testing suite using a small sample test data 
for devanagari script - hindi language in sanskrit2003 font. I used the 
accusum and wordaccsum programs to add up the info from multiple accuracy 
reports. 

I would like to use the results to help create additional training texts. 
Specifically I would like to delete the lines which have 100% recognition 
so that what is left are the lines in error from the wordacc reports which 
look like: 

       3        0   100.00   ख्य
       2        0   100.00   ख्या
       1        0   100.00   ख्याल
       1        1     0.00   ख्यि
       1        1     0.00   ख्यी
       1        0   100.00   ख्र
       1        0   100.00   ख्व
       1        0   100.00   ख्वा
       2        2     0.00   ख्स
       1        1     0.00   ख्सि
       1        0   100.00   खड़े
       5        0   100.00   ग
       2        2     0.00   गँ
       3        0   100.00   गं
       1        0   100.00   गंभीर

It should be easy to say, ignore all lines that have 100.00 in them.

Can you tell me what command I can use on Win7 - CYGWin installation to 
take the report and output just the text in error.

And, yes, here are the accuracy results:

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
   32220   Characters
    1122   Errors
   96.52%  Accuracy

       1   Reject Characters
       1   Suspect Markers
       1   False Marks
    0.01%  Characters Marked
   96.52%  Accuracy After Correction

     Ins    Subst      Del   Errors
       0        0        0        0   Marked
     324      310      488     1122   Unmarked
     324      310      488     1122   Total

   Count   Missed   %Right
       3        3     0.00   Unassigned
    6775      175    97.42   ASCII Spacing Characters
      31        2    93.55   ASCII Special Symbols
       6        0   100.00   ASCII Digits
      26        6    76.92   ASCII Uppercase Letters
      26        1    96.15   ASCII Lowercase Letters
   25340      445    98.24   Devanagari
       9        2    77.78   General Punctuation
       4        0   100.00   Halfwidth and Fullwidth Forms
   32220      634    98.03   Total

And, here are the word accuracy results:

UNLV-ISRI OCR Word Accuracy Report Version 5.1
----------------------------------------------
    6770   Words
     673   Misrecognized
   90.06%  Accuracy

Stopwords
   Count   Missed   %Right   Length
      12        0   100.00        6
     288        4    98.61       12
     208        1    99.52       18
     254        7    97.24       24
      38        0   100.00       30
      18        0   100.00       36
     818       12    98.53    Total

Non-stopwords
   Count   Missed   %Right   Length
      81        4    95.06        1
     283       22    92.23        6
     703      147    79.09       12
    1555      181    88.36       18
    1732      175    89.90       24
       1        1     0.00       25
     856       64    92.52       30
     331       26    92.15       36
     203       22    89.16       42
     207       19    90.82       48
    5952      661    88.89    Total

Distinct Non-stopwords
   Count   Missed   %Right   Occurs
    5405      608    88.75        1
     183        7    96.17        2
      12        2    83.33        3
       4        1    75.00        4
       3        1    66.67        5
       1        0   100.00        6
       2        1    50.00        7
       2        0   100.00      >10
    5612      620    88.95    Total

Phrases
   Count   Missed   %Right   Length
    6770      673    90.06        1
    6760     1093    83.83        2
    6750     1472    78.19        3
    6740     1813    73.10        4
    6730     2121    68.48        5
    6720     2405    64.21        6
    6710     2665    60.28        7
    6700     2896    56.78        8




 
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: tesseract testing suite

Reply via email to