Re: tesseract testing suite

Shree Devi Kumar Mon, 03 Jun 2013 09:58:40 -0700

Thanks, Nick,

unix is indeed cool, when one knows how :-)


Thanks so much for the commands. Appreciate the help.

Shree

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Mon, Jun 3, 2013 at 4:34 PM, Nick White <[email protected]> wrote:

> Hi,
>
> I'm very glad you're finding the test suite useful :)
>
> I'll reply to you below.
>
> On Mon, Jun 03, 2013 at 12:32:32AM -0700, sdk wrote:
> > I would like to use the results to help create additional training texts.
> > Specifically I would like to delete the lines which have 100%
> recognition so
> > that what is left are the lines in error from the wordacc reports which
> look
> > like:
> >
> >        3        0   100.00   ख्य
> >        2        0   100.00   ख्या
> >        1        0   100.00   ख्याल
> >        1        1     0.00   ख्यि
> >        1        1     0.00   ख्यी
> >        1        0   100.00   ख्र
> >        1        0   100.00   ख्व
> >        1        0   100.00   ख्वा
> >        2        2     0.00   ख्स
> >        1        1     0.00   ख्सि
> >        1        0   100.00   खड़े
> >        5        0   100.00   ग
> >        2        2     0.00   गँ
> >        3        0   100.00   गं
> >        1        0   100.00   गंभीर
> >
> > It should be easy to say, ignore all lines that have 100.00 in them.
> >
> > Can you tell me what command I can use on Win7 - CYGWin installation to
> take
> > the report and output just the text in error.
>
> Sure. As you're in cygwin this is pretty easy, as it's exactly the
> sort of thing unix tools are good for.
>
> If you just want to remove all lines which have 100% recognition,
> you can add a 'awk' command like this:
>
> ocrevalutf8 wordacc ground.txt ocr.txt | awk '$3 != 100 {print $0}' >
> results.txt
>
> or if you've already got a results file you want to change, you can
> do this:
>
> awk '$3 != 100 {print $0}' < results.txt > newresults.txt
>
> If you only want the last sections where things are broken down by
> word, you can add a sed commend, like this:
>
> ocrevalutf8 wordacc ground.txt ocr.txt | sed '/^   Count   Missed %Right
> $/,$ !d' | awk '$3 != 100 {print $0}' > results.txt
>
> See, isn't unix cool? :)
>
> Your accuracy results look great - good job on the training!
>
> I hope this helps, let me know if you need anything else.
>
> Nick
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: tesseract testing suite

Reply via email to