Re: Tess v3 not recognising accented Esperanto characters.

Nick White Thu, 04 Oct 2012 04:00:14 -0700

On Wed, Oct 03, 2012 at 08:08:16PM -0700, Donaldo wrote:
> This time I got 0.3% character errors (c.p. 1.3% before).


Awesome!

> Here are some of the main errors:
> 
> 10 ->  l0
> 0,1 % → O,l %
> 0,0000001 ->  0,000000l
> [It apparently is stuck on USA decimal separator, so it thinks that this is 
> alphabetic?]

You may well be able to improve these by creating a number-dawg
file. I can't find documentation for it now, but hopefully searching
the mailing list should provide something. Use the English one as a
guide (indeed you may well be able to copy it wholesale.)

To see the English one, unpack the eng.traineddata with
  combine_tessdata -u eng.traineddata eng.
and then undawg the number-dawg file with
  dawg2wordlist eng.unicharset eng.number-dawg numbers
and the 'numbers' file is what you want.

> tro -> tro Ŭ  [strange character at end of line.]

I suspect that character was a smudge or mark on the paper, after
the real text? If so preprocessing more would be a solution; I don't
know whether Tesseract can be made more strict about ignoring them.
With your accuracy so high already though, I wouldn't worry about
it.

What are you using to test the accuracy, by the way?

I don't know the answers to your other questions. Hopefully others
may be able to help.

Good job though, and thanks again for sharing the process!

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess v3 not recognising accented Esperanto characters.

Reply via email to