Re: How to improve recognition on TIFF black-and white Romanian text?

Nick White Wed, 22 Aug 2012 08:07:35 -0700

On Wed, Aug 22, 2012 at 05:50:06PM +0300, Jani Monoses wrote:
> So there's no way of just adding new words to the existing dictionary
> without redoing the whole training?


There is a way, yes. Create a ron.user-words file in your tessdata
directory, and a config file stating:

user_words_suffix    user-words

(I think the config file is needed, but I'm not sure.) The
ron.user-words file should have a list of words, one per line, UTF8
encoded.

> Are any other tunables such as the above that you think may help looking into?

I found 'enable_new_segsearch 1' to be very helpful, but it might
already be enabled with Romanian (use combine_tessdata -u and check
the .config file if you want to see). Other than that, I can't
advise really. There isn't any documentation for most of the
configuration variables, so they're in the realm of "black magic".
  grep -R VAR_H * | grep -v '^Binary '| grep -v 'svn-base'
on the source tree will give you a listing of things to try if you
feel like exploring.

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: How to improve recognition on TIFF black-and white Romanian text?

Reply via email to