Re: Tess v3 not recognising accented Esperanto characters.

Nick White Tue, 02 Oct 2012 02:33:36 -0700

Hi Donaldo,

It's great to hear how you're getting on. Thanks for sharing in so
much detail!

I'll reply / comment below.

On Mon, Oct 01, 2012 at 04:04:36PM -0700, Donaldo wrote:
> I ran tesseract to train it up on a few fonts. The txt files produced were 
> full
> of blank characters. It seems to be important to separate the tokens in each
> file name with a hyphen.

You mean with lazytrain? Can you explain further, I'm not following.

> Running mftraining produced shapetable file which is not mentioned in the
> documentation, as well as epo.unicharset, pffmtable, inttemp; cftraining
> produced normproto.

Yep, shapetable will be added to the documentation once 3.02 is
released (I presume). It is new to 3.02, which is why it isn't there
yet.

> I found a comment on the tesseract-ocr group that it is better to use png
> files.

Yes. TIFF files are somewhat unreliable just because there are so
many different types of TIFF. png is indeed better.

> Results: 1.5% character errors. Most accented letters recognised. Frequent
> errors: l → I, e → c, il → ü, li → h, o → O

Great! I'm happy to hear that.

> What should I do next? Dictionaries? I have a list of nearly 500,000 Esperanto
> words. Is that too big? Ambigs?

Yes, word lists and ambigs are indeed good places to turn next. The
freq-words list should be pretty small. Like around 100 words. The
full word list can be pretty big, though. The one I used was around
330,000. I don't know in Esperanto whether you can be confident that
you shouldn't be many words outside of the dictionary, but if so (as
is the case with Ancient Greek,) consider increasing the weight
given my the dictionaries. You can do this by altering a couple of
config variables, like so:

language_model_penalty_non_freq_dict_word 0.2
language_model_penalty_non_dict_word 0.3

And save that in a file called <langcode>.config. The number to use
should be based on testing; mine are probably too high for most
languages. The default values can be found by grepping through the
source code (I don't have it in front of me, but IIRC they were 0.1
and 0.15 respectively).

Also, if you haven't already, try using the new segsearch algorithm.
Most of the trainings have it enabled. I don't really know what it
does, but it improved things for me: 'enable_new_segsearch 1', for
<langcode>.config again.

As for <langcode>.unicharambigs, a good place to start would be to
add the common errors you found as 'suggestions', e.g. for li → h:
2       l i     1       h       0

I didn't find that unicharambigs made as much difference as I was
hoping, but it's still good to have around.

Hope this helps, keep us updated!

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Tess v3 not recognising accented Esperanto characters.

Reply via email to