Hi Donaldo, It's great to hear how you're getting on. Thanks for sharing in so much detail!
I'll reply / comment below. On Mon, Oct 01, 2012 at 04:04:36PM -0700, Donaldo wrote: > I ran tesseract to train it up on a few fonts. The txt files produced were > full > of blank characters. It seems to be important to separate the tokens in each > file name with a hyphen. You mean with lazytrain? Can you explain further, I'm not following. > Running mftraining produced shapetable file which is not mentioned in the > documentation, as well as epo.unicharset, pffmtable, inttemp; cftraining > produced normproto. Yep, shapetable will be added to the documentation once 3.02 is released (I presume). It is new to 3.02, which is why it isn't there yet. > I found a comment on the tesseract-ocr group that it is better to use png > files. Yes. TIFF files are somewhat unreliable just because there are so many different types of TIFF. png is indeed better. > Results: 1.5% character errors. Most accented letters recognised. Frequent > errors: l → I, e → c, il → ü, li → h, o → O Great! I'm happy to hear that. > What should I do next? Dictionaries? I have a list of nearly 500,000 Esperanto > words. Is that too big? Ambigs? Yes, word lists and ambigs are indeed good places to turn next. The freq-words list should be pretty small. Like around 100 words. The full word list can be pretty big, though. The one I used was around 330,000. I don't know in Esperanto whether you can be confident that you shouldn't be many words outside of the dictionary, but if so (as is the case with Ancient Greek,) consider increasing the weight given my the dictionaries. You can do this by altering a couple of config variables, like so: language_model_penalty_non_freq_dict_word 0.2 language_model_penalty_non_dict_word 0.3 And save that in a file called <langcode>.config. The number to use should be based on testing; mine are probably too high for most languages. The default values can be found by grepping through the source code (I don't have it in front of me, but IIRC they were 0.1 and 0.15 respectively). Also, if you haven't already, try using the new segsearch algorithm. Most of the trainings have it enabled. I don't really know what it does, but it improved things for me: 'enable_new_segsearch 1', for <langcode>.config again. As for <langcode>.unicharambigs, a good place to start would be to add the common errors you found as 'suggestions', e.g. for li → h: 2 l i 1 h 0 I didn't find that unicharambigs made as much difference as I was hoping, but it's still good to have around. Hope this helps, keep us updated! Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

