Sorry for the noise. I've looked into this more, and discovered more :) On Tue, Jul 15, 2014 at 10:54:06AM -0400, Nick White wrote: > On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote: > > When I download the traineddata files and extract the unicharset file from > > them > > I notice that some are extremely different from the ones on SVN in the > > folder > > training/langdata. > > > > For example: > > Bengali, Hebrew, Greek, Kannada, Malayam, Tamil, Telugu, Thai. > > > > These files differ significantly. > > So for example Greek has a size of 9 kB in the traineddata file > > tesseract-ocr-3.02.ell.tar.gz and defines 151 characters. > > But Greek.unicharset in the folder training/langdata has a size of 216 kB > > and > > defines 2820 unichars. > > I am guessing, but it looks likely that Ray/Google has some internal > tools that look replace any line in the extracted .unicharset with > a line from the "pregenerated" one in training/langdata.
This tool actually already exists, and is set_unicharset_properties in training/ I had been using it, but not paying attention to the --script_dir argument. That gives a directory to look for .unicharset files in, and adds any metrics found there to the unicharset file it writes. Good news, eh? I need to write some manpages for the tools in training/ soon. For my own sake, if no-one elses ;) Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140715165754.GE8807%40manta.lan. For more options, visit https://groups.google.com/d/optout.

