Sorry for the noise. I've looked into this more, and discovered more 
:)

On Tue, Jul 15, 2014 at 10:54:06AM -0400, Nick White wrote:
> On Mon, Jul 14, 2014 at 01:10:07PM -0700, Albrecht Hilker wrote: 
> > When I download the traineddata files and extract the unicharset file from 
> > them
> > I notice that some are extremely different from the ones on SVN in the 
> > folder
> > training/langdata.
> > 
> > For example:
> > Bengali, Hebrew, Greek, Kannada, Malayam, Tamil, Telugu, Thai.
> > 
> > These files differ significantly.
> > So for example Greek has a size of 9 kB in the traineddata file
> > tesseract-ocr-3.02.ell.tar.gz  and defines 151 characters.
> > But Greek.unicharset in the folder training/langdata has a size of 216 kB 
> > and
> > defines 2820 unichars.
> 
> I am guessing, but it looks likely that Ray/Google has some internal 
> tools that look replace any line in the extracted .unicharset with 
> a line from the "pregenerated" one in training/langdata.

This tool actually already exists, and is set_unicharset_properties 
in training/

I had been using it, but not paying attention to the --script_dir 
argument. That gives a directory to look for .unicharset files in, 
and adds any metrics found there to the unicharset file it writes.

Good news, eh?

I need to write some manpages for the tools in training/ soon. For 
my own sake, if no-one elses ;)

Nick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140715165754.GE8807%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Reply via email to