On 8/2/2015 10:31 AM, ShreeDevi Kumar wrote:
+ tesseract-dev google group

Thank you, Marco. I will download the training tools packages and and
give it a try.

In future updates to the tesseract package, may I suggest packaging of
more languages from 'tessdata' - https://github.com/tesseract-ocr/tessdata

specially the ones which have multiple files for the language such as
ara, hin etc.

The languages that have just one file for traineddata can be downloaded
easily as a zip from the 'raw' link. It would be very helpful to have a
single tar/zip for the others.


all the languages data in tessdata are > 1GB
so I assume very few will need all,
and most will not appreciate a single file of
346M (compressed with xz )

May be a script to list/download/update from
  https://github.com/tesseract-ocr/tessdata
will be more useful.

Question:
why tessdata includes other files than traineddata ?

$ ls -s1 rus*
1.0K rus.cube.fold
1.0K rus.cube.lm
892K rus.cube.nn
1.0K rus.cube.params
 15M rus.cube.size
6.8M rus.cube.word-freq
 16M rus.traineddata

From the wiki I had the impression that
traineddata should include all the others file inside.

Are all the files for a language needed or only the
{lang}.traineddata ?


Langdata includes a different set of files

 $ ls -s1 rus*
total 22M
1.0K desired_characters
8.0K rus.cube-unicharset
1.3M rus.cube-word-dawg
4.0K rus.numbers
8.0K rus.punc
 16K rus.training_text
 96K rus.training_text.bigram_freqs
4.0K rus.training_text.unigram_freqs
8.0K rus.unicharambigs
 11M rus.word.bigrams
 11M rus.wordlist

There is a description of the different type of data ?


Marco

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/55BDE8F4.8010609%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to