On Sun, Aug 2, 2015 at 3:25 PM, Marco Atzeri <[email protected]> wrote:
> On 8/2/2015 10:31 AM, ShreeDevi Kumar wrote: > >> + tesseract-dev google group >> >> Thank you, Marco. I will download the training tools packages and and >> give it a try. >> >> In future updates to the tesseract package, may I suggest packaging of >> more languages from 'tessdata' - >> https://github.com/tesseract-ocr/tessdata >> >> specially the ones which have multiple files for the language such as >> ara, hin etc. >> >> The languages that have just one file for traineddata can be downloaded >> easily as a zip from the 'raw' link. It would be very helpful to have a >> single tar/zip for the others. >> >> > all the languages data in tessdata are > 1GB > so I assume very few will need all, > and most will not appreciate a single file of > 346M (compressed with xz ) > You are right. What I meant was that for languages with just one file eg. guj, users can download using https://github.com/tesseract-ocr/tessdata/blob/master/guj.traineddata?raw=true But there is no easy way to download the multiple files for hin.* from same github directory. > > May be a script to list/download/update from > https://github.com/tesseract-ocr/tessdata > will be more useful. > Yes, that is a good idea. > > Question: > why tessdata includes other files than traineddata ? > > $ ls -s1 rus* > 1.0K rus.cube.fold > 1.0K rus.cube.lm > 892K rus.cube.nn > 1.0K rus.cube.params > 15M rus.cube.size > 6.8M rus.cube.word-freq > 16M rus.traineddata > > From the wiki I had the impression that > traineddata should include all the others file inside. > Some languages were trained using the 'cube' engine. The traineddata for them includes these extra files. Please see http://packages.ubuntu.com/wily/all/tesseract-ocr-ara/filelist http://packages.ubuntu.com/wily/all/tesseract-ocr-eng/filelist http://packages.ubuntu.com/wily/all/tesseract-ocr-hin/filelist http://packages.ubuntu.com/wily/all/tesseract-ocr-rus/filelist etc > > Are all the files for a language needed or only the > {lang}.traineddata ? > I think some of the cube files are required during recognition. Ray or other developers can offer a more complete answer. > > > Langdata includes a different set of files > > $ ls -s1 rus* > total 22M > 1.0K desired_characters > 8.0K rus.cube-unicharset > 1.3M rus.cube-word-dawg > 4.0K rus.numbers > 8.0K rus.punc > 16K rus.training_text > 96K rus.training_text.bigram_freqs > 4.0K rus.training_text.unigram_freqs > 8.0K rus.unicharambigs > 11M rus.word.bigrams > 11M rus.wordlist > Langdata files are required only by those who want to train for that particular language - maybe in an effort to improve the traineddata provided by Google or to customize it to their needs. > > There is a description of the different type of data ? > > > Marco > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/55BDE8F4.8010609%40gmail.com > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXkMx-Hs0A6eoajxde2CxpS74CDf_tbX9ugRLD3nCdj6A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

