Re: [tesseract-ocr] building on cygwin with training data

ShreeDevi Kumar Sun, 02 Aug 2015 03:12:43 -0700

On Sun, Aug 2, 2015 at 3:25 PM, Marco Atzeri <[email protected]> wrote:


> On 8/2/2015 10:31 AM, ShreeDevi Kumar wrote:
>
>> + tesseract-dev google group
>>
>> Thank you, Marco. I will download the training tools packages and and
>> give it a try.
>>
>> In future updates to the tesseract package, may I suggest packaging of
>> more languages from 'tessdata' -
>> https://github.com/tesseract-ocr/tessdata
>>
>> specially the ones which have multiple files for the language such as
>> ara, hin etc.
>>
>> The languages that have just one file for traineddata can be downloaded
>> easily as a zip from the 'raw' link. It would be very helpful to have a
>> single tar/zip for the others.
>>
>>
> all the languages data in tessdata are > 1GB
> so I assume very few will need all,
> and most will not appreciate a single file of
> 346M (compressed with xz )
>

You are right. What I meant was that for languages with just one file eg.
guj, users can download using
https://github.com/tesseract-ocr/tessdata/blob/master/guj.traineddata?raw=true

But there is no easy way to download the multiple files for hin.* from same
github directory.


>
> May be a script to list/download/update from
>   https://github.com/tesseract-ocr/tessdata
> will be more useful.
>

Yes, that is a good idea.


>
> Question:
> why tessdata includes other files than traineddata ?
>
> $ ls -s1 rus*
> 1.0K rus.cube.fold
> 1.0K rus.cube.lm
> 892K rus.cube.nn
> 1.0K rus.cube.params
>  15M rus.cube.size
> 6.8M rus.cube.word-freq
>  16M rus.traineddata
>
> From the wiki I had the impression that
> traineddata should include all the others file inside.
>

Some languages were trained using the 'cube' engine. The traineddata for
them includes these extra files. Please see
http://packages.ubuntu.com/wily/all/tesseract-ocr-ara/filelist
http://packages.ubuntu.com/wily/all/tesseract-ocr-eng/filelist
http://packages.ubuntu.com/wily/all/tesseract-ocr-hin/filelist
http://packages.ubuntu.com/wily/all/tesseract-ocr-rus/filelist
etc



>
> Are all the files for a language needed or only the
> {lang}.traineddata ?
>

I think some of the cube files are required during recognition.
Ray or other developers can offer a more complete answer.


>
>
> Langdata includes a different set of files
>
>  $ ls -s1 rus*
> total 22M
> 1.0K desired_characters
> 8.0K rus.cube-unicharset
> 1.3M rus.cube-word-dawg
> 4.0K rus.numbers
> 8.0K rus.punc
>  16K rus.training_text
>  96K rus.training_text.bigram_freqs
> 4.0K rus.training_text.unigram_freqs
> 8.0K rus.unicharambigs
>  11M rus.word.bigrams
>  11M rus.wordlist
>

Langdata files are required only by those who want to train for that
particular language - maybe in an effort to improve the traineddata
provided by Google or to customize it to their needs.




>
> There is a description of the different type of data ?
>
>
> Marco
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/55BDE8F4.8010609%40gmail.com
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXkMx-Hs0A6eoajxde2CxpS74CDf_tbX9ugRLD3nCdj6A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] building on cygwin with training data

Reply via email to