On 7/29/2015 11:40 AM, ShreeDevi Kumar wrote:
Marco,
Thanks for building the training tools for cygwin. Till now just the
additional binaries have been shipped as part of the tesseract package.
With Tesseract 3.04.00 there are additional scripts provided to help
with training. Google has also provided the language data which can be
used for training different languages and building the traineddata
files. Hence my request to include these.
Not all users will be interested in training for a new language or
trying to improve an existing traineddata, so in my opinion, it maybe
better to package these separately.
Hi ShreeDevi
uploading 3.04.00-2.
The training tools are in the new package
tesseract-training-util
while the training language file are split between
tesseract-training-core
tesseract-training-{lang}
I have not changed the previos datastructure,
just added an additional level
/usr/share/tessdata/training
and the two test files are in
/usr/share/tessdata/testing/eurotext.tif
/usr/share/tessdata/testing/phototest.tif
$ cygcheck -l tesseract-training-util
/usr/bin/ambiguous_words.exe
/usr/bin/classifier_tester.exe
/usr/bin/cntraining.exe
/usr/bin/combine_tessdata.exe
/usr/bin/dawg2wordlist.exe
/usr/bin/mftraining.exe
/usr/bin/set_unicharset_properties.exe
/usr/bin/shapeclustering.exe
/usr/bin/text2image.exe
/usr/bin/unicharset_extractor.exe
/usr/bin/wordlist2dawg.exe
/usr/bin/language-specific.sh
/usr/bin/tesstrain.sh
/usr/bin/tesstrain_utils.sh
$ cygcheck -l tesseract-training-core
/usr/share/tessdata/training/Arabic.unicharset
/usr/share/tessdata/training/Arabic.xheights
...
/usr/share/tessdata/training/Cherokee.xheights
/usr/share/tessdata/training/common.punc
/usr/share/tessdata/training/common.unicharambigs
/usr/share/tessdata/training/Common.unicharset
/usr/share/tessdata/training/Cyrillic.unicharset
...
/usr/share/tessdata/training/Ethiopic.xheights
/usr/share/tessdata/training/font_properties
/usr/share/tessdata/training/forbidden_characters_default
/usr/share/tessdata/training/Georgian.unicharset
...
/usr/share/tessdata/training/Tibetan.unicharset
$ cygcheck -l tesseract-training-eng
/usr/share/tessdata/training/eng/desired_characters
/usr/share/tessdata/training/eng/eng.cube-unicharset
/usr/share/tessdata/training/eng/eng.cube-word-dawg
/usr/share/tessdata/training/eng/eng.numbers
/usr/share/tessdata/training/eng/eng.punc
/usr/share/tessdata/training/eng/eng.training_text
/usr/share/tessdata/training/eng/eng.training_text.bigram_freqs
/usr/share/tessdata/training/eng/eng.training_text.unigram_freqs
/usr/share/tessdata/training/eng/eng.unicharambigs
/usr/share/tessdata/training/eng/eng.word.bigrams
/usr/share/tessdata/training/eng/eng.wordlist
Regards
Marco
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/55BDC558.2090205%40gmail.com.
For more options, visit https://groups.google.com/d/optout.