Re: [tesseract-ocr] building on cygwin with training data

Marco Atzeri Sun, 02 Aug 2015 00:24:07 -0700

On 7/29/2015 11:40 AM, ShreeDevi Kumar wrote:

Marco,


Thanks for building the training tools for cygwin. Till now just the
additional binaries have been shipped as part of the tesseract package.

With Tesseract 3.04.00 there are additional scripts provided to help
with training. Google has also provided the language data which can be
used for training different languages and building the traineddata
files. Hence my request to include these.

Not all users will be interested in training for a new language or
trying to improve an existing traineddata, so in my opinion, it maybe
better to package these separately.


Hi ShreeDevi
uploading 3.04.00-2.

The training tools are in the new package
  tesseract-training-util

while the training language file are split between
  tesseract-training-core
  tesseract-training-{lang}

I have not changed the previos datastructure,
just added an additional level
  /usr/share/tessdata/training

and the two test files are in
  /usr/share/tessdata/testing/eurotext.tif
  /usr/share/tessdata/testing/phototest.tif


$ cygcheck -l tesseract-training-util
/usr/bin/ambiguous_words.exe
/usr/bin/classifier_tester.exe
/usr/bin/cntraining.exe
/usr/bin/combine_tessdata.exe
/usr/bin/dawg2wordlist.exe
/usr/bin/mftraining.exe
/usr/bin/set_unicharset_properties.exe
/usr/bin/shapeclustering.exe
/usr/bin/text2image.exe
/usr/bin/unicharset_extractor.exe
/usr/bin/wordlist2dawg.exe
/usr/bin/language-specific.sh
/usr/bin/tesstrain.sh
/usr/bin/tesstrain_utils.sh

$ cygcheck -l tesseract-training-core
/usr/share/tessdata/training/Arabic.unicharset
/usr/share/tessdata/training/Arabic.xheights
...
/usr/share/tessdata/training/Cherokee.xheights
/usr/share/tessdata/training/common.punc
/usr/share/tessdata/training/common.unicharambigs
/usr/share/tessdata/training/Common.unicharset
/usr/share/tessdata/training/Cyrillic.unicharset
...
/usr/share/tessdata/training/Ethiopic.xheights
/usr/share/tessdata/training/font_properties
/usr/share/tessdata/training/forbidden_characters_default
/usr/share/tessdata/training/Georgian.unicharset
...
/usr/share/tessdata/training/Tibetan.unicharset

$ cygcheck -l tesseract-training-eng
/usr/share/tessdata/training/eng/desired_characters
/usr/share/tessdata/training/eng/eng.cube-unicharset
/usr/share/tessdata/training/eng/eng.cube-word-dawg
/usr/share/tessdata/training/eng/eng.numbers
/usr/share/tessdata/training/eng/eng.punc
/usr/share/tessdata/training/eng/eng.training_text
/usr/share/tessdata/training/eng/eng.training_text.bigram_freqs
/usr/share/tessdata/training/eng/eng.training_text.unigram_freqs
/usr/share/tessdata/training/eng/eng.unicharambigs
/usr/share/tessdata/training/eng/eng.word.bigrams
/usr/share/tessdata/training/eng/eng.wordlist

Regards
Marco

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/55BDC558.2090205%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] building on cygwin with training data

Reply via email to