Marco, Thanks for building the training tools for cygwin. Till now just the additional binaries have been shipped as part of the tesseract package.
With Tesseract 3.04.00 there are additional scripts provided to help with training. Google has also provided the language data which can be used for training different languages and building the traineddata files. Hence my request to include these. Not all users will be interested in training for a new language or trying to improve an existing traineddata, so in my opinion, it maybe better to package these separately. Given the above, following are my suggestions (from a user's perspective), I hope this will provide the impetus for developers and other packagers to provide their feedback too. 1. Package the training tools separately. 2. Modify the way tessdata is packaged (both as part of training tools as well as tesseract-ocr-core). Instead of packaging under ./usr/share/tessdata I suggest adding another level of directory above tessdata and provide it as ./usr/share/tesseract/tessdata . This would allow all tesseract related files to be kept under the tesseract directory. 3. Include the training tools exe files as well as the following training bash scripts in the ./usr/bin directory . ./usr/bin/tesstrain.sh ./usr/bin/tesstrain_utils.sh ./usr/bin/language-spcific.sh alternately the training scripts could be kept under ./usr/share/tesseract/training/ 4. Provide the tifs from testing directory for easy testing of install and example usage. It maybe useful in the future to add samples for non-latin based scripts too. ./usr/share/tesseract/testing/phototest.tif ./usr/share/tesseract/testing/eurotext.tif 5. Regarding langdata, the readme says "To re-create the training of a single language, *lang,* you need the following: - All the data in the *lang* directory. - The corresponding unicharset/xheights files for the script(s) used by *lang.* - All the remaining non-lang-specific files in the top-level directory, such as font_properties." 5.1 So, I would suggest that the training tools by default include the langdata for English (similar to the packaging for tesseract-ocr itself). 5.2 Include ALL the files in the top-level directory including the unicharset/xhights files for ALL the scripts. 5.3 Package or link to the language data for different languages, which is available in separate subfolders. The file list would then look, similar to the following: ./usr/share/tesseract/tessdata/configs/ambigs.train ./usr/share/tesseract/tessdata/configs/api_config ./usr/share/tesseract/tessdata/configs/bigram ./usr/share/tesseract/tessdata/configs/box.train ./usr/share/tesseract/tessdata/configs/box.train.stderr ./usr/share/tesseract/tessdata/configs/digits ./usr/share/tesseract/tessdata/configs/hocr ./usr/share/tesseract/tessdata/configs/inter ./usr/share/tesseract/tessdata/configs/kannada ./usr/share/tesseract/tessdata/configs/linebox ./usr/share/tesseract/tessdata/configs/logfile ./usr/share/tesseract/tessdata/configs/makebox ./usr/share/tesseract/tessdata/configs/pdf ./usr/share/tesseract/tessdata/configs/quiet ./usr/share/tesseract/tessdata/configs/rebox ./usr/share/tesseract/tessdata/configs/strokewidth ./usr/share/tesseract/tessdata/configs/unlv ./usr/share/tesseract/tessdata/pdf.ttf ./usr/share/tesseract/tessdata/tessconfigs/batch ./usr/share/tesseract/tessdata/tessconfigs/batch.nochop ./usr/share/tesseract/tessdata/tessconfigs/matdemo ./usr/share/tesseract/tessdata/tessconfigs/msdemo ./usr/share/tesseract/tessdata/tessconfigs/nobatch ./usr/share/tesseract/tessdata/tessconfigs/segdemo ./usr/share/tesseract/testing/phototest.tif ./usr/share/tesseract/testing/eurotext.tif ./usr/share/tesseract/training/tesstrain.sh ./usr/share/tesseract/training/tesstrain_utils.sh ./usr/share/tesseract/training/language-spcific.sh ./usr/share/tesseract/training/langdata/Arabic.unicharset ./usr/share/tesseract/training/langdata/Arabic.xheights ./usr/share/tesseract/training/langdata/Armenian.unicharset ./usr/share/tesseract/training/langdata/Armenian.xheights ./usr/share/tesseract/training/langdata/Bengali.unicharset ./usr/share/tesseract/training/langdata/Bengali.xheights ./usr/share/tesseract/training/langdata/Bopomofo.unicharset ./usr/share/tesseract/training/langdata/Bopomofo.xheights ./usr/share/tesseract/training/langdata/Canadian_Aboriginal.unicharset ./usr/share/tesseract/training/langdata/Canadian_Aboriginal.xheights ./usr/share/tesseract/training/langdata/Cherokee.unicharset ./usr/share/tesseract/training/langdata/Cherokee.xheights ./usr/share/tesseract/training/langdata/Common.unicharset ./usr/share/tesseract/training/langdata/Cyrillic.unicharset ./usr/share/tesseract/training/langdata/Cyrillic.xheights ./usr/share/tesseract/training/langdata/Devanagari.unicharset ./usr/share/tesseract/training/langdata/Devanagari.xheights ./usr/share/tesseract/training/langdata/Ethiopic.unicharset ./usr/share/tesseract/training/langdata/Ethiopic.xheights ./usr/share/tesseract/training/langdata/Georgian.unicharset ./usr/share/tesseract/training/langdata/Georgian.xheights ./usr/share/tesseract/training/langdata/Greek.unicharset ./usr/share/tesseract/training/langdata/Greek.xheights ./usr/share/tesseract/training/langdata/Gujarati.unicharset ./usr/share/tesseract/training/langdata/Gujarati.xheights ./usr/share/tesseract/training/langdata/Gurmukhi.unicharset ./usr/share/tesseract/training/langdata/Gurmukhi.xheights ./usr/share/tesseract/training/langdata/Han.unicharset ./usr/share/tesseract/training/langdata/Han.xheights ./usr/share/tesseract/training/langdata/Hangul.unicharset ./usr/share/tesseract/training/langdata/Hangul.xheights ./usr/share/tesseract/training/langdata/Hebrew.unicharset ./usr/share/tesseract/training/langdata/Hebrew.xheights ./usr/share/tesseract/training/langdata/Hiragana.unicharset ./usr/share/tesseract/training/langdata/Hiragana.xheights ./usr/share/tesseract/training/langdata/Kannada.unicharset ./usr/share/tesseract/training/langdata/Kannada.xheights ./usr/share/tesseract/training/langdata/Katakana.unicharset ./usr/share/tesseract/training/langdata/Katakana.xheights ./usr/share/tesseract/training/langdata/Khmer.unicharset ./usr/share/tesseract/training/langdata/Khmer.xheights ./usr/share/tesseract/training/langdata/Lao.unicharset ./usr/share/tesseract/training/langdata/Lao.xheights ./usr/share/tesseract/training/langdata/Latin.unicharset ./usr/share/tesseract/training/langdata/Latin.xheights ./usr/share/tesseract/training/langdata/Malayalam.unicharset ./usr/share/tesseract/training/langdata/Malayalam.xheights ./usr/share/tesseract/training/langdata/Myanmar.unicharset ./usr/share/tesseract/training/langdata/Myanmar.xheights ./usr/share/tesseract/training/langdata/Ogham.unicharset ./usr/share/tesseract/training/langdata/Ogham.xheights ./usr/share/tesseract/training/langdata/Oriya.unicharset ./usr/share/tesseract/training/langdata/Oriya.xheights ./usr/share/tesseract/training/langdata/Runic.unicharset ./usr/share/tesseract/training/langdata/Runic.xheights ./usr/share/tesseract/training/langdata/Sinhala.unicharset ./usr/share/tesseract/training/langdata/Sinhala.xheights ./usr/share/tesseract/training/langdata/Syriac.unicharset ./usr/share/tesseract/training/langdata/Syriac.xheights ./usr/share/tesseract/training/langdata/Tamil.unicharset ./usr/share/tesseract/training/langdata/Tamil.xheights ./usr/share/tesseract/training/langdata/Telugu.unicharset ./usr/share/tesseract/training/langdata/Telugu.xheights ./usr/share/tesseract/training/langdata/Thai.unicharset ./usr/share/tesseract/training/langdata/Thai.xheights ./usr/share/tesseract/training/langdata/Tibetan.unicharset ./usr/share/tesseract/training/langdata/common.punc ./usr/share/tesseract/training/langdata/common.unicharambigs ./usr/share/tesseract/training/langdata/font_properties ./usr/share/tesseract/training/langdata/forbidden_characters_default ./usr/share/tesseract/training/langdata/eng/desired_characters ./usr/share/tesseract/training/langdata/eng/eng.cube-unicharset ./usr/share/tesseract/training/langdata/eng/eng.cube-word-dawg ./usr/share/tesseract/training/langdata/eng/eng.numbers ./usr/share/tesseract/training/langdata/eng/eng.punc ./usr/share/tesseract/training/langdata/eng/eng.training_text ./usr/share/tesseract/training/langdata/eng/eng.training_text.bigram_freqs ./usr/share/tesseract/training/langdata/eng/eng.training_text.unigram_freqs ./usr/share/tesseract/training/langdata/eng/eng.unicharambigs ./usr/share/tesseract/training/langdata/eng/eng.word.bigrams ./usr/share/tesseract/training/langdata/eng/eng.wordlist ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Wed, Jul 29, 2015 at 4:17 AM, Marco Atzeri <[email protected]> wrote: > Hi, > I just completed the build of tesseract-ocr-3.04.00 > including the training portion. > > Attached the patch I used together with > > configure LIBS="$(pkg-config --libs icu-i18n)" > > to correctly include the icu dependency. > For what I see the additional steps > > make training > make training-install > > are only installing these additional files > > /usr/bin/ambiguous_words.exe > /usr/bin/classifier_tester.exe > /usr/bin/cntraining.exe > /usr/bin/combine_tessdata.exe > /usr/bin/dawg2wordlist.exe > /usr/bin/mftraining.exe > /usr/bin/set_unicharset_properties.exe > /usr/bin/shapeclustering.exe > /usr/bin/text2image.exe > /usr/bin/unicharset_extractor.exe > /usr/bin/wordlist2dawg.exe > > full list attached. > > Questions: > - anything missing ? > - which portion of > https://github.com/tesseract-ocr/langdata > you would like to see in a training data package ? > > The current splits is available at: > https://cygwin.com/packages/x86_64/tesseract-ocr/tesseract-ocr-3.04.00-1 > > https://cygwin.com/packages/x86_64/tesseract-ocr-devel/tesseract-ocr-devel-3.04.00-1 > > https://cygwin.com/packages/x86_64/libtesseract-ocr_3/libtesseract-ocr_3-3.04.00-1 > > only English language is installed by default and it also contain the osd > data: > > https://cygwin.com/packages/x86_64/tesseract-ocr-eng/tesseract-ocr-eng-3.04-1 > > Others : > tesseract-ocr-deu/ > tesseract-ocr-fra/ > tesseract-ocr-ita/ > tesseract-ocr-nld/ > tesseract-ocr-por/ > tesseract-ocr-spa/ > tesseract-ocr-vie/ > > > Regards > Marco > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/55B80674.4070709%40gmail.com > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWUd69_xsceSNTLf%2B6ssGpwBOShMzP1P-NxHPvUHp0dCw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

