>Where are these scripts, or how can I otherwise generate training text from dictionary/corpus data?
These are (most probably) internal scripts at Google which have not been open sourced. Please see https://groups.google.com/forum/#!searchin/tesseract-ocr/training$20text%7Csort:date/tesseract-ocr/-B0mWBwki5w/zuR4R6AGAgAJ which has Ray's comments regarding the generation of training text. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Jun 15, 2017 at 7:05 PM, Dingyuan Wang <[email protected]> wrote: > Dear all, > > I'm trying to generate a training text (chi_sim) for training tesseract > because I have a better dictionary and unigram/bigram data than the > defaults. I've found the following comments in training/language-specific. > sh > > (line 845) > # Set language-specific values for several global variables, including > # ${TEXT_CORPUS} > # holds the text corpus file for the language, used in phase F > # ${FONTS[@]} > # holds a sequence of applicable fonts for the language, used in > # phase F & I. only set if not already set, i.e. from command line > # ${TRAINING_DATA_ARGUMENTS} > # non-default arguments to the training_data program used in phase T > # ${FILTER_ARGUMENTS} - > # character-code-specific filtering to distinguish between scripts > # (eg. CJK) used by filter_borbidden_characters in phase F > # ${WORDLIST2DAWG_ARGUMENTS} > # specify fixed length dawg generation for non-space-delimited lang > # TODO(dsl): We can refactor these into functions that assign FONTS, > # TEXT_CORPUS, etc. separately. > > So I suppose there are scripts called training_data (phrase T) > and filter_borbidden_characters (sic, phrase F) to create the training > text from some wordlists and unigram/bigram frequency data. > > Where are these scripts, or how can I otherwise generate training text > from dictionary/corpus data? > > Thanks. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVn2655mukTEFmx0%3DVhfLMtdvVxY3Lx%2B%3DYW-o6HuqG_LQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

