You can also see https://ancientgreekocr.org/ for Nick White's method of creating training data for Ancient Greek.
ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Jun 16, 2017 at 8:18 AM, ShreeDevi Kumar <[email protected]> wrote: > >Where are these scripts, or how can I otherwise generate training text > from dictionary/corpus data? > > These are (most probably) internal scripts at Google which have not been > open sourced. > > Please see https://groups.google.com/forum/#!searchin/tesseract- > ocr/training$20text%7Csort:date/tesseract-ocr/-B0mWBwki5w/zuR4R6AGAgAJ > which has Ray's comments regarding the generation of training text. > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Thu, Jun 15, 2017 at 7:05 PM, Dingyuan Wang <[email protected]> > wrote: > >> Dear all, >> >> I'm trying to generate a training text (chi_sim) for training tesseract >> because I have a better dictionary and unigram/bigram data than the >> defaults. I've found the following comments in training/language-specific. >> sh >> >> (line 845) >> # Set language-specific values for several global variables, including >> # ${TEXT_CORPUS} >> # holds the text corpus file for the language, used in phase F >> # ${FONTS[@]} >> # holds a sequence of applicable fonts for the language, used in >> # phase F & I. only set if not already set, i.e. from command line >> # ${TRAINING_DATA_ARGUMENTS} >> # non-default arguments to the training_data program used in phase T >> # ${FILTER_ARGUMENTS} - >> # character-code-specific filtering to distinguish between scripts >> # (eg. CJK) used by filter_borbidden_characters in phase F >> # ${WORDLIST2DAWG_ARGUMENTS} >> # specify fixed length dawg generation for non-space-delimited lang >> # TODO(dsl): We can refactor these into functions that assign FONTS, >> # TEXT_CORPUS, etc. separately. >> >> So I suppose there are scripts called training_data (phrase T) >> and filter_borbidden_characters (sic, phrase F) to create the training >> text from some wordlists and unigram/bigram frequency data. >> >> Where are these scripts, or how can I otherwise generate training text >> from dictionary/corpus data? >> >> Thanks. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX%2BXYv4%3D1GrrGjaPpxmjVz7zDzCqrkzTzOEVRemXtzx6Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

