>Where are these scripts, or how can I otherwise generate training text
from dictionary/corpus data?

These are (most probably) internal scripts at Google which have not been
open sourced.

Please see
https://groups.google.com/forum/#!searchin/tesseract-ocr/training$20text%7Csort:date/tesseract-ocr/-B0mWBwki5w/zuR4R6AGAgAJ
which has Ray's comments regarding the generation of training text.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 15, 2017 at 7:05 PM, Dingyuan Wang <[email protected]>
wrote:

> Dear all,
>
> I'm trying to generate a training text (chi_sim) for training tesseract
> because I have a better dictionary and unigram/bigram data than the
> defaults. I've found the following comments in training/language-specific.
> sh
>
> (line 845)
> # Set language-specific values for several global variables, including
> #   ${TEXT_CORPUS}
> #      holds the text corpus file for the language, used in phase F
> #   ${FONTS[@]}
> #      holds a sequence of applicable fonts for the language, used in
> #      phase F & I. only set if not already set, i.e. from command line
> #   ${TRAINING_DATA_ARGUMENTS}
> #      non-default arguments to the training_data program used in phase T
> #   ${FILTER_ARGUMENTS} -
> #      character-code-specific filtering to distinguish between scripts
> #      (eg. CJK) used by filter_borbidden_characters in phase F
> #   ${WORDLIST2DAWG_ARGUMENTS}
> #      specify fixed length dawg generation for non-space-delimited lang
> # TODO(dsl): We can refactor these into functions that assign FONTS,
> # TEXT_CORPUS, etc. separately.
>
> So I suppose there are scripts called training_data (phrase T)
> and filter_borbidden_characters (sic, phrase F) to create the training
> text from some wordlists and unigram/bigram frequency data.
>
> Where are these scripts, or how can I otherwise generate training text
> from dictionary/corpus data?
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVn2655mukTEFmx0%3DVhfLMtdvVxY3Lx%2B%3DYW-o6HuqG_LQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to