Re: [tesseract-ocr] How to regenerate the training text

ShreeDevi Kumar Thu, 15 Jun 2017 19:52:00 -0700

You can also see https://ancientgreekocr.org/ for Nick White's method of
creating training data for Ancient Greek.


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jun 16, 2017 at 8:18 AM, ShreeDevi Kumar <[email protected]>
wrote:

> >Where are these scripts, or how can I otherwise generate training text
> from dictionary/corpus data?
>
> These are (most probably) internal scripts at Google which have not been
> open sourced.
>
> Please see https://groups.google.com/forum/#!searchin/tesseract-
> ocr/training$20text%7Csort:date/tesseract-ocr/-B0mWBwki5w/zuR4R6AGAgAJ
> which has Ray's comments regarding the generation of training text.
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Jun 15, 2017 at 7:05 PM, Dingyuan Wang <[email protected]>
> wrote:
>
>> Dear all,
>>
>> I'm trying to generate a training text (chi_sim) for training tesseract
>> because I have a better dictionary and unigram/bigram data than the
>> defaults. I've found the following comments in training/language-specific.
>> sh
>>
>> (line 845)
>> # Set language-specific values for several global variables, including
>> #   ${TEXT_CORPUS}
>> #      holds the text corpus file for the language, used in phase F
>> #   ${FONTS[@]}
>> #      holds a sequence of applicable fonts for the language, used in
>> #      phase F & I. only set if not already set, i.e. from command line
>> #   ${TRAINING_DATA_ARGUMENTS}
>> #      non-default arguments to the training_data program used in phase T
>> #   ${FILTER_ARGUMENTS} -
>> #      character-code-specific filtering to distinguish between scripts
>> #      (eg. CJK) used by filter_borbidden_characters in phase F
>> #   ${WORDLIST2DAWG_ARGUMENTS}
>> #      specify fixed length dawg generation for non-space-delimited lang
>> # TODO(dsl): We can refactor these into functions that assign FONTS,
>> # TEXT_CORPUS, etc. separately.
>>
>> So I suppose there are scripts called training_data (phrase T)
>> and filter_borbidden_characters (sic, phrase F) to create the training
>> text from some wordlists and unigram/bigram frequency data.
>>
>> Where are these scripts, or how can I otherwise generate training text
>> from dictionary/corpus data?
>>
>> Thanks.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/9a5c68ce-43d5-449e-81c1-ff7237133053%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX%2BXYv4%3D1GrrGjaPpxmjVz7zDzCqrkzTzOEVRemXtzx6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How to regenerate the training text

Reply via email to