Comments from Ray regarding training text

> For Latin-based languages, the existing model data provided has been
trained on about 400000 textlines spanning about 4500 fonts
<https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>.
For other scripts, not so many fonts are available, but they have still
been trained on a similar number of textlines. Instead of taking a few
minutes to a couple of hours to train, Tesseract 4.00 takes a few *days* to
a couple of *weeks.*

>The text corpus is from *all* the www, taken several years ago, plus more
recent data from wiki-something. The text is divided by language
automatically, so there is a separate stream for each of the
Devanagari-based languages (as there is for the Latin-based languages) and
clipped to 1GB for each language. For each language, the text is frequency
counted and cleaned by multiple methods, and sometimes this cleaning is too
stringent automatically, or not stringent enough, so forbidden_characters
and desired_characters are used as a guide in the cleanup process. There
are other lang-specific numbers like a 1-in-n discard ratio for the
frequency. For some languages, the amount of data produced at the end is
very thin.
​>​
The unicharset is extracted from what remains, and the wordlist that is
published in langdata.
​>​
For the LSTM training, I resorted to using Google's parallel infrastructure
to render enough text in all the languages.
​>​
However much or little corpus text there is, the rendering process makes
50000 chunks of 50 words to render in a different combination of font and
random degradation, which results in 400000-800000 rendered textlines. The
words are chosen to approximately echo the real frequency of conjunct
clusters (characters in most languages) in the source text, while also
using the most frequent words.
​>​
This process is all done without significant manual intervention, but
counts of the number of generated textlines indicates when it has gone
badly, usually due to a lack of fonts, or a lack of corpus text. I recently
stopped training chr, iku, khm, mya after discovering that I have no
rendered textlines that contain anything other than digits and punctuation.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 5, 2017 at 4:59 PM, Ibr <[email protected]> wrote:

> Hi,
>
> assume that I have creates  20 LSTM files for English for example, each
> LSTM file is for a different font, when I make detection against an image
> by running the command: *tesseract image results -l eng--tessdata-dir
> ./tessdata --oem 1* does the tesseract check the image against all LSTM
> files, or just take one of them and make detection against it?
>
> I'm assuming that to make the detection is more accurate I should create
> many LSTM files for different fonts, because images can be with different
> fonts from each other so in this way it would be more accurate since I have
> LSTM file for every possible font, is that correct?
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV9RqoWqcG2jeeEfesEaQYca%3Dr9_Xrm6gjsFNbM_wVy_Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to