Comments from Ray regarding training text > For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. Instead of taking a few minutes to a couple of hours to train, Tesseract 4.00 takes a few *days* to a couple of *weeks.*
>The text corpus is from *all* the www, taken several years ago, plus more recent data from wiki-something. The text is divided by language automatically, so there is a separate stream for each of the Devanagari-based languages (as there is for the Latin-based languages) and clipped to 1GB for each language. For each language, the text is frequency counted and cleaned by multiple methods, and sometimes this cleaning is too stringent automatically, or not stringent enough, so forbidden_characters and desired_characters are used as a guide in the cleanup process. There are other lang-specific numbers like a 1-in-n discard ratio for the frequency. For some languages, the amount of data produced at the end is very thin. > The unicharset is extracted from what remains, and the wordlist that is published in langdata. > For the LSTM training, I resorted to using Google's parallel infrastructure to render enough text in all the languages. > However much or little corpus text there is, the rendering process makes 50000 chunks of 50 words to render in a different combination of font and random degradation, which results in 400000-800000 rendered textlines. The words are chosen to approximately echo the real frequency of conjunct clusters (characters in most languages) in the source text, while also using the most frequent words. > This process is all done without significant manual intervention, but counts of the number of generated textlines indicates when it has gone badly, usually due to a lack of fonts, or a lack of corpus text. I recently stopped training chr, iku, khm, mya after discovering that I have no rendered textlines that contain anything other than digits and punctuation. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jun 5, 2017 at 4:59 PM, Ibr <[email protected]> wrote: > Hi, > > assume that I have creates 20 LSTM files for English for example, each > LSTM file is for a different font, when I make detection against an image > by running the command: *tesseract image results -l eng--tessdata-dir > ./tessdata --oem 1* does the tesseract check the image against all LSTM > files, or just take one of them and make detection against it? > > I'm assuming that to make the detection is more accurate I should create > many LSTM files for different fonts, because images can be with different > fonts from each other so in this way it would be more accurate since I have > LSTM file for every possible font, is that correct? > > Thanks > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV9RqoWqcG2jeeEfesEaQYca%3Dr9_Xrm6gjsFNbM_wVy_Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

