>assume that I have creates 20 LSTM files for English for example, each LSTM file is for a different font, when I make detection against an image by running the command: *tesseract image results -l eng--tessdata-dir ./tessdata --oem 1* does the tesseract check the image against all LSTM files, or just take one of them and make detection against it?
the .lstmf files are created per font/image. lstmtraining processes all of them together to create one .lstm file for the language. Maybe, internally it keeps the .lstmf files. I do not know whether it checks against just of them or creates a combined version to use for recognition ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Jun 5, 2017 at 7:05 PM, ShreeDevi Kumar <[email protected]> wrote: > Comments from Ray regarding training text > > > For Latin-based languages, the existing model data provided has been > trained on about 400000 textlines spanning about 4500 fonts > <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>. > For other scripts, not so many fonts are available, but they have still > been trained on a similar number of textlines. Instead of taking a few > minutes to a couple of hours to train, Tesseract 4.00 takes a few *days* to > a couple of *weeks.* > > >The text corpus is from *all* the www, taken several years ago, plus more > recent data from wiki-something. The text is divided by language > automatically, so there is a separate stream for each of the > Devanagari-based languages (as there is for the Latin-based languages) and > clipped to 1GB for each language. For each language, the text is frequency > counted and cleaned by multiple methods, and sometimes this cleaning is too > stringent automatically, or not stringent enough, so forbidden_characters > and desired_characters are used as a guide in the cleanup process. There > are other lang-specific numbers like a 1-in-n discard ratio for the > frequency. For some languages, the amount of data produced at the end is > very thin. > > > The unicharset is extracted from what remains, and the wordlist that is > published in langdata. > > > For the LSTM training, I resorted to using Google's parallel > infrastructure to render enough text in all the languages. > > > However much or little corpus text there is, the rendering process makes > 50000 chunks of 50 words to render in a different combination of font and > random degradation, which results in 400000-800000 rendered textlines. The > words are chosen to approximately echo the real frequency of conjunct > clusters (characters in most languages) in the source text, while also > using the most frequent words. > > > This process is all done without significant manual intervention, but > counts of the number of generated textlines indicates when it has gone > badly, usually due to a lack of fonts, or a lack of corpus text. I recently > stopped training chr, iku, khm, mya after discovering that I have no > rendered textlines that contain anything other than digits and punctuation. > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Mon, Jun 5, 2017 at 4:59 PM, Ibr <[email protected]> wrote: > >> Hi, >> >> assume that I have creates 20 LSTM files for English for example, each >> LSTM file is for a different font, when I make detection against an image >> by running the command: *tesseract image results -l eng--tessdata-dir >> ./tessdata --oem 1* does the tesseract check the image against all LSTM >> files, or just take one of them and make detection against it? >> >> I'm assuming that to make the detection is more accurate I should create >> many LSTM files for different fonts, because images can be with different >> fonts from each other so in this way it would be more accurate since I have >> LSTM file for every possible font, is that correct? >> >> Thanks >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX-fco4UmX1%3DhzR3YD5T1OPYRbqrQbHFejp%3DA%3DAfxH6Sg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

