Re: [tesseract-ocr] Detection Using LSTM Files

ShreeDevi Kumar Mon, 05 Jun 2017 06:42:33 -0700

>assume that I have creates  20 LSTM files for English for example, each
LSTM file is for a different font, when I make detection against an image
by running the command: *tesseract image results -l eng--tessdata-dir
./tessdata --oem 1* does the tesseract check the image against all LSTM
files, or just take one of them and make detection against it?


the .lstmf files are created per font/image. lstmtraining processes all
of them together to create one .lstm file for the language.

Maybe, internally it keeps the .lstmf files. I do not know whether it
checks against just of them or creates a combined version to use for
recognition


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Jun 5, 2017 at 7:05 PM, ShreeDevi Kumar <[email protected]>
wrote:

> Comments from Ray regarding training text
>
> > For Latin-based languages, the existing model data provided has been
> trained on about 400000 textlines spanning about 4500 fonts
> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>.
> For other scripts, not so many fonts are available, but they have still
> been trained on a similar number of textlines. Instead of taking a few
> minutes to a couple of hours to train, Tesseract 4.00 takes a few *days* to
> a couple of *weeks.*
>
> >The text corpus is from *all* the www, taken several years ago, plus more
> recent data from wiki-something. The text is divided by language
> automatically, so there is a separate stream for each of the
> Devanagari-based languages (as there is for the Latin-based languages) and
> clipped to 1GB for each language. For each language, the text is frequency
> counted and cleaned by multiple methods, and sometimes this cleaning is too
> stringent automatically, or not stringent enough, so forbidden_characters
> and desired_characters are used as a guide in the cleanup process. There
> are other lang-specific numbers like a 1-in-n discard ratio for the
> frequency. For some languages, the amount of data produced at the end is
> very thin.
> >
> The unicharset is extracted from what remains, and the wordlist that is
> published in langdata.
> >
> For the LSTM training, I resorted to using Google's parallel
> infrastructure to render enough text in all the languages.
> >
> However much or little corpus text there is, the rendering process makes
> 50000 chunks of 50 words to render in a different combination of font and
> random degradation, which results in 400000-800000 rendered textlines. The
> words are chosen to approximately echo the real frequency of conjunct
> clusters (characters in most languages) in the source text, while also
> using the most frequent words.
> >
> This process is all done without significant manual intervention, but
> counts of the number of generated textlines indicates when it has gone
> badly, usually due to a lack of fonts, or a lack of corpus text. I recently
> stopped training chr, iku, khm, mya after discovering that I have no
> rendered textlines that contain anything other than digits and punctuation.
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, Jun 5, 2017 at 4:59 PM, Ibr <[email protected]> wrote:
>
>> Hi,
>>
>> assume that I have creates  20 LSTM files for English for example, each
>> LSTM file is for a different font, when I make detection against an image
>> by running the command: *tesseract image results -l eng--tessdata-dir
>> ./tessdata --oem 1* does the tesseract check the image against all LSTM
>> files, or just take one of them and make detection against it?
>>
>> I'm assuming that to make the detection is more accurate I should create
>> many LSTM files for different fonts, because images can be with different
>> fonts from each other so in this way it would be more accurate since I have
>> LSTM file for every possible font, is that correct?
>>
>> Thanks
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/729fe287-e7b1-4f06-903b-25151b8126c6%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduX-fco4UmX1%3DhzR3YD5T1OPYRbqrQbHFejp%3DA%3DAfxH6Sg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Detection Using LSTM Files

Reply via email to