Re: [tesseract-ocr] Fine-turning LSTM for Japanese

ShreeDevi Kumar Sun, 28 May 2017 22:58:33 -0700

Also look at all three scripts used for training

https://github.com/tesseract-ocr/tesseract/blob/master/training/language-specific.sh


https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain_utils.sh

https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh

https://github.com/tesseract-ocr/tesseract/blob/8e79297dcefecdb929d753d28554fec51417ec39/ccutil/unicharcompress.cpp

// Most simple scripts
// will encode a single index to a UTF8-string, but Chinese, Japanese,
Korean
// and the Indic scripts will contain a many-to-many mapping.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, May 29, 2017 at 10:44 AM, ShreeDevi Kumar <[email protected]>
wrote:

> Ray is the best person to answer your questions. I can only share my
> experience trying to train using Devanagari script.
>
> Fine Tune will work if all you want to change is a font, with the same
> unicharset. This works well for Latin script based languages but not
> complex scripts.
>
> eg. for devanagari, the consonants, vowel marks, combining marks together
> make an 'akshara' glyph, the unicharset in the language model has these. If
> the new training text has additional new akshara glyphs, fine tune training
> gives errors such as Encoding of string failed!
>
> For Devanagari, I have tried training by changing top layer. This adds the
> new akshara glyphs. However, for accuracy training has to be done till
> 0.01% which takes very long - I have not been able to reach that level of
> accuracy in my training. Again, this may impact the originally trained
> fonts. Currently using --eval_listfile for a different set of images during
> training does not work.
>
> -dawgs are a way of compressing the wordlists. https://tesseract-
> ocr.repairfaq.org/allaboutdawg.html
>
> There is no way to finetune the legacy engine.
>
>
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Mon, May 29, 2017 at 9:19 AM, Akira Hayakawa <[email protected]>
> wrote:
>
>> Thanks for the reply. I understand.
>>
>> There are couple of questions related to this topic.
>>
>> 1)
>>
>> training_text may only include the text for the next (or new) learning?
>> For example, the LSTM net have learned a line "I have a pen" and we need
>> it to learn a line "I have a pineapple" then does training_text only
>> include the pineapple line but the pen line is removed?
>>
>> 2)
>>
>> In https://github.com/tesseract-ocr/tesseract/wiki/Training-Tes
>> seract-%E2%80%93-tesstrain.sh
>>
>> the files in langdata other than training_text are said to be optional.
>> I suppose these files are internally handled as hints. Am I right?
>> And what if these files are inconsistent with training_text? For example,
>> wordlist may contain fairly irrelevant words.
>> Should I erase the optional files if they are inconsistent?
>>
>> 3)
>>
>> Closely related to 2).
>> When the langdata doesn't have these optional files. Tesseract internally
>> generates the files from training_text?
>>
>> 4)
>>
>> Is there no way to fine-tune legacy tesseract?
>>
>> 5)
>>
>> In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>
>> These is a note:
>>
>>> NOTE Tesseract 4.00 will now run happily with a traineddata file that
>>> contains just lang.lstm.The lstm-*-dawgs are optional, and none of the
>>> other files are required or used with OEM_LSTM_ONLY as the OCR engine mode. 
>>> No
>>> bigrams, unichar ambigs or any of the other files are needed or even have
>>> any effect if present.
>>
>>
>> Does this mean if we use LSTM only (legacy tesseract is going to be
>> purged in the future release right?), the optionals files like wordlist are
>> entirely needless? This sounds natural to me because as far as I understand
>> the LSTM net only learn a text line from a sequence of byte or image.
>> btw, What does "dawgs" mean?
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU8viYcUZq2fE45AiSSSr3UZmmSm10%2B4goHJCKhKfmgfw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Fine-turning LSTM for Japanese

Reply via email to