Re: [tesseract-ocr] Fine-turning LSTM for Japanese

ShreeDevi Kumar Sun, 28 May 2017 22:15:06 -0700

Ray is the best person to answer your questions. I can only share my
experience trying to train using Devanagari script.


Fine Tune will work if all you want to change is a font, with the same
unicharset. This works well for Latin script based languages but not
complex scripts.

eg. for devanagari, the consonants, vowel marks, combining marks together
make an 'akshara' glyph, the unicharset in the language model has these. If
the new training text has additional new akshara glyphs, fine tune training
gives errors such as Encoding of string failed!

For Devanagari, I have tried training by changing top layer. This adds the
new akshara glyphs. However, for accuracy training has to be done till
0.01% which takes very long - I have not been able to reach that level of
accuracy in my training. Again, this may impact the originally trained
fonts. Currently using --eval_listfile for a different set of images during
training does not work.

-dawgs are a way of compressing the wordlists.
https://tesseract-ocr.repairfaq.org/allaboutdawg.html

There is no way to finetune the legacy engine.



ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, May 29, 2017 at 9:19 AM, Akira Hayakawa <[email protected]> wrote:

> Thanks for the reply. I understand.
>
> There are couple of questions related to this topic.
>
> 1)
>
> training_text may only include the text for the next (or new) learning?
> For example, the LSTM net have learned a line "I have a pen" and we need
> it to learn a line "I have a pineapple" then does training_text only
> include the pineapple line but the pen line is removed?
>
> 2)
>
> In https://github.com/tesseract-ocr/tesseract/wiki/Training-
> Tesseract-%E2%80%93-tesstrain.sh
>
> the files in langdata other than training_text are said to be optional.
> I suppose these files are internally handled as hints. Am I right?
> And what if these files are inconsistent with training_text? For example,
> wordlist may contain fairly irrelevant words.
> Should I erase the optional files if they are inconsistent?
>
> 3)
>
> Closely related to 2).
> When the langdata doesn't have these optional files. Tesseract internally
> generates the files from training_text?
>
> 4)
>
> Is there no way to fine-tune legacy tesseract?
>
> 5)
>
> In https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>
> These is a note:
>
>> NOTE Tesseract 4.00 will now run happily with a traineddata file that
>> contains just lang.lstm.The lstm-*-dawgs are optional, and none of the
>> other files are required or used with OEM_LSTM_ONLY as the OCR engine mode. 
>> No
>> bigrams, unichar ambigs or any of the other files are needed or even have
>> any effect if present.
>
>
> Does this mean if we use LSTM only (legacy tesseract is going to be purged
> in the future release right?), the optionals files like wordlist are
> entirely needless? This sounds natural to me because as far as I understand
> the LSTM net only learn a text line from a sequence of byte or image.
> btw, What does "dawgs" mean?
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5d410061-f281-42bd-98f5-04a746700dca%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUuFn1Fxpv5h-RdHA%3DvZ%3DgY8TBq%2Bj%3DwCPrwmLP7TZF%2BcQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Fine-turning LSTM for Japanese

Reply via email to