>From what I understand from the documentation provided by Ray Smith
regarding LSTM training, the models have been trained on hundreds of
thousands of lines and  hundreds of fonts. The network spec used for
training from scratch will therefore be optimized for such large models.

You seem to have a different requirement, hence I suggested building the
legacy tesseract model.

You can experiment and see if it is better.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Jun 1, 2018 at 12:23 PM, Julien Jemine <[email protected]>
wrote:

> Hi Shree,
>
> Thanks for your answer.
> If you don't mind, could you explain why it'd be better ?
>
> Le jeudi 31 mai 2018 17:25:47 UTC+2, shree a écrit :
>>
>> >I've trained a LSTM model for a custom language from scratch as explained
>>  here
>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>.
>>
>> >The language only has about 100 words and 17 characters, so it's pretty
>> simple.
>>
>> For such a small model, try to build the legacy version rather than LSTM.
>>
>> $tesstrain_dir/tesstrain.sh \
>>    --lang $Lang \
>>    --exposures "0" \
>>    --fonts_dir $fonts_dir \
>>    --fontlist $fonts_for_training \
>>    --langdata_dir $langdata_dir \
>>    --tessdata_dir  $tessdata_dir \
>>    --training_text $langdata_dir/$Lang/$Lang.training_text \
>>    --output_dir $train_output_dir
>>
>>
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Thu, May 31, 2018 at 3:43 PM, Julien Jemine <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I've trained a LSTM model for a custom language from scratch as
>>> explained here
>>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00>
>>> .
>>>
>>> The language only has about 100 words and 17 characters, so it's pretty
>>> simple.
>>>
>>> When I run lstmeval on my model, I get a perfect match:
>>> [icm@u16-offcao-07] train1$ lstmeval --model
>>> /home/icm/share/tessdata/iqi.traineddata --eval_listfile
>>> iqitrain2/iqi.training_files.txt --verbosity 2
>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi
>>> .Arial.exp0.lstmf
>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi
>>> .Calibri.exp0.lstmf
>>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> Truth:6CUEN 6 CU EN
>>> OCR  :6CUEN 6 CU EN
>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi
>>> .Lucida_Sans_Typewriter_Semi-Condensed.exp0.lstmf
>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> Truth:6CUEN 6 CU EN
>>> OCR  :6CUEN 6 CU EN
>>> Loaded 2/2 pages (1-2) of document /home/icm/train1/iqitrain2/iqi
>>> .Verdana.exp0.lstmf
>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> Truth:6CUEN 6 CU EN
>>> OCR  :6CUEN 6 CU EN
>>> Truth:6CUEN 6 CU EN
>>> OCR  :6CUEN 6 CU EN
>>> Truth:ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> OCR  :ASTM 10FEEN 10 FE EN 13CUEN 13 CU EN 02B 11 16
>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
>>>
>>> However, when I put my iqi.traineddata file in my tessdata folder and
>>> try to run tesseract on the same tif file, I get errors:
>>> [icm@u16-offcao-07] train1$ tesseract iqitrain2/iqi.training_img.txt
>>> stdout -l iqi
>>> Page 0 : /home/icm/train1/iqitrain2/iqi.Arial.exp0.tif
>>> 6CFEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEN
>>> 6CUEN 1 CU EN
>>> Page 1 : /home/icm/train1/iqitrain2/iqi.Calibri.exp0.tif
>>>
>>> 6CM 10FEEN 0 6 FEE 13CUEN 11 6 FE EEN 1116
>>> 6UEN 16 FE
>>> Page 2 : /home/icm/train1/iqitrain2/iqi.Lucida_Sans_Typewriter_Semi-
>>> Condensed.exp0.tif
>>>
>>> 6TM 13CUEN 13 1 EN 11CUE 11 CU EN 12B 11 16
>>> 6 6 CU EN
>>> Page 3 : /home/icm/train1/iqitrain2/iqi.Verdana.exp0.tif
>>>
>>> ASTM 103UEEN 13 1CU EN 13CUEN 13 6 FE EEN 11 16
>>> 6CUEN 6 CU EN
>>>
>>>
>>> Now the really frustrating part: I have the opposite phenomenon with the
>>> "eng" language! (with eng.traineddata taken from tessdata_best)
>>> lstmeval gives me a few errors (Eval Char error rate=2.4665552, Word
>>> error rate=16.666667)
>>> tesseract gives me the right answer! (But the images are generated with
>>> tesstrain.sh and very common fonts, it's probably to be expected).
>>>
>>> Am I doing something wrong?
>>> What's going on here?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/67286720-c624-4239-a812-3c76d7603cf1%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/67286720-c624-4239-a812-3c76d7603cf1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/827ac3ce-21dc-448b-901c-28faea02cfa0%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/827ac3ce-21dc-448b-901c-28faea02cfa0%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtvpuXvtcMLV-8A7sFr9C_GWvLfC6DO5ka3g1pb4Jw-Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to