Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

Shree Devi Kumar Thu, 16 Jan 2020 03:05:22 -0800

tesseract unpack is a new feature by @stweil - not yet in the master
branch. I was testing to see that your lstmf files are read correctly and
they are.


For tesstrain, all you need are single line images and their gt.txt.

I ram lstmtraining using your lstmf files, which worked fine.

If you want to test, try the following in a directory where you have the
two sample lstmf files.
Change  ~/tessdata_best to wherever you have the best traineddata file.

ls -1 *.lstmf > all-lstmf
mkdir -p ./testdir
combine_tessdata -e ~/tessdata_best/eng.traineddata   ./testdir/eng.lstm

time lstmtraining \
   --debug_interval  -1 \
   --model_output ./testdir/impact \
   --continue_from ./testdir/eng.lstm \
   --train_listfile all-lstmf \
   --traineddata ~/tessdata_best/eng.traineddata \
   --max_iterations 400





On Thu, Jan 16, 2020 at 3:59 PM 'Fabio Lugli' via tesseract-ocr <
tesseract-ocr@googlegroups.com> wrote:

> The command *tesseract unpack* is not recognized by my version of
> tesseract, is it a utility that you have yourself or is it already there in
> any release?
> Anyway does it only extract  the *.box *.*gt.txt .tif* files? If that's
> the case I can simply copy those file in the folder?
>
> Il giorno giovedì 16 gennaio 2020 10:45:59 UTC+1, shree ha scritto:
>>
>> Are you sure you have the files in the right places? It seems to work for
>> me...
>>
>> ubuntu@tesseract-ocr:~/tesseract$ cd ../TEST/lstmf
>> ubuntu@tesseract-ocr:~/TEST/lstmf$ tesseract unpack  eng.test.pro1.lstmf
>> Extracting eng.test.pro1.lstmf...
>> Loaded 1/1 lines (1-1) of document eng.test.pro1.lstmf
>> ubuntu@tesseract-ocr:~/TEST/lstmf$ ls
>> eng.test.pro1_0.gt.txt  eng.test.pro1_0.png  eng.test.pro1.box
>>  eng.test.pro1.lstmf  eng.test.pro1.tif  eng.test.pro5.box
>>  eng.test.pro5.lstmf  eng.test.pro5.tif  fabio
>> ubuntu@tesseract-ocr:~/TEST/lstmf$ tesseract unpack  eng.test.pro5.lstmf
>> Extracting eng.test.pro5.lstmf...
>> Loaded 1/1 lines (1-1) of document eng.test.pro5.lstmf
>> ubuntu@tesseract-ocr:~/TEST/lstmf$ ls -1 *.lstmf > all-lstmf
>> ubuntu@tesseract-ocr:~/TEST/lstmf$
>> ubuntu@tesseract-ocr:~/TEST/lstmf$  rm -rf ./lowercase_cursive
>> ubuntu@tesseract-ocr:~/TEST/lstmf$  mkdir -p ./lowercase_cursive
>> ubuntu@tesseract-ocr:~/TEST/lstmf$  #
>> ubuntu@tesseract-ocr:~/TEST/lstmf$  combine_tessdata -e
>> ~/tessdata_best/eng.traineddata \
>> >  ./lowercase_cursive/eng.lstm
>> Extracting tessdata components from
>> /home/ubuntu/tessdata_best/eng.traineddata
>> Wrote ./lowercase_cursive/eng.lstm
>> Version
>> string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
>> 17:lstm:size=11689099, offset=192
>> 18:lstm-punc-dawg:size=4322, offset=11689291
>> 19:lstm-word-dawg:size=3694794, offset=11693613
>> 20:lstm-number-dawg:size=4738, offset=15388407
>> 21:lstm-unicharset:size=6360, offset=15393145
>> 22:lstm-recoder:size=1012, offset=15399505
>> 23:version:size=80, offset=15400517
>> ubuntu@tesseract-ocr:~/TEST/lstmf$ #
>> ubuntu@tesseract-ocr:~/TEST/lstmf$ time lstmtraining \
>> >   --debug_interval  -1 \
>> >   --model_output ./lowercase_cursive/impact \
>> >   --continue_from ./lowercase_cursive/eng.lstm \
>> >   --train_listfile /home/ubuntu/TEST/lstmf/all-lstmf \
>> >   --traineddata ~/tessdata_best/eng.traineddata \
>> >   --max_iterations 400
>> Loaded file ./lowercase_cursive/eng.lstm, unpacking...
>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>> Continuing from ./lowercase_cursive/eng.lstm
>> Loaded 1/1 lines (1-1) of document eng.test.pro1.lstmf
>> Loaded 1/1 lines (1-1) of document eng.test.pro5.lstmf
>> Iteration 0: GROUND  TRUTH : nominating any more Labour life Peers
>> Iteration 0: ALIGNED TRUTH : nominating any moree Labour life Peers
>> Iteration 0: BEST OCR TEXT : wominadng  ang wow.  Lobowr Lfe_ "Paoro
>> File eng.test.pro1.lstmf line 0 :
>> Mean rms=3.82%, delta=18.848%, train=75.676%(100%), skip ratio=0%
>> Iteration 1: GROUND  TRUTH : Griffiths, MP for Mancheste Exchange
>> Iteration 1: ALIGNED TRUTH : Griiffiths, MP for Mancheste Exchanngee
>> Iteration 1: BEST OCR TEXT : Galbhtha , UP Roe Mowomadl) Cxerlaomqre
>> File eng.test.pro5.lstmf line 0 :
>> Mean rms=3.908%, delta=20.581%, train=86.449%(100%), skip ratio=0%
>> Iteration 2: GROUND  TRUTH : nominating any more Labour life Peers
>> Iteration 2: BEST OCR TEXT : wominading any wone. Lobowr Lfe. "Paoro
>> File eng.test.pro1.lstmf line 0 :
>> Mean rms=3.74%, delta=19.305%, train=75.651%(94.444%), skip ratio=0%
>> Iteration 3: GROUND  TRUTH : Griffiths, MP for Mancheste Exchange
>> Iteration 3: ALIGNED TRUTH : Griffiths, MP for Mancheste Exchanngee
>> Iteration 3: BEST OCR TEXT : Galbhtha , MUP foe Manomadl) Cxclaomgle
>> File eng.test.pro5.lstmf line 0 :
>> Mean rms=3.708%, delta=18.921%, train=78.266%(95.833%), skip ratio=0%
>> Iteration 4: GROUND  TRUTH : nominating any more Labour life Peers
>> Iteration 4: BEST OCR TEXT : wominading any wone Loabour Lfe. "Paro
>>
>> On Wed, Jan 15, 2020 at 8:15 PM 'Fabio Lugli' via tesseract-ocr <
>> tesser...@googlegroups.com> wrote:
>>
>>> Yes, i forgot to do it in the latest post. I share a couple of the
>>> images and their correspondant .*box *and .*lstmf *files. The others
>>> that i tried until now are very similar to these ones.
>>>
>>> Il giorno mercoledì 15 gennaio 2020 15:38:23 UTC+1, shree ha scritto:
>>>>
>>>> Please share a couple of lstmf files for testing.
>>>>
>>>> On Wed, Jan 15, 2020 at 8:03 PM 'Fabio Lugli' via tesseract-ocr <
>>>> tesser...@googlegroups.com> wrote:
>>>>
>>>>> After some work i am able to:
>>>>> - Use the method *lstmbox* of *tesseract.exe* to obtain the *.box* files
>>>>> of my *.tif* images
>>>>> - Use the third party software *JTessBoxEditor* to correct the
>>>>> recognized characters, leaving boxes all around the full line of text
>>>>> - Use the method *lstm.train* of *tesseract.exe* to obtain the
>>>>> *.lstmf* files from the *.box* files
>>>>>
>>>>> Now when i try to use *lstmtraining.exe, *using *eng*.*traineddata *as
>>>>> starter traineddata i obtain the error:
>>>>>
>>>>> *Deserialize header failed: [myfile1].lstmf*
>>>>> *Deserialize header failed: **[myfile2]**.lstmf*
>>>>> *Deserialize header failed: **[myfile3]**.lstmf*
>>>>> *Loaded 1/1 lines (1-1) of document **[myfile4]**.lstmf*
>>>>> *Load of images failed!!*
>>>>>
>>>>> From this i can understand there is an error either in the process of
>>>>> creating *.lstmf* files or in the images themselves that i have
>>>>> selected. Any suggestion is well accepted.
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/7e075fb6-ac4d-4125-96a6-98d520b88ca3%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/7e075fb6-ac4d-4125-96a6-98d520b88ca3%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5c4e3998-ff4c-43be-b207-c5068c921c0a%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5c4e3998-ff4c-43be-b207-c5068c921c0a%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUb7Yv3DiSfbpmgtiP1Lc49s6FJv01C18XPoPgi3_41Vw%40mail.gmail.com.

Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

Reply via email to