Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

'Fabio Lugli' via tesseract-ocr Thu, 16 Jan 2020 02:30:55 -0800

The command *tesseract unpack* is not recognized by my version of 
tesseract, is it a utility that you have yourself or is it already there in 
any release?
Anyway does it only extract  the *.box *.*gt.txt .tif* files? If that's the 
case I can simply copy those file in the folder?


Il giorno giovedì 16 gennaio 2020 10:45:59 UTC+1, shree ha scritto:
>
> Are you sure you have the files in the right places? It seems to work for 
> me...
>
> ubuntu@tesseract-ocr:~/tesseract$ cd ../TEST/lstmf
> ubuntu@tesseract-ocr:~/TEST/lstmf$ tesseract unpack  eng.test.pro1.lstmf
> Extracting eng.test.pro1.lstmf...
> Loaded 1/1 lines (1-1) of document eng.test.pro1.lstmf
> ubuntu@tesseract-ocr:~/TEST/lstmf$ ls
> eng.test.pro1_0.gt.txt  eng.test.pro1_0.png  eng.test.pro1.box 
>  eng.test.pro1.lstmf  eng.test.pro1.tif  eng.test.pro5.box 
>  eng.test.pro5.lstmf  eng.test.pro5.tif  fabio
> ubuntu@tesseract-ocr:~/TEST/lstmf$ tesseract unpack  eng.test.pro5.lstmf
> Extracting eng.test.pro5.lstmf...
> Loaded 1/1 lines (1-1) of document eng.test.pro5.lstmf
> ubuntu@tesseract-ocr:~/TEST/lstmf$ ls -1 *.lstmf > all-lstmf
> ubuntu@tesseract-ocr:~/TEST/lstmf$
> ubuntu@tesseract-ocr:~/TEST/lstmf$  rm -rf ./lowercase_cursive
> ubuntu@tesseract-ocr:~/TEST/lstmf$  mkdir -p ./lowercase_cursive
> ubuntu@tesseract-ocr:~/TEST/lstmf$  #
> ubuntu@tesseract-ocr:~/TEST/lstmf$  combine_tessdata -e 
> ~/tessdata_best/eng.traineddata \
> >  ./lowercase_cursive/eng.lstm
> Extracting tessdata components from 
> /home/ubuntu/tessdata_best/eng.traineddata
> Wrote ./lowercase_cursive/eng.lstm
> Version 
> string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
> 17:lstm:size=11689099, offset=192
> 18:lstm-punc-dawg:size=4322, offset=11689291
> 19:lstm-word-dawg:size=3694794, offset=11693613
> 20:lstm-number-dawg:size=4738, offset=15388407
> 21:lstm-unicharset:size=6360, offset=15393145
> 22:lstm-recoder:size=1012, offset=15399505
> 23:version:size=80, offset=15400517
> ubuntu@tesseract-ocr:~/TEST/lstmf$ #
> ubuntu@tesseract-ocr:~/TEST/lstmf$ time lstmtraining \
> >   --debug_interval  -1 \
> >   --model_output ./lowercase_cursive/impact \
> >   --continue_from ./lowercase_cursive/eng.lstm \
> >   --train_listfile /home/ubuntu/TEST/lstmf/all-lstmf \
> >   --traineddata ~/tessdata_best/eng.traineddata \
> >   --max_iterations 400
> Loaded file ./lowercase_cursive/eng.lstm, unpacking...
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> Continuing from ./lowercase_cursive/eng.lstm
> Loaded 1/1 lines (1-1) of document eng.test.pro1.lstmf
> Loaded 1/1 lines (1-1) of document eng.test.pro5.lstmf
> Iteration 0: GROUND  TRUTH : nominating any more Labour life Peers
> Iteration 0: ALIGNED TRUTH : nominating any moree Labour life Peers
> Iteration 0: BEST OCR TEXT : wominadng  ang wow.  Lobowr Lfe_ "Paoro
> File eng.test.pro1.lstmf line 0 :
> Mean rms=3.82%, delta=18.848%, train=75.676%(100%), skip ratio=0%
> Iteration 1: GROUND  TRUTH : Griffiths, MP for Mancheste Exchange
> Iteration 1: ALIGNED TRUTH : Griiffiths, MP for Mancheste Exchanngee
> Iteration 1: BEST OCR TEXT : Galbhtha , UP Roe Mowomadl) Cxerlaomqre
> File eng.test.pro5.lstmf line 0 :
> Mean rms=3.908%, delta=20.581%, train=86.449%(100%), skip ratio=0%
> Iteration 2: GROUND  TRUTH : nominating any more Labour life Peers
> Iteration 2: BEST OCR TEXT : wominading any wone. Lobowr Lfe. "Paoro
> File eng.test.pro1.lstmf line 0 :
> Mean rms=3.74%, delta=19.305%, train=75.651%(94.444%), skip ratio=0%
> Iteration 3: GROUND  TRUTH : Griffiths, MP for Mancheste Exchange
> Iteration 3: ALIGNED TRUTH : Griffiths, MP for Mancheste Exchanngee
> Iteration 3: BEST OCR TEXT : Galbhtha , MUP foe Manomadl) Cxclaomgle
> File eng.test.pro5.lstmf line 0 :
> Mean rms=3.708%, delta=18.921%, train=78.266%(95.833%), skip ratio=0%
> Iteration 4: GROUND  TRUTH : nominating any more Labour life Peers
> Iteration 4: BEST OCR TEXT : wominading any wone Loabour Lfe. "Paro
>
> On Wed, Jan 15, 2020 at 8:15 PM 'Fabio Lugli' via tesseract-ocr <
> tesser...@googlegroups.com <javascript:>> wrote:
>
>> Yes, i forgot to do it in the latest post. I share a couple of the images 
>> and their correspondant .*box *and .*lstmf *files. The others that i 
>> tried until now are very similar to these ones.
>>
>> Il giorno mercoledì 15 gennaio 2020 15:38:23 UTC+1, shree ha scritto:
>>>
>>> Please share a couple of lstmf files for testing.
>>>
>>> On Wed, Jan 15, 2020 at 8:03 PM 'Fabio Lugli' via tesseract-ocr <
>>> tesser...@googlegroups.com> wrote:
>>>
>>>> After some work i am able to:
>>>> - Use the method *lstmbox* of *tesseract.exe* to obtain the *.box* files 
>>>> of my *.tif* images
>>>> - Use the third party software *JTessBoxEditor* to correct the 
>>>> recognized characters, leaving boxes all around the full line of text
>>>> - Use the method *lstm.train* of *tesseract.exe* to obtain the *.lstmf* 
>>>> files 
>>>> from the *.box* files
>>>>
>>>> Now when i try to use *lstmtraining.exe, *using *eng*.*traineddata *as 
>>>> starter traineddata i obtain the error:
>>>>
>>>> *Deserialize header failed: [myfile1].lstmf*
>>>> *Deserialize header failed: **[myfile2]**.lstmf*
>>>> *Deserialize header failed: **[myfile3]**.lstmf*
>>>> *Loaded 1/1 lines (1-1) of document **[myfile4]**.lstmf*
>>>> *Load of images failed!!*
>>>>
>>>> From this i can understand there is an error either in the process of 
>>>> creating *.lstmf* files or in the images themselves that i have 
>>>> selected. Any suggestion is well accepted.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/7e075fb6-ac4d-4125-96a6-98d520b88ca3%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/7e075fb6-ac4d-4125-96a6-98d520b88ca3%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5c4e3998-ff4c-43be-b207-c5068c921c0a%40googlegroups.com.

Re: [tesseract-ocr] Re: Training Tesseract 5.0.0 to recognize digital handwriting

Reply via email to