tesseract unpack is a new feature by @stweil - not yet in the master branch. I was testing to see that your lstmf files are read correctly and they are.
For tesstrain, all you need are single line images and their gt.txt. I ram lstmtraining using your lstmf files, which worked fine. If you want to test, try the following in a directory where you have the two sample lstmf files. Change ~/tessdata_best to wherever you have the best traineddata file. ls -1 *.lstmf > all-lstmf mkdir -p ./testdir combine_tessdata -e ~/tessdata_best/eng.traineddata ./testdir/eng.lstm time lstmtraining \ --debug_interval -1 \ --model_output ./testdir/impact \ --continue_from ./testdir/eng.lstm \ --train_listfile all-lstmf \ --traineddata ~/tessdata_best/eng.traineddata \ --max_iterations 400 On Thu, Jan 16, 2020 at 3:59 PM 'Fabio Lugli' via tesseract-ocr < tesseract-ocr@googlegroups.com> wrote: > The command *tesseract unpack* is not recognized by my version of > tesseract, is it a utility that you have yourself or is it already there in > any release? > Anyway does it only extract the *.box *.*gt.txt .tif* files? If that's > the case I can simply copy those file in the folder? > > Il giorno giovedì 16 gennaio 2020 10:45:59 UTC+1, shree ha scritto: >> >> Are you sure you have the files in the right places? It seems to work for >> me... >> >> ubuntu@tesseract-ocr:~/tesseract$ cd ../TEST/lstmf >> ubuntu@tesseract-ocr:~/TEST/lstmf$ tesseract unpack eng.test.pro1.lstmf >> Extracting eng.test.pro1.lstmf... >> Loaded 1/1 lines (1-1) of document eng.test.pro1.lstmf >> ubuntu@tesseract-ocr:~/TEST/lstmf$ ls >> eng.test.pro1_0.gt.txt eng.test.pro1_0.png eng.test.pro1.box >> eng.test.pro1.lstmf eng.test.pro1.tif eng.test.pro5.box >> eng.test.pro5.lstmf eng.test.pro5.tif fabio >> ubuntu@tesseract-ocr:~/TEST/lstmf$ tesseract unpack eng.test.pro5.lstmf >> Extracting eng.test.pro5.lstmf... >> Loaded 1/1 lines (1-1) of document eng.test.pro5.lstmf >> ubuntu@tesseract-ocr:~/TEST/lstmf$ ls -1 *.lstmf > all-lstmf >> ubuntu@tesseract-ocr:~/TEST/lstmf$ >> ubuntu@tesseract-ocr:~/TEST/lstmf$ rm -rf ./lowercase_cursive >> ubuntu@tesseract-ocr:~/TEST/lstmf$ mkdir -p ./lowercase_cursive >> ubuntu@tesseract-ocr:~/TEST/lstmf$ # >> ubuntu@tesseract-ocr:~/TEST/lstmf$ combine_tessdata -e >> ~/tessdata_best/eng.traineddata \ >> > ./lowercase_cursive/eng.lstm >> Extracting tessdata components from >> /home/ubuntu/tessdata_best/eng.traineddata >> Wrote ./lowercase_cursive/eng.lstm >> Version >> string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] >> 17:lstm:size=11689099, offset=192 >> 18:lstm-punc-dawg:size=4322, offset=11689291 >> 19:lstm-word-dawg:size=3694794, offset=11693613 >> 20:lstm-number-dawg:size=4738, offset=15388407 >> 21:lstm-unicharset:size=6360, offset=15393145 >> 22:lstm-recoder:size=1012, offset=15399505 >> 23:version:size=80, offset=15400517 >> ubuntu@tesseract-ocr:~/TEST/lstmf$ # >> ubuntu@tesseract-ocr:~/TEST/lstmf$ time lstmtraining \ >> > --debug_interval -1 \ >> > --model_output ./lowercase_cursive/impact \ >> > --continue_from ./lowercase_cursive/eng.lstm \ >> > --train_listfile /home/ubuntu/TEST/lstmf/all-lstmf \ >> > --traineddata ~/tessdata_best/eng.traineddata \ >> > --max_iterations 400 >> Loaded file ./lowercase_cursive/eng.lstm, unpacking... >> Warning: LSTMTrainer deserialized an LSTMRecognizer! >> Continuing from ./lowercase_cursive/eng.lstm >> Loaded 1/1 lines (1-1) of document eng.test.pro1.lstmf >> Loaded 1/1 lines (1-1) of document eng.test.pro5.lstmf >> Iteration 0: GROUND TRUTH : nominating any more Labour life Peers >> Iteration 0: ALIGNED TRUTH : nominating any moree Labour life Peers >> Iteration 0: BEST OCR TEXT : wominadng ang wow. Lobowr Lfe_ "Paoro >> File eng.test.pro1.lstmf line 0 : >> Mean rms=3.82%, delta=18.848%, train=75.676%(100%), skip ratio=0% >> Iteration 1: GROUND TRUTH : Griffiths, MP for Mancheste Exchange >> Iteration 1: ALIGNED TRUTH : Griiffiths, MP for Mancheste Exchanngee >> Iteration 1: BEST OCR TEXT : Galbhtha , UP Roe Mowomadl) Cxerlaomqre >> File eng.test.pro5.lstmf line 0 : >> Mean rms=3.908%, delta=20.581%, train=86.449%(100%), skip ratio=0% >> Iteration 2: GROUND TRUTH : nominating any more Labour life Peers >> Iteration 2: BEST OCR TEXT : wominading any wone. Lobowr Lfe. "Paoro >> File eng.test.pro1.lstmf line 0 : >> Mean rms=3.74%, delta=19.305%, train=75.651%(94.444%), skip ratio=0% >> Iteration 3: GROUND TRUTH : Griffiths, MP for Mancheste Exchange >> Iteration 3: ALIGNED TRUTH : Griffiths, MP for Mancheste Exchanngee >> Iteration 3: BEST OCR TEXT : Galbhtha , MUP foe Manomadl) Cxclaomgle >> File eng.test.pro5.lstmf line 0 : >> Mean rms=3.708%, delta=18.921%, train=78.266%(95.833%), skip ratio=0% >> Iteration 4: GROUND TRUTH : nominating any more Labour life Peers >> Iteration 4: BEST OCR TEXT : wominading any wone Loabour Lfe. "Paro >> >> On Wed, Jan 15, 2020 at 8:15 PM 'Fabio Lugli' via tesseract-ocr < >> tesser...@googlegroups.com> wrote: >> >>> Yes, i forgot to do it in the latest post. I share a couple of the >>> images and their correspondant .*box *and .*lstmf *files. The others >>> that i tried until now are very similar to these ones. >>> >>> Il giorno mercoledì 15 gennaio 2020 15:38:23 UTC+1, shree ha scritto: >>>> >>>> Please share a couple of lstmf files for testing. >>>> >>>> On Wed, Jan 15, 2020 at 8:03 PM 'Fabio Lugli' via tesseract-ocr < >>>> tesser...@googlegroups.com> wrote: >>>> >>>>> After some work i am able to: >>>>> - Use the method *lstmbox* of *tesseract.exe* to obtain the *.box* files >>>>> of my *.tif* images >>>>> - Use the third party software *JTessBoxEditor* to correct the >>>>> recognized characters, leaving boxes all around the full line of text >>>>> - Use the method *lstm.train* of *tesseract.exe* to obtain the >>>>> *.lstmf* files from the *.box* files >>>>> >>>>> Now when i try to use *lstmtraining.exe, *using *eng*.*traineddata *as >>>>> starter traineddata i obtain the error: >>>>> >>>>> *Deserialize header failed: [myfile1].lstmf* >>>>> *Deserialize header failed: **[myfile2]**.lstmf* >>>>> *Deserialize header failed: **[myfile3]**.lstmf* >>>>> *Loaded 1/1 lines (1-1) of document **[myfile4]**.lstmf* >>>>> *Load of images failed!!* >>>>> >>>>> From this i can understand there is an error either in the process of >>>>> creating *.lstmf* files or in the images themselves that i have >>>>> selected. Any suggestion is well accepted. >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesser...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/7e075fb6-ac4d-4125-96a6-98d520b88ca3%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/7e075fb6-ac4d-4125-96a6-98d520b88ca3%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/5c4e3998-ff4c-43be-b207-c5068c921c0a%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/5c4e3998-ff4c-43be-b207-c5068c921c0a%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUb7Yv3DiSfbpmgtiP1Lc49s6FJv01C18XPoPgi3_41Vw%40mail.gmail.com.