Test with 5-10 files to figure out correct process. Probably files are not in the correct location or format.
On Tue, 3 Sep 2019, 17:10 Pranav Budhwant, <[email protected]> wrote: > I tried the same with Tesseract 4.1, and I generated all the files on > Ubuntu instead of creating them on Windows and then converting to Unix > formats. It still gives the same error. Please can anyone help me out here? > I don't know what I'm doing wrong. > > On Friday, August 30, 2019 at 7:18:28 PM UTC+5:30, Pranav Budhwant wrote: >> >> >> *Tesseract Version: * >> >> tesseract 5.0.0-alpha leptonica-1.75.3 >> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : >> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 >> >> Found AVX2 >> Found AVX >> Found SSE >> >> >> *Platform:* >> Linux pranav-vm 5.0.0-25-generic #26~18.04.1-Ubuntu SMP Thu Aug 1 >> 13:51:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux >> >> >> >> *Current Behavior:* >> I am trying to fine-tune tesseract lstm. I have done the following: >> >> 1. Downloaded & Extracted the current trained model for eng: >> cd tesseract/src/training/mkdir extracted >> combine_tessdata -e /usr/local/share/tessdata/eng.traineddata extracted/ >> eng.lstm >> >> >> 2. Generated the *.lstmf files from *.tif and *.box files using: >> for file in *.tif; do >> echo $file >> base=`basename $file .tif` >> tesseract $file $base --psm 7 nobatch lstm.train >> done >> >> >> 3. Generated all-lstmf and list.train, list.eval files using: >> >> ls -1 *.lstmf | sort -R > all-lstmf >> head -n 500 all-lstmf > list.eval >> tail -n +500 all-lstmf > list.train >> >> While generating the *.lstmf files, Tesseract threw the following warning: >> Warning. Invalid resolution 0 dpi. Using 70 instead. >> >> 4. Training the model using: >> >> lstmtraining \ >> --model_output ~/icr/train_output/ \ >> --continue_from /home/pranav/tesseract/src/training/extracted/eng.lstm \ >> --traineddata /usr/local/share/tessdata/eng.traineddata \ >> --train_listfile tune/list.train \ >> --eval_listfile tune/list.eval >> >> >> This however, throws the following error: >> >> Loaded file /home/pranav/tesseract/src/training/extracted/eng.lstm, >> unpacking... >> Warning: LSTMTrainer deserialized an LSTMRecognizer! >> Continuing from /home/pranav/tesseract/src/training/extracted/eng.lstm >> Deserialize header failed: ~/icr/train/a01-014-03.lstmf >> Deserialize header failed: ~/icr/train/n04-107-01.lstmf >> Deserialize header failed: ~/icr/train/g06-037f-02.lstmf >> Deserialize header failed: ~/icr/train/r03-090-03.lstmf >> Deserialize header failed: ~/icr/train/r03-084-09.lstmf >> Deserialize header failed: ~/icr/train/g06-037e-02.lstmf >> Load of page 0 failed! >> Load of images failed!! >> Deserialize header failed: ~/icr/train/j01-066-09.lstmf >> Deserialize header failed: ~/icr/train/k04-075-02.lstmf >> Deserialize header failed: ~/icr/train/n02-127-00.lstmf >> >> >> I have generated the *.box files in Windows, following the guidelines for >> tesseract 4.0. I have converted the EOL of these box files to unix using >> dos2unix format converter. >> I have attached a sample .box file and the all-lstmf file for reference. >> >> >> >> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/2ab512d6-5406-4571-a5de-7ed4e4e023d3%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/2ab512d6-5406-4571-a5de-7ed4e4e023d3%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUqAzwsGni11jWQ_290arKBNr44OoSHi2xjhbqu3j1O1A%40mail.gmail.com.

