@Shree, thanks for the help! Actually there were two things wrong with what I was doing, I had forgotten to add a TAB at the end to mark the end of line, also I generated the box files in ubuntu and it works now!
On Friday, August 30, 2019 at 7:18:28 PM UTC+5:30, Pranav Budhwant wrote: > > > *Tesseract Version: * > > tesseract 5.0.0-alpha leptonica-1.75.3 > libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : > libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 > > Found AVX2 > Found AVX > Found SSE > > > *Platform:* > Linux pranav-vm 5.0.0-25-generic #26~18.04.1-Ubuntu SMP Thu Aug 1 > 13:51:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux > > > > *Current Behavior:* > I am trying to fine-tune tesseract lstm. I have done the following: > > 1. Downloaded & Extracted the current trained model for eng: > cd tesseract/src/training/mkdir extracted > combine_tessdata -e /usr/local/share/tessdata/eng.traineddata extracted/ > eng.lstm > > > 2. Generated the *.lstmf files from *.tif and *.box files using: > for file in *.tif; do > echo $file > base=`basename $file .tif` > tesseract $file $base --psm 7 nobatch lstm.train > done > > > 3. Generated all-lstmf and list.train, list.eval files using: > > ls -1 *.lstmf | sort -R > all-lstmf > head -n 500 all-lstmf > list.eval > tail -n +500 all-lstmf > list.train > > While generating the *.lstmf files, Tesseract threw the following warning: > Warning. Invalid resolution 0 dpi. Using 70 instead. > > 4. Training the model using: > > lstmtraining \ > --model_output ~/icr/train_output/ \ > --continue_from /home/pranav/tesseract/src/training/extracted/eng.lstm \ > --traineddata /usr/local/share/tessdata/eng.traineddata \ > --train_listfile tune/list.train \ > --eval_listfile tune/list.eval > > > This however, throws the following error: > > Loaded file /home/pranav/tesseract/src/training/extracted/eng.lstm, > unpacking... > Warning: LSTMTrainer deserialized an LSTMRecognizer! > Continuing from /home/pranav/tesseract/src/training/extracted/eng.lstm > Deserialize header failed: ~/icr/train/a01-014-03.lstmf > Deserialize header failed: ~/icr/train/n04-107-01.lstmf > Deserialize header failed: ~/icr/train/g06-037f-02.lstmf > Deserialize header failed: ~/icr/train/r03-090-03.lstmf > Deserialize header failed: ~/icr/train/r03-084-09.lstmf > Deserialize header failed: ~/icr/train/g06-037e-02.lstmf > Load of page 0 failed! > Load of images failed!! > Deserialize header failed: ~/icr/train/j01-066-09.lstmf > Deserialize header failed: ~/icr/train/k04-075-02.lstmf > Deserialize header failed: ~/icr/train/n02-127-00.lstmf > > > I have generated the *.box files in Windows, following the guidelines for > tesseract 4.0. I have converted the EOL of these box files to unix using > dos2unix format converter. > I have attached a sample .box file and the all-lstmf file for reference. > > > > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e6fe80f2-43c0-4052-902e-ee38218ce475%40googlegroups.com.

