@Shree, thanks for the help! Actually there were two things wrong with what 
I was doing, I had forgotten to add a TAB at the end to mark the end of 
line, also I generated the box files in ubuntu and it works now! 

On Friday, August 30, 2019 at 7:18:28 PM UTC+5:30, Pranav Budhwant wrote:
>
>
> *Tesseract Version: *
>
> tesseract 5.0.0-alpha leptonica-1.75.3
>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : 
> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>
>  Found AVX2
>  Found AVX
>  Found SSE
>
>
> *Platform:* 
> Linux pranav-vm 5.0.0-25-generic #26~18.04.1-Ubuntu SMP Thu Aug 1 
> 13:51:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> *Current Behavior:*
> I am trying to fine-tune tesseract lstm. I have done the following:
>
> 1. Downloaded & Extracted the current trained model for eng:
> cd tesseract/src/training/mkdir extracted
> combine_tessdata -e /usr/local/share/tessdata/eng.traineddata extracted/
> eng.lstm
>
>
> 2. Generated the *.lstmf files from *.tif and *.box files using:
> for file in *.tif; do
>   echo $file
>   base=`basename $file .tif`
>   tesseract $file $base --psm 7 nobatch lstm.train
> done
>
>
> 3. Generated all-lstmf and list.train, list.eval files using:
>
> ls -1 *.lstmf | sort -R > all-lstmf
> head -n  500 all-lstmf > list.eval
> tail -n +500 all-lstmf > list.train
>
> While generating the *.lstmf files, Tesseract threw the following warning:
> Warning. Invalid resolution 0 dpi. Using 70 instead.
>
> 4. Training the model using:
>
> lstmtraining \        
> --model_output ~/icr/train_output/ \
> --continue_from /home/pranav/tesseract/src/training/extracted/eng.lstm \
> --traineddata /usr/local/share/tessdata/eng.traineddata \
> --train_listfile tune/list.train \
> --eval_listfile tune/list.eval
>
>
> This however, throws the following error:
>
> Loaded file /home/pranav/tesseract/src/training/extracted/eng.lstm, 
> unpacking...
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> Continuing from /home/pranav/tesseract/src/training/extracted/eng.lstm
> Deserialize header failed: ~/icr/train/a01-014-03.lstmf
> Deserialize header failed: ~/icr/train/n04-107-01.lstmf
> Deserialize header failed: ~/icr/train/g06-037f-02.lstmf
> Deserialize header failed: ~/icr/train/r03-090-03.lstmf
> Deserialize header failed: ~/icr/train/r03-084-09.lstmf
> Deserialize header failed: ~/icr/train/g06-037e-02.lstmf
> Load of page 0 failed!
> Load of images failed!!
> Deserialize header failed: ~/icr/train/j01-066-09.lstmf
> Deserialize header failed: ~/icr/train/k04-075-02.lstmf
> Deserialize header failed: ~/icr/train/n02-127-00.lstmf
>
>
> I have generated the *.box files in Windows, following the guidelines for 
> tesseract 4.0. I have converted the EOL of these box files to unix using 
> dos2unix format converter.
> I have attached a sample .box file and the all-lstmf file for reference.
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e6fe80f2-43c0-4052-902e-ee38218ce475%40googlegroups.com.

Reply via email to