I tried the same with Tesseract 4.1, and I generated all the files on 
Ubuntu instead of creating them on Windows and then converting to Unix 
formats. It still gives the same error. Please can anyone help me out here? 
I don't know what I'm doing wrong.

On Friday, August 30, 2019 at 7:18:28 PM UTC+5:30, Pranav Budhwant wrote:
>
>
> *Tesseract Version: *
>
> tesseract 5.0.0-alpha leptonica-1.75.3
>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : 
> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>
>  Found AVX2
>  Found AVX
>  Found SSE
>
>
> *Platform:* 
> Linux pranav-vm 5.0.0-25-generic #26~18.04.1-Ubuntu SMP Thu Aug 1 
> 13:51:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>
>
>
> *Current Behavior:*
> I am trying to fine-tune tesseract lstm. I have done the following:
>
> 1. Downloaded & Extracted the current trained model for eng:
> cd tesseract/src/training/mkdir extracted
> combine_tessdata -e /usr/local/share/tessdata/eng.traineddata extracted/
> eng.lstm
>
>
> 2. Generated the *.lstmf files from *.tif and *.box files using:
> for file in *.tif; do
>   echo $file
>   base=`basename $file .tif`
>   tesseract $file $base --psm 7 nobatch lstm.train
> done
>
>
> 3. Generated all-lstmf and list.train, list.eval files using:
>
> ls -1 *.lstmf | sort -R > all-lstmf
> head -n  500 all-lstmf > list.eval
> tail -n +500 all-lstmf > list.train
>
> While generating the *.lstmf files, Tesseract threw the following warning:
> Warning. Invalid resolution 0 dpi. Using 70 instead.
>
> 4. Training the model using:
>
> lstmtraining \        
> --model_output ~/icr/train_output/ \
> --continue_from /home/pranav/tesseract/src/training/extracted/eng.lstm \
> --traineddata /usr/local/share/tessdata/eng.traineddata \
> --train_listfile tune/list.train \
> --eval_listfile tune/list.eval
>
>
> This however, throws the following error:
>
> Loaded file /home/pranav/tesseract/src/training/extracted/eng.lstm, 
> unpacking...
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> Continuing from /home/pranav/tesseract/src/training/extracted/eng.lstm
> Deserialize header failed: ~/icr/train/a01-014-03.lstmf
> Deserialize header failed: ~/icr/train/n04-107-01.lstmf
> Deserialize header failed: ~/icr/train/g06-037f-02.lstmf
> Deserialize header failed: ~/icr/train/r03-090-03.lstmf
> Deserialize header failed: ~/icr/train/r03-084-09.lstmf
> Deserialize header failed: ~/icr/train/g06-037e-02.lstmf
> Load of page 0 failed!
> Load of images failed!!
> Deserialize header failed: ~/icr/train/j01-066-09.lstmf
> Deserialize header failed: ~/icr/train/k04-075-02.lstmf
> Deserialize header failed: ~/icr/train/n02-127-00.lstmf
>
>
> I have generated the *.box files in Windows, following the guidelines for 
> tesseract 4.0. I have converted the EOL of these box files to unix using 
> dos2unix format converter.
> I have attached a sample .box file and the all-lstmf file for reference.
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2ab512d6-5406-4571-a5de-7ed4e4e023d3%40googlegroups.com.

Reply via email to