Re: [tesseract-ocr] Re: Error: Deserialize header failed while fine-tuning Tesseract

Shree Devi Kumar Tue, 03 Sep 2019 09:22:16 -0700

Test with 5-10 files to figure out correct process. Probably files are not
in the correct location or format.


On Tue, 3 Sep 2019, 17:10 Pranav Budhwant, <[email protected]> wrote:

> I tried the same with Tesseract 4.1, and I generated all the files on
> Ubuntu instead of creating them on Windows and then converting to Unix
> formats. It still gives the same error. Please can anyone help me out here?
> I don't know what I'm doing wrong.
>
> On Friday, August 30, 2019 at 7:18:28 PM UTC+5:30, Pranav Budhwant wrote:
>>
>>
>> *Tesseract Version: *
>>
>> tesseract 5.0.0-alpha leptonica-1.75.3
>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 :
>> libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>>
>>  Found AVX2
>>  Found AVX
>>  Found SSE
>>
>>
>> *Platform:*
>> Linux pranav-vm 5.0.0-25-generic #26~18.04.1-Ubuntu SMP Thu Aug 1
>> 13:51:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>>
>>
>>
>> *Current Behavior:*
>> I am trying to fine-tune tesseract lstm. I have done the following:
>>
>> 1. Downloaded & Extracted the current trained model for eng:
>> cd tesseract/src/training/mkdir extracted
>> combine_tessdata -e /usr/local/share/tessdata/eng.traineddata extracted/
>> eng.lstm
>>
>>
>> 2. Generated the *.lstmf files from *.tif and *.box files using:
>> for file in *.tif; do
>>   echo $file
>>   base=`basename $file .tif`
>>   tesseract $file $base --psm 7 nobatch lstm.train
>> done
>>
>>
>> 3. Generated all-lstmf and list.train, list.eval files using:
>>
>> ls -1 *.lstmf | sort -R > all-lstmf
>> head -n  500 all-lstmf > list.eval
>> tail -n +500 all-lstmf > list.train
>>
>> While generating the *.lstmf files, Tesseract threw the following warning:
>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>>
>> 4. Training the model using:
>>
>> lstmtraining \
>> --model_output ~/icr/train_output/ \
>> --continue_from /home/pranav/tesseract/src/training/extracted/eng.lstm \
>> --traineddata /usr/local/share/tessdata/eng.traineddata \
>> --train_listfile tune/list.train \
>> --eval_listfile tune/list.eval
>>
>>
>> This however, throws the following error:
>>
>> Loaded file /home/pranav/tesseract/src/training/extracted/eng.lstm,
>> unpacking...
>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>> Continuing from /home/pranav/tesseract/src/training/extracted/eng.lstm
>> Deserialize header failed: ~/icr/train/a01-014-03.lstmf
>> Deserialize header failed: ~/icr/train/n04-107-01.lstmf
>> Deserialize header failed: ~/icr/train/g06-037f-02.lstmf
>> Deserialize header failed: ~/icr/train/r03-090-03.lstmf
>> Deserialize header failed: ~/icr/train/r03-084-09.lstmf
>> Deserialize header failed: ~/icr/train/g06-037e-02.lstmf
>> Load of page 0 failed!
>> Load of images failed!!
>> Deserialize header failed: ~/icr/train/j01-066-09.lstmf
>> Deserialize header failed: ~/icr/train/k04-075-02.lstmf
>> Deserialize header failed: ~/icr/train/n02-127-00.lstmf
>>
>>
>> I have generated the *.box files in Windows, following the guidelines for
>> tesseract 4.0. I have converted the EOL of these box files to unix using
>> dos2unix format converter.
>> I have attached a sample .box file and the all-lstmf file for reference.
>>
>>
>>
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2ab512d6-5406-4571-a5de-7ed4e4e023d3%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/2ab512d6-5406-4571-a5de-7ed4e4e023d3%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUqAzwsGni11jWQ_290arKBNr44OoSHi2xjhbqu3j1O1A%40mail.gmail.com.

Re: [tesseract-ocr] Re: Error: Deserialize header failed while fine-tuning Tesseract

Reply via email to