Dear reader,
I want to improve devanagari recognition. 
I have images and manually corrected Text with line coordinates.
>From those, I've generated .box files;
see attached file which produces the error above. 

Complete error Message from lstmtrain:
»Encoding of string failed! Failure bytes: 9 32 37 38 ffffffe0 ffffffa4 
ffffff98 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa8 ffffffe0 
ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffffb6 ffffffe0 fffff...
Can't encode transcription: 'श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं 
प्रत्यूहतिमिरमार्त्तण्डं सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि     
278घ्नेश्वरायवरदायसुरप्रियाय लम्बोदरा...

...
...«

.lstmf-Files are generated using »tesseract $tiff $box --tessdata-dir 
~/tessdata_best -l script/Devanagari lstm.train«

training is run by 
»combine_tessdata -u ~/tessdata_best/script/Devanagari.traineddata 
/tmp/Deva.trta
mkdir /tmp/deva
ls -1 *.lstmf >/tmp/list.txt
lstmtraining --model_output /tmp/deva --continue_from /tmp/Deva.trta.lstm  
--traineddata ~/tessdata_best/script/Devanagari.traineddata 
--train_listfile /tmp/list.txt«

I have double-checked that only characters from 
Devanagari.traineddata.lstm-unicharset are in the .box files.
No tabs, no control characters.

But the "9" from the error message above sounds like tab...?

Any ideas?

Kind regards, Jochen

PS: latest tesseract 4.1.0-rc1; tessdata_best: commit 95593f0b017280...

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3f411945-e3d5-4b70-bce6-b33e2aab7bfc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Attachment: durggapatha1890_-_001.box
Description: Binary data

Reply via email to