Dear reader, I want to improve devanagari recognition. I have images and manually corrected Text with line coordinates. >From those, I've generated .box files; see attached file which produces the error above.
Complete error Message from lstmtrain: »Encoding of string failed! Failure bytes: 9 32 37 38 ffffffe0 ffffffa4 ffffff98 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffffb6 ffffffe0 fffff... Can't encode transcription: 'श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि 278घ्नेश्वरायवरदायसुरप्रियाय लम्बोदरा... ... ...« .lstmf-Files are generated using »tesseract $tiff $box --tessdata-dir ~/tessdata_best -l script/Devanagari lstm.train« training is run by »combine_tessdata -u ~/tessdata_best/script/Devanagari.traineddata /tmp/Deva.trta mkdir /tmp/deva ls -1 *.lstmf >/tmp/list.txt lstmtraining --model_output /tmp/deva --continue_from /tmp/Deva.trta.lstm --traineddata ~/tessdata_best/script/Devanagari.traineddata --train_listfile /tmp/list.txt« I have double-checked that only characters from Devanagari.traineddata.lstm-unicharset are in the .box files. No tabs, no control characters. But the "9" from the error message above sounds like tab...? Any ideas? Kind regards, Jochen PS: latest tesseract 4.1.0-rc1; tessdata_best: commit 95593f0b017280... -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f411945-e3d5-4b70-bce6-b33e2aab7bfc%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
durggapatha1890_-_001.box
Description: Binary data

