Hello Jochen, I prefer the Wordstr format since it is easier to correct the text with ground truth, so I have not tested with the lstmbox file. A cursory glance at the file shows that the lstmbox file does not have lines with spaces between words.
Another point to remember when training with images is that the transcription used as ground truth needs to be of `gold standard` otherwise the training will not improve the results. I noticed a few typos on the corrected OCR text for 1st page and many for page 10. I used pages 1-12 to do a test run of training using Wordstr boxes and that does lead to improved results. I have used a smaller size of images so that co-ordinates may not match. I created the ground truth files using text from the website and corrected errors (mainly in page 1 and 10) - I did not review all for accuracy. I will zip all files and training script so that you can test at your end. I am not getting the encoding related errors. Please use `--psm 6` with the lstm.train command. On Tue, Apr 23, 2019 at 1:53 PM Jochen Barth <[email protected]> wrote: > Dear Shree, > I've tried it with the format below and combined letter-and-sign-symbols > (see attached file) > and with WordStr-Format (see attached file), > but still the same error... > > Kind regards, Jochen > > Am 18.04.19 um 17:40 schrieb Shree Devi Kumar: > > The following format (as in your box file) will not work for Devanagari. > > श 278 1253 2860 1413 0 > ् 278 1253 2860 1413 0 > र 278 1253 2860 1413 0 > ी 278 1253 2860 1413 0 > ग 278 1253 2860 1413 0 > ण 278 1253 2860 1413 0 > े 278 1253 2860 1413 0 > श 278 1253 2860 1413 0 > ा 278 1253 2860 1413 0 > य 278 1253 2860 1413 0 > न 278 1253 2860 1413 0 > म 278 1253 2860 1413 0 > ः 278 1253 2860 1413 0 > 278 1253 2860 1413 0 > > See files in attached zip file which show the box/tiff pairs as created by > text2image using the text with Murty Sanskrit font. > > श्री 112 4669 160 4708 0 > ग 156 4669 189 4701 0 > णे 185 4668 225 4708 0 > शा 221 4667 272 4700 0 > य 268 4668 301 4700 0 > न 297 4668 329 4700 0 > मः 326 4668 370 4700 0 > 370 4667 402 4701 0 > । 402 4667 407 4701 0 > । 428 4667 433 4700 0 > 433 4667 451 4700 0 > > The above format works for training. > > Box files created by using the new `lstmbox` or `wordstrbox` formats > should also work. > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/5ee99b59-c1d8-874e-043a-771b00f4b434%40ub.uni-heidelberg.de > <https://groups.google.com/d/msgid/tesseract-ocr/5ee99b59-c1d8-874e-043a-771b00f4b434%40ub.uni-heidelberg.de?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUbOqTDtp57arTN1p58Qzyw10Owr-QX1rJNQnXGOoj-wA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

