Glad you figured out the problem. Please consider sharing the improved traineddata file (when you complete training) for tessdata_contrib repo.
On Tue, 23 Apr 2019, 16:24 Jochen Barth, <[email protected]> wrote: > Thanks a lot. > > The error seems to be the missing space after the tab character in line > below »WordStr«! > > Kind regards, > Jochen > > > Am 23.04.19 um 12:02 schrieb Shree Devi Kumar: > > Uploaded the files at https://github.com/Shreeshrii/tessdata_sanskrit > > See NKP.sh and folder NKP > > The first part of the script loops through the images and creates Wordstr > box files for same using tesseract. > It then uses sed to replace the reognized text by the ground truth. > This corrected box file is then used to create the lstmf files. > > lstmtraining is done on textlines. > Use of --psm 6 causes text which is on the margins to be included as part > of the line. eg. > दु°टी° श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं > सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि > > You can use any other alternative mechanism to create/correct box files. > > With 12 pages used for training for 700 iterations and one page used for > eval, the results are as follows: > > tessdata_best/san > At iteration 0, stage 0, Eval Char error rate=22.924007, Word error > rate=62.127595 > > NKP/NKP-eval.gt.txt: 142 words 63 44% common 0 0% deleted 79 56% > changed > build/NKP-eval-san.txt: 142 words 63 44% common 2 1% inserted 77 > 54% changed > > > tessdata_best/script/Devanagari > At iteration 0, stage 0, Eval Char error rate=13.307604, Word error > rate=47.984793 > > NKP/NKP-eval.gt.txt: 142 words 85 60% common 0 0% deleted 57 40% > changed > build/NKP-eval-deva.txt: 141 words 85 60% common 0 0% inserted > 56 40% changed > > san_NKP > At iteration 0, stage 0, Eval Char error rate=7.8737359, Word error > rate=33.76221 > > NKP/NKP-eval.gt.txt: 142 words 108 76% common 0 0% deleted 34 24% > changed > build/NKP-eval.txt: 142 words 108 76% common 0 0% inserted 34 24% > changed > > san_NKP_int > At iteration 0, stage 0, Eval Char error rate=7.5598106, Word error > rate=32.463509 > > NKP/NKP-eval.gt.txt: 142 words 106 75% common 0 0% deleted 36 > 25% changed > build/NKP_int-eval.txt: 142 words 106 75% common 0 0% inserted > 36 25% changed > > > > > > > > > On Tue, Apr 23, 2019 at 2:52 PM Shree Devi Kumar <[email protected]> > wrote: > >> zip file is too big. Let me do an alternative upload. >> >> Training runs ok for me - >> >> Warning: LSTMTrainer deserialized an LSTMRecognizer! >> Continuing from /home/ubuntu/tessdata_best/script/Devanagari.lstm >> Loaded 13/13 lines (1-13) of document NKP/dp10.lstmf >> Loaded 13/13 lines (1-13) of document NKP/dp1.lstmf >> Loaded 13/13 lines (1-13) of document NKP/dp2.lstmf >> Loaded 13/13 lines (1-13) of document NKP/dp11.lstmf >> Loaded 13/13 lines (1-13) of document NKP/dp12.lstmf >> Loaded 13/13 lines (1-13) of document NKP/dp4.lstmf >> Loaded 12/12 lines (1-12) of document NKP/dp6.lstmf >> Loaded 13/13 lines (1-13) of document NKP/dp3.lstmf >> Loaded 13/13 lines (1-13) of document NKP/dp5.lstmf >> Iteration 0: GROUND TRUTH : दु°टी° श्च दाराश्च ते पुत्रदाराः पुत्रदाराः >> आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान् त्वं >> निरस्त >> Iteration 0: BEST OCR TEXT : हु०ी० *च दाराशच ते पुत्रदाराः पुत्रदाराः >> आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान् त्वं >> निरस्त >> File NKP/dp10.lstmf line 0 : >> Mean rms=1.216%, delta=2.957%, train=10.656%(25%), skip ratio=0% >> Loaded 13/13 lines (1-13) of document NKP/dp7.lstmf >> Iteration 1: GROUND TRUTH : पविष्ठौ तौ वैश्यपार्थिवौ काश्चित्कथाः >> चक्रतुः यथान्यायं यथाशास्त्रं यथायोग्यं तेन मुनिना संबिदं भाषां उपविष्टौ >> Iteration 1: BEST OCR TEXT : T >> File NKP/dp11.lstmf line 0 : >> Mean rms=3.579%, delta=50.635%, train=55.328%(62.5%), skip ratio=0% >> Loaded 13/13 lines (1-13) of document NKP/dp8.lstmf >> Iteration 2: GROUND TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८ >> उपविष्टौकथाःकाचिच्चक्रतुर्व्वैश्यपार्थिवौ ॥ राजो >> Iteration 2: ALIGNED TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८ >> उपविष्टौकथाःकाचिच्चकरतुर्व्वैश्यपार्थिवौ ॥ राजो >> Iteration 2: BEST OCR TEXT : कृत्वातुलोयथान्यायंयथार्हन्तेनसंविदम् २८' >> उपविंष्लोकथाःकाचिन्चकतुव्वश्यपा्धिबोौ ॥ राजो >> File NKP/dp12.lstmf line 0 : >> Mean rms=3.012%, delta=37.121%, train=46.249%(61.667%), skip ratio=0% >> Loaded 13/13 lines (1-13) of document NKP/dp9.lstmf >> Iteration 3: GROUND TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे >> महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सर्व्वशास्त्राणि >> सर्व्वाशास्त्रेषु विशा >> Iteration 3: ALIGNED TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे >> महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सव्वशास्त्राणि >> स्व्वाशास्रेषु विशा >> Iteration 3: BEST OCR TEXT : म्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बीधने हे >> महाप्रा्ञ प्रज्ञाबुद्धिः सरव्वाणिचतानि शास्राणि सव्वशास्त्राणि सव्वशाख्रषु >> विशा >> File NKP/dp1.lstmf line 0 : >> Mean rms=2.611%, delta=29.082%, train=38.07%(62.159%), skip ratio=0% >> Iteration 4: GROUND TRUTH : महाभागः भागः भाग्यं सःअष्टमःमनुः >> महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा >> Iteration 4: BEST OCR TEXT : महाभागः भागः भाग्यं सःअष्टमःमनुः >> महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा >> File NKP/dp2.lstmf line 0 : >> Mean rms=2.324%, delta=24.051%, train=30.666%(49.727%), skip ratio=0% >> Iteration 5: GROUND TRUTH : अपि तैः कोलाविध्वंसिभिः सहयुद्धेजितः >> अतिप्रबलदण्डिनः तस्य तैःसह युद्धम् अतिप्रबलश्वासौदण्डश्च अतिप्रबल >> Iteration 5: BEST OCR TEXT : अपि तैः कोलाविध्वंसिभिः सहयुद्धेज्ञितः >> अतिप्रबलदणिडनः तस्थ तेःसह युद्धम् अतिप्रबलश्चासौदण्डशच अतिप्रबल >> File NKP/dp3.lstmf line 0 : >> Mean rms=2.187%, delta=21.104%, train=27.189%(51.439%), skip ratio=0% >> Iteration 6: GROUND TRUTH : क्षीणबलस्य ततः तस्यइति ततः तस्य राज्ञ: >> सुरथस्य कोशः अर्थसंचयः अपहृतः आत्मसात्कृतः स्वाधीनः किंच बलं >> Iteration 6: BEST OCR TEXT : क्षाणबलस्य ततः तस्यईति ततः तस्य राज्ञः >> सुरथस्य कोडाः अर्थसंचयः अपद्वतः आत्मसात्रुतः स्वाधीनः किंच बत्त >> File NKP/dp4.lstmf line 0 : >> Mean rms=2.103%, delta=19.153%, train=26.624%(51.234%), skip ratio=0% >> Iteration 7: GROUND TRUTH : श्वापदाकीर्णः तं प्रशान्तश्वापदाकीर्णं >> प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप >> Iteration 7: ALIGNED TRUTH : श्वापदाकीर्णः तं प्रशान्तशवापदाकीर्णं >> प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप >> Iteration 7: BEST OCR TEXT : इवापदाकीणः तं प्रशान्तरवापदाकीर्णं >> प्रशञान्ताः परहिंसारहिताः इंवापदाः व्याघ्रादयः आकीर्णं व्याप्त सुनिशिष्योप >> >>> >>> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi%2BWOepAfkv7eF73c7U8R3_EpBG8EE7Wgqf0a-iKuBmA%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi%2BWOepAfkv7eF73c7U8R3_EpBG8EE7Wgqf0a-iKuBmA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > > -- > Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580 > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/bf014e73-d063-00fc-ddb5-643ddf7fec28%40ub.uni-heidelberg.de > <https://groups.google.com/d/msgid/tesseract-ocr/bf014e73-d063-00fc-ddb5-643ddf7fec28%40ub.uni-heidelberg.de?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWOEQ8uBUq4faTdkSuEoptc_HuRiX5Og-i-3GCFO%2BNwiQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

