Glad you figured out the problem.

Please consider sharing the improved traineddata file (when you complete
training) for tessdata_contrib repo.

On Tue, 23 Apr 2019, 16:24 Jochen Barth, <[email protected]> wrote:

> Thanks a lot.
>
> The error seems to be the missing space after the tab character in line
> below »WordStr«!
>
> Kind regards,
> Jochen
>
>
> Am 23.04.19 um 12:02 schrieb Shree Devi Kumar:
>
> Uploaded the files at https://github.com/Shreeshrii/tessdata_sanskrit
>
> See NKP.sh and folder NKP
>
> The first part of the script loops through the images and creates Wordstr
> box files for same using tesseract.
> It then uses sed to replace the reognized text by the ground truth.
> This corrected box file is then used to create the lstmf files.
>
> lstmtraining is done on textlines.
> Use of --psm 6 causes text which is on the margins to be included as part
> of the line. eg.
> दु°टी° श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं
> सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि
>
> You can use any other alternative mechanism to create/correct box files.
>
> With 12 pages used for training for 700 iterations and one page used for
> eval, the results are as follows:
>
> tessdata_best/san
>       At iteration 0, stage 0, Eval Char error rate=22.924007, Word error
> rate=62.127595
>
>        NKP/NKP-eval.gt.txt: 142 words  63 44% common  0 0% deleted  79 56%
> changed
>        build/NKP-eval-san.txt: 142 words  63 44% common  2 1% inserted  77
> 54% changed
>
>
> tessdata_best/script/Devanagari
>       At iteration 0, stage 0, Eval Char error rate=13.307604, Word error
> rate=47.984793
>
>        NKP/NKP-eval.gt.txt: 142 words  85 60% common  0 0% deleted  57 40%
> changed
>        build/NKP-eval-deva.txt: 141 words  85 60% common  0 0% inserted
> 56 40% changed
>
> san_NKP
>       At iteration 0, stage 0, Eval Char error rate=7.8737359, Word error
> rate=33.76221
>
>       NKP/NKP-eval.gt.txt: 142 words  108 76% common  0 0% deleted  34 24%
> changed
>       build/NKP-eval.txt: 142 words  108 76% common  0 0% inserted  34 24%
> changed
>
> san_NKP_int
>        At iteration 0, stage 0, Eval Char error rate=7.5598106, Word error
> rate=32.463509
>
>         NKP/NKP-eval.gt.txt: 142 words  106 75% common  0 0% deleted  36
> 25% changed
>         build/NKP_int-eval.txt: 142 words  106 75% common  0 0% inserted
> 36 25% changed
>
>
>
>
>
>
>
>
> On Tue, Apr 23, 2019 at 2:52 PM Shree Devi Kumar <[email protected]>
> wrote:
>
>> zip file is too big. Let me do an alternative upload.
>>
>> Training runs ok for me -
>>
>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>> Continuing from /home/ubuntu/tessdata_best/script/Devanagari.lstm
>> Loaded 13/13 lines (1-13) of document NKP/dp10.lstmf
>> Loaded 13/13 lines (1-13) of document NKP/dp1.lstmf
>> Loaded 13/13 lines (1-13) of document NKP/dp2.lstmf
>> Loaded 13/13 lines (1-13) of document NKP/dp11.lstmf
>> Loaded 13/13 lines (1-13) of document NKP/dp12.lstmf
>> Loaded 13/13 lines (1-13) of document NKP/dp4.lstmf
>> Loaded 12/12 lines (1-12) of document NKP/dp6.lstmf
>> Loaded 13/13 lines (1-13) of document NKP/dp3.lstmf
>> Loaded 13/13 lines (1-13) of document NKP/dp5.lstmf
>> Iteration 0: GROUND  TRUTH : दु°टी° श्च दाराश्च ते पुत्रदाराः पुत्रदाराः
>> आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान् त्वं
>> निरस्त
>> Iteration 0: BEST OCR TEXT : हु०ी० *च दाराशच ते पुत्रदाराः पुत्रदाराः
>> आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान्‌ त्वं
>> निरस्त
>> File NKP/dp10.lstmf line 0 :
>> Mean rms=1.216%, delta=2.957%, train=10.656%(25%), skip ratio=0%
>> Loaded 13/13 lines (1-13) of document NKP/dp7.lstmf
>> Iteration 1: GROUND  TRUTH : पविष्ठौ तौ वैश्यपार्थिवौ काश्चित्कथाः
>> चक्रतुः यथान्यायं यथाशास्त्रं यथायोग्यं तेन मुनिना संबिदं भाषां उपविष्टौ
>> Iteration 1: BEST OCR TEXT :  T
>> File NKP/dp11.lstmf line 0 :
>> Mean rms=3.579%, delta=50.635%, train=55.328%(62.5%), skip ratio=0%
>> Loaded 13/13 lines (1-13) of document NKP/dp8.lstmf
>> Iteration 2: GROUND  TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८
>> उपविष्टौकथाःकाचिच्चक्रतुर्व्वैश्यपार्थिवौ ॥ राजो
>> Iteration 2: ALIGNED TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८
>> उपविष्टौकथाःकाचिच्चकरतुर्व्वैश्यपार्थिवौ ॥ राजो
>> Iteration 2: BEST OCR TEXT : कृत्वातुलोयथान्यायंयथार्हन्तेनसंविदम्‌ २८'
>> उपविंष्लोकथाःकाचिन्चकतुव्वश्यपा्धिबोौ ॥ राजो
>> File NKP/dp12.lstmf line 0 :
>> Mean rms=3.012%, delta=37.121%, train=46.249%(61.667%), skip ratio=0%
>> Loaded 13/13 lines (1-13) of document NKP/dp9.lstmf
>> Iteration 3: GROUND  TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे
>> महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सर्व्वशास्त्राणि
>> सर्व्वाशास्त्रेषु विशा
>> Iteration 3: ALIGNED TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे
>> महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सव्वशास्त्राणि
>> स्व्वाशास्रेषु विशा
>> Iteration 3: BEST OCR TEXT : म्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बीधने हे
>> महाप्रा्ञ प्रज्ञाबुद्धिः सरव्वाणिचतानि शास्राणि सव्वशास्त्राणि सव्वशाख्रषु
>> विशा
>> File NKP/dp1.lstmf line 0 :
>> Mean rms=2.611%, delta=29.082%, train=38.07%(62.159%), skip ratio=0%
>> Iteration 4: GROUND  TRUTH : महाभागः भागः भाग्यं सःअष्टमःमनुः
>> महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा
>> Iteration 4: BEST OCR TEXT :  महाभागः भागः भाग्यं सःअष्टमःमनुः
>> महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा
>> File NKP/dp2.lstmf line 0 :
>> Mean rms=2.324%, delta=24.051%, train=30.666%(49.727%), skip ratio=0%
>> Iteration 5: GROUND  TRUTH : अपि तैः कोलाविध्वंसिभिः सहयुद्धेजितः
>> अतिप्रबलदण्डिनः तस्य तैःसह युद्धम् अतिप्रबलश्वासौदण्डश्च अतिप्रबल
>> Iteration 5: BEST OCR TEXT : अपि तैः कोलाविध्वंसिभिः सहयुद्धेज्ञितः
>> अतिप्रबलदणिडनः तस्थ तेःसह युद्धम्‌ अतिप्रबलश्चासौदण्डशच अतिप्रबल
>> File NKP/dp3.lstmf line 0 :
>> Mean rms=2.187%, delta=21.104%, train=27.189%(51.439%), skip ratio=0%
>> Iteration 6: GROUND  TRUTH : क्षीणबलस्य ततः तस्यइति ततः तस्य राज्ञ:
>> सुरथस्य कोशः अर्थसंचयः अपहृतः आत्मसात्कृतः स्वाधीनः किंच बलं
>> Iteration 6: BEST OCR TEXT : क्षाणबलस्य ततः तस्यईति ततः तस्य राज्ञः
>> सुरथस्य कोडाः अर्थसंचयः अपद्वतः आत्मसात्रुतः स्वाधीनः किंच बत्त
>> File NKP/dp4.lstmf line 0 :
>> Mean rms=2.103%, delta=19.153%, train=26.624%(51.234%), skip ratio=0%
>> Iteration 7: GROUND  TRUTH : श्वापदाकीर्णः तं प्रशान्तश्वापदाकीर्णं
>> प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप
>> Iteration 7: ALIGNED TRUTH : श्वापदाकीर्णः तं प्रशान्तशवापदाकीर्णं
>> प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप
>> Iteration 7: BEST OCR TEXT : इवापदाकीणः तं प्रशान्तरवापदाकीर्णं
>> प्रशञान्ताः परहिंसारहिताः इंवापदाः व्याघ्रादयः आकीर्णं व्याप्त सुनिशिष्योप
>>
>>>
>>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>
>
> --
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi%2BWOepAfkv7eF73c7U8R3_EpBG8EE7Wgqf0a-iKuBmA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi%2BWOepAfkv7eF73c7U8R3_EpBG8EE7Wgqf0a-iKuBmA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/bf014e73-d063-00fc-ddb5-643ddf7fec28%40ub.uni-heidelberg.de
> <https://groups.google.com/d/msgid/tesseract-ocr/bf014e73-d063-00fc-ddb5-643ddf7fec28%40ub.uni-heidelberg.de?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWOEQ8uBUq4faTdkSuEoptc_HuRiX5Og-i-3GCFO%2BNwiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to