Uploaded the files at https://github.com/Shreeshrii/tessdata_sanskrit

See NKP.sh and folder NKP

The first part of the script loops through the images and creates Wordstr
box files for same using tesseract.
It then uses sed to replace the reognized text by the ground truth.
This corrected box file is then used to create the lstmf files.

lstmtraining is done on textlines.
Use of --psm 6 causes text which is on the margins to be included as part
of the line. eg.
दु°टी° श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं
सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि

You can use any other alternative mechanism to create/correct box files.

With 12 pages used for training for 700 iterations and one page used for
eval, the results are as follows:

tessdata_best/san
      At iteration 0, stage 0, Eval Char error rate=22.924007, Word error
rate=62.127595

       NKP/NKP-eval.gt.txt: 142 words  63 44% common  0 0% deleted  79 56%
changed
       build/NKP-eval-san.txt: 142 words  63 44% common  2 1% inserted  77
54% changed


tessdata_best/script/Devanagari
      At iteration 0, stage 0, Eval Char error rate=13.307604, Word error
rate=47.984793

       NKP/NKP-eval.gt.txt: 142 words  85 60% common  0 0% deleted  57 40%
changed
       build/NKP-eval-deva.txt: 141 words  85 60% common  0 0% inserted  56
40% changed

san_NKP
      At iteration 0, stage 0, Eval Char error rate=7.8737359, Word error
rate=33.76221

      NKP/NKP-eval.gt.txt: 142 words  108 76% common  0 0% deleted  34 24%
changed
      build/NKP-eval.txt: 142 words  108 76% common  0 0% inserted  34 24%
changed

san_NKP_int
       At iteration 0, stage 0, Eval Char error rate=7.5598106, Word error
rate=32.463509

        NKP/NKP-eval.gt.txt: 142 words  106 75% common  0 0% deleted  36
25% changed
        build/NKP_int-eval.txt: 142 words  106 75% common  0 0% inserted
36 25% changed








On Tue, Apr 23, 2019 at 2:52 PM Shree Devi Kumar <[email protected]>
wrote:

> zip file is too big. Let me do an alternative upload.
>
> Training runs ok for me -
>
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> Continuing from /home/ubuntu/tessdata_best/script/Devanagari.lstm
> Loaded 13/13 lines (1-13) of document NKP/dp10.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp1.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp2.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp11.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp12.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp4.lstmf
> Loaded 12/12 lines (1-12) of document NKP/dp6.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp3.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp5.lstmf
> Iteration 0: GROUND  TRUTH : दु°टी° श्च दाराश्च ते पुत्रदाराः पुत्रदाराः
> आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान् त्वं
> निरस्त
> Iteration 0: BEST OCR TEXT : हु०ी० *च दाराशच ते पुत्रदाराः पुत्रदाराः
> आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान्‌ त्वं
> निरस्त
> File NKP/dp10.lstmf line 0 :
> Mean rms=1.216%, delta=2.957%, train=10.656%(25%), skip ratio=0%
> Loaded 13/13 lines (1-13) of document NKP/dp7.lstmf
> Iteration 1: GROUND  TRUTH : पविष्ठौ तौ वैश्यपार्थिवौ काश्चित्कथाः चक्रतुः
> यथान्यायं यथाशास्त्रं यथायोग्यं तेन मुनिना संबिदं भाषां उपविष्टौ
> Iteration 1: BEST OCR TEXT :  T
> File NKP/dp11.lstmf line 0 :
> Mean rms=3.579%, delta=50.635%, train=55.328%(62.5%), skip ratio=0%
> Loaded 13/13 lines (1-13) of document NKP/dp8.lstmf
> Iteration 2: GROUND  TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८
> उपविष्टौकथाःकाचिच्चक्रतुर्व्वैश्यपार्थिवौ ॥ राजो
> Iteration 2: ALIGNED TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८
> उपविष्टौकथाःकाचिच्चकरतुर्व्वैश्यपार्थिवौ ॥ राजो
> Iteration 2: BEST OCR TEXT : कृत्वातुलोयथान्यायंयथार्हन्तेनसंविदम्‌ २८'
> उपविंष्लोकथाःकाचिन्चकतुव्वश्यपा्धिबोौ ॥ राजो
> File NKP/dp12.lstmf line 0 :
> Mean rms=3.012%, delta=37.121%, train=46.249%(61.667%), skip ratio=0%
> Loaded 13/13 lines (1-13) of document NKP/dp9.lstmf
> Iteration 3: GROUND  TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे
> महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सर्व्वशास्त्राणि
> सर्व्वाशास्त्रेषु विशा
> Iteration 3: ALIGNED TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे
> महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सव्वशास्त्राणि
> स्व्वाशास्रेषु विशा
> Iteration 3: BEST OCR TEXT : म्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बीधने हे
> महाप्रा्ञ प्रज्ञाबुद्धिः सरव्वाणिचतानि शास्राणि सव्वशास्त्राणि सव्वशाख्रषु
> विशा
> File NKP/dp1.lstmf line 0 :
> Mean rms=2.611%, delta=29.082%, train=38.07%(62.159%), skip ratio=0%
> Iteration 4: GROUND  TRUTH : महाभागः भागः भाग्यं सःअष्टमःमनुः
> महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा
> Iteration 4: BEST OCR TEXT :  महाभागः भागः भाग्यं सःअष्टमःमनुः
> महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा
> File NKP/dp2.lstmf line 0 :
> Mean rms=2.324%, delta=24.051%, train=30.666%(49.727%), skip ratio=0%
> Iteration 5: GROUND  TRUTH : अपि तैः कोलाविध्वंसिभिः सहयुद्धेजितः
> अतिप्रबलदण्डिनः तस्य तैःसह युद्धम् अतिप्रबलश्वासौदण्डश्च अतिप्रबल
> Iteration 5: BEST OCR TEXT : अपि तैः कोलाविध्वंसिभिः सहयुद्धेज्ञितः
> अतिप्रबलदणिडनः तस्थ तेःसह युद्धम्‌ अतिप्रबलश्चासौदण्डशच अतिप्रबल
> File NKP/dp3.lstmf line 0 :
> Mean rms=2.187%, delta=21.104%, train=27.189%(51.439%), skip ratio=0%
> Iteration 6: GROUND  TRUTH : क्षीणबलस्य ततः तस्यइति ततः तस्य राज्ञ:
> सुरथस्य कोशः अर्थसंचयः अपहृतः आत्मसात्कृतः स्वाधीनः किंच बलं
> Iteration 6: BEST OCR TEXT : क्षाणबलस्य ततः तस्यईति ततः तस्य राज्ञः
> सुरथस्य कोडाः अर्थसंचयः अपद्वतः आत्मसात्रुतः स्वाधीनः किंच बत्त
> File NKP/dp4.lstmf line 0 :
> Mean rms=2.103%, delta=19.153%, train=26.624%(51.234%), skip ratio=0%
> Iteration 7: GROUND  TRUTH : श्वापदाकीर्णः तं प्रशान्तश्वापदाकीर्णं
> प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप
> Iteration 7: ALIGNED TRUTH : श्वापदाकीर्णः तं प्रशान्तशवापदाकीर्णं
> प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप
> Iteration 7: BEST OCR TEXT : इवापदाकीणः तं प्रशान्तरवापदाकीर्णं
> प्रशञान्ताः परहिंसारहिताः इंवापदाः व्याघ्रादयः आकीर्णं व्याप्त सुनिशिष्योप
>
>>
>>
>
> --
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi%2BWOepAfkv7eF73c7U8R3_EpBG8EE7Wgqf0a-iKuBmA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to