Uploaded the files at https://github.com/Shreeshrii/tessdata_sanskrit
See NKP.sh and folder NKP
The first part of the script loops through the images and creates Wordstr
box files for same using tesseract.
It then uses sed to replace the reognized text by the ground truth.
This corrected box file is then used to create the lstmf files.
lstmtraining is done on textlines.
Use of --psm 6 causes text which is on the margins to be included as part
of the line. eg.
दु°टी° श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं
सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि
You can use any other alternative mechanism to create/correct box files.
With 12 pages used for training for 700 iterations and one page used for
eval, the results are as follows:
tessdata_best/san
At iteration 0, stage 0, Eval Char error rate=22.924007, Word error
rate=62.127595
NKP/NKP-eval.gt.txt: 142 words 63 44% common 0 0% deleted 79 56%
changed
build/NKP-eval-san.txt: 142 words 63 44% common 2 1% inserted 77
54% changed
tessdata_best/script/Devanagari
At iteration 0, stage 0, Eval Char error rate=13.307604, Word error
rate=47.984793
NKP/NKP-eval.gt.txt: 142 words 85 60% common 0 0% deleted 57 40%
changed
build/NKP-eval-deva.txt: 141 words 85 60% common 0 0% inserted 56
40% changed
san_NKP
At iteration 0, stage 0, Eval Char error rate=7.8737359, Word error
rate=33.76221
NKP/NKP-eval.gt.txt: 142 words 108 76% common 0 0% deleted 34 24%
changed
build/NKP-eval.txt: 142 words 108 76% common 0 0% inserted 34 24%
changed
san_NKP_int
At iteration 0, stage 0, Eval Char error rate=7.5598106, Word error
rate=32.463509
NKP/NKP-eval.gt.txt: 142 words 106 75% common 0 0% deleted 36
25% changed
build/NKP_int-eval.txt: 142 words 106 75% common 0 0% inserted
36 25% changed
On Tue, Apr 23, 2019 at 2:52 PM Shree Devi Kumar <[email protected]>
wrote:
> zip file is too big. Let me do an alternative upload.
>
> Training runs ok for me -
>
> Warning: LSTMTrainer deserialized an LSTMRecognizer!
> Continuing from /home/ubuntu/tessdata_best/script/Devanagari.lstm
> Loaded 13/13 lines (1-13) of document NKP/dp10.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp1.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp2.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp11.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp12.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp4.lstmf
> Loaded 12/12 lines (1-12) of document NKP/dp6.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp3.lstmf
> Loaded 13/13 lines (1-13) of document NKP/dp5.lstmf
> Iteration 0: GROUND TRUTH : दु°टी° श्च दाराश्च ते पुत्रदाराः पुत्रदाराः
> आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान् त्वं
> निरस्त
> Iteration 0: BEST OCR TEXT : हु०ी० *च दाराशच ते पुत्रदाराः पुत्रदाराः
> आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः कलत्राणि भवान् त्वं
> निरस्त
> File NKP/dp10.lstmf line 0 :
> Mean rms=1.216%, delta=2.957%, train=10.656%(25%), skip ratio=0%
> Loaded 13/13 lines (1-13) of document NKP/dp7.lstmf
> Iteration 1: GROUND TRUTH : पविष्ठौ तौ वैश्यपार्थिवौ काश्चित्कथाः चक्रतुः
> यथान्यायं यथाशास्त्रं यथायोग्यं तेन मुनिना संबिदं भाषां उपविष्टौ
> Iteration 1: BEST OCR TEXT : T
> File NKP/dp11.lstmf line 0 :
> Mean rms=3.579%, delta=50.635%, train=55.328%(62.5%), skip ratio=0%
> Loaded 13/13 lines (1-13) of document NKP/dp8.lstmf
> Iteration 2: GROUND TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८
> उपविष्टौकथाःकाचिच्चक्रतुर्व्वैश्यपार्थिवौ ॥ राजो
> Iteration 2: ALIGNED TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम् २८
> उपविष्टौकथाःकाचिच्चकरतुर्व्वैश्यपार्थिवौ ॥ राजो
> Iteration 2: BEST OCR TEXT : कृत्वातुलोयथान्यायंयथार्हन्तेनसंविदम् २८'
> उपविंष्लोकथाःकाचिन्चकतुव्वश्यपा्धिबोौ ॥ राजो
> File NKP/dp12.lstmf line 0 :
> Mean rms=3.012%, delta=37.121%, train=46.249%(61.667%), skip ratio=0%
> Loaded 13/13 lines (1-13) of document NKP/dp9.lstmf
> Iteration 3: GROUND TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे
> महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सर्व्वशास्त्राणि
> सर्व्वाशास्त्रेषु विशा
> Iteration 3: ALIGNED TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बोधने हे
> महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि सव्वशास्त्राणि
> स्व्वाशास्रेषु विशा
> Iteration 3: BEST OCR TEXT : म्राज्ञा यस्यसःमहाप्राज्ञा तत्सम्बीधने हे
> महाप्रा्ञ प्रज्ञाबुद्धिः सरव्वाणिचतानि शास्राणि सव्वशास्त्राणि सव्वशाख्रषु
> विशा
> File NKP/dp1.lstmf line 0 :
> Mean rms=2.611%, delta=29.082%, train=38.07%(62.159%), skip ratio=0%
> Iteration 4: GROUND TRUTH : महाभागः भागः भाग्यं सःअष्टमःमनुः
> महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा
> Iteration 4: BEST OCR TEXT : महाभागः भागः भाग्यं सःअष्टमःमनुः
> महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा
> File NKP/dp2.lstmf line 0 :
> Mean rms=2.324%, delta=24.051%, train=30.666%(49.727%), skip ratio=0%
> Iteration 5: GROUND TRUTH : अपि तैः कोलाविध्वंसिभिः सहयुद्धेजितः
> अतिप्रबलदण्डिनः तस्य तैःसह युद्धम् अतिप्रबलश्वासौदण्डश्च अतिप्रबल
> Iteration 5: BEST OCR TEXT : अपि तैः कोलाविध्वंसिभिः सहयुद्धेज्ञितः
> अतिप्रबलदणिडनः तस्थ तेःसह युद्धम् अतिप्रबलश्चासौदण्डशच अतिप्रबल
> File NKP/dp3.lstmf line 0 :
> Mean rms=2.187%, delta=21.104%, train=27.189%(51.439%), skip ratio=0%
> Iteration 6: GROUND TRUTH : क्षीणबलस्य ततः तस्यइति ततः तस्य राज्ञ:
> सुरथस्य कोशः अर्थसंचयः अपहृतः आत्मसात्कृतः स्वाधीनः किंच बलं
> Iteration 6: BEST OCR TEXT : क्षाणबलस्य ततः तस्यईति ततः तस्य राज्ञः
> सुरथस्य कोडाः अर्थसंचयः अपद्वतः आत्मसात्रुतः स्वाधीनः किंच बत्त
> File NKP/dp4.lstmf line 0 :
> Mean rms=2.103%, delta=19.153%, train=26.624%(51.234%), skip ratio=0%
> Iteration 7: GROUND TRUTH : श्वापदाकीर्णः तं प्रशान्तश्वापदाकीर्णं
> प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप
> Iteration 7: ALIGNED TRUTH : श्वापदाकीर्णः तं प्रशान्तशवापदाकीर्णं
> प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप
> Iteration 7: BEST OCR TEXT : इवापदाकीणः तं प्रशान्तरवापदाकीर्णं
> प्रशञान्ताः परहिंसारहिताः इंवापदाः व्याघ्रादयः आकीर्णं व्याप्त सुनिशिष्योप
>
>>
>>
>
> --
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi%2BWOepAfkv7eF73c7U8R3_EpBG8EE7Wgqf0a-iKuBmA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.