Thanks a lot.
The error seems to be the missing space after the tab character in line
below »WordStr«!
Kind regards,
Jochen
Am 23.04.19 um 12:02 schrieb Shree Devi Kumar:
Uploaded the files at https://github.com/Shreeshrii/tessdata_sanskrit
See NKP.sh and folder NKP
The first part of the script loops through the images and creates
Wordstr box files for same using tesseract.
It then uses sed to replace the reognized text by the ground truth.
This corrected box file is then used to create the lstmf files.
lstmtraining is done on textlines.
Use of --psm 6 causes text which is on the margins to be included as
part of the line. eg.
दु°टी° श्रीगणेशायनमः ।। अलिकुलमण्डितगण्डं प्रत्यूहतिमिरमार्त्तण्डं
सिन्दूरारुणशुण्डं देवंवेतण्डमुण्डमवलम्बे १ वि
You can use any other alternative mechanism to create/correct box files.
With 12 pages used for training for 700 iterations and one page used
for eval, the results are as follows:
tessdata_best/san
At iteration 0, stage 0, Eval Char error rate=22.924007, Word
error rate=62.127595
NKP/NKP-eval.gt.txt: 142 words 63 44% common 0 0% deleted 79
56% changed
build/NKP-eval-san.txt: 142 words 63 44% common 2 1%
inserted 77 54% changed
tessdata_best/script/Devanagari
At iteration 0, stage 0, Eval Char error rate=13.307604, Word
error rate=47.984793
NKP/NKP-eval.gt.txt: 142 words 85 60% common 0 0% deleted 57 40%
changed
build/NKP-eval-deva.txt: 141 words 85 60% common 0 0% inserted 56
40% changed
san_NKP
At iteration 0, stage 0, Eval Char error rate=7.8737359, Word
error rate=33.76221
NKP/NKP-eval.gt.txt: 142 words 108 76% common 0 0% deleted 34
24% changed
build/NKP-eval.txt: 142 words 108 76% common 0 0% inserted 34
24% changed
san_NKP_int
At iteration 0, stage 0, Eval Char error rate=7.5598106, Word
error rate=32.463509
NKP/NKP-eval.gt.txt: 142 words 106 75% common 0 0% deleted 36 25%
changed
build/NKP_int-eval.txt: 142 words 106 75% common 0 0% inserted 36
25% changed
On Tue, Apr 23, 2019 at 2:52 PM Shree Devi Kumar <[email protected]
<mailto:[email protected]>> wrote:
zip file is too big. Let me do an alternative upload.
Training runs ok for me -
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from /home/ubuntu/tessdata_best/script/Devanagari.lstm
Loaded 13/13 lines (1-13) of document NKP/dp10.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp1.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp2.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp11.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp12.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp4.lstmf
Loaded 12/12 lines (1-12) of document NKP/dp6.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp3.lstmf
Loaded 13/13 lines (1-13) of document NKP/dp5.lstmf
Iteration 0: GROUND TRUTH : दु°टी° श्च दाराश्च ते पुत्रदाराः
पुत्रदाराः आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः
कलत्राणि भवान् त्वं निरस्त
Iteration 0: BEST OCR TEXT : हु०ी० *च दाराशच ते पुत्रदाराः
पुत्रदाराः आदिर्येषां ते पुत्रदारादयः तैः पुत्रदारादिभिः दाराः
कलत्राणि भवान् त्वं निरस्त
File NKP/dp10.lstmf line 0 :
Mean rms=1.216%, delta=2.957%, train=10.656%(25%), skip ratio=0%
Loaded 13/13 lines (1-13) of document NKP/dp7.lstmf
Iteration 1: GROUND TRUTH : पविष्ठौ तौ वैश्यपार्थिवौ काश्चित्कथाः
चक्रतुः यथान्यायं यथाशास्त्रं यथायोग्यं तेन मुनिना संबिदं भाषां उपविष्टौ
Iteration 1: BEST OCR TEXT : T
File NKP/dp11.lstmf line 0 :
Mean rms=3.579%, delta=50.635%, train=55.328%(62.5%), skip ratio=0%
Loaded 13/13 lines (1-13) of document NKP/dp8.lstmf
Iteration 2: GROUND TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम्
२८ उपविष्टौकथाःकाचिच्चक्रतुर्व्वैश्यपार्थिवौ ॥ राजो
Iteration 2: ALIGNED TRUTH : कृत्वातुतौयथान्यायंयथार्हन्तेनसंविदम्
२८ उपविष्टौकथाःकाचिच्चकरतुर्व्वैश्यपार्थिवौ ॥ राजो
Iteration 2: BEST OCR TEXT :
कृत्वातुलोयथान्यायंयथार्हन्तेनसंविदम् २८'
उपविंष्लोकथाःकाचिन्चकतुव्वश्यपा्धिबोौ ॥ राजो
File NKP/dp12.lstmf line 0 :
Mean rms=3.012%, delta=37.121%, train=46.249%(61.667%), skip ratio=0%
Loaded 13/13 lines (1-13) of document NKP/dp9.lstmf
Iteration 3: GROUND TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा
तत्सम्बोधने हे महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि
सर्व्वशास्त्राणि सर्व्वाशास्त्रेषु विशा
Iteration 3: ALIGNED TRUTH : प्राज्ञा यस्यसःमहाप्राज्ञा
तत्सम्बोधने हे महाप्राज्ञ प्रज्ञाबुद्धिः सर्व्वाणिचतानि शास्त्राणि
सव्वशास्त्राणि स्व्वाशास्रेषु विशा
Iteration 3: BEST OCR TEXT : म्राज्ञा यस्यसःमहाप्राज्ञा
तत्सम्बीधने हे महाप्रा्ञ प्रज्ञाबुद्धिः सरव्वाणिचतानि शास्राणि
सव्वशास्त्राणि सव्वशाख्रषु विशा
File NKP/dp1.lstmf line 0 :
Mean rms=2.611%, delta=29.082%, train=38.07%(62.159%), skip ratio=0%
Iteration 4: GROUND TRUTH : महाभागः भागः भाग्यं सःअष्टमःमनुः
महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा
Iteration 4: BEST OCR TEXT : महाभागः भागः भाग्यं सःअष्टमःमनुः
महामायानुभावेन महतीमाया यस्याःसा महामाया महामाहायाःअनुभावः महा
File NKP/dp2.lstmf line 0 :
Mean rms=2.324%, delta=24.051%, train=30.666%(49.727%), skip ratio=0%
Iteration 5: GROUND TRUTH : अपि तैः कोलाविध्वंसिभिः सहयुद्धेजितः
अतिप्रबलदण्डिनः तस्य तैःसह युद्धम् अतिप्रबलश्वासौदण्डश्च अतिप्रबल
Iteration 5: BEST OCR TEXT : अपि तैः कोलाविध्वंसिभिः
सहयुद्धेज्ञितः अतिप्रबलदणिडनः तस्थ तेःसह युद्धम् अतिप्रबलश्चासौदण्डशच
अतिप्रबल
File NKP/dp3.lstmf line 0 :
Mean rms=2.187%, delta=21.104%, train=27.189%(51.439%), skip ratio=0%
Iteration 6: GROUND TRUTH : क्षीणबलस्य ततः तस्यइति ततः तस्य
राज्ञ: सुरथस्य कोशः अर्थसंचयः अपहृतः आत्मसात्कृतः स्वाधीनः किंच बलं
Iteration 6: BEST OCR TEXT : क्षाणबलस्य ततः तस्यईति ततः तस्य
राज्ञः सुरथस्य कोडाः अर्थसंचयः अपद्वतः आत्मसात्रुतः स्वाधीनः किंच बत्त
File NKP/dp4.lstmf line 0 :
Mean rms=2.103%, delta=19.153%, train=26.624%(51.234%), skip ratio=0%
Iteration 7: GROUND TRUTH : श्वापदाकीर्णः तं
प्रशान्तश्वापदाकीर्णं प्रशान्ताः परहिंसारहिताः श्वापदाः
व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप
Iteration 7: ALIGNED TRUTH : श्वापदाकीर्णः तं प्रशान्तशवापदाकीर्णं
प्रशान्ताः परहिंसारहिताः श्वापदाः व्याघ्रादयः आकीर्णं व्याप्तं मुनिशिष्योप
Iteration 7: BEST OCR TEXT : इवापदाकीणः तं प्रशान्तरवापदाकीर्णं
प्रशञान्ताः परहिंसारहिताः इंवापदाः व्याघ्रादयः आकीर्णं व्याप्त सुनिशिष्योप
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
To post to this group, send email to [email protected]
<mailto:[email protected]>.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi%2BWOepAfkv7eF73c7U8R3_EpBG8EE7Wgqf0a-iKuBmA%40mail.gmail.com
<https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXi%2BWOepAfkv7eF73c7U8R3_EpBG8EE7Wgqf0a-iKuBmA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.
--
Jochen Barth * Universitätsbibliothek Heidelberg, IT * Telefon 06221 54-2580
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/bf014e73-d063-00fc-ddb5-643ddf7fec28%40ub.uni-heidelberg.de.
For more options, visit https://groups.google.com/d/optout.