make a collection of unicode devanagari fonts - look at fonts.google.com make a large training text with nepali text
review and improve the wordlist in tesseract-ocr/langdata for nepali I will share my modified training scripts, which use small sections of the large training text for each font. Please note that so far I have not had success in improving the accuracy of hindi traineddata with my experiments. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com 2017-05-10 22:07 GMT+05:30 ShreeDevi Kumar <shreesh...@gmail.com>: > see > > https://github.com/tesseract-ocr/langdata/tree/master/nep > > http://crubadan.org/languages/ne > > https://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96% > E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0 > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > 2017-05-10 20:51 GMT+05:30 Nirajan Pant <niraja...@gmail.com>: > >> Thank you @shree. Can you help in how to generate langdata for training >> Tesseract 4.0? >> >> On Wednesday, 10 May 2017 17:25:56 UTC+5:45, shree wrote: >>> >>> Please open an issue in langdata repo with any specific errors that you >>> see for Nepali. Take a look at the wordlist and training_text, >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> 2017-05-10 17:03 GMT+05:30 Nirajan Pant <nira...@gmail.com>: >>> >>>> Yeah! I got the same result as yours with hin.traineddata which is >>>> better than nep.traineddata. I think the langdata need some revisions. I >>>> have attached the ground truth text for the image. >>>> >>>> >>>> >>>> On Tuesday, 9 May 2017 22:38:25 UTC+5:45, shree wrote: >>>>> >>>>> Attached is the output I get with >>>>> >>>>> tesseract nep_text_11.png nep_text_11 --oem 1 --psm 6 -l hin >>>>> >>>>> >>>>> ShreeDevi >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>>> 2017-05-09 21:11 GMT+05:30 ShreeDevi Kumar <shree...@gmail.com>: >>>>> >>>>>> Thanks. Please provide the 'ground truth' ie the original accurate >>>>>> text for the image. >>>>>> >>>>>> Have tried to OCR the same image with options >>>>>> >>>>>> --oem 1 --PSM 6 -l hin >>>>>> >>>>>> Sometimes hindi traineddata gives better results. >>>>>> >>>>>> On May 9, 2017 9:05 PM, "Nirajan Pant" <nira...@gmail.com> wrote: >>>>>> >>>>>>> Here is a sample image: >>>>>>> >>>>>>> >>>>>>> <https://lh3.googleusercontent.com/-4WrfbKY7lFk/WRHhTrz5F-I/AAAAAAAADOU/drzKr-Gl1E4MHjhCErwiH_BnYe1CPk8XQCLcB/s1600/nep_text_11.png> >>>>>>> >>>>>>> And the result is: >>>>>>> >>>>>>> त्यसपछि कसरी उ इजरायल प्रवेश गर्यो,, घर बनायो ? जागीर खायो? उफ~~ सबै >>>>>>> बिर्सिइयो , आफ्नै जीवनकथा सिलसिला मिलाएर >>>>>>> सम्हानसकोक्षमत्तापत्तिफ्लिअबउसमा|स्वारणशक्तिक्षीणहुदेंगएकोछ् >>>>>>> दुकौंपन्निदुत्सकोस्पष्टहेक्कारह्दैन। >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> मन्दिर जाने बाटो र प्रार्थनाका एक दुइ ऋचा मन्त्रहरु बाहेक उसको >>>>>>> सम्झनामा सबै कुरा अधुरा छन । दिनभरिको अधिकांश समय यिनै >>>>>>> कुरामा सिमित गर्दै आएको यो बुढो मान्छे संग कति खुसिका क्षणहरु होलान, >>>>>>> कति संघर्ष वा दुखका कहानीहरु होलान ? म बारम्बार >>>>>>> सोध्ने यत्न गर्छु, उ मुस्काई मात्र रहन्छ । >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> आज त त्यो मुस्कान पनि उसले बिर्से जस्तो छ, घरिघरि एक्लै बर्बराएको >>>>>>> सुन्छु " हे भगवान, कति एक्लो जीवन !" >>>>>>> >>>>>>> >>>>>>> एक कप तातो कफी पिई सकेपछि बल्ल् अने मुखबाट उठेको बाफ पर पर फ्याक्दै >>>>>>> उ प्रश्न गर्छ - >>>>>>> 'म्झिचकोबिषयमाकतिलेखिइन्यग्यौत?पुस्तककहिलेतय1रहुनात्तिम्रो?" >>>>>>> किबुच एक प्रकारको सामुदायिक विकासको अवधारणा हो, इजरायलमा यसको >>>>>>> उदाहरणीय र अनुकरणीय प्रयोग भएको छ | >>>>>>> >>>>>>> >>>>>>> "अहँ आधा पनि सकेको छैन, यस्ता खाले पुस्तकको हाम्रो देशमा खासै महत्व >>>>>>> या उपयोगिता होला जस्तो पनि लाग्दैन । त्यसैले यी >>>>>>> >>>>>>> >>>>>>> अहिले त कथा पो लेखन थालेको छु, फेसबुकतिर टाँस्दिन्छु , एक दुइ जनाले >>>>>>> पढ्छन पनि।" >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> मेरो नजीक आएर अन्छ उ, त्यसो भए आज के लेख्यौँ त, सुनाउन त ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tuesday, 9 May 2017 12:54:31 UTC+5:45, shree wrote: >>>>>>>> >>>>>>>> Please provide sample of 'not giving good results' and samples of >>>>>>>> lines not being recognized correctly. Images and ground truth files >>>>>>>> will be >>>>>>>> helpful. >>>>>>>> >>>>>>>> ShreeDevi >>>>>>>> ____________________________________________________________ >>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>> >>>>>>>> On Tue, May 9, 2017 at 12:16 PM, Nirajan Pant <nira...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The trainned data provided here >>>>>>>>> <https://github.com/tesseract-ocr/tessdata> is not giving good >>>>>>>>> results with Nepali text image documents. It is unable to recognize >>>>>>>>> some >>>>>>>>> lines correctly. Can anybody help me in re-training Tesseract 4.0 for >>>>>>>>> Nepali language. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7761b739-6f6 >>>>>>>>> e-4343-9039-501f7c60782c%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7761b739-6f6e-4343-9039-501f7c60782c%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/57c5e038-da3 >>>>>>> b-4f94-82c4-791b858fbf42%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/57c5e038-da3b-4f94-82c4-791b858fbf42%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit https://groups.google.com/d/ms >>>> gid/tesseract-ocr/da24a59a-b651-47a7-a599-df22d821da4c%40goo >>>> glegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/da24a59a-b651-47a7-a599-df22d821da4c%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit https://groups.google.com/d/ms >> gid/tesseract-ocr/c5cdd4f4-62c5-44fa-9ba0-558fcff375c2%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/c5cdd4f4-62c5-44fa-9ba0-558fcff375c2%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVkg9SR8ecjpem6Y9wOFN41b%3DXVNvabX4St4d%3DKtebORg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.