make a collection of unicode devanagari fonts - look at fonts.google.com

make a large training text with nepali text

review and improve the wordlist in tesseract-ocr/langdata for nepali

I will share my modified training scripts, which use small sections of the
large training text for each font.

Please note that so far I have not had success in improving the accuracy of
hindi traineddata with my experiments.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

2017-05-10 22:07 GMT+05:30 ShreeDevi Kumar <shreesh...@gmail.com>:

> see
>
> https://github.com/tesseract-ocr/langdata/tree/master/nep
>
> http://crubadan.org/languages/ne
>
> https://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%
> E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> 2017-05-10 20:51 GMT+05:30 Nirajan Pant <niraja...@gmail.com>:
>
>> Thank you @shree. Can you help in how to generate langdata for training
>> Tesseract 4.0?
>>
>> On Wednesday, 10 May 2017 17:25:56 UTC+5:45, shree wrote:
>>>
>>> Please open an issue in langdata repo with any specific errors that you
>>> see for Nepali. Take a look at the wordlist and training_text,
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> 2017-05-10 17:03 GMT+05:30 Nirajan Pant <nira...@gmail.com>:
>>>
>>>> Yeah! I got the same result as yours with hin.traineddata which is
>>>> better than nep.traineddata. I think the langdata need some revisions. I
>>>> have attached the ground truth text for the image.
>>>>
>>>>
>>>>
>>>> On Tuesday, 9 May 2017 22:38:25 UTC+5:45, shree wrote:
>>>>>
>>>>> Attached is the output I get with
>>>>>
>>>>> tesseract nep_text_11.png nep_text_11 --oem 1 --psm 6 -l hin
>>>>>
>>>>>
>>>>> ShreeDevi
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> 2017-05-09 21:11 GMT+05:30 ShreeDevi Kumar <shree...@gmail.com>:
>>>>>
>>>>>> Thanks. Please provide the 'ground truth' ie the original accurate
>>>>>> text for the image.
>>>>>>
>>>>>> Have tried to OCR the same image with options
>>>>>>
>>>>>> --oem 1 --PSM 6 -l hin
>>>>>>
>>>>>> Sometimes hindi traineddata gives better results.
>>>>>>
>>>>>> On May 9, 2017 9:05 PM, "Nirajan Pant" <nira...@gmail.com> wrote:
>>>>>>
>>>>>>> Here is a sample image:
>>>>>>>
>>>>>>>
>>>>>>> <https://lh3.googleusercontent.com/-4WrfbKY7lFk/WRHhTrz5F-I/AAAAAAAADOU/drzKr-Gl1E4MHjhCErwiH_BnYe1CPk8XQCLcB/s1600/nep_text_11.png>
>>>>>>>
>>>>>>> And the result is:
>>>>>>>
>>>>>>> त्यसपछि कसरी उ इजरायल प्रवेश गर्यो,, घर बनायो ? जागीर खायो? उफ~~ सबै
>>>>>>> बिर्सिइयो , आफ्नै जीवनकथा सिलसिला मिलाएर
>>>>>>> सम्हानसकोक्षमत्तापत्तिफ्लिअबउसमा|स्वारणशक्तिक्षीणहुदेंगएकोछ्
>>>>>>> दुकौंपन्निदुत्सकोस्पष्टहेक्कारह्दैन।
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> मन्दिर जाने बाटो र प्रार्थनाका एक दुइ ऋचा मन्त्रहरु बाहेक उसको
>>>>>>> सम्झनामा सबै कुरा अधुरा छन । दिनभरिको अधिकांश समय यिनै
>>>>>>> कुरामा सिमित गर्दै आएको यो बुढो मान्छे संग कति खुसिका क्षणहरु होलान,
>>>>>>> कति संघर्ष वा दुखका कहानीहरु होलान ? म बारम्बार
>>>>>>> सोध्ने यत्न गर्छु, उ मुस्काई मात्र रहन्छ ।
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> आज त त्यो मुस्कान पनि उसले बिर्से जस्तो छ, घरिघरि एक्लै बर्बराएको
>>>>>>> सुन्छु " हे भगवान, कति एक्लो जीवन !"
>>>>>>>
>>>>>>>
>>>>>>> एक कप तातो कफी पिई सकेपछि बल्ल् अने मुखबाट उठेको बाफ पर पर फ्याक्दै
>>>>>>> उ प्रश्न गर्छ -
>>>>>>> 'म्झिचकोबिषयमाकतिलेखिइन्यग्यौत?पुस्तककहिलेतय1रहुनात्तिम्रो?"
>>>>>>> किबुच एक प्रकारको सामुदायिक विकासको अवधारणा हो, इजरायलमा यसको
>>>>>>> उदाहरणीय र अनुकरणीय प्रयोग भएको छ |
>>>>>>>
>>>>>>>
>>>>>>> "अहँ आधा पनि सकेको छैन, यस्ता खाले पुस्तकको हाम्रो देशमा खासै महत्व
>>>>>>> या उपयोगिता होला जस्तो पनि लाग्दैन । त्यसैले यी
>>>>>>>
>>>>>>>
>>>>>>> अहिले त कथा पो लेखन थालेको छु, फेसबुकतिर टाँस्दिन्छु , एक दुइ जनाले
>>>>>>> पढ्छन पनि।"
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> मेरो नजीक आएर अन्छ उ, त्यसो भए आज के लेख्यौँ त, सुनाउन त ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tuesday, 9 May 2017 12:54:31 UTC+5:45, shree wrote:
>>>>>>>>
>>>>>>>> Please provide sample of 'not giving good results' and samples of
>>>>>>>> lines not being recognized correctly. Images and ground truth files 
>>>>>>>> will be
>>>>>>>> helpful.
>>>>>>>>
>>>>>>>> ShreeDevi
>>>>>>>> ____________________________________________________________
>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>
>>>>>>>> On Tue, May 9, 2017 at 12:16 PM, Nirajan Pant <nira...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The trainned data provided here
>>>>>>>>> <https://github.com/tesseract-ocr/tessdata> is not giving good
>>>>>>>>> results with Nepali text image documents. It is unable to recognize 
>>>>>>>>> some
>>>>>>>>> lines correctly. Can anybody help me in re-training Tesseract 4.0 for
>>>>>>>>> Nepali language.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7761b739-6f6
>>>>>>>>> e-4343-9039-501f7c60782c%40googlegroups.com
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7761b739-6f6e-4343-9039-501f7c60782c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/57c5e038-da3
>>>>>>> b-4f94-82c4-791b858fbf42%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/57c5e038-da3b-4f94-82c4-791b858fbf42%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>>> gid/tesseract-ocr/da24a59a-b651-47a7-a599-df22d821da4c%40goo
>>>> glegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/da24a59a-b651-47a7-a599-df22d821da4c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/tesseract-ocr/c5cdd4f4-62c5-44fa-9ba0-558fcff375c2%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/c5cdd4f4-62c5-44fa-9ba0-558fcff375c2%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVkg9SR8ecjpem6Y9wOFN41b%3DXVNvabX4St4d%3DKtebORg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to