Re: [tesseract-ocr] Re: Do we have Sanskrit training images and box files online?

rohit saluja Thu, 30 Jun 2016 03:55:43 -0700

Hi

I just ocred 30 pages of a sanskrit book on Sanskrit OCR. I got WER of 54% 
and CER of 24 %.
Whereas I get WER of 20 % on Indsenz and CER Of 8 %. Have you tried 
comparing Indsenz with Sanskrit OCR. Which one is better where?


On Tuesday, 21 June 2016 12:36:23 UTC+5:30, rohit saluja wrote:
>
> Hey thanks a lot. Your replies are really helpful.
>
> Rohit
>
> On Saturday, 18 June 2016 23:41:13 UTC+5:30, shree wrote:
>>
>> I do not know about the training process for cube, it is not documented. 
>>
>> I have uploaded the box/tif pairs generated by text2image under windows 
>> for sanskrit - there are two versions  s21 and s95 - using different fonts 
>> and exposure levels. Please see
>> https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s21
>> https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s95
>>
>> In s21, each font is used for 3 different exposure levels , -1, 0 and 1. 
>> tesstrain.sh --lang san --langdata_dir ./langdata --tessdata_dir ./ 
>> --exposures "-1 0 1" 
>>
>> In s95, each font is used only at 0 exposure level.
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Jun 14, 2016 at 3:35 AM, rohit saluja <[email protected]> 
>> wrote:
>>
>>> Hey thanks a lot for your reply. This seems to be a great idea to use 
>>> hin data with sanskrit wordlist.
>>>
>>> Still I am interested in knowing the things building from scratch.
>>> So I used some boxfiles and images I created for sanskrit 2003 font and 
>>> used the hindi config file from 
>>> https://github.com/tesseract-ocr/langdata/blob/master/hin/hin.config
>>> and I renamed it as san3ds.config. san3ds(3 for 2003 ds for devanagari 
>>> split) is the new name I am giving for my new training data.
>>>
>>> I was able to train san3ds without any config file before.
>>>
>>> I just renamed san3ds.word-dawg as san3ds.cube-word-dawg. Remaining 
>>> files I kept as it is.
>>> I could form san3.traineddata file, but I am getting an error while 
>>> recognition:-
>>>
>>> Cube ERROR (CubeRecoContext::Load): unable to read cube language model 
>>> params from /usr/local/share/tessdata/san3ds.cube.lm
>>> Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext 
>>> object
>>> init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file 
>>> tessedit.cpp, line 214
>>> Segmentation fault (core dumped)
>>>
>>> Any help in this, why this is happening? Is it wrong in renaming 
>>> word-dawg, I cannot find any separate option for generating cube-word-dawg.
>>>
>>> Thanks in advance
>>> Rohit
>>>
>>>
>>> On Mon, Jun 13, 2016 at 7:04 PM, ShreeDevi Kumar <[email protected]> 
>>> wrote:
>>>
>>>> If you look at the readme files in the diff subdirectories starting 
>>>> with OCR under 
>>>> https://github.com/Shreeshrii/imagessan/tree/master you will see 
>>>> results of character and word level accuracy. Depending on the font, 
>>>> character level accuracy is around 80% and word level accuracy around 60% 
>>>>
>>>> I have not used it for actual OCR of any text because sanskritocr 
>>>> software by dr. Oliver hellwig gives better results. 
>>>>
>>>> See https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing
>>>>
>>>> - sent from my phone. excuse the brevity.
>>>> On 13-Jun-2016 6:53 pm, "ShreeDevi Kumar" <[email protected]> wrote:
>>>>
>>>>> Yes, hin traineddata with cube gives better results than san.
>>>>>
>>>>> I did some rudimentary testing with the new traineddata I made. It 
>>>>> does not use cube. Look at the config files, it has some options for 
>>>>> devanagari processing.
>>>>>
>>>>> You could try to unpack the hin traineddata and then remake the Dawg 
>>>>> files using sanskrit wordlists and combine them as an experiment.
>>>>>
>>>>> If you have unicode version of the font used for the docs you want to 
>>>>> OCR, then train using that.
>>>>>
>>>>> - sent from my phone. excuse the brevity.
>>>>> On 13-Jun-2016 4:47 pm, "rohit saluja" <[email protected]> wrote:
>>>>>
>>>>>> Thanks again for replying. I will surely check them out.
>>>>>>
>>>>>> My experience is that OCR on sanskrit data with hin.traineddata gives 
>>>>>> better results than san.traineddata. I do know know, it is due to cube 
>>>>>> mode 
>>>>>> or devanagari preprocessing(segmentation i guess) in devanagari?
>>>>>>
>>>>>> I wonder why such preprocessing is not applied in san.traineddata.
>>>>>> Please let me know whether you are using cube mode in your 
>>>>>> traineddata or not, and are you using devanagari preprocessing?
>>>>>>
>>>>>> On Mon, Jun 13, 2016 at 9:18 AM, ShreeDevi Kumar <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>> Google has not provided images and box files for San.traineddata 
>>>>>>> released for 3.04
>>>>>>>
>>>>>>> I tried training using text2image with a combination of different 
>>>>>>> fonts and training text. Results are at 
>>>>>>> https://github.com/Shreeshrii/imagessan/tree/master/tessdata
>>>>>>>
>>>>>>> You can give these a try to see if recognition is any better.
>>>>>>>
>>>>>>> You can unpack any trained data file using -u option with 
>>>>>>> combine-tessdata to see the config files etc. 
>>>>>>>
>>>>>>>
>>>>>>> http://manpages.ubuntu.com/manpages/trusty/man1/combine_tessdata.1.html
>>>>>>>
>>>>>>> Use the dawg2wordlist to look at the various dictionary word lists 
>>>>>>> used.
>>>>>>>
>>>>>>> http://manpages.ubuntu.com/manpages/trusty/man1/dawg2wordlist.1.html
>>>>>>>
>>>>>>> - sent from my phone. excuse the brevity.
>>>>>>> On 12-Jun-2016 11:26 am, "rohit saluja" <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hey thanks for replying.
>>>>>>>> Which options to use with text2image command? Also, is there any 
>>>>>>>> configuration file and fonts list?
>>>>>>>>
>>>>>>>> I tried the default option of text2image with tesseract github 
>>>>>>>> training data with sanskrit 2003, but the recognition results are far 
>>>>>>>> away 
>>>>>>>> from san.traineddata file on github.
>>>>>>>> Any help in matching san.traineddata results, starting from the 
>>>>>>>> scratch, would be highly appreciated.
>>>>>>>>
>>>>>>>> Thanks in advance
>>>>>>>> Rohit
>>>>>>>>
>>>>>>>> On Friday, 6 May 2016 12:59:44 UTC+5:30, rohit saluja wrote: 
>>>>>>>>
>>>>>>>>> Do we have Sanskrit training images and box files available online?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Rohit
>>>>>>>>>
>>>>>>>> -- 
>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>> send an email to [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit 
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com
>>>>>>>>  
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to a topic in 
>>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this topic, visit 
>>>>>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe
>>>>>>> .
>>>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>>>> [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To post to this group, send email to [email protected].
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to a topic in the 
>>>> Google Groups "tesseract-ocr" group.
>>>> To unsubscribe from this topic, visit 
>>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe
>>>> .
>>>> To unsubscribe from this group and all its topics, send an email to 
>>>> [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c9edbaa5-fb5d-4c01-87d9-93b1a2308f9f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Do we have Sanskrit training images and box files online?

Reply via email to