Hey thanks a lot. Your replies are really helpful.

Rohit

On Saturday, 18 June 2016 23:41:13 UTC+5:30, shree wrote:
>
> I do not know about the training process for cube, it is not documented. 
>
> I have uploaded the box/tif pairs generated by text2image under windows 
> for sanskrit - there are two versions  s21 and s95 - using different fonts 
> and exposure levels. Please see
> https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s21
> https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s95
>
> In s21, each font is used for 3 different exposure levels , -1, 0 and 1. 
> tesstrain.sh --lang san --langdata_dir ./langdata --tessdata_dir ./ 
> --exposures "-1 0 1" 
>
> In s95, each font is used only at 0 exposure level.
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Jun 14, 2016 at 3:35 AM, rohit saluja <[email protected] 
> <javascript:>> wrote:
>
>> Hey thanks a lot for your reply. This seems to be a great idea to use hin 
>> data with sanskrit wordlist.
>>
>> Still I am interested in knowing the things building from scratch.
>> So I used some boxfiles and images I created for sanskrit 2003 font and 
>> used the hindi config file from 
>> https://github.com/tesseract-ocr/langdata/blob/master/hin/hin.config
>> and I renamed it as san3ds.config. san3ds(3 for 2003 ds for devanagari 
>> split) is the new name I am giving for my new training data.
>>
>> I was able to train san3ds without any config file before.
>>
>> I just renamed san3ds.word-dawg as san3ds.cube-word-dawg. Remaining files 
>> I kept as it is.
>> I could form san3.traineddata file, but I am getting an error while 
>> recognition:-
>>
>> Cube ERROR (CubeRecoContext::Load): unable to read cube language model 
>> params from /usr/local/share/tessdata/san3ds.cube.lm
>> Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext 
>> object
>> init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file 
>> tessedit.cpp, line 214
>> Segmentation fault (core dumped)
>>
>> Any help in this, why this is happening? Is it wrong in renaming 
>> word-dawg, I cannot find any separate option for generating cube-word-dawg.
>>
>> Thanks in advance
>> Rohit
>>
>>
>> On Mon, Jun 13, 2016 at 7:04 PM, ShreeDevi Kumar <[email protected] 
>> <javascript:>> wrote:
>>
>>> If you look at the readme files in the diff subdirectories starting with 
>>> OCR under 
>>> https://github.com/Shreeshrii/imagessan/tree/master you will see 
>>> results of character and word level accuracy. Depending on the font, 
>>> character level accuracy is around 80% and word level accuracy around 60% 
>>>
>>> I have not used it for actual OCR of any text because sanskritocr 
>>> software by dr. Oliver hellwig gives better results. 
>>>
>>> See https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing
>>>
>>> - sent from my phone. excuse the brevity.
>>> On 13-Jun-2016 6:53 pm, "ShreeDevi Kumar" <[email protected] 
>>> <javascript:>> wrote:
>>>
>>>> Yes, hin traineddata with cube gives better results than san.
>>>>
>>>> I did some rudimentary testing with the new traineddata I made. It does 
>>>> not use cube. Look at the config files, it has some options for devanagari 
>>>> processing.
>>>>
>>>> You could try to unpack the hin traineddata and then remake the Dawg 
>>>> files using sanskrit wordlists and combine them as an experiment.
>>>>
>>>> If you have unicode version of the font used for the docs you want to 
>>>> OCR, then train using that.
>>>>
>>>> - sent from my phone. excuse the brevity.
>>>> On 13-Jun-2016 4:47 pm, "rohit saluja" <[email protected] 
>>>> <javascript:>> wrote:
>>>>
>>>>> Thanks again for replying. I will surely check them out.
>>>>>
>>>>> My experience is that OCR on sanskrit data with hin.traineddata gives 
>>>>> better results than san.traineddata. I do know know, it is due to cube 
>>>>> mode 
>>>>> or devanagari preprocessing(segmentation i guess) in devanagari?
>>>>>
>>>>> I wonder why such preprocessing is not applied in san.traineddata.
>>>>> Please let me know whether you are using cube mode in your traineddata 
>>>>> or not, and are you using devanagari preprocessing?
>>>>>
>>>>> On Mon, Jun 13, 2016 at 9:18 AM, ShreeDevi Kumar <[email protected] 
>>>>> <javascript:>> wrote:
>>>>>
>>>>>> Google has not provided images and box files for San.traineddata 
>>>>>> released for 3.04
>>>>>>
>>>>>> I tried training using text2image with a combination of different 
>>>>>> fonts and training text. Results are at 
>>>>>> https://github.com/Shreeshrii/imagessan/tree/master/tessdata
>>>>>>
>>>>>> You can give these a try to see if recognition is any better.
>>>>>>
>>>>>> You can unpack any trained data file using -u option with 
>>>>>> combine-tessdata to see the config files etc. 
>>>>>>
>>>>>>
>>>>>> http://manpages.ubuntu.com/manpages/trusty/man1/combine_tessdata.1.html
>>>>>>
>>>>>> Use the dawg2wordlist to look at the various dictionary word lists 
>>>>>> used.
>>>>>>
>>>>>> http://manpages.ubuntu.com/manpages/trusty/man1/dawg2wordlist.1.html
>>>>>>
>>>>>> - sent from my phone. excuse the brevity.
>>>>>> On 12-Jun-2016 11:26 am, "rohit saluja" <[email protected] 
>>>>>> <javascript:>> wrote:
>>>>>>
>>>>>>> Hey thanks for replying.
>>>>>>> Which options to use with text2image command? Also, is there any 
>>>>>>> configuration file and fonts list?
>>>>>>>
>>>>>>> I tried the default option of text2image with tesseract github 
>>>>>>> training data with sanskrit 2003, but the recognition results are far 
>>>>>>> away 
>>>>>>> from san.traineddata file on github.
>>>>>>> Any help in matching san.traineddata results, starting from the 
>>>>>>> scratch, would be highly appreciated.
>>>>>>>
>>>>>>> Thanks in advance
>>>>>>> Rohit
>>>>>>>
>>>>>>> On Friday, 6 May 2016 12:59:44 UTC+5:30, rohit saluja wrote: 
>>>>>>>
>>>>>>>> Do we have Sanskrit training images and box files available online?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Rohit
>>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected] <javascript:>.
>>>>>>> To post to this group, send email to [email protected] 
>>>>>>> <javascript:>.
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to a topic in 
>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this topic, visit 
>>>>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe
>>>>>> .
>>>>>> To unsubscribe from this group and all its topics, send an email to 
>>>>>> [email protected] <javascript:>.
>>>>>> To post to this group, send email to [email protected] 
>>>>>> <javascript:>.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected] <javascript:>.
>>>>> To post to this group, send email to [email protected] 
>>>>> <javascript:>.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to a topic in the 
>>> Google Groups "tesseract-ocr" group.
>>> To unsubscribe from this topic, visit 
>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to 
>>> [email protected] <javascript:>.
>>> To post to this group, send email to [email protected] 
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/019cb2df-8f94-470d-8823-ad3ee15a80e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to