Hey thanks a lot. Your replies are really helpful. Rohit
On Saturday, 18 June 2016 23:41:13 UTC+5:30, shree wrote: > > I do not know about the training process for cube, it is not documented. > > I have uploaded the box/tif pairs generated by text2image under windows > for sanskrit - there are two versions s21 and s95 - using different fonts > and exposure levels. Please see > https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s21 > https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s95 > > In s21, each font is used for 3 different exposure levels , -1, 0 and 1. > tesstrain.sh --lang san --langdata_dir ./langdata --tessdata_dir ./ > --exposures "-1 0 1" > > In s95, each font is used only at 0 exposure level. > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Tue, Jun 14, 2016 at 3:35 AM, rohit saluja <[email protected] > <javascript:>> wrote: > >> Hey thanks a lot for your reply. This seems to be a great idea to use hin >> data with sanskrit wordlist. >> >> Still I am interested in knowing the things building from scratch. >> So I used some boxfiles and images I created for sanskrit 2003 font and >> used the hindi config file from >> https://github.com/tesseract-ocr/langdata/blob/master/hin/hin.config >> and I renamed it as san3ds.config. san3ds(3 for 2003 ds for devanagari >> split) is the new name I am giving for my new training data. >> >> I was able to train san3ds without any config file before. >> >> I just renamed san3ds.word-dawg as san3ds.cube-word-dawg. Remaining files >> I kept as it is. >> I could form san3.traineddata file, but I am getting an error while >> recognition:- >> >> Cube ERROR (CubeRecoContext::Load): unable to read cube language model >> params from /usr/local/share/tessdata/san3ds.cube.lm >> Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext >> object >> init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file >> tessedit.cpp, line 214 >> Segmentation fault (core dumped) >> >> Any help in this, why this is happening? Is it wrong in renaming >> word-dawg, I cannot find any separate option for generating cube-word-dawg. >> >> Thanks in advance >> Rohit >> >> >> On Mon, Jun 13, 2016 at 7:04 PM, ShreeDevi Kumar <[email protected] >> <javascript:>> wrote: >> >>> If you look at the readme files in the diff subdirectories starting with >>> OCR under >>> https://github.com/Shreeshrii/imagessan/tree/master you will see >>> results of character and word level accuracy. Depending on the font, >>> character level accuracy is around 80% and word level accuracy around 60% >>> >>> I have not used it for actual OCR of any text because sanskritocr >>> software by dr. Oliver hellwig gives better results. >>> >>> See https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing >>> >>> - sent from my phone. excuse the brevity. >>> On 13-Jun-2016 6:53 pm, "ShreeDevi Kumar" <[email protected] >>> <javascript:>> wrote: >>> >>>> Yes, hin traineddata with cube gives better results than san. >>>> >>>> I did some rudimentary testing with the new traineddata I made. It does >>>> not use cube. Look at the config files, it has some options for devanagari >>>> processing. >>>> >>>> You could try to unpack the hin traineddata and then remake the Dawg >>>> files using sanskrit wordlists and combine them as an experiment. >>>> >>>> If you have unicode version of the font used for the docs you want to >>>> OCR, then train using that. >>>> >>>> - sent from my phone. excuse the brevity. >>>> On 13-Jun-2016 4:47 pm, "rohit saluja" <[email protected] >>>> <javascript:>> wrote: >>>> >>>>> Thanks again for replying. I will surely check them out. >>>>> >>>>> My experience is that OCR on sanskrit data with hin.traineddata gives >>>>> better results than san.traineddata. I do know know, it is due to cube >>>>> mode >>>>> or devanagari preprocessing(segmentation i guess) in devanagari? >>>>> >>>>> I wonder why such preprocessing is not applied in san.traineddata. >>>>> Please let me know whether you are using cube mode in your traineddata >>>>> or not, and are you using devanagari preprocessing? >>>>> >>>>> On Mon, Jun 13, 2016 at 9:18 AM, ShreeDevi Kumar <[email protected] >>>>> <javascript:>> wrote: >>>>> >>>>>> Google has not provided images and box files for San.traineddata >>>>>> released for 3.04 >>>>>> >>>>>> I tried training using text2image with a combination of different >>>>>> fonts and training text. Results are at >>>>>> https://github.com/Shreeshrii/imagessan/tree/master/tessdata >>>>>> >>>>>> You can give these a try to see if recognition is any better. >>>>>> >>>>>> You can unpack any trained data file using -u option with >>>>>> combine-tessdata to see the config files etc. >>>>>> >>>>>> >>>>>> http://manpages.ubuntu.com/manpages/trusty/man1/combine_tessdata.1.html >>>>>> >>>>>> Use the dawg2wordlist to look at the various dictionary word lists >>>>>> used. >>>>>> >>>>>> http://manpages.ubuntu.com/manpages/trusty/man1/dawg2wordlist.1.html >>>>>> >>>>>> - sent from my phone. excuse the brevity. >>>>>> On 12-Jun-2016 11:26 am, "rohit saluja" <[email protected] >>>>>> <javascript:>> wrote: >>>>>> >>>>>>> Hey thanks for replying. >>>>>>> Which options to use with text2image command? Also, is there any >>>>>>> configuration file and fonts list? >>>>>>> >>>>>>> I tried the default option of text2image with tesseract github >>>>>>> training data with sanskrit 2003, but the recognition results are far >>>>>>> away >>>>>>> from san.traineddata file on github. >>>>>>> Any help in matching san.traineddata results, starting from the >>>>>>> scratch, would be highly appreciated. >>>>>>> >>>>>>> Thanks in advance >>>>>>> Rohit >>>>>>> >>>>>>> On Friday, 6 May 2016 12:59:44 UTC+5:30, rohit saluja wrote: >>>>>>> >>>>>>>> Do we have Sanskrit training images and box files available online? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Rohit >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected] <javascript:>. >>>>>>> To post to this group, send email to [email protected] >>>>>>> <javascript:>. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to a topic in >>>>>> the Google Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this topic, visit >>>>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe >>>>>> . >>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>> [email protected] <javascript:>. >>>>>> To post to this group, send email to [email protected] >>>>>> <javascript:>. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected] <javascript:>. >>>>> To post to this group, send email to [email protected] >>>>> <javascript:>. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "tesseract-ocr" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected] <javascript:>. >>> To post to this group, send email to [email protected] >>> <javascript:>. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/019cb2df-8f94-470d-8823-ad3ee15a80e8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

