Hey thanks a lot for your reply. This seems to be a great idea to use hin data with sanskrit wordlist.
Still I am interested in knowing the things building from scratch. So I used some boxfiles and images I created for sanskrit 2003 font and used the hindi config file from https://github.com/tesseract-ocr/langdata/blob/master/hin/hin.config and I renamed it as san3ds.config. san3ds(3 for 2003 ds for devanagari split) is the new name I am giving for my new training data. I was able to train san3ds without any config file before. I just renamed san3ds.word-dawg as san3ds.cube-word-dawg. Remaining files I kept as it is. I could form san3.traineddata file, but I am getting an error while recognition:- Cube ERROR (CubeRecoContext::Load): unable to read cube language model params from /usr/local/share/tessdata/san3ds.cube.lm Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext object init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file tessedit.cpp, line 214 Segmentation fault (core dumped) Any help in this, why this is happening? Is it wrong in renaming word-dawg, I cannot find any separate option for generating cube-word-dawg. Thanks in advance Rohit On Mon, Jun 13, 2016 at 7:04 PM, ShreeDevi Kumar <[email protected]> wrote: > If you look at the readme files in the diff subdirectories starting with > OCR under > https://github.com/Shreeshrii/imagessan/tree/master you will see results > of character and word level accuracy. Depending on the font, character > level accuracy is around 80% and word level accuracy around 60% > > I have not used it for actual OCR of any text because sanskritocr software > by dr. Oliver hellwig gives better results. > > See https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing > > - sent from my phone. excuse the brevity. > On 13-Jun-2016 6:53 pm, "ShreeDevi Kumar" <[email protected]> wrote: > >> Yes, hin traineddata with cube gives better results than san. >> >> I did some rudimentary testing with the new traineddata I made. It does >> not use cube. Look at the config files, it has some options for devanagari >> processing. >> >> You could try to unpack the hin traineddata and then remake the Dawg >> files using sanskrit wordlists and combine them as an experiment. >> >> If you have unicode version of the font used for the docs you want to >> OCR, then train using that. >> >> - sent from my phone. excuse the brevity. >> On 13-Jun-2016 4:47 pm, "rohit saluja" <[email protected]> wrote: >> >>> Thanks again for replying. I will surely check them out. >>> >>> My experience is that OCR on sanskrit data with hin.traineddata gives >>> better results than san.traineddata. I do know know, it is due to cube mode >>> or devanagari preprocessing(segmentation i guess) in devanagari? >>> >>> I wonder why such preprocessing is not applied in san.traineddata. >>> Please let me know whether you are using cube mode in your traineddata >>> or not, and are you using devanagari preprocessing? >>> >>> On Mon, Jun 13, 2016 at 9:18 AM, ShreeDevi Kumar <[email protected]> >>> wrote: >>> >>>> Google has not provided images and box files for San.traineddata >>>> released for 3.04 >>>> >>>> I tried training using text2image with a combination of different fonts >>>> and training text. Results are at >>>> https://github.com/Shreeshrii/imagessan/tree/master/tessdata >>>> >>>> You can give these a try to see if recognition is any better. >>>> >>>> You can unpack any trained data file using -u option with >>>> combine-tessdata to see the config files etc. >>>> >>>> http://manpages.ubuntu.com/manpages/trusty/man1/combine_tessdata.1.html >>>> >>>> Use the dawg2wordlist to look at the various dictionary word lists used. >>>> >>>> http://manpages.ubuntu.com/manpages/trusty/man1/dawg2wordlist.1.html >>>> >>>> - sent from my phone. excuse the brevity. >>>> On 12-Jun-2016 11:26 am, "rohit saluja" <[email protected]> >>>> wrote: >>>> >>>>> Hey thanks for replying. >>>>> Which options to use with text2image command? Also, is there any >>>>> configuration file and fonts list? >>>>> >>>>> I tried the default option of text2image with tesseract github >>>>> training data with sanskrit 2003, but the recognition results are far away >>>>> from san.traineddata file on github. >>>>> Any help in matching san.traineddata results, starting from the >>>>> scratch, would be highly appreciated. >>>>> >>>>> Thanks in advance >>>>> Rohit >>>>> >>>>> On Friday, 6 May 2016 12:59:44 UTC+5:30, rohit saluja wrote: >>>>> >>>>>> Do we have Sanskrit training images and box files available online? >>>>>> >>>>>> Thanks >>>>>> Rohit >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "tesseract-ocr" group. >>>> To unsubscribe from this topic, visit >>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe >>>> . >>>> To unsubscribe from this group and all its topics, send an email to >>>> [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

