I do not know about the training process for cube, it is not documented. I have uploaded the box/tif pairs generated by text2image under windows for sanskrit - there are two versions s21 and s95 - using different fonts and exposure levels. Please see https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s21 https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s95
In s21, each font is used for 3 different exposure levels , -1, 0 and 1. tesstrain.sh --lang san --langdata_dir ./langdata --tessdata_dir ./ --exposures "-1 0 1" In s95, each font is used only at 0 exposure level. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Tue, Jun 14, 2016 at 3:35 AM, rohit saluja <rohitsaluj...@gmail.com> wrote: > Hey thanks a lot for your reply. This seems to be a great idea to use hin > data with sanskrit wordlist. > > Still I am interested in knowing the things building from scratch. > So I used some boxfiles and images I created for sanskrit 2003 font and > used the hindi config file from > https://github.com/tesseract-ocr/langdata/blob/master/hin/hin.config > and I renamed it as san3ds.config. san3ds(3 for 2003 ds for devanagari > split) is the new name I am giving for my new training data. > > I was able to train san3ds without any config file before. > > I just renamed san3ds.word-dawg as san3ds.cube-word-dawg. Remaining files > I kept as it is. > I could form san3.traineddata file, but I am getting an error while > recognition:- > > Cube ERROR (CubeRecoContext::Load): unable to read cube language model > params from /usr/local/share/tessdata/san3ds.cube.lm > Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext object > init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file > tessedit.cpp, line 214 > Segmentation fault (core dumped) > > Any help in this, why this is happening? Is it wrong in renaming > word-dawg, I cannot find any separate option for generating cube-word-dawg. > > Thanks in advance > Rohit > > > On Mon, Jun 13, 2016 at 7:04 PM, ShreeDevi Kumar <shreesh...@gmail.com> > wrote: > >> If you look at the readme files in the diff subdirectories starting with >> OCR under >> https://github.com/Shreeshrii/imagessan/tree/master you will see results >> of character and word level accuracy. Depending on the font, character >> level accuracy is around 80% and word level accuracy around 60% >> >> I have not used it for actual OCR of any text because sanskritocr >> software by dr. Oliver hellwig gives better results. >> >> See https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing >> >> - sent from my phone. excuse the brevity. >> On 13-Jun-2016 6:53 pm, "ShreeDevi Kumar" <shreesh...@gmail.com> wrote: >> >>> Yes, hin traineddata with cube gives better results than san. >>> >>> I did some rudimentary testing with the new traineddata I made. It does >>> not use cube. Look at the config files, it has some options for devanagari >>> processing. >>> >>> You could try to unpack the hin traineddata and then remake the Dawg >>> files using sanskrit wordlists and combine them as an experiment. >>> >>> If you have unicode version of the font used for the docs you want to >>> OCR, then train using that. >>> >>> - sent from my phone. excuse the brevity. >>> On 13-Jun-2016 4:47 pm, "rohit saluja" <rohitsaluj...@gmail.com> wrote: >>> >>>> Thanks again for replying. I will surely check them out. >>>> >>>> My experience is that OCR on sanskrit data with hin.traineddata gives >>>> better results than san.traineddata. I do know know, it is due to cube mode >>>> or devanagari preprocessing(segmentation i guess) in devanagari? >>>> >>>> I wonder why such preprocessing is not applied in san.traineddata. >>>> Please let me know whether you are using cube mode in your traineddata >>>> or not, and are you using devanagari preprocessing? >>>> >>>> On Mon, Jun 13, 2016 at 9:18 AM, ShreeDevi Kumar <shreesh...@gmail.com> >>>> wrote: >>>> >>>>> Google has not provided images and box files for San.traineddata >>>>> released for 3.04 >>>>> >>>>> I tried training using text2image with a combination of different >>>>> fonts and training text. Results are at >>>>> https://github.com/Shreeshrii/imagessan/tree/master/tessdata >>>>> >>>>> You can give these a try to see if recognition is any better. >>>>> >>>>> You can unpack any trained data file using -u option with >>>>> combine-tessdata to see the config files etc. >>>>> >>>>> http://manpages.ubuntu.com/manpages/trusty/man1/combine_tessdata.1.html >>>>> >>>>> Use the dawg2wordlist to look at the various dictionary word lists >>>>> used. >>>>> >>>>> http://manpages.ubuntu.com/manpages/trusty/man1/dawg2wordlist.1.html >>>>> >>>>> - sent from my phone. excuse the brevity. >>>>> On 12-Jun-2016 11:26 am, "rohit saluja" <rohitsaluj...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hey thanks for replying. >>>>>> Which options to use with text2image command? Also, is there any >>>>>> configuration file and fonts list? >>>>>> >>>>>> I tried the default option of text2image with tesseract github >>>>>> training data with sanskrit 2003, but the recognition results are far >>>>>> away >>>>>> from san.traineddata file on github. >>>>>> Any help in matching san.traineddata results, starting from the >>>>>> scratch, would be highly appreciated. >>>>>> >>>>>> Thanks in advance >>>>>> Rohit >>>>>> >>>>>> On Friday, 6 May 2016 12:59:44 UTC+5:30, rohit saluja wrote: >>>>>> >>>>>>> Do we have Sanskrit training images and box files available online? >>>>>>> >>>>>>> Thanks >>>>>>> Rohit >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to a topic in the >>>>> Google Groups "tesseract-ocr" group. >>>>> To unsubscribe from this topic, visit >>>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe >>>>> . >>>>> To unsubscribe from this group and all its topics, send an email to >>>>> tesseract-ocr+unsubscr...@googlegroups.com. >>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >> You received this message because you are subscribed to a topic in the >> Google Groups "tesseract-ocr" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV7yfRRfedXGhVzKpWTJCWpU-VAgXT9x0qfrVFSzSt3FQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.