Oliver had released first version of sanskritocr for free and new version is commercial with demo, sold by indsenz. I assume newer one may be better, it also allows for training for particular fonts.
- sent from my phone. excuse the brevity. On 30-Jun-2016 4:25 pm, "rohit saluja" <[email protected]> wrote: > Hi > > I just ocred 30 pages of a sanskrit book on Sanskrit OCR. I got WER of 54% > and CER of 24 %. > Whereas I get WER of 20 % on Indsenz and CER Of 8 %. Have you tried > comparing Indsenz with Sanskrit OCR. Which one is better where? > > On Tuesday, 21 June 2016 12:36:23 UTC+5:30, rohit saluja wrote: >> >> Hey thanks a lot. Your replies are really helpful. >> >> Rohit >> >> On Saturday, 18 June 2016 23:41:13 UTC+5:30, shree wrote: >>> >>> I do not know about the training process for cube, it is not documented. >>> >>> I have uploaded the box/tif pairs generated by text2image under windows >>> for sanskrit - there are two versions s21 and s95 - using different fonts >>> and exposure levels. Please see >>> https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s21 >>> https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s95 >>> >>> In s21, each font is used for 3 different exposure levels , -1, 0 and 1. >>> tesstrain.sh --lang san --langdata_dir ./langdata --tessdata_dir ./ >>> --exposures "-1 0 1" >>> >>> In s95, each font is used only at 0 exposure level. >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Tue, Jun 14, 2016 at 3:35 AM, rohit saluja <[email protected]> >>> wrote: >>> >>>> Hey thanks a lot for your reply. This seems to be a great idea to use >>>> hin data with sanskrit wordlist. >>>> >>>> Still I am interested in knowing the things building from scratch. >>>> So I used some boxfiles and images I created for sanskrit 2003 font and >>>> used the hindi config file from >>>> https://github.com/tesseract-ocr/langdata/blob/master/hin/hin.config >>>> and I renamed it as san3ds.config. san3ds(3 for 2003 ds for devanagari >>>> split) is the new name I am giving for my new training data. >>>> >>>> I was able to train san3ds without any config file before. >>>> >>>> I just renamed san3ds.word-dawg as san3ds.cube-word-dawg. Remaining >>>> files I kept as it is. >>>> I could form san3.traineddata file, but I am getting an error while >>>> recognition:- >>>> >>>> Cube ERROR (CubeRecoContext::Load): unable to read cube language model >>>> params from /usr/local/share/tessdata/san3ds.cube.lm >>>> Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext >>>> object >>>> init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file >>>> tessedit.cpp, line 214 >>>> Segmentation fault (core dumped) >>>> >>>> Any help in this, why this is happening? Is it wrong in renaming >>>> word-dawg, I cannot find any separate option for generating cube-word-dawg. >>>> >>>> Thanks in advance >>>> Rohit >>>> >>>> >>>> On Mon, Jun 13, 2016 at 7:04 PM, ShreeDevi Kumar <[email protected]> >>>> wrote: >>>> >>>>> If you look at the readme files in the diff subdirectories starting >>>>> with OCR under >>>>> https://github.com/Shreeshrii/imagessan/tree/master you will see >>>>> results of character and word level accuracy. Depending on the font, >>>>> character level accuracy is around 80% and word level accuracy around 60% >>>>> >>>>> I have not used it for actual OCR of any text because sanskritocr >>>>> software by dr. Oliver hellwig gives better results. >>>>> >>>>> See https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing >>>>> >>>>> - sent from my phone. excuse the brevity. >>>>> On 13-Jun-2016 6:53 pm, "ShreeDevi Kumar" <[email protected]> wrote: >>>>> >>>>>> Yes, hin traineddata with cube gives better results than san. >>>>>> >>>>>> I did some rudimentary testing with the new traineddata I made. It >>>>>> does not use cube. Look at the config files, it has some options for >>>>>> devanagari processing. >>>>>> >>>>>> You could try to unpack the hin traineddata and then remake the Dawg >>>>>> files using sanskrit wordlists and combine them as an experiment. >>>>>> >>>>>> If you have unicode version of the font used for the docs you want to >>>>>> OCR, then train using that. >>>>>> >>>>>> - sent from my phone. excuse the brevity. >>>>>> On 13-Jun-2016 4:47 pm, "rohit saluja" <[email protected]> wrote: >>>>>> >>>>>>> Thanks again for replying. I will surely check them out. >>>>>>> >>>>>>> My experience is that OCR on sanskrit data with hin.traineddata >>>>>>> gives better results than san.traineddata. I do know know, it is due to >>>>>>> cube mode or devanagari preprocessing(segmentation i guess) in >>>>>>> devanagari? >>>>>>> >>>>>>> I wonder why such preprocessing is not applied in san.traineddata. >>>>>>> Please let me know whether you are using cube mode in your >>>>>>> traineddata or not, and are you using devanagari preprocessing? >>>>>>> >>>>>>> On Mon, Jun 13, 2016 at 9:18 AM, ShreeDevi Kumar <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> Google has not provided images and box files for San.traineddata >>>>>>>> released for 3.04 >>>>>>>> >>>>>>>> I tried training using text2image with a combination of different >>>>>>>> fonts and training text. Results are at >>>>>>>> https://github.com/Shreeshrii/imagessan/tree/master/tessdata >>>>>>>> >>>>>>>> You can give these a try to see if recognition is any better. >>>>>>>> >>>>>>>> You can unpack any trained data file using -u option with >>>>>>>> combine-tessdata to see the config files etc. >>>>>>>> >>>>>>>> >>>>>>>> http://manpages.ubuntu.com/manpages/trusty/man1/combine_tessdata.1.html >>>>>>>> >>>>>>>> Use the dawg2wordlist to look at the various dictionary word lists >>>>>>>> used. >>>>>>>> >>>>>>>> http://manpages.ubuntu.com/manpages/trusty/man1/dawg2wordlist.1.html >>>>>>>> >>>>>>>> - sent from my phone. excuse the brevity. >>>>>>>> On 12-Jun-2016 11:26 am, "rohit saluja" <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hey thanks for replying. >>>>>>>>> Which options to use with text2image command? Also, is there any >>>>>>>>> configuration file and fonts list? >>>>>>>>> >>>>>>>>> I tried the default option of text2image with tesseract github >>>>>>>>> training data with sanskrit 2003, but the recognition results are far >>>>>>>>> away >>>>>>>>> from san.traineddata file on github. >>>>>>>>> Any help in matching san.traineddata results, starting from the >>>>>>>>> scratch, would be highly appreciated. >>>>>>>>> >>>>>>>>> Thanks in advance >>>>>>>>> Rohit >>>>>>>>> >>>>>>>>> On Friday, 6 May 2016 12:59:44 UTC+5:30, rohit saluja wrote: >>>>>>>>> >>>>>>>>>> Do we have Sanskrit training images and box files available >>>>>>>>>> online? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Rohit >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to [email protected]. >>>>>>>>> To post to this group, send email to [email protected]. >>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to a topic in >>>>>>>> the Google Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this topic, visit >>>>>>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe >>>>>>>> . >>>>>>>> To unsubscribe from this group and all its topics, send an email to >>>>>>>> [email protected]. >>>>>>>> To post to this group, send email to [email protected]. >>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to a topic in the >>>>> Google Groups "tesseract-ocr" group. >>>>> To unsubscribe from this topic, visit >>>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe >>>>> . >>>>> To unsubscribe from this group and all its topics, send an email to >>>>> [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/c9edbaa5-fb5d-4c01-87d9-93b1a2308f9f%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/c9edbaa5-fb5d-4c01-87d9-93b1a2308f9f%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV1_yaJdreww6O3_QUKPc690KLpJqGvPfwf9FFchnTrbQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

