If you look at the readme files in the diff subdirectories starting with OCR under https://github.com/Shreeshrii/imagessan/tree/master you will see results of character and word level accuracy. Depending on the font, character level accuracy is around 80% and word level accuracy around 60%
I have not used it for actual OCR of any text because sanskritocr software by dr. Oliver hellwig gives better results. See https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing - sent from my phone. excuse the brevity. On 13-Jun-2016 6:53 pm, "ShreeDevi Kumar" <[email protected]> wrote: > Yes, hin traineddata with cube gives better results than san. > > I did some rudimentary testing with the new traineddata I made. It does > not use cube. Look at the config files, it has some options for devanagari > processing. > > You could try to unpack the hin traineddata and then remake the Dawg files > using sanskrit wordlists and combine them as an experiment. > > If you have unicode version of the font used for the docs you want to OCR, > then train using that. > > - sent from my phone. excuse the brevity. > On 13-Jun-2016 4:47 pm, "rohit saluja" <[email protected]> wrote: > >> Thanks again for replying. I will surely check them out. >> >> My experience is that OCR on sanskrit data with hin.traineddata gives >> better results than san.traineddata. I do know know, it is due to cube mode >> or devanagari preprocessing(segmentation i guess) in devanagari? >> >> I wonder why such preprocessing is not applied in san.traineddata. >> Please let me know whether you are using cube mode in your traineddata or >> not, and are you using devanagari preprocessing? >> >> On Mon, Jun 13, 2016 at 9:18 AM, ShreeDevi Kumar <[email protected]> >> wrote: >> >>> Google has not provided images and box files for San.traineddata >>> released for 3.04 >>> >>> I tried training using text2image with a combination of different fonts >>> and training text. Results are at >>> https://github.com/Shreeshrii/imagessan/tree/master/tessdata >>> >>> You can give these a try to see if recognition is any better. >>> >>> You can unpack any trained data file using -u option with >>> combine-tessdata to see the config files etc. >>> >>> http://manpages.ubuntu.com/manpages/trusty/man1/combine_tessdata.1.html >>> >>> Use the dawg2wordlist to look at the various dictionary word lists used. >>> >>> http://manpages.ubuntu.com/manpages/trusty/man1/dawg2wordlist.1.html >>> >>> - sent from my phone. excuse the brevity. >>> On 12-Jun-2016 11:26 am, "rohit saluja" <[email protected]> wrote: >>> >>>> Hey thanks for replying. >>>> Which options to use with text2image command? Also, is there any >>>> configuration file and fonts list? >>>> >>>> I tried the default option of text2image with tesseract github training >>>> data with sanskrit 2003, but the recognition results are far away from >>>> san.traineddata file on github. >>>> Any help in matching san.traineddata results, starting from the >>>> scratch, would be highly appreciated. >>>> >>>> Thanks in advance >>>> Rohit >>>> >>>> On Friday, 6 May 2016 12:59:44 UTC+5:30, rohit saluja wrote: >>>> >>>>> Do we have Sanskrit training images and box files available online? >>>>> >>>>> Thanks >>>>> Rohit >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "tesseract-ocr" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

