Oliver had released first version of sanskritocr for free and new version
is commercial with demo, sold by indsenz. I assume newer one may be better,
it also allows for training for particular fonts.

- sent from my phone. excuse the brevity.
On 30-Jun-2016 4:25 pm, "rohit saluja" <[email protected]> wrote:

> Hi
>
> I just ocred 30 pages of a sanskrit book on Sanskrit OCR. I got WER of 54%
> and CER of 24 %.
> Whereas I get WER of 20 % on Indsenz and CER Of 8 %. Have you tried
> comparing Indsenz with Sanskrit OCR. Which one is better where?
>
> On Tuesday, 21 June 2016 12:36:23 UTC+5:30, rohit saluja wrote:
>>
>> Hey thanks a lot. Your replies are really helpful.
>>
>> Rohit
>>
>> On Saturday, 18 June 2016 23:41:13 UTC+5:30, shree wrote:
>>>
>>> I do not know about the training process for cube, it is not documented.
>>>
>>> I have uploaded the box/tif pairs generated by text2image under windows
>>> for sanskrit - there are two versions  s21 and s95 - using different fonts
>>> and exposure levels. Please see
>>> https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s21
>>> https://github.com/Shreeshrii/imagessan/tree/master/trainingdata-s95
>>>
>>> In s21, each font is used for 3 different exposure levels , -1, 0 and 1.
>>> tesstrain.sh --lang san --langdata_dir ./langdata --tessdata_dir ./
>>> --exposures "-1 0 1"
>>>
>>> In s95, each font is used only at 0 exposure level.
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Tue, Jun 14, 2016 at 3:35 AM, rohit saluja <[email protected]>
>>> wrote:
>>>
>>>> Hey thanks a lot for your reply. This seems to be a great idea to use
>>>> hin data with sanskrit wordlist.
>>>>
>>>> Still I am interested in knowing the things building from scratch.
>>>> So I used some boxfiles and images I created for sanskrit 2003 font and
>>>> used the hindi config file from
>>>> https://github.com/tesseract-ocr/langdata/blob/master/hin/hin.config
>>>> and I renamed it as san3ds.config. san3ds(3 for 2003 ds for devanagari
>>>> split) is the new name I am giving for my new training data.
>>>>
>>>> I was able to train san3ds without any config file before.
>>>>
>>>> I just renamed san3ds.word-dawg as san3ds.cube-word-dawg. Remaining
>>>> files I kept as it is.
>>>> I could form san3.traineddata file, but I am getting an error while
>>>> recognition:-
>>>>
>>>> Cube ERROR (CubeRecoContext::Load): unable to read cube language model
>>>> params from /usr/local/share/tessdata/san3ds.cube.lm
>>>> Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext
>>>> object
>>>> init_cube_objects(true, &tessdata_manager):Error:Assert failed:in file
>>>> tessedit.cpp, line 214
>>>> Segmentation fault (core dumped)
>>>>
>>>> Any help in this, why this is happening? Is it wrong in renaming
>>>> word-dawg, I cannot find any separate option for generating cube-word-dawg.
>>>>
>>>> Thanks in advance
>>>> Rohit
>>>>
>>>>
>>>> On Mon, Jun 13, 2016 at 7:04 PM, ShreeDevi Kumar <[email protected]>
>>>> wrote:
>>>>
>>>>> If you look at the readme files in the diff subdirectories starting
>>>>> with OCR under
>>>>> https://github.com/Shreeshrii/imagessan/tree/master you will see
>>>>> results of character and word level accuracy. Depending on the font,
>>>>> character level accuracy is around 80% and word level accuracy around 60%
>>>>>
>>>>> I have not used it for actual OCR of any text because sanskritocr
>>>>> software by dr. Oliver hellwig gives better results.
>>>>>
>>>>> See https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing
>>>>>
>>>>> - sent from my phone. excuse the brevity.
>>>>> On 13-Jun-2016 6:53 pm, "ShreeDevi Kumar" <[email protected]> wrote:
>>>>>
>>>>>> Yes, hin traineddata with cube gives better results than san.
>>>>>>
>>>>>> I did some rudimentary testing with the new traineddata I made. It
>>>>>> does not use cube. Look at the config files, it has some options for
>>>>>> devanagari processing.
>>>>>>
>>>>>> You could try to unpack the hin traineddata and then remake the Dawg
>>>>>> files using sanskrit wordlists and combine them as an experiment.
>>>>>>
>>>>>> If you have unicode version of the font used for the docs you want to
>>>>>> OCR, then train using that.
>>>>>>
>>>>>> - sent from my phone. excuse the brevity.
>>>>>> On 13-Jun-2016 4:47 pm, "rohit saluja" <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks again for replying. I will surely check them out.
>>>>>>>
>>>>>>> My experience is that OCR on sanskrit data with hin.traineddata
>>>>>>> gives better results than san.traineddata. I do know know, it is due to
>>>>>>> cube mode or devanagari preprocessing(segmentation i guess) in 
>>>>>>> devanagari?
>>>>>>>
>>>>>>> I wonder why such preprocessing is not applied in san.traineddata.
>>>>>>> Please let me know whether you are using cube mode in your
>>>>>>> traineddata or not, and are you using devanagari preprocessing?
>>>>>>>
>>>>>>> On Mon, Jun 13, 2016 at 9:18 AM, ShreeDevi Kumar <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Google has not provided images and box files for San.traineddata
>>>>>>>> released for 3.04
>>>>>>>>
>>>>>>>> I tried training using text2image with a combination of different
>>>>>>>> fonts and training text. Results are at
>>>>>>>> https://github.com/Shreeshrii/imagessan/tree/master/tessdata
>>>>>>>>
>>>>>>>> You can give these a try to see if recognition is any better.
>>>>>>>>
>>>>>>>> You can unpack any trained data file using -u option with
>>>>>>>> combine-tessdata to see the config files etc.
>>>>>>>>
>>>>>>>>
>>>>>>>> http://manpages.ubuntu.com/manpages/trusty/man1/combine_tessdata.1.html
>>>>>>>>
>>>>>>>> Use the dawg2wordlist to look at the various dictionary word lists
>>>>>>>> used.
>>>>>>>>
>>>>>>>> http://manpages.ubuntu.com/manpages/trusty/man1/dawg2wordlist.1.html
>>>>>>>>
>>>>>>>> - sent from my phone. excuse the brevity.
>>>>>>>> On 12-Jun-2016 11:26 am, "rohit saluja" <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey thanks for replying.
>>>>>>>>> Which options to use with text2image command? Also, is there any
>>>>>>>>> configuration file and fonts list?
>>>>>>>>>
>>>>>>>>> I tried the default option of text2image with tesseract github
>>>>>>>>> training data with sanskrit 2003, but the recognition results are far 
>>>>>>>>> away
>>>>>>>>> from san.traineddata file on github.
>>>>>>>>> Any help in matching san.traineddata results, starting from the
>>>>>>>>> scratch, would be highly appreciated.
>>>>>>>>>
>>>>>>>>> Thanks in advance
>>>>>>>>> Rohit
>>>>>>>>>
>>>>>>>>> On Friday, 6 May 2016 12:59:44 UTC+5:30, rohit saluja wrote:
>>>>>>>>>
>>>>>>>>>> Do we have Sanskrit training images and box files available
>>>>>>>>>> online?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Rohit
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to [email protected].
>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/45767a89-cd11-4f39-9622-3fe7b4d20a4a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to a topic in
>>>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this topic, visit
>>>>>>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe
>>>>>>>> .
>>>>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>>>>> [email protected].
>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXfqoY_BSW9BURAbj_AzdtRykK2ea5e9G2Suq9QCeWMOA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsUNCmGHEmPB0fBZjgPmEAXvWNtzzdkkKK%3DRcd_u25f%2B1Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to a topic in the
>>>>> Google Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this topic, visit
>>>>> https://groups.google.com/d/topic/tesseract-ocr/apmhpJ3K924/unsubscribe
>>>>> .
>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>> [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWu3-cLcTHi2e%3D0Zr15Do5nawfG93k_dXvBeBwze%2BMHfw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAEga%2BsWmbZV-qC1fwKbb%2BmqO3SSaqseZPHu7o1srObOqt%2BPjxw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c9edbaa5-fb5d-4c01-87d9-93b1a2308f9f%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c9edbaa5-fb5d-4c01-87d9-93b1a2308f9f%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV1_yaJdreww6O3_QUKPc690KLpJqGvPfwf9FFchnTrbQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to