Re: Hindi training data - unicharset_extractor error

2013-06-01 Thread sdk
After my experimentation and trial of last two months using Tesseract OCR for Hindi/Sanskrit, I would like to update the forum members, who have been very helpful in providing info and guidance of the results so far. I have posted the training source files as well as traineddata for hindi -

Re: Hindi training data - unicharset_extractor error

2013-04-18 Thread zdenko podobny
On Thu, Apr 18, 2013 at 5:35 AM, sdk shreesh...@gmail.com wrote: Zdenko, You wrote: He can create another data and use it together with data provided by google. Does this mean that we can use the ability of tessearct to use multiple languages for recognition to use multiple traineddata

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread sdk
Thanks, Yes, Google has provided hin.traineddata which gives good results. I was trying to see whether it was possible to further train it with additional fonts. On Tuesday, April 16, 2013 10:50:24 PM UTC+5:30, rākēśvara rāvu wrote: I think google has an internal traineddata file for

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread Sven Pedersen
This is covered in the FAQ: https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_add_just_one_character_or_one_font_to_my_favourite_lang which links to the training WIKI https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 --Sven On Wed, Apr 17, 2013 at 7:24 AM, sdk

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread Robert Komar
On Wed, 17 Apr 2013, Sven Pedersen wrote: This is covered in theFAQ:https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_ do_I_add_just_one_character_or_one_font_to_my_favourite_l ang which links to the training WIKI https://code.google.com/p/tesseract-ocr/wiki/TrainingTess eract3 --Sven

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread Sven Pedersen
Rob, You can add fonts to existing languages. Just follow the combine instructions. Sven On Wednesday, April 17, 2013, Robert Komar wrote: On Wed, 17 Apr 2013, Sven Pedersen wrote: This is covered in theFAQ:https://code.google.**

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread zdenko podobny
On Wed, Apr 17, 2013 at 10:41 PM, Sven Pedersen sven.peder...@gmail.comwrote: Rob, You can add fonts to existing languages. Just follow the combine instructions. As far as I know, it is not possible. He can create another data and use it together with data provided by google. Sven On

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread zdenko podobny
On Wed, Apr 17, 2013 at 10:36 PM, Robert Komar rko...@telus.net wrote: On Wed, 17 Apr 2013, Sven Pedersen wrote: This is covered in theFAQ:https://code.google.** com/p/tesseract-ocr/wiki/FAQ#**How_https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread Shree Devi Kumar
Thanks. I did follow the training wiki. However, since Hindi uses CUBE mode, it is not possible to train for that. I am trying to train for san - Sanskrit which uses the same devanagari script, in Non-cube mode. On Thu, Apr 18, 2013 at 1:34 AM, Sven Pedersen sven.peder...@gmail.comwrote:

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread Shree Devi Kumar
Thanks, Zdenko! I think it would be helpful to add this to the training pages wiki in the next update. If possible, also add a list of the languages that use the Cube mode. On Thu, Apr 18, 2013 at 3:05 AM, zdenko podobny zde...@gmail.com wrote: I remember one user post, that he

Re: Hindi training data - unicharset_extractor error

2013-04-17 Thread sdk
Zdenko, You wrote: He can create another data and use it together with data provided by google. Does this mean that we can use the ability of tessearct to use multiple languages for recognition to use multiple traineddata files for same 'real' language but with different language codes?

Re: Hindi training data - unicharset_extractor error

2013-04-16 Thread Rakesh A
I think google has an internal traineddata file for devanagari, because sometimes when you search for sanskrit stuff it gives results from google books. so it is possible. On Sat, Mar 30, 2013 at 7:18 PM, sdk shreesh...@gmail.com wrote: Hello, I have recently installed tesseract-ocr 3.02 on

Hindi training data - unicharset_extractor error

2013-03-30 Thread sdk
Hello, I have recently installed tesseract-ocr 3.02 on windows 7 and am training it for sanskrit2003 font for Hindi. 1. While running unicharset_extractor I received the error Utf8 buffer too big, size=57 for à☼½à☼_à☼¿à¥?à¥?à¥,ृà¥,à¥.à¥+à¥╪à¥^à¥%à¥Sà¥à¥ Oà¥?à¥Zà¥? Is this just a warning or