Dear Sriranga, I see there something interesting is happening with Phonetic English and Oriental scripts, but unfortunately I cannot understand what exactly. Please elaborate on this. All I can suppose to the moment is that English script is trained better and is simpler by its nature than the Oriental ones, so this is the reason why you don't get same good accuracy.
Looking forward to your more detailed explanations on what you are trying to achieve by using Phonetic English. Warm regards, Dmitri Silaev 2011/4/15 Sriranga(78yrsold) <[email protected]>: > Dear Dimitry, > Since my post to tesseract-ocr forum did not appear, I am forwarding this to > you directly for valuable guidance. will you kindly inform me in which > source code I have to look into and how to test it whether output of tif is > according to unicharset file. I am ready whatever procedure to be followed > and feedback to you for further guidance. what I want to know how output > generated based on tif file and in which source codes used for this purpose > by tesseract. Kindly note I am not programmer nor developer as such your > expertise guidance is solicited > With warmest Regards, > -sriranga(78yrs) > > ---------- Forwarded message ---------- > From: Sriranga(78yrsold) <[email protected]> > Date: 2011/4/15 > Subject: Re: [Indic-OCR] What next? > To: Debayan Banerjee <[email protected]>, [email protected] > Cc: Ray Smith <[email protected]> > > > From the attached files, it could be seen that there are no problems of > maatraas for Bengali script. ( I may be wrong) - tesseract -r527 and WinXP > were used > I translated from Kannada scripts to Bengali script which further converted > to Latin phonetic English -generated tif, box, ke.unicharset file, > ke.traineddata all are in Latin Phonetic English. > > When tested as" tesseract kanE.tif outputkanE -l ke". I am shocked and > surprised to note that outputkanE.txt did not contain misspelling but are > all 100% accuracy - please note output was in Latin Phonetic English and > agree with tif file. > > To make sure - outputkanE.txt was converted to Bengali as well as Kannada > script. both scripts were all found to be 100% accuracy. > > Now question is when tested in bengali or kannada tif following the same > procedure done for Latin Phonetic English. the output text does not contain > 100% accuracy in its own scripts( i.e. Bengali or Kannada) > Why it happens I could not understand > i.e. how if the output of same script in Latin Phonetic English will be > 100% whereas if the output of scripts is in its original scripts will have > 70-80% - Why? - This required investigation by experts. > > Now I have attached all data files generated in Latin Phonetic English. > However data files genrated in Bengali or kannada or even hindi will > forwarded on request from the experts. > With warmest Regards, > -sriranga(78yrs > > > On Mon, Apr 11, 2011 at 9:42 AM, pranay prateek <[email protected]> > wrote: >> >> For descending vowel thing, finding the minima in the histogram doesn't >> seem to be working as well as expected. >> Sometimes, there doesn't exist a minima. Since, there are only a few >> descending vowels, like उ, ऊ and रे कार, can't >> we just do a simple template matching for the lower part of the alphabet. >> Might be computationally intensive, but >> it might work. >> On Mon, Apr 11, 2011 at 12:11 AM, Debayan Banerjee <[email protected]> >> wrote: >>> >>> http://hacking-tesseract.blogspot.com/2011/04/what-next.html >>> >>> This blog post uses Bengali script as example. Hindi is very similar >>> for the purpose and hence the discussion is applicable to Hindi script >>> as well. >>> >>> -- >>> Debayan Banerjee >> >> >> >> -- >> "You aren't remembered for doing what is expected of you." > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

