[Indic-OCR] What next?

Dmitri Silaev Sat, 16 Apr 2011 08:15:04 -0700

Dear Sriranga,

I see there something interesting is happening with Phonetic English
and Oriental scripts, but unfortunately I cannot understand what
exactly. Please elaborate on this. All I can suppose to the moment is
that English script is trained better and is simpler by its nature
than the Oriental ones, so this is the reason why you don't get same
good accuracy.


Looking forward to your more detailed explanations on what you are
trying to achieve by using Phonetic English.

Warm regards,
Dmitri Silaev





2011/4/15 Sriranga(78yrsold) <[email protected]>:
> Dear Dimitry,
> Since my post to tesseract-ocr forum did not appear, I am forwarding this to
> you directly for valuable guidance. will you kindly inform me in which
> source code I have to look into and how to test it whether output of tif is
> according to unicharset file. I am ready whatever procedure to be followed
> and feedback to you for further guidance. what I want to know how output
> generated based on tif file and in which source codes used for this purpose
> by tesseract. Kindly note I am not programmer nor developer as such your
> expertise guidance is solicited
> With warmest Regards,
> -sriranga(78yrs)
>
> ---------- Forwarded message ----------
> From: Sriranga(78yrsold) <[email protected]>
> Date: 2011/4/15
> Subject: Re: [Indic-OCR] What next?
> To: Debayan Banerjee <[email protected]>, [email protected]
> Cc: Ray Smith <[email protected]>
>
>
> From the attached files, it could be seen that there are no problems of
> maatraas for Bengali script. ( I may be wrong) - tesseract -r527 and WinXP
> were used
> I translated from Kannada scripts to Bengali script which further converted
> to Latin phonetic English -generated tif, box, ke.unicharset file,
> ke.traineddata all are in Latin Phonetic English.
>
> When tested as" tesseract kanE.tif outputkanE -l ke". I am shocked and
> surprised to note that outputkanE.txt did not contain misspelling but are
> all 100% accuracy - please note output was in Latin Phonetic English and
> agree with tif file.
>
> To make sure - outputkanE.txt  was converted to Bengali as well as Kannada
> script. both scripts were all found to be  100% accuracy.
>
> Now question is when tested in bengali or kannada tif following the same
> procedure done for Latin Phonetic English. the output text does not contain
> 100% accuracy in its own scripts( i.e. Bengali or Kannada)
>  Why it happens I could not understand
> i.e. how if the output of same script in Latin Phonetic English  will be
> 100% whereas if the output of scripts is in its original scripts  will have
> 70-80% - Why? - This required investigation by experts.
>
> Now I have attached all data files generated  in Latin Phonetic English.
> However data files genrated in  Bengali or kannada or even hindi will
> forwarded on request from the experts.
> With warmest Regards,
> -sriranga(78yrs
>
>
> On Mon, Apr 11, 2011 at 9:42 AM, pranay prateek <[email protected]>
> wrote:
>>
>> For descending vowel thing,  finding the minima in the histogram doesn't
>> seem to be working as well as expected.
>> Sometimes, there doesn't exist a minima. Since, there are only a few
>> descending vowels, like उ, ऊ and रे कार, can't
>> we just do a simple template matching  for the lower part of the alphabet.
>> Might be computationally intensive, but
>> it might work.
>> On Mon, Apr 11, 2011 at 12:11 AM, Debayan Banerjee <[email protected]>
>> wrote:
>>>
>>> http://hacking-tesseract.blogspot.com/2011/04/what-next.html
>>>
>>> This blog post uses Bengali script as example. Hindi is very similar
>>> for the purpose and hence the discussion is applicable to Hindi script
>>> as well.
>>>
>>> --
>>> Debayan Banerjee
>>
>>
>>
>> --
>> "You aren't remembered for doing what is expected of you."
>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

[Indic-OCR] What next?

Reply via email to