Hi,
Many thanks for the response. 
We tried many methods and splitting the chars to basics was also one among 
them. We wanted some expert opinion on that aspect.
I do not know what is meant as " inscript keyboard". Kindly elaborate. I 
welcome some more info on what is available with you. We have been using the 
normal method in boxing and entering string into the boxes in the same way as 
your files indicate. 
I welcome exchange and communication on this.
Thanks, 
MNS Rao



----- Original Message ----- 
  From: Rakesh A 
  To: [email protected] 
  Sent: Monday, September 05, 2011 4:51 AM
  Subject: Re: Training Kannada


  Hi Satyanarayana Rao, 

  I have worked a  bit with Kannada and Telugu. Attached are sample files . The 
picture is telugu but the characters in .box are kannada. I just leave the 
combinations as they are like న, నా, ని, ను etc. 

  I very strongly suggest that you use the inscript keyboard if you are in 
Indic computing for the long run.
  That way you have complete control over what you are typing, instead of 
relying on a mediator program like baraha (who knows what it does). 

  I can also send you a post processer file to handle things like న్యా etc. But 
I will do that later once you understand this.

  - Rakeshvara Rao


  On Wed, Aug 24, 2011 at 8:40 AM, Sathyanarayanarao Magadi Nanjappa 
<[email protected]> wrote:

          To train for Kannada is posing problems as the script is very 
complex. Making box file for normal image of a character and using string in 
the standard way has been tried. But the efficiency level is not rising. Main 
problem lies in a data file without all combinations and required number of 
repetitions. One transliteration scheme (equal to writing kannada using English 
keyboard in the same way as writing your name in English) is purely based on 
phonetic way. By using ZWJ between consonant and vowel live consonant can be 
broken and similarly this ZWJ can be removed   during post processing to get 
the output in the normal way. I was thinking whether using ZWJ during training 
helps in obtaining the requirement of combinations and repetitions.

            

         i.e    ಕ್‍ಅ= ಕ | ಕ್‍ಆ= ಕಾ | ಕ್‍ಇ = ಕಿ
             

          i.e  ನಾ = ನ್‍ಆ | ನು=ನ್‍ಉ | ನ = ನ್‍ಅ



          In the first instance ‘ka’ is split as ‘k’ ‘a’ separated byZWJ ‘^’ 
while preparing the image and box file.


          In the second instance image as per normal rendering is used for 
boxing and string within the box is split as shown on the top of the box 
using"^" 

          Here again as my knowledge of the Tesseract engine is poor I am not 
able to decide whether using ZWJ is to be used while creating the data image or 
in the strings in the box file.

          Another point is whether this is really a solution? 

          Somebody in the group who is having good insight into the working of 
the OCR engine and also some fonts using transliteration schemes using normal 
keyboard and phonetic method.

          N.B  ^ = ZWJ (zero width joiner)

                 ^^ = ZWNJ (zero width non joiner)

          used in baraha s/w

          MNS Rao





    -- 
    You received this message because you are subscribed to the Google
    Groups "tesseract-ocr" group.
    To post to this group, send email to [email protected]
    To unsubscribe from this group, send email to
    [email protected]
    For more options, visit this group at
    http://groups.google.com/group/tesseract-ocr?hl=en




  -- 
  You received this message because you are subscribed to the Google
  Groups "tesseract-ocr" group.
  To post to this group, send email to [email protected]
  To unsubscribe from this group, send email to
  [email protected]
  For more options, visit this group at
  http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

<<[email protected]>>

  • Training Kannada Sathyanarayanarao Magadi Nanjappa
    • Re: Training Kannada M.N.S.Rao

Reply via email to