2010/8/13 రాకేశ్వర రావు <[email protected]>:
> actually I have had this problem of symbols being chopped up.
> I do not blame the engine though. Some symbols in my language are twice or
> thrice that of the average.
> Eg:- మూ [muu] is almost thrice in length as రీ[rii]. Similar problem must be
> there in English with m , i etc.
> Unfortunately for me the most frequent letter in my Telugu language is a ము
> [mu]. which is being chopped up.
> in to a మ[ma] and a tail ఎ[é].

Yes, that's almost exactly the same issue as 'm' becoming 'iii' or 'rn'

> ARE THERE ANY variables or code segments that I can look into and tweak for
> this?
>

DON'T SHOUT.

Variables? No, not really. Code? Yes, but it's not documented, and not
entirely clear how it would be used. I'm looking into it, but it's one
of many things I'm looking into.

> Secondly,
> Symbol ము'mu' coming out as మ'ma' and the standalone vowel ఎ'é'  is erratic
> as the symbol మ'ma' already has a vowel in it and hence a standalone can not
> follow it.
> IS THERE ANY way I can specify that some characters can not come after some
> other characters?

Well, that's what 'DangAmbigs' are for -- impossible sequences, and
their corrections.

There is newer code to use those ambiguities in a less 'dangerous'
way, but it's not documented... see above.

There is also a statistical post-editing tool for tesseract - googling
that phrase should lead you to it.

> Like specify, ma-é is an invalid combination? (In general I can not have
> stand-alone-vowel-symbols like ఎ'é' coming in the middle of a word as each
> symbol is a syllable in Telugu, like in Tibetan)।

> I am looking for
> non-dictionary based solutions.

Why?

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to