2010/8/13 రాకేశ్వర రావు <[email protected]>: > actually I have had this problem of symbols being chopped up. > I do not blame the engine though. Some symbols in my language are twice or > thrice that of the average. > Eg:- మూ [muu] is almost thrice in length as రీ[rii]. Similar problem must be > there in English with m , i etc. > Unfortunately for me the most frequent letter in my Telugu language is a ము > [mu]. which is being chopped up. > in to a మ[ma] and a tail ఎ[é].
Yes, that's almost exactly the same issue as 'm' becoming 'iii' or 'rn' > ARE THERE ANY variables or code segments that I can look into and tweak for > this? > DON'T SHOUT. Variables? No, not really. Code? Yes, but it's not documented, and not entirely clear how it would be used. I'm looking into it, but it's one of many things I'm looking into. > Secondly, > Symbol ము'mu' coming out as మ'ma' and the standalone vowel ఎ'é' is erratic > as the symbol మ'ma' already has a vowel in it and hence a standalone can not > follow it. > IS THERE ANY way I can specify that some characters can not come after some > other characters? Well, that's what 'DangAmbigs' are for -- impossible sequences, and their corrections. There is newer code to use those ambiguities in a less 'dangerous' way, but it's not documented... see above. There is also a statistical post-editing tool for tesseract - googling that phrase should lead you to it. > Like specify, ma-é is an invalid combination? (In general I can not have > stand-alone-vowel-symbols like ఎ'é' coming in the middle of a word as each > symbol is a syllable in Telugu, like in Tibetan)। > I am looking for > non-dictionary based solutions. Why? -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

