Hi Jimmy,

Thank you for your replies.

Shouting - so that people find it easy to locate the actual questions.

< Well, that's what 'DangAmbigs' are for -- impossible sequences, and their
corrections. >
>From my understanding of DangAmbigs, I did NOT infer that it could be used
for impossible sequences.
As in English no sequence is impossible (bam and barn are both valid right).

But with Unicode characters, a lot of sequences are impossible. Eg:- A vowel
symbol with out a consonant preceeding it.
Invalid sequence of combining characters etc. (
http://en.wikipedia.org/wiki/Combining_character )



"WHY NON DICTIONARY based solutions?"
1) That is because in English, you have 'dry' words.
If you are familiar with German you will know that words can be joined. like
'autobahn'. This is taken to the limit in Indian languages.
2) In agglutinative languages like Telugu, the phrase 'from in between them'
could be just the word వాఱిరువురిమధ్యలోనుండి।
Hence there is no limit to the number of words that a dictionary should
have.
3) In English, a word, say 'fast' is pronounced differently in different
countries, but written the same.
But in Telugu etc. 'poyaadu' in another accent becomes 'poyindu'.
4) There are Sandhis that alter a word's spelling based on the next or
previous words.  http://en.wikipedia.org/wiki/Sandhi

The above reasons make building a comprehensive dictionary impossible.
Hence making provisions for them in the code seemed like a smarter, though
harder, thing.

Regards,
Rakesh


On 13 August 2010 22:23, Jimmy O'Regan <[email protected]> wrote:

> 2010/8/13 రాకేశ్వర రావు <[email protected]>:
> > actually I have had this problem of symbols being chopped up.
> > I do not blame the engine though. Some symbols in my language are twice
> or
> > thrice that of the average.
> > Eg:- మూ [muu] is almost thrice in length as రీ[rii]. Similar problem must
> be
> > there in English with m , i etc.
> > Unfortunately for me the most frequent letter in my Telugu language is a
> ము
> > [mu]. which is being chopped up.
> > in to a మ[ma] and a tail ఎ[é].
>
> Yes, that's almost exactly the same issue as 'm' becoming 'iii' or 'rn'
>
> > ARE THERE ANY variables or code segments that I can look into and tweak
> for
> > this?
> >
>
> DON'T SHOUT.
>
> Variables? No, not really. Code? Yes, but it's not documented, and not
> entirely clear how it would be used. I'm looking into it, but it's one
> of many things I'm looking into.
>
> > Secondly,
> > Symbol ము'mu' coming out as మ'ma' and the standalone vowel ఎ'é'  is
> erratic
> > as the symbol మ'ma' already has a vowel in it and hence a standalone can
> not
> > follow it.
> > IS THERE ANY way I can specify that some characters can not come after
> some
> > other characters?
>
> Well, that's what 'DangAmbigs' are for -- impossible sequences, and
> their corrections.
>
> There is newer code to use those ambiguities in a less 'dangerous'
> way, but it's not documented... see above.
>
> There is also a statistical post-editing tool for tesseract - googling
> that phrase should lead you to it.
>
> > Like specify, ma-é is an invalid combination? (In general I can not have
> > stand-alone-vowel-symbols like ఎ'é' coming in the middle of a word as
> each
> > symbol is a syllable in Telugu, like in Tibetan)।
>
> > I am looking for
> > non-dictionary based solutions.
>
> Why?
>
> --
> <Leftmost> jimregan, that's because deep inside you, you are evil.
> <Leftmost> Also not-so-deep inside you.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to