2010/8/14 రాకేశ్వర రావు <[email protected]>: > Hi Jimmy, > > Thank you for your replies. > > Shouting - so that people find it easy to locate the actual questions. >
Don't do it. There are other, well established ways of adding emphasis. > < Well, that's what 'DangAmbigs' are for -- impossible sequences, and their > corrections. > > From my understanding of DangAmbigs, I did NOT infer that it could be used > for impossible sequences. Well, instead of 'impossible', let's say 'highly improbable'. > As in English no sequence is impossible (bam and barn are both valid right). Yes. That would make a bad addition to DangAmbigs. 'iii' is /possible/, but highly improbable; it's so improbable, that you can assume that any occurrence should really have been 'm'. That happens to be the sort of situation you were asking about, so that's the answer. > But with Unicode characters, a lot of sequences are impossible. Eg:- A vowel > symbol with out a consonant preceeding it. > Invalid sequence of combining characters etc. ( > http://en.wikipedia.org/wiki/Combining_character ) > Again, this is what DangAmbigs are for. > > > "WHY NON DICTIONARY based solutions?" > 1) That is because in English, you have 'dry' words. > If you are familiar with German you will know that words can be joined. like > 'autobahn'. This is taken to the limit in Indian languages. > 2) In agglutinative languages like Telugu, the phrase 'from in between them' > could be just the word వాఱిరువురిమధ్యలోనుండి। > Hence there is no limit to the number of words that a dictionary should > have. > 3) In English, a word, say 'fast' is pronounced differently in different > countries, but written the same. > But in Telugu etc. 'poyaadu' in another accent becomes 'poyindu'. I'm waiting for you to explain how that could possibly be relevant to OCR. English also has words that have different spellings in different regions - I write 'colour', the Americans write 'color'. Just add both forms to the wordlist. > 4) There are Sandhis that alter a word's spelling based on the next or > previous words. http://en.wikipedia.org/wiki/Sandhi > > The above reasons make building a comprehensive dictionary impossible. I might find your (condescending) argument more convincing if I knew nothing about NLP, but I *do* know something about it, and I find your arguments unconvincing. I think the crux of the matter is this: > Hence making provisions for them in the code seemed like a smarter, though > harder, thing. In terms of the rest of what you said, I read this to /really/ mean "I think my language is extra special and needs specific treatment". It's not, it doesn't. I'm not trying to insult your language; I'm trying to drive home that you seem to have exactly the wrong attitude for NLP. Treat it as a set of individual features, not as a white elephant. Saying 'English doesn't have...' is a red herring in a multilingual engine. Think of the individual features in terms of languages that /do/ have them. First, read http://en.wikipedia.org/wiki/Zipf%27s_law Zip's law is universal. Build a frequency dawg. 1) Compounds. You cited German... Tesseract supports German. 2) Agglutination. Hungarian and Basque are highly agglutinative, and Tesseract has support for those. It /would/ be somewhat of an issue in creating the regular dictionary, and I would normally point you to the Tesseract paper, where they talk about a method for generating inflected/agglutinated suffixes to increase the dictionary /at creation time/, but the fact is, it's not necessary. IIIT Hyderabad have some quite extensive resources for Telugu, you could use their resources to generate the agglutinated list. 4) Sandhi is a rather inexact concept that's about as clear as saying 'internal changes' in English. I take it that you mean liaison, rather than epenthesis, or ablaut, or... ('liaison' is the change that causes the prefix that is 'con' in 'confession' to become 'com' in 'compression') This is only really relevant in terms of compounds; I'll concede that better support could be achieved with some extra treatment of liaison, epenthesis, ablaut, vowel harmony, etc. - for Swedish and Hungarian too - but 1) the improvement will be /slight/ and 2) I, frankly, am not going to waste my time discussing it with someone who is clearly thinking in biased terms. -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

