Hi Jimmy, Thank you for your valuable time and feedback. I am about to learn NLP formally myself in a few months. I have to admit I do not completely understand your feedback now. I shall refer to it after sometime. I am very interested in how the Finnish, German, Tibetan etc. are using the Engine. Also Thank you for reminding me to contact other people working on Indic NLP.
Regards, - Rakesh On 14 August 2010 14:52, Jimmy O'Regan <[email protected]> wrote: > 2010/8/14 రాకేశ్వర రావు <[email protected]>: > > Hi Jimmy, > > > > Thank you for your replies. > > > > Shouting - so that people find it easy to locate the actual questions. > > > > Don't do it. There are other, well established ways of adding emphasis. > > > < Well, that's what 'DangAmbigs' are for -- impossible sequences, and > their > > corrections. > > > From my understanding of DangAmbigs, I did NOT infer that it could be > used > > for impossible sequences. > > Well, instead of 'impossible', let's say 'highly improbable'. > > > As in English no sequence is impossible (bam and barn are both valid > right). > > Yes. That would make a bad addition to DangAmbigs. > > 'iii' is /possible/, but highly improbable; it's so improbable, that > you can assume that any occurrence should really have been 'm'. That > happens to be the sort of situation you were asking about, so that's > the answer. > > > But with Unicode characters, a lot of sequences are impossible. Eg:- A > vowel > > symbol with out a consonant preceeding it. > > Invalid sequence of combining characters etc. ( > > http://en.wikipedia.org/wiki/Combining_character ) > > > > Again, this is what DangAmbigs are for. > > > > > > > "WHY NON DICTIONARY based solutions?" > > 1) That is because in English, you have 'dry' words. > > If you are familiar with German you will know that words can be joined. > like > > 'autobahn'. This is taken to the limit in Indian languages. > > 2) In agglutinative languages like Telugu, the phrase 'from in between > them' > > could be just the word వాఱిరువురిమధ్యలోనుండి। > > Hence there is no limit to the number of words that a dictionary should > > have. > > 3) In English, a word, say 'fast' is pronounced differently in different > > countries, but written the same. > > But in Telugu etc. 'poyaadu' in another accent becomes 'poyindu'. > > I'm waiting for you to explain how that could possibly be relevant to > OCR. English also has words that have different spellings in different > regions - I write 'colour', the Americans write 'color'. Just add both > forms to the wordlist. > > > 4) There are Sandhis that alter a word's spelling based on the next or > > previous words. http://en.wikipedia.org/wiki/Sandhi > > > > The above reasons make building a comprehensive dictionary impossible. > > I might find your (condescending) argument more convincing if I knew > nothing about NLP, but I *do* know something about it, and I find your > arguments unconvincing. > > I think the crux of the matter is this: > > > Hence making provisions for them in the code seemed like a smarter, > though > > harder, thing. > > In terms of the rest of what you said, I read this to /really/ mean "I > think my language is extra special and needs specific treatment". > > It's not, it doesn't. > > I'm not trying to insult your language; I'm trying to drive home that > you seem to have exactly the wrong attitude for NLP. Treat it as a set > of individual features, not as a white elephant. > > Saying 'English doesn't have...' is a red herring in a multilingual > engine. Think of the individual features in terms of languages that > /do/ have them. > > First, read http://en.wikipedia.org/wiki/Zipf%27s_law > Zip's law is universal. Build a frequency dawg. > > 1) Compounds. You cited German... Tesseract supports German. > 2) Agglutination. Hungarian and Basque are highly agglutinative, and > Tesseract has support for those. > > It /would/ be somewhat of an issue in creating the regular dictionary, > and I would normally point you to the Tesseract paper, where they talk > about a method for generating inflected/agglutinated suffixes to > increase the dictionary /at creation time/, but the fact is, it's not > necessary. IIIT Hyderabad have some quite extensive resources for > Telugu, you could use their resources to generate the agglutinated > list. > > 4) Sandhi is a rather inexact concept that's about as clear as saying > 'internal changes' in English. I take it that you mean liaison, rather > than epenthesis, or ablaut, or... > > ('liaison' is the change that causes the prefix that is 'con' in > 'confession' to become 'com' in 'compression') > > This is only really relevant in terms of compounds; I'll concede that > better support could be achieved with some extra treatment of liaison, > epenthesis, ablaut, vowel harmony, etc. - for Swedish and Hungarian > too - but 1) the improvement will be /slight/ and 2) I, frankly, am > not going to waste my time discussing it with someone who is clearly > thinking in biased terms. > > > -- > <Leftmost> jimregan, that's because deep inside you, you are evil. > <Leftmost> Also not-so-deep inside you. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<tesseract-ocr%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

