Re: No treatment for touching letters?

రాకేశ్వర రావు Mon, 16 Aug 2010 08:23:38 -0700

Hi Jimmy,

Thank you for your valuable time and feedback. I am about to learn NLP
formally myself in a few months.
I have to admit I do not completely understand your feedback now. I shall
refer to it after sometime.
I am very interested in how the Finnish, German, Tibetan etc. are using the
Engine.
Also Thank you for reminding me to contact other people working on Indic
NLP.


Regards,
- Rakesh

On 14 August 2010 14:52, Jimmy O'Regan <[email protected]> wrote:

> 2010/8/14 రాకేశ్వర రావు <[email protected]>:
> > Hi Jimmy,
> >
> > Thank you for your replies.
> >
> > Shouting - so that people find it easy to locate the actual questions.
> >
>
> Don't do it. There are other, well established ways of adding emphasis.
>
> > < Well, that's what 'DangAmbigs' are for -- impossible sequences, and
> their
> > corrections. >
> > From my understanding of DangAmbigs, I did NOT infer that it could be
> used
> > for impossible sequences.
>
> Well, instead of 'impossible', let's say 'highly improbable'.
>
> > As in English no sequence is impossible (bam and barn are both valid
> right).
>
> Yes. That would make a bad addition to DangAmbigs.
>
> 'iii' is /possible/, but highly improbable; it's so improbable, that
> you can assume that any occurrence should really have been 'm'. That
> happens to be the sort of situation you were asking about, so that's
> the answer.
>
> > But with Unicode characters, a lot of sequences are impossible. Eg:- A
> vowel
> > symbol with out a consonant preceeding it.
> > Invalid sequence of combining characters etc. (
> > http://en.wikipedia.org/wiki/Combining_character )
> >
>
> Again, this is what DangAmbigs are for.
>
> >
> >
> > "WHY NON DICTIONARY based solutions?"
> > 1) That is because in English, you have 'dry' words.
> > If you are familiar with German you will know that words can be joined.
> like
> > 'autobahn'. This is taken to the limit in Indian languages.
> > 2) In agglutinative languages like Telugu, the phrase 'from in between
> them'
> > could be just the word వాఱిరువురిమధ్యలోనుండి।
> > Hence there is no limit to the number of words that a dictionary should
> > have.
> > 3) In English, a word, say 'fast' is pronounced differently in different
> > countries, but written the same.
> > But in Telugu etc. 'poyaadu' in another accent becomes 'poyindu'.
>
> I'm waiting for you to explain how that could possibly be relevant to
> OCR. English also has words that have different spellings in different
> regions - I write 'colour', the Americans write 'color'. Just add both
> forms to the wordlist.
>
> > 4) There are Sandhis that alter a word's spelling based on the next or
> > previous words.  http://en.wikipedia.org/wiki/Sandhi
> >
> > The above reasons make building a comprehensive dictionary impossible.
>
> I might find your (condescending) argument more convincing if I knew
> nothing about NLP, but I *do* know something about it, and I find your
> arguments unconvincing.
>
> I think the crux of the matter is this:
>
> > Hence making provisions for them in the code seemed like a smarter,
> though
> > harder, thing.
>
> In terms of the rest of what you said, I read this to /really/ mean "I
> think my language is extra special and needs specific treatment".
>
> It's not, it doesn't.
>
> I'm not trying to insult your language; I'm trying to drive home that
> you seem to have exactly the wrong attitude for NLP. Treat it as a set
> of individual features, not as a white elephant.
>
> Saying 'English doesn't have...' is a red herring in a multilingual
> engine. Think of the individual features in terms of languages that
> /do/ have them.
>
> First, read http://en.wikipedia.org/wiki/Zipf%27s_law
> Zip's law is universal. Build a frequency dawg.
>
> 1) Compounds. You cited German... Tesseract supports German.
> 2) Agglutination. Hungarian and Basque are highly agglutinative, and
> Tesseract has support for those.
>
> It /would/ be somewhat of an issue in creating the regular dictionary,
> and I would normally point you to the Tesseract paper, where they talk
> about a method for generating inflected/agglutinated suffixes to
> increase the dictionary /at creation time/, but the fact is, it's not
> necessary. IIIT Hyderabad have some quite extensive resources for
> Telugu, you could use their resources to generate the agglutinated
> list.
>
> 4) Sandhi is a rather inexact concept that's about as clear as saying
> 'internal changes' in English. I take it that you mean liaison, rather
> than epenthesis, or ablaut, or...
>
> ('liaison' is the change that causes the prefix that is 'con' in
> 'confession' to become 'com' in 'compression')
>
> This is only really relevant in terms of compounds; I'll concede that
> better support could be achieved with some extra treatment of liaison,
> epenthesis, ablaut, vowel harmony, etc. - for Swedish and Hungarian
> too - but 1) the improvement will be /slight/ and 2) I, frankly, am
> not going to waste my time discussing it with someone who is clearly
> thinking in biased terms.
>
>
> --
> <Leftmost> jimregan, that's because deep inside you, you are evil.
> <Leftmost> Also not-so-deep inside you.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: No treatment for touching letters?

Reply via email to