2010/8/14 రాకేశ్వర రావు <[email protected]>:
> Hi Jimmy,
>
> Thank you for your replies.
>
> Shouting - so that people find it easy to locate the actual questions.
>

Don't do it. There are other, well established ways of adding emphasis.

> < Well, that's what 'DangAmbigs' are for -- impossible sequences, and their
> corrections. >
> From my understanding of DangAmbigs, I did NOT infer that it could be used
> for impossible sequences.

Well, instead of 'impossible', let's say 'highly improbable'.

> As in English no sequence is impossible (bam and barn are both valid right).

Yes. That would make a bad addition to DangAmbigs.

'iii' is /possible/, but highly improbable; it's so improbable, that
you can assume that any occurrence should really have been 'm'. That
happens to be the sort of situation you were asking about, so that's
the answer.

> But with Unicode characters, a lot of sequences are impossible. Eg:- A vowel
> symbol with out a consonant preceeding it.
> Invalid sequence of combining characters etc. (
> http://en.wikipedia.org/wiki/Combining_character )
>

Again, this is what DangAmbigs are for.

>
>
> "WHY NON DICTIONARY based solutions?"
> 1) That is because in English, you have 'dry' words.
> If you are familiar with German you will know that words can be joined. like
> 'autobahn'. This is taken to the limit in Indian languages.
> 2) In agglutinative languages like Telugu, the phrase 'from in between them'
> could be just the word వాఱిరువురిమధ్యలోనుండి।
> Hence there is no limit to the number of words that a dictionary should
> have.
> 3) In English, a word, say 'fast' is pronounced differently in different
> countries, but written the same.
> But in Telugu etc. 'poyaadu' in another accent becomes 'poyindu'.

I'm waiting for you to explain how that could possibly be relevant to
OCR. English also has words that have different spellings in different
regions - I write 'colour', the Americans write 'color'. Just add both
forms to the wordlist.

> 4) There are Sandhis that alter a word's spelling based on the next or
> previous words.  http://en.wikipedia.org/wiki/Sandhi
>
> The above reasons make building a comprehensive dictionary impossible.

I might find your (condescending) argument more convincing if I knew
nothing about NLP, but I *do* know something about it, and I find your
arguments unconvincing.

I think the crux of the matter is this:

> Hence making provisions for them in the code seemed like a smarter, though
> harder, thing.

In terms of the rest of what you said, I read this to /really/ mean "I
think my language is extra special and needs specific treatment".

It's not, it doesn't.

I'm not trying to insult your language; I'm trying to drive home that
you seem to have exactly the wrong attitude for NLP. Treat it as a set
of individual features, not as a white elephant.

Saying 'English doesn't have...' is a red herring in a multilingual
engine. Think of the individual features in terms of languages that
/do/ have them.

First, read http://en.wikipedia.org/wiki/Zipf%27s_law
Zip's law is universal. Build a frequency dawg.

1) Compounds. You cited German... Tesseract supports German.
2) Agglutination. Hungarian and Basque are highly agglutinative, and
Tesseract has support for those.

It /would/ be somewhat of an issue in creating the regular dictionary,
and I would normally point you to the Tesseract paper, where they talk
about a method for generating inflected/agglutinated suffixes to
increase the dictionary /at creation time/, but the fact is, it's not
necessary. IIIT Hyderabad have some quite extensive resources for
Telugu, you could use their resources to generate the agglutinated
list.

4) Sandhi is a rather inexact concept that's about as clear as saying
'internal changes' in English. I take it that you mean liaison, rather
than epenthesis, or ablaut, or...

('liaison' is the change that causes the prefix that is 'con' in
'confession' to become 'com' in 'compression')

This is only really relevant in terms of compounds; I'll concede that
better support could be achieved with some extra treatment of liaison,
epenthesis, ablaut, vowel harmony, etc. - for Swedish and Hungarian
too - but 1) the improvement will be /slight/ and 2) I, frankly, am
not going to waste my time discussing it with someone who is clearly
thinking in biased terms.


-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to