Re: [OT] RE: FW: extracting words

2001-02-11 Thread Jungshik Shin

On Sun, 11 Feb 2001, Thomas Chan wrote:

> On Sun, 11 Feb 2001, Mike Lischke wrote:
>
> > > If you are willing to give up precision, then you can use heuristics.
> > >
> > > It's ugly but perhaps ok for a simple editor. You can improve the
> > > precision
> > > with better heuristics and more data, so you get to decide how much is
> > > good enough...
> >
> > So using white spaces for general word breaking and ideographs for CJK
> > would be an acceptable approach? What I wonder about is how to handle
>
> The handling of Japanese and Korean text is different from that of Chinese
> (lumping them together as "CJK" is inappropriate in this context), but I

I'm glad to see this. Lumping them together as "CJK" is inappropriate not
only in this context but also in other cases as well. For sure Chinese,
Japanese and Korean text processing have a lot in common.  However, there
are a lot of differences as well. In case of Korean, Korean writting
system Hangul  is not just syllabic (as is Japanese Kana) but it's also
alphabetic (which means it also needs to be dealt with the way Thai and
Indic scripts are treated in some cases) and this point should not be
overlooked to avoid making half-baked Korean support.

The other day, somebody wrote to this list that most morphemes in CJK
might be monosyllabic. That's true of Chinese (as far as I can tell),
but cannot be farther from true in Japanese and Korean (although that
holds true for Chinese-loan-words in Korean). Chinese is an isolating
language. On the other hand, Japanese and Korean are agglutinating
languages (the geographic closeness doesn't necesarilly lead to the
linguistic closeness. The distance between Chinese on the one hand and
Japanese and Korean on the other hand is much much greater than that
between English and Sanskrit both of which belong to the Indo-European
language family).  IMHO, this difference makes it harder to extract
word-roots (for search engines, DB, etc) out of Japanese and Korean text
(and highly inflective languages) than out of Chinese text.


Jungshik Shin




[OT] RE: FW: extracting words

2001-02-11 Thread Thomas Chan

On Sun, 11 Feb 2001, Mike Lischke wrote:

> > If you are willing to give up precision, then you can use heuristics.
> >
> > It's ugly but perhaps ok for a simple editor. You can improve the
> > precision
> > with better heuristics and more data, so you get to decide how much is
> > good enough...
> 
> So using white spaces for general word breaking and ideographs for CJK
> would be an acceptable approach? What I wonder about is how to handle

No, that is not acceptable for Chinese.  Chinese text does not use white 
space anywhere.[1]  What was described was that it is tolerable (but not
perfect--e.g., punctuation is not handled properly) to break *lines* in
Chinese text between Chinese characters.  To break *words* properly in
Chinese text, you really need a dictionary.[2]

[1] There is some Chinese text with spaces, where a space is inserted
after each Chinese character, but that is a hack to make word-wrapping
behave properly on Chinese-unaware software (which would otherwise treat
an entire paragraph of Chinese text as a single "word").

[2] You might get away with treating each Chinese character as a "word",
but this is technically wrong from linguistic standpoint, despite cultural
claims to the contrary, and will have implications.


The handling of Japanese and Korean text is different from that of Chinese
(lumping them together as "CJK" is inappropriate in this context), but I
will leave them for others to provide a better treatment.  (Jungshik Shin
has already explained the Korean case.)


Thomas Chan
[EMAIL PROTECTED]




Re: FW: extracting words

2001-02-11 Thread David Starner

On Sun, Feb 11, 2001 at 11:14:36AM -0800, Mike Lischke wrote:
> Can I used this simple aproach for, say, cherokee and arabic scripts too? 

Yes, at least for Cherokee.

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg



RE: FW: extracting words

2001-02-11 Thread Mike Lischke

> If you are willing to give up precision, then you can use heuristics.
>
> It's ugly but perhaps ok for a simple editor. You can improve the
> precision
> with better heuristics and more data, so you get to decide how much is
> good enough...

So using white spaces for general word breaking and ideographs for CJK would be an 
acceptable
approach? What I wonder about is how to handle all those languages I don't 
speak/understand (in fact
almost all :-)). Can I used this simple aproach for, say, cherokee and arabic scripts 
too? I don't
even know which has white spaces and which has not.

Ciao, Mike




Re: FW: extracting words

2001-02-11 Thread Tex Texin

If you are willing to give up precision, then you can use heuristics.

The grossest heuristics are not really word breaking at all, but
give users that do not know the language a compatible way of working
with the text. For example, some software have extended their western
European language software which did word breaking with spaces, to
simply break after each ideograph when moving their software to CJK
markets. Although this is in no way "word" breaking, it gives user
a predictable behavior for "control-right-arrow" functions that
executed "next word". 

Although it gives some kind of upward and "global" comaptibility,
it does mean that next character and next word do pretty much the
same thing for ideographs.

It's ugly but perhaps ok for a simple editor. You can improve the
precision
with better heuristics and more data, so you get to decide how much is
good enough...

tex

Mike Lischke wrote:
> 
> >
> > Yes, we have had it for a long time; no, nobody has solved it
> > entirely; and yes, this approach is wrong. Breaking a string into
> > words may require a thorough understanding of the vocabulary and
> > grammar of the language, and even that may not be enough.
> 
> But how can we then ever have a reliable word-break algorithm? It cannot be that, 
>say, for a simple editor (be it written in Java or whatever) you have to supply a 
>database with language specific details just to do automatic word wrap.
> 
> Ciao, Mike

-- 
According to Murphy, nothing goes according to Hoyle.
--
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271 Fax:+1-781-280-4655
Progress Software Corp.14 Oak Park, Bedford, MA 01730

http://www.Progress.com#1 Embedded Database

Globalization Program   
http://www.Progress.com/partners/globalization.htm
---