Re: [OT] RE: FW: extracting words
On Sun, 11 Feb 2001, Thomas Chan wrote: > On Sun, 11 Feb 2001, Mike Lischke wrote: > > > > If you are willing to give up precision, then you can use heuristics. > > > > > > It's ugly but perhaps ok for a simple editor. You can improve the > > > precision > > > with better heuristics and more data, so you get to decide how much is > > > good enough... > > > > So using white spaces for general word breaking and ideographs for CJK > > would be an acceptable approach? What I wonder about is how to handle > > The handling of Japanese and Korean text is different from that of Chinese > (lumping them together as "CJK" is inappropriate in this context), but I I'm glad to see this. Lumping them together as "CJK" is inappropriate not only in this context but also in other cases as well. For sure Chinese, Japanese and Korean text processing have a lot in common. However, there are a lot of differences as well. In case of Korean, Korean writting system Hangul is not just syllabic (as is Japanese Kana) but it's also alphabetic (which means it also needs to be dealt with the way Thai and Indic scripts are treated in some cases) and this point should not be overlooked to avoid making half-baked Korean support. The other day, somebody wrote to this list that most morphemes in CJK might be monosyllabic. That's true of Chinese (as far as I can tell), but cannot be farther from true in Japanese and Korean (although that holds true for Chinese-loan-words in Korean). Chinese is an isolating language. On the other hand, Japanese and Korean are agglutinating languages (the geographic closeness doesn't necesarilly lead to the linguistic closeness. The distance between Chinese on the one hand and Japanese and Korean on the other hand is much much greater than that between English and Sanskrit both of which belong to the Indo-European language family). IMHO, this difference makes it harder to extract word-roots (for search engines, DB, etc) out of Japanese and Korean text (and highly inflective languages) than out of Chinese text. Jungshik Shin
[OT] RE: FW: extracting words
On Sun, 11 Feb 2001, Mike Lischke wrote: > > If you are willing to give up precision, then you can use heuristics. > > > > It's ugly but perhaps ok for a simple editor. You can improve the > > precision > > with better heuristics and more data, so you get to decide how much is > > good enough... > > So using white spaces for general word breaking and ideographs for CJK > would be an acceptable approach? What I wonder about is how to handle No, that is not acceptable for Chinese. Chinese text does not use white space anywhere.[1] What was described was that it is tolerable (but not perfect--e.g., punctuation is not handled properly) to break *lines* in Chinese text between Chinese characters. To break *words* properly in Chinese text, you really need a dictionary.[2] [1] There is some Chinese text with spaces, where a space is inserted after each Chinese character, but that is a hack to make word-wrapping behave properly on Chinese-unaware software (which would otherwise treat an entire paragraph of Chinese text as a single "word"). [2] You might get away with treating each Chinese character as a "word", but this is technically wrong from linguistic standpoint, despite cultural claims to the contrary, and will have implications. The handling of Japanese and Korean text is different from that of Chinese (lumping them together as "CJK" is inappropriate in this context), but I will leave them for others to provide a better treatment. (Jungshik Shin has already explained the Korean case.) Thomas Chan [EMAIL PROTECTED]
Re: FW: extracting words
On Sun, Feb 11, 2001 at 11:14:36AM -0800, Mike Lischke wrote: > Can I used this simple aproach for, say, cherokee and arabic scripts too? Yes, at least for Cherokee. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org "I don't care if Bill personally has my name and reads my email and laughs at me. In fact, I'd be rather honored." - Joseph_Greg
RE: FW: extracting words
> If you are willing to give up precision, then you can use heuristics. > > It's ugly but perhaps ok for a simple editor. You can improve the > precision > with better heuristics and more data, so you get to decide how much is > good enough... So using white spaces for general word breaking and ideographs for CJK would be an acceptable approach? What I wonder about is how to handle all those languages I don't speak/understand (in fact almost all :-)). Can I used this simple aproach for, say, cherokee and arabic scripts too? I don't even know which has white spaces and which has not. Ciao, Mike
Re: FW: extracting words
If you are willing to give up precision, then you can use heuristics. The grossest heuristics are not really word breaking at all, but give users that do not know the language a compatible way of working with the text. For example, some software have extended their western European language software which did word breaking with spaces, to simply break after each ideograph when moving their software to CJK markets. Although this is in no way "word" breaking, it gives user a predictable behavior for "control-right-arrow" functions that executed "next word". Although it gives some kind of upward and "global" comaptibility, it does mean that next character and next word do pretty much the same thing for ideographs. It's ugly but perhaps ok for a simple editor. You can improve the precision with better heuristics and more data, so you get to decide how much is good enough... tex Mike Lischke wrote: > > > > > Yes, we have had it for a long time; no, nobody has solved it > > entirely; and yes, this approach is wrong. Breaking a string into > > words may require a thorough understanding of the vocabulary and > > grammar of the language, and even that may not be enough. > > But how can we then ever have a reliable word-break algorithm? It cannot be that, >say, for a simple editor (be it written in Java or whatever) you have to supply a >database with language specific details just to do automatic word wrap. > > Ciao, Mike -- According to Murphy, nothing goes according to Hoyle. -- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 Progress Software Corp.14 Oak Park, Bedford, MA 01730 http://www.Progress.com#1 Embedded Database Globalization Program http://www.Progress.com/partners/globalization.htm ---