In order to support Word Segmentation in Chinese, dictionary is indispensable. I understand your doubt but I'm not going to deal with these details and re-invent a wheel. As far as I know there are some open source Chinese word segmentation libs available on the web. It's much better if they could be integrated into Vim. Although I've been using Vim for a while and I'm a programmer, I'm still new to Vim source. I don't know if the concept of "word" in Vim is expandable to this challenge. If not, that would be a waste of time.
-- Xie On Jan 21, 12:57 pm, Tony Mechelynck <[email protected]> wrote: > On 21/01/09 04:11, Xie wrote: > > > > > Thank you for your reply, Tony. I don't know if my English is enough > > to make myself clear but I'll try. > > > In English, semantically, a "word" sequence of characters (a-zA-Z) and > > is the smallest meaningful unit. Word segmentation is not needed in > > English because the "word" is naturally separated by whitespaces. The > > situation is different in CJK languages. It takes several CJK > > characters to form a "word" but this "word" exists in a serial of > > characters and is not easily distinguishable for computer. That's why > > Word Segmentation algorithm is needed to recognize a "word". > > > As far as I know, Vim simply takes a sequence of whatever characters > > (not ,./?><...) as a "word", which is correct semantically for > > English, but not for CJK languages. What I want to know is that if Vim > > has ever thought about adding support to this. > > > Thanks > > Xie > > I don't think it would be feasible, especially since OT1H there exist > compound words which can exist either as distinct words or as part of > larger compounds, and OTOH there exist characters which cannot appear as > separate words in contemporary Chinese but can do so in poetic or > archaic language (and you wouldn't prevent Vim from being usable with, > let's say, commentaries of ancient writers, would you?). So IIUC Vim > would need an extensive dictionary of compounds, and the logic to go > with it, in order to "intelligently" break CJK words (and I'm not sure > it could do so when spelling is not being checked). So I suppose > treating all ideograms (but not ideographic punctuation) as "word" > characters may be less than perfect but at least it's doable (and > someone who doesn't speak CJK languages can program it and test it). > > What might be possible (but I'm not sure it is) would be to define > spelling dictionaries for mainland Chinese, Taiwanese, Hong Kong > Chinese, Japanese, South Korean and North Korean, containing only the > "acceptable" isolated words and "indivisible" compounds. This might give > a basis for what you're asking for; but how would you treat a CJK > character which is not used alone in some language, and appears (maybe > as a result of some typo, or maybe in a quotation from some other CJK > language) in a context where it doesn't make an "acceptable" compound > with the hanzi-kanji-hanja/kana/hangeul-chosŏngŭl surrounding it? In > alphabetic languages you could scan both ways to the nearest space, tab, > linebreak or punctuation mark; but I'm not sure how to do it with CJK text. > > Best regards, > Tony. > -- > When a fly lands on the ceiling, does it do a half roll or a half > loop? --~--~---------~--~----~------------~-------~--~----~ You received this message from the "vim_dev" maillist. For more information, visit http://www.vim.org/maillist.php -~----------~----~----~----~------~----~------~--~---
