On 21/01/09 04:11, Xie wrote: > Thank you for your reply, Tony. I don't know if my English is enough > to make myself clear but I'll try. > > In English, semantically, a "word" sequence of characters (a-zA-Z) and > is the smallest meaningful unit. Word segmentation is not needed in > English because the "word" is naturally separated by whitespaces. The > situation is different in CJK languages. It takes several CJK > characters to form a "word" but this "word" exists in a serial of > characters and is not easily distinguishable for computer. That's why > Word Segmentation algorithm is needed to recognize a "word". > > As far as I know, Vim simply takes a sequence of whatever characters > (not ,./?><...) as a "word", which is correct semantically for > English, but not for CJK languages. What I want to know is that if Vim > has ever thought about adding support to this. > > Thanks > Xie
I don't think it would be feasible, especially since OT1H there exist compound words which can exist either as distinct words or as part of larger compounds, and OTOH there exist characters which cannot appear as separate words in contemporary Chinese but can do so in poetic or archaic language (and you wouldn't prevent Vim from being usable with, let's say, commentaries of ancient writers, would you?). So IIUC Vim would need an extensive dictionary of compounds, and the logic to go with it, in order to "intelligently" break CJK words (and I'm not sure it could do so when spelling is not being checked). So I suppose treating all ideograms (but not ideographic punctuation) as "word" characters may be less than perfect but at least it's doable (and someone who doesn't speak CJK languages can program it and test it). What might be possible (but I'm not sure it is) would be to define spelling dictionaries for mainland Chinese, Taiwanese, Hong Kong Chinese, Japanese, South Korean and North Korean, containing only the "acceptable" isolated words and "indivisible" compounds. This might give a basis for what you're asking for; but how would you treat a CJK character which is not used alone in some language, and appears (maybe as a result of some typo, or maybe in a quotation from some other CJK language) in a context where it doesn't make an "acceptable" compound with the hanzi-kanji-hanja/kana/hangeul-chosŏngŭl surrounding it? In alphabetic languages you could scan both ways to the nearest space, tab, linebreak or punctuation mark; but I'm not sure how to do it with CJK text. Best regards, Tony. -- When a fly lands on the ceiling, does it do a half roll or a half loop? --~--~---------~--~----~------------~-------~--~----~ You received this message from the "vim_dev" maillist. For more information, visit http://www.vim.org/maillist.php -~----------~----~----~----~------~----~------~--~---
