Re: word segmentation in Vim

Tony Mechelynck Tue, 20 Jan 2009 20:57:46 -0800

On 21/01/09 04:11, Xie wrote:
> Thank you for your reply, Tony. I don't know if my English is enough
> to make myself clear but I'll try.
>
> In English, semantically, a "word" sequence of characters (a-zA-Z) and
> is the smallest meaningful unit. Word segmentation is not needed in
> English because the "word" is naturally separated by whitespaces. The
> situation is different in CJK languages. It takes several CJK
> characters to form a "word" but this "word" exists in a serial of
> characters and is not easily distinguishable for computer. That's why
> Word Segmentation algorithm is needed to recognize a "word".
>
> As far as I know, Vim simply takes a sequence of whatever characters
> (not ,./?><...) as a "word", which is correct semantically for
> English, but not for CJK languages. What I want to know is that if Vim
> has ever thought about adding support to this.
>
> Thanks
> Xie


I don't think it would be feasible, especially since OT1H there exist 
compound words which can exist either as distinct words or as part of 
larger compounds, and OTOH there exist characters which cannot appear as 
separate words in contemporary Chinese but can do so in poetic or 
archaic language (and you wouldn't prevent Vim from being usable with, 
let's say, commentaries of ancient writers, would you?). So IIUC Vim 
would need an extensive dictionary of compounds, and the logic to go 
with it, in order to "intelligently" break CJK words (and I'm not sure 
it could do so when spelling is not being checked). So I suppose 
treating all ideograms (but not ideographic punctuation) as "word" 
characters may be less than perfect but at least it's doable (and 
someone who doesn't speak CJK languages can program it and test it).

What might be possible (but I'm not sure it is) would be to define 
spelling dictionaries for mainland Chinese, Taiwanese, Hong Kong 
Chinese, Japanese, South Korean and North Korean, containing only the 
"acceptable" isolated words and "indivisible" compounds. This might give 
a basis for what you're asking for; but how would you treat a CJK 
character which is not used alone in some language, and appears (maybe 
as a result of some typo, or maybe in a quotation from some other CJK 
language) in a context where it doesn't make an "acceptable" compound 
with the hanzi-kanji-hanja/kana/hangeul-chosŏngŭl surrounding it? In 
alphabetic languages you could scan both ways to the nearest space, tab, 
linebreak or punctuation mark; but I'm not sure how to do it with CJK text.


Best regards,
Tony.
-- 
When a fly lands on the ceiling, does it do a half roll or a half
loop?

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: word segmentation in Vim

Raspunde prin e-mail lui