Re: word segmentation in Vim

Xie Wed, 21 Jan 2009 00:58:11 -0800

In order to support Word Segmentation in Chinese, dictionary is
indispensable. I understand your doubt but I'm not going to deal with
these details and re-invent a wheel. As far as I know there are some
open source Chinese word segmentation libs available on the web. It's
much better if they could be integrated into Vim. Although I've been
using Vim for a while and I'm a programmer, I'm still new to Vim
source. I don't know if the concept of "word" in Vim is expandable to
this challenge. If not, that would be a waste of time.


--
Xie

On Jan 21, 12:57 pm, Tony Mechelynck <[email protected]>
wrote:
> On 21/01/09 04:11, Xie wrote:
>
>
>
> > Thank you for your reply, Tony. I don't know if my English is enough
> > to make myself clear but I'll try.
>
> > In English, semantically, a "word" sequence of characters (a-zA-Z) and
> > is the smallest meaningful unit. Word segmentation is not needed in
> > English because the "word" is naturally separated by whitespaces. The
> > situation is different in CJK languages. It takes several CJK
> > characters to form a "word" but this "word" exists in a serial of
> > characters and is not easily distinguishable for computer. That's why
> > Word Segmentation algorithm is needed to recognize a "word".
>
> > As far as I know, Vim simply takes a sequence of whatever characters
> > (not ,./?><...) as a "word", which is correct semantically for
> > English, but not for CJK languages. What I want to know is that if Vim
> > has ever thought about adding support to this.
>
> > Thanks
> > Xie
>
> I don't think it would be feasible, especially since OT1H there exist
> compound words which can exist either as distinct words or as part of
> larger compounds, and OTOH there exist characters which cannot appear as
> separate words in contemporary Chinese but can do so in poetic or
> archaic language (and you wouldn't prevent Vim from being usable with,
> let's say, commentaries of ancient writers, would you?). So IIUC Vim
> would need an extensive dictionary of compounds, and the logic to go
> with it, in order to "intelligently" break CJK words (and I'm not sure
> it could do so when spelling is not being checked). So I suppose
> treating all ideograms (but not ideographic punctuation) as "word"
> characters may be less than perfect but at least it's doable (and
> someone who doesn't speak CJK languages can program it and test it).
>
> What might be possible (but I'm not sure it is) would be to define
> spelling dictionaries for mainland Chinese, Taiwanese, Hong Kong
> Chinese, Japanese, South Korean and North Korean, containing only the
> "acceptable" isolated words and "indivisible" compounds. This might give
> a basis for what you're asking for; but how would you treat a CJK
> character which is not used alone in some language, and appears (maybe
> as a result of some typo, or maybe in a quotation from some other CJK
> language) in a context where it doesn't make an "acceptable" compound
> with the hanzi-kanji-hanja/kana/hangeul-chosŏngŭl surrounding it? In
> alphabetic languages you could scan both ways to the nearest space, tab,
> linebreak or punctuation mark; but I'm not sure how to do it with CJK text.
>
> Best regards,
> Tony.
> --
> When a fly lands on the ceiling, does it do a half roll or a half
> loop?
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: word segmentation in Vim

Raspunde prin e-mail lui