On 28/05/12 08:10, Chris Jones wrote:
On Sun, May 27, 2012 at 09:25:30PM EDT, William Fugy wrote:
On Mon, May 28, 2012 at 10:15 AM, Xell Liu <[email protected]> wrote:

[..]

Unless I missed something, and if you absolutely need to do this,
you could bypass the limitation by breaking up the range like so:


| :g/[一-仿伀-俿倀-儀 ... 鼀-龻]/

Good one! i'll give it a try. But so many characters,.....

Depends how much one needs a regex that works for all cases or if
something more relaxed can do the job at hand. I was also thinking that
depending on the particular use case it might be possible to have
a script create the regex and initialize a variable/register and use its
contents in interactive commands to simulate a [:CJK:] character class
more conveniently.

This corresponds to ranges:

| \u4e00-\u4eff
| \u4f00-\u4fff
| \u5000-\u50ff
| ..
| \u9f00-\u9fbb¹

Trouble is, this is going to add up to something like 80+ subranges and
may cause you to run into other limitations. I haven't tested the whole
range, only the above (it works here) but if nobody comes up with

a better idea, and you choose go down this path, I would suggest
generating the regex programatically..


thank you. Apparently it has just to be done like this way.  Now I'm
dealing with this problem by Perl. Hope Vim could accomplish it.

I don't use Perl but I would have expected it to provide native support
for Unicode blocks. In this instance ‘\p{InCJk_Unified_Ideographs},
which corresponds precisely to U+4E00...U+9FFF.

See this:

http://www.regular-expressions.info/unicode.html

¹ I think \u4e00-\u9fbb is the correct CJK range

Yes. it's accurate.

Sorry.. in fact, correct was the wrong word.. I really meant something
like ‘effectively assigned’.. \u9fbb-\u9fff do belong to the unicode
range but afaict no characters have been assigned. Which makes it
impossible to refer to them by character.. only by code point.

CJ


There are additional "rare" CJK characters outside the BMP (in plane 2), and there are other CJK "wide" characters elsewhere in the BMP (e.g. fullwidth space, U+3000). For details, see "East Asian Scripts" in the rightmost column of http://www.unicode.org/charts/ — hovering your mouse over a link will display the codepoint range in a tooltip.

However, there is also a limitation in Vim, namely, a collection can only match (IIRC) at most 257 different individual characters at the same point. 4E00..9FFF alone is already much more than that.


Best regards,
Tony.
--
Genderplex, n.:
        The predicament of a person in a restaurant who is unable to
determine his or her designated restroom (e.g., turtles and
tortoises).
                -- Rich Hall, "Sniglets"

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Reply via email to