On Wed, Mar 15, 2017 at 11:16:49PM +0100, Bram Moolenaar wrote:
>
> Kazunobu Kuriyama wrote:
>
> > > But it seems strange that we need to restrict [:cntrl:] and [:graph:] to
> > > ASCII only.
> >
> > Quite understandable. But otherwise, we will have to either rely
> > entirely on the is*() functions provided by the OS in use or define
> > our own character classes independently of any of it.
> >
> > The former case implies that the behavior of Vim scripts using
> > [:class:] depends on the OS in use. Surely, the latter case is
> > expected to resolve the flaw of the former, but I'm not sure we can
> > specify character classes in such a way that almost all users on
> > various platforms are satisfied with them.
> >
> > So, I think at the moment that the ASCII restriction is a reasonable
> > compromise. But I'm still quite open to other better solutions.
>
> It's a difficult choice. Either we say the regexp should be portable,
> and we let Vim define exactly what those classes mean, or we say we must
> follow how the current system considers characters to be classified. I
> wonder when the system knows better, perhaps when something in the
> system configuration, e.g. the country or language, changes what
> characters mean?
Yes, it does. At least on glibc (i.e. GNU). iswcntrl(3) is defined as any
character that is *not* part of "print", "alpha", "upper", "lower", "digit",
"xdigit", "punct".
The problem is that "alpha" and "punct" are affected by locale settings,
therefore "cntrl" is affected too. In other words, with a simple regex Vim
would likely either: classify all >0x80 characters as [:cntrl:] or none of
them, which may be erroneous since some UTF-8 characters are not printable in
the higher ranges.
So, the restriction to ASCII values, or better to 0x00-0x00ff, makes sense.
For example (using some UTF-8 aware terminal emulator and a UTF-8 locale):
1. printf "\x00\xc0" # will print an À
2. printf "\x00\x9f" # will give the same [:cntrl:] character that 0x9f
# gives under LC_CTYPE=latin1
iswgraph(3) also has a note that it depends on LC_CTYPE but the defintion on
how this happens seems more convoluted.
For non-UTF things should be simpler to regex I guess.
Yet, still for UTF-8, different version of glibc do have different UTF-8
tables. And other systems may as well be more or less updated to the unicode
consortium. Other OSes may be more or less often updated too.
I'd make a regex for all the ISO8859, KOI8 and EUC locales and leave
the system to deal with the others. Then, on *nix LC_TYPE=C should work like
latin1 (iso8859-1) and on MS windows *I believe* that you can set the locale to
latin1 on any version of it.
Will not test all locales but there will be some tests at least.
--
Mike Grochmal
GPG key ID 0xC840C4F6
--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
---
You received this message because you are subscribed to the Google Groups
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.