Re: [vim/vim] Not all regexp classes [:...:] were not tested. (#1560)

Michal Grochmal Wed, 15 Mar 2017 16:41:38 -0700

On Wed, Mar 15, 2017 at 11:16:49PM +0100, Bram Moolenaar wrote:
> 
> Kazunobu Kuriyama wrote:
> 
> > > But it seems strange that we need to restrict [:cntrl:] and [:graph:] to 
> > > ASCII only.
> > 
> > Quite understandable.  But otherwise, we will have to either rely
> > entirely on the is*() functions provided by the OS in use or define
> > our own character classes independently of any of it.
> > 
> > The former case implies that the behavior of Vim scripts using
> > [:class:] depends on the OS in use.   Surely, the latter case is
> > expected to resolve the flaw of the former, but I'm not sure we can
> > specify character classes in such a way that almost all users on
> > various platforms are satisfied with them.
> > 
> > So, I think at the moment that the ASCII restriction is a reasonable
> > compromise.  But I'm still quite open to other better solutions.
> 
> It's a difficult choice.  Either we say the regexp should be portable,
> and we let Vim define exactly what those classes mean, or we say we must
> follow how the current system considers characters to be classified.  I
> wonder when the system knows better, perhaps when something in the
> system configuration, e.g. the country or language, changes what
> characters mean?


Yes, it does.  At least on glibc (i.e. GNU).  iswcntrl(3) is defined as any
character that is *not* part of "print", "alpha", "upper", "lower", "digit",
"xdigit", "punct".

The problem is that "alpha" and "punct" are affected by locale settings,
therefore "cntrl" is affected too.  In other words, with a simple regex Vim
would likely either: classify all >0x80 characters as [:cntrl:] or none of
them, which may be erroneous since some UTF-8 characters are not printable in
the higher ranges.

So, the restriction to ASCII values, or better to 0x00-0x00ff, makes sense.
For example (using some UTF-8 aware terminal emulator and a UTF-8 locale):

1.   printf "\x00\xc0"  # will print an À
2.   printf "\x00\x9f"  # will give the same [:cntrl:] character that 0x9f
                        # gives under LC_CTYPE=latin1

iswgraph(3) also has a note that it depends on LC_CTYPE but the defintion on
how this happens seems more convoluted.

For non-UTF things should be simpler to regex I guess.

Yet, still for UTF-8, different version of glibc do have different UTF-8
tables.  And other systems may as well be more or less updated to the unicode
consortium.  Other OSes may be more or less often updated too.

I'd make a regex for all the ISO8859, KOI8 and EUC locales and leave
the system to deal with the others.  Then, on *nix LC_TYPE=C should work like
latin1 (iso8859-1) and on MS windows *I believe* that you can set the locale to
latin1 on any version of it.

Will not test all locales but there will be some tests at least.

-- 
Mike Grochmal
GPG key ID 0xC840C4F6

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [vim/vim] Not all regexp classes [:...:] were not tested. (#1560)

Raspunde prin e-mail lui