Re: \+ not same as [^\t ]\+

Benjamin R. Haskell Sat, 07 Aug 2010 11:42:39 -0700

On Thu, 5 Aug 2010, Tim Chase wrote:

> On 08/05/10 00:17, Bee wrote:
> > Too subtile for me!
> > 
> > I have looked and searched, this is the only difference I can find.
> > 
> > *[:blank:]*     [:blank:]     space and tab characters
> > *[:space:]*     [:space:]     whitespace characters
> > 
> > What are other whitespace characters are than space and tab?
> > 
> > On the Mac non-breaking space is xA0 and neither find it.
> > 
> > Search for /\%xA0 finds the Mac non-breaking space.
> 
> You have the right idea.  Remember that a [...] character-class can be 
> prefixed by "\_" to include newlines, so you might do something like
> 
>   /the\_[[:space:]]\+brackets
> 
> (finds a match in my help on those POSIX-style character-classes) 
> whereas it won't find a match if you use [[:blank:]]
> 
> There are other Unicode whitespace characters (such as thin-space and 
> perhaps your non-breaking space, and other similar variants) so 
> [:blank:] is "JUST tabs and spaces" while [:space:] should find any of 
> the more generic whitespace.


So, Vim's [:space:] and [:blank:] don't seem to match Unicode spaces, 
which differs from Perl's [:space:] and [:blank:].

Here's a complete list of what matches for me in perl 5.12.1, using a 
test program[1]:

Unicode name             ║Hex   ║Dec  ║:space:║:blank:
═════════════════════════╬══════╬═════╬═══════╬═══════
CHARACTER TABULATION     ║\u0009║9    ║1      ║1
LINE FEED (LF)           ║\u000a║10   ║1      ║0
LINE TABULATION          ║\u000b║11   ║1      ║0
FORM FEED (FF)           ║\u000c║12   ║1      ║0
CARRIAGE RETURN (CR)     ║\u000d║13   ║1      ║0
SPACE                    ║\u0020║32   ║1      ║1
OGHAM SPACE MARK         ║\u1680║5760 ║1      ║1
MONGOLIAN VOWEL SEPARATOR║\u180e║6158 ║1      ║1
EN QUAD                  ║\u2000║8192 ║1      ║1
EM QUAD                  ║\u2001║8193 ║1      ║1
EN SPACE                 ║\u2002║8194 ║1      ║1
EM SPACE                 ║\u2003║8195 ║1      ║1
THREE-PER-EM SPACE       ║\u2004║8196 ║1      ║1
FOUR-PER-EM SPACE        ║\u2005║8197 ║1      ║1
SIX-PER-EM SPACE         ║\u2006║8198 ║1      ║1
FIGURE SPACE             ║\u2007║8199 ║1      ║1
PUNCTUATION SPACE        ║\u2008║8200 ║1      ║1
THIN SPACE               ║\u2009║8201 ║1      ║1
HAIR SPACE               ║\u200a║8202 ║1      ║1
LINE SEPARATOR           ║\u2028║8232 ║1      ║0
PARAGRAPH SEPARATOR      ║\u2029║8233 ║1      ║0
NARROW NO-BREAK SPACE    ║\u202f║8239 ║1      ║1
MEDIUM MATHEMATICAL SPACE║\u205f║8287 ║1      ║1
IDEOGRAPHIC SPACE        ║\u3000║12288║1      ║1

But, using a short test Vim script[2], I get only the non-Unicode 
spaces:

dec 9  space 1 blank 1
dec 10 space 1 blank 0
dec 11 space 1 blank 0
dec 12 space 1 blank 0
dec 13 space 1 blank 0
dec 32 space 1 blank 1

I was surprised by that, but also surprised that NO-BREAK SPACE (\u00a 
[decimal 160]) didn't show up in either list.

Any reason the Unicode spaces in general don't match in Vim?

-- 
Best,
Ben


[1] Perl 'one-liner':
perl -CDS -Mcharnames=:full -lwe 'BEGIN{print "Unicode 
name\tHex\tDec\t:space:\t:blank:";} for (map chr, 1..0xd700) { $s = 
/[[:space:]]/; $b = /[[:blank:]]/; print join "\t", charnames::viacode(ord), 
sprintf("\\u%04x",ord), ord, $s?1:0, $b?1:0 if $s or $b }'

[2] Vim script (saved as /tmp/script.vim, then :so /tmp/script.vim)
for n in range(0x4000)
        let ord = n + 1
        let c = nr2char(ord)
        let sp = (match(c, '[[:space:]]') < 0) ? 0 : 1
        let bl = (match(c, '[[:blank:]]') < 0) ? 0 : 1
        if !(sp || bl)
                continue
        endif
        echo "dec" ord "space" sp "blank" bl
endfor

-- 
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

Re: \+ not same as [^\t ]\+

Reply via email to