On 08/08/10 01:48, Bee wrote:
On Aug 7, 2:50 pm, Tony Mechelynck<[email protected]>
wrote:
On 07/08/10 20:41, Benjamin R. Haskell wrote:
On Thu, 5 Aug 2010, Tim Chase wrote:
On 08/05/10 00:17, Bee wrote:
Too subtile for me!
I have looked and searched, this is the only difference I can find.
*[:blank:]* [:blank:] space and tab characters
*[:space:]* [:space:] whitespace characters
What are other whitespace characters are than space and tab?
On the Mac non-breaking space is xA0 and neither find it.
Search for /\%xA0 finds the Mac non-breaking space.
You have the right idea. Remember that a [...] character-class can be
prefixed by "\_" to include newlines, so you might do something like
/the\_[[:space:]]\+brackets
(finds a match in my help on those POSIX-style character-classes)
whereas it won't find a match if you use [[:blank:]]
There are other Unicode whitespace characters (such as thin-space and
perhaps your non-breaking space, and other similar variants) so
[:blank:] is "JUST tabs and spaces" while [:space:] should find any of
the more generic whitespace.
So, Vim's [:space:] and [:blank:] don't seem to match Unicode spaces,
which differs from Perl's [:space:] and [:blank:].
Here's a complete list of what matches for me in perl 5.12.1, using a
test program[1]:
Unicode name ║Hex ║Dec ║:space:║:blank:
═════════════════════════╬══════╬═════╬═══════╬═══════
CHARACTER TABULATION ║\u0009║9 ║1 ║1
LINE FEED (LF) ║\u000a║10 ║1 ║0
LINE TABULATION ║\u000b║11 ║1 ║0
FORM FEED (FF) ║\u000c║12 ║1 ║0
CARRIAGE RETURN (CR) ║\u000d║13 ║1 ║0
SPACE ║\u0020║32 ║1 ║1
OGHAM SPACE MARK ║\u1680║5760 ║1 ║1
MONGOLIAN VOWEL SEPARATOR║\u180e║6158 ║1 ║1
EN QUAD ║\u2000║8192 ║1 ║1
EM QUAD ║\u2001║8193 ║1 ║1
EN SPACE ║\u2002║8194 ║1 ║1
EM SPACE ║\u2003║8195 ║1 ║1
THREE-PER-EM SPACE ║\u2004║8196 ║1 ║1
FOUR-PER-EM SPACE ║\u2005║8197 ║1 ║1
SIX-PER-EM SPACE ║\u2006║8198 ║1 ║1
FIGURE SPACE ║\u2007║8199 ║1 ║1
PUNCTUATION SPACE ║\u2008║8200 ║1 ║1
THIN SPACE ║\u2009║8201 ║1 ║1
HAIR SPACE ║\u200a║8202 ║1 ║1
LINE SEPARATOR ║\u2028║8232 ║1 ║0
PARAGRAPH SEPARATOR ║\u2029║8233 ║1 ║0
NARROW NO-BREAK SPACE ║\u202f║8239 ║1 ║1
MEDIUM MATHEMATICAL SPACE║\u205f║8287 ║1 ║1
IDEOGRAPHIC SPACE ║\u3000║12288║1 ║1
But, using a short test Vim script[2], I get only the non-Unicode
spaces:
dec 9 space 1 blank 1
dec 10 space 1 blank 0
dec 11 space 1 blank 0
dec 12 space 1 blank 0
dec 13 space 1 blank 0
dec 32 space 1 blank 1
I was surprised by that, but also surprised that NO-BREAK SPACE (\u00a
[decimal 160]) didn't show up in either list.
Any reason the Unicode spaces in general don't match in Vim?
It's documented, a few paragraphs below the [:list:]:
These items only work for 8-bit characters.
Characters in the range 0x80 to 0xFF are 8-bit in 8-bit encodings, but
in UTF-8 everything above 0x7F is multibyte (and what counts here is not
the 'fileencoding' used to represent your data on disk, but the
'encoding' used to represent it in memory, which is how the data is
represented when the search looks at it). For instance the no-break
space U+00A0 is represented as 0xC2 0xA0 (2 bytes), the ideographic
space U+3000 is represented as 0xE3 0x80 0x80 (3 bytes), etc.
vim 7.2.446 Mac terminal
I searched for [:list:] and helpgrep for :list: (just to be sure) and
could find nothing. Then went to ftp.nluug.nl::Vim/runtime/doc to get
(maybe) a more recent pattern.txt file. Still nothing.
Is that info from vim 7.3?
The expression [:list:] above was shorthand for "list of POSIX
collections using a word between colons between brackets". There is no
[:list:] sequence (as bra-, colon, ell, eye, ess, tee, colon, -cket) as
such in the help. That list starts at ":help [:alnum:]". The indented
line I copied is a little lower, at line 1027 of the latest Vim 7.3e
pattern.txt helpfile. The rest is commentary of my own.
I did find the phrase: "These items only work for 8-bit characters."
But no more info.
What is or is not single-byte comes from the general info about Unicode
that I absorbed since Vim 6.1 times. Here is how a Unicode codepoint is
represented in UTF-8:
- Bytes 0x00 to 0x7F have the same meaning as in US-ASCII, representing
codepoints U+0000 to U+007F respectively. These are "single bytes".
- Bytes 0x80 to 0xBF can be any byte except the first in a multibyte
sequence. In binary, this means 10xxxxxx where the two high bits say
"this is a trailer byte" and the 6 lower bits are the payload
- Bytes above 0xBF are header bytes, the first byte of a multibyte
sequence. The number of high "one" bits is the total number of bytes in
the sequence, then there is one "zero" bit, the rest are payload bits
- UTF-8 is always bigendian (it is byte-oriented, and in each multibyte
sequence, high-weight payload bits come in the first byte)
- In principle, as few bytes as possible must be used for each codepoint
(this means that bytes 0xC0 and 0xC1 are invalid); and codepoints above
U+10FFFF, as well as codepoints U+xxFFFE and U+xxFFFF where xx is any
hex value, will never correspond to any character, not even a
private-use character. (Vim accepts Unicode codepoints up to U+7FFFFFFF,
which corresponds to an earlier state of the ISO-10686 standard.)
Search for /\%xA0 finds the Mac non-breaking space.
I have not seen it preceeded by 0xC2.
I guess I need to use something like:
[[:space:]\xA0]\+
If I copy the above commented "unicode table" from vim_use website,
which has html non-breaking spaces, and paste into vim.app (a gvim for
the Mac I like better than MacVim) is shows as " =" that is 0x20 0xA0
(2 byte) the "=" representing 0xA0
'encoding' was set to Latin1 maybe? That's an ordinary space followed by
a no-break space.
BUT pasted into terminal vim with:
set pastetoggle=<F11>
then it does show as 0xC2 0xA0 (2 bytes)
--AND--
/[[:space:]]
will find the 2 byte sequence!
well, I tried in gvim, inserting the characters with |i_CTRL-V_digit|,
and /\_[[:space:]] skipped no-break spaces, ideographic spaces, etc.,
except when found at the end of a line. 'encoding' was set to utf-8.
Thank you for the explanation.
-Bill
Best regards,
Tony.
--
"Life would be much simpler and things would get done much faster if it
weren't for other people"
-- Blore
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php