On Sunday, February 8, 2015 at 5:16:21 AM UTC+8, Jacky Liu wrote:
> Here is the VimL code I wrote:
>
> " Use some wierd Unicode chars to mark the region, '+' being put here
> as a contrast.
> syntax region myCmdLine matchgroup=myCmdLine_
> start=/[⣱+]/ end=/[⡇⡗⡧+]/
> hi link myCmdLine _LightGreen_233b5a
> hi link myCmdLine_ Normal
>
> syntax keyword myCmdName man bind less containedin=myCmdLine
> contained
> hi link myCmdName _Green_233b5a
>
> And here's its effect on some simple demonstrating text (see attached image
> file)
>
> With '+' as the marker all three syntax keywords were correctlly recognized,
> but not with the abnormal Unicode chars
>
> Another thing is using '*' to do a quick search would work normally, as would
> do the following search command:
>
> /\<man\|bind\|less\>
>
> 'iskeyword' or 'regexpengine' option seems have no effect here.
>
> Should this be considered a bug?
Update:
I've found a solution. Although a slight modification to Vim source would be
involved, it solves the problem without any seeming side effects.
The method is changing the classification of certain characters as one desire,
by modifying this file: vim74/src/mbyte.c:
/*
* Get class of a Unicode character.
* 0: white space
* 1: punctuation
* 2 or bigger: some class of word character.
*/
int
utf_class(c)
int c;
{
/* sorted list of non-overlapping intervals */
static struct clinterval
{
unsigned short first;
unsigned short last;
unsigned short class;
} classes[] =
{
{0x037e, 0x037e, 1}, /* Greek question mark */
{0x0387, 0x0387, 1}, /* Greek ano teleia */
{0x055a, 0x055f, 1}, /* Armenian punctuation */
{0x0589, 0x0589, 1}, /* Armenian full stop */
{0x05be, 0x05be, 1},
{0x05c0, 0x05c0, 1},
... ...
the above list in mbyte.c defines character slices within the unicode table and
how they are to be classified. change the last value to '1' will make that
segment punctuation characters, and after recompile&install, word boundry would
apply where they appear.
There's another data structure in the same file which specifies the display
width of characters:
/*
* For UTF-8 character "c" return 2 for a double-width character, 1 for others.
* Returns 4 or 6 for an unprintable character.
* Is only correct for characters >= 0x80.
* When p_ambw is "double", return 2 for a character with East Asian Width
* class 'A'(mbiguous).
*/
int
utf_char2cells(c)
int c;
{
/* Sorted list of non-overlapping intervals of East Asian double width
* characters, generated with ../runtime/tools/unicode.vim. */
static struct interval doublewidth[] =
{
{0x1100, 0x115f},
{0x11a3, 0x11a7},
{0x11fa, 0x11ff},
{0x2329, 0x232a},
{0x2e80, 0x2e99},
{0x2e9b, 0x2ef3},
... ...
Characters specified by this list would be drawn as double width, this is when
the 'ambiwidth' option was set to "double".
The unicode table is so immense that it's not possible to make one
classification of characters that suits everybody, so I think the above would
be sometimes inevitable
Thank you ~
--
--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
---
You received this message because you are subscribed to the Google Groups
"vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.