Re: Word boundry would not work when using some wierd Unicode chars with the 'contained' syntax

Jacky Liu Tue, 24 Feb 2015 09:12:02 -0800

On Sunday, February 8, 2015 at 5:16:21 AM UTC+8, Jacky Liu wrote:
> Here is the VimL code I wrote:
> 
>       " Use some wierd Unicode chars to mark the region, '+' being put here 
> as a contrast.
>       syntax  region  myCmdLine       matchgroup=myCmdLine_           
> start=/[⣱+]/    end=/[⡇⡗⡧+]/
>       hi      link    myCmdLine       _LightGreen_233b5a
>       hi      link    myCmdLine_      Normal
> 
>       syntax  keyword myCmdName       man bind less   containedin=myCmdLine   
> contained
>       hi      link    myCmdName       _Green_233b5a
> 
> And here's its effect on some simple demonstrating text (see attached image 
> file)
> 
> With '+' as the marker all three syntax keywords were correctlly recognized, 
> but not with the abnormal Unicode chars
> 
> Another thing is using '*' to do a quick search would work normally, as would 
> do the following search command:
> 
>       /\<man\|bind\|less\>
> 
> 'iskeyword' or 'regexpengine' option seems have no effect here.
> 
> Should this be considered a bug?




Update:

I've found a solution. Although a slight modification to Vim source would be 
involved, it solves the problem without any seeming side effects.

The method is changing the classification of certain characters as one desire, 
by modifying this file: vim74/src/mbyte.c:

/*
 * Get class of a Unicode character.
 * 0: white space
 * 1: punctuation
 * 2 or bigger: some class of word character.
 */
    int
utf_class(c)
    int         c;
{
    /* sorted list of non-overlapping intervals */
    static struct clinterval
    {
        unsigned short first;
        unsigned short last;
        unsigned short class;
    } classes[] =
    {
        {0x037e, 0x037e, 1},            /* Greek question mark */
        {0x0387, 0x0387, 1},            /* Greek ano teleia */
        {0x055a, 0x055f, 1},            /* Armenian punctuation */
        {0x0589, 0x0589, 1},            /* Armenian full stop */
        {0x05be, 0x05be, 1},
        {0x05c0, 0x05c0, 1},
         ... ...


the above list in mbyte.c defines character slices within the unicode table and 
how they are to be classified. change the last value to '1' will make that 
segment punctuation characters, and after recompile&install, word boundry would 
apply where they appear.

There's another data structure in the same file which specifies the display 
width of characters:

/*
 * For UTF-8 character "c" return 2 for a double-width character, 1 for others.
 * Returns 4 or 6 for an unprintable character.
 * Is only correct for characters >= 0x80.
 * When p_ambw is "double", return 2 for a character with East Asian Width
 * class 'A'(mbiguous).
 */
    int
utf_char2cells(c)
    int         c;
{
    /* Sorted list of non-overlapping intervals of East Asian double width
     * characters, generated with ../runtime/tools/unicode.vim. */
    static struct interval doublewidth[] =
    {
        {0x1100, 0x115f},
        {0x11a3, 0x11a7},
        {0x11fa, 0x11ff},
        {0x2329, 0x232a},
        {0x2e80, 0x2e99},
        {0x2e9b, 0x2ef3},
         ... ...

Characters specified by this list would be drawn as double width, this is when 
the 'ambiwidth' option was set to "double".

The unicode table is so immense that it's not possible to make one 
classification of characters that suits everybody, so I think the above would 
be sometimes inevitable

Thank you ~

-- 
-- 
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_use" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Word boundry would not work when using some wierd Unicode chars with the 'contained' syntax

Reply via email to