Re: Issue in match() function with multi-byte characters

ZyX Sun, 30 Mar 2014 06:58:26 -0700

On Sunday, March 30, 2014 5:00:13 PM UTC+4, Andre Sihera wrote:
> On 30/03/14 20:32, Yasuhiro MATSUMOTO wrote:
> Now this is interesting.
> 
> index() does indeed split on character, not byteboundaries. However, even if
> I can do this:
> 
>      split("こんにちわ世界", '\zs')
> 
> to get this:
> 
>      ['こ', 'ん', 'に', 'ち', 'わ', '世', '界']
> 
> it still doesn't allow me to do a search for "世界" (i.e. a word) and 
> get the
> answer 5. Instead I have to break my search word into individual characters
> and then perform a manual character by character comparison - in ViM script.
> Absolutely no good for performance, especially if I'm processing big 
> text files.


First use `match()`/`stridx()` or whatever. Then slice and operate on the 
results:

    let idx = stridx(s, '世界')
    let codepoint_offset = strchars(s[: idx - 1])
    let characters_offset = len(split(s[: idx - 1], '\m')

. Some notes:

1. split(, '\zs') is said not to work if vim is compiled without +syntax. So I 
usually use `\m` or `.\@!` even though I know it is unlikely that there will be 
a vim instance with -syntax and +eval.
2. split() *does not split on unicode codepoints*. In Japanese text it may not 
matter, but it splits by the character with all of the following composing 
characters included in one list item. If you need one unicode character and 
have byte offset you have to use `nr2char(char2nr(s[offset :]))`.

> Incidentally, checking this yielded yet another inconsistency. The 
> reverse of
> index() is the array subscript operator "[...]" which works directly on 
> strings
> to get a character. e.g.
> 
>                      1111
>            01234567890123
>      echo "this is a test"[5]
> 
> correctly yields "i". However, if I do this:
> 
>            ０１２３４５６
>      echo "こんにちわ世界"[5]
> 
> instead of getting "世" (6th character), it wrongly returns the 6th byte and
> gives me "<93>", which I presume is a byte midway through a UTF-8 character
> sequence.

According to the help string indexing returns *a single byte*. It is 
*completely correct* behavior, see first sentence in :h expr8.

> This is not good. These inconsistencies need to be fixed.

String indexing *must* not fixed. As I said there is a number of plugins that 
need *exactly* bytes: any plugin implementing hash function. char2nr(s[i]) is 
guaranteed to return a value between 0x00 and 0xFF (inclusive) (0x00 is 
returned only if s[i] is an empty string).

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Issue in match() function with multi-byte characters

Raspunde prin e-mail lui