Re: Issue in match() function with multi-byte characters

Nikolay Pavlov Sun, 30 Mar 2014 01:38:22 -0700

On Mar 30, 2014 11:50 AM, "Andre Sihera" <[email protected]> wrote:
>
>
>
> On 30/03/14 16:40, Nikolay Pavlov wrote:
>>
>>
>> On Mar 30, 2014 5:54 AM, "Andre Sihera" <[email protected]>
wrote:
>> >
>> >
>> > On 30/03/14 09:03, Nikolay Pavlov wrote:
>> >>
>> >>
>> >> On Mar 30, 2014 3:35 AM, "Dmitry Frank" <[email protected]> wrote:
>> >> >
>> >> > Hello all.
>> >> >
>> >> > match() function returns index of first match, but if there are
multi-byte chars before first match, then each multi-byte chars is
interpreted as several chars, so, index becomes wrong.
>> >> >
>> >> > Say, match("foobar", "bar") returns 3, which is correct.  But
match("яfoobar", "bar")  returns 5, which is wrong (should be 4)
>> >>
>> >> This is completely correct. What are you going to do with 4?
"яfoobar"[4] is "o" (specifically, second one).
>> >
>> >
>> > This is only marginally correct, even according to my documentation
(7.3.475)
>> > which *starts* by talking about characters and *ends* by talking about
bytes,
>> > even when referring to the same notions. stridx(), strpart(), and most
other
>> > functions start from the outset by talking about bytes with no mention
of
>> > characters. At minimum, the OP was probably mislead by the match()'s
description.
>> >
>> >>
>> >> > But we surely need to make match() work as expected when &encoding
is "utf-8" too.
>> >>
>> >> >
>> >>
>> >> Also col(), string indexing /\%Nc and so on? Not going to happen,
this is incompatible change.
>> >
>> >
>> > This kind of flat-refusal mentality gets nobody anywhere.
>> >
>> > You can't go touting ViM around as a multilingual editor and fill it
with lots of
>> > features and settings that handle multi-byte encodings and ISO-10646
support if this
>> > kind of English-only support prevails in the script language and
prevents you from
>> > processing what the user has input in the first place.
>> >
>> > There are so many easy real-life examples I could cherry-pick as to
why the OPs
>> > thinking is correct it isn't funny.
>> >
>> > For example, say in Japanese (the input language I use) I'm processing
buffer lines
>> > or user input where the first 20 characters are not useful. So you
think I can go and
>> > just do this?
>> >
>> >     match(szUserInput, szSearchString, 20)
>> >
>> > In 8-byte *legacy* encodings, maybe. But in UTF-8? You must be
kidding! Here's what
>> > I have as my input:
>> >
>> >     "今日 時間 日 本語 勉強 思      今日は２時間ぐらい日本語を勉強したいと思います。",
>> >
>> > I am looking for "勉強" in the right hand portion (character 33). Just
how on earth
>> > do I specify the position *in bytes*, as match() expects, of the 20th
*character*?
>> > By having to force me, the user, to *binary dump* every string I want
to use to extract
>> > the byte index? What about if that position has to be calculated
dynamically based on
>> > previous user/file input (this is typically necessary as even
whitespace can vary in
>> > width in Japanese, meaning an isspace()-like whitespace test succeeds
but the number
>> > of bytes occupied varies).
>> >
>> > Incidentally, in the above example, character 20 is the first
character of "今日",
>> > the word after the larger whitespace portion in the middle. However,
*byte* 20 is
>> > the "語" of the 3rd word "日本語". Thus, the ViM script:
>> >
>> >     szLine = "今日 時間 日 本語 勉強 思      今日は２時間ぐらい日本語を勉 強したいと思います。"
>> >     szSearch = input(...)
>> >     ...
>> >     match(szLine, szInput, 20)
>> >
>> > comes back with 24 (byte 24). At minimum, I want it to come back with
79 (the byte
>> > index of what I'm looking for) except that there was no easy way to
dynamically
>> > compute 40, the byte position of where the search actually needs to
start from.
>>
>> Usually match(str, '.\{20}') is used in this case. I would ask though
where did you obtain the number 20.
>
>
> The code in the example was:
>
> match(szLine, szInput, 20)
>
> I want to start matching from character index 20 (i.e. I want to skip the
first 20
> characters in the string). I don't want to match character U+0020. ViM
can already do
> that.


You cannot use my regex to match U+0020, it matches 20 characters.
Composing characters are counted as part of the previous character though.

>
>
>
>> >
>> > This basic lack of support in the script language for multi-lingual
features needs
>> > to be addressed, either through new functions or through fixing of the
existing ones
>> > so they match the behaviour that the user expects when modifying
*related* settings
>> > like encoding, fileencoding, etc.
>>
>> Indexing string to get a character would be good idea for most use-cases
that will fix a number of plugins. But unfortunately there is a whole
*class* of plugins that will be *broken* by this change: any plugin
implementing hash calculation function. You may have expected this in
neovim (not as long as I am responsible for new VimL implementation), but
Bram hates including incompatible changes (and neither I like this). So you
cannot expect existing functions to be fixed.
>>
>> About adding new functions: do not know. Maybe if somebody writes a
patch to add mbstrlen() (alias to existing strchars() for consistency),
mbmatch(,end,str,list), mbstrpart(), mbstridx(), mbstrridx(), mbcol() and
//\%NC they will be included.
>>
>> >
>> >
>> >
>> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Dmitry
>> >> >
>> >> > --
>> >> > --
>> >> > You received this message from the "vim_dev" maillist.
>> >> > Do not top-post! Type your reply below the text you are replying to.
>> >> > For more information, visit http://www.vim.org/maillist.php
>> >> >
>> >> > ---
>> >> > You received this message because you are subscribed to the Google
Groups "vim_dev" group.
>> >> > To unsubscribe from this group and stop receiving emails from it,
send an email to [email protected].
>> >> > For more options, visit https://groups.google.com/d/optout.
>> >>
>> >> --
>> >> --
>> >> You received this message from the "vim_dev" maillist.
>> >> Do not top-post! Type your reply below the text you are replying to.
>> >> For more information, visit http://www.vim.org/maillist.php
>> >>
>> >> ---
>> >> You received this message because you are subscribed to the Google
Groups "vim_dev" group.
>> >> To unsubscribe from this group and stop receiving emails from it,
send an email to [email protected].
>> >> For more options, visit https://groups.google.com/d/optout.
>> >
>> > --
>> > --
>> > You received this message from the "vim_dev" maillist.
>> > Do not top-post! Type your reply below the text you are replying to.
>> > For more information, visit http://www.vim.org/maillist.php
>> >
>> > ---
>> > You received this message because you are subscribed to the Google
Groups "vim_dev" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected].
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> --
>> You received this message from the "vim_dev" maillist.
>> Do not top-post! Type your reply below the text you are replying to.
>> For more information, visit http://www.vim.org/maillist.php
>>
>> ---
>> You received this message because you are subscribed to the Google
Groups "vim_dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> --
> You received this message from the "vim_dev" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the Google Groups
"vim_dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Issue in match() function with multi-byte characters

Raspunde prin e-mail lui