Re: Issue in match() function with multi-byte characters

Andre Sihera Sun, 30 Mar 2014 00:51:07 -0700


On 30/03/14 16:40, Nikolay Pavlov wrote:
>
>
> On Mar 30, 2014 5:54 AM, "Andre Sihera" <[email protected]
> <mailto:[email protected]>> wrote:
> >
> >
> > On 30/03/14 09:03, Nikolay Pavlov wrote:
> >>
> >>
> >> On Mar 30, 2014 3:35 AM, "Dmitry Frank" <[email protected]
> <mailto:[email protected]>> wrote:
> >> >
> >> > Hello all.
> >> >
> >> > match() function returns index of first match, but if there are
> multi-byte chars before first match, then each multi-byte chars is
> interpreted as several chars, so, index becomes wrong.
> >> >
> >> > Say, match("foobar", "bar") returns 3, which is correct. But
> match("яfoobar", "bar") returns 5, which is wrong (should be 4)
> >>
> >> This is completely correct. What are you going to do with 4?
> "яfoobar"[4] is "o" (specifically, second one).
> >
> >
> > This is only marginally correct, even according to my documentation
> (7.3.475)
> > which *starts* by talking about characters and *ends* by talking
> about bytes,
> > even when referring to the same notions. stridx(), strpart(), and
> most other
> > functions start from the outset by talking about bytes with no
> mention of
> > characters. At minimum, the OP was probably mislead by the match()'s
> description.
> >
> >>
> >> > But we surely need to make match() work as expected when
> &encoding is "utf-8" too.
> >>
> >> >
> >>
> >> Also col(), string indexing /\%Nc and so on? Not going to happen,
> this is incompatible change.
> >
> >
> > This kind of flat-refusal mentality gets nobody anywhere.
> >
> > You can't go touting ViM around as a multilingual editor and fill it
> with lots of
> > features and settings that handle multi-byte encodings and ISO-10646
> support if this
> > kind of English-only support prevails in the script language and
> prevents you from
> > processing what the user has input in the first place.
> >
> > There are so many easy real-life examples I could cherry-pick as to
> why the OPs
> > thinking is correct it isn't funny.
> >
> > For example, say in Japanese (the input language I use) I'm
> processing buffer lines
> > or user input where the first 20 characters are not useful. So you
> think I can go and
> > just do this?
> >
> > match(szUserInput, szSearchString, 20)
> >
> > In 8-byte *legacy* encodings, maybe. But in UTF-8? You must be
> kidding! Here's what
> > I have as my input:
> >
> > "今日 時間 日 本語 勉強 思 今日は２時間ぐらい日本語を勉強したいと思
> います。",
> >
> > I am looking for "勉強" in the right hand portion (character 33).
> Just how on earth
> > do I specify the position *in bytes*, as match() expects, of the
> 20th *character*?
> > By having to force me, the user, to *binary dump* every string I
> want to use to extract
> > the byte index? What about if that position has to be calculated
> dynamically based on
> > previous user/file input (this is typically necessary as even
> whitespace can vary in
> > width in Japanese, meaning an isspace()-like whitespace test
> succeeds but the number
> > of bytes occupied varies).
> >
> > Incidentally, in the above example, character 20 is the first
> character of "今日",
> > the word after the larger whitespace portion in the middle. However,
> *byte* 20 is
> > the "語" of the 3rd word "日本語". Thus, the ViM script:
> >
> > szLine = "今日 時間 日 本語 勉強 思 今日は２時間ぐらい日本語を勉 強
> したいと思います。"
> > szSearch = input(...)
> > ...
> > match(szLine, szInput, 20)
> >
> > comes back with 24 (byte 24). At minimum, I want it to come back
> with 79 (the byte
> > index of what I'm looking for) except that there was no easy way to
> dynamically
> > compute 40, the byte position of where the search actually needs to
> start from.
>
> Usually match(str, '.\{20}') is used in this case. I would ask though
> where did you obtain the number 20.
>


The code in the example was:

match(szLine, szInput, 20)

I want to start matching from character index 20 (i.e. I want to skip
the first 20
characters in the string). I don't want to match character U+0020. ViM
can already do
that.


> >
> > This basic lack of support in the script language for multi-lingual
> features needs
> > to be addressed, either through new functions or through fixing of
> the existing ones
> > so they match the behaviour that the user expects when modifying
> *related* settings
> > like encoding, fileencoding, etc.
>
> Indexing string to get a character would be good idea for most
> use-cases that will fix a number of plugins. But unfortunately there
> is a whole *class* of plugins that will be *broken* by this change:
> any plugin implementing hash calculation function. You may have
> expected this in neovim (not as long as I am responsible for new VimL
> implementation), but Bram hates including incompatible changes (and
> neither I like this). So you cannot expect existing functions to be fixed.
>
> About adding new functions: do not know. Maybe if somebody writes a
> patch to add mbstrlen() (alias to existing strchars() for
> consistency), mbmatch(,end,str,list), mbstrpart(), mbstridx(),
> mbstrridx(), mbcol() and //\%NC they will be included.
>
> >
> >
> >
> >
> >> >
> >> > --
> >> > Regards,
> >> > Dmitry
> >> >
> >> > --
> >> > --
> >> > You received this message from the "vim_dev" maillist.
> >> > Do not top-post! Type your reply below the text you are replying to.
> >> > For more information, visit http://www.vim.org/maillist.php
> >> >
> >> > ---
> >> > You received this message because you are subscribed to the
> Google Groups "vim_dev" group.
> >> > To unsubscribe from this group and stop receiving emails from it,
> send an email to [email protected]
> <mailto:vim_dev%[email protected]>.
> >> > For more options, visit https://groups.google.com/d/optout.
> >>
> >> --
> >> --
> >> You received this message from the "vim_dev" maillist.
> >> Do not top-post! Type your reply below the text you are replying to.
> >> For more information, visit http://www.vim.org/maillist.php
> >>
> >> ---
> >> You received this message because you are subscribed to the Google
> Groups "vim_dev" group.
> >> To unsubscribe from this group and stop receiving emails from it,
> send an email to [email protected]
> <mailto:vim_dev%[email protected]>.
> >> For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > --
> > You received this message from the "vim_dev" maillist.
> > Do not top-post! Type your reply below the text you are replying to.
> > For more information, visit http://www.vim.org/maillist.php
> >
> > ---
> > You received this message because you are subscribed to the Google
> Groups "vim_dev" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send an email to [email protected]
> <mailto:vim_dev%[email protected]>.
> > For more options, visit https://groups.google.com/d/optout.
>
> -- 
> -- 
> You received this message from the "vim_dev" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the Google
> Groups "vim_dev" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected]
> <mailto:[email protected]>.
> For more options, visit https://groups.google.com/d/optout.

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Issue in match() function with multi-byte characters

Raspunde prin e-mail lui