Re: Issue in match() function with multi-byte characters

Dmitry Frank Sun, 30 Mar 2014 01:42:07 -0700

2014-03-30 11:43 GMT+04:00 Andre Sihera <[email protected]>:


>
>
> On 30/03/14 16:40, Nikolay Pavlov wrote:
>
>
> On Mar 30, 2014 5:54 AM, "Andre Sihera" <[email protected]>
> wrote:
> >
> >
> > On 30/03/14 09:03, Nikolay Pavlov wrote:
> >>
> >>
> >> On Mar 30, 2014 3:35 AM, "Dmitry Frank" <[email protected]> wrote:
> >> >
> >> > Hello all.
> >> >
> >> > match() function returns index of first match, but if there are
> multi-byte chars before first match, then each multi-byte chars is
> interpreted as several chars, so, index becomes wrong.
> >> >
> >> > Say, match("foobar", "bar") returns 3, which is correct.  But
> match("яfoobar", "bar")  returns 5, which is wrong (should be 4)
> >>
> >> This is completely correct. What are you going to do with 4?
> "яfoobar"[4] is "o" (specifically, second one).
> >
> >
> > This is only marginally correct, even according to my documentation
> (7.3.475)
> > which *starts* by talking about characters and *ends* by talking about
> bytes,
> > even when referring to the same notions. stridx(), strpart(), and most
> other
> > functions start from the outset by talking about bytes with no mention of
> > characters. At minimum, the OP was probably mislead by the match()'s
> description.
> >
> >>
> >> > But we surely need to make match() work as expected when &encoding is
> "utf-8" too.
> >>
> >> >
> >>
> >> Also col(), string indexing /\%Nc and so on? Not going to happen, this
> is incompatible change.
> >
> >
> > This kind of flat-refusal mentality gets nobody anywhere.
> >
> > You can't go touting ViM around as a multilingual editor and fill it
> with lots of
> > features and settings that handle multi-byte encodings and ISO-10646
> support if this
> > kind of English-only support prevails in the script language and
> prevents you from
> > processing what the user has input in the first place.
> >
> > There are so many easy real-life examples I could cherry-pick as to why
> the OPs
> > thinking is correct it isn't funny.
> >
> > For example, say in Japanese (the input language I use) I'm processing
> buffer lines
> > or user input where the first 20 characters are not useful. So you think
> I can go and
> > just do this?
> >
> >     match(szUserInput, szSearchString, 20)
> >
> > In 8-byte *legacy* encodings, maybe. But in UTF-8? You must be kidding!
> Here's what
> > I have as my input:
> >
> >     "今日 時間 日 本語 勉強 思      今日は２時間ぐらい日本語を勉強したいと思います。",
> >
> > I am looking for "勉強" in the right hand portion (character 33). Just how
> on earth
> > do I specify the position *in bytes*, as match() expects, of the 20th
> *character*?
> > By having to force me, the user, to *binary dump* every string I want to
> use to extract
> > the byte index? What about if that position has to be calculated
> dynamically based on
> > previous user/file input (this is typically necessary as even whitespace
> can vary in
> > width in Japanese, meaning an isspace()-like whitespace test succeeds
> but the number
> > of bytes occupied varies).
> >
> > Incidentally, in the above example, character 20 is the first character
> of "今日",
> > the word after the larger whitespace portion in the middle. However,
> *byte* 20 is
> > the "語" of the 3rd word "日本語". Thus, the ViM script:
> >
> >     szLine = "今日 時間 日 本語 勉強 思      今日は２時間ぐらい日本語を勉 強したいと思います。"
> >     szSearch = input(...)
> >     ...
> >     match(szLine, szInput, 20)
> >
> > comes back with 24 (byte 24). At minimum, I want it to come back with 79
> (the byte
> > index of what I'm looking for) except that there was no easy way to
> dynamically
> > compute 40, the byte position of where the search actually needs to
> start from.
>
> Usually match(str, '.\{20}') is used in this case. I would ask though
> where did you obtain the number 20.
>
>
> The code in the example was:
>
> match(szLine, szInput, 20)
>
> I want to start matching from character index 20 (i.e. I want to skip the
> first 20
> characters in the string). I don't want to match character U+0020. ViM can
> already do
> that.
>
>
>
>  >
> > This basic lack of support in the script language for multi-lingual
> features needs
> > to be addressed, either through new functions or through fixing of the
> existing ones
> > so they match the behaviour that the user expects when modifying
> *related* settings
> > like encoding, fileencoding, etc.
>
> Indexing string to get a character would be good idea for most use-cases
> that will fix a number of plugins. But unfortunately there is a whole
> *class* of plugins that will be *broken* by this change: any plugin
> implementing hash calculation function. You may have expected this in
> neovim (not as long as I am responsible for new VimL implementation), but
> Bram hates including incompatible changes (and neither I like this). So you
> cannot expect existing functions to be fixed.
>
> About adding new functions: do not know. Maybe if somebody writes a patch
> to add mbstrlen() (alias to existing strchars() for consistency),
> mbmatch(,end,str,list), mbstrpart(), mbstridx(), mbstrridx(), mbcol() and
> //\%NC they will be included.
>
> >
> >
> >
> >
> >> >
> >> > --
> >> > Regards,
> >> > Dmitry
> >> >
> >> > --
> >> > --
> >> > You received this message from the "vim_dev" maillist.
> >> > Do not top-post! Type your reply below the text you are replying to.
> >> > For more information, visit http://www.vim.org/maillist.php
> >> >
> >> > ---
> >> > You received this message because you are subscribed to the Google
> Groups "vim_dev" group.
> >> > To unsubscribe from this group and stop receiving emails from it,
> send an email to [email protected].
> >> > For more options, visit https://groups.google.com/d/optout.
> >>
> >> --
> >> --
> >> You received this message from the "vim_dev" maillist.
> >> Do not top-post! Type your reply below the text you are replying to.
> >> For more information, visit http://www.vim.org/maillist.php
> >>
> >> ---
> >> You received this message because you are subscribed to the Google
> Groups "vim_dev" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> >> For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > --
> > You received this message from the "vim_dev" maillist.
> > Do not top-post! Type your reply below the text you are replying to.
> > For more information, visit http://www.vim.org/maillist.php
> >
> > ---
> > You received this message because you are subscribed to the Google
> Groups "vim_dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected].
> > For more options, visit https://groups.google.com/d/optout.
>  --
> --
> You received this message from the "vim_dev" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the Google Groups
> "vim_dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>
>  --
> --
> You received this message from the "vim_dev" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the Google Groups
> "vim_dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>


Well, uncompatible changes are surely something we should avoid if
possible. So I would vote for new mb...() functions, they will help a lot.

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Issue in match() function with multi-byte characters

Raspunde prin e-mail lui