Re: Issue in match() function with multi-byte characters

Andre Sihera Sat, 29 Mar 2014 18:55:24 -0700


On 30/03/14 09:03, Nikolay Pavlov wrote:

On Mar 30, 2014 3:35 AM, "Dmitry Frank" <[email protected]<mailto:[email protected]>> wrote:
>
> Hello all.
>
> match() function returns index of first match, but if there aremulti-byte chars before first match, then each multi-byte chars isinterpreted as several chars, so, index becomes wrong.
>
> Say, match("foobar", "bar") returns 3, which is correct. Butmatch("яfoobar", "bar") returns 5, which is wrong (should be 4)
This is completely correct. What are you going to do with 4?"яfoobar"[4] is "o" (specifically, second one).

This is only marginally correct, even according to my documentation(7.3.475)which *starts* by talking about characters and *ends* by talking aboutbytes,

even when referring to the same notions. stridx(), strpart(), and most other
functions start from the outset by talking about bytes with no mention of

characters. At minimum, the OP was probably mislead by the match()'sdescription.

> But we surely need to make match() work as expected when &encodingis "utf-8" too.
>
Also col(), string indexing /\%Nc and so on? Not going to happen, thisis incompatible change.


This kind of flat-refusal mentality gets nobody anywhere.

You can't go touting ViM around as a multilingual editor and fill itwith lots offeatures and settings that handle multi-byte encodings and ISO-10646support if thiskind of English-only support prevails in the script language andprevents you from

processing what the user has input in the first place.

There are so many easy real-life examples I could cherry-pick as to whythe OPs

thinking is correct it isn't funny.

For example, say in Japanese (the input language I use) I'm processingbuffer linesor user input where the first 20 characters are not useful. So you thinkI can go and

just do this?

    match(szUserInput, szSearchString, 20)

In 8-byte *legacy* encodings, maybe. But in UTF-8? You must be kidding!Here's what

I have as my input:

    "????? ???? ?     ???2????????????????????",

I am looking for "??" in the right hand portion (character 33). Just howon earthdo I specify the position *in bytes*, as match() expects, of the 20th*character*?By having to force me, the user, to *binary dump* every string I want touse to extractthe byte index? What about if that position has to be calculateddynamically based onprevious user/file input (this is typically necessary as even whitespacecan vary inwidth in Japanese, meaning an isspace()-like whitespace test succeedsbut the number

of bytes occupied varies).

Incidentally, in the above example, character 20 is the first characterof "??",the word after the larger whitespace portion in the middle. However,*byte* 20 is

the "?" of the 3rd word "???". Thus, the ViM script:

    szLine = "????? ???? ?     ???2?????????? ??????????"
    szSearch = input(...)
    ...
    match(szLine, szInput, 20)

comes back with 24 (byte 24). At minimum, I want it to come back with 79(the byteindex of what I'm looking for) except that there was no easy way todynamicallycompute 40, the byte position of where the search actually needs tostart from.

This basic lack of support in the script language for multi-lingualfeatures needsto be addressed, either through new functions or through fixing of theexisting onesso they match the behaviour that the user expects when modifying*related* settings

like encoding, fileencoding, etc.

>
> --
> Regards,
> Dmitry
>
> --
> --
> You received this message from the "vim_dev" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php
>
> ---
> You received this message because you are subscribed to the GoogleGroups "vim_dev" group.> To unsubscribe from this group and stop receiving emails from it,send an email to [email protected]<mailto:vim_dev%[email protected]>.
> For more options, visit https://groups.google.com/d/optout.

--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---
You received this message because you are subscribed to the GoogleGroups "vim_dev" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected]<mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.


--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---You received this message because you are subscribed to the Google Groups "vim_dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Issue in match() function with multi-byte characters

Raspunde prin e-mail lui