Re: Issue in match() function with multi-byte characters

Andre Sihera Sun, 30 Mar 2014 08:30:11 -0700


On 30/03/14 23:52, ZyX wrote:

>       Thanks for your insights. It's not perfect but it's much better than
>       I was
>       originally thinking.
>> I didn't say that string indexing has to be fixed. I said the
>       inconsistencies
>       had to be fixed. There is a difference.
>> The inconsistency of looking at this:>> 1) "this is a test">> and this:>> 2) "こんにちわ世界">> and having to say "(1) is a string, but (2) is a byte sequence" just
>       because
>       I live in a part of the world that doesn't use English (8-bit
>       characters) is an
>       inconsistency that needs to be addressed. Any reasonable language
>       that
>       provides string constants should treat everything in between double
>       quotes
>       consistently.

There is no inconsistency: both variants are byte sequences. I do not say it is 
good, but you must remember two things:

1. Vim is supposed to be able to edit any text file. It is not impossible for a 
edited file to be in UTF-8 encoding*and*  contain sequence 0x20 0xFF 0x20 which 
is not correct unicode. Reasons for this may vary: e.g. somebody assumed it is 
good idea to embed binary data directly in a C string in place of using escape 
sequences, file became corrupt due to power fail, etc.

True, and the script language should therefore provide the means for
script writers to decide for themselves whether their scripts should
treat 0x20 0xFF 0x20 as a valid byte sequence or an invalid character
sequence.

So far we have all byte support and little-to-no character support for
what should be consistently and fully supported from both points of
view, particularly as the editor allows input of both character and byte
data indiscriminately.

2. It is not the only language with byte indexing by default. Originally there 
was no thing like UTF, but were a number of local 8-bit encodings. So lots of 
languages carry this history into today: Perl uses single-byte strings without 
`use utf8;` and a number of places where you must add code to convert string to 
utf string, lua has no unicode support without external libraries, python-2* 
strings are byte sequences by default (and there were some issues with that 
situation that led to incompatible changes in python-3*) and so on.

    These are basically three concepts for dealing with situation: (Perl) add a 
solution that will require a number of hacks, (Lua) just ignore it and (Python) 
make new version not compatible with the old one. Vim uses mostly the second 
variant. I do not see any fourth variant which will not result in one of the 
already mentioned.
>> The fact we have plug-ins that operate only in byte mode or rely on
>       that
>       particular behaviour is an issue that prevents the necessary changes
>       from
>       being applied cleanly and easily.
>> Again, thanks for your insights. Very useful.
--
-- You received this message from the "vim_dev" maillist. Do nottop-post! Type your reply below the text you are replying to. For moreinformation, visit http://www.vim.org/maillist.php --- You receivedthis message because you are subscribed to the Google Groups "vim_dev"group. To unsubscribe from this group and stop receiving emails fromit, send an email to [email protected]. For moreoptions, visit https://groups.google.com/d/optout.


--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

---You received this message because you are subscribed to the Google Groups "vim_dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Issue in match() function with multi-byte characters

Raspunde prin e-mail lui