Re: [RFC] Default 'encoding' to UTF-8

Tony Mechelynck Sat, 14 Mar 2009 05:10:20 -0700

On 13/03/09 17:22, Mike Williams wrote:
>
> Matt Wozniski wrote:
>> On Fri, Mar 13, 2009 at 12:01 PM, Mike Williams wrote:
>>> Matt Wozniski wrote:
>>>> This sounds like a very good idea to me.  I don't know of any other
>>>> programs that allow you to change encoding used internally, and we
>>>> would be in good company if we chose to always use a unicode encoding
>>>> internally: Java uses UTF-16 internally, and I believe python does as
>>>> well.  Is there any time when it would be desirable to use a
>>>> non-unicode 'encoding' (assuming, of course, that +multi_byte is
>>>> available)?  I can't think of any.
>>> Yes, editing very large (say a few 100MB) data files that in a single
>>> byte encoding.  For my day job I regularly enjoy having to spelunk my
>>> way around large files containing a mix of readable ASCII and binary
>>> data.  Using a Unicode encoding could make this prohibitive.  Yes, this
>>> is essentially a raw file edit mode, perhaps that should be an option -
>>> or would it be part of setting binary mode?
>>
>> How would using Unicode for 'enc' in any way affect this?  Sure, you'd
>> want to use a single-byte 'fenc', but no one is suggesting that the
>> 'fenc' option should be removed.  If there is a reason why editing
>> binary files should be affected at all by what encoding the editor
>> uses for storing the buffer text internally, I don't see it and you'll
>> need to elaborate.
>
> With a UTF-16 internal encoding a 250MB data file blossoms into a nice
> round 500MB.  For all the cheap memory these days this will still have
> an effect on system performance - time to allocate, paging out of idle
> apps to disk, etc.


Vim doesn't use UTF-16 internally but UTF-8 -- even if you set 
'encoding' to, let's say, utf-16le, because Vim cannot tolerate actual 
nulls in the middle of lines. This also means there is no space loss for 
7-bit ASCII, which is represented identically in ASCII, Latin1, UTF-8, 
and indeed also in most iso-8859 encodings.

>
> And will VIM internally use a canonical Unicode form?  What happens if I
> want to insert some 8-bit data whose unicode character has multiple
> forms?  Which one is used?  How will I know that the 8-bit value I
> intend does not appear as composed sequence?  I haven't used VIM for
> editing unicode with composing characters (damn my native english
> country) - I see there is some discussion on composing but a first
> glance it is not clear whether it is automatic or not.  In my case I
> would not want deletion of data byte to result in other bytes to deleted
> as well.
>
> At the moment I cannot see how supporting Unicode semantics maps to
> editing binary data files.  Not saying it is impossible, I'd just like
> to see the possible way out of the woods if we did go this way.
>
> TTFN
>
> Mike

IMHO, binary data should be read "as if" 8-bit because in an 8-bit 
'fileencoding' there are no "invalid" byte sequences -- and probably 
Latin1 because the conversion Latin1 <=> UTF-8 is trivial and requires 
no iconv library. An alternate possibility (but to be used only at 
user's explicit request IMHO) is to convert binary to hex and vice-versa 
via xxd.

However, this is not what Vim does if you read a file with ++bin: what 
it does is "no conversion", which means that if 'encoding' is set to 
UTF-8 you'll probably get invalid UTF-8 sequences at many places in your 
code. For instance an a-acute in Spanish Latin1 text will appear as <e1> 
instead of á and an e-circumflex in French Latin1 text will appear as 
<ea> instead of ê. Not very convenient if they happen to be within text 
strings -- messages, maybe, to be typed out on the screen. So even if 
you know that the code is binary you might prefer to use

        :e ++enc=latin1 ++ff=unix foobar.bin

and omit the 'binary' setting. The result, if you make changes and save 
them, could be an extra 0x0A at the very end if there wasn't one 
already, but I don't expect trouble even if it happens. (Overlong lines 
might be split if you were on a 16-bit machine, but on 32-bit machines 
the maximum line lize and the maximum file size are both 2GB, and even 
on a 64-bit machine I don't expect you'll often have to edit a binary 
file containing a 2GB stretch of code without a single 0x0A in it.)

Of course, the utmost care should be used when editing binary files 
because, if it is e.g. program code,
- the code can contain displacements in binary, which will become 
invalid if the length of the intervening text is modified
- executable code should in general not be touched
- compressed binaries are probably not editable in any way
- and what if the program includes a binary hash of its ASCII text 
somewhere?

As for canonical forms: I don't think Vim will spontaneously convert 
either way between a spacing character + combining character(s) combo 
and a precomposed character. If you type a then Ctrl-v u 0301 you'll get 
a spacing a and a combining acute. If your keyboard allows "keying" an 
a-acute character, or if you type Ctrl-V x e1, you'll get a precomposed 
a-acute. The two results will be indistinguishable if you have a "good" 
font but Vim doesn't know that, and searching for the precomposed 
character will not match the ascii + accent two-codepoint combo.


Best regards,
Tony.
-- 
"Can you hammer a 6-inch spike into a wooden plank with your penis?"

"Uh, not right now."

"Tsk.  A girl has to have some standards."
                -- "Real Genius"

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: [RFC] Default 'encoding' to UTF-8

Raspunde prin e-mail lui