Re: UTF-8 bomb showing up after :%!sort

Nikolai Weibull Tue, 20 May 2008 08:17:42 -0700

On Tue, May 20, 2008 at 12:47 PM, Tony Mechelynck
<[EMAIL PROTECTED]> wrote:
>
> On 20/05/08 09:11, Nikolai Weibull wrote:
>> On Sun, May 18, 2008 at 8:44 PM, Tony Mechelynck
>> <[EMAIL PROTECTED]>  wrote:
>>
>>>> Note, that you probably do not want to use BOM with UTF-8.
>>>> See http://unicode.org/faq/utf_bom.html#29 (Q: Can a UTF-8 data stream
>>>> contain the BOM character (in UTF-8 form)? If yes, then can I still
>>>> assume the remaining UTF-8 bytes are in big-endian order?)
>>
>>> The BOM can also be used in UTF-8, not to determine endianness (which is
>>> not relevant for UTF-8 -- one could argue that UTF-8 is always
>>> big-endian) but to distinguish UTF-8 from other encodings including
>>> UTF-16 and UTF-32.
>>
>> How can you argue that?  UTF-8 is neither big-endian nor
>> little-endian.  It's just a sequence of 8-bit bytes.
>
> Exactly. Some people (including you, apparently) say that endianness is
> not a property of bytes but of words (or doublewords, quadwords, etc.)
> when written to disk. Since UTF-8 doesn't use 2^n-word data items (with
> n in the set {0, 1, 2, ...}), those people would say that it is neither
> big-endian nor little-endian. According to a different definition of
> endianness (used, apparently, by Ilya Bobir), any sequence of two or
> more bytes representing a single integer (and for instance a Unicode
> codepoint number) can be big-endian (if the bits of higher weight come
> first) or little-endian (if it's the bits of lower weight). According to
> this latter definition, UTF-8 is always big-endian.


Ilya Bobir is simply linking to the FAQ, which doesn't mention any
such definition.

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Re: UTF-8 bomb showing up after :%!sort

Raspunde prin e-mail lui