On 14/11/08 13:37, Ben Schmidt wrote:
[James Kanze wrote]
[...]
>> I presume that the BOM testing only concerns the first 4 bytes,
>
> I think it can be up to 6. But I don't really know. I could look it up,
> but I couldn't be bothered. It doesn't matter. Yes, it's only a few
> bytes and the BOM detection will fail after that and fall through to the
> next of the fencs encodings.
[...]

The BOM concerns the first two bytes in UTF-16 (or its UCS-2 subset 
which is, rightly, not recognized anymore by Vim when reading an 
existing file with a BOM), the first three in UTF-8, the first four in 
UTF-32 (which can also be called UCS-4). See my other post for the details.

- UCS-2 can only represent codepoints up to U+FFFF and uses 2 bytes per 
codepoint.
- UTF-16 extends UCS-2 by using 2 surrogates apiece for codepoints from 
U+10000 to U+10FFFF; for these it uses 4 bytes. It still uses 2 bytes 
per codepoint below U+10000.
- UTF-32 uses 4 bytes (one 32-bit doubleword) per codepoint.
- UTF-8 was originally foreseen as using between 1 and 6 bytes per 
codepoint to represent codepoints from U+0000 to U+7FFFFFFF. Since then, 
the Unicode Consortium has decided that no codepoint above U+10FFFF 
would ever be valid. Unless that decision is rescinded in the future, 4 
bytes or less are enough to represent any valid codepoint in UTF-8.


Best regards,
Tony.
-- 
Chicago Transit Authority Rider's Rule #36:
        Never ever ask the tough looking gentleman wearing El Rukn
headgear where he got his "pyramid powered pizza warmer".
                -- Chicago Reader 3/27/81

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply via email to