On 14/11/08 13:37, Ben Schmidt wrote:
[James Kanze wrote]
[...]
>> I presume that the BOM testing only concerns the first 4 bytes,
>
> I think it can be up to 6. But I don't really know. I could look it up,
> but I couldn't be bothered. It doesn't matter. Yes, it's only a few
> bytes and the BOM detection will fail after that and fall through to the
> next of the fencs encodings.
[...]
The BOM concerns the first two bytes in UTF-16 (or its UCS-2 subset
which is, rightly, not recognized anymore by Vim when reading an
existing file with a BOM), the first three in UTF-8, the first four in
UTF-32 (which can also be called UCS-4). See my other post for the details.
- UCS-2 can only represent codepoints up to U+FFFF and uses 2 bytes per
codepoint.
- UTF-16 extends UCS-2 by using 2 surrogates apiece for codepoints from
U+10000 to U+10FFFF; for these it uses 4 bytes. It still uses 2 bytes
per codepoint below U+10000.
- UTF-32 uses 4 bytes (one 32-bit doubleword) per codepoint.
- UTF-8 was originally foreseen as using between 1 and 6 bytes per
codepoint to represent codepoints from U+0000 to U+7FFFFFFF. Since then,
the Unicode Consortium has decided that no codepoint above U+10FFFF
would ever be valid. Unless that decision is rescinded in the future, 4
bytes or less are enough to represent any valid codepoint in UTF-8.
Best regards,
Tony.
--
Chicago Transit Authority Rider's Rule #36:
Never ever ask the tough looking gentleman wearing El Rukn
headgear where he got his "pyramid powered pizza warmer".
-- Chicago Reader 3/27/81
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_use" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---