Oops: ... two words are needed for codepoints 10000 to 10FFFF...

Best regards,
Tony.

On Sun, Nov 1, 2015 at 12:45 AM, Tony Mechelynck
<[email protected]> wrote:
> On Sat, Oct 31, 2015 at 11:46 PM, Andre Sihera
> <[email protected]> wrote:
>> On 01/11/15 00:01, mattn wrote:
>>>>
>>>> Thanks.  Is there any corner case where we would need a few more bytes
>>>> than MAXPATHL?
>>>
>>> In utf-8, max bytes of letter should be 4. So MAXPATHL * 4.
>>>
>> No, the maximum length of a UTF-8 character is 6 bytes, as that is the
>> maximum required to encode all characters in ISO10646. The currently
>> defined character space only uses 4 bytes but new characters are always
>> being added.
>>
>> Note that ISO10646 is *not* a linear space. New characters can be
>> added anywhere in the space, including the very last character at the
>> top end (0xFFFFFFFF).
>>
>> We don't want to be changing this every time new characters are added
>> to the ISO standard, and its hardly an issue of memory, so just set to the
>> maximum from the start.
>>
>
> The Unicode Consortium has decided that no codepoints will _ever_ be
> added above plane 10, and since the last two codepoints of every plane
> are also "noncharacters", this means that the highest codepoint which
> will ever be valid for public or private use us U+10FFFD. (I don't
> count "internal use" codepoints, which may be used in memory for
> temporary use, but will never be used even for private communication.)
> In addition, planes F and 10 are allocated to private use areas, and
> require agreement between the sender and the receiver. Prohibiting
> planes 11 and higher is consistent with the fact that UTF-16 cannot
> represent anything above U+10FFFF, even with surrogates.
>
> Now even if the whole UTF-32 range were to be valid, any codepoint up
> to U+1FFFFF can be represented in UTF-8 with no more than 4 bytes, and
> anything in the BMP (i.e. until U+FFFF) can be repesented with no more
> than 3 bytes.
>
> Ken Takata said that the maximum path length is 260 UTF-16 words. In
> UTF-16, one word can represent anything in the BMP, two words are
> needed for codepoints 1FFFF to 10FFFF, and nothing higher can be
> represented. Since codepoints above the BMP count for two each, this
> means up to 3 UTF-8 bytes per UTF-16 word in the BMP, and 4 bytes per
> UTF-16 doubleword (or 2 bytes per word) above the BMP. So 3 * 260
> UTF-8 bytes are enough, and 1024 (the proposed buffer size) is just
> short of 4 * 260, which is more than enough.
>
>
> Best regards,
> Tony.

-- 
-- 
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- 
You received this message because you are subscribed to the Google Groups 
"vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Raspunde prin e-mail lui