On 01/11/15 08:50, Tony Mechelynck wrote:
Oops: ... two words are needed for codepoints 10000 to 10FFFF...

Best regards,
Tony.

On Sun, Nov 1, 2015 at 12:45 AM, Tony Mechelynck
<[email protected]> wrote:
On Sat, Oct 31, 2015 at 11:46 PM, Andre Sihera
<[email protected]> wrote:
On 01/11/15 00:01, mattn wrote:
Thanks.  Is there any corner case where we would need a few more bytes
than MAXPATHL?
In utf-8, max bytes of letter should be 4. So MAXPATHL * 4.

No, the maximum length of a UTF-8 character is 6 bytes, as that is the
maximum required to encode all characters in ISO10646. The currently
defined character space only uses 4 bytes but new characters are always
being added.

Note that ISO10646 is *not* a linear space. New characters can be
added anywhere in the space, including the very last character at the
top end (0xFFFFFFFF).

We don't want to be changing this every time new characters are added
to the ISO standard, and its hardly an issue of memory, so just set to the
maximum from the start.

The Unicode Consortium has decided that no codepoints will _ever_ be
added above plane 10, and since the last two codepoints of every plane
are also "noncharacters", this means that the highest codepoint which
will ever be valid for public or private use us U+10FFFD. (I don't
count "internal use" codepoints, which may be used in memory for
temporary use, but will never be used even for private communication.)
In addition, planes F and 10 are allocated to private use areas, and
require agreement between the sender and the receiver. Prohibiting
planes 11 and higher is consistent with the fact that UTF-16 cannot
represent anything above U+10FFFF, even with surrogates.

Now even if the whole UTF-32 range were to be valid, any codepoint up
to U+1FFFFF can be represented in UTF-8 with no more than 4 bytes, and
anything in the BMP (i.e. until U+FFFF) can be repesented with no more
than 3 bytes.

Ken Takata said that the maximum path length is 260 UTF-16 words. In
UTF-16, one word can represent anything in the BMP, two words are
needed for codepoints 1FFFF to 10FFFF, and nothing higher can be
represented. Since codepoints above the BMP count for two each, this
means up to 3 UTF-8 bytes per UTF-16 word in the BMP, and 4 bytes per
UTF-16 doubleword (or 2 bytes per word) above the BMP. So 3 * 260
UTF-8 bytes are enough, and 1024 (the proposed buffer size) is just
short of 4 * 260, which is more than enough.


Best regards,
Tony.

Given the situation that Unicode's 17-plane limitation appears to
be the prevailing factor here, 1024 bytes appears to be adequate.

--
--
You received this message from the "vim_dev" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--- You received this message because you are subscribed to the Google Groups "vim_dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Raspunde prin e-mail lui