On Sat, Oct 31, 2015 at 11:46 PM, Andre Sihera <[email protected]> wrote: > On 01/11/15 00:01, mattn wrote: >>> >>> Thanks. Is there any corner case where we would need a few more bytes >>> than MAXPATHL? >> >> In utf-8, max bytes of letter should be 4. So MAXPATHL * 4. >> > No, the maximum length of a UTF-8 character is 6 bytes, as that is the > maximum required to encode all characters in ISO10646. The currently > defined character space only uses 4 bytes but new characters are always > being added. > > Note that ISO10646 is *not* a linear space. New characters can be > added anywhere in the space, including the very last character at the > top end (0xFFFFFFFF). > > We don't want to be changing this every time new characters are added > to the ISO standard, and its hardly an issue of memory, so just set to the > maximum from the start. >
The Unicode Consortium has decided that no codepoints will _ever_ be added above plane 10, and since the last two codepoints of every plane are also "noncharacters", this means that the highest codepoint which will ever be valid for public or private use us U+10FFFD. (I don't count "internal use" codepoints, which may be used in memory for temporary use, but will never be used even for private communication.) In addition, planes F and 10 are allocated to private use areas, and require agreement between the sender and the receiver. Prohibiting planes 11 and higher is consistent with the fact that UTF-16 cannot represent anything above U+10FFFF, even with surrogates. Now even if the whole UTF-32 range were to be valid, any codepoint up to U+1FFFFF can be represented in UTF-8 with no more than 4 bytes, and anything in the BMP (i.e. until U+FFFF) can be repesented with no more than 3 bytes. Ken Takata said that the maximum path length is 260 UTF-16 words. In UTF-16, one word can represent anything in the BMP, two words are needed for codepoints 1FFFF to 10FFFF, and nothing higher can be represented. Since codepoints above the BMP count for two each, this means up to 3 UTF-8 bytes per UTF-16 word in the BMP, and 4 bytes per UTF-16 doubleword (or 2 bytes per word) above the BMP. So 3 * 260 UTF-8 bytes are enough, and 1024 (the proposed buffer size) is just short of 4 * 260, which is more than enough. Best regards, Tony. -- -- You received this message from the "vim_dev" maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php --- You received this message because you are subscribed to the Google Groups "vim_dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
