Oops: ... two words are needed for codepoints 10000 to 10FFFF... Best regards, Tony.
On Sun, Nov 1, 2015 at 12:45 AM, Tony Mechelynck <[email protected]> wrote: > On Sat, Oct 31, 2015 at 11:46 PM, Andre Sihera > <[email protected]> wrote: >> On 01/11/15 00:01, mattn wrote: >>>> >>>> Thanks. Is there any corner case where we would need a few more bytes >>>> than MAXPATHL? >>> >>> In utf-8, max bytes of letter should be 4. So MAXPATHL * 4. >>> >> No, the maximum length of a UTF-8 character is 6 bytes, as that is the >> maximum required to encode all characters in ISO10646. The currently >> defined character space only uses 4 bytes but new characters are always >> being added. >> >> Note that ISO10646 is *not* a linear space. New characters can be >> added anywhere in the space, including the very last character at the >> top end (0xFFFFFFFF). >> >> We don't want to be changing this every time new characters are added >> to the ISO standard, and its hardly an issue of memory, so just set to the >> maximum from the start. >> > > The Unicode Consortium has decided that no codepoints will _ever_ be > added above plane 10, and since the last two codepoints of every plane > are also "noncharacters", this means that the highest codepoint which > will ever be valid for public or private use us U+10FFFD. (I don't > count "internal use" codepoints, which may be used in memory for > temporary use, but will never be used even for private communication.) > In addition, planes F and 10 are allocated to private use areas, and > require agreement between the sender and the receiver. Prohibiting > planes 11 and higher is consistent with the fact that UTF-16 cannot > represent anything above U+10FFFF, even with surrogates. > > Now even if the whole UTF-32 range were to be valid, any codepoint up > to U+1FFFFF can be represented in UTF-8 with no more than 4 bytes, and > anything in the BMP (i.e. until U+FFFF) can be repesented with no more > than 3 bytes. > > Ken Takata said that the maximum path length is 260 UTF-16 words. In > UTF-16, one word can represent anything in the BMP, two words are > needed for codepoints 1FFFF to 10FFFF, and nothing higher can be > represented. Since codepoints above the BMP count for two each, this > means up to 3 UTF-8 bytes per UTF-16 word in the BMP, and 4 bytes per > UTF-16 doubleword (or 2 bytes per word) above the BMP. So 3 * 260 > UTF-8 bytes are enough, and 1024 (the proposed buffer size) is just > short of 4 * 260, which is more than enough. > > > Best regards, > Tony. -- -- You received this message from the "vim_dev" maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php --- You received this message because you are subscribed to the Google Groups "vim_dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
