2015-11-01 2:54 GMT+03:00 Dominique Pellé <[email protected]>: > Andre Sihera <[email protected]> wrote: > >> On 01/11/15 00:01, mattn wrote: >>>> >>>> Thanks. Is there any corner case where we would need a few more bytes >>>> than MAXPATHL? >>> >>> In utf-8, max bytes of letter should be 4. So MAXPATHL * 4. >>> >> No, the maximum length of a UTF-8 character is 6 bytes, as that is the >> maximum required to encode all characters in ISO10646. The currently >> defined character space only uses 4 bytes but new characters are always >> being added. >> >> Note that ISO10646 is *not* a linear space. New characters can be >> added anywhere in the space, including the very last character at the >> top end (0xFFFFFFFF). >> >> We don't want to be changing this every time new characters are added >> to the ISO standard, and its hardly an issue of memory, so just set to the >> maximum from the start. > > No, Unicode is limited to U+10FFFF. Yes, UTF-8 could encode more > but it's not allowed. So the maximum allowed sequence size is 4 bytes. > See: > > https://en.wikipedia.org/wiki/UTF-8 > http://stackoverflow.com/questions/5924105/how-many-characters-can-be-mapped-with-unicode > > According to https://en.wikipedia.org/wiki/Unicode, Unicode-8.0 (the latest) > currently defines 120,737 characters. So there is still plenty of available > code points. > > Having said that, a file could may contain invalid utf-8 with > sequences longer than 4 bytes. > It should be treated as errors and should not crash Vim.
How is it an error? Encoding is an application level, most filesystem drivers have no idea that filenames inside have any encoding *at all*. >From the driver point of view file name consists of the sequence of *bytes*, without any meaning or encoding applied to them, so five-byte UTF-8-encoded sequence is completely valid sequence for a file name, and you can easily get something like this from zip archives with non-ASCII file names (because some of them contain names in a 8-bit encoding and this is not so easily handled by unarchiver programs on linux, not because they have embeded invalid UTF-8 sequences for whatever reason). Note that this is true for linux. Windows and Mac think differently. > > Dominique > > -- > -- > You received this message from the "vim_dev" maillist. > Do not top-post! Type your reply below the text you are replying to. > For more information, visit http://www.vim.org/maillist.php > > --- > You received this message because you are subscribed to the Google Groups > "vim_dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. -- -- You received this message from the "vim_dev" maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php --- You received this message because you are subscribed to the Google Groups "vim_dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
