Well, originally Unicode codepoints were foreseen as possibly someday extending from U+0000 to U+7FFFFFFF; UTF-32 (aka UCS-4) and UTF-8 can address that, and Vim too; but UCS-2 could only address the BMP (the Basic Multilingual Plane), i.e. up to U+FFFF. Later UCS-2 was expanded to UTF-16 by means of surrogate code points, and UTF-16 can go as high as U+10FFFF but no higher, so the authorities responsible for Unicode decided that no codepoint higher than U+10FFFF would ever be given a value, or indeed considered valid. Now the earlier maximum, U+7FFFFFFF, is represented by the hex bytes FC 9F BF BF BF BF (6 bytes) while the newer maximum, U+10FFF, is represented as F4 8F BF BF (4 bytes). Since Vim goes by the earlier standard, it still reserves 6 bytes per spacing character.
But this is not all. Unicode also knows combining characters, which occupy no space by themselves but are printed on top of the previous codepoint, sometimes modifying its shape (think of accents, underlines, overlines, etc.). Each of these also gets its own codepoint, and there may be several on a single spacing character. The 'maxcombine' option, which can range from 0 to 6 (default 2) defines how many Vim will accept. Arabic can usually print even the most complex vocalised Coranic text with no more than 2 combining characters per spacing character, Hebrew may require 4 in some cases, so Vim took some safety margin and allows up to 6. But why only three bytes for each combining character? Well, 3 UTF-8 bytes can address everything in the BMP (i.e. U+0000 to U+FFFF) and I suppose that it is not foreseen to have combining characters higher than that. Additionally, as I read the code you quoted, Vim assumes that only BMP spacing characters (U+0000 to U+FFFF) will ever need combining characters, so we arrive at either 6 bytes for a spacing character no matter how high with no combining characters, or 7 times 3 for one spacing character plus up to 6 combining characters, all of them in the BMP. Best regards, Tony. On Sat, Jun 6, 2020 at 2:21 AM Matt Anonyme <[email protected]> wrote: > hi, > > I am trying to hack on vim's codebase but there is something I dont get, > that is the value of MB_MAXBYTES defined at: > > https://github.com/vim/vim/blob/c17e66c5c0acd5038f1eb3d7b3049b64bb6ea30b/src/vim.h#L1771 > > Here is the description: > ==== > /// character of up to 6 bytes, or one 16-bit character of up to three > bytes > /// plus six following composing characters of three bytes each. > #define MB_MAXBYTES 21 > /// Maximum number of bytes in a multi-byte character. It can be one > 32-bit > ==== > > I understand that 3 + 6 * 3 = 21 but I don't get how we can input a > multibyte character of 21 bytes ? In what encoding/way is it possible ? > > Cheers > > -- > -- > You received this message from the "vim_dev" maillist. > Do not top-post! Type your reply below the text you are replying to. > For more information, visit http://www.vim.org/maillist.php > > --- > You received this message because you are subscribed to the Google Groups > "vim_dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/vim_dev/92f5d462-d469-4237-b2f4-242a98b2be85o%40googlegroups.com > <https://groups.google.com/d/msgid/vim_dev/92f5d462-d469-4237-b2f4-242a98b2be85o%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- -- You received this message from the "vim_dev" maillist. Do not top-post! Type your reply below the text you are replying to. For more information, visit http://www.vim.org/maillist.php --- You received this message because you are subscribed to the Google Groups "vim_dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/vim_dev/CAJkCKXvysWrJ8tWrT6PkCoYe6cd5Qt9ws7oMeD2e-L3xiE-%2BXA%40mail.gmail.com.
