Am 19.03.2017 um 23:00 schrieb Christian Bartolomaeus via RT:
Looking at https://en.wikipedia.org/wiki/Code_point and
http://www.unicode.org/glossary/#code_point I understand that
U+10 is indeed the maximum Unicode code point.
Yes, that's the maximum value you can encode in four bytes with UTF-8,
see https://en.wikipedia.org/wiki/UTF-8#Description.
I was wondering how the Unicode consortium might extend this limit, so I
investigated a bit.
TL;DR
I can confirm that 10 is going to remain the maximum for the
foreseeable future.
DETAILS
Technical limits:
UTF-8 could be extended up to 0x108430 [1]
UTF-16 ("surrogate pairs") cannot be extended beyond 0x10
UTF-32 can be extended up to 0x (32 bits available)
Political limits:
Since Java chose to use surrogate pairs, and UTF-16 is not extensible,
any motion to extend the Unicode code range would be met with opposition
from Oracle, and from any language community that has a JVM
implementation and wants to be interoperable with Java libraries.
Code space exhaustion:
Unicode assigns code points like this:
characters: 128,237 code points
private use: U+e000—U+f8ff (6,400 code points) [2]
U+f—U+d (65,534) [2]
U+10—U+10fffd (65,534) [2]
surrogates: U+d800—U+dfff (2,048 code points) [2]
So out of the
0x10=1,114,111 available code points,
128,237+6,400+65,534+65,534=265705 are in use, leaving
848406 free for future character set extension.
Assuming 10,000 new characters per year (which is conservative given the
numbers in [3]), the current encoding space will be exhaused in ca. 85
years.
Regards,
Jo
[1]
The Unicode consortium could extend the maximum value of UTF-8 by using
more prefixes:
10xx for 5-byte encoding
110x for 6-byte encoding
1110 for 7-byte encoding
for 8-byte encoding
No prefixes are possible for a longer encoding.
Bit counts for each prefix are:
5-byte: nr of 4-byte encoding bits (21) + 5 = 26
6-byte: 26 + 5 = 31
7-byte: 26 + 5 = 36
8-byte: 36 + 6 (prefix does not lose a bit) = 42
The maximum 8-byte-encoded value is
0x10+2^26+2^31+2^36+2^42 = 0x108430
[2]
Numbers taken from
http://unicodebook.readthedocs.io/unicode.html#statistics
[3]
See https://en.wikipedia.org/wiki/Unicode#Versions