Re: [perl #130914] [BUG] chr() aliases codepoint numbers mod 2**32

2017-03-20 Thread Joachim Durchholz

Am 19.03.2017 um 23:00 schrieb Christian Bartolomaeus via RT:

Looking at https://en.wikipedia.org/wiki/Code_point and
http://www.unicode.org/glossary/#code_point I understand that
U+10 is indeed the maximum Unicode code point.


Yes, that's the maximum value you can encode in four bytes with UTF-8, 
see https://en.wikipedia.org/wiki/UTF-8#Description.


I was wondering how the Unicode consortium might extend this limit, so I 
investigated a bit.



TL;DR

I can confirm that 10 is going to remain the maximum for the 
foreseeable future.



DETAILS


Technical limits:

  UTF-8 could be extended up to 0x108430 [1]
  UTF-16 ("surrogate pairs") cannot be extended beyond 0x10
  UTF-32 can be extended up to 0x (32 bits available)


Political limits:

Since Java chose to use surrogate pairs, and UTF-16 is not extensible,
any motion to extend the Unicode code range would be met with opposition 
from Oracle, and from any language community that has a JVM 
implementation and wants to be interoperable with Java libraries.



Code space exhaustion:

Unicode assigns code points like this:
  characters:  128,237 code points
  private use: U+e000—U+f8ff (6,400 code points) [2]
   U+f—U+d (65,534) [2]
   U+10—U+10fffd (65,534) [2]
  surrogates:  U+d800—U+dfff (2,048 code points) [2]
So out of the
  0x10=1,114,111 available code points,
  128,237+6,400+65,534+65,534=265705 are in use, leaving
  848406 free for future character set extension.

Assuming 10,000 new characters per year (which is conservative given the 
numbers in [3]), the current encoding space will be exhaused in ca. 85 
years.



Regards,
Jo

[1]
The Unicode consortium could extend the maximum value of UTF-8 by using 
more prefixes:

  10xx for 5-byte encoding
  110x for 6-byte encoding
  1110 for 7-byte encoding
   for 8-byte encoding
No prefixes are possible for a longer encoding.

Bit counts for each prefix are:
  5-byte: nr of 4-byte encoding bits (21) + 5 = 26
  6-byte: 26 + 5 = 31
  7-byte: 26 + 5 = 36
  8-byte: 36 + 6 (prefix does not lose a bit) = 42
The maximum 8-byte-encoded value is
  0x10+2^26+2^31+2^36+2^42 = 0x108430

[2]
Numbers taken from
http://unicodebook.readthedocs.io/unicode.html#statistics

[3]
See https://en.wikipedia.org/wiki/Unicode#Versions


Re: [perl #130914] [BUG] chr() aliases codepoint numbers mod 2**32

2017-03-04 Thread Elizabeth Mattijsen
Fixed with https://github.com/rakudo/rakudo/commit/20fa14be7a , tests needed.

> On 4 Mar 2017, at 11:24, Zefram (via RT)  wrote:
> 
> # New Ticket Created by  Zefram 
> # Please include the string:  [perl #130914]
> # in the subject line of all future correspondence about this issue. 
> # https://rt.perl.org/Ticket/Display.html?id=130914 >
> 
> 
>> chr(0x10001).ords
> (1)
>> "\x[10001]".ords
> (1)
>> chr(-0x).ords
> (1)
> 
> chr() is reducing the supplied codepoint number mod 2**32.  The output
> produced is not what the user asked for.  chr() should instead just
> signal an error for any codepoint outside the supported [0, 2**31) range.
> 
> -zefram