RE: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-05 Thread Doug Ewell via Unicode
Martin J. Dürst wrote:

> Assuming (conservatively) that it will take about a century to fill up
> all 17 (well, actually 15, because two are private) planes, this would
> give us another century.

Current estimates seem to indicate that 800 years is closer to the mark.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-05 Thread William_J_G Overington via Unicode
Martin J. Dürst > Sorry to be late with this, but if 20.1 bits turn out to not 
be enough, what about 21 bits?

Martin J. Dürst > That would still limit UTF-8 to four bytes, but would almost 
double the code space. Assuming (conservatively) that it will take about a 
century to fill up all 17 (well, actually 15, because two are private) planes, 
this would give us another century.

Martin J. Dürst > Just one more crazy idea :-(.

An interesting possibility for application of some of the code points of those 
extra planes is to encode one code point for each Esperanto word that is in the 
PanLex database.

https://www.panlex.org/

That could provide a platform for assisting communication through the language 
barrier.

William Overington

Monday 5 June 2017




Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-05 Thread Richard Wordingham via Unicode
On Mon, 5 Jun 2017 13:08:06 +0900
"Martin J. Dürst via Unicode"  wrote:

> On 2017/06/02 04:54, Doug Ewell via Unicode wrote:
> > Richard Wordingham wrote:
> >   
> >> even supporting 6-byte patterns just in case 20.1 bits eventually
> >> turn out not to be enough,  
> 
> Sorry to be late with this, but if 20.1 bits turn out to not be
> enough, what about 21 bits?
> 
> That would still limit UTF-8 to four bytes, but would almost double
> the code space. Assuming (conservatively) that it will take about a
> century to fill up all 17 (well, actually 15, because two are
> private) planes, this would give us another century.

It all depends on how the lead byte is parsed.  With a block-if
construct ignorant of the original design or a look-up table, it may be
simplest to treat F5 onwards as out and out errors and not expect any
trailing bytes.  Code handling attempts at 6-byte code points
was the most complex case.  Of course, one **might** want to handle a
list of mostly small positive integers, at which point the old UTF-8
design might be useful.

Richard.



Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-04 Thread David Starner via Unicode
On Sun, Jun 4, 2017 at 9:13 PM Martin J. Dürst via Unicode <
unicode@unicode.org> wrote:

> Sorry to be late with this, but if 20.1 bits turn out to not be enough,
> what about 21 bits?
>
> That would still limit UTF-8 to four bytes, but would almost double the
> code space. Assuming (conservatively) that it will take about a century
> to fill up all 17 (well, actually 15, because two are private) planes,
> this would give us another century.
>
> Just one more crazy idea :-(.
>

It seems hard to estimate the value of that, without knowing why we ran out
of characters. A slow collection of a huge number of Chinese ideographs and
new Native American scripts, maybe. Access to a library with a trillion
works over billions of years from millions of species, probably not. Given
that we're in no risk of running out of characters right now, speculating
on this seems pointless.


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-04 Thread Martin J. Dürst via Unicode

On 2017/06/02 04:54, Doug Ewell via Unicode wrote:

Richard Wordingham wrote:


even supporting 6-byte patterns just in case 20.1 bits eventually turn
out not to be enough,


Sorry to be late with this, but if 20.1 bits turn out to not be enough, 
what about 21 bits?


That would still limit UTF-8 to four bytes, but would almost double the 
code space. Assuming (conservatively) that it will take about a century 
to fill up all 17 (well, actually 15, because two are private) planes, 
this would give us another century.


Just one more crazy idea :-(.

Regards,   Martin.


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode


On 6/1/2017 8:32 PM, Richard Wordingham via Unicode wrote:

TUS Section 3 is like the Augean Stables.  It is a complete mess as a
standards document,


That is a matter of editorial taste, I suppose.


imputing mental states to computing processes.


That, however, is false. The rhetorical turn in the Unicode Standard's 
conformance clauses, "A process shall interpret..." and "A process shall 
not interpret..." has been in the standard for 21 years, and seems to 
have done its general job in guiding interoperable, conformant 
implementations fairly well. And everyone -- well, perhaps almost 
everyone -- has been able to figure out that such wording is a shorthand 
for something along the lines of "Any person implementing software 
conforming to the Unicode Standard in which a process does X shall 
implement it in such a way that that process when doing X shall follow 
the specification part Y, relevant to doing X, exactly according to that 
specification of Y...", rather than a misguided assumption that software 
processes are cognitive agents equipped with mental states that the 
standard can "tell what to think".


And I contend that the shorthand works just fine.



Table 3-7 for example, should be a consequence of a 'definition' that
UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
forms'.


Well, Definition D92 does already explicitly limit UTF-8 to Unicode 
scalar values, and explicitly limits the form to sequences of one to 
four bytes. The reason why it doesn't explicitly include the exclusion 
of "non-shortest form" in the definition, but instead refers to Table 
3-7 for the well-formed sequences (which, btw explicitly rule out all 
the non-shortest forms), is because that would create another 
terminological conundrum -- trying to specify an air-tight definition of 
"non-shortest form (of UTF-8)" before UTF-8 itself is defined. It is 
terminologically cleaner to let people *derive* non-shortest form from 
the explicit exclusions of Table 3-7.



Instead, the exclusion of the sequence  is presented
as a brute definition, rather than as a consequence of 0xD800 not being
a Unicode scalar value. Likewise, 0xFC fails to be legal because it
would define either a 'non-shortest form' or a value that is not a
Unicode scalar value.


Actually 0xFC fails quite simply and unambiguously, because it is not in 
Table 3-7. End of story.


Same for 0xFF. There is nothing architecturally special about 
0xF5..0xFF. All are simply and unambiguously excluded from any 
well-formed UTF-8 byte sequence.




The differences are a matter of presentation; the outcome as to what is
permitted is the same.  The difference lies rather in whether the rules
are comprehensible.  A comprehensible definition is more likely to be
implemented correctly.  Where the presentation makes a difference is in
how malformed sequences are naturally handled.


Well, I don't think implementers have all that much trouble figuring out 
what *well-formed* UTF-8 is these days.


As for "how malformed sequences are naturally handled", I can't really 
say. Nor do I think the standard actually requires any particular 
handling to be conformant. It says thou shalt not emit them, and if you 
encounter them, thou shalt not interpret them as Unicode characters. 
Beyond that, it would be nice, of course, if people converged their 
error handling for malformed sequences in cooperative ways, but there is 
no conformance statement to that effect in the standard.


I have no trouble with the contention that the wording about "best 
practice" and "recommendations" regarding the handling of U+FFFD has 
caused some confusion and differences of interpretation among 
implementers. I'm sure the language in that area could use cleanup, 
precisely because it has led to contending, incompatible interpretations 
of the text. As to what actually *is* best practice in use of U+FFFD 
when attempting to convert ill-formed sequences handed off to UTF-8 
conversion processes, or whether the Unicode Standard should attempt to 
narrow down or change practice in that area, I am completely agnostic. 
Back to the U+FFFD thread for that discussion.


--Ken



Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 19:19:51 -0700
Ken Whistler via Unicode  wrote:

> >   and therefore should start a
> > sequence of 6 characters.  
> 
> That is completely false, and has nothing to do with the current 
> definition of UTF-8.
> 
> The current, normative definition of UTF-8, in the Unicode Standard,
> and in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly
> "obsoletes and replaces RFC 2279") states clearly that 0xFC cannot
> start a sequence of anything identifiable as UTF-8.

TUS Section 3 is like the Augean Stables.  It is a complete mess as a
standards document, imputing mental states to computing processes.

Table 3-7 for example, should be a consequence of a 'definition' that
UTF-8 only represents Unicode Scalar values and excludes 'non-shortest
forms'. Instead, the exclusion of the sequence  is presented
as a brute definition, rather than as a consequence of 0xD800 not being
a Unicode scalar value. Likewise, 0xFC fails to be legal because it
would define either a 'non-shortest form' or a value that is not a
Unicode scalar value.

The differences are a matter of presentation; the outcome as to what is
permitted is the same.  The difference lies rather in whether the rules
are comprehensible.  A comprehensible definition is more likely to be
implemented correctly.  Where the presentation makes a difference is in
how malformed sequences are naturally handled.

Richard.


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode


On 6/1/2017 6:21 PM, Richard Wordingham via Unicode wrote:

By definition D39b, either sequence of bytes, if encountered by an
conformant UTF-8 conversion process, would be interpreted as a
sequence of 6 maximal subparts of an ill-formed subsequence.

("D39b" is a typo for "D93b".)


Sorry about that. :)



Conformant with what?  There is no mandatory*requirement*  for a UTF-8
conversion process conformant with Unicode to have any concept of
'maximal subpart'.


Conformant with the definition of UTF-8. I agree that nothing forces a 
conversion *process* to care anything about maximal subparts, but if 
*any* process using a conformant definition of UTF-8 then goes on to 
have any concept of "maximal subpart of an ill-formed subsequence" that 
departs from definition D93b in the Unicode Standard, then it is just 
making s**t up.





I don't see a good reason to build in special logic to treat FC 80 80
80 80 80 as somehow privileged as a unit for conversion fallback,
simply because*if*  UTF-8 were defined as the Unix gods intended
(which it ain't no longer) then that sequence*could*  be interpreted
as an out-of-bounds scalar value (which it ain't) on spec that the
codespace*might*  be extended past 10 at some indefinite time in
the future (which it won't).

Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
invalid sequence.


That would be equally true of FF FF FF FF FF FF. Which was my point, 
actually.



   FC is not ASCII,


True, of course. But irrelevant. Because we are talking about UTF-8 
here. And just because some non-UTF-8 character encoding happened to 
include 0xFC as a valid (or invalid) value, might not require any 
special case processing. A simple 8-bit to 8-bit conversion table could 
be completely regular in its processing of 0xFC for a conversion.



  and has more than one leading bit
set.  It has the six leading bits set,


True, of course.


  and therefore should start a
sequence of 6 characters.


That is completely false, and has nothing to do with the current 
definition of UTF-8.


The current, normative definition of UTF-8, in the Unicode Standard, and 
in ISO/IEC 10646:2014, and in RFC 3629 (which explicitly "obsoletes and 
replaces RFC 2279") states clearly that 0xFC cannot start a sequence of 
anything identifiable as UTF-8.


--Ken



Richard.





Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 17:10:54 -0700
Ken Whistler via Unicode  wrote:

> Well, working from the *current* specification:
> 
> FC 80 80 80 80 80
> and
> FF FF FF FF FF FF
> 
> are equal trash, uninterpretable as *anything* in UTF-8.
> 
> By definition D39b, either sequence of bytes, if encountered by an 
> conformant UTF-8 conversion process, would be interpreted as a
> sequence of 6 maximal subparts of an ill-formed subsequence.

There is a very good argument that 0xFC and 0xFF are not code units
(D77) - they are not used in the representation of any Unicode scalar
values.  By that argument, you have 5 maximal subparts and seven
garbage bytes.

Richard.


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 1 Jun 2017 17:10:54 -0700
Ken Whistler via Unicode  wrote:

> On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:
> > You were implicitly invited to argue that there was no need to
> > handle 5 and 6 byte invalid sequences.
> >  
> 
> Well, working from the *current* specification:
> 
> FC 80 80 80 80 80
> and
> FF FF FF FF FF FF
> 
> are equal trash, uninterpretable as *anything* in UTF-8.
> 
> By definition D39b, either sequence of bytes, if encountered by an 
> conformant UTF-8 conversion process, would be interpreted as a
> sequence of 6 maximal subparts of an ill-formed subsequence.

("D39b" is a typo for "D93b".)

Conformant with what?  There is no mandatory *requirement* for a UTF-8
conversion process conformant with Unicode to have any concept of
'maximal subpart'.

> I don't see a good reason to build in special logic to treat FC 80 80
> 80 80 80 as somehow privileged as a unit for conversion fallback,
> simply because *if* UTF-8 were defined as the Unix gods intended
> (which it ain't no longer) then that sequence *could* be interpreted
> as an out-of-bounds scalar value (which it ain't) on spec that the
> codespace *might* be extended past 10 at some indefinite time in
> the future (which it won't).

Arguably, it requires special logic to treat FC 80 80 80 80 80 as an
invalid sequence.  FC is not ASCII, and has more than one leading bit
set.  It has the six leading bits set, and therefore should start a
sequence of 6 characters.

Richard.


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Ken Whistler via Unicode


On 6/1/2017 2:39 PM, Richard Wordingham via Unicode wrote:

You were implicitly invited to argue that there was no need to handle
5 and 6 byte invalid sequences.



Well, working from the *current* specification:

FC 80 80 80 80 80
and
FF FF FF FF FF FF

are equal trash, uninterpretable as *anything* in UTF-8.

By definition D39b, either sequence of bytes, if encountered by an 
conformant UTF-8 conversion process, would be interpreted as a sequence 
of 6 maximal subparts of an ill-formed subsequence. Whatever your 
particular strategy for conversion fallbacks for uninterpretable 
sequences, it ought to treat either one of those trash sequences the 
same, in my book.


I don't see a good reason to build in special logic to treat FC 80 80 80 
80 80 as somehow privileged as a unit for conversion fallback, simply 
because *if* UTF-8 were defined as the Unix gods intended (which it 
ain't no longer) then that sequence *could* be interpreted as an 
out-of-bounds scalar value (which it ain't) on spec that the codespace 
*might* be extended past 10 at some indefinite time in the future 
(which it won't).


--Ken


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Philippe Verdy via Unicode
This is still very unlikely to occur. Lot of discussions about emojis but
they still don't count a lot in the total.
The major updates were epected for CJK sinograms, but even the rate of
updates has slowed down and we will eventually will have another
sinographic plane, but it will not come soon and will be very slow to fill
in. This still leaves enough planes for several decenials or more.

May be in the next century a new encoding will be designed but we have
ample time to prepare this to reflect the best practives and experiences
acquired, and it will probably not happen because we lack of code points
but only because some experimentations will have proven that another
encoding is better performing and less complex to manage (just like the
ongoing transition from XML to JSON for the UCD) and because current
supporters of Unicode will prefer this new format and will have implemented
it (starting first by an automatic conversion from the existing encoding in
Unicode and ISO 10646, which will no longer be needed in deployed client
applications)

I bet it will still be an 8-bit based encoding using 7-bit ASCII (at least
the ngraphic part plus a few controls, but some other controls will be
remapped), but it could be simply a new 32 bit or 64-bit encoding.

Before this change ever occurs, there will be the need to demonstrate that
it is better performing, that it allows smooth transition and excellent
compatibility (possibly with efficient transcoders) and many implementation
"quirks" will have been resolved (including security risks).

2017-06-01 21:54 GMT+02:00 Doug Ewell via Unicode :

> Richard Wordingham wrote:
>
> > even supporting 6-byte patterns just in case 20.1 bits eventually turn
> > out not to be enough,
>
> Oh, gosh, here we go with this.
>
> What will we do if 31 bits turn out not to be enough?
>
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>


Re: Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Richard Wordingham via Unicode
On Thu, 01 Jun 2017 12:54:45 -0700
Doug Ewell via Unicode  wrote:

> Richard Wordingham wrote:
> 
> > even supporting 6-byte patterns just in case 20.1 bits eventually
> > turn out not to be enough,  
> 
> Oh, gosh, here we go with this.

You were implicitly invited to argue that there was no need to handle
5 and 6 byte invalid sequences. 

> What will we do if 31 bits turn out not to be enough?

A compatible extension of UTF-16 to unbounded length has already been
designed.  Prefix bytes 0xFF can be used to extend the length for UTF-8
by 8 bytes at a time.  Extending UTF-32 is not beyond the wit of man,
and we know that UTF-16 could have been done better if the need had
been foreseen.

While it seems natural to hold a Unicode scalar value in a single
machine word of some length, this is not necessary, just highly
convenient.

In short, it won't be a big problem intrinsically.  The UCD may get a
bit unwieldy, which may be a problem for small systems without Internet
access.

Richard.


Running out of code points, redux (was: Re: Feedback on the proposal...)

2017-06-01 Thread Doug Ewell via Unicode
Richard Wordingham wrote:

> even supporting 6-byte patterns just in case 20.1 bits eventually turn
> out not to be enough,

Oh, gosh, here we go with this.

What will we do if 31 bits turn out not to be enough?
 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org