Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

Asmus Freytag Fri, 19 Aug 2011 17:39:05 -0700

On 8/19/2011 3:24 PM, Ken Whistler wrote:

On 8/19/2011 2:07 PM, Doug Ewell wrote:
Technically, I think 10646 was always limited to 32,768 planes so that
one could always address a code point with a 32-bit signed integer (a
nod to the Java fans).
Well, yes, but it didn't really have anything to do with Java.Remember that Javawasn't released until 1995, but the 10646 architecture dates back tocirca 1986.


Yep.

So more likely it was a nod to C implementations which would, it wassupposed,have implemented the 2-, 3-, or 4-octet forms of 10646 with a wchar_t,and whichwould have wanted a signed 32 bit type to work. I suspect, by the way,that thatlimitation was probably originally brought to WG2 by the U.S. nationalbody,
as they would have been the ones most worried about the C implementations
of 10646 multi-octet forms.

No, it was the Japanese NB, as represented by the individual from ToppanPrinting.

This limitation was insisted upon in 1991, after the accord on themerger between

Unicode and 10646, when 10646 was changed to use a "flat" codespace, not the
ISO 2022-like scheme.

And the original architecture was also not really a full 32K planes inthe sense

that we now understand planes for Unicode and 10646. The original design

for 10646 was for a 1- to 4-octet encoding, with all octets conformingto the

ISO 2022 specification. It used the option that the "working sets" for the
encoding octets would be the 94-unit ranges. So for G0: 0x21..0x7E and
for G1: 0xA1..0xFE. The other bytes C0, 0x20, 0x7F, C1, 0xA0, 0xFF, were
not used except for the single-octet form, as in 2022-conformant schemes
still used today for some East Asian character encodings.

And the octets were then designated G (group) P (plane) R (row) and C.

The 1-octet form thus allowed 95 + 96 = 191 code positions.

The 2-octet form thus allowed (94 + 94)^2 = 35,344 code positions

The 3-octet form thus allowed (94 + 94)^3 = 6,644,672 code positions

The Group octet was constrained to the low set of 94. (This is the origin

of the constraint to half the planes, which would keep wchar_timplementations

out of negative signed range.)

The 4-octet form thus allowed 94 * (94 +94)^3 = 624,599,168 code positions

The grand total for all possible forms was the sum of those values or:

*631,279,375* code positions

(before various *other* set-asides for "plane swapping" and private
use start getting taken into account)

This was so mind-bogglingly complicated that it was a deal breaker formany companies. Unicode's more restrictive concept of a character or itscombining technology or many other innovations weren't initially seen asits primary benefits by people being faced with evaluating thedifferences between the formal ISO-backed project and the de-factoindustry collaboration forming around Apple and Xerox. But the flat codespace, now you were talking.

Of course, 2.1 billion characters is also overkill, but the advent of
UTF-16 was how we ended up with 17 planes.
So a lot less than 2.1 billion characters. But I think Doug's point isstill valid:
631 million plus code points was still overkill for the problem to
be addressed.
And I think that we can thank our lucky stars that it isn't *that*architecture fora universal character encoding that we would now be implementing anddebating on
the alternative universe version of this email list. ;-)


Even remembering it makes my head hurt.

A./

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

Reply via email to