On 8/19/2011 3:24 PM, Ken Whistler wrote:
On 8/19/2011 2:07 PM, Doug Ewell wrote:
Technically, I think 10646 was always limited to 32,768 planes so that
one could always address a code point with a 32-bit signed integer (a
nod to the Java fans).

Well, yes, but it didn't really have anything to do with Java. Remember that Java wasn't released until 1995, but the 10646 architecture dates back to circa 1986.

Yep.

So more likely it was a nod to C implementations which would, it was supposed, have implemented the 2-, 3-, or 4-octet forms of 10646 with a wchar_t, and which would have wanted a signed 32 bit type to work. I suspect, by the way, that that limitation was probably originally brought to WG2 by the U.S. national body,
as they would have been the ones most worried about the C implementations
of 10646 multi-octet forms.

No, it was the Japanese NB, as represented by the individual from Toppan Printing.

This limitation was insisted upon in 1991, after the accord on the merger between
Unicode and 10646, when 10646 was changed to use a "flat" codespace, not the
ISO 2022-like scheme.


And the original architecture was also not really a full 32K planes in the sense
that we now understand planes for Unicode and 10646. The original design
for 10646 was for a 1- to 4-octet encoding, with all octets conforming to the
ISO 2022 specification. It used the option that the "working sets" for the
encoding octets would be the 94-unit ranges. So for G0: 0x21..0x7E and
for G1: 0xA1..0xFE. The other bytes C0, 0x20, 0x7F, C1, 0xA0, 0xFF, were
not used except for the single-octet form, as in 2022-conformant schemes
still used today for some East Asian character encodings.

And the octets were then designated G (group) P (plane) R (row) and C.

The 1-octet form thus allowed 95 + 96 = 191 code positions.

The 2-octet form thus allowed (94 + 94)^2 = 35,344 code positions

The 3-octet form thus allowed (94 + 94)^3 = 6,644,672 code positions

The Group octet was constrained to the low set of 94. (This is the origin
of the constraint to half the planes, which would keep wchar_t implementations
out of negative signed range.)

The 4-octet form thus allowed 94 * (94 +94)^3 = 624,599,168 code positions

The grand total for all possible forms was the sum of those values or:

*631,279,375* code positions

(before various *other* set-asides for "plane swapping" and private
use start getting taken into account)

This was so mind-bogglingly complicated that it was a deal breaker for many companies. Unicode's more restrictive concept of a character or its combining technology or many other innovations weren't initially seen as its primary benefits by people being faced with evaluating the differences between the formal ISO-backed project and the de-facto industry collaboration forming around Apple and Xerox. But the flat code space, now you were talking.

Of course, 2.1 billion characters is also overkill, but the advent of
UTF-16 was how we ended up with 17 planes.

So a lot less than 2.1 billion characters. But I think Doug's point is still valid:
631 million plus code points was still overkill for the problem to
be addressed.

And I think that we can thank our lucky stars that it isn't *that* architecture for a universal character encoding that we would now be implementing and debating on
the alternative universe version of this email list. ;-)

Even remembering it makes my head hurt.

A./

Reply via email to