Bernard Miller scripsit:

> I’m afraid I have a little bit of a beef about the
> Unicode documentation here, forgive me if this has
> already been brought up. How come UAX #27 says that
> Unicode 3.0 had 34 non characters, 32 of which are in
> supplementary planes? First of all, there are no
> characters defined in supplementary planes in Unicode
> 3.0.

Correct.  However, the codepoints FFFE and FFFF in
*every* plane have been non-characters since Unicode
2.0 or even earlier.  They were mentioned in ISO 10646
if not in Unicode itself.

> How many planes are defined in Unicode 3.1? UAX #27
> seems to indicate that it depends on what
> transformation format is used (“A process shall
> interpret the Unicode code units in accordance with
> the Unicode Transformation Format used.”). UTF-8 seems
> to only define 17 planes but UTF-32 seems to have 128
> groups of 256 planes.

There are only 17 planes, period.  Code units in UTF-32
greater than 0x10FFFF are not valid codepoints.

> UAX #27 says that Unicode 3.1
> defines 3 new supplementary planes... including plane
> 14. I have difficulty with that statement.. does that
> mean that there are only 3 new planes, or that there
> are (at least) 14 new planes, but only 3 of which have
> plane names and characters in them? At least 17 planes
> must be defined in order to define the 32 non
> characters in 16 supplementary planes, that’s what
> common sense would say anyway. 

Unicode 3.1 defined characters in three of the
existing 16 supplementary planes.  The planes themselves
have been here since 2.0.

> This whole “plane” business suffers from a lack of
> documentation. UAX #27 talks about planes as if it’s
> ancient history but the Unicode 3.0 book does not
> mention planes once (it’s not in the index anyway). I
> would like the Unicode documentation to explain
> exactly what a plane is without requiring the 10646
> documentation which is only available for a fee. In
> fact, according to UAX #27 the planes are defined in
> terms of what WILL be in 10646-2. 

A plane is a sequence of 65536 Unicode scalar values,
in the terminology of Unicode 2.0, on a divisible-by-65536 boundary.

> I’m trying to get a grasp on exactly how many planes
> are defined in Unicode in part because it seems to
> affect the number of non characters that are defined.
> I also want to know the maximum number of characters
> that Unicode can encode. So far I reckon there are
> 1114112 (assuming 17 planes) minus 2048 (half
> surrogates) minus 2 (special non characters) minus 32
> (“hidden” non characters) minus 32 (non characters due
> to some arbitrary association between 16 higher planes
> code values and the special non characters code
> values) = 1111998 code positions available for
> characters. 

Your reasoning is sound.

> What’s with this 1114111 number I’ve seen
> on this list? 

I have no clue.

> BTW, it doesn’t make sense for every code position
> ending in FFFF or FFFE to be a non character.

It doesn't make much sense, but it is the rule anyway.

> Why isn’t the same rule applied to the “hidden” non
> characters, so that every code value ending in FDD0 to
> FDEF is also a non character? Is it to contribute to
> their “hidden” nature? 

No.  There is simply no reason to reserve them on the other planes.

-- 
John Cowan           http://www.ccil.org/~cowan              [EMAIL PROTECTED]
Please leave your values        |       Check your assumptions.  In fact,
   at the front desk.           |          check your assumptions at the door.
     --sign in Paris hotel      |            --Miles Vorkosigan

Reply via email to