Billancourt, le 1er avril 2001,

I was thinking about this while reading the thread about UTF-8s.
If the binary order of UTF-16 is of so prime interest that the
(numerous) users of UTF-8 should slightly modify their code
to co-operate with UTF-16-based database engines, by
accepting UTF-8s rather than UTF-8 on input (which is a minor
annoyance), and sending UTF-8s rather than UTF-8 for the 4-byte
sequences (again, this is rather easy to achieve, thanks to the
easy-to-notice barrer), then I believe the (seldom) users of
UTF-32 should be prepared to have to modify their code when
the problem will surface for them too (clearly, at the moment it
doesn't).

So I suggest to correct the problem before it came out.
And I would like to propose UTF-32s.

Since there is a lot of unused space in UTF-32, it is easy to solve
the problem: you just need to "shift" the "incorrectly sorted"
characters into the "correct" place.

A first solution was to specify that every character from the planes
1-16 to be encoded in UTF-32s as a pair of 32-bit values,
the first one being of the form 0000D8xx/0000DBxxxx, and
the second of the form 0000DCxx/0000DFxx. Of course, the
relationship is the same as with UTF-16.
The advantage of this "solution" is that it is then trivial to map
from UTF-32s to UTF-16 and vice versa.
The main problem, however, is that it loses the principal
characteristic of UTF-32, the fact that characters are of fixed
length. This is clearly unacceptable (?)

So instead, I propose to shift the characters U+E000 to U+FFFD,
toward the position U-0011Exxx/U-0011Fxxx.
Yes, it is clearly a hack, and it does add some complexity for
BMP characters while doing nothing for the others ones which
are supposed to be less useful. However, it is quite easy to
convert the datas (the "most" difficult is the conversion from
plain UTF-32 to UTF-32s, because it needs a cap-and-floor
comparison to detect the characters in the U+E000 to U+FFFD
range).


Now, the astute reader certainly has remarked that one can
conceive a variant of UTF-8s which is only 4 byte long (instead
of 6) for the surrogates, while still preserving the sacred
binary order of UTF-16: just apply the standard algorithm
for UTF-8, but taking as input the UTF-32s code. As a result:

     codepoints              UTF-32s                UTF-8s'
  U+0000 .. 007F        00000000 .. 0000007F       00 .. 7F
  U+0080 .. 07FF        00000080 .. 000007FF     C0 .. DF + 80 .. BF
  U+0800 .. D7FF        00000800 .. 0000D7FF     Ex + 80..BF + 80..BF
 U+10000 ..4FFFD        00010000 .. 0004FFFD   F0+80..BF+80..BF+80..BF
 U+50000 ..8FFFD        00050000 .. 0008FFFD   F1+80..BF+80..BF+80..BF
 U+90000 ..CFFFD        00090000 .. 000CFFFD   F2+80..BF+80..BF+80..BF
 U+D0000 ..10FFFD       000D0000 .. 0010FFFD   F3+80..BF+80..BF+80..BF
  U+E000 .. FFFD        0011E000 .. 0011FFFD   F4+9E..9F+80..BF+80..BF



Antoine

Reply via email to