From: "John Cowan" <[EMAIL PROTECTED]> > Mark Crispin scripsit: > > I thought about UTF-18, but I couldn't think of a good way to represent > > Unicode in 18 bits without surrogates. On the other hand, the idea to cover > > 0/1/2/14 (BMP/SMP/SIP/SSP) in a UTF-18 is interesting. > > I agree, and think it makes sense.
My best choice would be to cover planes 0/1/2/3 with a single code unit for UTF-18, expecting that a huge number of characters would have to be coded soon in a second supplementary ideographic plane. For your information, surrogates only exist in the BMP, not in the other planes. They would have to be used in UTF-18 to cover the whole Unicode set, using the same decomposition (of characters that don't fit in a single code unit) as UTF-16, as it just simplifies things (this means that in UTF-18, the high surrogate code units normally needed for characters in planes 1 to 3 would become invalid) If ever someone resurrects 9-bit bytes in some new 72-bit RISC architecture with very long parallel instruction formats on 144 bits (18 octets or 16 nonet-bytes), such idea would of course make a lot of sense, as well as UTF-9... On such systems, the extra bit in each byte could be used on I/O as a parity mark on bytes, or CRC code on words, and filesystems could be updated to include a storage attribute for disks, specifying if these bits are used to remap and verify octet-based data, or as plain storage to save space. But the main issue could come from devices like IDE and SCSI disks. If the computer speeds are continuing to grow, the autocorrectable CRC capability of very high speed buses (including within the processor itself) may become a requirement for all fast I/O operations, to help prevent the bad effects of external electromagnetic pollution and bursts, while also maintaining a good interoperability of legacy octet-based softwares...

