Last year I became aware of, and frustrated with, a new Unicode TES called 
UTF-5, proposed in an Internet Draft by James Seng, Martin D�rst, and Tin Wee 
Tan.  It was intended for encoding internationalized domain names (IDN) 
without breaking the existing DNS structure.  It used a fairly clever scheme 
of transforming the hex representation of a Unicode code point into a 
variable-length byte sequence.  However, I felt the Internet Draft was poorly 
written: there were no guidelines as to which characters were to be encoded 
and which (beyond the obvious U+002E) were not, and the examples (and 
reference encoder and decoder at www.idns.org) were self-contradictory.

Ken Whistler pointed out, in a reply to my diatribe, that UTF-5 had the 
additional problem of not being a true UTF, despite the name.  It was really 
a TES (transfer encoding syntax), because the intent was to provide a 
reversible transform of Unicode characters to avoid violating the DNS naming 
requirements.  At least the "-5" part was accurate, though.

Subsequently I discovered Internet Drafts describing an assortment of what 
had come to be called ACEs (ASCII-Compatible Encodings), all intending to 
solve the IDN problem.  Mark Davis and Paul Hoffman created things called 
LACE (Length-based ACE) and RACE (Row-based ACE), which had elaborate 
compression schemes but which at least appeared to be completely specified.  
Each encoding came with a special signature, "--bq", to indicate the presence 
of a LACE- or RACE-based domain name.

I'm not sure exactly how "ASCII-compatible" these ACEs are, since ASCII 
characters (those below U+0080) seem to be encrypted along with everything 
else, but that seems to be the accepted name, and at least it doesn't 
conflict with Unicode usage.

Now, upon visiting the Internet Drafts index once again, I see a 
proliferation of ACEs, including schemes called BRACE and DUNCE.  (I can't 
tell from the spec whether DUNCE is intended as a joke or not, and I think 
that says a lot.)  The big question now is which of these burgeoning ACEs 
will emerge as the standard, or -- horrors -- whether *more than one* might 
be adopted.

But there's more.  While looking, out of curiosity, for an update to the 
since-expired UTF-5 document, I found:

    draft-ietf-idn-utf6-00.txt

by Mark Welter and Brian W. Spolarich, which claims to describe something 
called UTF-6.  (Yes!)  This document, like many others that plagiarize freely 
from Fran�ois Yergeau's RFC 2279 on UTF-8, copies text and structure from the 
BRACE proposal.  This type of copying isn't always a bad idea, but it always 
raises the question of whether the author fully understood the underlying 
concepts or just copied and pasted the words.

So what is this UTF-6?  Get ready... it's nothing more than a rehash of Seng 
et al.'s UTF-5, with a two-level run-length compression scheme added on.  
That's it.  It suffers from the same problems as UTF-5, adds compression that 
mainly benefits small alphabets (every proponent of a DNS solution seems to 
be motivated by a desire to support a specific language or script, often CJK; 
Welter and Spolarich seem to have been motivated to support Arabic DNS 
names), and of course proposes its own signature, "--wq", to differentiate it 
from all the other ACEs and jokers.

That name, "UTF-6", is particularly annoying.  As Whistler observed, these 
things aren't really UTFs at all, but because of the widespread distribution 
and mindless copying of well-written documents like RFC 2279 that describe 
well-specified encoding schemes like UTF-8, everybody now claims to have 
developed a "UTF."  (Compare this to some of the Gedankenexperiments by 
Unicode list members, which will never be adopted but which at least qualify 
as true UTFs.)  The "6" in UTF-6 doesn't refer to anything except the idea 
that UTF-6 is an enhancement to UTF-5.  Nothing is done in groups of six 
bits, bytes, characters, or anything.

There is an observation in the classical music world about the English horn, 
to the effect that it is neither English nor a horn.  (A similar remark has 
been made about the "Holy Roman Empire.")  This is the situation with UTF-6: 
it is neither a UTF nor is there anything "6" about it.

Much of the discussion on this list concerning Oracle's proposed UTF-8s 
mentions the very real problems with proliferating UTFs.  They add confusion 
to Unicode, especially among non-experts.  The explosion of IDN solutions is 
similar, except that there are even more proposals out there and even more 
confusion.  Many companies seem to have developed their own scheme instead of 
adopting an existing proposal for no good reason except visions of patent 
rights and royalties.

I hope that some order comes to the IDN scene soon, so that the Internet can 
have ONE well-defined scheme that allows the use of Unicode in the DNS, does 
not leak into the outside world any more than necessary, solves the problem 
it was intended to solve in a way that everyone can agree on, isn't 
extraordinarily difficult to implement, and DOESN'T call itself a UTF.  That 
would be music to just about everyone's ears.

-Doug Ewell
 Fullerton, California

Reply via email to