Jill Ramonsky wrote: > Who were this "certain committee"? And why did they have so much > control over the Unicode Consortium that they could force the > introduction of a new character block that nobody had ever previously > used? What was this "abuse of UTF-8" of which you speak. Indeed, what > is an "abuse" of UTF-8? What does the phrase even mean?
The so-called "Multi-Lingual String Format" was described in an Internet-Draft, draft-ietf-acap-mlsf-01.txt, written by Chris Newman of Innosoft in June 1997. It was an attempt to define a lightweight, inline language tagging protocol for ACAP (Application Configuration Access Protocol) using invalid UTF-8 sequences, such as <E0 E5 EE> for "en". The protocol was described as "another layer of encoding on top of UTF-8," but since there was no signature mechanism or other way for UTF-8 processors to tell this MLSF from normal (corrupted) UTF-8 text, it was effectively a non-standard extension of UTF-8. At the time this was proposed, UTF-8 was still new and not very widely adopted, and there was apparently great concern within the UTC that this non-standard extension would undermine the stability of the UTF-8 format (just as the tacit approval of non-shortest UTF-8 sequences was criticized as a security hole years later). Plane 14 tags were introduced as an equally lightweight countermeasure to persuade the ACAP people to abandon MLSF in favor of an official tagging mechanism that used real (but out-of-the-way) Unicode characters and did not break the rules of UTF-8. > How can you possibly add a block of characters to Unicode and then say > "the UTC sincerely hopes that they never get used at all"? > (Particularly when there are still people around whose actual real > characters are still not being added). First, the comparison between adding this special-purpose tagging mechanism and adding "actual real characters" that are part of some writing system is disingenuous. Nobody ever made a choice between encoding Tai Lue, Rejang, or Plane 14 tags. Second, there are those of us (outside the UTC) who do feel that Plane 14 language tags have a valid use, since not all text that may benefit from language tagging is necessarily in a marked-up format. But the writing is on the wall, and "those of us" have given up our battle. > If this "certain committee" had intended to (falsely) declare > something as UTF-8 and then embed something like: > > <XXX>lang=en-uk<YYY> > > where <XXX> and <YYY> are invalid UTF-8 byte-sequences, then so what? > That would simply mean that "a certain committee"'s code wouldn't then > interoperate with the rest of the world. Why is that any business of > the UC's? Because they were publishing their mechanism as an Internet-Draft, which would soon have graduated to being an RFC, and then other groups might have picked it up. Again, if you think back to 1997, the most commonly referenced definition of UTF-8 itself was an RFC. > Hell, if only the KLI had thought to implement the Klingon alphabet in > invalid UTF-8 sequences - then maybe the UC would have added Klingon > characters just to shut them up, saying things like "it's not really a > script", and "the UTC sincerely hopes that they never get used at > all". Could have saved an awful lot of time! With all respect, this completely misrepresents the intent and working process of the UTC. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

