Marco Cimarosti <marco dot cimarosti at essetre dot it> wrote: > (Warning: I have probably succeeded in the impossible task of being > more verbose than Mr. Overington. Please start reading only if you > have a few free time... :-)
There's a difference between "verbose," which implies a high ratio of words per idea, and "long." Marco's post was definitely long, but of necessity. > I will be pretending that William is "Overington Inc.", one of the key > customers of the company I work with, and that they are asking me to > implement a protocol to send text over the famous "Overington > Multimedia Broadcasting (OMB)", with the following requirements: William took exception to being "reduced" to a company in this way, but I think it makes the scenario a bit more realistic. In the software business, our customers are usually companies rather than individuals. The net result of this is that more than one person is responsible for the customer requirements, and trying to get clarifications or modifications to them takes more than a simple one-on-one chat. > 1. The text MUST be transmitted in UTF-8 (because the CEO of > Overington Inc. thinks that UTF-8 is cute). That's a perfectly legitimate requirement. BTW, I think UTF-8 is cute too. :-) > I convert the sample text file to XML (see <wo.xml> in the attached > ZIP file), and here comes the first surprise: while the Plane-14 > tagged file <wo.txt> wad 445 bytes long, the XML files is only 322 > bytes long! > > This seems strange, at first: because of the "/" each pair of my XML > language tags is one character longer than the corresponding pair of > Plane-14 tags. Moreover, the syntactical overhead in X.1 above cannot > be less than 30 characters. Of course, the reason for the 123-byte > spare is that, in UTF-8, the characters composing XML tags only take > one byte each, while Plane-14 tag character take four bytes each. Too bad the customer in this scenario didn't think SCSU was cute. > a. An XML file is human readable and may be edited with any text > editor; although the Plain-14 file claims to be "plain text", each > language tag character appears as a three black boxes in any UTF-8 > editor (and as a random twelve "accented" characters in a non-UTF-8 > editor). While I'm no longer in the business of defending Plane 14 tags, it should be mentioned that rendering engines are *not* supposed to display tag characters as black boxes (although they all do). From UAX #27, Section 13.7: "... the tag characters themselves have no display and do not affect line breaking, character shaping or joining, or any other format or layout properties." As for the non-UTF-8 editor, well, UTF-8 was a customer requirement, so not only will the tags display badly, so will every other character outside the Basic Latin range. But the rest of Marco's arguments for XML are certainly sound. In particular, XML information and support is everywhere, and as soon as the functional requirements expand beyond language tagging, Plane 14 tags are no longer adequate. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

