Michael Everson <everson at evertype dot com> wrote: >> 3. Is there any method of tagging, anywhere, that is lighter-weight >> than Plane 14? (Corollary: Is "lightweight" important?) > > HTML and XML markup?
and <Peter_Constable at sil dot org> replied: > Doug was already comparing the plane 14 characters to HTML and XML, > and clearly considers the latter to be relatively heavy -- and > certainly they are heavier. Certainly I don't want to claim, as some have, that HTML and XML and SGML are *very* heavy. But there is definitely a difference. HTML language tags (used here to include the slightly more complex XML syntax as well) are of the form <lang="xx">, whereas Plane 14 tags are of the form ?xx where ? represents U+E0001 and xx, the language identifier, is translated to Plane 14. (HTML allows the alternative form <lang=xx> without quotation marks, but XML does not.) In either case, there is clearly more parsing to be done in the case of HTML: * the spelling of the tag "lang" must be checked; * alternatively, it might be another type of tag altogether (not a language tag); * the equal sign = must be checked; * there must be exactly 0 (HTML optional) or 2 quotation marks surrounding the identifier; * the greater-than sign > must be checked. Plane 14 tags begin with a single, dedicated code point that means "language tag," so no syntax checking is needed at that point. The language identifier itself is encoded by dedicated code points, so checking for "the end of the tag" is simpler (last character in the tag range, or end of stream). Parsing the cancel tag is likewise simpler: </lang> vs. U+E0001 U+E007F. For that matter, a Plane 14 cancel tag is not always necessary, which is not true in HTML. Any syntax checking of the identifier itself (e.g. "en" is valid but "em" is not) must be performed regardless of the mechanism, so neither approach holds an advantage there. Peter continued: >> 2. What extra processing is necessary to ignore Plane 14 tags that >> wouldn't be necessary to ignore any other Unicode character(s)? > > None. And if some form of light-weight markup were used, then there > would inevitably be a need for some kind of character escape mechanism, > so ignoring language tagging would still entail interpreting of the > escapes. E.g. > > #LT=en#This is English text, #LT=fr# et ce texte ci est en fran�ais. > #LT=en#To use the pound character in text, as in "He's in room ##4," > you have to encode it twice. Exactly. With the dedicated code points in Plane 14, you don't need either the closing tag or the double-# escaping scheme. I am not arguing that it takes Herculean effort to program a parser for ASCII-based language tags, only that Plane 14 tags are even simpler, and that some text applications call for the mechanism of greater simplicity. -Doug Ewell Fullerton, California

