John Cowan asked: > > D17a Defective combining character sequence: A combining character > > sequence that does not start with a base character. > > > > * Defective combining character sequences occur when a sequence > > of combining characters appears at the start of a string or > > follows a control or format character. Such sequences are > > defective from the point of view of handling of combining > > marks, but are not ill-formed. > > ^^^^^^^^^^^^^^^^^^^^^^ > > What, if anything, does the term "ill-formed" mean when attached to > a sequence of characters?
Nothing, really. The bullet goes on to point to the definition (D30) of "ill-formed", which applies to code *unit* sequences in the context of the encoding forms. The rewrite of Chapter 3 of the Unicode Standard dispensed with the ill-advised ;-) and confusing distinction between "illegal", "irregular", and "ill-formed" "code value sequences" in the context of the discussion of "transformations", in favor of a much starker and simpler distinction: a code unit sequence is either well-formed or it is not > I understood that every sequence of > characters whatsoever is permitted. As regards code *point* sequences, these sequences can either be conformant to the standard or not conformant to the standard. They are conformant if they meet the conformance requirements (the "C" clauses of Chapter 3). And as regards sequences of characters that basically comes down to not trying to interchange reserved or noncharacter code points. So if you include an reserved (unassigned) code point (for a particular version of the Unicode Standard) in an interchanged data stream, a recipient could claim that data stream is not conformant to (that version of) the standard. Shorthand: the data contains "illegal" characters. But even that is relative to the version of the standard, since a recipient of reserved code points is obliged to preserve their values -- they may, after all, be "legal" assigned code points in a future version of the standard that that particular implementation is not supporting. So, yeah, basically every sequence of code points "assigned to abstract characters" is "legal" for interchange. What you cannot interchange are code points with gc=Cs (U+D800..U+DFFF) or code points with gc=Cn (noncharacters and reserved). What D17a is trying to tell people is that while certain sequences of Unicode characters may be "defective" from the point of view of certain kinds of processing -- in this case rendering of combining character sequences -- that does not make them ill-formed (for which see the specification of encoding forms), nor does it make them nonconformant to the standard. There are many sequences of Unicode characters that we could dream up which would be abominable, distasteful, problematical, defective, implementation-busting, or just plain screwy, but the standard itself isn't prohibiting people from conformantly creating such sequences and then challenging Microsoft or anybody else to display them without blowing a gasket. One of the reasons why we have to be so incredibly careful now before introducing conceptually new *types* of characters, like the COMBINING GRAPHEME JOINER or such things as INVISIBLE BASE CHARACTER or COMBINING CLASS CHANGER or whatnot, is precisely that it gets harder and harder to program defensively against all the possible combinations and interactions that such beasties might have when mixed with everything else that is available. --Ken

