An anonymous wag who picks the nits even finer that I did wishes the following clarification to be posted regarding an assertion I made about what Unicode code points are interchangeable. ;-)
------------- Begin Forwarded Message ------------- > So, yeah, basically every sequence of code points "assigned to > abstract characters" is "legal" for interchange. What you cannot > interchange are code points with gc=Cs (U+D800..U+DFFF) or > code points with gc=Cn (noncharacters and reserved). You *can* interchange reserved characters. You *should* not originate them, but if you are passed a string with them, you should preserve them, and pass them on. And in most circumstances you can depend on them being preserved. For noncharacters you can interchange, but should not depend on them being preserved. You *can* also interchange Cs characters; just not within conformant UTF encoding scheme/forms. But it is perfectly legal for me to have a record with a field containing an *arbitrary Unicode code point*, serialize that record, and send it off. ---------------End Forwarded Message ------------------ I concur with the general intent of this clarification, but this is definitely in the gray area as regards exactly what the conformance claims for the standard means. It is certainly good practice and the most robust approach to an implementation for it to behave the way suggested here, but note also the following letter of the law from 10646, to which the Unicode Standard itself claims conformance: <quote> 2.2 Conformance of information interchange A code-character-data-element (CC-data-element) within coded information for interchange is in conformance with ISO/IEC 10646 if a) all the coded representations of graphic characters within that CC-data-element conform to clauses 6 and 7, ... b) all the graphic characters represented within that CC-data-element are taken from those within an identified subset (clause 12) ... 7. General requirements for the UCS ... b. Code positions to which a character is not allocated, except for the positions reserved for private use characters or for transformation formats, are reserved for future standardization and shall not be used for any other purpose. ... </quote> 2.2.a and 7.b imply that it is not conformant to interchange reserved code points, and 2.2.b implies that what you can interchange are only the assigned characters from a subset (in the Unicode case, of course, the subset of the whole). So the way I would summarize this is: I. Reserved code points A conformant implementation should not originate them, but because conformant implementations may be designed to work with multiple versions of the standard and may encounter uplevel data, good implementation practice is to follow the Unicode recommendations about not munging uninterpreted code points and about passing them along unharmed. II. Noncharacters These cannot be used in open interchange, although they can, of course be used in "internal" interchange, which is essentially a private agreement (perhaps with oneself) regarding what noncharacter usage those code points have. No external recipient can interpret them, nor is an external recipient obliged to preserve them if received. III. Surrogate code points I would claim, contra the above, that these *cannot* be interchanged in conformance with the standard -- at all. If one is attempting to interchange arbitrary Unicode code points, including Cs code points (U-0000D800..U-0000DFFF), this cannot be done with a well-formed encoding form, and thus cannot be done in conformance with the standard. If one claims to be *interchanging* such code points in the context of a Unicode string (which does not, of course, have to be well-formed to constitute a "Unicode string" by the definition in the standard), then such interchange is effectively a protocol built on top of the standard, rather than something in conformance with the standard itself. At any rate, that is how *I* would pick the nits. --Ken

