Lars Marius Garshol asked: > I'm working on a specification for a data model and would like to > check that my definition of the string type makes sense.
Well, language designers and data modelers may want to chime in with alternate opinions, but here is my two cents on this topic. > > The definition currently says: > > <dt>String</dt> > <dd><p>Strings are sequences of Unicode code points > conforming to Unicode Normalization Form C <xref to="unicode"/>.</p> I really think this is asking for trouble. A string data type should be specified in terms of specific code units, unless you are dealing with a level of abstraction where you really are talking about *characters* -- in which case any operations you define on such abstract strings will also be rather abstract and difficult to tie to specific implementations of operations (even such simple things as specification of storage and field size, etc.). Also, it is asking for trouble to tie a string data type to a particular normalization form. If you do so, that means that you would have distinctions between legal and illegal data in your data type which would then put you in the position of having to verify for legality for any operation involving your string data type. Contrast the official Unicode definition of a "Unicode string": "D29a Unicode string: A code unit sequence containing code units of a particular Unicode encoding form." That then lets you go on to define a "Unicode 8-bit string", a "Unicode 16-bit string" or a "Unicode 32-bit string", depending on which encoding form is appropriate for your purposes. Note that the definition of the string per se does not even require the content of the "Unicode string" to be well-formed, because to do so puts constraints on the efficiency of low-level string processing. Even less so would the definition of the string require the data to be in a particular normalization form. That said, you may still want to impose well-formedness conditions in your data model for strings. I just don't see it as part of your data type definition itself. If you want the data, at some appropriate level of the data abstraction, to always be nominally in NFC, that would be fine; it is comparable to the way some commercial databases handle Unicode data, normalizing on input, so that internal comparisons are always done on normalized data strings and so that ill-formed data doesn't make it into the stores. > <p>Strings are equal if they consist of the exact same sequence of > abstract Unicode characters. This implies that all comparisons are > case-sensitive.</p> You can do this, of course. But you might as well be defining a binary comparison on the code *unit* string, which is how this is going to end up being implemented, anyway. > > Does this make sense? Is "code point" the right term, or should I say > "scalar value"? There is a subtle distinction, since code point includes the surrogate code points, which are always ill-formed. Scalar value, by definition, excludes the surrogate code points. > And what about "abstract character"? That's not what you want, since some abstract characters are not encoded (yet), and some abstract characters have two or more representations in Unicode. See Figure 2-8 of the Unicode Standard, 4.0. See more discussion of this in Section 2.7 of the Unicode Standard, 4.0. > Are two equal > sequences of code points in NFC necessarily composed of the same > sequence of abstract characters? Yes. Because the mapping of code points to abstract characters is fixed (standardized) by the character encoding itself. --Ken > > Thanks for any help! > > -- > Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net > > GSM: +47 98 21 55 50 <URL: http://www.garshol.priv.no >

