Use of the Unicode standard does *not* require constant validation of strings. The standard carefully distinguishes between Unicode strings (D29a-d, page 74) and UTFs. The Unicode strings are in-memory representations of Unicode, but do not have to be valid UTFs; so all Unicode X-bit strings are valid UTF-X, but not the converse. When interpreting a UTF you must validate the input. And when generating a UTF you must produce valid results (for that UTF format). But you don't have to do this with Unicode strings.
So there are two strategies for handling Unicode in API libraries: 1. Your internal string format is guaranteed to be UTF. So all operations have to maintain that. Thus in each API you validate all input and output strings. (Of course, if an operation is guaranteed to maintain validity of a UTF-X string, like concatenation, you don't need to recheck the output.) 2. Your internal string format is a Unicode string (but not necessarily a UTF). In that case, the APIs have to "tolerate" odd code units, and may produce strings that contain odd code units. (But all the input are UTF strings, any higher-level API should not produce non-UTF strings.) Note that whenever you export to a protocol requiring UTF (e.g. saving to a file), you *do* have to validate, either stripping the odd code units or providing some other error handling (see UTS #22 for more info). Both of these strategies are legitimate. For 16-bit Unicode, I don't know of any significant package in practice that does #1; it is just too expensive and cumbersome, both to check and to handle any exceptions that arise. And the low odds of encountering a loose surrogate, and the ease of tolerating it (just treated like unassigned code point) make #2 very reasonable. I'm not as familiar with packages using 8-bit Unicode, but certainly the base C routines make no such guarantees, so I doubt it would be viable to try to follow that strategy in practice, at least in C. âMark ----- Original Message ----- From: "Arcane Jill" <[EMAIL PROTECTED]> To: "Unicode" <[EMAIL PROTECTED]> Sent: Friday, December 10, 2004 06:46 Subject: When to validate? > Here's something that's been bothering me. Suppose I write a function - > let's call it trim(), which removes leading and trailing spaces from a > string, represented as one of the UTFs. If I've understood this correctly, > I'm supposed to validate the input, yes? > > Okay, now suppose I write a second function - let's call it tolower(), which > lowercases a string, again represented as one of the UTFs. Again, I guess > I'm supposed to validate the input. yes? > > And yet, in an expression such as tolower(trim(s)), the second validation is > unnecessary. The input to tolower() /must/ be valid, because it is the > output of trim(). But on the other hand, tolower() could be called with > arbitrary input, so I can't skip the validation. > > For efficiency, I /could/ assume that all input was already valid - but > then, what if it isn't? Or I could validate all input - but that's > inefficient. Or I could write two versions of each function, one validating, > the other not, but that adds too much complexity. It seems to me that not > validating input to such functions would give you the best performance, but > then in order to remain compliant you'd have to do the validation somewhere > else - for example something like > > t = tolower(trim(validate(s))). > > where validate(s) does nothing but throw an exception if s is invalid. > > Other people must have had to make decisions like this. What's the preferred > strategy? > Arcane Jill > > >

