If you are concerned with computer security

If for example I sit on a committee that devises a new encoding form, I would need to be concerned with the question which /sequences of Unicode code points/ are sound. If this is the same as "sequences of Unicode scalar values", I would need to exclude surrogates, if I read the standard correctly (this wasn't obvious to me on first inspection btw). If for example I sit on a committee that designs an optimized compression algorithm for Unicode strings (yep, I do know about SCSU), I might want to first convert them to some canonical internal form (say, my array of non-negative integers). If U+<surrogate values> can be assumed to not exist, there are 2048 fewer values a code point can assume; that's good for compression, and I'll subtract 2048 from those large scalar values in a first step. Etc etc. So I do think there are a number of very general use cases where this question arises.

    For example, the original C datatype named "string", as it is
    understood and manipulated by the C standard library, has an
    /absolute/ prohibition against U+0000 anywhere inside.


That's not as much a prohibition as an artifact of NUL-termination of strings. In more modern libraries, the string contents and its explicit length are stored together, and you can store a 00 byte just fine, for example in a C++ string.

Yep.

If my question is really underspecified or ill-formed, a listing of possible interpretations somewhere (with case-specific answers) might be useful.

Stephan

Reply via email to