Re: What does it mean to "not be a valid string in Unicode"?

Stephan Stiller Fri, 04 Jan 2013 22:31:09 -0800

If you are concerned with computer security

If for example I sit on a committee that devises a new encoding form, Iwould need to be concerned with the question which /sequences of Unicodecode points/ are sound. If this is the same as "sequences of Unicodescalar values", I would need to exclude surrogates, if I read thestandard correctly (this wasn't obvious to me on first inspection btw).If for example I sit on a committee that designs an optimizedcompression algorithm for Unicode strings (yep, I do know about SCSU), Imight want to first convert them to some canonical internal form (say,my array of non-negative integers). If U+<surrogate values> can beassumed to not exist, there are 2048 fewer values a code point canassume; that's good for compression, and I'll subtract 2048 from thoselarge scalar values in a first step. Etc etc. So I do think there are anumber of very general use cases where this question arises.

    For example, the original C datatype named "string", as it is
    understood and manipulated by the C standard library, has an
    /absolute/ prohibition against U+0000 anywhere inside.
That's not as much a prohibition as an artifact of NUL-termination ofstrings. In more modern libraries, the string contents and itsexplicit length are stored together, and you can store a 00 byte justfine, for example in a C++ string.


Yep.

If my question is really underspecified or ill-formed, a listing ofpossible interpretations somewhere (with case-specific answers) might beuseful.


Stephan

Re: What does it mean to "not be a valid string in Unicode"?

Reply via email to