One of the reasons why the Unicode Standard avoids the term “valid string”, is
that it immediate begs the question, valid *for what*?
The Unicode string <U+0061, U+FFFF, U+0062> is just a sequence of 3 Unicode
characters. It is valid *for* use in internal processing, because for my own
processing I can decide I need to use the noncharacter value U+FFFF for some
internal sentinel (or whatever). It is not, however, valid *for* open
interchange, because there is no conformant way by the standard (by design) for
me to communicate to you how to interpret U+FFFF in that string. However, the
string <U+0061, U+FFFF, U+0062> is valid *as* a NFC-normalized Unicode string,
because the normalization algorithm must correctly process all Unicode code
points, including noncharacters.
The Unicode string <U+0061, U+E000, U+0062> contains a private use character
U+E0000. That is valid *for* open interchange, but it is not interpretable
according the standard itself. It requires an external agreement as to the
interpretation of U+E000.
The Unicode string <U+0061, U+002A, U+0062> (“a*b”) is not valid *as* an
identifier, because it contains a pattern-syntax character, the asterisk.
However, it is certainly valid *for* use as an expression, for example.
And so on up the chain of potential uses to which a Unicode string could be put.
People (and particularly programmers) should not get too hung up on the notion
of validity of a Unicode string, IMO. It is not some absolute kind of condition
which should be tested in code with a bunch of assert() conditions every time a
string hits an API. That way lies bad implementations of bad code. ;-)
Essentially, most Unicode string handling APIs just pass through string
pointers (or string objects) the same way old ASCII-based programs passed
around ASCII strings. Checks for “validity” are only done at points where they
make sense, and where the context is available for determining what the
conditions for validity actually are. For example, a character set conversion
API absolutely should be checking for ill-formedness for UTF-8, for example,
and have appropriate error-handling, as well as checking for uninterpretable
conversions (mapping not in the table), again with appropriate error-handling.
But, on the other hand, an API which converts Unicode strings between UTF-8 and
UTF-16, for example, absolutely should not – must not – concern itself with the
presence of a defective combining character sequence. If it doesn’t convert the
defective combining character sequence in UTF-8 into the corresponding
defective combining character sequence in UTF-16, then the API is just broken.
Never mind the fact that the defective combining character sequence itself
might not then be valid *for* some other operation, say a display algorithm
which detects that as an unacceptable edge condition and inserts a virtual base
for the combining mark in order not to break the display.
--Ken
What does it mean to not be a valid string in Unicode?
Is there a concise answer in one place? For example, if one uses the
noncharacters just mentioned by Ken Whistler ("intended for process-internal
uses, but [...] not permitted for interchange"), what precisely does that mean?
Naively, all strings over the alphabet {U+0000, ..., U+10FFFF} seem "valid",
but section 16.7 clarifies that noncharacters are "forbidden for use in open
interchange of Unicode text data". I'm assuming there is a set of
isValidString(...)-type ICU calls that deals with this? Yes, I'm sure this has
been asked before and ICU documentation has an answer, but this page
http://www.unicode.org/faq/utf_bom.html
contains lots of distributed factlets where it's imo unclear how to add them
up. An implementation can use characters that are "invalid in interchange", but
I wouldn't expect implementation-internal aspects of anything to be subject to
any standard in the first place (so, why write this?). Also it makes me wonder
about the runtime of the algorithm checking for valid Unicode strings of a
particular length. Of course the answer is "linear" complexity-wise, but as it
or a variation of it (depending on how one treats holes and noncharacters) will
be dependent on the positioning of those special characters, how fast does this
function perform in practice? This also relates to Markus Scherer's reply to
the "holes" thread just now.
Stephan