What does it mean to not be a valid string in Unicode?

Is there a concise answer in one place? For example, if one uses the noncharacters just mentioned by Ken Whistler ("intended for process-internal uses, but [...] not permitted for interchange"), what precisely does that mean? /Naively/, all strings over the alphabet {U+0000, ..., U+10FFFF} seem "valid", but section 16.7 clarifies that noncharacters are "forbidden for use in open interchange of Unicode text data". I'm assuming there is a set of isValidString(...)-type ICU calls that deals with this? Yes, I'm sure this has been asked before and ICU documentation has an answer, but this page
    http://www.unicode.org/faq/utf_bom.html
contains lots of distributed factlets where it's imo unclear how to add them up. An implementation can use characters that are "invalid in interchange", but I wouldn't expect implementation-internal aspects of anything to be subject to any standard in the first place (so, why write this?). Also it makes me wonder about the runtime of the algorithm checking for valid Unicode strings of a particular length. Of course the answer is "linear" complexity-wise, but as it or a variation of it (depending on how one treats holes and noncharacters) will be dependent on the positioning of those special characters, how fast does this function perform in practice? This also relates to Markus Scherer's reply to the "holes" thread just now.

Stephan

Reply via email to