Re: What does it mean to "not be a valid string in Unicode"?

Stephan Stiller Fri, 04 Jan 2013 15:09:24 -0800

What does it mean to not be a valid string in Unicode?

Is there a concise answer in one place? For example, if one uses thenoncharacters just mentioned by Ken Whistler ("intended forprocess-internal uses, but [...] not permitted for interchange"), whatprecisely does that mean? /Naively/, all strings over the alphabet{U+0000, ..., U+10FFFF} seem "valid", but section 16.7 clarifies thatnoncharacters are "forbidden for use in open interchange of Unicode textdata". I'm assuming there is a set of isValidString(...)-type ICU callsthat deals with this? Yes, I'm sure this has been asked before and ICUdocumentation has an answer, but this page

    http://www.unicode.org/faq/utf_bom.html

contains lots of distributed factlets where it's imo unclear how to addthem up. An implementation can use characters that are "invalid ininterchange", but I wouldn't expect implementation-internal aspects ofanything to be subject to any standard in the first place (so, why writethis?). Also it makes me wonder about the runtime of the algorithmchecking for valid Unicode strings of a particular length. Of course theanswer is "linear" complexity-wise, but as it or a variation of it(depending on how one treats holes and noncharacters) will be dependenton the positioning of those special characters, how fast does thisfunction perform in practice? This also relates to Markus Scherer'sreply to the "holes" thread just now.


Stephan

Re: What does it mean to "not be a valid string in Unicode"?

Reply via email to