Erik Ostermueller wrote: > I'm dealing with an API that claims it doesn't support > unicode characters with embedded nulls. > I'm trying to figure out how much of a liability this is.
If by "embedded nulls" they mean bytes of value zero, that library can *only* work with UTF-8. The other two UTF's cannot be supported in this way. But are you sure you understood clearly? Didn't they perhaps write "Unicode *strings* with embedded nulls? In that case they could have meant that null *characters* inside strings. I.e., they don't support strings containing the Unicode character U+0000, because that code is used as a string terminator. In this case, it would be a common and accepted limitation. > What is my best plan of attack for discovering precisely > which code points have embedded nulls > given a particular encoding? Didn't find it in the maillist archive. > I've googled for quite a while with no luck. The question doesn't make sense. However: UTF-8: Only one character is affected (U+0000 itself); UTF-16: In range U+0000..U+FFFF (Basic Multilingual Plane), there are of course exactly 511 characters affected (all those of form U+00xx or U+xx00), 484 of which are actually assigned. However, a few of these code points are high or low surrogates, which means that also many characters in range U+010000..U+10FFFF are affected. UTF-32: All characters are affected, because the high byte of an UTF-32 unit is always 0x00. > I'll want to do this for a few different versions of unicode > and a few different encodings. Most single and double-byte encodings behave like UTF-8 (i.e., a single zero-byte is only needed to encode U+0000 itself). > What if I write a program using some of the data files > available at unicode.org? > Am I crazy (I'm new at this stuff) or am I getting warm? > Perhaps this data file: > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ? > > Algorithm: > INPUT: Name of unicode code point file > INPUT: Name of encoding (perhaps UTF-8) > > Read code point from file. > Expand code point to encoded format for the given encoding. > Test all constituent bytes for 0x00. > Goto next code point from file. That would be totally useless, I am afraid. The only UTF for which this count makes sense is UTF-8, and the result is "one". _ Marco

