Erik Ostermueller wrote:
> I'm dealing with an API that claims it doesn't support 
> unicode characters with embedded nulls.
> I'm trying to figure out how much of a liability this is.

If by "embedded nulls" they mean bytes of value zero, that library can
*only* work with UTF-8. The other two UTF's cannot be supported in this way.

But are you sure you understood clearly? Didn't they perhaps write "Unicode
*strings* with embedded nulls? In that case they could have meant that null
*characters* inside strings. I.e., they don't support strings containing the
Unicode character U+0000, because that code is used as a string terminator.
In this case, it would be a common and accepted limitation.

> What is my best plan of attack for discovering precisely 
> which code points have embedded nulls
> given a particular encoding?  Didn't find it in the maillist archive.
> I've googled for quite a while with no luck.  

The question doesn't make sense. However:

UTF-8: Only one character is affected (U+0000 itself);

UTF-16: In range U+0000..U+FFFF (Basic Multilingual Plane), there are of
course exactly 511 characters affected (all those of form U+00xx or U+xx00),
484 of which are actually assigned. However, a few of these code points are
high or low surrogates, which means that also many characters in range
U+010000..U+10FFFF are affected.

UTF-32: All characters are affected, because the high byte of an UTF-32 unit
is always 0x00.

> I'll want to do this for a few different versions of unicode 
> and a few different encodings.

Most single and double-byte encodings behave like UTF-8 (i.e., a single
zero-byte is only needed to encode U+0000 itself).

> What if I write a program using some of the data files 
> available at unicode.org?
> Am I crazy (I'm new at this stuff) or am I getting warm?
> Perhaps this data file: 
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ?
> 
> Algorithm:
> INPUT: Name of unicode code point file
> INPUT: Name of encoding (perhaps UTF-8)
> 
> Read code point from file.
> Expand code point to encoded format for the given encoding.
> Test all constituent bytes for 0x00.
> Goto next code point from file.

That would be totally useless, I am afraid.

The only UTF for which this count makes sense is UTF-8, and the result is
"one".

_ Marco

Reply via email to