Re: U+0000 in C strings

Peter Kirk Mon, 15 Nov 2004 10:21:28 -0800

On 15/11/2004 16:38, Doug Ewell wrote:

Peter Kirk <peterkirk at qaya dot org> wrote:
I'd still like to know what practical, real-world TEXT-related benefits would derive from allowing U+0000 in strings of TEXT in a C program.

The practical situation which I have in mind (although not important to me personally as I do very little programming - I am making this point more for the general good) is when (hypothetically) I am trying to write a program in C, or Java, or whatever, to process an arbitrary string of Unicode characters, perhaps received from the Internet, before handing them on to a higher level processor. My program works fine until someone, for whatever (possibly malicious) reason, sends a string containing U+0000. At that point my program crashes, or does something I did not intend which may be a security risk. It might well be a security risk if the task of my program is to scan the string for security issues, and if none are found it passes on the Unicode string including U+0000 and what follows it.

The key to your scenario is "an arbitrary string of Unicode characters." Text processing is a special case of arbitrary "binary" data processing (a misnomer, of course, since all computer data is "binary," but we have no better term for "non-text").

OK, maybe by your strict definition what I am talking about is not TEXT processing. But neither is it "binary". But it is processing of a valid sequence of Unicode characters, as defined for example in Unicode conformance clause C10:

C10 When a process purports not to modify the interpretation of a valid coded character representation, it shall make no change to that coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.

Suppose I am implementing a process, any process, which "purports not to modify the interpretation of a valid coded character representation" and so must conform to C10. Since U+0000 is not a noncharacter code point, my process must not delete U+0000, nor must it delete or ignore characters which follow U+0000. If my process acts non-conformantly by doing either of these things, it damages valid data, and creates a security risk. My process therefore needs to store its data in a data type which accepts U+0000 in the middle of a sequence. A UTF-8 encoded C string is not such a type, and so cannot be used in a process conforming to C10. The Java type which people are objecting to is such a type, and so can be used in a process conforming to C10.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: U+0000 in C strings

Reply via email to