On 15/11/2004 16:38, Doug Ewell wrote:

Peter Kirk <peterkirk at qaya dot org> wrote:



I'd still like to know what practical, real-world TEXT-related
benefits would derive from allowing U+0000 in strings of TEXT in a C
program.


The practical situation which I have in mind (although not important
to me personally as I do very little programming - I am making this
point more for the general good) is when (hypothetically) I am trying
to write a program in C, or Java, or whatever, to process an arbitrary
string of Unicode characters, perhaps received from the Internet,
before handing them on to a higher level processor. My program works
fine until someone, for whatever (possibly malicious) reason, sends a
string containing U+0000. At that point my program crashes, or does
something I did not intend which may be a security risk. It might well
be a security risk if the task of my program is to scan the string for
security issues, and if none are found it passes on the Unicode string
including U+0000 and what follows it.



The key to your scenario is "an arbitrary string of Unicode characters."
Text processing is a special case of arbitrary "binary" data processing
(a misnomer, of course, since all computer data is "binary," but we have
no better term for "non-text").



OK, maybe by your strict definition what I am talking about is not TEXT processing. But neither is it "binary". But it is processing of a valid sequence of Unicode characters, as defined for example in Unicode conformance clause C10:


C10 When a process purports not to modify the interpretation of a valid coded character representation, it shall make no change to that coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.


Suppose I am implementing a process, any process, which "purports not to modify the interpretation of a valid coded character representation" and so must conform to C10. Since U+0000 is not a noncharacter code point, my process must not delete U+0000, nor must it delete or ignore characters which follow U+0000. If my process acts non-conformantly by doing either of these things, it damages valid data, and creates a security risk. My process therefore needs to store its data in a data type which accepts U+0000 in the middle of a sequence. A UTF-8 encoded C string is not such a type, and so cannot be used in a process conforming to C10. The Java type which people are objecting to is such a type, and so can be used in a process conforming to C10.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to