Re: Roundtripping in Unicode

Philippe Verdy Sat, 11 Dec 2004 18:22:49 -0800

RE: Roundtripping in UnicodeMy view about this "problem" of roundtripping is that if data, supposed to contain only valid UTF-8 sequences, contains some invalid byte sequences that still need to be roundtripped to some "code point" for internal management that can be roundtripped later to the original invalid byte sequence, then these invalid bytes MUST NOT be converted to valid code points.

An implementation based on internal UTF-32 code units representation could use, privately only, only the range which is NOT assigned to valid Unicode code points; so such application would need to convert these bytes into code points higher than 0x10FFFF; but the same application will no longer be conforming to strict UTF-32 requirements: the application will represent this way binary data which is NOT bound to Unicode rules and which can't be valid plain-text. For example, {0xFF0000+n} where n is the byte value to encapsulate. Don't call it "UTF-32", because it MUST remain for private use only!

This will be more complex if the application uses UTF-16 code units, because there are only TWO code units that can be used to recognize such invalid-text data within a text stream. It is possible to do that, but with MUCH care: For example encoding 0xFFFE before each byte value converted to some 16-bit code unit. The problem is that backward parsing of strings just check that a code unit is a low surrogate, to see if a second backward step is needed to get the first high surrogate, and so U+FFFE would need to be used (privately only) as another lead high surrogate with special (internal) meaning for round trip compatibility, and so the best choice for the code unit encoding the invalid byte value would be to use a standard low surrogate to store this byte. So a qualifying internal representation would be {0xFFFE, 0xDC00+n} where n is the byte value to encapsulate. Don't call this "UTF-16", because it is not UTF-16.

An implementation that uses UTF-8 for valid string could use the invalid ranges for lead bytes to encapsultate invalid byte values. Note however that invalid bytes you would need to represent have 256 possible values, but the UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1) each for 64 codes, if you want to use an encoding on two bytes. The alternative would be to use the UTF-8 lead byte values which have initially been assigned to byte sequences longer than 4 bytes, and that are now unassigned/invalid in standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}. Here also it will be a private encoding, that should NOT be named UTF-8, and the application should clearly document that it will not only accept any valid Unicode string, but also some invalid data which will have some roundtrip compatibility.

So what is the problem: suppose that the application, internally, starts to generate strings containing any occurences of such private sequences, then it will be possible for the application to generate on its output a byte stream that would NOT have roundtrip compatibility, back to the private representation. So roundtripping would only be guaranteed for streams converted FROM an UTF-8 where some invalid sequences are present and must be preserved by the internal representation. So the transformation is not bijective as you would think, and this potentially creates lots of possible security issues.

So for such application, it would be much more appropriate to use different datatypes and structures to represent either streams of binary bytes, or streams of characters, and recognize them independantly. The need of a bijective representation means that the input stream will contain an encapsultation to recognize *exactly* if the stream is text or binary.

If the application is a filesystem storing filenames and there's no place in the filesystem to encode if a filename is binary or text, then you are left without any secured solution!

So the best thing you can do to secure your application, is to REJECT/IGNORE all files whose names do not match the strict UTF-8 encoding rules that your application expect (all will happen as if those files were not present, but this may still create security problems if an application that does not see any file in a directory wants to delete that directory, assuming it is empty... In that case the application must be ready to accept the presence of directories without any content, and must not depend on the presence of a directory to determine that it has some contents; anyway, on secured filesystems, such things could happen due to access restrictions, completely unrelated to the encoding of filenames, and it is not unreasonnable to prepare the application so that it will behave correctly face to inaccessible files or directories, so that the application will also correctly handle the fact that the same filesystem will contain non plain-text and inaccessible filenames).

Anyway, the exposed solutions above demonstrate that there's absolutely NO need to reserve in Unicode some range to represent pure binary data, even for roundtripping in situations like the one you expose. Instead, programmers must clearly study the impact of using invalid byte sequences and how they can be tracked throughout the application. The various Unicode encoding forms leave enough space to allow such implementations, but Unicode will NOT assign code points for binary data which don't have a character semantic, because this assignment would become characters valid in plain-text!

So why all these discussions? Because there are various interpretations about what is or is not "plain-text". As soon as a system import some plain-text specification but applies some restrictions on it, then it is NOT plain-text. A filename in a filesystem is NOT plain text because it is only a restricted subset of what a plain-text can represent. For Unicode and for ISO/IEC 10646, a plain-text data is ANY sequence of characters in the valid Unicode range (U+0000 to U+10FFFD minus the surrogates and all non-characters), in ANY order, or with ANY size. Plain-text also gives a few mandatory interpretations for some characters, notably the end-of-line characters (CR, LF, NL, LS, PS...), but no other interpretation for any other characters, and no limitation on line-lengths (measured in whatever unit such as bytes, code units, characters, combining sequences, grapheme clusters, ems, millimeters/picas, pixels, percentage of a container width...): plain text is just an ordered list of line records, containing valid characters (including control characters, not to confuse with control bytes), and optionally terminated by end-of-line character(s).

For more strict definitions of plain-text, you need to create a specification, and make sure that this specification comes first in the encapsulation encoding/decoding; if this encapsulation allows any plain-text to be represented using some escaping mechanism, this mechanism MUST be a mandatory part of the protocol specification. This is where most of the legacy filesystems have been failing: their specification is incomplete, or is simply wrong when they say that filenames are plain-text. In fact, all filesystems have restrictions on valid filenames, because they also need that filenames be encapsulted into other text protocols, or even in text files that have other restrictions (for example within shell commands, or in URLs, or on single lines), and they don't want that these filenames be complicate to specify in these external applications (for example, encapsulating a filename within a URI using the "tricky" URI-escaping mechanism). But I do think this is a bad argument, made only for lazy programmers, that often don't use the mandatory parts of these specifications that document how encapsulation can be safely performed.

Notably, the concept of filenames is a legacy and badly designed concept, inherited from times where storage space was very limited, and the designers wanted to create a compact (but often cryptic) representation.

The concept of filenames combines too many independant things: - a unique moniker used to reference files, and allowing the creation of links with possible security restrictions. - a summary identification of the content type (with file extensions present on most filesystems, including Unix as a nearly universal but unreliable convention). - sometimes a version identifier or number (on VMS devices, or on CDFS/ISO9660), for archival purpose. - sometimes a data channel identifier (on filesystems that support multiple data streams with independant datatypes and storage, for the same file, such as NTFS streams, and MacOS data/resource forks); however this concept is quite similar to the concept of hierarchical folders that can be used as valid resources with default contents. - a description of the content (which is meta-data), but in a way that it is too much truncated and must nearly always be interpreted relatively to the description of the directory into which the filename is stored.

---
Lars Kristan wrote:

> Furthermore, I was proposing this concept to be used, but not
> unconditionally. So, you can, possibly even should, keep using
> whatever you are using.

So you prefer to make programs misbehave in unpredictable ways
(when they pass the data from a component which uses relaxed rules
to a component which uses strict rules) rather than have a clear and
unambiguous notion of a valid UTF-8?

I am not particulary thrilled about it. In fact it should be discussed. Constructively. Simply assuming everything will break is not helpful. But if you want an answer, yes, I would go for it. Actually, there are fewer concerns involved than people think. Security is definitely an issue. But again, one shouldn't assume it breaks just like that. Let me risk a bold statement: security is typically implicitly centralized. And if comparison is always done in the same UTF, it won't break. A simple fact that two different UTF-16 strings compare equal in UTF-8 (after relaxed conversion), does not introduce a security issue. Today, two invalid UTF-8 strings compare the same in UTF-16, after a valid conversion (using a single replacement char, U+FFFD) and they compare different in their original form, if you use strcmp. But you probably don't. Either you do everything in UTF-8, or everything in UTF-16. Not always, but typically. If comparisons are not always done in the same UTF, then you need to validate. And not validate while converting, but validate on its own. And now many designers will remember that they didn't. So, all UTF-8 programs (of that kind) will need to be fixed. Well, might as well adopt my broken conversion and fix all UTF-16 programs. Again, of that kind, not all in general, so there are few. And even those would not be all affected. It would depend on which conversion is used where. Things could be worked out. Even if we would start changing all the conversions. Even more so if a new conversion is added and only used when specifically requested. There is cost and there are risks. Nothing should be done hastily. But let's go back and ask ourselves what are the benefits. And evaluate the whole.


> Perhaps I can convert mine, but I cannot convert all filenames on
> a user's system.

They you can't access his files.

Yes, this is where it all started. I cannot afford not to access the files. I am not writing a notepad.


With your proposal you couldn't as well, because you don't make them
valid unconditionally. Some programs would access them and some would
break, and it's not clear what should be fixed: programs or filenames.

It is important to have a way to write programs that can. And, there is definitely nothing to be fixed about the filenames. They are there and nobody will bother to change them. It is the programs that need to be fixed. And if Unicode needs to be fixed to allow that, then that is what is supposed to happen. Eventually. Lars

Re: Roundtripping in Unicode

Reply via email to