Re: RE: RE: Roundtripping in Unicode

Philippe VERDY Mon, 13 Dec 2004 11:09:42 -0800

> From : "Lars Kristan" 
> Philippe VERDY wrote: 
> > If a source sequence is invalid, and you want to preserve it, 
> > then this sequence must remain invalid if you change its encoding. 
> > So there's no need for Unicode to assign valid code points 
> > for invalid source data. 
> Using invalid UTF-16 sequences to represent invalid UTF-8 sequences is a 
> known approach (UTF-8B, if I remember correctly). But this is then not UTF-16 
> data so you don't gain much. The data is at risk of being rejeted or filtered 
> out at any time. And that misses the whole point.


I don't think I miss the point. My suggested approach to perform roundtrip 
conversions between UTF's while keeping all invalid sequences as invalid (for 
the standard UTFs), is much less risky than converting them to valid codepoints 
(and by consequence to valid code units, because all valid code points need 
valid code units in UTF encoding forms).

The application doing that just preserves the original byte sequences, for its 
internal needs, but will not expose to other applications or modules such 
invalid sequences without the same risks: these other modules need their own 
strategy, and their strategy could simply be rejecting invalid sequences, 
assuming that all other valid sequences are encoding valid codepoints (this is 
the risk you take with your proposal to assign valid codepoints to invalid byte 
sequences in a UTF-8 stream, and a module that would implement your proposal 
would remove important security features).

Note also that once your proposal is implemented, all valid codepoints become 
convertible across all UTFs, without notice (this is the principle of UTF that 
they allow transparent conversions between each other).

Suppose that your proposal is accepted, and that invalid bytes 0xnn in UTF-8 
sources (these bytes are necessarily between 0x80 and 0xFF) get encoded to some 
valid code units U+0mmmnn (in a new range U+mmm80 to U+mmmFF), then they become 
immediately and transparently convertible to valid UTF-16 or even valid UTF-8. 
Your assumption that the byte sequence will be preserved will be wrong, because 
each encoded binary byte will become valid sequences of 3 or 4 UTF-8 bytes (one 
lead byte in 0xE0..EF if code points are in the BMP, or in 0xF0..0xF7 if they 
are in a supplementary plane, and 2 or 3 trail bytes in 0x80..0xBF).

How do you think that other applications will treat these sequences: they won't 
notice that they are originally equivalent to the new valid sequences, and the 
byte sequence itself would be transmitted across modules without any warning 
(applications most often don't check whever codepoints are assigned, just that 
they are valid and properly encoded).

Which application will take the responsability to convert back these 3-4 bytes 
valid sequences back to invalid 1-byte sequences, given that your data will 
already be treated by them as valid, and already encoded with valid UTF code 
units or encoding schemes?

Come back to your filesystem problem. Suppose that there ARE filenames that 
already contain these valid 3-4 byte sequences. This hypothetic application 
will blindly convert the valid 3-4 bytes sequences to invalid 1-byte sequences, 
and then won't be able to access these files, despite they were already 
correctly UTF-8 encoded. So your proposal breaks valid UTF-8 encoding of 
filenames. In addition it creates dangerous aliases that will redirect accesses 
from one filename to another (so yes it is also a security problem).

My opinion is then that we must not allow the conversion of any invalid byte 
sequences to valid code points. All what your application can do is to convert 
them to invalid sequences code units, to preserve the invalid status. Then it's 
up to that application to make this conversion privately and resoring the 
original byte sequence before communicating again with the external system. 
Another process or module can do the same if it wishes to, but none will 
communicate directly to each other with their private code unit sequences. The 
decision to accept invalid byte sequences must remain local to each module and 
is not transmissible.

This means that permanent files containing invalid byte sequences must not be 
converted and replaced to another UTF as long as they contain an invalid byte 
sequence. Such file converter should fail, and warn the user about file 
contents or filenames that could not be converted. Then it's up to the user to 
decide if it wishes to:
- drop these files
- use a filter to remove invalid sequences (if it's a filename, the filter may 
need to append some indexing string to keep filenames unique in a directory)
- use a filter to replace some invad sequences by a user specified valid 
substitution string
- use a filter that will automatically generate valid substitution strings.
- use other programs that will accept and will be able to process invalid files 
as opaque sequences of bytes instead of as a stream of Unicode characters.
- change the meta-data file-type so that it will no longer be considered as 
plain-text
- change the meta-data encoding label, so that it will be treated as ISO-8859-1 
or some other complete 8-bit charset with 256 valid positions (like CP850, 
CP437, ISO-8859-2, MacRoman...).

Re: RE: RE: Roundtripping in Unicode

Reply via email to