Philippe VERDY wrote:
> I don't think I miss the point. My suggested approach to
> perform roundtrip conversions between UTF's while keeping all
> invalid sequences as invalid (for the standard UTFs), is much
> less risky than converting them to valid codepoints (and by
> consequence to valid code units, because all valid code
> points need valid code units in UTF encoding forms).
I still do think you are missing the point. About two years ago I started a similar thread. At that time I was pursuing the use of UTF-8B conversion, which uses one invalid sequence to represent another. It uses unpaired low surrogates. It works rather well, but one of the readers alerted me that I cannot expect that a Unicode database will be able (or, rather, willing) to process such data. Since I am not in a habit of writing every piece of the code myself (or by my team for that matter), I chose to use a third party database. The data that I have is mainly UTF-8, and users expect it to be interpreted as such. But are not expecting purism in the form of rejecting data (filenames) which contain invalid sequences. I am thankful to the person that pointed this out, and I have moved to using PUA. The rest of the responses were much like what I am getting now. Useless. Telling me to reject invalid sequences, telling me to rewrite everything and treat the data as binary. Or use an escaping technique, forgetting that everything they find wrong about the codepoint approach is also true for escaping. Except that escaping has a lot of overhead and that there is an actual risk of those escaping sequences being present in today's files. Not the ones on UNIX, but the ones on Windows. It should work both ways.
>
> The application doing that just preserves the original byte
> sequences, for its internal needs, but will not expose to
> other applications or modules such invalid sequences without
> the same risks: these other modules need their own strategy,
> and their strategy could simply be rejecting invalid
> sequences, assuming that all other valid sequences are
> encoding valid codepoints (this is the risk you take with
> your proposal to assign valid codepoints to invalid byte
> sequences in a UTF-8 stream, and a module that would
> implement your proposal would remove important security features).
Only applications that do use the new conversion need to worry about security issues. And only those of course, that security issues apply to in the first place. All other applications can and should treat those codepoints as letters. And convert them to UTF-8 just as any other valid codepoint. I may have suggested otherwise at some point in time, but this is my current position.
> Note also that once your proposal is implemented, all valid
> codepoints become convertible across all UTFs, without notice
> (this is the principle of UTF that they allow transparent
> conversions between each other).
Existing conversion is not modified. I am explaining how an alternate conversion works simply to prove it is useful. And it does not convert to UTF-8. It converts to byte sequences. And can be used in places where interfacing with such data. For example UNIX filenames. And 'supposedly UTF-8' is not the only case. The same technique can be used on 'supposedly Latin 3' data. The new conversions are used in pairs and existing UTF conversions remain as they are. Any security issues are up to whoever decides to use the new conversions. There are no security issues for those that do not.
>
> Suppose that your proposal is accepted, and that invalid
> bytes 0xnn in UTF-8 sources (these bytes are necessarily
> between 0x80 and 0xFF) get encoded to some valid code units
> U+0mmmnn (in a new range U+mmm80 to U+mmmFF), then they
> become immediately and transparently convertible to valid
> UTF-16 or even valid UTF-8. Your assumption that the byte
> sequence will be preserved will be wrong, because each
> encoded binary byte will become valid sequences of 3 or 4
> UTF-8 bytes (one lead byte in 0xE0..EF if code points are in
> the BMP, or in 0xF0..0xF7 if they are in a supplementary
> plane, and 2 or 3 trail bytes in 0x80..0xBF).
Again, a UTF-8 to UTF-16 converter does not need to (and should not) encode the invalid sequences as valid codepoints. Existing rules apply. Signal, reject, replace with U+FFFD.
>
> How do you think that other applications will treat these
> sequences: they won't notice that they are originally
> equivalent to the new valid sequences, and the byte sequence
> itself would be transmitted across modules without any
> warning (applications most often don't check whever
> codepoints are assigned, just that they are valid and
> properly encoded).
Exactly. This is why nothing breaks. And Unicode application should treat the new codepoints exactly the say it treats them today. Today they are unassigned and are converted according to existing rules. Once they are assigned, they just get some properties, but are still treated as valid and should be converted as before.
>
> Which application will take the responsibility to convert
> back these 3-4 bytes valid sequences back to invalid 1-byte
> sequences, given that your data will already be treated by
> them as valid, and already encoded with valid UTF code units
> or encoding schemes?
Typically the application that generated them. This technique allows the application to use Unicode sublayers, databases, sorting and so on to process the data. Most of the data IS valid UTF-8 text, and, I can tell you from experience, the rest of the data does sort (collate) usefully. Let's not make up examples where this is not true. For data in UNIX filesystems this is true.
Now, it is true that data from two applications using this technique can become intermixed. But this is not something we should fear. On the contrary, this is why I do what to standardize the approach. Because in most cases what will happen is exactly what one expects. If each of the two applications chose an arbitrary escaping technique to solve the problem, then you get a bigger mess.
And each time I prove something works, someone steps in and finds abuses. Yes, it can be abused, but there are cases where there are no security issues and the abuser only finds himself amused, but no more. We can discuss the possible abuses and exactly what they cause and how they can be prevented. And in which cases they really need to be prevented. I have discussed some of that in other replies. But am willing to discuss it with anyone.
>
> Come back to your filesystem problem. Suppose that there ARE
> filenames that already contain these valid 3-4 byte
> sequences. This hypothetic application will blindly convert
> the valid 3-4 bytes sequences to invalid 1-byte sequences,
> and then won't be able to access these files, despite they
> were already correctly UTF-8 encoded. So your proposal breaks
> valid UTF-8 encoding of filenames. In addition it creates
> dangerous aliases that will redirect accesses from one
> filename to another (so yes it is also a security problem).
We need to separate the UNIX and Windows side here. Using my conversion, Windows can access any file on UNIX, because my conversion guarantees roundtrip UX=>Win=>UX (can't say UTF-8=>UTF-16=>UTF8, because it is not UTF-8). Even if an encoded replacement codepoint is present there, because they are escaped themselves (but only in this conversion, not when using regular UTF-8 interpretation).
Win=>UX=>Win roundtrip is not guaranteed. I admit that and have stated so a long time ago. And it is not guaranteed only if they contain certain sequences of the new codepoints. Note that the sequences that are generated by the UX=>Win conversion do roundtrip and are the ones we will expect to see, mostly. Existing filenames shouldn't contain any of the codepoints in question, because these codepoints are still unused. If you want to suppose that this is not true, it just becomes the same case as the abuse attempt and we will deal with that next. But let me stress that the fact there shouldn't be any actually means there aren't any. OK, next suppose there are some or that some concatenation was done or that someone attempts to abuse the concept.
I described this in another mail, but let's do it again. A fact is that you do have multiple representation of filenames that map to the same filename. Filenames are not case sensitive in Windows. That's it. Are there security issues? Depends on what you do. If you let the system take over the security and rely entirely on it, then there are no problems. And making double checks and early assumptions is something that is not wise anyway, nor efficient. Security is by definition centralized, and when it is, bijectivity is not a requirement.
I am supposing most of security works the way I described above. Now, some may not. Well, if they rely on bijectivity, they need strict validation. If they use the UTF-8 conversion, they can again remain the same. Only if someone wants to extend such a security layer to allow invalid sequences, then they would need to strengthen the validation. But it can be done, simply by roundtripping through UTF-8, and either use the result as-is, or compare it to the original if rejection is desired. It can be made even simpler. A very strict security layer could reject all the new codepoints. But perhaps even before that, it should reject U+FFFD. U+FFFD may present a security risk even today. And the new codepoints actually present less of a risk. A pure Unicode security layer does not need to reject them, since it doesn't use the new conversions. If anyone chooses to obtain a Unicode string by using the new conversions, and feed it to the security layer, this is no problem. As long as you don't compare such strings yourself and let the security layer do all the work. An example where something would apparently break is, suppose you have validated a user via such security, and via the new conversion. And on your (UTF-16) system, the application generated your home directory. You then use a string that you know will map to the same user. Well, you ARE the same user, you had to use your password and everything. It's no different from case insensitivity. The only risk is, you are now not getting your home directory. Well, your loss, you shot yourself in your foot, but the security didn't break. And just suppose this would not happen only in malicious attempts but was really submitted as a bug. The fix is simple, you just need to roundtrip (the 'broken' one) the data through Unicode to get the same string the user database is getting.
>
> My opinion is then that we must not allow the conversion of
> any invalid byte sequences to valid code points. All what
> your application can do is to convert them to invalid
> sequences code units, to preserve the invalid status. Then
> it's up to that application to make this conversion privately
> and resoring the original byte sequence before communicating
> again with the external system. Another process or module can
> do the same if it wishes to, but none will communicate
> directly to each other with their private code unit
> sequences. The decision to accept invalid byte sequences must
> remain local to each module and is not transmissible.
Applications are built from building blocks. Limiting the choice of blocks to those that are willing to process invalid data is not a good idea. I won't go into the discussion of whether building blocks should or should not process the invalid data in the first place. Or should I? I think they should have the ability and should only validate if told to do so. But not everybody will agree. And even if they would, it would take ages to fix all the building blocks (functions, databases, conversions, etc). The straightforward solution is to make the data valid. By assigning valid codepoints for it. Whoever will chose to interpret those codepoints in a special way will also need to worry about the consequences. The rest can and should remain as it is.
>
> This means that permanent files containing invalid byte
> sequences must not be converted and replaced to another UTF
> as long as they contain an invalid byte sequence. Such file
> converter should fail, and warn the user about file contents
> or filenames that could not be converted. Then it's up to the
> user to decide if it wishes to:
> - drop these files
Oh, please.
> - use a filter to remove invalid sequences (if it's a
> filename, the filter may need to append some indexing string
> to keep filenames unique in a directory)
Possibly valid if you are renaming the files (with serious risks involved though). But very impractical if you want to simply present the files on the network.
> - use a filter to replace some invad sequences by a user
> specified valid substitution string
> - use a filter that will automatically generate valid
> substitution strings.
That's escaping. And has all the problems you brought up for my approach. And contradicts one of your basic premises - that invalid sequences should not be replaces with valid ones. So, if you are suggesting that valid sequences CAN be replaced by valid ones, let's drop that premise. We just need to chose the most appropriate escaping technique. And assigning codepoints is the best choice.
> - use other programs that will accept and will be able to
> process invalid files as opaque sequences of bytes instead of
> as a stream of Unicode characters.
Text based programs were usable with Latin 1. Text based programs will be usable with UTF-8 once there are no invalid sequences anywhere. Why should complex programs be rewritten to treat data as binary just to get over a period of time where there will be some invalid sequences present? Is this cost effective? Is it as easy as you make it sound? Eventually that binary data will need to be displayed. Or entered. UI is text, isn't it? Or should we start displaying all UNIX filenames in HEX codes? And saying that text based approach will work once everything really is in clean UTF-8 is also not entirely true. There will always be occasional invalid sequences. Suppose you are accessing a UNIX filesystem from Windows. Somehow, one file has an invalid sequence. Isn't it better to be able to access that file? Or at least rename it. But from Windows. You want to signal the error on the UNIX side, which is not where the user is. Force the user to log to that system? Why? Because of some fear my conversion will break everything? Because one can use philosophy to prove UNIX filenames are sequences of bytes, yet we are all aware they are text?
> - change the meta-data file-type so that it will no longer be
> considered as plain-text
> - change the meta-data encoding label, so that it will be
> treated as ISO-8859-1 or some other complete 8-bit charset
> with 256 valid positions (like CP850, CP437, ISO-8859-2, MacRoman...).
And have all the other names displayed wrong? There may be applications running at the same time that depend on accessing the files, according to their names at a previous point in time. Also depends on where the conversion is done - what if the setting is on the share side? Then fix the application, right? Can you weigh the cost of that against your desires to not have some 128 codepoints in Unicode, just because you THINK they are not needed?
Lars

