RE: Roundtripping in Unicode

Lars Kristan Mon, 13 Dec 2004 10:17:55 -0800

Title: RE: Roundtripping in Unicode

Marcin 'Qrczak' Kowalczyk wrote:
> UTF-8 is painful to process in the first place. You are making it
> even harder by demanding that all functions which process UTF-8 do
> something sensible for bytes which don't form valid UTF-8. They even
> can't temporarily convert it to UTF-32 for internal processing for
> convenience.
My point exactly. I am proposing to provide a conversion so you can. All you need is to assign 128 codepoints and define their properties. They would be printable characters, non-spaces, would have no upper/lower case properties, would collate (for example) after all letters but before any special characters, and so on. Then you don't need to fix anything. Not in the functions. You just need to convert (and even convert from byte stream to UTF-8) on boundaries where you expect such data. And decide whether you need to prevent anything due to security reasons. If not, then you're done.

So, no, I am not demanding that UTF-8 functions need to behave differently. Existing functions work perfectly well, assuming you convert to UTF-8 (so, use three bytes to represent each invalid byte as a valid codepoint). It would be beneficial if they would, but that is a separate issue. It would need to be determined which functions could do so. Maybe all could, maybe only some could, maybe none should. It needs to be investigated before anything is changed. This is in line with what I said about validation. Processing functions may do validation implicitly. But this is not a requirement. Unless you make it so. But in my opinion, it is better to separate validation from processing. In that case you can even prescribe exactly what they should do with invalid data. And in this case they should do exactly what they would do if the data was converted to UTF-8 according to my conversion. But again, this is the next step, that needn't be done at all.

>
> > Listing files in a directory should not signal anything. It MUST
> > return all files and it should also return them in a way that this
> > list can be used to access each of the files.
>
> Which implies that they can't be interpreted as UTF-8.
>
> By masking an error you are not encouraging users to fix it.
> Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.
Failure to process such files is also an error. Think virus scanners and backup.

> > The interesting thing is that if you do start using my conversion,
> > you can actually get rid of the need to validate UTF-8 strings
> > in the first scenario. That of course means you will allow users
> > with invalid UTF-8 sequences, but if one determines that this is
> > acceptable (or even desired), then it makes things easier. But the
> > choice is yours.
>
> For me it's not acceptable, so I will not support declaring it valid.
I said, the choice is yours. My proposal does not prevent you from doing it your way. You don't need to change anything and it will still work the way it worked before. OK? I just want 128 codepoints so I can make my own choice. And once and for all, you can treat those 128 codepoints just as you do today.

Lars

RE: Roundtripping in Unicode

Reply via email to