Lars Kristan suggested: > OK, another way of looking at all this. I believe you would accept three > options: > A - Reject the stream. > B - Drop the invalid data.
If you were defining an application concerned with security, and if you had a clearly defined conversion you were performing, yes these would be valid options, as if your table conversion is correctly defined, you are being fed garbage. > C - Replace the invalid characters with U+FFFD (the replacement character). This, however, is the more graceful and robust way to handle conversions that are undefined in your conversion table -- and is the way recommended by the Unicode Standard. Your concern about old software behaving gracefully when dealing with an updated version of a data stream is a valid one that we know we will run into -- the additions for the euro sign in many code pages were a recent case in point. But if software designers follow the fallback guidelines (U+FFFD for unavailable conversion, missing glyph for display, and so on) then older software shouldn't choke when encountering previously unencoded characters in newer data streams. > > Then my proposal could be viewed as an addition to option C, with one > difference. Instead of one replacement character, I propose to have 256 > (though in most cases only 128 would be used). Now, what does that violate? Parsimony and good sense. And it seems to have overlooked the fact that not all conversions are defined on multi-byte character encodings to Unicode. What if you were converting EUC-JP to Unicode? Of the 65,536 two-byte combinations, 40,253 are illegal, 7359 involve at least one control code, and might be questionable to convert, depending, and of the 8,836 legal A1..FE/A1..FE combinations, many are not actually defined for JIS X 0208. And then there are 3-byte combinations in 0x8F..., most of which are also illegal or undefined. Are you proposing that we use 256 "GARBAGE CONVERSION BYTE-00".. "GARBAGE CONVERSION BYTE-FF" characters in arbitrary sequences to replicate all these illegal values into a Unicode stream if garbage purporting to be EUC-JP gets pumped at a convertor, just so you can maintain round-trippability of the garbage? I don't think this is any more useful than throwing an exception (to the error handler, by the way, not to the secretary on the third floor), and dumping the input into a sanitary can labelled "invalid data which was labelled 'EUC-JP' on input". By the way, just to turn the screw here a little bit, how would legacy software that uses U+FFFD correctly for dealing with unavailable conversions be supposed to react when it comes across new GARBAGE CONVERSION BYTE characters that were undefined when it was written? How do you expect unaware conversion implementations to deal with your mechanism for maintaining convertibility for older software unable to deal with new data streams? Right -- it won't handle it correctly, and your garbage convertibility hints will be garbaged away, and you still can't get your roundtrip garbage. --Ken

