RE: Is it roundtripping or transfer-encoding

Lars Kristan Tue, 21 Dec 2004 05:49:34 -0800

Title: RE: Is it roundtripping or transfer-encoding

Philippe Verdy wrote:
> No, it won't happen, because Unicode and ISO/IEC-10646
> already states that
> it encodes abstract characters.

I see that as a technicality. What matters are consequences of rules, not rules themselves. Consequences of breaking a rule should be analyzed (thoroughly and carefully) and if they are acceptable (manageable) and usefulness is determined, then rules need to be reinterpreted. So, I think UTC needs to interpret the rules, not follow them literally. The rest of us should be allowed to try to interpret the rules on our own and make suggestions. An attempt to break a rule should not constitute a show-stopper for a useful concept. Especially not while analysis of the consequences is still in progress.

>
> This finally mean that you want these codepoints recognized
> as characters
> sometimes,

And that is exactly how they should be treated by UTFs. And they already are. There is no conflict there.

> but not when you perform the conversion with a
> transform-encoding-syntax. A transform-encoding-syntax must
> also not modify
> the codepoints represented by an encoding scheme (or
> charset), and UTFs have
> also the property of having a single representation of these
> codepoints

OK, so it introduces a multiple representation of the same codepoints. Every escaping technique does that. And it is not a problem. All you need to do is define the normalization procedure. And use it where it applies. In many cases its use is not even necessary. Specifically, a Unicode system does not need to (and should not) normalize the escape codepoints. The need for normalization only needs to be determined for an application that uses the TES itself, and applies only in few cases.

CESU-8 has similar problems. If it is misinterpreted as Unicode, it self-normalizes if it trips through UTF-16. My data self normalizes if it trips through NON-UTF-8 (or shall we call it MUTF-8, Mostly-UTF-8, at the risk of being called a mutant:). CESU-8 is slightly simpler, because it self-normalizes completely and it can also always be normalized back to a CESU-8 representation. My conversion only normalizes partially (it only normalizes completely if tripped length/3 times, in the worst case). Also, after a full normalization you can no longer tell how many times it was escaped in its original form. In real life, this is often acceptable and far better than not being able to handle invalid sequences as gracefully as MUTF-8 conversion does.

The above is a loose description of what happens. Not all cases are covered systematically. But can be. You can define, for example, that escape sequences that normalize to new escape sequences or to invalid sequences in UTF-8 are valid (or expected). Those that normalize to other codepoints could be considered as invalid, or ill-formed. But again, that only matters in a few specific cases. It matters if you'd be handling users this way, but doesn't if you are mapping filenames. Even less if one wants to apply this technique to editing text files.

There are two options for using this technique:

A - You can treat it as 'use it in rare cases'. UTF-8 then remains what it is and existing Unicode applications already treat those codepoints exactly as they should.

B - You can start using it wherever you convert to or (well, and) from UTF-8. Typically you need to do it in both directions or else you risk over-escaping in one case and self-normalization in the other. The latter can even be useful in some cases, specifically where graceful handling is desired, but roundtripping is not required.

Now, case B is what I said I would not be trying to do. And that is - replacing the UTF-8 conversions with a new conversion. But consequences of that can be determined. In the long run it actually reduces the risks of over-escaping and self-normalization. The major 'problem' that most people brought up is that it threatens to introduce invalid sequences into UTF-8. Which would mean that all UTF-8 readers would need to start handling them. Perhaps. If they knew how, it wouldn't be that hard anyway. But then again, what about the time period when they don't and what if they decide never to? Well, does it really matter whether they got it directly from a corrupted source or they got it from an application that managed to preserve the data and reconstructed it? So, it is not introducing, it is preserving.

It is a question of signalling or raising an exception. Some applications have no way of signalling an error. Signalling "as early as possible" is in my opinion an excuse in this case. Signalling should be done at the point where user can make decisions and is able to fix it. And even at that point, you have users that do want the signalling and you have users that don't want it. And the latter are the majority. From the perspective of a standardizer that can be seen as unwise. But in real life, usability prevails. Did you ever see that a ls command on UNIX would warn you about the invalid sequences? Of course not. It would be completely unusable. Well, the fact is many UTF-8 decoders (or renderers) don't even use the U+FFFD, they simply drop the sequence. Very bad. But no matter how you improve it, signalling will never be an option, not in ls, not while rendering. And U+FFFD is not a very good option either.

> If you want to strictly limit the case where escaping of
> valid characters
> will happen, the best option you have in Unicode is to use
> PUAs which are
> the least likely to happen in original strings (of characters and
> non-characters), in absence of an explicit agreement.

Assigning new characters is then even better.

>
> Note that a Transfer-Encoding-Syntax, to be usable, requires
> an explicit
> mutual agreement to allow the conversion in either direction.

That explicit agreement is one of the things I am trying to avoid. It can be avoided, and that is the intent of standards.

But I am not so sure this should be called TES after all. It has often been suggested or implied that what I do is completely internal and enclosed. But that is not true. I started by storing the filenames in UTF-16. But, eventually, the filenames can be displayed on Windows. Or created in a Windows filesystem (with a few additional restrictions compared to displaying, but only those that had already existed before).

> PUA, or it may even ignore "silently" these PUAs in the
> rendered graphic,
> signaling elsewhere to the user that not all characters could
> be rendered

I would say, "may, but *only* if it signals". And the same goes for invalid sequences. But is not done that way. Far too often. Lots of it will need to be fixed. By using U+FFFD? There is a better way. Use 128 new characters. You can look on all this from this end. First, allow (and provide a means for) renderers to display an invalid UTF-8 sequence (for example in a ls command). A useful thing. The rest comes naturally.

> There are tons of existing TES used everyday in many
> applications, and none
> of them required the allocation of distinct codepoints for
> the encoded
> strings they generate. Why do you want new characters for
> this mapping? It's
> not necessary as demonstrated by all the other existing TES...

Four reasons:
1 - Display. Having new characters (or, escape codepoints with their appearance defined) allows the text to remain visually similar. Length of the text is preserved in many cases, words are easier to read (deduce), and line breaks cause less problems. All pretty similar to how mixed encoding environments have behaved all this time. No other escaping technique can provide this. BTW, U+FFFD can, but is lossy.

2 - Other escaping techniques do not retain the usual assumption that UTF-16 is at most twice as big (in bytes) as UTF-8, or MUTF-8. Which can lead to bugs and increased memory consumption.

3 - The PUA solution works well, but has some inherent risks. And cannot be standardized.
4 - Anyone that will encounter the same problem that I have encountered might devise a new escaping technique. Adding a few kilos to those tons.

It was impossible to afford something like assigning any, let alone 128, codepoints in a SBCS. Nobody thought of it in MBCS. But they were dealing with conversions from SBCS which don't have invalid sequences and have very few unassigned positions. And were able to preserve the invalid sequences. If UTF-8 would replace them all, we wouldn't need it either, since UTF-8 also CAN preserve invalid sequences. Well, it would be nice if they could be displayed and collated, but perhaps that would even succeed, since there would be no other UTFs and the many to one issue would not exist. The problem of invalid sequences is a Unicode problem. Not addressing it will not make it go away.

Lars

RE: Is it roundtripping or transfer-encoding

Reply via email to