Title: RE: Roundtripping Solved

Peter Kirk wrote:
> But this last requirement provides the proof that you can't have what
> you want.
>
> The current situation is:
>
> 1. for all valid UTF-8 strings s8, f(s8) is a valid UTF-16 string and
> g(f(s8)) = s8
> 2. for all valid UTF-16 strings s16, g(s16) is a valid UTF-8
> string and
> f(g(s16)) = s16
>
> Your requirements are apparently:
>
> 3. for all INVALID UTF-8 strings t8, f(t8) is a valid UTF-16
> string and
> g(f(t8)) = t8
>
> But if f(t8) is a valid UTF-16 string, by rule 2 g(f(t8)) is a valid
> UTF-8 string, and by rule 3 g(f(t8)) = t8. But we have already stated
> that t8 is an INVALID UTF-8 string. So there is a
> mathematically proved
> inconsistency in your requirements.

This only proves that requirements cannot be met by a single conversion pair. If they could be met, then such a conversion could be used immediately for converting to and from UTF-8.

However, requirements 1 and 2 are actually taken from Unicode standard, they are not my requirements.

How's that? Well, they are my requirements also, but instead of "for all valid UTF-x strings", in my case the requirement is relaxed to "for all valid UTF-8 strings that do not contain the 128 replacement codepoints".


> The only way round this is to break the functionality of g so that it
> does not correctly convert all valid UTF-16 strings to UTF-8.
> That will
> certainly be unacceptable to the UTC.
Why not? It does not claim to produce UTF-8 and is not intended to. f(x) is used on "unclean,-not-really-binary,-but-mostly-UTF-8" data. And g(y) produces such data.

g(f(x)) is very useful. It preserves all the data and rountrips.
f(g(y)) is not problematic. It behaves like UTF16(UTF8(s16)) for all codepoints except the infamous 128. Which is acceptable in my case. Or, well, it would be if everyone agreed what those 128 codepoints are and what is their purpose.

Even more, f(x) only produces sequences of the 128 codepoints for which f(g(y))=y is actually true.

Furthermore, today, y should not contain any of the 128 codepoints (assuming UTC takes unassigned codepoints and assigns them today). Any occurences after today shall be interpreted according to their intended meaning.

Sequences for which f(g(y)) is NOT y, can be declared as invalid sequences. Applications dealing with security could reject them. For the rest, anything that happens will only be amusing, rarely confusing, never dangerous. No more than any other escaping technique. And considerably less than inability to access files or even files being displayed with missing characters (or no characters at all).



> Alternatively, you need to relax your requirement that f(t8)
> is a valid
> UTF-16 string, and instead allow that it can be a UTF-16-like
> string but
> including something invalid like a noncharacter or an unpaired
> surrogate. This will not be technically valid for interchange, of
> course. But my suggestion of using a noncharacter as an
> escape is a way
> in which this could be done.

No, this is the most important requirement. The idea is to obtain a VALID UTF-16 string. Interchange is vital. Otherwise I cannot even use a Unicode database to store them. Obtaining a semi-valid string achieves nothing. Might as well stick with the original 'binary' stream (well, 8-bit-opaque-nul-terminated-string). Which is terribly impractical.


Lars

Reply via email to