Re: Roundtripping in Unicode

Marcin 'Qrczak' Kowalczyk Mon, 13 Dec 2004 09:19:41 -0800

Lars Kristan <[EMAIL PROTECTED]> writes:

> And once we understand that things are manageable and not as
> frigtening as it seems at first, then we can stop using this as an
> argument against introducing 128 codepoints. People who will find
> them useful should and will bother with the consequences. Others
> don't need to and can roundtrip them as today.


A person who is against them can't ignore a motion to introduce them,
because if they are introduced, other people / programs will start
feeding our programs arbitrary byte sequences labeled as UTF-8
expecting them to accept the data.

> So, interpreting the 128 codepoints as 'recreate the original byte
> sequence' is an option.

Which guarantees that different programs will have different view of
the validity and meaning of the same data labeled by the same encoding.
Long live standarization.

> Even I will do the same where I just want to represent Unicode in
> UTF-8. I will only use this conversion in certain places.

So it's not just different programs, but even the same program in
different places. Great...

> The fact that my conversion actually produces UTF-8 from most of
> Unicode points does not mean it produced UTF-8.

Increasing the number of encodings means more opportunities of
mislabeling and using wrong libraries to process data (as it works
"in most of cases" and thus the error is not detected immediately)
and harder life for programs which aim at supporting all data.

Think further than the immediate moment where many people are
performing a transition form something to UTF-8. Look what happened
with the interpretation of HTML in web browsers.

If the standard from the beginning stood firmly at disallowing
"guessing" what a malformed HTML was supposed to mean, then people
would learn how to produce correct HTML and the interpretation would
be unambiguous. But browsers tried to accept arbitrary contents and
interpret parts of HTML they found there, guessing how errors should
be resolved, being "friendly" to careless webmasters. The effect is
that too often they submitted a webpage after checking that it works
in their browser, but in fact it had basic syntax errors. Other
browsers interpreted the errors differently, and the page was
inaccessible or looked badly.

When designing XML, they learned from this mistake:
http://www.xml.com/axml/target.html#dt-fatal
http://www.xml.com/axml/notes/Draconian.html

That's why people here reject balkanization of UTF-8 by introducing
variations with subtle differences, like Java-modified UTF-8.

> Inaccessible filenames are something we shouldn't accept. All your
> discussion of non-empty empty directories is just approaching the problem
> from the wrong end. One should fix the root cause, not consequences.

The root cause is that users and programs use different encodings in
different places, and thus Unix filenames can't be unambiguously and
context-freely interpreted as character sequences.

Unfortunately it's hard to fix.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

Reply via email to