Lars Kristan <[EMAIL PROTECTED]> writes: > And once we understand that things are manageable and not as > frigtening as it seems at first, then we can stop using this as an > argument against introducing 128 codepoints. People who will find > them useful should and will bother with the consequences. Others > don't need to and can roundtrip them as today.
A person who is against them can't ignore a motion to introduce them, because if they are introduced, other people / programs will start feeding our programs arbitrary byte sequences labeled as UTF-8 expecting them to accept the data. > So, interpreting the 128 codepoints as 'recreate the original byte > sequence' is an option. Which guarantees that different programs will have different view of the validity and meaning of the same data labeled by the same encoding. Long live standarization. > Even I will do the same where I just want to represent Unicode in > UTF-8. I will only use this conversion in certain places. So it's not just different programs, but even the same program in different places. Great... > The fact that my conversion actually produces UTF-8 from most of > Unicode points does not mean it produced UTF-8. Increasing the number of encodings means more opportunities of mislabeling and using wrong libraries to process data (as it works "in most of cases" and thus the error is not detected immediately) and harder life for programs which aim at supporting all data. Think further than the immediate moment where many people are performing a transition form something to UTF-8. Look what happened with the interpretation of HTML in web browsers. If the standard from the beginning stood firmly at disallowing "guessing" what a malformed HTML was supposed to mean, then people would learn how to produce correct HTML and the interpretation would be unambiguous. But browsers tried to accept arbitrary contents and interpret parts of HTML they found there, guessing how errors should be resolved, being "friendly" to careless webmasters. The effect is that too often they submitted a webpage after checking that it works in their browser, but in fact it had basic syntax errors. Other browsers interpreted the errors differently, and the page was inaccessible or looked badly. When designing XML, they learned from this mistake: http://www.xml.com/axml/target.html#dt-fatal http://www.xml.com/axml/notes/Draconian.html That's why people here reject balkanization of UTF-8 by introducing variations with subtle differences, like Java-modified UTF-8. > Inaccessible filenames are something we shouldn't accept. All your > discussion of non-empty empty directories is just approaching the problem > from the wrong end. One should fix the root cause, not consequences. The root cause is that users and programs use different encodings in different places, and thus Unix filenames can't be unambiguously and context-freely interpreted as character sequences. Unfortunately it's hard to fix. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

