Re: utf-8 != latin-1
One of the main features of XML is that it has quite strict rules about how to handle errors. The goal, I believe, is to ensure that we are not awash in malformed files that have no clear interpretation. And this is clearly an error: the acceptable code points are quite clearly stated: http://www.w3.org/TR/2000/REC-xml-20001006#dt-character Converting an illegal UTF-8 sequence into a valid -- BUT WRONG -- sequence of valid code points is clearly against the intent of this production rule. XML could have taken the opposite tack -- that illegal code points and illegal code unit sequences are to be ignored. But it didn't. Mark BTW, I have a simple browser-based UTF converter (in Javascript) at http://www.macchiato.com/unicode/charts.html (click on Converter). It lets you convert back and forth to different UTFs, with various choices for format. And, it does checks for illegal UTF-8 sequences! - Original Message - From: "Doug Ewell" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Friday, October 13, 2000 21:59 Subject: Re: utf-8 != latin-1 "Steven R. Loomis" [EMAIL PROTECTED] wrote: What happened was that the sequence AD 63 61 73 was interpreted as U+E54E U+DC73.. Why? As an illegal UTF-8 sequence, it shouldn't be interpreted as anything. John Cowan's "utf" perl script (which carries the appropriate disclaimers about no error checking) converts that sequence to U+D94E U+DC73, which seems a bit more reasonable -- at least it's a complete surrogate pair. -Doug Ewell Fullerton, California
Re: utf-8 != latin-1
Hello, I didn't get it. So what happens if a company had a Job site in Unicode, and people were copying resume text from Word written in ISO 8859-1 and pasting into a text window in the browser? Does the character set automatically convert correctly. Or does the user need to use a character set converter like Recode? Thanks George Sat, 14 Oct 2000, ÷Ù ÎÁÐÉÓÁÌÉ: Here's a gotcha story .. Someone was working on documentation files in XML. The PDF generator all of a sudden started choking, complaining that there was "Illegal character U+DC73" somewhere in the late stages of PDF generation. Well, the low surrogate certainly didn't belong there. Software bug? Memory corruption? I converted the 1.1mb intermediate file into literal \u notation and searched for DC73. Sure enough, there was lower\uE54E\uDC73e (U+E54E, a PUA, and U+DC73) .. in place of what was "lower-case" in the source text. Definitely memory corruption.. But wait.. On a hunch, I deleted the hyphen and replaced it, which worked somehow. I was told that the text "lower-case" was copied from another document. Further inspection showed that the offending hypen was actually \xAD, "soft hyphen". Since the XML document had no encoding tag, it defaults to . UTF-8! What happened was that the sequence AD 63 61 73 was interpreted as U+E54E U+DC73.. So moral: BE CAREFUL when you are pasting text into utf-8 documents.. -steven
Re: utf-8 != latin-1
Doug Ewell wrote: Why? As an illegal UTF-8 sequence, it shouldn't be interpreted as anything. It wasn't interpreted as anything. It halted processing at that point in the text, as an error. George Zeigler wrote: I didn't get it. So what happens if a company had a Job site in Unicode, and people were copying resume text from Word written in ISO 8859-1 and pasting into a text window in the browser? Does the character set automatically convert correctly. Or does the user need to use a character set converter like Recode? It was pasted into Windows Notepad or some other editor editing an XML file. XML files unless otherwise tagged are UTF-8, but the editor thought it was something like Windows-1252. So, the right thing to do *might* be to tag the file as being 'windows-1252'. A better solution would be to use UTF-8 aware editors only. My point is that it was hard to tell visually whether the data being copied was a 'safe' subset of both utf-8 and windows-1252 [such as ASCII]. -s
utf-8 != latin-1
Here's a gotcha story .. Someone was working on documentation files in XML. The PDF generator all of a sudden started choking, complaining that there was "Illegal character U+DC73" somewhere in the late stages of PDF generation. Well, the low surrogate certainly didn't belong there. Software bug? Memory corruption? I converted the 1.1mb intermediate file into literal \u notation and searched for DC73. Sure enough, there was lower\uE54E\uDC73e (U+E54E, a PUA, and U+DC73) .. in place of what was "lower-case" in the source text. Definitely memory corruption.. But wait.. On a hunch, I deleted the hyphen and replaced it, which worked somehow. I was told that the text "lower-case" was copied from another document. Further inspection showed that the offending hypen was actually \xAD, "soft hyphen". Since the XML document had no encoding tag, it defaults to . UTF-8! What happened was that the sequence AD 63 61 73 was interpreted as U+E54E U+DC73.. So moral: BE CAREFUL when you are pasting text into utf-8 documents.. -steven
Re: utf-8 != latin-1
"Steven R. Loomis" [EMAIL PROTECTED] wrote: What happened was that the sequence AD 63 61 73 was interpreted as U+E54E U+DC73.. Why? As an illegal UTF-8 sequence, it shouldn't be interpreted as anything. John Cowan's "utf" perl script (which carries the appropriate disclaimers about no error checking) converts that sequence to U+D94E U+DC73, which seems a bit more reasonable -- at least it's a complete surrogate pair. -Doug Ewell Fullerton, California