Re: utf-8 != latin-1

2000-10-17 Thread Mark Davis

One of the main features of XML is that it has quite strict rules about how
to handle errors. The goal, I believe, is to ensure that we are not awash in
malformed files that have no clear interpretation.

And this is clearly an error: the acceptable code points are quite clearly
stated:

http://www.w3.org/TR/2000/REC-xml-20001006#dt-character

Converting an illegal UTF-8 sequence into a valid -- BUT WRONG -- sequence
of valid code points is clearly against the intent of this production rule.
XML could have taken the opposite tack -- that illegal code points and
illegal code unit sequences are to be ignored. But it didn't.

Mark

BTW, I have a simple browser-based UTF converter (in Javascript) at
http://www.macchiato.com/unicode/charts.html (click on Converter). It lets
you convert back and forth to different UTFs, with various choices for
format. And, it does checks for illegal UTF-8 sequences!

- Original Message -
From: "Doug Ewell" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Friday, October 13, 2000 21:59
Subject: Re: utf-8 != latin-1


"Steven R. Loomis" [EMAIL PROTECTED] wrote:

 What happened was that the sequence  AD 63 61 73 was
 interpreted as U+E54E U+DC73..

Why?  As an illegal UTF-8 sequence, it shouldn't be interpreted as
anything.

John Cowan's "utf" perl script (which carries the appropriate
disclaimers about no error checking) converts that sequence to U+D94E
U+DC73, which seems a bit more reasonable -- at least it's a complete
surrogate pair.

-Doug Ewell
 Fullerton, California







Re: utf-8 != latin-1

2000-10-14 Thread George Zeigler

Hello,

  I didn't get it.  So what happens if a company had a Job site in Unicode,
and people were copying resume text from Word written in ISO 8859-1
and pasting into a text window in the browser?  Does the character set
automatically convert correctly.  Or does the user need to use a character set
converter like Recode?

Thanks
George 

Sat, 14 Oct 2000, ÷Ù ÎÁÐÉÓÁÌÉ:
 Here's a gotcha story ..
 
 Someone was working on documentation files in XML.  The PDF generator
 all of a sudden started choking, complaining that there was "Illegal
 character U+DC73" somewhere in the late stages of PDF generation. Well,
 the low surrogate certainly didn't belong there. Software bug? Memory
 corruption?
 
 I converted the 1.1mb intermediate file into literal \u notation and
 searched for DC73. Sure enough, there was lower\uE54E\uDC73e  (U+E54E, a
 PUA, and U+DC73) .. in place of what was "lower-case" in the source
 text. Definitely memory corruption.. But wait..
  On a hunch, I deleted the hyphen and replaced it, which worked somehow.
 I was told that the text "lower-case" was copied from another document.
 
 Further inspection showed that the offending hypen was actually \xAD, 
 "soft hyphen".  Since the XML document had no encoding tag, it defaults
 to . UTF-8!  What happened was that the sequence  AD 63 61 73 was
 interpreted as U+E54E U+DC73.. 
 
 So moral: BE CAREFUL when you are pasting text into utf-8 documents..
 
 -steven




Re: utf-8 != latin-1

2000-10-14 Thread Steven R. Loomis

Doug Ewell wrote:
 Why?  As an illegal UTF-8 sequence, it shouldn't be interpreted as anything.

 It wasn't interpreted as anything. It halted processing at that point
in the text, as an error.

George Zeigler wrote:
   I didn't get it.  So what happens if a company had a Job site in Unicode,
 and people were copying resume text from Word written in ISO 8859-1
 and pasting into a text window in the browser?  Does the character set
 automatically convert correctly.  Or does the user need to use a character set
 converter like Recode?

 It was pasted into Windows Notepad or some other editor editing an XML
file. XML files unless otherwise tagged are UTF-8, but the editor
thought it was something like Windows-1252. So, the right thing to do
*might* be to tag the file as being 'windows-1252'.  A better solution
would be to use UTF-8 aware editors only.

 My point is that it was hard to tell visually whether the data being
copied was a 'safe' subset of both utf-8 and windows-1252 [such as
ASCII].

-s



utf-8 != latin-1

2000-10-13 Thread Steven R. Loomis

Here's a gotcha story ..

Someone was working on documentation files in XML.  The PDF generator
all of a sudden started choking, complaining that there was "Illegal
character U+DC73" somewhere in the late stages of PDF generation. Well,
the low surrogate certainly didn't belong there. Software bug? Memory
corruption?

I converted the 1.1mb intermediate file into literal \u notation and
searched for DC73. Sure enough, there was lower\uE54E\uDC73e  (U+E54E, a
PUA, and U+DC73) .. in place of what was "lower-case" in the source
text. Definitely memory corruption.. But wait..
 On a hunch, I deleted the hyphen and replaced it, which worked somehow.
I was told that the text "lower-case" was copied from another document.

Further inspection showed that the offending hypen was actually \xAD, 
"soft hyphen".  Since the XML document had no encoding tag, it defaults
to . UTF-8!  What happened was that the sequence  AD 63 61 73 was
interpreted as U+E54E U+DC73.. 

So moral: BE CAREFUL when you are pasting text into utf-8 documents..

-steven



Re: utf-8 != latin-1

2000-10-13 Thread Doug Ewell

"Steven R. Loomis" [EMAIL PROTECTED] wrote:

 What happened was that the sequence  AD 63 61 73 was
 interpreted as U+E54E U+DC73.. 

Why?  As an illegal UTF-8 sequence, it shouldn't be interpreted as
anything.

John Cowan's "utf" perl script (which carries the appropriate
disclaimers about no error checking) converts that sequence to U+D94E
U+DC73, which seems a bit more reasonable -- at least it's a complete
surrogate pair.

-Doug Ewell
 Fullerton, California