From: "Philippe Verdy" <[EMAIL PROTECTED]> To: "Doug Ewell" <[EMAIL PROTECTED]> Sent: Tuesday, October 12, 2004 8:24 PM Subject: Re: UTF-8 stress test file?
From: "Doug Ewell" <[EMAIL PROTECTED]>Theodore H. Smith <delete at elfdata dot com> wrote:
- the file mixes UTF-8 and UTF-16
Does this file mix UTF-8 and UTF-16? I thought it just had surrogates encoded into UTF-8? Of course a surrogate should never exist in UTF-8.
You are right. Philippe's statement was incorrect, and also puzzling.
What is much more puzzling is the text contained in that referenced text: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
Examples of bad assumptions that a reader could make:
- [quote](...) Experience so far suggests that most first-time authors of UTF-8 decoders find at least one serious problem in their decoder by using this file.[/quote]
This suggests to the reader that if its browser or editor does not display the contained test text as indicated, there's a problem in that application. But given that the file is not conforming to UTF-8 because of the "errors" it contains *on purpose*, No assumption should be made about how the browser or text editor will behave with the content of that file. Any difference with what is "expected" by the text is really not a bug, given that the whole file is incorrect and is *not* UTF-8 encoded. In fact, if your browser or editor still allows to view it as if it was UTF-8, and inidicates to the user that it is UTF-8 encoded without warning the user about the encoding violations that should be detected, I really think that this browser or editor is not conforming. A conforming browser or editor should load that document without encoding violation problems, assuming it is encoded instead with ISO-8859-1 or ISO-8859-2 or any other complete 8-bit encoding (an encoding that has no invalid code position, so ISO-8859-4 should not work without similar warnings). The only thing that could be said is that the document respects only the ISO 10646-1:2000 standard, but not its later version and not Unicode (so a browser or editor could still accept the document as being encoded with UTF-8:2000, but not with UTF-8.
- [quote](...) All lines in this file are exactly 79 characters long (plus the line
feed). In addition, all lines end with "|", except for the two test
lines 2.1.1 and 2.2.1, which contain non-printable ASCII controls
U+0000 and U+007F. If you display this file with a fixed-width font,
these "|" characters should all line up in column 79 (right margin).[/quote]
Nothing is wrong if lines are displayed with more or less characters, or if "|" characters are not vertically aligned when using fixed fonts.
- [quote] (...) 1 Some correct UTF-8 text
You should see the Greek word 'kosme': "ÎáÏÎÎ" (...) [/quote]
You can see the Greek word here in this message (because this message is properly UTF-8 encoded), but nothing is wrong in your editor or browser if the word is not readable as indicated, and you see for example the string "ÃÂÃÂÂÃÆÃÂÃÂ" when your editor or browser loads the file as an ISO-8859-1 text.
- All the section 3 "Malformed sequences" should not be readable at all, or could display random characters when the text is loaded as ISO-8859-1. Don't expect to see "?" even if Internet Explorer display them without warning the user (this is a violation of the current UTF-8 encoding rules).
- Same thing for section 4 "Overlong sequences" (prohibited in UTF-8, but tolerated in UTF-8:2000 i.e. the RFC version used by ISO 10646:2000). If you see "?" characters without other warnings, your browser is not conforming exactly like browsers that would display the indicated slash "/".
- Section 5 "Illegal code positions" (single and paired "UTF-16" surrogates) is the one that should immediately throw an exception in the browser's UTF-8 decoder to force it retry with another encoding (possibly with UTF-8:2000, or with ISO-8859-1). Nothing is wrong in your browser if you see sequences like "Ã â" or "ÃÂÂ"when the file is loaded as Windows-1252, or if lines do not line up or have strange layout when the file is loaded as ISO-8859-1.
- Subsection 5.3 "Other illegal code positions" also forgets all illegal *code points* (not "code positions" !) that are permanently reserved in the 16 other planes (out of the BMP), and illegal positions found in the Arabic compatibility block.
So who's puzzling here? Not me! It's the content of the text itself.

