[whatwg] Valid Unicode

Elliotte Harold Fri, 01 Dec 2006 04:39:03 -0800

In 9.1.3 we see

Text must consist of valid Unicode characters other than U+0000. Textshould not contain control characters other than space characters.



Later in 9.2.3.1 we find:

If the number is not a valid Unicode character (e.g. if the number ishigher than 1114111), or if the number is zero, then return a charactertoken for the U+FFFD REPLACEMENT CHARACTER character instead.

I do not think the Unicode spec defines the notion of a "valid Unicodecharacter". (It does define a valid Unicode code unit sequence, butthat's a little different. A code unit sequence generally consists ofmore than one character.) Thus I suggest we need to be more precise hereabout what is and is not a valid Unicode character. In particular:



1. Are private use characters allowed?

2. Are control characters allowed (probably yes, based on other parts ofthe spec).

3. Are surrogate characters allowed? (probably no)
4. Are non-characters beyond 10FFFF allowed (no)
5. Are reserved but currently undefined characters allowed (yes)
6. Are noncharacters U+FDD0..U+FDEF allowed (?)

7. Are the noncharacters from the last two characters of each planeallowed (?)



--
Elliotte Rusty Harold  [EMAIL PROTECTED]
Java I/O 2nd Edition Just Published!
http://www.cafeaulait.org/books/javaio2/
http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/

[whatwg] Valid Unicode

Reply via email to