On 06/09/2013 04:05, Kang-Hao (Kenny) Lu wrote:
(2013/09/06 6:08), Geoffrey Sneddon wrote:The phrasing content section states:Text nodes and attribute values must consist of Unicode characters, must not contain U+0000 characters, must not contain permanently undefined Unicode characters (noncharacters), and must not contain control characters other than space characters. This specification includes extra constraints on the exact value of Text nodes and attribute values depending on their precise context.And the pre-processing the input-stream section states:Any occurrences of any characters in the ranges U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).Note the first uses "Unicode characters", the second "characters" — the former excludes surrogates as a conformance requirement. Note that every disallowed non-surrogate character is a parse error.Except U+0000 or am I missing something?
This is handled inline in the parser, as noted in the preprocessing section. It sometimes gets passed through as U+0000, sometimes gets changed to U+FFFD, sometimes gets ignored, but always creates a parser error.
Therefore, it would make sense to make surrogates parse errors. It should be noted that they can only occur in the input stream if they come from script (as they cannot be decoded from the input byte stream as the decoders will never emit a surrogate).which means that this seems ... cubersome ... to implement in a conformance checker. Which reminds me, does # Conformance checkers must report at least one parse error # condition to the user if one or more parse error conditions exist # in the document and must not report parse error conditions if none # exist in the document. Conformance checkers may report more than # one parse error condition if more than one parse error condition # exists in the document. mean validator.nu and Firefox view source are non-conforming because they do nothing about document.write() ? I think we should exempt conformance checkers from scripts instead.
They already are. From the "Conformance classes" section:
Conformance checkers must check that the input document conforms when parsed without a browsing context (meaning that no scripts are run, and that the parser's scripting flag is disabled), and should also check that the input document conforms when parsed with a browsing context in which scripts execute, and that the scripts never cause non-conforming states to occur other than transiently during script execution itself. (This is only a "SHOULD" and not a "MUST" requirement because it has been proven to be impossible. [COMPUTABLE])
(I feel like pedanting and pointing out this is untrue — it has not been proven impossible to do, it has been proven impossible to do in general. It wouldn't be that hard to design a conformance checker to check "<html><script>document.write("<p>")</script>".)
On the other hand, a JS console can reasonably report parse errors from script, so the parse errors are still worthwhile to have.
/Geoffrey.