Re: [whatwg] Parse errors for invalid characters

Geoffrey Sneddon Sat, 07 Sep 2013 04:01:36 -0700

On 06/09/2013 04:05, Kang-Hao (Kenny) Lu wrote:

(2013/09/06 6:08), Geoffrey Sneddon wrote:

The phrasing content section states:

Text nodes and attribute values must consist of Unicode characters,
must not contain U+0000 characters, must not contain permanently
undefined Unicode characters (noncharacters), and must not contain
control characters other than space characters. This specification
includes extra constraints on the exact value of Text nodes and
attribute values depending on their precise context.


And the pre-processing the input-stream section states:

Any occurrences of any characters in the ranges U+0001 to U+0008,
U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE,
U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF,
U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE,
U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF,
U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse
errors. These are all control characters or permanently undefined
Unicode characters (noncharacters).


Note the first uses "Unicode characters", the second "characters" — the
former excludes surrogates as a conformance requirement.

Note that every disallowed non-surrogate character is a parse error.


Except U+0000 or am I missing something?

This is handled inline in the parser, as noted in the preprocessingsection. It sometimes gets passed through as U+0000, sometimes getschanged to U+FFFD, sometimes gets ignored, but always creates a parsererror.

Therefore, it would make sense to make surrogates parse errors.

It should be noted that they can only occur in the input stream if they
come from script (as they cannot be decoded from the input byte stream
as the decoders will never emit a surrogate).


which means that this seems ... cubersome ... to implement in a
conformance checker. Which reminds me, does

    # Conformance checkers must report at least one parse error
    # condition to the user if one or more parse error conditions exist
    # in the document and must not report parse error conditions if none
    # exist in the document. Conformance checkers may report more than
    # one parse error condition if more than one parse error condition
    # exists in the document.

mean validator.nu and Firefox view source are non-conforming because
they do nothing about document.write() ?

I think we should exempt conformance checkers from scripts instead.


They already are. From the "Conformance classes" section:

Conformance checkers must check that the input document conforms when parsed without a browsing 
context (meaning that no scripts are run, and that the parser's scripting flag is disabled), and 
should also check that the input document conforms when parsed with a browsing context in which 
scripts execute, and that the scripts never cause non-conforming states to occur other than 
transiently during script execution itself. (This is only a "SHOULD" and not a 
"MUST" requirement because it has been proven to be impossible. [COMPUTABLE])

(I feel like pedanting and pointing out this is untrue — it has not beenproven impossible to do, it has been proven impossible to do in general.It wouldn't be that hard to design a conformance checker to check"<html><script>document.write("<p>")</script>".)

On the other hand, a JS console can reasonably report parse errors fromscript, so the parse errors are still worthwhile to have.


/Geoffrey.

Re: [whatwg] Parse errors for invalid characters

Reply via email to