Re: [whatwg] Parse errors for invalid characters

2013-11-26 Thread Ian Hickson

On Sat, 7 Sep 2013, Geoffrey Sneddon wrote:
  
  [...] this seems ... cubersome ... to implement in a conformance 
  checker. Which reminds me, does
  
  # Conformance checkers must report at least one parse error
  # condition to the user if one or more parse error conditions exist
  # in the document and must not report parse error conditions if none
  # exist in the document. Conformance checkers may report more than
  # one parse error condition if more than one parse error condition
  # exists in the document.
  
  mean validator.nu and Firefox view source are non-conforming because
  they do nothing about document.write() ?
  
  I think we should exempt conformance checkers from scripts instead.
 
 They already are. From the Conformance classes section:
 
  Conformance checkers must check that the input document conforms when parsed
  without a browsing context (meaning that no scripts are run, and that the
  parser's scripting flag is disabled), and should also check that the input
  document conforms when parsed with a browsing context in which scripts
  execute, and that the scripts never cause non-conforming states to occur
  other than transiently during script execution itself. (This is only a
  SHOULD and not a MUST requirement because it has been proven to be
  impossible. [COMPUTABLE])

Right.


 (I feel like pedanting and pointing out this is untrue — it has not been 
 proven impossible to do, it has been proven impossible to do in general. 

I'm not sure what the distinction is here.


 It wouldn't be that hard to design a conformance checker to check 
 htmlscriptdocument.write(p)/script.)

It wouldn't be very useful to have a conformance checker only check that 
literal string, and as soon as you start allowing more things, the 
complexity becomes astronomically high very quickly.

But I'm all in favour of conformance checkers checking these things as 
much as possible.


 On the other hand, a JS console can reasonably report parse errors from 
 script, so the parse errors are still worthwhile to have.

Right.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Parse errors for invalid characters

2013-09-13 Thread Ian Hickson
On Thu, 5 Sep 2013, Geoffrey Sneddon wrote:

 The phrasing content section states:
 
  Text nodes and attribute values must consist of Unicode characters, 
  must not contain U+ characters, must not contain permanently 
  undefined Unicode characters (noncharacters), and must not contain 
  control characters other than space characters.
 
 And the pre-processing the input-stream section states:
 
  Any occurrences of any characters in the ranges U+0001 to U+0008, 
  U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters 
  U+000B, U+FFFE, U+, U+1FFFE, U+1, U+2FFFE, U+2, U+3FFFE, 
  U+3, U+4FFFE, U+4, U+5FFFE, U+5, U+6FFFE, U+6, 
  U+7FFFE, U+7, U+8FFFE, U+8, U+9FFFE, U+9, U+AFFFE, 
  U+A, U+BFFFE, U+B, U+CFFFE, U+C, U+DFFFE, U+D, 
  U+EFFFE, U+E, U+E, U+F, U+10FFFE, and U+10 are parse 
  errors. These are all control characters or permanently undefined 
  Unicode characters (noncharacters).
 
 Note the first uses Unicode characters, the second characters — the 
 former excludes surrogates as a conformance requirement.
 
 Note that every disallowed non-surrogate character is a parse error.

 Therefore, it would make sense to make surrogates parse errors.

Done.


 It should be noted that they can only occur in the input stream if they 
 come from script (as they cannot be decoded from the input byte stream 
 as the decoders will never emit a surrogate).

Done.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Parse errors for invalid characters

2013-09-07 Thread Geoffrey Sneddon

On 06/09/2013 04:05, Kang-Hao (Kenny) Lu wrote:

(2013/09/06 6:08), Geoffrey Sneddon wrote:

The phrasing content section states:


Text nodes and attribute values must consist of Unicode characters,
must not contain U+ characters, must not contain permanently
undefined Unicode characters (noncharacters), and must not contain
control characters other than space characters. This specification
includes extra constraints on the exact value of Text nodes and
attribute values depending on their precise context.


And the pre-processing the input-stream section states:


Any occurrences of any characters in the ranges U+0001 to U+0008,
U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
U+000B, U+FFFE, U+, U+1FFFE, U+1, U+2FFFE, U+2, U+3FFFE,
U+3, U+4FFFE, U+4, U+5FFFE, U+5, U+6FFFE, U+6,
U+7FFFE, U+7, U+8FFFE, U+8, U+9FFFE, U+9, U+AFFFE,
U+A, U+BFFFE, U+B, U+CFFFE, U+C, U+DFFFE, U+D,
U+EFFFE, U+E, U+E, U+F, U+10FFFE, and U+10 are parse
errors. These are all control characters or permanently undefined
Unicode characters (noncharacters).


Note the first uses Unicode characters, the second characters — the
former excludes surrogates as a conformance requirement.

Note that every disallowed non-surrogate character is a parse error.


Except U+ or am I missing something?


This is handled inline in the parser, as noted in the preprocessing 
section. It sometimes gets passed through as U+, sometimes gets 
changed to U+FFFD, sometimes gets ignored, but always creates a parser 
error.



Therefore, it would make sense to make surrogates parse errors.

It should be noted that they can only occur in the input stream if they
come from script (as they cannot be decoded from the input byte stream
as the decoders will never emit a surrogate).


which means that this seems ... cubersome ... to implement in a
conformance checker. Which reminds me, does

# Conformance checkers must report at least one parse error
# condition to the user if one or more parse error conditions exist
# in the document and must not report parse error conditions if none
# exist in the document. Conformance checkers may report more than
# one parse error condition if more than one parse error condition
# exists in the document.

mean validator.nu and Firefox view source are non-conforming because
they do nothing about document.write() ?

I think we should exempt conformance checkers from scripts instead.


They already are. From the Conformance classes section:


Conformance checkers must check that the input document conforms when parsed without a browsing 
context (meaning that no scripts are run, and that the parser's scripting flag is disabled), and 
should also check that the input document conforms when parsed with a browsing context in which 
scripts execute, and that the scripts never cause non-conforming states to occur other than 
transiently during script execution itself. (This is only a SHOULD and not a 
MUST requirement because it has been proven to be impossible. [COMPUTABLE])


(I feel like pedanting and pointing out this is untrue — it has not been 
proven impossible to do, it has been proven impossible to do in general. 
It wouldn't be that hard to design a conformance checker to check 
htmlscriptdocument.write(p)/script.)


On the other hand, a JS console can reasonably report parse errors from 
script, so the parse errors are still worthwhile to have.


/Geoffrey.


Re: [whatwg] Parse errors for invalid characters

2013-09-05 Thread Kang-Hao (Kenny) Lu
(2013/09/06 6:08), Geoffrey Sneddon wrote:
 The phrasing content section states:
 
 Text nodes and attribute values must consist of Unicode characters,
 must not contain U+ characters, must not contain permanently
 undefined Unicode characters (noncharacters), and must not contain
 control characters other than space characters. This specification
 includes extra constraints on the exact value of Text nodes and
 attribute values depending on their precise context.
 
 And the pre-processing the input-stream section states:
 
 Any occurrences of any characters in the ranges U+0001 to U+0008,
 U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters
 U+000B, U+FFFE, U+, U+1FFFE, U+1, U+2FFFE, U+2, U+3FFFE,
 U+3, U+4FFFE, U+4, U+5FFFE, U+5, U+6FFFE, U+6,
 U+7FFFE, U+7, U+8FFFE, U+8, U+9FFFE, U+9, U+AFFFE,
 U+A, U+BFFFE, U+B, U+CFFFE, U+C, U+DFFFE, U+D,
 U+EFFFE, U+E, U+E, U+F, U+10FFFE, and U+10 are parse
 errors. These are all control characters or permanently undefined
 Unicode characters (noncharacters).
 
 Note the first uses Unicode characters, the second characters — the
 former excludes surrogates as a conformance requirement.
 
 Note that every disallowed non-surrogate character is a parse error.

Except U+ or am I missing something?

 Therefore, it would make sense to make surrogates parse errors.
 
 It should be noted that they can only occur in the input stream if they
 come from script (as they cannot be decoded from the input byte stream
 as the decoders will never emit a surrogate).

which means that this seems ... cubersome ... to implement in a
conformance checker. Which reminds me, does

   # Conformance checkers must report at least one parse error
   # condition to the user if one or more parse error conditions exist
   # in the document and must not report parse error conditions if none
   # exist in the document. Conformance checkers may report more than
   # one parse error condition if more than one parse error condition
   # exists in the document.

mean validator.nu and Firefox view source are non-conforming because
they do nothing about document.write() ?

I think we should exempt conformance checkers from scripts instead.


Cheers,
Kenny
-- 
Web Specialist, Opera Sphinx Game Force, Oupeng Browser, Beijing
Try Oupeng: http://www.oupeng.com/