[whatwg] Bogus comment state and CDATA section state do not stylistically fit in the tokenizer

2014-06-08 Thread Geoffrey Sneddon
It would aid programmatic conversion of the spec, and confuse me when
reading the spec less thereby avoiding bugs like 25871, if these states
matched the model of the rest of the tokenizer.

Thus I propose the bogus comment state becomes:

 Consume the next input character:
 
 U+003E GREATER-THAN SIGN ():
 
 Switch to the data state. Emit the comment token.
 
 U+ NULL:
 
 Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
 
 EOF:
 
 Switch to the data state. Emit the comment token. Reconsume the EOF character.
 
 Anything else:
 
 Append the current input character to the comment token's data.

This also necessitates creating a new comment token prior to entering
the bogus comment state.

The CDATA section state should become:

 Consume the next input character:
 
 U+005D RIGHT SQUARE BRACKET (]):
 
 If the three characters starting from the current input character are U+005D 
 RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN 
 (]]), then consume those characters and switch to the data state. Otherwise, 
 emit the current input character as a character token.
 
 EOF:
 
 Switch to the data state. Reconsume the EOF character.
 
 Anything else:
 
 Append the current input character to the comment token's data.

No changes are needed elsewhere for this. (There is no consistent style
for lookahead — and most cases are ASCII case-insensitive words — so I
went with what seems sane here!)

/Geoffrey


Re: [whatwg] Bogus comment state and CDATA section state do not stylistically fit in the tokenizer

2014-06-08 Thread Adam Barth
In Blink's implementation, we actually use two additional tokenizer
states for CDATA:

CDATASectionRightSquareBracketState,
CDATASectionDoubleRightSquareBracketState,

Adam


On Sun, Jun 8, 2014 at 6:24 PM, Geoffrey Sneddon
foolist...@googlemail.com wrote:
 It would aid programmatic conversion of the spec, and confuse me when
 reading the spec less thereby avoiding bugs like 25871, if these states
 matched the model of the rest of the tokenizer.

 Thus I propose the bogus comment state becomes:

 Consume the next input character:

 U+003E GREATER-THAN SIGN ():

 Switch to the data state. Emit the comment token.

 U+ NULL:

 Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.

 EOF:

 Switch to the data state. Emit the comment token. Reconsume the EOF 
 character.

 Anything else:

 Append the current input character to the comment token's data.

 This also necessitates creating a new comment token prior to entering
 the bogus comment state.

 The CDATA section state should become:

 Consume the next input character:

 U+005D RIGHT SQUARE BRACKET (]):

 If the three characters starting from the current input character are U+005D 
 RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN 
 (]]), then consume those characters and switch to the data state. 
 Otherwise, emit the current input character as a character token.

 EOF:

 Switch to the data state. Reconsume the EOF character.

 Anything else:

 Append the current input character to the comment token's data.

 No changes are needed elsewhere for this. (There is no consistent style
 for lookahead — and most cases are ASCII case-insensitive words — so I
 went with what seems sane here!)

 /Geoffrey