Re: [whatwg] Bogus comment state and CDATA section state do not stylistically fit in the tokenizer

2014-06-08 Thread Adam Barth
In Blink's implementation, we actually use two additional tokenizer
states for CDATA:

CDATASectionRightSquareBracketState,
CDATASectionDoubleRightSquareBracketState,

Adam


On Sun, Jun 8, 2014 at 6:24 PM, Geoffrey Sneddon
 wrote:
> It would aid programmatic conversion of the spec, and confuse me when
> reading the spec less thereby avoiding bugs like 25871, if these states
> matched the model of the rest of the tokenizer.
>
> Thus I propose the bogus comment state becomes:
>
>> Consume the next input character:
>>
>> U+003E GREATER-THAN SIGN (>):
>>
>> Switch to the data state. Emit the comment token.
>>
>> U+ NULL:
>>
>> Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
>>
>> EOF:
>>
>> Switch to the data state. Emit the comment token. Reconsume the EOF 
>> character.
>>
>> Anything else:
>>
>> Append the current input character to the comment token's data.
>
> This also necessitates creating a new comment token prior to entering
> the bogus comment state.
>
> The CDATA section state should become:
>
>> Consume the next input character:
>>
>> U+005D RIGHT SQUARE BRACKET (]):
>>
>> If the three characters starting from the current input character are U+005D 
>> RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN 
>> (]]>), then consume those characters and switch to the data state. 
>> Otherwise, emit the current input character as a character token.
>>
>> EOF:
>>
>> Switch to the data state. Reconsume the EOF character.
>>
>> Anything else:
>>
>> Append the current input character to the comment token's data.
>
> No changes are needed elsewhere for this. (There is no consistent style
> for lookahead — and most cases are ASCII case-insensitive words — so I
> went with what seems sane here!)
>
> /Geoffrey


[whatwg] Bogus comment state and CDATA section state do not stylistically fit in the tokenizer

2014-06-08 Thread Geoffrey Sneddon
It would aid programmatic conversion of the spec, and confuse me when
reading the spec less thereby avoiding bugs like 25871, if these states
matched the model of the rest of the tokenizer.

Thus I propose the bogus comment state becomes:

> Consume the next input character:
> 
> U+003E GREATER-THAN SIGN (>):
> 
> Switch to the data state. Emit the comment token.
> 
> U+ NULL:
> 
> Append a U+FFFD REPLACEMENT CHARACTER character to the comment token's data.
> 
> EOF:
> 
> Switch to the data state. Emit the comment token. Reconsume the EOF character.
> 
> Anything else:
> 
> Append the current input character to the comment token's data.

This also necessitates creating a new comment token prior to entering
the bogus comment state.

The CDATA section state should become:

> Consume the next input character:
> 
> U+005D RIGHT SQUARE BRACKET (]):
> 
> If the three characters starting from the current input character are U+005D 
> RIGHT SQUARE BRACKET U+005D RIGHT SQUARE BRACKET U+003E GREATER-THAN SIGN 
> (]]>), then consume those characters and switch to the data state. Otherwise, 
> emit the current input character as a character token.
> 
> EOF:
> 
> Switch to the data state. Reconsume the EOF character.
> 
> Anything else:
> 
> Append the current input character to the comment token's data.

No changes are needed elsewhere for this. (There is no consistent style
for lookahead — and most cases are ASCII case-insensitive words — so I
went with what seems sane here!)

/Geoffrey


Re: [whatwg] Proposal: Inline pronounce element

2014-06-08 Thread timeless
Tab wrote:
> This is already theoretically addressed by ,
> linking to a well-defined pronunciation file format. Nobody
> implements that, but nobody implements anything new either, of course.

Brett wrote:
> I think it'd be a lot easier for sites, say along the lines of 
> Wikipedia, to support inline markup to allow users to get a word 
> referenced at the beginning of an article, for example, pronounced 
> accurately.

Wikipedia can easily use data:... if it needs to. 
And wiktionary already has a solution...

A better challenge is explaining to a screen reader if "read" is "rEd" or 
"rehD" in a page where you want to define and use both. I claim that this can 
be addressed with id= on the link and a ref= (or similar) on the use. 

But before User Agents should be asked to support this, I'd want to see real 
sites showing an interest. 

Screen Reader vendors seem ok with the current state - they sell the 
pronunciation tables...