Re: [whatwg] Bug in Before DOCTYPE name state?

2007-06-18 Thread Ian Hickson
On Fri, 22 Dec 2006, Thomas Broyer wrote:
 2006/12/22, Ian Hickson:
  On Thu, 21 Dec 2006, Thomas Broyer wrote:
  
   Why is the DOCTYPE marked in error in the former case?
 
  Because otherwise this document:
 
 !DOCTYPEH
 
  ...would emit a DOCTYPE that is not in error (since the token would be 
  emitted before the bit at the end of the DOCTYPE name state).
 
 Doh! right.

This changed recently, by the way, if someone could check that the spec 
still is indeed causing the right errors to be flagged that would be 
great. (I think it is, though some errors moved from the tokeniser to the 
tree construction phase.)


   In other words, why would !DOCTYPE html be in error while 
   !DOCTYPE Html wouldn't?
 
  Both would be not in error, because of the sentence at the end of the 
  DOCTYPE name state.
 
 OK, now understood (thanks you Simon for having enlighted me)

Note that this is now handled quite differently.


  On Thu, 21 Dec 2006, Thomas Broyer wrote:
  
   But it also has this note, which is quite confusing: Because 
   lowercase letters in the name are uppercased by the algorithm above, 
   the HTML letters are actually case-insensitive relative to the 
   markup.
 
  How is it confusing? I would clarify it, but I don't know what is 
  confusing.
 
 Maybe there's no need to clarify it, it might just have been me…

Ok.


   It remains that the tokenization stage is a bit confusing…
 
  Yes. The tree construction stage is even worse. Just implement it 
  exactly as written with no interpretation and you should be fine. ;-)
 
 My problem is that I'm not implementing an emitting parser (à la 
 SAX) but a pulling parser, so I'm stopping as soon as I've found a 
 token and return true to say hey, I've changed the TokenType, Name, 
 Value, etc. properties to reflect a new token. ...so I'm interpreting 
 ;-)
 
 Re tree construction, I'm about to implemented it in two parts: in the 
 pull parser when possible (handling omitted tags and misnested 
 formatting elements) and in a tree fixer otherwise (move the meta 
 and link into head, etc.)

How has that worked for you? Is the spec ok for that approach?

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

[whatwg] Bug in Before DOCTYPE name state?

2006-12-21 Thread Thomas Broyer

Before DOCTYPE name state:
http://www.whatwg.org/specs/web-apps/current-work/#before1

↪ U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
   Create a new DOCTYPE token. Set the token's name name to the
uppercase version of the current input character (subtract 0x0020 from
the character's code point), and mark it as being in error. Switch to
the DOCTYPE name state.


DOCTYPE name state
http://www.whatwg.org/specs/web-apps/current-work/#doctype1

↪ U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z
   Append the uppercase version of the current input character
(subtract 0x0020 from the character's code point) to the current
DOCTYPE token's name. Stay in the DOCTYPE name state.

Why is the DOCTYPE marked in error in the former case?

In other words, why would !DOCTYPE html be in error while
!DOCTYPE Html wouldn't?

My guess is that it's a bug in the Before DOCTYPE name state.

--
Thomas Broyer


Re: [whatwg] Bug in Before DOCTYPE name state?

2006-12-21 Thread Thomas Broyer

2006/12/21, Anne van Kesteren:

On Thu, 21 Dec 2006 18:09:43 +0100, Thomas Broyer wrote:
 But it also has this note, which is quite confusing: Because
 lowercase letters in the name are uppercased by the algorithm above,
 the HTML letters are actually case-insensitive relative to the
 markup.

During tokenization you store the lowercase ASCII characters as uppercase.
So you can do a case-sensitive comparison with HTML in the end (HTML
will also end up in the DOM or whatever model you use there).

In the markup it could be written as !doctype html which is what is
suggested there.


Ah, ok, that what I thought.

So what's the prupose of marking the DOCTYPE in error in the before
DOCTYPE name state when it finds a lowercase 'h' if it's set back to
correct in DOCTYPE name state if it actually was followed by the
three letters tml (case-insensitively)?


 However, section 8.1.1 says:
 http://www.whatwg.org/specs/web-apps/current-work/#doctype
 
 In other words, !DOCTYPE HTML, case-insensitively.
 

 So I guess you're right.

Learned this when writing the implementation of it :-)


So !doctype html should not produce a parse rror? or should it?

--
Thomas Broyer


Re: [whatwg] Bug in Before DOCTYPE name state?

2006-12-21 Thread Thomas Broyer

2006/12/21, Anne van Kesteren:

On Thu, 21 Dec 2006 11:08:51 +0100, Thomas Broyer wrote:

 In other words, why would !DOCTYPE html be in error while
 !DOCTYPE Html wouldn't?

 My guess is that it's a bug in the Before DOCTYPE name state.

It's not. The DOCTYPE name state also has this paragraph: Then, if the
name of the DOCTYPE token is exactly the four letters HTML, then mark
the token as being correct. Otherwise, mark it as being in error.


Additional note: as I read this, if the DOCTYPE was previously marked
as being in error, it should then be rolled back to being correct
if the DOCTYPE name is HTML: !DOCTYPEHTML would *not* be marked
in error.

That's probably not what's intended.

So I'll just code it so that these are correct:
!doctype html
!DOCTYPE HTML
and every other lowercase/uppercase variant;
and thiese are in error:
!doctypehtml
!DOCTYPEHTML
and every other lowercase/uppercase variant.

--
Thomas Broyer


Re: [whatwg] Bug in Before DOCTYPE name state?

2006-12-21 Thread Anne van Kesteren
On Thu, 21 Dec 2006 19:03:34 +0100, Thomas Broyer [EMAIL PROTECTED]  
wrote:

So what's the prupose of marking the DOCTYPE in error in the before
DOCTYPE name state when it finds a lowercase 'h' if it's set back to
correct in DOCTYPE name state if it actually was followed by the
three letters tml (case-insensitively)?


I suppose it's just the way the specification is written. You're free to  
implement this stuff whatever way you feel like as long as


  input - magic - output

yields the same result.



However, section 8.1.1 says:
http://www.whatwg.org/specs/web-apps/current-work/#doctype

In other words, !DOCTYPE HTML, case-insensitively.


So I guess you're right.


Learned this when writing the implementation of it :-)


So !doctype html should not produce a parse rror? or should it?


Not, it's correct.


--
Anne van Kesteren
http://annevankesteren.nl/
http://www.opera.com/


Re: [whatwg] Bug in Before DOCTYPE name state?

2006-12-21 Thread Anne van Kesteren
On Thu, 21 Dec 2006 19:42:24 +0100, Thomas Broyer [EMAIL PROTECTED]  
wrote:

Additional note: as I read this, if the DOCTYPE was previously marked
as being in error, it should then be rolled back to being correct
if the DOCTYPE name is HTML: !DOCTYPEHTML would *not* be marked
in error.


It would be a parse error, but



That's probably not what's intended.


yeah, I suppose not. I don't really see directly how you can nicely deal  
with that during tokenizing. Hmm. (It would work with a flag, but that's  
not nice.)




So I'll just code it so that these are correct:
!doctype html
!DOCTYPE HTML
and every other lowercase/uppercase variant;
and thiese are in error:
!doctypehtml
!DOCTYPEHTML
and every other lowercase/uppercase variant.



--
Anne van Kesteren
http://annevankesteren.nl/
http://www.opera.com/


Re: [whatwg] Bug in Before DOCTYPE name state?

2006-12-21 Thread Thomas Broyer

2006/12/21, Simon Pieters:


From: Thomas Broyer
Additional note: as I read this, if the DOCTYPE was previously marked
as being in error, it should then be rolled back to being correct
if the DOCTYPE name is HTML: !DOCTYPEHTML would *not* be marked
in error.

Correct.

That's probably not what's intended.

It is intended. Browsers trigger standards mode for !doctypehtml. It does,
however, generate a parse error, but the doctype itself is not in error.


OK, understood; thanks a lot.

--
Thomas Broyer


Re: [whatwg] Bug in Before DOCTYPE name state?

2006-12-21 Thread Thomas Broyer

2006/12/22, Ian Hickson:

On Thu, 21 Dec 2006, Thomas Broyer wrote:

 Why is the DOCTYPE marked in error in the former case?

Because otherwise this document:

   !DOCTYPEH

...would emit a DOCTYPE that is not in error (since the token would be
emitted before the bit at the end of the DOCTYPE name state).


Doh! right.


 In other words, why would !DOCTYPE html be in error while
 !DOCTYPE Html wouldn't?

Both would be not in error, because of the sentence at the end of the
DOCTYPE name state.


OK, now understood (thanks you Simon for having enlighted me)


On Thu, 21 Dec 2006, Thomas Broyer wrote:

 But it also has this note, which is quite confusing: Because lowercase
 letters in the name are uppercased by the algorithm above, the HTML
 letters are actually case-insensitive relative to the markup.

How is it confusing? I would clarify it, but I don't know what is
confusing.


Maybe there's no need to clarify it, it might just have been me…


 It remains that the tokenization stage is a bit confusing…

Yes. The tree construction stage is even worse. Just implement it exactly
as written with no interpretation and you should be fine. ;-)


My problem is that I'm not implementing an emitting parser (à la
SAX) but a pulling parser, so I'm stopping as soon as I've found a
token and return true to say hey, I've changed the TokenType, Name,
Value, etc. properties to reflect a new token.
...so I'm interpreting ;-)

Re tree construction, I'm about to implemented it in two parts: in the
pull parser when possible (handling omitted tags and misnested
formatting elements) and in a tree fixer otherwise (move the meta
and link into head, etc.)

--
Thomas Broyer