Re: [whatwg] Bug in Before DOCTYPE name state?
On Fri, 22 Dec 2006, Thomas Broyer wrote: 2006/12/22, Ian Hickson: On Thu, 21 Dec 2006, Thomas Broyer wrote: Why is the DOCTYPE marked in error in the former case? Because otherwise this document: !DOCTYPEH ...would emit a DOCTYPE that is not in error (since the token would be emitted before the bit at the end of the DOCTYPE name state). Doh! right. This changed recently, by the way, if someone could check that the spec still is indeed causing the right errors to be flagged that would be great. (I think it is, though some errors moved from the tokeniser to the tree construction phase.) In other words, why would !DOCTYPE html be in error while !DOCTYPE Html wouldn't? Both would be not in error, because of the sentence at the end of the DOCTYPE name state. OK, now understood (thanks you Simon for having enlighted me) Note that this is now handled quite differently. On Thu, 21 Dec 2006, Thomas Broyer wrote: But it also has this note, which is quite confusing: Because lowercase letters in the name are uppercased by the algorithm above, the HTML letters are actually case-insensitive relative to the markup. How is it confusing? I would clarify it, but I don't know what is confusing. Maybe there's no need to clarify it, it might just have been me… Ok. It remains that the tokenization stage is a bit confusing… Yes. The tree construction stage is even worse. Just implement it exactly as written with no interpretation and you should be fine. ;-) My problem is that I'm not implementing an emitting parser (à la SAX) but a pulling parser, so I'm stopping as soon as I've found a token and return true to say hey, I've changed the TokenType, Name, Value, etc. properties to reflect a new token. ...so I'm interpreting ;-) Re tree construction, I'm about to implemented it in two parts: in the pull parser when possible (handling omitted tags and misnested formatting elements) and in a tree fixer otherwise (move the meta and link into head, etc.) How has that worked for you? Is the spec ok for that approach? -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
[whatwg] Bug in Before DOCTYPE name state?
Before DOCTYPE name state: http://www.whatwg.org/specs/web-apps/current-work/#before1 ↪ U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z Create a new DOCTYPE token. Set the token's name name to the uppercase version of the current input character (subtract 0x0020 from the character's code point), and mark it as being in error. Switch to the DOCTYPE name state. DOCTYPE name state http://www.whatwg.org/specs/web-apps/current-work/#doctype1 ↪ U+0061 LATIN SMALL LETTER A through to U+007A LATIN SMALL LETTER Z Append the uppercase version of the current input character (subtract 0x0020 from the character's code point) to the current DOCTYPE token's name. Stay in the DOCTYPE name state. Why is the DOCTYPE marked in error in the former case? In other words, why would !DOCTYPE html be in error while !DOCTYPE Html wouldn't? My guess is that it's a bug in the Before DOCTYPE name state. -- Thomas Broyer
Re: [whatwg] Bug in Before DOCTYPE name state?
2006/12/21, Anne van Kesteren: On Thu, 21 Dec 2006 18:09:43 +0100, Thomas Broyer wrote: But it also has this note, which is quite confusing: Because lowercase letters in the name are uppercased by the algorithm above, the HTML letters are actually case-insensitive relative to the markup. During tokenization you store the lowercase ASCII characters as uppercase. So you can do a case-sensitive comparison with HTML in the end (HTML will also end up in the DOM or whatever model you use there). In the markup it could be written as !doctype html which is what is suggested there. Ah, ok, that what I thought. So what's the prupose of marking the DOCTYPE in error in the before DOCTYPE name state when it finds a lowercase 'h' if it's set back to correct in DOCTYPE name state if it actually was followed by the three letters tml (case-insensitively)? However, section 8.1.1 says: http://www.whatwg.org/specs/web-apps/current-work/#doctype In other words, !DOCTYPE HTML, case-insensitively. So I guess you're right. Learned this when writing the implementation of it :-) So !doctype html should not produce a parse rror? or should it? -- Thomas Broyer
Re: [whatwg] Bug in Before DOCTYPE name state?
2006/12/21, Anne van Kesteren: On Thu, 21 Dec 2006 11:08:51 +0100, Thomas Broyer wrote: In other words, why would !DOCTYPE html be in error while !DOCTYPE Html wouldn't? My guess is that it's a bug in the Before DOCTYPE name state. It's not. The DOCTYPE name state also has this paragraph: Then, if the name of the DOCTYPE token is exactly the four letters HTML, then mark the token as being correct. Otherwise, mark it as being in error. Additional note: as I read this, if the DOCTYPE was previously marked as being in error, it should then be rolled back to being correct if the DOCTYPE name is HTML: !DOCTYPEHTML would *not* be marked in error. That's probably not what's intended. So I'll just code it so that these are correct: !doctype html !DOCTYPE HTML and every other lowercase/uppercase variant; and thiese are in error: !doctypehtml !DOCTYPEHTML and every other lowercase/uppercase variant. -- Thomas Broyer
Re: [whatwg] Bug in Before DOCTYPE name state?
On Thu, 21 Dec 2006 19:03:34 +0100, Thomas Broyer [EMAIL PROTECTED] wrote: So what's the prupose of marking the DOCTYPE in error in the before DOCTYPE name state when it finds a lowercase 'h' if it's set back to correct in DOCTYPE name state if it actually was followed by the three letters tml (case-insensitively)? I suppose it's just the way the specification is written. You're free to implement this stuff whatever way you feel like as long as input - magic - output yields the same result. However, section 8.1.1 says: http://www.whatwg.org/specs/web-apps/current-work/#doctype In other words, !DOCTYPE HTML, case-insensitively. So I guess you're right. Learned this when writing the implementation of it :-) So !doctype html should not produce a parse rror? or should it? Not, it's correct. -- Anne van Kesteren http://annevankesteren.nl/ http://www.opera.com/
Re: [whatwg] Bug in Before DOCTYPE name state?
On Thu, 21 Dec 2006 19:42:24 +0100, Thomas Broyer [EMAIL PROTECTED] wrote: Additional note: as I read this, if the DOCTYPE was previously marked as being in error, it should then be rolled back to being correct if the DOCTYPE name is HTML: !DOCTYPEHTML would *not* be marked in error. It would be a parse error, but That's probably not what's intended. yeah, I suppose not. I don't really see directly how you can nicely deal with that during tokenizing. Hmm. (It would work with a flag, but that's not nice.) So I'll just code it so that these are correct: !doctype html !DOCTYPE HTML and every other lowercase/uppercase variant; and thiese are in error: !doctypehtml !DOCTYPEHTML and every other lowercase/uppercase variant. -- Anne van Kesteren http://annevankesteren.nl/ http://www.opera.com/
Re: [whatwg] Bug in Before DOCTYPE name state?
2006/12/21, Simon Pieters: From: Thomas Broyer Additional note: as I read this, if the DOCTYPE was previously marked as being in error, it should then be rolled back to being correct if the DOCTYPE name is HTML: !DOCTYPEHTML would *not* be marked in error. Correct. That's probably not what's intended. It is intended. Browsers trigger standards mode for !doctypehtml. It does, however, generate a parse error, but the doctype itself is not in error. OK, understood; thanks a lot. -- Thomas Broyer
Re: [whatwg] Bug in Before DOCTYPE name state?
2006/12/22, Ian Hickson: On Thu, 21 Dec 2006, Thomas Broyer wrote: Why is the DOCTYPE marked in error in the former case? Because otherwise this document: !DOCTYPEH ...would emit a DOCTYPE that is not in error (since the token would be emitted before the bit at the end of the DOCTYPE name state). Doh! right. In other words, why would !DOCTYPE html be in error while !DOCTYPE Html wouldn't? Both would be not in error, because of the sentence at the end of the DOCTYPE name state. OK, now understood (thanks you Simon for having enlighted me) On Thu, 21 Dec 2006, Thomas Broyer wrote: But it also has this note, which is quite confusing: Because lowercase letters in the name are uppercased by the algorithm above, the HTML letters are actually case-insensitive relative to the markup. How is it confusing? I would clarify it, but I don't know what is confusing. Maybe there's no need to clarify it, it might just have been me… It remains that the tokenization stage is a bit confusing… Yes. The tree construction stage is even worse. Just implement it exactly as written with no interpretation and you should be fine. ;-) My problem is that I'm not implementing an emitting parser (à la SAX) but a pulling parser, so I'm stopping as soon as I've found a token and return true to say hey, I've changed the TokenType, Name, Value, etc. properties to reflect a new token. ...so I'm interpreting ;-) Re tree construction, I'm about to implemented it in two parts: in the pull parser when possible (handling omitted tags and misnested formatting elements) and in a tree fixer otherwise (move the meta and link into head, etc.) -- Thomas Broyer