2006/12/22, Ian Hickson:
On Thu, 21 Dec 2006, Thomas Broyer wrote:
>
> Why is the DOCTYPE marked "in error" in the former case?
Because otherwise this document:
<!DOCTYPEH
...would emit a DOCTYPE that is not in error (since the token would be
emitted before the bit at the end of the DOCTYPE name state).
Doh! right.
> In other words, why would <!DOCTYPE html> be "in error" while
> <!DOCTYPE Html> wouldn't?
Both would be not in error, because of the sentence at the end of the
DOCTYPE name state.
OK, now understood (thanks you Simon for having enlighted me)
On Thu, 21 Dec 2006, Thomas Broyer wrote:
>
> But it also has this note, which is quite confusing: "Because lowercase
> letters in the name are uppercased by the algorithm above, the "HTML"
> letters are actually case-insensitive relative to the markup."
How is it confusing? I would clarify it, but I don't know what is
confusing.
Maybe there's no need to clarify it, it might just have been me…
> It remains that the tokenization stage is a bit confusing…
Yes. The tree construction stage is even worse. Just implement it exactly
as written with no interpretation and you should be fine. ;-)
My "problem" is that I'm not implementing an "emitting" parser (à la
SAX) but a "pulling" parser, so I'm stopping as soon as I've found a
token and return true to say "hey, I've changed the TokenType, Name,
Value, etc. properties to reflect a new token".
...so I'm interpreting ;-)
Re tree construction, I'm about to implemented it in two parts: in the
"pull parser" when possible (handling omitted tags and misnested
formatting elements) and in a "tree fixer" otherwise (move the <meta>
and <link> into <head>, etc.)
--
Thomas Broyer