Re: [whatwg] [WA1] Formatting elements
On Wed, 19 Jul 2006, Stewart Brodie wrote: I know it's hard to see when written out textually, but note that for the text node 'jkl', the I and B elements are the wrong way around! Wrong way with respect to what? They're the right way if you look at the end tags: /b closes first, so it must be innermost! ;-) I disagree because the 'jkl' is the bit I'm interested in here. Are you saying that the desirable tree order in defined in terms only of the closing tags rather than the open tags? No, I'm saying that it doesn't really matter. The content is malformed, so what we do with it doesn't really matter -- so long as it is well-defined, works with existing content, and isn't an undue burden on implementations for the correct case and the common case (if that's not the correct case). In the original source, there haven't been any close tags at all at the time the 'jkl' is parsed, ignoring the other text nodes, the tree is: DIV B I P jkl (I don't really like the P being there, though, to be honest). What would you do instead? (Considering the performance concerns given below?) At this point, jkl has a logical element hierarchy above it in the DOM tree that matches what was in the original HTML source. In CSS selector terms, DIV B I. The subsequent processing of the /B token causes such a selector to no longer match (it has now changed to DIV I B): DIV B I /I /B P I B jkl Surely it is reasonable to expect the jkl to retain its ancestry - i.e. be a child of the cloned I, which is a child of the cloned B, regardless of the tag closure (of the B) that's about to occur, which would convert it to ... DIV B I /I /B P B I jkl /I /B I (mno...) I suppose the root of my concern is how to apply CSS selector matching in a reasonable looking manner to the DOM tree if the parser has reversed the parentage of the formatting elements. The entire basis of the Adoption Agency algorithm is that in fact the ancestry is not kept. I don't know of an alternative that works in as many cases. I agree that it isn't optimal, but the problem is that the input is ill-formed in the first place, so any attempt to make it into a tree will be flawed in some way. It gets more obviously bad to do what Mozilla does when you consider a case like: bp...p...p...p...p...p... ...which is very common. With that exact markup, Safari, IE7, and the spec all end up with the exact same DOM tree (from the body down, at least), and with the same number of element nodes (from body down, 8). Mozilla ends up with 13 nodes (from the body down). That doesn't scale -- there are pages with hundreds of nodes like this. And it gets much worse if it was all wrapped in a u and em too. The key is, as you mention in one of the blog entries linked below, that the behaviour differs depending on whether or not the content is well-formed in terms of matching order of start and end tags, or not. In the Mozilla case, it depends on more than just whether the document is well-formed -- it depends on where the TCP packet boundaries lie. This is, IMHO, completely unacceptable, far less acceptable than moving the nodes around after their birth. I just don't like the idea of having to detach nodes from the DOM tree once they have been attached. The current algorithm is to allow any element inside any other (pretty much) until a problem crops up at which point there's a reorganisation required and that requires detachment (almost always) Right. I'm not a huge fan of it either, but it works (Safari does it), and it doesn't have the (IMHO much worse) problems that the other algorithms have. Note that it actually is compatible with non-tree parsing modes (where the parser doesn't construct a DOM but instead marks the start and end of each tag, with tags possibly overlapping). The handling of broken content in table tags isn't. This, to me, is a much worse situation to be in, and there we really have no choice (all browsers are basically interoperable on that case). The problem here may simply be that appending any node due to opening any non-formatting/non-phrasing open tag when in in body should cause any formatting/phrasing elements to be popped off the stack of open elements, and then NOT execute reconstruct the active formatting elements (because it'll be executed automatically when opening the next formatting/phrasing element or text node anyway) Isn't that already the case? You only reconstruct for inline elements and text nodes, as far as I can tell. No, on both counts. Firstly, you just append the new node regardless of what's already on the stack; secondly, the algorithm as stated causes the reconstruction to happen for P too. That may be an error? I don't understand what you are describing here. Could you explain further? I'm also wondering about a change of behaviour for the
Re: [whatwg] [WA1] Formatting elements
Ian Hickson [EMAIL PROTECTED] wrote: On Mon, 17 Jul 2006, Stewart Brodie wrote: I tried dry-running the algorithm for handling mis-nested formatting elements, but I ended up with a tree that looked very odd. I can't believe that the output I ended up with is what the desired result of the algorithm is, so there is a mistake somewhere: either in my execution of the algorithm or in the algorithm itself. I took the following fragment of HTML: DIV abc B def I ghi P jkl /B mno /I pqr /P stu the result I ended up with was equivalent to: DIV abc B def I ghi /I /B I /I P I B jkl /B mno /I pqr /P stu /DIV Looks right. With that as input, my implementation outputs: 5: Parse error: missing document type declaration. 38: Parse error: mismatched b element end tag (misnested tags). 47: Parse error: mismatched i element end tag (misnested tags). 57: Parse error: mismatched body element end tag (premature end of file?). htmlhead/headbodydiv abc b def i ghi /i/bi/ipib jkl /b mno /i pqr /p stu/div/body/html Good - we do end up with exactly the same thing. I know it's hard to see when written out textually, but note that for the text node 'jkl', the I and B elements are the wrong way around! Wrong way with respect to what? They're the right way if you look at the end tags: /b closes first, so it must be innermost! ;-) I disagree because the 'jkl' is the bit I'm interested in here. Are you saying that the desirable tree order in defined in terms only of the closing tags rather than the open tags? In the original source, there haven't been any close tags at all at the time the 'jkl' is parsed, ignoring the other text nodes, the tree is: DIV B I P jkl (I don't really like the P being there, though, to be honest). At this point, jkl has a logical element hierarchy above it in the DOM tree that matches what was in the original HTML source. In CSS selector terms, DIV B I. The subsequent processing of the /B token causes such a selector to no longer match (it has now changed to DIV I B): DIV B I /I /B P I B jkl Surely it is reasonable to expect the jkl to retain its ancestry - i.e. be a child of the cloned I, which is a child of the cloned B, regardless of the tag closure (of the B) that's about to occur, which would convert it to ... DIV B I /I /B P B I jkl /I /B I (mno...) I suppose the root of my concern is how to apply CSS selector matching in a reasonable looking manner to the DOM tree if the parser has reversed the parentage of the formatting elements. The point is this is error-correction logic, there is no right way (well, until the spec is a standard, I guess). Indeed I suspect that it may not be possible to define the one true way in such a way that satisfies all content. It all seems to start going wrong for me in step 7 of the algorithm. During the handling of the /B tag, the clone of I gets created and that's the node that ends up being the childless I node that has the DIV as its parent (during step 5 of handling the /I tag when the I is cloned for a second time to be the child of the P and adopt the original children of the P) Firefox generates what I think I would expect and prefer: DIV abc B def I ghi /I /B P B I jkl /I /B I mno /I pqr /P stu /DIV It's the same number of tags, in this case. It gets more obviously bad to do what Mozilla does when you consider a case like: bp...p...p...p...p...p... ...which is very common. With that exact markup, Safari, IE7, and the spec all end up with the exact same DOM tree (from the body down, at least), and with the same number of element nodes (from body down, 8). Mozilla ends up with 13 nodes (from the body down). That doesn't scale -- there are pages with hundreds of nodes like this. And it gets much worse if it was all wrapped in a u and em too. The key is, as you mention in one of the blog entries linked below, that the behaviour differs depending on whether or not the content is well-formed in terms of matching order of start and end tags, or not. For comparison, Internet Explorer 6 on the other hand treats the P no differently to the B or I and ends up with: DIV abc B def I ghi P jkl /P /I /B I P mno /P /I P pqr /P stu /DIV Actually IE has only one P element (and only one B and only one I). Look closer and you'll find that the P element isn't closed -- it's just that the mno and pqr text nodes' parentNodes point to the P, while the DIV element's childNodes array actually also mentions those text nodes. Yes, IE generates DOM trees that aren't trees. See also: http://ln.hixie.ch/?start=1037910467count=1 http://ln.hixie.ch/?start=1138169545count=1 http://ln.hixie.ch/?start=1137740632count=1 http://ln.hixie.ch/?start=1026485588count=1 http://ln.hixie.ch/?start=1137799947count=1 Yes, I have already read many of your blog entries on this topic. I got the
[whatwg] [WA1] Formatting elements
I tried dry-running the algorithm for handling mis-nested formatting elements, but I ended up with a tree that looked very odd. I can't believe that the output I ended up with is what the desired result of the algorithm is, so there is a mistake somewhere: either in my execution of the algorithm or in the algorithm itself. I took the following fragment of HTML: DIV abc B def I ghi P jkl /B mno /I pqr /P stu The DIV is chosen to provide a suitable context for testing everything else. B and I were chosen as formatting elements with short names, P was chosen as it has no special behaviour as an open tag when in in body state (possibly a mistake? I'm not certain). One filled whiteboard later, the result I ended up with was equivalent to: DIV abc B def I ghi /I /B I /I P I B jkl /B mno /I pqr /P stu /DIV I know it's hard to see when written out textually, but note that for the text node 'jkl', the I and B elements are the wrong way around! It all seems to start going wrong for me in step 7 of the algorithm. During the handling of the /B tag, the clone of I gets created and that's the node that ends up being the childless I node that has the DIV as its parent (during step 5 of handling the /I tag when the I is cloned for a second time to be the child of the P and adopt the original children of the P) Firefox generates what I think I would expect and prefer: DIV abc B def I ghi /I /B P B I jkl /I /B I mno /I pqr /P stu /DIV This behaviour would be consistent with disallowing non-phrasing and non-formatting elements on the stack of open elements when there are phrasing/formatting elements on the bottom of the stack. IOW, the P implicitly closes the B and I elements, leaving them in the list of active formatting elements, and then NOT executing reconstruct the active formatting elements before appending the new P element, leaving that for when the 'jkl' text node is encountered. For comparison, Internet Explorer 6 on the other hand treats the P no differently to the B or I and ends up with: DIV abc B def I ghi P jkl /P /I /B I P mno /P /I P pqr /P stu /DIV The problem here may simply be that appending any node due to opening any non-formatting/non-phrasing open tag when in in body should cause any formatting/phrasing elements to be popped off the stack of open elements, and then NOT execute reconstruct the active formatting elements (because it'll be executed automatically when opening the next formatting/phrasing element or text node anyway) -- Stewart Brodie Software Engineer ANT Software Limited