Re: [whatwg] [WA1] Formatting elements

2006-07-21 Thread Ian Hickson
On Wed, 19 Jul 2006, Stewart Brodie wrote:
  
   I know it's hard to see when written out textually, but note that 
   for the text node 'jkl', the I and B elements are the wrong way 
   around!
  
  Wrong way with respect to what? They're the right way if you look at 
  the end tags: /b closes first, so it must be innermost! ;-)
 
 I disagree because the 'jkl' is the bit I'm interested in here.  Are you
 saying that the desirable tree order in defined in terms only of the closing
 tags rather than the open tags?

No, I'm saying that it doesn't really matter. The content is malformed, so 
what we do with it doesn't really matter -- so long as it is well-defined, 
works with existing content, and isn't an undue burden on implementations 
for the correct case and the common case (if that's not the correct case).


 In the original source, there haven't been any close tags at all at the 
 time the 'jkl' is parsed, ignoring the other text nodes, the tree is:
 
 DIV B I P jkl

 (I don't really like the P being there, though, to be honest).

What would you do instead? (Considering the performance concerns given 
below?)


 At this point, jkl has a logical element hierarchy above it in the DOM 
 tree that matches what was in the original HTML source.  In CSS selector 
 terms, DIV  B  I.  The subsequent processing of the /B token 
 causes such a selector to no longer match (it has now changed to DIV  
 I  B):
 
 DIV B I /I /B P I B jkl
 
 Surely it is reasonable to expect the jkl to retain its ancestry - i.e. 
 be a child of the cloned I, which is a child of the cloned B, regardless 
 of the tag closure (of the B) that's about to occur, which would convert 
 it to ...
 
 DIV B I /I /B P B I jkl /I /B I (mno...)
 
 I suppose the root of my concern is how to apply CSS selector matching 
 in a reasonable looking manner to the DOM tree if the parser has 
 reversed the parentage of the formatting elements.

The entire basis of the Adoption Agency algorithm is that in fact the 
ancestry is not kept. I don't know of an alternative that works in as many 
cases. I agree that it isn't optimal, but the problem is that the input is 
ill-formed in the first place, so any attempt to make it into a tree will 
be flawed in some way.


  It gets more obviously bad to do what Mozilla does when you consider a 
  case like:
  
 bp...p...p...p...p...p...
  
  ...which is very common. With that exact markup, Safari, IE7, and the 
  spec all end up with the exact same DOM tree (from the body down, at 
  least), and with the same number of element nodes (from body down, 
  8).
  
  Mozilla ends up with 13 nodes (from the body down). That doesn't scale 
  -- there are pages with hundreds of nodes like this.
 
 And it gets much worse if it was all wrapped in a u and em too. The 
 key is, as you mention in one of the blog entries linked below, that the 
 behaviour differs depending on whether or not the content is well-formed 
 in terms of matching order of start and end tags, or not.

In the Mozilla case, it depends on more than just whether the document is 
well-formed -- it depends on where the TCP packet boundaries lie. This is, 
IMHO, completely unacceptable, far less acceptable than moving the nodes 
around after their birth.


 I just don't like the idea of having to detach nodes from the DOM tree 
 once they have been attached.  The current algorithm is to allow any 
 element inside any other (pretty much) until a problem crops up at which 
 point there's a reorganisation required and that requires detachment 
 (almost always)

Right. I'm not a huge fan of it either, but it works (Safari does it), and 
it doesn't have the (IMHO much worse) problems that the other algorithms 
have.

Note that it actually is compatible with non-tree parsing modes (where the 
parser doesn't construct a DOM but instead marks the start and end of each 
tag, with tags possibly overlapping). The handling of broken content in 
table tags isn't. This, to me, is a much worse situation to be in, and 
there we really have no choice (all browsers are basically interoperable 
on that case).


   The problem here may simply be that appending any node due to 
   opening any non-formatting/non-phrasing open tag when in in body 
   should cause any formatting/phrasing elements to be popped off the 
   stack of open elements, and then NOT execute reconstruct the active 
   formatting elements (because it'll be executed automatically when 
   opening the next formatting/phrasing element or text node anyway)
  
  Isn't that already the case? You only reconstruct for inline elements 
  and text nodes, as far as I can tell.
 
 No, on both counts.  Firstly, you just append the new node regardless of 
 what's already on the stack; secondly, the algorithm as stated causes 
 the reconstruction to happen for P too.  That may be an error?

I don't understand what you are describing here. Could you explain 
further?


 I'm also wondering about a change of behaviour for the 

Re: [whatwg] [WA1] Formatting elements

2006-07-19 Thread Stewart Brodie
Ian Hickson [EMAIL PROTECTED] wrote:

 On Mon, 17 Jul 2006, Stewart Brodie wrote:
  
  I tried dry-running the algorithm for handling mis-nested formatting 
  elements, but I ended up with a tree that looked very odd.  I can't 
  believe that the output I ended up with is what the desired result of 
  the algorithm is, so there is a mistake somewhere: either in my 
  execution of the algorithm or in the algorithm itself.  I took the 
  following fragment of HTML:
  
  DIV abc B def I ghi P jkl /B mno /I pqr /P stu
 
  the result I ended up with was equivalent to:
  
  DIV abc B def I ghi /I /B I /I P I B jkl /B mno
  /I pqr /P stu /DIV
 
 Looks right.  With that as input, my implementation outputs:
 
5: Parse error: missing document type declaration.
38: Parse error: mismatched b element end tag (misnested tags).
47: Parse error: mismatched i element end tag (misnested tags).
57: Parse error: mismatched body element end tag (premature end of 
file?).
htmlhead/headbodydiv abc b def i ghi 
/i/bi/ipib jkl /b mno /i pqr /p 
stu/div/body/html

Good - we do end up with exactly the same thing.


  I know it's hard to see when written out textually, but note that for 
  the text node 'jkl', the I and B elements are the wrong way around!
 
 Wrong way with respect to what? They're the right way if you look at the

 end tags: /b closes first, so it must be innermost! ;-)

I disagree because the 'jkl' is the bit I'm interested in here.  Are you
saying that the desirable tree order in defined in terms only of the closing
tags rather than the open tags?  In the original source, there haven't been
any close tags at all at the time the 'jkl' is parsed, ignoring the other
text nodes, the tree is:

DIV B I P jkl

(I don't really like the P being there, though, to be honest).  At this
point, jkl has a logical element hierarchy above it in the DOM tree that
matches what was in the original HTML source.  In CSS selector terms, DIV 
B  I.  The subsequent processing of the /B token causes such a selector
to no longer match (it has now changed to DIV  I  B):

DIV B I /I /B P I B jkl

Surely it is reasonable to expect the jkl to retain its ancestry - i.e. be a
child of the cloned I, which is a child of the cloned B, regardless of the
tag closure (of the B) that's about to occur, which would convert it to ...

DIV B I /I /B P B I jkl /I /B I (mno...)

I suppose the root of my concern is how to apply CSS selector matching in a
reasonable looking manner to the DOM tree if the parser has reversed the
parentage of the formatting elements.


 The point is this is error-correction logic, there is no right way 
 (well, until the spec is a standard, I guess).

Indeed I suspect that it may not be possible to define the one true way in
such a way that satisfies all content.


  It all seems to start going wrong for me in step 7 of the algorithm.  
  During the handling of the /B tag, the clone of I gets created and 
  that's the node that ends up being the childless I node that has the DIV

  as its parent (during step 5 of handling the /I tag when the I is 
  cloned for a second time to be the child of the P and adopt the original

  children of the P) Firefox generates what I think I would expect and 
  prefer:
  
  DIV abc B def I ghi /I /B P B I jkl /I /B I mno
  /I pqr /P stu /DIV
 
 It's the same number of tags, in this case.
 
 It gets more obviously bad to do what Mozilla does when you consider a 
 case like:
 
bp...p...p...p...p...p...
 
 ...which is very common. With that exact markup, Safari, IE7, and the spec

 all end up with the exact same DOM tree (from the body down, at least), 
 and with the same number of element nodes (from body down, 8).
 
 Mozilla ends up with 13 nodes (from the body down). That doesn't scale -- 
 there are pages with hundreds of nodes like this.

And it gets much worse if it was all wrapped in a u and em too. The key
is, as you mention in one of the blog entries linked below, that the
behaviour differs depending on whether or not the content is well-formed in
terms of matching order of start and end tags, or not.


  For comparison, Internet Explorer 6 on the other hand treats the P no
  differently to the B or I and ends up with:  DIV abc B def I ghi
  P jkl /P /I /B I P mno /P /I P pqr /P stu /DIV
 
 Actually IE has only one P element (and only one B and only one I). Look 
 closer and you'll find that the P element isn't closed -- it's just that 
 the mno and pqr text nodes' parentNodes point to the P, while the DIV 
 element's childNodes array actually also mentions those text nodes. Yes, 
 IE generates DOM trees that aren't trees. See also:
 
http://ln.hixie.ch/?start=1037910467count=1
http://ln.hixie.ch/?start=1138169545count=1
http://ln.hixie.ch/?start=1137740632count=1
http://ln.hixie.ch/?start=1026485588count=1
http://ln.hixie.ch/?start=1137799947count=1

Yes, I have already read many of your blog entries on this topic.  I got the

[whatwg] [WA1] Formatting elements

2006-07-17 Thread Stewart Brodie

I tried dry-running the algorithm for handling mis-nested formatting
elements, but I ended up with a tree that looked very odd.  I can't believe
that the output I ended up with is what the desired result of the algorithm
is, so there is a mistake somewhere: either in my execution of the algorithm
or in the algorithm itself.  I took the following fragment of HTML:

DIV abc B def I ghi P jkl /B mno /I pqr /P stu

The DIV is chosen to provide a suitable context for testing everything else.
B and I were chosen as formatting elements with short names, P was chosen as
it has no special behaviour as an open tag when in in body state (possibly
a mistake?  I'm not certain).  One filled whiteboard later, the result I
ended up with was equivalent to:

DIV abc B def I ghi /I /B I /I P I B jkl /B mno /I
pqr /P stu /DIV

I know it's hard to see when written out textually, but note that for the
text node 'jkl', the I and B elements are the wrong way around!  It all
seems to start going wrong for me in step 7 of the algorithm.  During the
handling of the /B tag, the clone of I gets created and that's the node
that ends up being the childless I node that has the DIV as its parent
(during step 5 of handling the /I tag when the I is cloned for a second
time to be the child of the P and adopt the original children of the P)
Firefox generates what I think I would expect and prefer:

DIV abc B def I ghi /I /B P B I jkl /I /B I mno /I
pqr /P stu /DIV

This behaviour would be consistent with disallowing non-phrasing and
non-formatting elements on the stack of open elements when there are
phrasing/formatting elements on the bottom of the stack.  IOW, the P
implicitly closes the B and I elements, leaving them in the list of active
formatting elements, and then NOT executing reconstruct the active
formatting elements before appending the new P element, leaving that for
when the 'jkl' text node is encountered.

For comparison, Internet Explorer 6 on the other hand treats the P no
differently to the B or I and ends up with:  DIV abc B def I ghi P
jkl /P /I /B I P mno /P /I P pqr /P stu /DIV

The problem here may simply be that appending any node due to opening any
non-formatting/non-phrasing open tag when in in body should cause any
formatting/phrasing elements to be popped off the stack of open elements,
and then NOT execute reconstruct the active formatting elements (because
it'll be executed automatically when opening the next formatting/phrasing
element or text node anyway)


-- 
Stewart Brodie
Software Engineer
ANT Software Limited