On Aug 29, 2005, at 22:29, Henri Sivonen wrote:

What kind of approach to tag inference can HTML5 be expected to take? For an SGML validator that is parsing HTML 4 the set of possible element names is finite. However, a browser needs to deal with an infinite set of a potential elements names. Therefore, it makes a difference whether end tag inference is based on what is allowed as a child of an element or on what elements are not allowed.

Example:
<p><foo>
Is 'foo' an element that not allowed as a child of 'p' and, therefore, implicitly closes the 'p'? Or is 'foo' not on the list of elements that close 'p' and, therefore, does not implicitly close it? Which way are the inference rules going to be defined?

I think the latter approach should be chosen, because otherwise it would be impossible to extend HTML in the future with an element that can occur as a child of 'p'.

Therefore:

End tag inference

I made the following list based on the HTML 4.01 Transitional DTD. Before the colon on each line there is a element whose end tag is optional. After the colon, there is the list of elements whose start tag can cause the end tag being inferred. How should this list be augmented for HTML5? Eg. should a start tag for <section> close a paragraph?

p: p, h1, h2, h3, h4, h5, h6, ol, ul, pre, dl, div, center, noscript, noframes, blockquote, form, isindex, hr, table, fieldset, address
dt: dt, dd
dd: dt, dd
li: li
thead: tfoot, tbody
tfoot: tbody
tbody: tbody
colgroup: colgroup, thead, tfoot, tbody, tr
tr: tr, tfoot, tbody
td: td, th, tr, tfoot, tbody
th: td, th, tr, tfoot, tbody
html:
body:
head: ANY BUT script, style, meta, link, object, title, isindex, base


Start tag inference

* If the top of the stack is 'table' and the element start is 'tr', infer 'tbody'. * If the stack is empty and the element start is anything but 'html', infer 'html'. * If the top of the stack is 'html', the element start is not 'head' and 'head' has not been seen yet, infer 'head'. * If the top of the stack is 'html', the element start is not 'body' and 'head' has been seen, infer 'body'.

Should (in memory of HTML 4.01 Transitional) character data imply the start of body?

As far as I can tell, there are four kinds of inference needed when parsing *conforming* documents (ie. no second stack for residual style): 1) Element end causes the end of the elements that is on the top of the stack*.

If the top of the stack does not match the element end event, see if the top of the stack is on the list of elements whose end tag is optional. Pop and report the end of the popped element if yes. Err if not. Repeat.

2) End of the data stream causes the end of the element that is on the top of the stack.

See if the top of the stack is on the list of elements whose end tag is optional. Pop and report the end of the popped element if yes. Err if not. Repeat.

3) Element start causes the end of the element that is on the top of the stack.
4) Element start causes another element start before itself.

a) Perform end tag inference repeatedly according to the lists given above until no inference can be made.
b) Perform the start tag inference once.
Repeat from a) until additional inference cannot be performed. Then let the original element start go through.

Is this correct for *conforming* documents (ie. without residual style, etc.)?

--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

Reply via email to