On 23 May 2008, at 03:50, Ian Hickson wrote:

On Thu, 28 Jun 2007, Øistein E. Andersen wrote:

1) Is it useful to handle unterminated entities followed by an
alphanumerical character like IE does? [...]

2) HTML 4.01 allows the semicolon to be omitted in certain cases.
[...] Firefox and Safari both support this, and it would
seem meaningless to change the way conforming documents are parsed
[...]

3) Will new entities ever be needed? If yes, can new entities adopt
existing conformance criteria and parsing rules?

[...]

New entities have since been added, and the rules for parsing entities
(sorry, "named character references") have been changed a bit. However, I am reluctant to change this from what we have now, since what we have now
works well. How strongly do you feel about this?

I think I may have expressed my concern in rather too abstract terms previously.

The named character references currently present in HTML5 can be subdivided (roughly) into the following subsets:

        IE4 < HTML4 < HTML5

Approximately 100 named character references are included in the IE4 set, 200 in the HTML4 set, and 2,000 in the HTML5 set.

When a named character reference is followed by a semicolon, it clearly has to be expanded, but how to handle non-semicolon-terminated character references is less obvious.

Let &IE4 (resp. &HTML4, &HTML5) be a non-semicolon-terminated named character reference from the IE4 (resp. HTML4, HTML5) set, and let . (full stop) represent any character other than semicolon, and ^ (circumflex) any character which is (roughly) not an ASCII letter or digit (i.e., [^a-zA-Z0-9]). Not completely unreasonable sets of character references to expand (outside of attribute values) include:

        1) &IE4^
        2) &IE4.
        3) &HTML4^
        4) &IE4. &HTML4^
        5) &HTML4.
        6) &IE4. &HTML5^
        7) &HTML4. &HTML5^
        8) &HTML5.

(The set of character references to be expanded in attribute values could be obtained by replacing . by ^ above.)

Currently, Opera follows 1), IE 2), and Safari and Firefox 3).

My main concern is that &HTML4^ is actually legitimate in HTML4 and works in both Safari and Firefox today, and that HTML5 should not change the rendering of valid HTML4 pages unless there is a good reason to do so.

4) does not break any valid HTML4 pages and does also not cause any character references to be expanded which are not already expanded in either IE or both Safari and Firefox, so this should be possible to implement.

[Options 5), 6) and 8) can, to a greater or lesser extent, be specified more easily, but might be too controversial. There are pages relying on, e.g., `10&ndash20' to work, though, so handling character references in a more liberal way would actually have some benefits; only invalid mark-up would be affected in any case; and the negative effects are to a certain extent compounded by the more conservative treatment in attribute values. That being said, I do of course realise that it will be seen as safer not to expand too many character references as long as the actual impact remains difficult to quantify.]

--
Øistein E. Andersen

Reply via email to