Re: [whatwg] Fwd: Entity parsing
On Sat, 18 Jul 2009, �istein E. Andersen wrote: Non-semicolon-terminated entities that were conforming in HTML4, like pi and mdash when they are not followed by a letter or digit (roughly speaking), are currently expanded in Safari and Firefox, and requiring this to change would be a regression affecting existing pages. As far as I can tell HTML5 more or less matches what legacy pages need, You keep repeating this, and also that much work has been done to get entity parsing right and that you really do not want to change it. It seems to me that you have tried to follow IE's behaviour closely, which is not completely unreasonable. I have not seen evidence of any analysis of legacy pages supporting this decision, though; on the contrary, more or less anecdotal evidence sent to the mailing list(s) seems to suggest that certain modifications might make the algorithm work better for legacy pages. Replicating IE may well be good enough and seems like a reasonably safe option, but HTML5 does not completely follow IE in other areas, and I do not quite see why entity parsing should be treated differently. It's certainly the case that we can find individual pages that depend on particular behaviours to support any argument. I do not want to change the current parsing spec unless we have _very_ good reasons to do so, because there are now multiple implementations and tests, and any change can introduce bugs and incompatibilities. If you have strong data showing that a particular change to the spec would be highly beneficial, then it's something I'd be happy to consider. But I'm not willing to make changes just to change the spec from being compatible with IE to being compatible with WebKit, or some such. I need data showing that the change is needed. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Fwd: Entity parsing
On 5 Jun 2009, at 00:49, Ian Hickson wrote: Could you give an example of what you mean? I'm having trouble following your description On Fri, 24 Apr 2009, Øistein E. Andersen wrote: Let IE4 (resp. HTML4, HTML5) be a non-semicolon-terminated named character reference from the IE4 (resp. HTML4, HTML5) set, IE4 includes eacute, iuml; HTML4 includes in addition pi, oelig; and HTML5 includes in addition SHcy, rcaron. and let . (full stop) represent any character other than semicolon, and ^ (circumflex) any character which is (roughly) not an ASCII letter or digit (i.e., [^a-zA-Z0-9]). Not completely unreasonable sets of character references to expand (outside of attribute values) include: 1) IE4^ e.g., cafeacute (café) 2) IE4. e.g., naiumlve (naïve) 3) HTML4^ e.g., 2pi (2π) 4) IE4. HTML4^ e.g., naiumlve (naïve), 2pi (2π) 5) HTML4. e.g., hors d'oeliguvre (hors d'œuvre) 6) IE4. HTML5^ e.g., naiumlve (naïve), SHcy(A/K) [Ш(A/K)] 7) HTML4. HTML5^ e.g., hors d'oeliguvre (hors d'œuvre), SHcy(A/K) [Ш(A/K)] 8) HTML5. e.g., Dvorcaronaacutek (Dvořák) [...] Currently, Opera follows 1), i.e., expands cafeacute, but not naiumlve or 2pi IE 2), i.e., expands cafeacute and naiumlve, but not 2pi and Safari and Firefox 3). i.e., expands cafeacute and 2pi, but not naiumlve My main concern is that HTML4^ is actually legitimate in HTML4 and works in both Safari and Firefox today, and that HTML5 should not change the rendering of valid HTML4 pages unless there is a good reason to do so. Non-semicolon-terminated entities that were conforming in HTML4, like pi and mdash when they are not followed by a letter or digit (roughly speaking), are currently expanded in Safari and Firefox, and requiring this to change would be a regression affecting existing pages. As far as I can tell HTML5 more or less matches what legacy pages need, You keep repeating this, and also that much work has been done to get entity parsing right and that you really do not want to change it. It seems to me that you have tried to follow IE's behaviour closely, which is not completely unreasonable. I have not seen evidence of any analysis of legacy pages supporting this decision, though; on the contrary, more or less anecdotal evidence sent to the mailing list(s) seems to suggest that certain modifications might make the algorithm work better for legacy pages. Replicating IE may well be good enough and seems like a reasonably safe option, but HTML5 does not completely follow IE in other areas, and I do not quite see why entity parsing should be treated differently. -- Øistein E. Andersen
Re: [whatwg] Fwd: Entity parsing
On Fri, 24 Apr 2009, Øistein E. Andersen wrote: When a named character reference is followed by a semicolon, it clearly has to be expanded, but how to handle non-semicolon-terminated character references is less obvious. Let IE4 (resp. HTML4, HTML5) be a non-semicolon-terminated named character reference from the IE4 (resp. HTML4, HTML5) set, and let . (full stop) represent any character other than semicolon, and ^ (circumflex) any character which is (roughly) not an ASCII letter or digit (i.e., [^a-zA-Z0-9]). Not completely unreasonable sets of character references to expand (outside of attribute values) include: 1) IE4^ 2) IE4. 3) HTML4^ 4) IE4. HTML4^ 5) HTML4. 6) IE4. HTML5^ 7) HTML4. HTML5^ 8) HTML5. (The set of character references to be expanded in attribute values could be obtained by replacing . by ^ above.) Currently, Opera follows 1), IE 2), and Safari and Firefox 3). My main concern is that HTML4^ is actually legitimate in HTML4 and works in both Safari and Firefox today, and that HTML5 should not change the rendering of valid HTML4 pages unless there is a good reason to do so. Could you give an example of what you mean? I'm having trouble following your description above. As far as I can tell HTML5 more or less matches what legacy pages need, but if there are specific entities that should be parsed in a different way than HTML5 says they should, I'm happy to fix this. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
[whatwg] Fwd: Entity parsing
On 23 May 2008, at 03:50, Ian Hickson wrote: On Thu, 28 Jun 2007, Øistein E. Andersen wrote: 1) Is it useful to handle unterminated entities followed by an alphanumerical character like IE does? [...] 2) HTML 4.01 allows the semicolon to be omitted in certain cases. [...] Firefox and Safari both support this, and it would seem meaningless to change the way conforming documents are parsed [...] 3) Will new entities ever be needed? If yes, can new entities adopt existing conformance criteria and parsing rules? [...] New entities have since been added, and the rules for parsing entities (sorry, named character references) have been changed a bit. However, I am reluctant to change this from what we have now, since what we have now works well. How strongly do you feel about this? I think I may have expressed my concern in rather too abstract terms previously. The named character references currently present in HTML5 can be subdivided (roughly) into the following subsets: IE4 HTML4 HTML5 Approximately 100 named character references are included in the IE4 set, 200 in the HTML4 set, and 2,000 in the HTML5 set. When a named character reference is followed by a semicolon, it clearly has to be expanded, but how to handle non-semicolon-terminated character references is less obvious. Let IE4 (resp. HTML4, HTML5) be a non-semicolon-terminated named character reference from the IE4 (resp. HTML4, HTML5) set, and let . (full stop) represent any character other than semicolon, and ^ (circumflex) any character which is (roughly) not an ASCII letter or digit (i.e., [^a-zA-Z0-9]). Not completely unreasonable sets of character references to expand (outside of attribute values) include: 1) IE4^ 2) IE4. 3) HTML4^ 4) IE4. HTML4^ 5) HTML4. 6) IE4. HTML5^ 7) HTML4. HTML5^ 8) HTML5. (The set of character references to be expanded in attribute values could be obtained by replacing . by ^ above.) Currently, Opera follows 1), IE 2), and Safari and Firefox 3). My main concern is that HTML4^ is actually legitimate in HTML4 and works in both Safari and Firefox today, and that HTML5 should not change the rendering of valid HTML4 pages unless there is a good reason to do so. 4) does not break any valid HTML4 pages and does also not cause any character references to be expanded which are not already expanded in either IE or both Safari and Firefox, so this should be possible to implement. [Options 5), 6) and 8) can, to a greater or lesser extent, be specified more easily, but might be too controversial. There are pages relying on, e.g., `10ndash20' to work, though, so handling character references in a more liberal way would actually have some benefits; only invalid mark-up would be affected in any case; and the negative effects are to a certain extent compounded by the more conservative treatment in attribute values. That being said, I do of course realise that it will be seen as safer not to expand too many character references as long as the actual impact remains difficult to quantify.] -- Øistein E. Andersen