Re: [whatwg] Fwd: Entity parsing

2009-07-30 Thread Ian Hickson
On Sat, 18 Jul 2009, �istein E. Andersen wrote:
 
 Non-semicolon-terminated entities that were conforming in HTML4, like 
 pi and mdash when they are not followed by a letter or digit (roughly 
 speaking), are currently expanded in Safari and Firefox, and requiring 
 this to change would be a regression affecting existing pages.
 
  As far as I can tell HTML5 more or less matches what legacy pages 
  need,
 
 You keep repeating this, and also that much work has been done to get 
 entity parsing right and that you really do not want to change it.  It 
 seems to me that you have tried to follow IE's behaviour closely, which 
 is not completely unreasonable.  I have not seen evidence of any 
 analysis of legacy pages supporting this decision, though; on the 
 contrary, more or less anecdotal evidence sent to the mailing list(s) 
 seems to suggest that certain modifications might make the algorithm 
 work better for legacy pages. Replicating IE may well be good enough and 
 seems like a reasonably safe option, but HTML5 does not completely 
 follow IE in other areas, and I do not quite see why entity parsing 
 should be treated differently.

It's certainly the case that we can find individual pages that depend on 
particular behaviours to support any argument.

I do not want to change the current parsing spec unless we have _very_ 
good reasons to do so, because there are now multiple implementations
and tests, and any change can introduce bugs and incompatibilities.

If you have strong data showing that a particular change to the spec would 
be highly beneficial, then it's something I'd be happy to consider. But 
I'm not willing to make changes just to change the spec from being 
compatible with IE to being compatible with WebKit, or some such. I need 
data showing that the change is needed.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Fwd: Entity parsing

2009-07-17 Thread Øistein E . Andersen

On 5 Jun 2009, at 00:49, Ian Hickson wrote:


Could you give an example of what you mean? I'm having trouble  
following

your description



On Fri, 24 Apr 2009, Øistein E. Andersen wrote:



Let IE4 (resp. HTML4, HTML5) be a non-semicolon-terminated named
character reference from the IE4 (resp. HTML4, HTML5) set,


IE4 includes eacute, iuml;
HTML4 includes in addition pi, oelig; and
HTML5 includes in addition SHcy, rcaron.


and let .
(full stop) represent any character other than semicolon, and ^
(circumflex) any character which is (roughly) not an ASCII letter or
digit (i.e., [^a-zA-Z0-9]).  Not completely unreasonable sets of
character references to expand (outside of attribute values) include:

1) IE4^

  e.g., cafeacute (café)


2) IE4.

  e.g., naiumlve (naïve)


3) HTML4^

  e.g., 2pi (2π)


4) IE4. HTML4^

  e.g., naiumlve (naïve), 2pi (2π)


5) HTML4.

  e.g., hors d'oeliguvre (hors d'œuvre)


6) IE4. HTML5^

  e.g., naiumlve (naïve), SHcy(A/K) [Ш(A/K)]


7) HTML4. HTML5^
  e.g., hors d'oeliguvre (hors d'œuvre), SHcy(A/K)  
[Ш(A/K)]


8) HTML5.

  e.g., Dvorcaronaacutek (Dvořák)


[...]
Currently, Opera follows 1),

 i.e., expands cafeacute, but not naiumlve or 2pi

IE 2),

 i.e., expands cafeacute and naiumlve, but not 2pi

and Safari and Firefox 3).

 i.e., expands cafeacute and 2pi, but not naiumlve



My main concern is that HTML4^ is actually legitimate in HTML4 and
works in both Safari and Firefox today, and that HTML5 should not  
change
the rendering of valid HTML4 pages unless there is a good reason to  
do

so.


Non-semicolon-terminated entities that were conforming in HTML4, like  
pi and mdash when they are not followed by a letter or digit  
(roughly speaking), are currently expanded in Safari and Firefox, and  
requiring this to change would be a regression affecting existing pages.


As far as I can tell HTML5 more or less matches what legacy pages  
need,


You keep repeating this, and also that much work has been done to get  
entity parsing right and that you really do not want to change it.  It  
seems to me that you have tried to follow IE's behaviour closely,  
which is not completely unreasonable.  I have not seen evidence of any  
analysis of legacy pages supporting this decision, though; on the  
contrary, more or less anecdotal evidence sent to the mailing list(s)  
seems to suggest that certain modifications might make the algorithm  
work better for legacy pages. Replicating IE may well be good enough  
and seems like a reasonably safe option, but HTML5 does not completely  
follow IE in other areas, and I do not quite see why entity parsing  
should be treated differently.


--
Øistein E. Andersen

Re: [whatwg] Fwd: Entity parsing

2009-06-04 Thread Ian Hickson
On Fri, 24 Apr 2009, Øistein E. Andersen wrote:
 
 When a named character reference is followed by a semicolon, it clearly 
 has to be expanded, but how to handle non-semicolon-terminated character 
 references is less obvious.
 
 Let IE4 (resp. HTML4, HTML5) be a non-semicolon-terminated named 
 character reference from the IE4 (resp. HTML4, HTML5) set, and let . 
 (full stop) represent any character other than semicolon, and ^ 
 (circumflex) any character which is (roughly) not an ASCII letter or 
 digit (i.e., [^a-zA-Z0-9]).  Not completely unreasonable sets of 
 character references to expand (outside of attribute values) include:
 
   1) IE4^
   2) IE4.
   3) HTML4^
   4) IE4. HTML4^
   5) HTML4.
   6) IE4. HTML5^
   7) HTML4. HTML5^
   8) HTML5.
 
 (The set of character references to be expanded in attribute values 
 could be obtained by replacing . by ^ above.)
 
 Currently, Opera follows 1), IE 2), and Safari and Firefox 3).
 
 My main concern is that HTML4^ is actually legitimate in HTML4 and 
 works in both Safari and Firefox today, and that HTML5 should not change 
 the rendering of valid HTML4 pages unless there is a good reason to do 
 so.

Could you give an example of what you mean? I'm having trouble following 
your description above.

As far as I can tell HTML5 more or less matches what legacy pages need, 
but if there are specific entities that should be parsed in a different 
way than HTML5 says they should, I'm happy to fix this.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

[whatwg] Fwd: Entity parsing

2009-04-24 Thread Øistein E . Andersen

On 23 May 2008, at 03:50, Ian Hickson wrote:


On Thu, 28 Jun 2007, Øistein E. Andersen wrote:


1) Is it useful to handle unterminated entities followed by an
alphanumerical character like IE does? [...]

2) HTML 4.01 allows the semicolon to be omitted in certain cases.
[...] Firefox and Safari both support this, and it would
seem meaningless to change the way conforming documents are parsed
[...]

3) Will new entities ever be needed? If yes, can new entities adopt
existing conformance criteria and parsing rules?

[...]


New entities have since been added, and the rules for parsing entities
(sorry, named character references) have been changed a bit.  
However, I
am reluctant to change this from what we have now, since what we  
have now

works well. How strongly do you feel about this?


I think I may have expressed my concern in rather too abstract terms  
previously.


The named character references currently present in HTML5 can be  
subdivided (roughly) into the following subsets:


IE4  HTML4  HTML5

Approximately 100 named character references are included in the IE4  
set, 200 in the HTML4 set, and 2,000 in the HTML5 set.


When a named character reference is followed by a semicolon, it  
clearly has to be expanded, but how to handle non-semicolon-terminated  
character references is less obvious.


Let IE4 (resp. HTML4, HTML5) be a non-semicolon-terminated named  
character reference from the IE4 (resp. HTML4, HTML5) set, and let .  
(full stop) represent any character other than semicolon, and ^  
(circumflex) any character which is (roughly) not an ASCII letter or  
digit (i.e., [^a-zA-Z0-9]).  Not completely unreasonable sets of  
character references to expand (outside of attribute values) include:


1) IE4^
2) IE4.
3) HTML4^
4) IE4. HTML4^
5) HTML4.
6) IE4. HTML5^
7) HTML4. HTML5^
8) HTML5.

(The set of character references to be expanded in attribute values  
could be obtained by replacing . by ^ above.)


Currently, Opera follows 1), IE 2), and Safari and Firefox 3).

My main concern is that HTML4^ is actually legitimate in HTML4 and  
works in both Safari and Firefox today, and that HTML5 should not  
change the rendering of valid HTML4 pages unless there is a good  
reason to do so.


4) does not break any valid HTML4 pages and does also not cause any  
character references to be expanded which are not already expanded in  
either IE or both Safari and Firefox, so this should be possible to  
implement.


[Options 5), 6) and 8) can, to a greater or lesser extent, be  
specified more easily, but might be too controversial. There are pages  
relying on, e.g., `10ndash20' to work, though, so handling character  
references in a more liberal way would actually have some benefits;  
only invalid mark-up would be affected in any case; and the negative  
effects are to a certain extent compounded by the more conservative  
treatment in attribute values.  That being said, I do of course  
realise that it will be seen as safer not to expand too many character  
references as long as the actual impact remains difficult to quantify.]


--
Øistein E. Andersen