Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-16 Thread Iñigo
 Edward Z. Yang:
   Sounds good, since HTML4 is a strict subset of HTML5 (correct me if I'm
   wrong?)

 Ian Hickson:
  Mostly, yes. (There are exceptions, but they're not things you'd really
  want to be using anyway, e.g. obscure SGML features.)

 Note though that it's not possible to write a document that is both
 valid HTML 4 and HTML 5, since they both require a different DOCTYPE to
 be used.


That's right, Cameron, but you could change nothing more than the DOCTYPE in
order to validate
if you have choosen a right subset of HTML 4.x.

iñigo




 --
 Cameron McCormack ≝ http://mcc.id.au/




-- 
Iñigo Medina García
Tecnología

http://www.toprural.com
Tu guía de turismo rural


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-16 Thread Edward Z. Yang
Ian Hickson wrote:
 Mostly, yes. (There are exceptions, but they're not things you'd really 
 want to be using anyway, e.g. obscure SGML features.)

Are these exceptions, by any chance, documented somewhere?

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-16 Thread Ian Hickson
On Tue, 16 Dec 2008, Edward Z. Yang wrote:
 Ian Hickson wrote:
  Mostly, yes. (There are exceptions, but they're not things you'd really 
  want to be using anyway, e.g. obscure SGML features.)
 
 Are these exceptions, by any chance, documented somewhere?

   http://wiki.whatwg.org/wiki/Differences_from_HTML4
   http://dev.w3.org/html5/html4-differences/

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread James Graham

Edward Z. Yang wrote:

The reason I'd like to know this is because I am the author of a tool
named HTML Purifier, which takes user-input HTML and cleans it for
standards-compliance as well as XSS. We insist on output being standards
compliant, because the result is unambiguous.
  


Nothing in section 8 is going to ensure that you get output that passes 
a conformance check. If you do transform the output into something that 
is conforming then you have to make up the rules yourself so you have 
just shifted the ambiguity from the client (where it will hopefully 
disappear in a few years once the HTML5 algorithm has large-scale 
adoption) to the sanitizer implementation.




Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Geoffrey Sneddon wrote:
 If you do start work on a PHP implementation, please do seriously
 consider adding it to the html5lib project (which currently contains
 Python and Ruby implementations) as MIT licensed — there are also a fair
 number of test cases there.

I'd be quite interested in reusing the html5lib test-cases, but I prefer
to do my development on Git which means that it won't be hosted on
Google Code. This might be a winter break project for me.

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote:
 In general you should be able to just implement what the spec says and 
 then either leave the HTML5 support in (it's unlikely to cause any harm) 
 or just comment out the support for the new elements, that should be 
 relatively easy.

Right, this is mostly what I intended to do. But from what I can tell,
there's a difference between the design philosophies of HTML 5 and XHTML
2.0; XHTML tries to make everything extensible and able to be imported
from other places, while HTML 5 attempts to document what exists, and
then make sensible additions as necessary. HTML 5 pragmatism makes sense
for a user-agent, but the XHTML extensibility is useful for a sanitizer,
which doesn't actually have to render anything and needs to support
multiple dialects and variants.

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
James Graham wrote:
 Nothing in section 8 is going to ensure that you get output that passes
 a conformance check. If you do transform the output into something that
 is conforming then you have to make up the rules yourself

Yes, which I suppose is slightly concerning. My philosophy is to first
reconstruct the DOM as much like browsers, and then for non-compliant
DOMs move things around so they become compliant, but *look* the same as
a non-compliant DOM.

 so you have
 just shifted the ambiguity from the client (where it will hopefully
 disappear in a few years once the HTML5 algorithm has large-scale
 adoption) to the sanitizer implementation.

I feel like this is preferable in many cases. There's only one sanitizer
implementation to worry about, as opposed to many browser
implementations. Also, the sanitizer can transparently add cross-browser
compatibility code for poorly supported elements.

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Ian Hickson
On Mon, 15 Dec 2008, Edward Z. Yang wrote:
 Ian Hickson wrote:
  In general you should be able to just implement what the spec says and 
  then either leave the HTML5 support in (it's unlikely to cause any harm) 
  or just comment out the support for the new elements, that should be 
  relatively easy.
 
 Right, this is mostly what I intended to do. But from what I can tell, 
 there's a difference between the design philosophies of HTML 5 and XHTML 
 2.0; XHTML tries to make everything extensible and able to be imported 
 from other places, while HTML 5 attempts to document what exists, and 
 then make sensible additions as necessary. HTML 5 pragmatism makes sense 
 for a user-agent, but the XHTML extensibility is useful for a sanitizer, 
 which doesn't actually have to render anything and needs to support 
 multiple dialects and variants.

Extensibility certainly isn't a priority for HTML5 in text/html, at least 
not compared to compatibility, indeed.

I don't really see why a sanitiser needs extensibility though. Could you 
elaborate on this? Surely you just want to filter anything that isn't 
valid or safe, and only leave the valid safe stuff, using a whitelist.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote:
 I don't really see why a sanitiser needs extensibility though. Could you 
 elaborate on this? Surely you just want to filter anything that isn't 
 valid or safe, and only leave the valid safe stuff, using a whitelist.

In theory, I could write separate sanitizers for HTML 4, XHTML 1.0,
XHTML 2.0, HTML 5, etc. In practice, I want to reuse as much code as
possible between these cases, since I'm a lazy developer. Perhaps
extensibility is not the right word here; it's more like reusability
of components.

A side-note: something we've been looking into is bolting on extensions
to the HTML language. A user might write something in HTML 5, but the
website is in HTML 4, so the sanitizer converts the HTML 5 into a more
ugly but functional HTML 4 version, and returns that. The future, today!

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Ian Hickson
On Mon, 15 Dec 2008, Edward Z. Yang wrote:
 
 In theory, I could write separate sanitizers for HTML 4, XHTML 1.0, 
 XHTML 2.0, HTML 5, etc. In practice, I want to reuse as much code as 
 possible between these cases, since I'm a lazy developer. Perhaps 
 extensibility is not the right word here; it's more like reusability 
 of components.

Oh well that's just a matter of having pluggable modules for different 
things to filter. You can equally support SVG and MathML in this way. You 
just need the core processing to be made independent of the filtering.


 A side-note: something we've been looking into is bolting on extensions 
 to the HTML language. A user might write something in HTML 5, but the 
 website is in HTML 4, so the sanitizer converts the HTML 5 into a more 
 ugly but functional HTML 4 version, and returns that. The future, today!

I wouldn't really worry about 4 vs 5. What matters is what works in 
browsers, or whatever tools your users are using. (This is one reason in 
HTML5 we do away with having the version number in the DOCTYPE.) I'd 
recommend just using the HTML5 DOCTYPE and then filtering the content to 
be whatever you want it to be.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote:
 Oh well that's just a matter of having pluggable modules for different 
 things to filter. You can equally support SVG and MathML in this way. You 
 just need the core processing to be made independent of the filtering.

I just realized an error in my thought that I would need to modify the
parsing algorithm; that would only be the case if I tried to integrate
filtering with the core processing. If it's a two-stage process, the
core processing merely has special rules for certain elements embedded
in it, but otherwise acts normally. Performance *is* an issue (getting
things to be standards compliant is relatively CPU/memory intensive),
but getting things to work is first.

 I wouldn't really worry about 4 vs 5. What matters is what works in 
 browsers, or whatever tools your users are using. (This is one reason in 
 HTML5 we do away with having the version number in the DOCTYPE.) I'd 
 recommend just using the HTML5 DOCTYPE and then filtering the content to 
 be whatever you want it to be.

HTML Purifier puts a high value on standards-compliance, and we've been
attacked on several occasions because of it. Standards suck. To this I
have to say, standards compliance has helped defend against a number of
XSS attacks--enforcing it lowers attack surface and makes behavior much
more well-defined. So I feel like it's a goal worth striving for, in and
of itself, especially since you can't enforce semantics with computers.

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Ian Hickson
On Mon, 15 Dec 2008, Edward Z. Yang wrote:
 
  I wouldn't really worry about 4 vs 5. What matters is what works 
  in browsers, or whatever tools your users are using. (This is one 
  reason in HTML5 we do away with having the version number in the 
  DOCTYPE.) I'd recommend just using the HTML5 DOCTYPE and then 
  filtering the content to be whatever you want it to be.
 
 HTML Purifier puts a high value on standards-compliance, and we've been 
 attacked on several occasions because of it. Standards suck. To this I 
 have to say, standards compliance has helped defend against a number of 
 XSS attacks--enforcing it lowers attack surface and makes behavior much 
 more well-defined. So I feel like it's a goal worth striving for, in and 
 of itself, especially since you can't enforce semantics with computers.

I'm not saying don't be standards-compliant; I'm just saying use a subset 
of HTML5 that you feel comfortable with (which might also be a subset of 
HTML4, for that matter, just with the HTML5 DOCTYPE so that you don't have 
to worry about exactly which version you want to follow).

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote:
 I'm not saying don't be standards-compliant; I'm just saying use a subset 
 of HTML5 that you feel comfortable with (which might also be a subset of 
 HTML4, for that matter, just with the HTML5 DOCTYPE so that you don't have 
 to worry about exactly which version you want to follow).

Sounds good, since HTML4 is a strict subset of HTML5 (correct me if I'm
wrong?)



Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Tab Atkins Jr.
On Mon, Dec 15, 2008 at 3:32 PM, Edward Z. Yang
edwardzy...@thewritingpot.com wrote:
 Ian Hickson wrote:
 I'm not saying don't be standards-compliant; I'm just saying use a subset
 of HTML5 that you feel comfortable with (which might also be a subset of
 HTML4, for that matter, just with the HTML5 DOCTYPE so that you don't have
 to worry about exactly which version you want to follow).

 Sounds good, since HTML4 is a strict subset of HTML5 (correct me if I'm
 wrong?)

By it's nature, it has to be.  There are several parts of html4 which
*shouldn't* be used in html5, but by necessity we must deal with those
things existing.  Frex, the center element.

~TJ


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Cameron McCormack
Edward Z. Yang:
  Sounds good, since HTML4 is a strict subset of HTML5 (correct me if I'm 
  wrong?)

Ian Hickson:
 Mostly, yes. (There are exceptions, but they're not things you'd really 
 want to be using anyway, e.g. obscure SGML features.)

Note though that it’s not possible to write a document that is both
valid HTML 4 and HTML 5, since they both require a different DOCTYPE to
be used.

-- 
Cameron McCormack ≝ http://mcc.id.au/


[whatwg] Stability of tokenizing/dom algorithms

2008-12-14 Thread Edward Z. Yang
Hello all,

I was curious to know how stable/complete HTML 5's tokenizing and DOM
algorithms are (specifically section 8). A cursory glance through the
section reveals a few red warning boxes, but these are largely issues of
whether or not the specification should follow browser implementations,
and not actual errors in the specification.

The reason I'd like to know this is because I am the author of a tool
named HTML Purifier, which takes user-input HTML and cleans it for
standards-compliance as well as XSS. We insist on output being standards
compliant, because the result is unambiguous.

As far as I can tell, this is quite unlike the tools that HTML5 is
tooled towards; compliance checkers, user agents and data miners. There
certainly is overlap: we have our own parsing and DOM-building
algorithms which work decently well, although they do trip up on a
number of edge-cases (active formatting elements being one notable
example). However, using the HTML5 algorithm wholesale is not possible
for several reasons:

1. Users input HTML fragments, not actual HTML documents. A parser I
would use needs to be able to enter parsing in a specific state, and has
to ignore any requests by the user to exit that state (i.e. a /body tag)

2. No one actually codes their HTML in HTML5 (yet), so the only parts of
the algorithm I want to use are the ones that are emulating browser
behavior with HTML4. However, HTML5 interweaves it's additions with the
browser research it has done.

I'd be really interested to hear what you all have to say about this
matter. Thanks!

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-14 Thread Anne van Kesteren
On Sun, 14 Dec 2008 22:37:40 +0100, Edward Z. Yang  
edwardzy...@thewritingpot.com wrote:

1. Users input HTML fragments, not actual HTML documents. A parser I
would use needs to be able to enter parsing in a specific state, and has
to ignore any requests by the user to exit that state (i.e. a /body  
tag)


Could you explain what is not sufficient about the the Parsing HTML  
fragments section:


http://www.whatwg.org/specs/web-apps/current-work/multipage/serializing-html-fragments.html#parsing-html-fragments

?



2. No one actually codes their HTML in HTML5 (yet), so the only parts of
the algorithm I want to use are the ones that are emulating browser
behavior with HTML4. However, HTML5 interweaves it's additions with the
browser research it has done.


Are there any specific differences that pose problems?


--
Anne van Kesteren
http://annevankesteren.nl/
http://www.opera.com/


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-14 Thread Edward Z. Yang
Anne van Kesteren wrote:
 Could you explain what is not sufficient about the the Parsing HTML
 fragments section:

I must admit, I had not seen that section! That seems to be quite
sufficient. My bad. :o)

 Are there any specific differences that pose problems?

Not that I know of yet, since I haven't started on an implementation
yet. Which brings me back to my original question: how stable is section
8? I would rather not be chasing a moving target.

Cheers,
Edward



Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-14 Thread Geoffrey Sneddon


On 14 Dec 2008, at 21:55, Edward Z. Yang wrote:


Are there any specific differences that pose problems?


Not that I know of yet, since I haven't started on an implementation
yet. Which brings me back to my original question: how stable is  
section

8? I would rather not be chasing a moving target.


It's not really a moving target — what it is is largely constrained by  
the requirement to parse pre-existing documents (which rely on almost  
every possible bit of behaviour).


If you do start work on a PHP implementation, please do seriously  
consider adding it to the html5lib project (which currently contains  
Python and Ruby implementations) as MIT licensed — there are also a  
fair number of test cases there.



--
Geoffrey Sneddon
http://gsnedders.com/



Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-14 Thread Ian Hickson
On Sun, 14 Dec 2008, Edward Z. Yang wrote:
 
 I was curious to know how stable/complete HTML 5's tokenizing and DOM 
 algorithms are (specifically section 8).

Pretty stable. There are some known issues [1], and more issues will 
surely be found as implementations grow in usage, but the basic 
architecture is unlikely to change and the specifics are unlikely to 
change much. The only major pending change is adding SVG, but that will 
likely be done in a way similar to what is currently specified but 
commented out.

[1] Mostly listed here: http://www.whatwg.org/issues/#parsing


 The reason I'd like to know this is because I am the author of a tool 
 named HTML Purifier, which takes user-input HTML and cleans it for 
 standards-compliance as well as XSS. We insist on output being standards 
 compliant, because the result is unambiguous.
 
 As far as I can tell, this is quite unlike the tools that HTML5 is 
 tooled towards; compliance checkers, user agents and data miners. There 
 certainly is overlap: we have our own parsing and DOM-building 
 algorithms which work decently well, although they do trip up on a 
 number of edge-cases (active formatting elements being one notable 
 example). However, using the HTML5 algorithm wholesale is not possible 
 for several reasons:
 
 1. Users input HTML fragments, not actual HTML documents. A parser I 
 would use needs to be able to enter parsing in a specific state, and has 
 to ignore any requests by the user to exit that state (i.e. a /body 
 tag)

As Anne pointed out, we do have a section to handle that case (it's 
similar to innerHTML in browsers); if there's anything I can do to make 
those sections more helpful to you, please let me know.


 2. No one actually codes their HTML in HTML5 (yet), so the only parts of 
 the algorithm I want to use are the ones that are emulating browser 
 behavior with HTML4. However, HTML5 interweaves it's additions with the 
 browser research it has done.

In general you should be able to just implement what the spec says and 
then either leave the HTML5 support in (it's unlikely to cause any harm) 
or just comment out the support for the new elements, that should be 
relatively easy.

HTH,
-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'