Re: [whatwg] Stability of tokenizing/dom algorithms
Edward Z. Yang: Sounds good, since HTML4 is a strict subset of HTML5 (correct me if I'm wrong?) Ian Hickson: Mostly, yes. (There are exceptions, but they're not things you'd really want to be using anyway, e.g. obscure SGML features.) Note though that it's not possible to write a document that is both valid HTML 4 and HTML 5, since they both require a different DOCTYPE to be used. That's right, Cameron, but you could change nothing more than the DOCTYPE in order to validate if you have choosen a right subset of HTML 4.x. iñigo -- Cameron McCormack ≝ http://mcc.id.au/ -- Iñigo Medina García Tecnología http://www.toprural.com Tu guía de turismo rural
Re: [whatwg] Stability of tokenizing/dom algorithms
Ian Hickson wrote: Mostly, yes. (There are exceptions, but they're not things you'd really want to be using anyway, e.g. obscure SGML features.) Are these exceptions, by any chance, documented somewhere? Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
On Tue, 16 Dec 2008, Edward Z. Yang wrote: Ian Hickson wrote: Mostly, yes. (There are exceptions, but they're not things you'd really want to be using anyway, e.g. obscure SGML features.) Are these exceptions, by any chance, documented somewhere? http://wiki.whatwg.org/wiki/Differences_from_HTML4 http://dev.w3.org/html5/html4-differences/ -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Stability of tokenizing/dom algorithms
Edward Z. Yang wrote: The reason I'd like to know this is because I am the author of a tool named HTML Purifier, which takes user-input HTML and cleans it for standards-compliance as well as XSS. We insist on output being standards compliant, because the result is unambiguous. Nothing in section 8 is going to ensure that you get output that passes a conformance check. If you do transform the output into something that is conforming then you have to make up the rules yourself so you have just shifted the ambiguity from the client (where it will hopefully disappear in a few years once the HTML5 algorithm has large-scale adoption) to the sanitizer implementation.
Re: [whatwg] Stability of tokenizing/dom algorithms
Geoffrey Sneddon wrote: If you do start work on a PHP implementation, please do seriously consider adding it to the html5lib project (which currently contains Python and Ruby implementations) as MIT licensed — there are also a fair number of test cases there. I'd be quite interested in reusing the html5lib test-cases, but I prefer to do my development on Git which means that it won't be hosted on Google Code. This might be a winter break project for me. Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
Ian Hickson wrote: In general you should be able to just implement what the spec says and then either leave the HTML5 support in (it's unlikely to cause any harm) or just comment out the support for the new elements, that should be relatively easy. Right, this is mostly what I intended to do. But from what I can tell, there's a difference between the design philosophies of HTML 5 and XHTML 2.0; XHTML tries to make everything extensible and able to be imported from other places, while HTML 5 attempts to document what exists, and then make sensible additions as necessary. HTML 5 pragmatism makes sense for a user-agent, but the XHTML extensibility is useful for a sanitizer, which doesn't actually have to render anything and needs to support multiple dialects and variants. Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
James Graham wrote: Nothing in section 8 is going to ensure that you get output that passes a conformance check. If you do transform the output into something that is conforming then you have to make up the rules yourself Yes, which I suppose is slightly concerning. My philosophy is to first reconstruct the DOM as much like browsers, and then for non-compliant DOMs move things around so they become compliant, but *look* the same as a non-compliant DOM. so you have just shifted the ambiguity from the client (where it will hopefully disappear in a few years once the HTML5 algorithm has large-scale adoption) to the sanitizer implementation. I feel like this is preferable in many cases. There's only one sanitizer implementation to worry about, as opposed to many browser implementations. Also, the sanitizer can transparently add cross-browser compatibility code for poorly supported elements. Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
On Mon, 15 Dec 2008, Edward Z. Yang wrote: Ian Hickson wrote: In general you should be able to just implement what the spec says and then either leave the HTML5 support in (it's unlikely to cause any harm) or just comment out the support for the new elements, that should be relatively easy. Right, this is mostly what I intended to do. But from what I can tell, there's a difference between the design philosophies of HTML 5 and XHTML 2.0; XHTML tries to make everything extensible and able to be imported from other places, while HTML 5 attempts to document what exists, and then make sensible additions as necessary. HTML 5 pragmatism makes sense for a user-agent, but the XHTML extensibility is useful for a sanitizer, which doesn't actually have to render anything and needs to support multiple dialects and variants. Extensibility certainly isn't a priority for HTML5 in text/html, at least not compared to compatibility, indeed. I don't really see why a sanitiser needs extensibility though. Could you elaborate on this? Surely you just want to filter anything that isn't valid or safe, and only leave the valid safe stuff, using a whitelist. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Stability of tokenizing/dom algorithms
Ian Hickson wrote: I don't really see why a sanitiser needs extensibility though. Could you elaborate on this? Surely you just want to filter anything that isn't valid or safe, and only leave the valid safe stuff, using a whitelist. In theory, I could write separate sanitizers for HTML 4, XHTML 1.0, XHTML 2.0, HTML 5, etc. In practice, I want to reuse as much code as possible between these cases, since I'm a lazy developer. Perhaps extensibility is not the right word here; it's more like reusability of components. A side-note: something we've been looking into is bolting on extensions to the HTML language. A user might write something in HTML 5, but the website is in HTML 4, so the sanitizer converts the HTML 5 into a more ugly but functional HTML 4 version, and returns that. The future, today! Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
On Mon, 15 Dec 2008, Edward Z. Yang wrote: In theory, I could write separate sanitizers for HTML 4, XHTML 1.0, XHTML 2.0, HTML 5, etc. In practice, I want to reuse as much code as possible between these cases, since I'm a lazy developer. Perhaps extensibility is not the right word here; it's more like reusability of components. Oh well that's just a matter of having pluggable modules for different things to filter. You can equally support SVG and MathML in this way. You just need the core processing to be made independent of the filtering. A side-note: something we've been looking into is bolting on extensions to the HTML language. A user might write something in HTML 5, but the website is in HTML 4, so the sanitizer converts the HTML 5 into a more ugly but functional HTML 4 version, and returns that. The future, today! I wouldn't really worry about 4 vs 5. What matters is what works in browsers, or whatever tools your users are using. (This is one reason in HTML5 we do away with having the version number in the DOCTYPE.) I'd recommend just using the HTML5 DOCTYPE and then filtering the content to be whatever you want it to be. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Stability of tokenizing/dom algorithms
Ian Hickson wrote: Oh well that's just a matter of having pluggable modules for different things to filter. You can equally support SVG and MathML in this way. You just need the core processing to be made independent of the filtering. I just realized an error in my thought that I would need to modify the parsing algorithm; that would only be the case if I tried to integrate filtering with the core processing. If it's a two-stage process, the core processing merely has special rules for certain elements embedded in it, but otherwise acts normally. Performance *is* an issue (getting things to be standards compliant is relatively CPU/memory intensive), but getting things to work is first. I wouldn't really worry about 4 vs 5. What matters is what works in browsers, or whatever tools your users are using. (This is one reason in HTML5 we do away with having the version number in the DOCTYPE.) I'd recommend just using the HTML5 DOCTYPE and then filtering the content to be whatever you want it to be. HTML Purifier puts a high value on standards-compliance, and we've been attacked on several occasions because of it. Standards suck. To this I have to say, standards compliance has helped defend against a number of XSS attacks--enforcing it lowers attack surface and makes behavior much more well-defined. So I feel like it's a goal worth striving for, in and of itself, especially since you can't enforce semantics with computers. Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
On Mon, 15 Dec 2008, Edward Z. Yang wrote: I wouldn't really worry about 4 vs 5. What matters is what works in browsers, or whatever tools your users are using. (This is one reason in HTML5 we do away with having the version number in the DOCTYPE.) I'd recommend just using the HTML5 DOCTYPE and then filtering the content to be whatever you want it to be. HTML Purifier puts a high value on standards-compliance, and we've been attacked on several occasions because of it. Standards suck. To this I have to say, standards compliance has helped defend against a number of XSS attacks--enforcing it lowers attack surface and makes behavior much more well-defined. So I feel like it's a goal worth striving for, in and of itself, especially since you can't enforce semantics with computers. I'm not saying don't be standards-compliant; I'm just saying use a subset of HTML5 that you feel comfortable with (which might also be a subset of HTML4, for that matter, just with the HTML5 DOCTYPE so that you don't have to worry about exactly which version you want to follow). -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Stability of tokenizing/dom algorithms
Ian Hickson wrote: I'm not saying don't be standards-compliant; I'm just saying use a subset of HTML5 that you feel comfortable with (which might also be a subset of HTML4, for that matter, just with the HTML5 DOCTYPE so that you don't have to worry about exactly which version you want to follow). Sounds good, since HTML4 is a strict subset of HTML5 (correct me if I'm wrong?)
Re: [whatwg] Stability of tokenizing/dom algorithms
On Mon, Dec 15, 2008 at 3:32 PM, Edward Z. Yang edwardzy...@thewritingpot.com wrote: Ian Hickson wrote: I'm not saying don't be standards-compliant; I'm just saying use a subset of HTML5 that you feel comfortable with (which might also be a subset of HTML4, for that matter, just with the HTML5 DOCTYPE so that you don't have to worry about exactly which version you want to follow). Sounds good, since HTML4 is a strict subset of HTML5 (correct me if I'm wrong?) By it's nature, it has to be. There are several parts of html4 which *shouldn't* be used in html5, but by necessity we must deal with those things existing. Frex, the center element. ~TJ
Re: [whatwg] Stability of tokenizing/dom algorithms
Edward Z. Yang: Sounds good, since HTML4 is a strict subset of HTML5 (correct me if I'm wrong?) Ian Hickson: Mostly, yes. (There are exceptions, but they're not things you'd really want to be using anyway, e.g. obscure SGML features.) Note though that it’s not possible to write a document that is both valid HTML 4 and HTML 5, since they both require a different DOCTYPE to be used. -- Cameron McCormack ≝ http://mcc.id.au/
[whatwg] Stability of tokenizing/dom algorithms
Hello all, I was curious to know how stable/complete HTML 5's tokenizing and DOM algorithms are (specifically section 8). A cursory glance through the section reveals a few red warning boxes, but these are largely issues of whether or not the specification should follow browser implementations, and not actual errors in the specification. The reason I'd like to know this is because I am the author of a tool named HTML Purifier, which takes user-input HTML and cleans it for standards-compliance as well as XSS. We insist on output being standards compliant, because the result is unambiguous. As far as I can tell, this is quite unlike the tools that HTML5 is tooled towards; compliance checkers, user agents and data miners. There certainly is overlap: we have our own parsing and DOM-building algorithms which work decently well, although they do trip up on a number of edge-cases (active formatting elements being one notable example). However, using the HTML5 algorithm wholesale is not possible for several reasons: 1. Users input HTML fragments, not actual HTML documents. A parser I would use needs to be able to enter parsing in a specific state, and has to ignore any requests by the user to exit that state (i.e. a /body tag) 2. No one actually codes their HTML in HTML5 (yet), so the only parts of the algorithm I want to use are the ones that are emulating browser behavior with HTML4. However, HTML5 interweaves it's additions with the browser research it has done. I'd be really interested to hear what you all have to say about this matter. Thanks! Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
On Sun, 14 Dec 2008 22:37:40 +0100, Edward Z. Yang edwardzy...@thewritingpot.com wrote: 1. Users input HTML fragments, not actual HTML documents. A parser I would use needs to be able to enter parsing in a specific state, and has to ignore any requests by the user to exit that state (i.e. a /body tag) Could you explain what is not sufficient about the the Parsing HTML fragments section: http://www.whatwg.org/specs/web-apps/current-work/multipage/serializing-html-fragments.html#parsing-html-fragments ? 2. No one actually codes their HTML in HTML5 (yet), so the only parts of the algorithm I want to use are the ones that are emulating browser behavior with HTML4. However, HTML5 interweaves it's additions with the browser research it has done. Are there any specific differences that pose problems? -- Anne van Kesteren http://annevankesteren.nl/ http://www.opera.com/
Re: [whatwg] Stability of tokenizing/dom algorithms
Anne van Kesteren wrote: Could you explain what is not sufficient about the the Parsing HTML fragments section: I must admit, I had not seen that section! That seems to be quite sufficient. My bad. :o) Are there any specific differences that pose problems? Not that I know of yet, since I haven't started on an implementation yet. Which brings me back to my original question: how stable is section 8? I would rather not be chasing a moving target. Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
On 14 Dec 2008, at 21:55, Edward Z. Yang wrote: Are there any specific differences that pose problems? Not that I know of yet, since I haven't started on an implementation yet. Which brings me back to my original question: how stable is section 8? I would rather not be chasing a moving target. It's not really a moving target — what it is is largely constrained by the requirement to parse pre-existing documents (which rely on almost every possible bit of behaviour). If you do start work on a PHP implementation, please do seriously consider adding it to the html5lib project (which currently contains Python and Ruby implementations) as MIT licensed — there are also a fair number of test cases there. -- Geoffrey Sneddon http://gsnedders.com/
Re: [whatwg] Stability of tokenizing/dom algorithms
On Sun, 14 Dec 2008, Edward Z. Yang wrote: I was curious to know how stable/complete HTML 5's tokenizing and DOM algorithms are (specifically section 8). Pretty stable. There are some known issues [1], and more issues will surely be found as implementations grow in usage, but the basic architecture is unlikely to change and the specifics are unlikely to change much. The only major pending change is adding SVG, but that will likely be done in a way similar to what is currently specified but commented out. [1] Mostly listed here: http://www.whatwg.org/issues/#parsing The reason I'd like to know this is because I am the author of a tool named HTML Purifier, which takes user-input HTML and cleans it for standards-compliance as well as XSS. We insist on output being standards compliant, because the result is unambiguous. As far as I can tell, this is quite unlike the tools that HTML5 is tooled towards; compliance checkers, user agents and data miners. There certainly is overlap: we have our own parsing and DOM-building algorithms which work decently well, although they do trip up on a number of edge-cases (active formatting elements being one notable example). However, using the HTML5 algorithm wholesale is not possible for several reasons: 1. Users input HTML fragments, not actual HTML documents. A parser I would use needs to be able to enter parsing in a specific state, and has to ignore any requests by the user to exit that state (i.e. a /body tag) As Anne pointed out, we do have a section to handle that case (it's similar to innerHTML in browsers); if there's anything I can do to make those sections more helpful to you, please let me know. 2. No one actually codes their HTML in HTML5 (yet), so the only parts of the algorithm I want to use are the ones that are emulating browser behavior with HTML4. However, HTML5 interweaves it's additions with the browser research it has done. In general you should be able to just implement what the spec says and then either leave the HTML5 support in (it's unlikely to cause any harm) or just comment out the support for the new elements, that should be relatively easy. HTH, -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'