Re: [whatwg] Parsing, syntax, and content model feedback
Ian Hickson wrote: On Mon, 22 Dec 2008, Edward Z. Yang wrote: in the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to 0xFDDF U+000B is not a range. While this is technically true, I don't really see a better way to phrase this that isn't verbose (e.g. ranges and codepoints or some such). If it helps, consider the whole set of subranges and code points to be a single discontinuous range, hence the use of the singular range. :-) The spec made me double-take when I read it (since it fairly clearly separates range from codepoints). Also, I messed up the copypaste while quoting, so the text I cited is not actually what's there, it's: in the ranges U+0001 to U+0008, U+000B, U+000E to U+001F, U+007F to U+009F, U+D800 to U+DFFF, U+FDD0 to U+FDDF, and characters U+FFFE... It seems fairly clear to me that U+000B should moved to the list of characters (at the cost of the nice ordering) or we should collapse ranges/characters into one range. On Tue, 23 Dec 2008, Edward Z. Yang wrote: You're still checking the next input character at that point, so P is still the next input character, so the next six are PUBLIC. At least, that's how I'm defending what the spec says. :-) The spec is pretty unambiguous about this: The next input character is the first character in the input stream that has not yet been consumed. Initially, the next input character is the first character in the input. and, at the beginning of the section: Consume the next input character: So, the spec is wrong. In practice I think having the text be clear (PUBLIC) is less confusing than having it be pedantic (P and UBLIC or this and the next five or some such). It's not like people are going to assume the spec is allowing XPUBLIC or *PUBLIC and so forth, right? I understand this consideration, and there's several ways we could go about doing this. I think the easiest would be to un-consume a character, and then perform the checks, and then reconsume the character. As for people making this mistake... well, you're looking at one. :-) Cheers, Edward (accidentally emailed only Ian; re-sending to WHATWG list)
[whatwg] Error in 8.2.4.26 After DOCTYPE name state
In section 8.2.4.26 the spec says: If the next six characters are an ASCII case-insensitive match for the word PUBLIC, then consume those characters and switch to the before DOCTYPE public identifier state. The P has already been consumed at the beginning of this section. Thus, I believe it should read: If this character and the next five characters are an ASCII case-insensitive match for the word PUBLIC, etc. Same goes for the match for SYSTEM. Cheers, Edward
[whatwg] 8.2.4.4 Close tag open state
The condition here is relly long. Is there any way we can make it shorter? Cheers, Edward
[whatwg] 8.2.4.37: EOF handling
Hello all, I think EOF should be handled explicitly in the states after we Consume the U+0023 NUMBER SIGN, since the spec as it stands right now implies that there will always be another character after the number sign. Or am I being a little redundant? Cheers, Edward
Re: [whatwg] 8.2.4.37: EOF handling
Philip Taylor wrote: EOF is always treated as if it were a character, e.g. lots of places say Consume the next input character: ... EOF - ... Reconsume the EOF character in the data state. That seems fair, although most implementations won't have an actual end of file character; they'll be checking their string index to see if they've gone out of bounds. But the spec is internally consistent (I'm just used to seeing an EOF special case on almost every state). Thanks, Edward
[whatwg] Consuming ampersands
Hello all, When I'm consuming a character reference, when does the ampersand get consumed? This doesn't seem to be obvious from the documentation, which talks of consuming character references and number hash signs, but never the ampersand. Cheers, Edward
[whatwg] Minor typo in 8.2.4.37
in the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDF U+000B is not a range. Cheers, Edward
[whatwg] Byte-wise tokenization algorithm
I am currently working on a PHP5 implementation of the HTML5 specification. PHP has abysmal Unicode support, and implementing Unicode streams in userspace may be unacceptablu slow. Thus, my questions: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the tokenization algorithm with byte-wise operations only? I think it's possible, since all of the character matching parts of the algorithm map to characters in ASCII space. 2. Would such an implementation be conforming? Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
Ian Hickson wrote: Mostly, yes. (There are exceptions, but they're not things you'd really want to be using anyway, e.g. obscure SGML features.) Are these exceptions, by any chance, documented somewhere? Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
Geoffrey Sneddon wrote: If you do start work on a PHP implementation, please do seriously consider adding it to the html5lib project (which currently contains Python and Ruby implementations) as MIT licensed — there are also a fair number of test cases there. I'd be quite interested in reusing the html5lib test-cases, but I prefer to do my development on Git which means that it won't be hosted on Google Code. This might be a winter break project for me. Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
Ian Hickson wrote: In general you should be able to just implement what the spec says and then either leave the HTML5 support in (it's unlikely to cause any harm) or just comment out the support for the new elements, that should be relatively easy. Right, this is mostly what I intended to do. But from what I can tell, there's a difference between the design philosophies of HTML 5 and XHTML 2.0; XHTML tries to make everything extensible and able to be imported from other places, while HTML 5 attempts to document what exists, and then make sensible additions as necessary. HTML 5 pragmatism makes sense for a user-agent, but the XHTML extensibility is useful for a sanitizer, which doesn't actually have to render anything and needs to support multiple dialects and variants. Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
James Graham wrote: Nothing in section 8 is going to ensure that you get output that passes a conformance check. If you do transform the output into something that is conforming then you have to make up the rules yourself Yes, which I suppose is slightly concerning. My philosophy is to first reconstruct the DOM as much like browsers, and then for non-compliant DOMs move things around so they become compliant, but *look* the same as a non-compliant DOM. so you have just shifted the ambiguity from the client (where it will hopefully disappear in a few years once the HTML5 algorithm has large-scale adoption) to the sanitizer implementation. I feel like this is preferable in many cases. There's only one sanitizer implementation to worry about, as opposed to many browser implementations. Also, the sanitizer can transparently add cross-browser compatibility code for poorly supported elements. Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
Ian Hickson wrote: I don't really see why a sanitiser needs extensibility though. Could you elaborate on this? Surely you just want to filter anything that isn't valid or safe, and only leave the valid safe stuff, using a whitelist. In theory, I could write separate sanitizers for HTML 4, XHTML 1.0, XHTML 2.0, HTML 5, etc. In practice, I want to reuse as much code as possible between these cases, since I'm a lazy developer. Perhaps extensibility is not the right word here; it's more like reusability of components. A side-note: something we've been looking into is bolting on extensions to the HTML language. A user might write something in HTML 5, but the website is in HTML 4, so the sanitizer converts the HTML 5 into a more ugly but functional HTML 4 version, and returns that. The future, today! Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
Ian Hickson wrote: Oh well that's just a matter of having pluggable modules for different things to filter. You can equally support SVG and MathML in this way. You just need the core processing to be made independent of the filtering. I just realized an error in my thought that I would need to modify the parsing algorithm; that would only be the case if I tried to integrate filtering with the core processing. If it's a two-stage process, the core processing merely has special rules for certain elements embedded in it, but otherwise acts normally. Performance *is* an issue (getting things to be standards compliant is relatively CPU/memory intensive), but getting things to work is first. I wouldn't really worry about 4 vs 5. What matters is what works in browsers, or whatever tools your users are using. (This is one reason in HTML5 we do away with having the version number in the DOCTYPE.) I'd recommend just using the HTML5 DOCTYPE and then filtering the content to be whatever you want it to be. HTML Purifier puts a high value on standards-compliance, and we've been attacked on several occasions because of it. Standards suck. To this I have to say, standards compliance has helped defend against a number of XSS attacks--enforcing it lowers attack surface and makes behavior much more well-defined. So I feel like it's a goal worth striving for, in and of itself, especially since you can't enforce semantics with computers. Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
Ian Hickson wrote: I'm not saying don't be standards-compliant; I'm just saying use a subset of HTML5 that you feel comfortable with (which might also be a subset of HTML4, for that matter, just with the HTML5 DOCTYPE so that you don't have to worry about exactly which version you want to follow). Sounds good, since HTML4 is a strict subset of HTML5 (correct me if I'm wrong?)
[whatwg] Stability of tokenizing/dom algorithms
Hello all, I was curious to know how stable/complete HTML 5's tokenizing and DOM algorithms are (specifically section 8). A cursory glance through the section reveals a few red warning boxes, but these are largely issues of whether or not the specification should follow browser implementations, and not actual errors in the specification. The reason I'd like to know this is because I am the author of a tool named HTML Purifier, which takes user-input HTML and cleans it for standards-compliance as well as XSS. We insist on output being standards compliant, because the result is unambiguous. As far as I can tell, this is quite unlike the tools that HTML5 is tooled towards; compliance checkers, user agents and data miners. There certainly is overlap: we have our own parsing and DOM-building algorithms which work decently well, although they do trip up on a number of edge-cases (active formatting elements being one notable example). However, using the HTML5 algorithm wholesale is not possible for several reasons: 1. Users input HTML fragments, not actual HTML documents. A parser I would use needs to be able to enter parsing in a specific state, and has to ignore any requests by the user to exit that state (i.e. a /body tag) 2. No one actually codes their HTML in HTML5 (yet), so the only parts of the algorithm I want to use are the ones that are emulating browser behavior with HTML4. However, HTML5 interweaves it's additions with the browser research it has done. I'd be really interested to hear what you all have to say about this matter. Thanks! Cheers, Edward
Re: [whatwg] Stability of tokenizing/dom algorithms
Anne van Kesteren wrote: Could you explain what is not sufficient about the the Parsing HTML fragments section: I must admit, I had not seen that section! That seems to be quite sufficient. My bad. :o) Are there any specific differences that pose problems? Not that I know of yet, since I haven't started on an implementation yet. Which brings me back to my original question: how stable is section 8? I would rather not be chasing a moving target. Cheers, Edward
Re: [whatwg] Dealing with UI redress vulnerabilities inherent to the current web
Michal Zalewski wrote: More importantly, since the dictionary of possible inputs is rather limited, it would be pretty trivial to build a dictionary of site - hash pairs and crack the values. May protect xyzzy2984.eur.int.example.com, but would still reveal to me you are coming from playboy.com. Salt it. Problem solved.
Re: [whatwg] Dealing with UI redress vulnerabilities inherent to the current web
Michal Zalewski wrote: Not really? I just need to rebuild my dictionary for that salt, but to check against say a million or ten million of common domains, it wouldn't be very expensive. And it's not very expensive to build such a list of domains, too. In that case, you are certainly correct; adding a salt only hinders an attacker. But if we're worried about Origin giving away a secret intranet website, I think things should be reasonable. Of course, they can still dictionary brute-force it... (whoops, forgot to CC list)
Re: [whatwg] Can var possibly work?
Ozob the Great wrote: Then var steps on MathML's toes: It duplicates functionality. Not necessarily; a program variable should certainly not be marked up with MathML.
Re: [whatwg] The iframe element and sandboxing ideas
Warning: This is going to be a little bit of an HTML Purifier evangelising post. Frode Børli wrote: Yeah, I thought about that also. Then we have more complex attributes such as style='font-family: expression#40;a+5#41;;'... So your sanitizer must also parse CSS properly - including unescaping entities. The way HTML Purifier handles this is unescaping all entities (hex, dec and named) before handling HTML. Output text is always in UTF-8 and thus never has entities. Also, it should be noted that #40; is HTML escaping, not CSS escaping. CSS has its own set of escaping syntax. HTML Purifier handles that too. For all I know - a future invention may introduce a new method of encoding entities also, so your sanitizer must support all future entity encodings. I don't know what you really mean by this, but by converting entities to characters this is not a problem. Ofcourse we can skip supporting the style attribute - but there are not many other ways to style content in XHTML. Style attribute is supported. A bank want a HTML-messaging system where the customer can write HTML-based messages to customer support trough the online banking system. Customer support personell have access to perform transactions worth millions of dollars trough the intranet web interface (where they also receive HTML-based messages from customers). A few problems with this theoretical situation: 1. Why does the bank need an HTML messaging system? 2. Why is this system on the same domain as the intranet web interface? 3. Why do customer support personell have access to the transaction interface? But whatever, it's not really relevant to the topic at hand. Security depends on on a perfect sanitizer. Would you sell your sanitizer to this bank without any disclaimers, and say that your sanitizer will be valid for eternity and for all browsers that the bank decides to use internally in the future? Well, it's an open-source sanitizer. But that aside, say, I was selling them a support contract, I would not say valid for eternity. However, I would be very confident that a bug would be more likely than a future browser breaking the sanitizer. And the reason I say this is because of the principle of backwards-compatibility: my sanitizer only allows HTML/CSS that has well-defined behavior by all current browsers. colspan=expr(3+4) is theoretically valid and safe HTML, but it doesn't have well-defined behavior with browsers, so it is sanitized out. colspan=4 is well-defined, valid and safe, and unless a browser decides 4 is a magic number that should trigger the execution of JavaScript code in a nearby node, it's safe. Today I would not allow HTML-based messages since I could never be sure enough that the sanitizer was perfect. I encourage you to try out HTML Purifier http://htmlpurifier.org. It's certainly not perfect (we've had a total of two security problems with the core code (three if you count a Shift_JIS related vulnerability, and four if you count an XSS vulnerability in a testing script for the library)), but I hope it certainly approaches it.
[whatwg] Pre, code and semantics in HTML5: Wishful thinking?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I was reading through the HTML5 spec the other day and I noticed this tidbit: To represent a block of computer code, the pre element can be used with a code element; to represent a block of computer output the pre element can be used with a samp element. Similarly, the kbd element can be used within a pre element to indicate text that the user is to enter. The implication is that document authors are recommended to use precode to wrap all of their programming code instead of a lone pre, if they wish to be fully semantic. This feels needlessly verbose and abusive of code, which traditionally has been used to mark single-liners. It also makes it extremely difficult to style pre as a block for code, as the only semantic indication that the contents of the pre block are computer code is its child. You'd end up having to say pre class=codecode if you wanted to style pre as well. At the same time, I still think the semantics of whether or not a pre tag indicates a plaintext file, or a piece of ASCII art, or computer code, is somewhat important. However, I think this information would be more appropriately given as an attribute. Thanks for reading, Edward P.S. Please CC my address on all replies. - -- Edward Z. YangGnuPG: 0x869C48DA HTML Purifier http://htmlpurifier.org Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIXrSQqTO+fYacSNoRAn1WAJ95X7i0Rf4sMGuj4n5qEEWoEH4CuwCfUnP8 TIADRZ6VRXWK2AC9tIATl8E= =TY06 -END PGP SIGNATURE-