Re: [whatwg] Parsing, syntax, and content model feedback

2008-12-25 Thread Edward Z. Yang
Ian Hickson wrote: On Mon, 22 Dec 2008, Edward Z. Yang wrote: in the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 0xFDD0 to 0xFDDF U+000B

[whatwg] Error in 8.2.4.26 After DOCTYPE name state

2008-12-23 Thread Edward Z. Yang
In section 8.2.4.26 the spec says: If the next six characters are an ASCII case-insensitive match for the word PUBLIC, then consume those characters and switch to the before DOCTYPE public identifier state. The P has already been consumed at the beginning of this section. Thus, I believe it

[whatwg] 8.2.4.4 Close tag open state

2008-12-22 Thread Edward Z. Yang
The condition here is relly long. Is there any way we can make it shorter? Cheers, Edward

[whatwg] 8.2.4.37: EOF handling

2008-12-22 Thread Edward Z. Yang
Hello all, I think EOF should be handled explicitly in the states after we Consume the U+0023 NUMBER SIGN, since the spec as it stands right now implies that there will always be another character after the number sign. Or am I being a little redundant? Cheers, Edward

Re: [whatwg] 8.2.4.37: EOF handling

2008-12-22 Thread Edward Z. Yang
Philip Taylor wrote: EOF is always treated as if it were a character, e.g. lots of places say Consume the next input character: ... EOF - ... Reconsume the EOF character in the data state. That seems fair, although most implementations won't have an actual end of file character; they'll be

[whatwg] Consuming ampersands

2008-12-22 Thread Edward Z. Yang
Hello all, When I'm consuming a character reference, when does the ampersand get consumed? This doesn't seem to be obvious from the documentation, which talks of consuming character references and number hash signs, but never the ampersand. Cheers, Edward

[whatwg] Minor typo in 8.2.4.37

2008-12-22 Thread Edward Z. Yang
in the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F

[whatwg] Byte-wise tokenization algorithm

2008-12-20 Thread Edward Z. Yang
I am currently working on a PHP5 implementation of the HTML5 specification. PHP has abysmal Unicode support, and implementing Unicode streams in userspace may be unacceptablu slow. Thus, my questions: 1. Given an input stream that is known to be valid UTF-8, is it possible to implement the

Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-16 Thread Edward Z. Yang
Ian Hickson wrote: Mostly, yes. (There are exceptions, but they're not things you'd really want to be using anyway, e.g. obscure SGML features.) Are these exceptions, by any chance, documented somewhere? Cheers, Edward

Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Geoffrey Sneddon wrote: If you do start work on a PHP implementation, please do seriously consider adding it to the html5lib project (which currently contains Python and Ruby implementations) as MIT licensed — there are also a fair number of test cases there. I'd be quite interested in

Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote: In general you should be able to just implement what the spec says and then either leave the HTML5 support in (it's unlikely to cause any harm) or just comment out the support for the new elements, that should be relatively easy. Right, this is mostly what I intended to

Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
James Graham wrote: Nothing in section 8 is going to ensure that you get output that passes a conformance check. If you do transform the output into something that is conforming then you have to make up the rules yourself Yes, which I suppose is slightly concerning. My philosophy is to first

Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote: I don't really see why a sanitiser needs extensibility though. Could you elaborate on this? Surely you just want to filter anything that isn't valid or safe, and only leave the valid safe stuff, using a whitelist. In theory, I could write separate sanitizers for HTML 4,

Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote: Oh well that's just a matter of having pluggable modules for different things to filter. You can equally support SVG and MathML in this way. You just need the core processing to be made independent of the filtering. I just realized an error in my thought that I would need

Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote: I'm not saying don't be standards-compliant; I'm just saying use a subset of HTML5 that you feel comfortable with (which might also be a subset of HTML4, for that matter, just with the HTML5 DOCTYPE so that you don't have to worry about exactly which version you want to

[whatwg] Stability of tokenizing/dom algorithms

2008-12-14 Thread Edward Z. Yang
Hello all, I was curious to know how stable/complete HTML 5's tokenizing and DOM algorithms are (specifically section 8). A cursory glance through the section reveals a few red warning boxes, but these are largely issues of whether or not the specification should follow browser implementations,

Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-14 Thread Edward Z. Yang
Anne van Kesteren wrote: Could you explain what is not sufficient about the the Parsing HTML fragments section: I must admit, I had not seen that section! That seems to be quite sufficient. My bad. :o) Are there any specific differences that pose problems? Not that I know of yet, since I

Re: [whatwg] Dealing with UI redress vulnerabilities inherent to the current web

2008-09-30 Thread Edward Z. Yang
Michal Zalewski wrote: More importantly, since the dictionary of possible inputs is rather limited, it would be pretty trivial to build a dictionary of site - hash pairs and crack the values. May protect xyzzy2984.eur.int.example.com, but would still reveal to me you are coming from

Re: [whatwg] Dealing with UI redress vulnerabilities inherent to the current web

2008-09-30 Thread Edward Z. Yang
Michal Zalewski wrote: Not really? I just need to rebuild my dictionary for that salt, but to check against say a million or ten million of common domains, it wouldn't be very expensive. And it's not very expensive to build such a list of domains, too. In that case, you are certainly correct;

Re: [whatwg] Can var possibly work?

2008-09-20 Thread Edward Z. Yang
Ozob the Great wrote: Then var steps on MathML's toes: It duplicates functionality. Not necessarily; a program variable should certainly not be marked up with MathML.

Re: [whatwg] The iframe element and sandboxing ideas

2008-07-25 Thread Edward Z. Yang
Warning: This is going to be a little bit of an HTML Purifier evangelising post. Frode Børli wrote: Yeah, I thought about that also. Then we have more complex attributes such as style='font-family: expression#40;a+5#41;;'... So your sanitizer must also parse CSS properly - including unescaping

[whatwg] Pre, code and semantics in HTML5: Wishful thinking?

2008-06-22 Thread Edward Z. Yang
. Thanks for reading, Edward P.S. Please CC my address on all replies. - -- Edward Z. YangGnuPG: 0x869C48DA HTML Purifier http://htmlpurifier.org Anti-XSS Filter [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]] -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7