Re: [whatwg] Parsing, syntax, and content model feedback

2008-12-25 Thread Edward Z. Yang
Ian Hickson wrote:
 On Mon, 22 Dec 2008, Edward Z. Yang wrote:
 in the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to 
 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x to 
 0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF, 
 0xFDD0 to 0xFDDF

 U+000B is not a range.
 
 While this is technically true, I don't really see a better way to phrase 
 this that isn't verbose (e.g. ranges and codepoints or some such).
 
 If it helps, consider the whole set of subranges and code points to be a 
 single discontinuous range, hence the use of the singular range. :-)

The spec made me double-take when I read it (since it fairly clearly
separates range from codepoints). Also, I messed up the copypaste while
quoting, so the text I cited is not actually what's there, it's:

 in the ranges U+0001 to U+0008,  U+000B,  U+000E to U+001F,  U+007F  to 
 U+009F, U+D800 to U+DFFF, U+FDD0 to U+FDDF, and characters U+FFFE...

It seems fairly clear to me that U+000B should moved to the list of
characters (at the cost of the nice ordering) or we should collapse
ranges/characters into one range.

 On Tue, 23 Dec 2008, Edward Z. Yang wrote:
 You're still checking the next input character at that point, so P is 
 still the next input character, so the next six are PUBLIC.
 
 At least, that's how I'm defending what the spec says. :-)

The spec is pretty unambiguous about this:

 The next input character is the first character in the input stream that has 
 not yet been consumed. Initially, the next input character is the first 
 character in the input.

and, at the beginning of the section:

 Consume the next input character:

So, the spec is wrong.

 In practice I think having the text be clear (PUBLIC) is less confusing 
 than having it be pedantic (P and UBLIC or this and the next five or 
 some such). It's not like people are going to assume the spec is allowing 
 XPUBLIC or *PUBLIC and so forth, right?

I understand this consideration, and there's several ways we could go
about doing this. I think the easiest would be to un-consume a
character, and then perform the checks, and then reconsume the character.

As for people making this mistake... well, you're looking at one. :-)

Cheers,
Edward

(accidentally emailed only Ian; re-sending to WHATWG list)


[whatwg] Error in 8.2.4.26 After DOCTYPE name state

2008-12-23 Thread Edward Z. Yang
In section 8.2.4.26 the spec says:

 If the next six characters are an ASCII case-insensitive match for the
 word PUBLIC, then consume those characters and switch to the before
 DOCTYPE public identifier state.

The P has already been consumed at the beginning of this section. Thus,
I believe it should read:

If this character and the next five characters are an ASCII
case-insensitive match for the word PUBLIC, etc.

Same goes for the match for SYSTEM.

Cheers,
Edward


[whatwg] 8.2.4.4 Close tag open state

2008-12-22 Thread Edward Z. Yang
The condition here is relly long. Is there any way we can make it
shorter?

Cheers,
Edward


[whatwg] 8.2.4.37: EOF handling

2008-12-22 Thread Edward Z. Yang
Hello all,

I think EOF should be handled explicitly in the states after we Consume
the U+0023 NUMBER SIGN, since the spec as it stands right now implies
that there will always be another character after the number sign. Or am
I being a little redundant?

Cheers,
Edward


Re: [whatwg] 8.2.4.37: EOF handling

2008-12-22 Thread Edward Z. Yang
Philip Taylor wrote:
 EOF is always treated as if it were a character, e.g. lots of places
 say Consume the next input character: ... EOF - ... Reconsume the
 EOF character in the data state. 

That seems fair, although most implementations won't have an actual end
of file character; they'll be checking their string index to see if
they've gone out of bounds. But the spec is internally consistent (I'm
just used to seeing an EOF special case on almost every state).

Thanks,
Edward


[whatwg] Consuming ampersands

2008-12-22 Thread Edward Z. Yang
Hello all,

When I'm consuming a character reference, when does the ampersand get
consumed? This doesn't seem to be obvious from the documentation, which
talks of consuming character references and number hash signs, but never
the ampersand.

Cheers,
Edward


[whatwg] Minor typo in 8.2.4.37

2008-12-22 Thread Edward Z. Yang
 in the range 0x to 0x0008,  U+000B,  U+000E to 0x001F,  0x007F  to
0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x to
0x0008,  U+000B,  U+000E to 0x001F,  0x007F  to 0x009F, 0xD800 to 0xDFFF
, 0xFDD0 to 0xFDDFin the range 0x to 0x0008,  U+000B,  U+000E to
0x001F,  0x007F  to 0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDF

U+000B is not a range.

Cheers,
Edward


[whatwg] Byte-wise tokenization algorithm

2008-12-20 Thread Edward Z. Yang
I am currently working on a PHP5 implementation of the HTML5
specification. PHP has abysmal Unicode support, and implementing Unicode
streams in userspace may be unacceptablu slow. Thus, my questions:

1. Given an input stream that is known to be valid UTF-8, is it possible
to implement the tokenization algorithm with byte-wise operations only?
I think it's possible, since all of the character matching parts of the
algorithm map to characters in ASCII space.

2. Would such an implementation be conforming?

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-16 Thread Edward Z. Yang
Ian Hickson wrote:
 Mostly, yes. (There are exceptions, but they're not things you'd really 
 want to be using anyway, e.g. obscure SGML features.)

Are these exceptions, by any chance, documented somewhere?

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Geoffrey Sneddon wrote:
 If you do start work on a PHP implementation, please do seriously
 consider adding it to the html5lib project (which currently contains
 Python and Ruby implementations) as MIT licensed — there are also a fair
 number of test cases there.

I'd be quite interested in reusing the html5lib test-cases, but I prefer
to do my development on Git which means that it won't be hosted on
Google Code. This might be a winter break project for me.

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote:
 In general you should be able to just implement what the spec says and 
 then either leave the HTML5 support in (it's unlikely to cause any harm) 
 or just comment out the support for the new elements, that should be 
 relatively easy.

Right, this is mostly what I intended to do. But from what I can tell,
there's a difference between the design philosophies of HTML 5 and XHTML
2.0; XHTML tries to make everything extensible and able to be imported
from other places, while HTML 5 attempts to document what exists, and
then make sensible additions as necessary. HTML 5 pragmatism makes sense
for a user-agent, but the XHTML extensibility is useful for a sanitizer,
which doesn't actually have to render anything and needs to support
multiple dialects and variants.

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
James Graham wrote:
 Nothing in section 8 is going to ensure that you get output that passes
 a conformance check. If you do transform the output into something that
 is conforming then you have to make up the rules yourself

Yes, which I suppose is slightly concerning. My philosophy is to first
reconstruct the DOM as much like browsers, and then for non-compliant
DOMs move things around so they become compliant, but *look* the same as
a non-compliant DOM.

 so you have
 just shifted the ambiguity from the client (where it will hopefully
 disappear in a few years once the HTML5 algorithm has large-scale
 adoption) to the sanitizer implementation.

I feel like this is preferable in many cases. There's only one sanitizer
implementation to worry about, as opposed to many browser
implementations. Also, the sanitizer can transparently add cross-browser
compatibility code for poorly supported elements.

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote:
 I don't really see why a sanitiser needs extensibility though. Could you 
 elaborate on this? Surely you just want to filter anything that isn't 
 valid or safe, and only leave the valid safe stuff, using a whitelist.

In theory, I could write separate sanitizers for HTML 4, XHTML 1.0,
XHTML 2.0, HTML 5, etc. In practice, I want to reuse as much code as
possible between these cases, since I'm a lazy developer. Perhaps
extensibility is not the right word here; it's more like reusability
of components.

A side-note: something we've been looking into is bolting on extensions
to the HTML language. A user might write something in HTML 5, but the
website is in HTML 4, so the sanitizer converts the HTML 5 into a more
ugly but functional HTML 4 version, and returns that. The future, today!

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote:
 Oh well that's just a matter of having pluggable modules for different 
 things to filter. You can equally support SVG and MathML in this way. You 
 just need the core processing to be made independent of the filtering.

I just realized an error in my thought that I would need to modify the
parsing algorithm; that would only be the case if I tried to integrate
filtering with the core processing. If it's a two-stage process, the
core processing merely has special rules for certain elements embedded
in it, but otherwise acts normally. Performance *is* an issue (getting
things to be standards compliant is relatively CPU/memory intensive),
but getting things to work is first.

 I wouldn't really worry about 4 vs 5. What matters is what works in 
 browsers, or whatever tools your users are using. (This is one reason in 
 HTML5 we do away with having the version number in the DOCTYPE.) I'd 
 recommend just using the HTML5 DOCTYPE and then filtering the content to 
 be whatever you want it to be.

HTML Purifier puts a high value on standards-compliance, and we've been
attacked on several occasions because of it. Standards suck. To this I
have to say, standards compliance has helped defend against a number of
XSS attacks--enforcing it lowers attack surface and makes behavior much
more well-defined. So I feel like it's a goal worth striving for, in and
of itself, especially since you can't enforce semantics with computers.

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-15 Thread Edward Z. Yang
Ian Hickson wrote:
 I'm not saying don't be standards-compliant; I'm just saying use a subset 
 of HTML5 that you feel comfortable with (which might also be a subset of 
 HTML4, for that matter, just with the HTML5 DOCTYPE so that you don't have 
 to worry about exactly which version you want to follow).

Sounds good, since HTML4 is a strict subset of HTML5 (correct me if I'm
wrong?)



[whatwg] Stability of tokenizing/dom algorithms

2008-12-14 Thread Edward Z. Yang
Hello all,

I was curious to know how stable/complete HTML 5's tokenizing and DOM
algorithms are (specifically section 8). A cursory glance through the
section reveals a few red warning boxes, but these are largely issues of
whether or not the specification should follow browser implementations,
and not actual errors in the specification.

The reason I'd like to know this is because I am the author of a tool
named HTML Purifier, which takes user-input HTML and cleans it for
standards-compliance as well as XSS. We insist on output being standards
compliant, because the result is unambiguous.

As far as I can tell, this is quite unlike the tools that HTML5 is
tooled towards; compliance checkers, user agents and data miners. There
certainly is overlap: we have our own parsing and DOM-building
algorithms which work decently well, although they do trip up on a
number of edge-cases (active formatting elements being one notable
example). However, using the HTML5 algorithm wholesale is not possible
for several reasons:

1. Users input HTML fragments, not actual HTML documents. A parser I
would use needs to be able to enter parsing in a specific state, and has
to ignore any requests by the user to exit that state (i.e. a /body tag)

2. No one actually codes their HTML in HTML5 (yet), so the only parts of
the algorithm I want to use are the ones that are emulating browser
behavior with HTML4. However, HTML5 interweaves it's additions with the
browser research it has done.

I'd be really interested to hear what you all have to say about this
matter. Thanks!

Cheers,
Edward


Re: [whatwg] Stability of tokenizing/dom algorithms

2008-12-14 Thread Edward Z. Yang
Anne van Kesteren wrote:
 Could you explain what is not sufficient about the the Parsing HTML
 fragments section:

I must admit, I had not seen that section! That seems to be quite
sufficient. My bad. :o)

 Are there any specific differences that pose problems?

Not that I know of yet, since I haven't started on an implementation
yet. Which brings me back to my original question: how stable is section
8? I would rather not be chasing a moving target.

Cheers,
Edward



Re: [whatwg] Dealing with UI redress vulnerabilities inherent to the current web

2008-09-30 Thread Edward Z. Yang
Michal Zalewski wrote:
 More importantly, since the dictionary of possible inputs is rather
 limited, it would be pretty trivial to build a dictionary of site -
 hash pairs and crack the values. May protect
 xyzzy2984.eur.int.example.com, but would still reveal to me you are
 coming from playboy.com.

Salt it. Problem solved.


Re: [whatwg] Dealing with UI redress vulnerabilities inherent to the current web

2008-09-30 Thread Edward Z. Yang
Michal Zalewski wrote:
 Not really? I just need to rebuild my dictionary for that salt, but to
 check against say a million or ten million of common domains, it
 wouldn't be very expensive. And it's not very expensive to build such a
 list of domains, too.

In that case, you are certainly correct; adding a salt only hinders an
attacker. But if we're worried about Origin giving away a secret
intranet website, I think things should be reasonable. Of course, they
can still dictionary brute-force it...

(whoops, forgot to CC list)


Re: [whatwg] Can var possibly work?

2008-09-20 Thread Edward Z. Yang
Ozob the Great wrote:
 Then var steps on MathML's toes: It duplicates functionality.

Not necessarily; a program variable should certainly not be marked up
with MathML.


Re: [whatwg] The iframe element and sandboxing ideas

2008-07-25 Thread Edward Z. Yang
Warning: This is going to be a little bit of an HTML Purifier
evangelising post.

Frode Børli wrote:
 Yeah, I thought about that also. Then we have more complex attributes
 such as style='font-family: expression#40;a+5#41;;'... So your
 sanitizer must also parse CSS properly - including unescaping
 entities.

The way HTML Purifier handles this is unescaping all entities (hex, dec
and named) before handling HTML. Output text is always in UTF-8 and thus
never has entities.

Also, it should be noted that #40; is HTML escaping, not CSS escaping.
CSS has its own set of escaping syntax. HTML Purifier handles that too.

 For all I know - a future invention may introduce a new method of
 encoding entities also, so your sanitizer must support all future
 entity encodings.

I don't know what you really mean by this, but by converting entities to
characters this is not a problem.

 Ofcourse we can skip supporting the style attribute - but there are
 not many other ways to style content in XHTML.

Style attribute is supported.

 A bank want a HTML-messaging system where the customer can write
 HTML-based messages to customer support trough the online banking
 system. Customer support personell have access to perform transactions
 worth millions of dollars trough the intranet web interface (where
 they also receive HTML-based messages from customers).

A few problems with this theoretical situation:

1. Why does the bank need an HTML messaging system?
2. Why is this system on the same domain as the intranet web interface?
3. Why do customer support personell have access to the transaction
interface?

But whatever, it's not really relevant to the topic at hand.

 Security depends on on a perfect sanitizer. Would you sell your
 sanitizer to this bank without any disclaimers, and say that your
 sanitizer will be valid for eternity and for all browsers that the
 bank decides to use internally in the future?

Well, it's an open-source sanitizer. But that aside, say, I was selling
them a support contract, I would not say valid for eternity. However,
I would be very confident that a bug would be more likely than a future
browser breaking the sanitizer. And the reason I say this is because of
the principle of backwards-compatibility: my sanitizer only allows
HTML/CSS that has well-defined behavior by all current browsers.
colspan=expr(3+4) is theoretically valid and safe HTML, but it doesn't
have well-defined behavior with browsers, so it is sanitized out.
colspan=4 is well-defined, valid and safe, and unless a browser
decides 4 is a magic number that should trigger the execution of
JavaScript code in a nearby node, it's safe.

 Today I would not allow HTML-based messages since I could never be
 sure enough that the sanitizer was perfect.

I encourage you to try out HTML Purifier http://htmlpurifier.org. It's
certainly not perfect (we've had a total of two security problems with
the core code (three if you count a Shift_JIS related vulnerability, and
four if you count an XSS vulnerability in a testing script for the
library)), but I hope it certainly approaches it.


[whatwg] Pre, code and semantics in HTML5: Wishful thinking?

2008-06-22 Thread Edward Z. Yang
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I was reading through the HTML5 spec the other day and I noticed this
tidbit:

 To represent a block of computer code, the pre element can be used
 with a code element; to represent a block of computer output the pre
 element can be used with a samp element. Similarly, the kbd element
 can be used within a pre element to indicate text that the user is to
 enter.

The implication is that document authors are recommended to use
precode to wrap all of their programming code instead of a lone
pre, if they wish to be fully semantic. This feels needlessly verbose
and abusive of code, which traditionally has been used to mark
single-liners.

It also makes it extremely difficult to style pre as a block for code,
as the only semantic indication that the contents of the pre block are
computer code is its child. You'd end up having to say pre
class=codecode if you wanted to style pre as well.

At the same time, I still think the semantics of whether or not a pre
tag indicates a plaintext file, or a piece of ASCII art, or computer
code, is somewhat important. However, I think this information would be
more appropriately given as an attribute.

Thanks for reading,
Edward

P.S. Please CC my address on all replies.

- --
 Edward Z. YangGnuPG: 0x869C48DA
 HTML Purifier http://htmlpurifier.org Anti-XSS Filter
 [[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIXrSQqTO+fYacSNoRAn1WAJ95X7i0Rf4sMGuj4n5qEEWoEH4CuwCfUnP8
TIADRZ6VRXWK2AC9tIATl8E=
=TY06
-END PGP SIGNATURE-