Ian Hickson wrote:
On Mon, 22 Dec 2008, Edward Z. Yang wrote:
in the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to
0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x to
0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF,
0xFDD0 to 0xFDDF
U+000B
In section 8.2.4.26 the spec says:
If the next six characters are an ASCII case-insensitive match for the
word PUBLIC, then consume those characters and switch to the before
DOCTYPE public identifier state.
The P has already been consumed at the beginning of this section. Thus,
I believe it
The condition here is relly long. Is there any way we can make it
shorter?
Cheers,
Edward
Hello all,
I think EOF should be handled explicitly in the states after we Consume
the U+0023 NUMBER SIGN, since the spec as it stands right now implies
that there will always be another character after the number sign. Or am
I being a little redundant?
Cheers,
Edward
Philip Taylor wrote:
EOF is always treated as if it were a character, e.g. lots of places
say Consume the next input character: ... EOF - ... Reconsume the
EOF character in the data state.
That seems fair, although most implementations won't have an actual end
of file character; they'll be
Hello all,
When I'm consuming a character reference, when does the ampersand get
consumed? This doesn't seem to be obvious from the documentation, which
talks of consuming character references and number hash signs, but never
the ampersand.
Cheers,
Edward
in the range 0x to 0x0008, U+000B, U+000E to 0x001F, 0x007F to
0x009F, 0xD800 to 0xDFFF , 0xFDD0 to 0xFDDFin the range 0x to
0x0008, U+000B, U+000E to 0x001F, 0x007F to 0x009F, 0xD800 to 0xDFFF
, 0xFDD0 to 0xFDDFin the range 0x to 0x0008, U+000B, U+000E to
0x001F, 0x007F
I am currently working on a PHP5 implementation of the HTML5
specification. PHP has abysmal Unicode support, and implementing Unicode
streams in userspace may be unacceptablu slow. Thus, my questions:
1. Given an input stream that is known to be valid UTF-8, is it possible
to implement the
Ian Hickson wrote:
Mostly, yes. (There are exceptions, but they're not things you'd really
want to be using anyway, e.g. obscure SGML features.)
Are these exceptions, by any chance, documented somewhere?
Cheers,
Edward
Geoffrey Sneddon wrote:
If you do start work on a PHP implementation, please do seriously
consider adding it to the html5lib project (which currently contains
Python and Ruby implementations) as MIT licensed — there are also a fair
number of test cases there.
I'd be quite interested in
Ian Hickson wrote:
In general you should be able to just implement what the spec says and
then either leave the HTML5 support in (it's unlikely to cause any harm)
or just comment out the support for the new elements, that should be
relatively easy.
Right, this is mostly what I intended to
James Graham wrote:
Nothing in section 8 is going to ensure that you get output that passes
a conformance check. If you do transform the output into something that
is conforming then you have to make up the rules yourself
Yes, which I suppose is slightly concerning. My philosophy is to first
Ian Hickson wrote:
I don't really see why a sanitiser needs extensibility though. Could you
elaborate on this? Surely you just want to filter anything that isn't
valid or safe, and only leave the valid safe stuff, using a whitelist.
In theory, I could write separate sanitizers for HTML 4,
Ian Hickson wrote:
Oh well that's just a matter of having pluggable modules for different
things to filter. You can equally support SVG and MathML in this way. You
just need the core processing to be made independent of the filtering.
I just realized an error in my thought that I would need
Ian Hickson wrote:
I'm not saying don't be standards-compliant; I'm just saying use a subset
of HTML5 that you feel comfortable with (which might also be a subset of
HTML4, for that matter, just with the HTML5 DOCTYPE so that you don't have
to worry about exactly which version you want to
Hello all,
I was curious to know how stable/complete HTML 5's tokenizing and DOM
algorithms are (specifically section 8). A cursory glance through the
section reveals a few red warning boxes, but these are largely issues of
whether or not the specification should follow browser implementations,
Anne van Kesteren wrote:
Could you explain what is not sufficient about the the Parsing HTML
fragments section:
I must admit, I had not seen that section! That seems to be quite
sufficient. My bad. :o)
Are there any specific differences that pose problems?
Not that I know of yet, since I
Michal Zalewski wrote:
More importantly, since the dictionary of possible inputs is rather
limited, it would be pretty trivial to build a dictionary of site -
hash pairs and crack the values. May protect
xyzzy2984.eur.int.example.com, but would still reveal to me you are
coming from
Michal Zalewski wrote:
Not really? I just need to rebuild my dictionary for that salt, but to
check against say a million or ten million of common domains, it
wouldn't be very expensive. And it's not very expensive to build such a
list of domains, too.
In that case, you are certainly correct;
Ozob the Great wrote:
Then var steps on MathML's toes: It duplicates functionality.
Not necessarily; a program variable should certainly not be marked up
with MathML.
Warning: This is going to be a little bit of an HTML Purifier
evangelising post.
Frode Børli wrote:
Yeah, I thought about that also. Then we have more complex attributes
such as style='font-family: expression#40;a+5#41;;'... So your
sanitizer must also parse CSS properly - including unescaping
.
Thanks for reading,
Edward
P.S. Please CC my address on all replies.
- --
Edward Z. YangGnuPG: 0x869C48DA
HTML Purifier http://htmlpurifier.org Anti-XSS Filter
[[ 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA ]]
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7
22 matches
Mail list logo