Re: [whatwg] The problem of duplicate ID as a security issue

2007-06-08 Thread Ian Hickson
On Thu, 7 Jun 2007, Alexey Feldgendler wrote:

 On Thu, 07 Jun 2007 00:42:31 +0200, Ian Hickson [EMAIL PROTECTED] wrote:
 
   IDs in user-supplied content are only useful as fragment identifiers for
   URLs, and mangling them like that defeats this use case because you
   don't know N before you post the comment, and therefore can't make
   internal links within the body (and it's also unobvious when you try to
   make links to parts of your article afterwards).
 
  True. I don't have a good solution to this that doesn't involve code on
  the server-side, though.
 
 Some form of sandboxing would be one.

If sandboxing would solve it then I'll treat this issue as closed and deal 
with the sandboxing problems separately.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] CR entities and LFCR

2007-06-08 Thread Anne van Kesteren
On Thu, 07 Jun 2007 23:12:38 +0200, Michael A. Puls II  
[EMAIL PROTECTED] wrote:

On 6/7/07, Anne van Kesteren [EMAIL PROTECTED] wrote:

These should be converted to LF too. One thing that might be interesting
to look into is the handling of LFCR in browsers (as opposed to CRLF). I
haven't done that yet... Some browsers (just tested Opera) also  
normalize

two newline entities following each other (CRLF pair).


Not sure if it'll help, but whenever I do newline normalization to LF, I:

Convert all CR + LF pairs to LF.
Then, I convert any CRs left over to LF.


Sure, that's what the specification says to do as well. I was wondering if  
some user agents do something special for LFCR. For instance, if I  
remember correctly using \n\r in JavaScript gives a single newline in  
Firefox and two in Opera.



--
Anne van Kesteren
http://annevankesteren.nl/
http://www.opera.com/


Re: [whatwg] The problem of duplicate ID as a security issue

2007-06-08 Thread Alexey Feldgendler

On Fri, 08 Jun 2007 08:13:07 +0200, Ian Hickson [EMAIL PROTECTED] wrote:

True. I don't have a good solution to this that doesn't involve code  
on the server-side, though.



Some form of sandboxing would be one.


If sandboxing would solve it then I'll treat this issue as closed and  
deal with the sandboxing problems separately.


Only some form of sandboxing would solve this, not any form. To solve this  
issue, the sandboxing solution has to meet additional an requirement:  
addressability of content in sandboxes, possibly using a qualified form  
(e.g. URL#sandboxID+innerID).



--
Alexey Feldgendler [EMAIL PROTECTED]
[ICQ: 115226275] http://feldgendler.livejournal.com


Re: [whatwg] CR entities and LFCR

2007-06-08 Thread Henri Sivonen

On Jun 7, 2007, at 15:00, Anne van Kesteren wrote:

These should be converted to LF too. One thing that might be  
interesting to look into is the handling of LFCR in browsers (as  
opposed to CRLF). I haven't done that yet... Some browsers (just  
tested Opera) also normalize two newline entities following each  
other (CRLF pair).


This requires more code. I haven't analyzed the perf impact, but  
intuitively this requires either naïve and inefficient buffer  
retraversal in the tree builder or additional complexity to the  
tokenizer's buffer management (assuming the tokenizer is doing  
efficient buffering to begin with).


You can't protect the DOM from getting CRs if someone insists on  
putting them there using JS or XML. Is it worthwhile to prevent  
escaped CRs from ending up in the DOM as CRs in HTML? Is special  
handling required for compat.


I'd try doing exactly what XML does here unless compat requires  
otherwise.


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/




Re: [whatwg] CR entities and LFCR

2007-06-08 Thread Kristof Zelechovski
Oops.  I would swear that text mode input is performed by the operating
system.  It turns out I was wrong and the POSIX compatibility layer is
provided by the compiler vendor.  That means the exact behavior depends
indeed.
Thanks for the clarification.
Cheers
Chris
(You never know what you know)

-Original Message-
From: Henri Sivonen [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 08, 2007 1:45 PM
To: Kristof Zelechovski
Cc: 'Michel Fortin'; 'WHATWG List'
Subject: Re: [whatwg] CR entities and LFCR

On Jun 8, 2007, at 09:24, Kristof Zelechovski wrote:

 Reading a file in text mode ignores all carriage return control  
 characters.
 Stray carriage returns are ignored as well.

Depends on what does the reading.





Re: [whatwg] HTML5Lib Test Suite vs. PHP 5 HTML Parser

2007-06-08 Thread Thomas Broyer

2007/6/8, Michel Fortin:

Perhaps someone will find this raw data interesting. I've made a
script to run the HTML5Lib test cases against the built-in HTML
parser in PHP 5. And here's the result:

http://www.michelf.com/docs/html5libtests-vs-php5html.html


Have you tried PH5P (pure PHP HTML5 parser)?

http://jero.net/lab/ph5p/

[CC'd Implementors list, please follow-up there]

--
Thomas Broyer


Re: [whatwg] CR entities and LFCR

2007-06-08 Thread Michael A. Puls II

On 6/8/07, Anne van Kesteren [EMAIL PROTECTED] wrote:

On Thu, 07 Jun 2007 23:12:38 +0200, Michael A. Puls II
[EMAIL PROTECTED] wrote:
 On 6/7/07, Anne van Kesteren [EMAIL PROTECTED] wrote:
 These should be converted to LF too. One thing that might be interesting
 to look into is the handling of LFCR in browsers (as opposed to CRLF). I
 haven't done that yet... Some browsers (just tested Opera) also
 normalize
 two newline entities following each other (CRLF pair).

 Not sure if it'll help, but whenever I do newline normalization to LF, I:

 Convert all CR + LF pairs to LF.
 Then, I convert any CRs left over to LF.

Sure, that's what the specification says to do as well. I was wondering if
some user agents do something special for LFCR. For instance, if I
remember correctly using \n\r in JavaScript gives a single newline in
Firefox and two in Opera.


I believe Boris told me for FF, newline normalization (including
entities) is only done for parsing into the DOM and that any setting
of a string property in JS does zero newline normalization. So, if you
set \n\r, \n\r is stored as-is (which we visually equivalent as having
2 newlines) and if there needs to be any normalization, it needs to be
done by the author of the JS code.

As a side note, when checking how newlines are stored in js, I usually
do alert(encodingURIComponent(element.nodeValue)) for example, so I
can for sure see what newline characters are present.

--
Michael


[whatwg] HTML5Lib Test Suite vs. PHP 5 HTML Parser

2007-06-08 Thread Michel Fortin
Perhaps someone will find this raw data interesting. I've made a  
script to run the HTML5Lib test cases against the built-in HTML  
parser in PHP 5. And here's the result:


http://www.michelf.com/docs/html5libtests-vs-php5html.html

As far as I know, PHP 5 use libxml2 as its HTML parser.


Michel Fortin
[EMAIL PROTECTED]
http://www.michelf.com/




Re: [whatwg] Still more comments and questions on Web Apps 1.0

2007-06-08 Thread Ian Hickson
On Mon, 20 Mar 2006, Henri Sivonen wrote:
 
 5.1.1.
 I think the spec should suggest shift-return as the key combo for inserting a
 line separator to make it even more clear that plain return should break the
 block.

Done.


 5.1.1.
 (Updating the default* DOM attributes causes content attributs to be updated
 as well.)
 
 attributes

Fixed.


 6.1.
 The canvas element is characterized as a bitmap canvas. However, it is
 conceptually a vector graphics drawing context. I think the spec should not
 require bitmapping. If I had a way to inform my UA I intend to print, I would
 sure prefer the UA collecting the canvas drawing operation in a CGPDFContext
 as opposed to a CGBitmapContext on OS X. (Compare with what the spec itself
 says about requesting the image as image/svg+xml.)

Could you elaborate on exactly what it is you think should be changed?


 6.1.
 Is omitting the height and/or width of canvas conforming?

This should be clear now.


 6.1.1.3.
 WA defines xor like this: Exclusive OR of the source and destination images.
 Apple defines it more restrictively: Exclusive OR of the source and
 destination images. Works only with black and white images and is not
 recommended for color images.
 http://developer.apple.com/documentation/AppleApplications/Reference/SafariJSRef/Classes/Canvas.html#//apple_ref/doc/uid/30001240-54491

This is now defined in terms of Porter-Duff. Is that ok?


 I am not an expert here, but IIRC, the underlying PDF/Quartz imaging model
 does not allow general xoring. I think it is important to ensure that canvas
 can be implemented on top a Quartz 2D drawing context on OS X without breaking
 hardware acceleration or PDF output.

I can drop 'xor', I guess...


 6.2.
 The spec doesn't name any patent-free audio format that UAs would be required
 to support as a baseline. Linear PCM in WAV or AIFF would probably be
 sufficiently safe although not really suitable for the network.
 
 Vorbis can still be subject to submarine patents. MP3 and AAC obviously won't
 do. IIRC, AMR is a potential lawyer bomb as well.
 
 This is one of the few cases where HTTP content negotiation could actually be
 useful. Perhaps the spec should remind UA implementors to send an Accept
 header that lists the audio formats supported by their Audio object
 implementation (and no other media types) when loading the audio data.

Wave PCM is now required. There's a defined mechanism for content 
negotiation on the client.


 8.2.
 Authors interested in using SGML tools in their authoring pipeline are
 encouraged to use the XML serialisation of HTML5 instead of the HTML
 serialisation.
 
 Since HTML5 is not an application of SGML, SGML tools are inappropriate. I
 think authors interested in using SGML tools with (X)HTML5 should be actively
 discouraged to use SGML tools and encouraged to use XML tools instead.

Changed.


 8.2.
 This specification defines the parsing rules for HTML documents, whether they
 are syntactically valid or not. 
 
 Valid used loosely. :-)

No longer a problem now that we've defined HTML validator. :-)

(let me know if you want this changed anyway)


 8.2.
 A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present. Surely it
 should only be dropped for encodings where the BOM acts as an encoding
 signature. That is, with UTF-8 and UTF-16 it should be dropped but with
 UTF-16LE it should count as an erroneous garbage.

Treating it as garbage will make a mess of the DOM. It seems like it would 
be very unlikely that that was intended rather than having intended the 
BOM to just be eaten despite the incantation being slightly off.

I agree it is an error (that's covered by the encoding specs and Unicode) 
but I don't see why we would want to actively go out of our way to punish 
authors in such cases.


 8.2.1. Attribute value (unquoted) state
 I think in cases where an unquoted attribute value contains characters that
 were not formally allowed in unquoted values in HTML 4.01 the document should
 be considered non-conforming. That way keeping document conforming would be a
 reasonable precaution against hairy interactions with legacy parsers out
 there.

I disagree. It doesn't cause ambiguities, the legacy browsers handle them 
fine, and it would just make authors think that HTML syntax was a black 
art with arbitrary rules. We have the opportunity here to clean up the 
rules, I think we should take it.


 8.2.1. Comment end state
 Shouldn't the Anything else branch be a parse error?

It is now.


 8.2.1.1.
 I think an NCR expanding to zero, above the Unicode range or to a surrogate
 should be a parse error.

It is now.


 8.2.2.1.
 Append that character to the Document node.
 
 Having text nodes outside the root elements is at least a bit surprising if
 nothing else.

I don't disagree. Should we just drop these spaces on the floor? It 
doesn't seem like the best thing but I guess I'm not opposed. What do 
other people think?


 8.2.2.3.1. and later references to the stack of open 

Re: [whatwg] id and xml:id

2007-06-08 Thread Ian Hickson
On Sun, 2 Apr 2006, Henri Sivonen wrote:

 Since UAs handle whitespace in the id attribute inconsistently (see 
 below)

Note that there is interoperability (in that, we have two browsers that do 
the same thing, and one of those is IE, even).


 old specs imply or require whitespace trimming

Old specs imply or require a lot of things. ;-)


 and ids with whitespace are unreferencable from whitespace-separated 
 lists of ids,

True.


 I suggest adding the following language concerning document conformance:
 
 The value of the id attribute must be a string that consists of one or 
 more characters matching the following production: 
 [#x21-#xD7FF]|[#xE000-#xFFFD]|[#x1-#x10] (any XML 1.0 character 
 excluding whitespace).

I've made it non-conforming for an ID to contain a whitespace character.


 Also, I suggest requiring that elements must not have both id and xml:id 
 and requiring that xml:id must not occur in the HTML serialization. 
 (Again, from the document conformance point of view--not disputing 
 requirements on browsers.)

I don't really want to mention xml:id. If someone wants to write a spec 
that affects our spec, that's their business. I don't think it makes sense 
for us to go ahead and then ban their spec. That's not to say that xml:id 
is good or bad, it just doesn't seem relevant to mention it in our spec.


 If an element had both an id attribute and an xml:id attribute with different
 values, the document would not be HTML-serializable, which would be bad.

That applies to any document that has nodes from other namespaces. xml:id 
isn't special in that sense.


 If an element was allowed to have an id attribute and an xml:id attribute with
 the same value, the following constraint from xml:id spec would be violated
 even for conforming docs:
 An xml:id processor should assure that the following constraint holds:
* The values of all attributes of type “ID” (which includes all xml:id
 attributes) within a document are unique.
 ( http://www.w3.org/TR/xml-id/ )

I don't really understand what you mean there.


 Finally, as the ultimate ID nitpicking, the spec should state that it is 
 naughty of authors to turn attributes other than id and xml:id into IDs 
 via the DTD. (Well, using a DTD at all is naughty. :-)

Again, if they want to do that, that's their business. I don't see that as 
a big problem.


 Test case: http://hsivonen.iki.fi/test/wa10/adhoc/id.html
 The script tries every id with a whitespaceless value to see if whitespace is
 trimmed before ID assignment.

 Safari and IE 6:
 
 id='a' PASS
 id='2' PASS
 id='lt;' PASS
 id=',' PASS
 id='auml;' PASS
 id=' c ' FAIL
 id='\nd\n' FAIL
 id='\t\te\t\t' FAIL
 id='#13;f#13;' FAIL

That's what the spec requires today.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Still more comments and questions on Web Apps 1.0

2007-06-08 Thread Michel Fortin

Le 2007-06-08 à 21:05, Ian Hickson a écrit :


8.2.2.1.
Append that character to the Document node.

Having text nodes outside the root elements is at least a bit  
surprising if

nothing else.


I don't disagree. Should we just drop these spaces on the floor? It
doesn't seem like the best thing but I guess I'm not opposed. What do
other people think?


I'd agree they're mostly useless in a browser context, but when  
reading HTML with the intent of reserializing it later, preserving  
the whitespace around the document type declaration, the comments and  
the root element can be beneficial for the readability of the final  
output. I'd keep them there, just like XML does.



Michel Fortin
[EMAIL PROTECTED]
http://www.michelf.com/