Re: [whatwg] The problem of duplicate ID as a security issue
On Thu, 7 Jun 2007, Alexey Feldgendler wrote: On Thu, 07 Jun 2007 00:42:31 +0200, Ian Hickson [EMAIL PROTECTED] wrote: IDs in user-supplied content are only useful as fragment identifiers for URLs, and mangling them like that defeats this use case because you don't know N before you post the comment, and therefore can't make internal links within the body (and it's also unobvious when you try to make links to parts of your article afterwards). True. I don't have a good solution to this that doesn't involve code on the server-side, though. Some form of sandboxing would be one. If sandboxing would solve it then I'll treat this issue as closed and deal with the sandboxing problems separately. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] CR entities and LFCR
On Thu, 07 Jun 2007 23:12:38 +0200, Michael A. Puls II [EMAIL PROTECTED] wrote: On 6/7/07, Anne van Kesteren [EMAIL PROTECTED] wrote: These should be converted to LF too. One thing that might be interesting to look into is the handling of LFCR in browsers (as opposed to CRLF). I haven't done that yet... Some browsers (just tested Opera) also normalize two newline entities following each other (CRLF pair). Not sure if it'll help, but whenever I do newline normalization to LF, I: Convert all CR + LF pairs to LF. Then, I convert any CRs left over to LF. Sure, that's what the specification says to do as well. I was wondering if some user agents do something special for LFCR. For instance, if I remember correctly using \n\r in JavaScript gives a single newline in Firefox and two in Opera. -- Anne van Kesteren http://annevankesteren.nl/ http://www.opera.com/
Re: [whatwg] The problem of duplicate ID as a security issue
On Fri, 08 Jun 2007 08:13:07 +0200, Ian Hickson [EMAIL PROTECTED] wrote: True. I don't have a good solution to this that doesn't involve code on the server-side, though. Some form of sandboxing would be one. If sandboxing would solve it then I'll treat this issue as closed and deal with the sandboxing problems separately. Only some form of sandboxing would solve this, not any form. To solve this issue, the sandboxing solution has to meet additional an requirement: addressability of content in sandboxes, possibly using a qualified form (e.g. URL#sandboxID+innerID). -- Alexey Feldgendler [EMAIL PROTECTED] [ICQ: 115226275] http://feldgendler.livejournal.com
Re: [whatwg] CR entities and LFCR
On Jun 7, 2007, at 15:00, Anne van Kesteren wrote: These should be converted to LF too. One thing that might be interesting to look into is the handling of LFCR in browsers (as opposed to CRLF). I haven't done that yet... Some browsers (just tested Opera) also normalize two newline entities following each other (CRLF pair). This requires more code. I haven't analyzed the perf impact, but intuitively this requires either naïve and inefficient buffer retraversal in the tree builder or additional complexity to the tokenizer's buffer management (assuming the tokenizer is doing efficient buffering to begin with). You can't protect the DOM from getting CRs if someone insists on putting them there using JS or XML. Is it worthwhile to prevent escaped CRs from ending up in the DOM as CRs in HTML? Is special handling required for compat. I'd try doing exactly what XML does here unless compat requires otherwise. -- Henri Sivonen [EMAIL PROTECTED] http://hsivonen.iki.fi/
Re: [whatwg] CR entities and LFCR
Oops. I would swear that text mode input is performed by the operating system. It turns out I was wrong and the POSIX compatibility layer is provided by the compiler vendor. That means the exact behavior depends indeed. Thanks for the clarification. Cheers Chris (You never know what you know) -Original Message- From: Henri Sivonen [mailto:[EMAIL PROTECTED] Sent: Friday, June 08, 2007 1:45 PM To: Kristof Zelechovski Cc: 'Michel Fortin'; 'WHATWG List' Subject: Re: [whatwg] CR entities and LFCR On Jun 8, 2007, at 09:24, Kristof Zelechovski wrote: Reading a file in text mode ignores all carriage return control characters. Stray carriage returns are ignored as well. Depends on what does the reading.
Re: [whatwg] HTML5Lib Test Suite vs. PHP 5 HTML Parser
2007/6/8, Michel Fortin: Perhaps someone will find this raw data interesting. I've made a script to run the HTML5Lib test cases against the built-in HTML parser in PHP 5. And here's the result: http://www.michelf.com/docs/html5libtests-vs-php5html.html Have you tried PH5P (pure PHP HTML5 parser)? http://jero.net/lab/ph5p/ [CC'd Implementors list, please follow-up there] -- Thomas Broyer
Re: [whatwg] CR entities and LFCR
On 6/8/07, Anne van Kesteren [EMAIL PROTECTED] wrote: On Thu, 07 Jun 2007 23:12:38 +0200, Michael A. Puls II [EMAIL PROTECTED] wrote: On 6/7/07, Anne van Kesteren [EMAIL PROTECTED] wrote: These should be converted to LF too. One thing that might be interesting to look into is the handling of LFCR in browsers (as opposed to CRLF). I haven't done that yet... Some browsers (just tested Opera) also normalize two newline entities following each other (CRLF pair). Not sure if it'll help, but whenever I do newline normalization to LF, I: Convert all CR + LF pairs to LF. Then, I convert any CRs left over to LF. Sure, that's what the specification says to do as well. I was wondering if some user agents do something special for LFCR. For instance, if I remember correctly using \n\r in JavaScript gives a single newline in Firefox and two in Opera. I believe Boris told me for FF, newline normalization (including entities) is only done for parsing into the DOM and that any setting of a string property in JS does zero newline normalization. So, if you set \n\r, \n\r is stored as-is (which we visually equivalent as having 2 newlines) and if there needs to be any normalization, it needs to be done by the author of the JS code. As a side note, when checking how newlines are stored in js, I usually do alert(encodingURIComponent(element.nodeValue)) for example, so I can for sure see what newline characters are present. -- Michael
[whatwg] HTML5Lib Test Suite vs. PHP 5 HTML Parser
Perhaps someone will find this raw data interesting. I've made a script to run the HTML5Lib test cases against the built-in HTML parser in PHP 5. And here's the result: http://www.michelf.com/docs/html5libtests-vs-php5html.html As far as I know, PHP 5 use libxml2 as its HTML parser. Michel Fortin [EMAIL PROTECTED] http://www.michelf.com/
Re: [whatwg] Still more comments and questions on Web Apps 1.0
On Mon, 20 Mar 2006, Henri Sivonen wrote: 5.1.1. I think the spec should suggest shift-return as the key combo for inserting a line separator to make it even more clear that plain return should break the block. Done. 5.1.1. (Updating the default* DOM attributes causes content attributs to be updated as well.) attributes Fixed. 6.1. The canvas element is characterized as a bitmap canvas. However, it is conceptually a vector graphics drawing context. I think the spec should not require bitmapping. If I had a way to inform my UA I intend to print, I would sure prefer the UA collecting the canvas drawing operation in a CGPDFContext as opposed to a CGBitmapContext on OS X. (Compare with what the spec itself says about requesting the image as image/svg+xml.) Could you elaborate on exactly what it is you think should be changed? 6.1. Is omitting the height and/or width of canvas conforming? This should be clear now. 6.1.1.3. WA defines xor like this: Exclusive OR of the source and destination images. Apple defines it more restrictively: Exclusive OR of the source and destination images. Works only with black and white images and is not recommended for color images. http://developer.apple.com/documentation/AppleApplications/Reference/SafariJSRef/Classes/Canvas.html#//apple_ref/doc/uid/30001240-54491 This is now defined in terms of Porter-Duff. Is that ok? I am not an expert here, but IIRC, the underlying PDF/Quartz imaging model does not allow general xoring. I think it is important to ensure that canvas can be implemented on top a Quartz 2D drawing context on OS X without breaking hardware acceleration or PDF output. I can drop 'xor', I guess... 6.2. The spec doesn't name any patent-free audio format that UAs would be required to support as a baseline. Linear PCM in WAV or AIFF would probably be sufficiently safe although not really suitable for the network. Vorbis can still be subject to submarine patents. MP3 and AAC obviously won't do. IIRC, AMR is a potential lawyer bomb as well. This is one of the few cases where HTTP content negotiation could actually be useful. Perhaps the spec should remind UA implementors to send an Accept header that lists the audio formats supported by their Audio object implementation (and no other media types) when loading the audio data. Wave PCM is now required. There's a defined mechanism for content negotiation on the client. 8.2. Authors interested in using SGML tools in their authoring pipeline are encouraged to use the XML serialisation of HTML5 instead of the HTML serialisation. Since HTML5 is not an application of SGML, SGML tools are inappropriate. I think authors interested in using SGML tools with (X)HTML5 should be actively discouraged to use SGML tools and encouraged to use XML tools instead. Changed. 8.2. This specification defines the parsing rules for HTML documents, whether they are syntactically valid or not. Valid used loosely. :-) No longer a problem now that we've defined HTML validator. :-) (let me know if you want this changed anyway) 8.2. A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present. Surely it should only be dropped for encodings where the BOM acts as an encoding signature. That is, with UTF-8 and UTF-16 it should be dropped but with UTF-16LE it should count as an erroneous garbage. Treating it as garbage will make a mess of the DOM. It seems like it would be very unlikely that that was intended rather than having intended the BOM to just be eaten despite the incantation being slightly off. I agree it is an error (that's covered by the encoding specs and Unicode) but I don't see why we would want to actively go out of our way to punish authors in such cases. 8.2.1. Attribute value (unquoted) state I think in cases where an unquoted attribute value contains characters that were not formally allowed in unquoted values in HTML 4.01 the document should be considered non-conforming. That way keeping document conforming would be a reasonable precaution against hairy interactions with legacy parsers out there. I disagree. It doesn't cause ambiguities, the legacy browsers handle them fine, and it would just make authors think that HTML syntax was a black art with arbitrary rules. We have the opportunity here to clean up the rules, I think we should take it. 8.2.1. Comment end state Shouldn't the Anything else branch be a parse error? It is now. 8.2.1.1. I think an NCR expanding to zero, above the Unicode range or to a surrogate should be a parse error. It is now. 8.2.2.1. Append that character to the Document node. Having text nodes outside the root elements is at least a bit surprising if nothing else. I don't disagree. Should we just drop these spaces on the floor? It doesn't seem like the best thing but I guess I'm not opposed. What do other people think? 8.2.2.3.1. and later references to the stack of open
Re: [whatwg] id and xml:id
On Sun, 2 Apr 2006, Henri Sivonen wrote: Since UAs handle whitespace in the id attribute inconsistently (see below) Note that there is interoperability (in that, we have two browsers that do the same thing, and one of those is IE, even). old specs imply or require whitespace trimming Old specs imply or require a lot of things. ;-) and ids with whitespace are unreferencable from whitespace-separated lists of ids, True. I suggest adding the following language concerning document conformance: The value of the id attribute must be a string that consists of one or more characters matching the following production: [#x21-#xD7FF]|[#xE000-#xFFFD]|[#x1-#x10] (any XML 1.0 character excluding whitespace). I've made it non-conforming for an ID to contain a whitespace character. Also, I suggest requiring that elements must not have both id and xml:id and requiring that xml:id must not occur in the HTML serialization. (Again, from the document conformance point of view--not disputing requirements on browsers.) I don't really want to mention xml:id. If someone wants to write a spec that affects our spec, that's their business. I don't think it makes sense for us to go ahead and then ban their spec. That's not to say that xml:id is good or bad, it just doesn't seem relevant to mention it in our spec. If an element had both an id attribute and an xml:id attribute with different values, the document would not be HTML-serializable, which would be bad. That applies to any document that has nodes from other namespaces. xml:id isn't special in that sense. If an element was allowed to have an id attribute and an xml:id attribute with the same value, the following constraint from xml:id spec would be violated even for conforming docs: An xml:id processor should assure that the following constraint holds: * The values of all attributes of type “ID” (which includes all xml:id attributes) within a document are unique. ( http://www.w3.org/TR/xml-id/ ) I don't really understand what you mean there. Finally, as the ultimate ID nitpicking, the spec should state that it is naughty of authors to turn attributes other than id and xml:id into IDs via the DTD. (Well, using a DTD at all is naughty. :-) Again, if they want to do that, that's their business. I don't see that as a big problem. Test case: http://hsivonen.iki.fi/test/wa10/adhoc/id.html The script tries every id with a whitespaceless value to see if whitespace is trimmed before ID assignment. Safari and IE 6: id='a' PASS id='2' PASS id='lt;' PASS id=',' PASS id='auml;' PASS id=' c ' FAIL id='\nd\n' FAIL id='\t\te\t\t' FAIL id='#13;f#13;' FAIL That's what the spec requires today. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Still more comments and questions on Web Apps 1.0
Le 2007-06-08 à 21:05, Ian Hickson a écrit : 8.2.2.1. Append that character to the Document node. Having text nodes outside the root elements is at least a bit surprising if nothing else. I don't disagree. Should we just drop these spaces on the floor? It doesn't seem like the best thing but I guess I'm not opposed. What do other people think? I'd agree they're mostly useless in a browser context, but when reading HTML with the intent of reserializing it later, preserving the whitespace around the document type declaration, the comments and the root element can be beneficial for the readability of the final output. I'd keep them there, just like XML does. Michel Fortin [EMAIL PROTECTED] http://www.michelf.com/