Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
Hi Ian, We don't have any data that says that we need to support this for innerHTML. I think it's a win if we can drop the hack from innerHTML. Okay, so allowing some HTML elements to break out of foreign content is a hack added for historical reasons, that will surprise authors and complicate implementations and is thus regrettable, but necessary. Then there are two possibilities for fragment parsing: (1) The hack can be left out of fragment parsing, as there is no historical justification for it. Since the hack is bad, removing it from as many situations as possible is good. (2) The hack can apply to fragment parsing in the same way as it applies to regular parsing. This makes parsing behaviour more consistent across different situations, which is good. I'm strongly in favour of (2), as it seems that omitting the hack from some rare situations doesn't save authors any trouble, and doesn't follow the principle of least surprise. In an ideal world it would be possible to grab any subsection of a document, parse that in isolation as a fragment, and get the same result as if it was parsed in its original document context. This is possible in XML, but not HTML, due to the existing author-friendly hacks, and making the parsing behaviour even more context sensitive doesn't seem like a good thing. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
Hi Ian, The problem is that we can't do (2) in _all_ cases, e.g. innerHTML on an svg can't possibly break out of the svg if it sees one of these tags, since that's the root of what is being parsed. Yes, HTML has already lost the composability of parsing that XML and other languages have, that's long gone. But that doesn't mean we should try to make it even more irregular :) Currently Firefox, Chrome, and Prince all treat the fragment case the same as the whole document case, so we already have interoperable behaviour on this issue. Since the HTML spec is supposed to reflect reality, it seems pointless to deliberately introduce an inconsistency in the parsing model that requires changes in all user agents to implement. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
Hi Ian, I ended up removing this from the spec for other reasons, so this should be resolved now. Let me know if it's not. (No, I don't know what I had originally intended.) I don't think the new spec is correct. The question is what happens if we are tokenizing some foreign content, and we see an HTML start tag. In the normal case, we pop off all the foreign elements until we get back to the HTML namespace, then reprocess the token. In the fragment case, the context element may be a foreign element, so there was the wrinkle of having to handle that appropriately when we have this fake root html element that makes everything confusing. The new text reads: If the parser was originally created for the HTML fragment parsing algorithm, then act as described in the any other start tag entry below. (fragment case) This always just adds the HTML element in place inside the foreign content, even if the fragment context element *is* a HTML element! This can't be right, as it means parsing document.body.innerHTML will behave totally differently to parsing htmlbody, for no reason. Looking back a couple of years, this section of the spec seems to be drifting in a random walk away from reality. We can study this further and try suggesting some text based on what we have implemented so far. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content
Hi Adam, Since the stack of open elements always has html at the top of the stack, the element in scope algorithm will always find it, and as a result, the first part of the condition will always fail. Even in the fragment case? (Note the parenthetical remark in the spec about this text applying only in the fragment case.) Yes, see 12.4, the stack of open elements always contains a html root in the fragment case when there is a context element: Let root be a new html element with no attributes. ... Set up the parser's stack of open elements so that it contains just the single element root. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
[whatwg] [dom] attributes collection not fully defined?
Hi, In the definition of the Element.attributes collection here: http://dom.spec.whatwg.org/#dom-element-attributes It doesn't seem to describe the behaviour for setting direct properties of the attributes collection, and how they map to attributes. For example, setting an attribute will create a property with the same name as the attribute: div = document.createElement(div); div.setAttribute(foo, bar); alert(div.attributes.foo); // [Object Attr] Except for read-only properties like length, which will not be shadowed by attributes: div.setAttribute(length, 99); alert(div.attributes.length); // 2 So far so good. Things get weird, though: div.attributes.fruit = apple; alert(div.attributes.fruit); // apple div.setAttribute(fruit, orange); alert(div.attributes.fruit); // [object Attr] div.removeAttribute(fruit); alert(div.attributes.fruit); // apple (!!!) Firefox and Chrome seem to be inconsistent on this, but at least in some situations they will shadow the property with an attribute, then restore the original property when the attribute is removed. You can have more fun by using Object.defineProperty to make the property read-only or unconfigurable, which Firefox and Chrome will again treat inconsistently. The mind boggles. How are these pseudo-properties supposed to be implemented? What magic hook calls them to life? The reason I ask is that jQuery = 1.9 uses div.attributes in its feature detection code, and it's causing us problems. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] [dom] attributes collection not fully defined?
Hi Boris, Thank you for the detailed explanation. Having the WebIDL named getter definition helps to simplify things. This part still seems inconsistent with current browsers: 4) Setting a property name that is currently exposed does a Reject (which means throw in strict mode, silently do nothing in non-strict mode). Unless there is a named setter, of course. If I set the property name which has already been used for an attribute, it still seems to store the value: div.setAttribute(fruit, orange); div.attributes.fruit = apple; div.removeAttribute(fruit); alert(div.attributes.fruit); // apple except for a very strange bug in Firefox only, where if I *read* the value before removing it, the attribute doesn't go away later: div.setAttribute(fruit, orange); div.attributes.fruit = apple; alert(div.attributes.fruit.value); // orange div.removeAttribute(fruit); alert(div.attributes.fruit); // [object Attr] ??? Just adding the extra alert in the middle changes the value after removing the attribute, so that the Attr object is still returned. Anyway, doing nothing or throwing if the user tries to write to a property which is currently exposed seems like a much better option. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] [dom] attributes collection not fully defined?
And this is why we should make named getter/setters a thing of the past. New specs are still being written which use these WebIDL features and almost all of them end up with confusing behavior like this. +1 +1 +1 +1e100 :) Michael -- Prince: Print with CSS! http://www.princexml.com
[whatwg] Pull requests for HTML5 spec?
Hi, There are various branches and versions of the W3C and WHAT-WG HTML specifications hosted on Github. Is there any standard procedure in place for pull requests, if you have editorial changes to suggest? Or is there a better way to track these kinds of changes? Cheers, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] Pull requests for HTML5 spec?
Hi Silvia, If you want to contribute to the WHATWG spec, you should register a bug on https://www.w3.org/Bugs/Public/describecomponents.cgi?product=WHATWG . WHATWG patches eventually get cherry-picked into the W3C spec, too, unless there is strong opposition in the HTML WG. If the WHATWG spec is hosted on Subversion, I guess that means pull requests to that branch on Github will be ignored? Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
[whatwg] Spec ambiguity and Firefox bug for newlines following pre and textarea
Hi, If a newline character token follows a pre or textarea start tag, it is supposed to be ignored as an authoring convenience. However, what if a NULL character token gets in the way? Consider these two cases, where NULL represents a literal U+ character: preNULL#xA; textareaNULL#xA; For textarea, the tokenizer will be in the rcdata state, which generates replacement character (U+FFFD) tokens for each NULL. So the newline will not be the next token following the start tag, and should not be ignored. Chrome gets this right, Firefox get this wrong, and displays the replacement character *and* strips the newline. For pre, the tokenizer will be in the data state, which emits NULL characters as-is. The NULL character token is then ignored by the in body insertion mode. Does this mean it doesn't count as the next token after the start tag? Both browsers seem to think so. In general, the concept of next token is not well defined; in fact I don't think it is ever explicitly defined in the spec. If a token is ignored, is it still the next token? Since this concept is only used for the specific case of ignoring newlines at the start of pre, listing, and textarea, perhaps a better mechanism could be found to describe how it should work. Best regards, Michael -- Prince: Print with CSS! http://www.princexml.com
Re: [whatwg] Spec ambiguity and Firefox bug for newlines followingpre and textarea
Hi Peter, You should report this issue and your previous issue (HTML5 is broken: menuitem causes infinite loop) in Bugzilla. The WHATWG HTML spec makes it easy. Thanks, I've done this now. Michael -- Prince: Print with CSS! http://www.princexml.com
[whatwg] HTML5 is broken: menuitem causes infinite loop
Hilarious spec bug of the week: HTML5 requires implementations to loop indefinitely if they see a menuitem start tag. 12.2.5.4.7 in body insertion mode = see a menuitem start tag, process using rules for in head 12.2.5.4.4 in head insertion mode = see menuitem, act as if /head and reprocess 12.2.5.4.6 after head insertion mode = see menuitem, act as if body and reprocess ...and we're back at in body insertion mode, and will continue to bounce around with the menuitem start tag token making absolutely no progress whatsoever. What is the menuitem tag supposed to be, anyway? A test to ensure that implementers are awake, like the /sarcasm close tag? Cheers, Michael
[whatwg] adjusted current node in 12.2.5.5
Hi, Recently the spec has been changed to introduce the concept of the adjusted current node defined in 12.2.3.2 The stack of open elements. The intention seems to be to handle the case of setting innerHTML on a MathML or SVG element, and hence triggering the fragment parsing algorithm in a foreign content context. Since the math or svg element will not be in the stack of open elements, this would otherwise cause problems with child elements not in the right namespace, and CDATA sections not being parsed properly. However, 12.2.5.5 The rules for parsing tokens in foreign content still only refers to the current node, not the adjusted current node. For example, the rules for parsing Any other start tag: If the current node is an element in the MathML namespace, adjust MathML attributes for the token. Since the current node in the fragment parsing case is still html, this will not have the desired effect. Should this section be changed to refer to the adjusted current node? Best regards, Michael
Re: [whatwg] canvas miterLimit property
Hi Ian, The main thing driving this API is back-compat with canvas implementations, not consistency with SVG. :-) As always, whatever random crap gets implemented first becomes the official standard we have to support forever in the name of backwards compatibility because it already has a few dozen users :) Cheers, Michael
Re: [whatwg] Canvas arcTo method
Hi Ian, Yeah, that's why the spec hand-waves to transform the line too... but I agree that that doesn't really work. Do you have any suggestion of how to spec this better? This is the most general arcTo situation: setTransform(M0) lineTo(x0, y0) setTransform(M) arcTo(x1, y1, x2, y2, radius, ...) To generate the arc we need three points: P0, P1, P2, all in the same coordinate system. The three points are: P0 = inverse(M) * M0 * (x0, y0) P1 = (x1, y1) P2 = (x2, y2) We are transforming (x0, y0) by M0, which is the transform current at the time the point was added to the path. This gives us a point in canvas coordinates that we can transform by the inverse of M, which is the transform current at the time the arc is added to the path. This gives us a point in the same coordinate space as P1 and P2. In the common case where M = M0, the transforms cancel each other out and P0 = (x0, y0). Once we have the three points in the same coordinate space we can generate the arc and then apply M to all of the points in the generated arc to draw the arc in canvas coordinates. Does this make sense? I don't think it is possible to specify this process without requiring an inverse transformation somewhere, to get all three points into the same coordinate space. If so, it is probably best to describe this explicitly, rather than ambiguously implying the need for it. Best regards, Michael
[whatwg] canvas miterLimit property
Hi, The canvas miterLimit property has a default value of 10, while the SVG stroke-miterlimit property has a default value of 4. Is there a reason for this inconsistency? For reference, the PDF rendering model also has a default value of 10 for miterLimit, making SVG apparently the odd one out here. Cheers, Michael
Re: [whatwg] canvas miterLimit property
Hi Rik, I'm unsure why SVG is different. While we are on the subject, in SVG stroke-miterlimit must be = 1.0, whereas in the canvas it must be = 0.0. In Prince we are clamping it to 1.0, as the PDF spec is consistent with SVG this time, and Adobe Reader will fail if the miter limit is dropped below 1.0. Best regards, Michael
Re: [whatwg] Canvas arcTo method
Hi Rik, The 'scale(2,1)' set up a different coordinate system. You can rewrite your code from this: ctx.lineTo(100, 100); ctx.scale(2, 1); ctx.arcTo(100, 100, 100, 200, 100); to this: ctx.scale(2, 1); *ctx.lineTo(50, 100);* ctx.arcTo(100, 100, 100, 200, 100); Right, these will produce the same arc. But how should this be implemented in the user agent? It's almost like it is getting the last point in the previous subpath, transforming it by the inverse of the current transformation matrix, generating the arc, and then transforming the arc by the matrix. Is this what Firefox and Chrome do? There is no hint of this in the spec, which is quite ambiguous about how the current transform should affect previous subpaths. Cheers, Michael
Re: [whatwg] Canvas arcTo method
Hi Rik, Yes, that is one way of implementing it. This is not specific to arcTo; this happens with all drawing operators. It is not quite the same with other drawing operators, for example: ctx.setTransform(...T1...); ctx.lineTo(100, 100); ctx.setTransform(...T2...); ctx.lineTo(100, 100); This will draw a line from T1*(100,100) to T2*(100,100), and these points can be calculated immediately in absolute canvas coordinates, there is no need to apply any inverse transformations. For arcTo, it's much less obvious how the arc should be generated from the three control points, when the first control point is transformed by a different matrix to the last two; in this case you cannot just remember the three points in absolute canvas coordinates, but the specification does not clarify this. I don't know. It just depends how they implemented in. They might apply the CTM to all the coordinates or keep the coordinates and pass them along with the CTM to the drawing system. In our case we are rendering to PDF, which cannot change the transformation matrix halfway through a path. Even if it could, it does not support arc primitives. But anyway, regardless of the exact details of how the browsers implement it, there is the question of how to describe the algorithm to someone such that it can be implemented with pencil and paper. Currently it is very non-obvious how arcTo should work when a new transform has been applied since the last drawing command. Best regards, Michael
Re: [whatwg] Canvas arcTo method
Hi Rik, Yes you can go to absolute canvas coordinates but you need to remember that the radius is transformed too. You cannot transform the three control points, and then generate the arc. If you do this, you will always get circular arcs, whereas a scale(2, 1) will produce an elliptical arc. You have to generate the arc, then scale it. I am sure that it's supposed to work. Do you have an example where this is not the case? (Maybe you're using PDFL?) This is getting off-topic, but in the PDF 1.7 specification, section 4.1 Graphics Objects, it states that inside a path object the only allowed operators are the path construction operators, followed by the path painting operators. This does not include page description level operators that change the graphics state, such as the transformation. Cheers, Michael
[whatwg] Canvas arcTo method
Hi, The camvas arcTo method generates an arc that touches two tangent lines. The first tangent line is from the last point in the previous subpath to the first point passed to the arcTo method. What happens in this situation: ctx.lineTo(100, 100); ctx.scale(2, 1); ctx.arcTo(100, 100, 100, 200, 100); The current transformation matrix should be used to transform the generated arc, not to transform its control points. However, in this case the first untransformed control point is equal to the last point in the previous subpath, which means it must generate a straight line and not an arc. Firefox and Chrome do not do this, as can be seen by viewing the attached HTML file. What is the correct behaviour in this case? Best regards, Michael
Re: [whatwg] Canvas arcTo method
Firefox and Chrome do not do this, as can be seen by viewing the attached HTML file. Or since attachments are stripped, here is the file: http://www.princexml.com/arcto.html Cheers, Michael
[whatwg] title/meta elements outside of head
Hi, Currently the spec seems to indicate that title and meta elements found in the body will stay where they are and not be added to the head. However, if these elements occur after the head and before the body then they will be added to the head. Is this intentional? Sample document #1: html head /head body titleThis will stay in the body/title Sample document #2: html head /head titleThis will be moved to the head/title Sample document #3: html head abc titleNow we are in the body, where this will stay/title What is the reason why title/meta elements are not always moved to the head, regardless of where they appear? Best regards, Michael -- Print XML with Prince! http://www.princexml.com
[whatwg] Minor clarification of meta charset sniffing
Hi, A minor point relating to comment skipping in the charset sniffing algorithm described in section 8.2.2 of HTML5. The existing text says: Advance the position pointer so that it points at the first 0x3E byte which is preceeded by two 0x2D bytes (i.e. at the end of an ASCII '--' sequence) and comes after the second 0x2D byte that was found. (The two 0x2D bytes cannot be the same as the those in the '!--' sequence.) If no such byte is found before the nth byte, abort this two step algorithm. This clearly says that '!--' is not a complete comment, as the second pair of hyphens cannot be the same as the first. However, it doesn't clearly say whether '!---' is a complete comment or not. One option would be to say that the second two 0x2D bytes come after the second 0x2D byte that was found, not just the 0x3E byte coming after the second 0x2D byte that was found. Best regards, Michael -- Print XML with Prince! http://www.princexml.com
[whatwg] Minor bug in meta charset sniffing
Hi, 0x3C 0x2D (ASCII '!') the 0x2D should be 0x21. Cheers, Michael -- Print XML with Prince! http://www.princexml.com
[whatwg] Drop UTF-32
Hi, Suggestion: drop UTF-32 from the character encoding detection section of HTML5, and even better, discourage of forbid user agents from implementing support for UTF-32. Why: - It's not widely used. In fact, has UTF-32 ever been used at all, outside of test suites? - It's not widely implemented. For example, the expat XML parser does not support it, and nobody cares. - When it is supported, people get it wrong, and the bugs are never fixed because no one uses UTF-32 anyway and no one cares. For an example of this, see html5lib 0.9, which implements the BOM detection algorithm, but gets it wrong by checking for UTF-16 before checking for UTF-32. Because the UTF-16 BOM (FF FE) is a substring of the UTF-32 BOM (FF FE 00 00) the test will always succeed and UTF-32 will always be misidentified as UTF-16. But no one cares, as no one uses UTF-32 anyway. - UTF-32 is horrendously inefficient for just about all real world text and its use should not be encouraged on the web. Really, UTF-32 only exists as a tutorial example of how UNICODE can be encoded, not as a practical character encoding that people should actually use. Please, drop UTF-32 and save implementors from worrying about it when no one uses it and no one should use it. Thanks, Michael -- Print XML with Prince! http://www.princexml.com
Re: [whatwg] Resurrection of HTML+'s image
Hi Anne, Oh yes, lets upgrade DOCTYPE sniffing to the 20th century. Fricking awesome. 21st century -- or to put it another way, discworld let's drag DOCTYPE kicking and screaming into the century of the fruitbat /discworld Michael -- Print XML with Prince! http://www.princexml.com
Re: [whatwg] Configure Apache to send the right MIME type for XHTML
Hi David, Or export them to PDF via PrinceXML, for example. The ability to mark up content once but publish it twice, in a usable, attractive format both for the web and for print, gives XHTML tremendous practical value for web publishers. It isn't just theoretical or fashionable anymore. While I agree that XHTML is indeed great, Prince also supports regular HTML documents, too. This can be convenient when grabbing content off the web that you need to print, or reusing your existing HTML content. One downside of using HTML is that errors in the document can cause odd behaviour and can be harder to track down than errors in XML/XHTML. Best regards, Michael -- Print XML with Prince! http://www.princexml.com
Re: [whatwg] Distinguishing XML and HTML by content sniffing
Hi Simon, If you load a file from disk, then use any meta information the OS can provide. (I think Linux can store Content-Type information for files.) If the OS relies on file extensions (like Windows does) then use that. Some Linux file systems might potentially be capable of storing extra metadata in extended attributes, but in practice I haven't seen any Linux distributions actually use this functionality for storing the content type of files. This basically leaves us with file extensions, just like Windows. .htm and .html are HTML. I know of lots of HTML documents that start with an XML declaration but are not well-formed if parsed as XML. (For starters, some version of DreamWeaver emitted XML declarations for documents, but did not ensure well-formedness and the result is often not well-formed.) Even if it was well-formed, it probably wasn't tested under XML conditions so it's likely that style sheets and scripts only work correctly under HTML conditions. Given that Prince serves a different niche than most user agents, our users tend to be more likely to use XML with embedded SVG etc., and less likely to run Prince on documents created by DreamWeaver. When Prince is run on a document retrieved over HTTP it obeys the Content-Type header, so that documents on the web will be parsed as HTML. However, it is true that if a document that appears to be XML but actually isn't is downloaded and saved as a file then Prince will try to load it as XML rather than HTML after sniffing the content in the absence of a Content-Type header. The user will then receive error messages if the document is not well-formed. In practice, this case does not seem to arise very often, but if it encourages people to either fix their XML and make it well-formed or stop pretending that their HTML is XML then that doesn't sound like such a bad thing :) If an author authored a document and testing it with Prince, finding that XML-only features work even with a .html file extension, then it is likely that that document would break in browsers (because XML-only features don't work in HTML). This comes back to the thorny issue of how MathML is supposed to work on the web. It seems to require that content be served up as XHTML, which no one does, or that HTML documents contain XML islands, which is not well specified at all. It would be nice if HTML5 could tackle this in a way that makes sense. HTML5 has specified content-sniffing rules, FWIW: http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing Yes, these rules never seem to identify a document as being XML, though. See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500 Prince always respects the Content-Type header, and only sniffs document content when no such metadata is available. Best regards, Michael -- Print XML with Prince! http://www.princexml.com
Re: [whatwg] Distinguishing XML and HTML by content sniffing
Hi Julian, What, except efficiency, prevents you from parsing the whole file with an XML parser? If it parses, it is XML. Otherwise it isn't. This approach would suffer from the opposite problem: documents that the author intended to be treated as XML would be treated as HTML if there was a single well-formedness error anywhere in the document. The resulting behaviour would be quite confusing for users, as an XHTML file containing SVG and MathML content would suddenly stop working if a tag was left unclosed. However, since the file would probably still parse correctly as HTML, especially if the unclosed tag was something like img or br, the user might not get any error messages relating to the well-formedness error. Instead, they could get error messages relating to the unknown SVG and MathML tags in their HTML document. Our heuristics are an attempt to guess the intentions of users. Specifying an XML declaration or other XML-specific content is an indication that the document should be treated as XML. In the absence of any XML-specific signs, a .html file really has to be treated like a HTML document, even if it would potentially be successfully parsed by an XML parser. Any other policy would appear to lead to very confusing behaviour. Best regards, Michael -- Print XML with Prince! http://www.princexml.com
[whatwg] Distinguishing XML and HTML by content sniffing
Hi all, For user agents like Prince that support XML and HTML content it is sometimes necessary to distinguish whether a .html file is actually XML or HTML in order for it to be processed correctly. I've written an article for XML.com explaining exactly how Prince performs content sniffing to distinguish XML and HTML documents: What Does XML Smell Like? http://www.xml.com/pub/a/2007/02/28/what-does-xml-smell-like.html Any feedback would be greatly appreciated. No doubt at some point it will be necessary to revise our heuristics for HTML5 :) Best regards, Michael -- Print XML with Prince! http://www.princexml.com