Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-07-03 Thread Michael Day

Hi Ian,


We don't have any data that says that we need to support this for
innerHTML. I think it's a win if we can drop the hack from innerHTML.


Okay, so allowing some HTML elements to break out of foreign content is 
a hack added for historical reasons, that will surprise authors and 
complicate implementations and is thus regrettable, but necessary.


Then there are two possibilities for fragment parsing:

(1) The hack can be left out of fragment parsing, as there is no 
historical justification for it. Since the hack is bad, removing it from 
as many situations as possible is good.


(2) The hack can apply to fragment parsing in the same way as it applies 
to regular parsing. This makes parsing behaviour more consistent across 
different situations, which is good.


I'm strongly in favour of (2), as it seems that omitting the hack from 
some rare situations doesn't save authors any trouble, and doesn't 
follow the principle of least surprise.


In an ideal world it would be possible to grab any subsection of a 
document, parse that in isolation as a fragment, and get the same result 
as if it was parsed in its original document context. This is possible 
in XML, but not HTML, due to the existing author-friendly hacks, and 
making the parsing behaviour even more context sensitive doesn't seem 
like a good thing.


Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-07-03 Thread Michael Day

Hi Ian,


The problem is that we can't do (2) in _all_ cases, e.g. innerHTML on an
svg can't possibly break out of the svg if it sees one of these tags,
since that's the root of what is being parsed.


Yes, HTML has already lost the composability of parsing that XML and 
other languages have, that's long gone. But that doesn't mean we should 
try to make it even more irregular :)


Currently Firefox, Chrome, and Prince all treat the fragment case the 
same as the whole document case, so we already have interoperable 
behaviour on this issue.


Since the HTML spec is supposed to reflect reality, it seems pointless 
to deliberately introduce an inconsistency in the parsing model that 
requires changes in all user agents to implement.


Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-07-02 Thread Michael Day

Hi Ian,


I ended up removing this from the spec for other reasons, so this should
be resolved now. Let me know if it's not.

(No, I don't know what I had originally intended.)


I don't think the new spec is correct. The question is what happens if 
we are tokenizing some foreign content, and we see an HTML start tag.


In the normal case, we pop off all the foreign elements until we get 
back to the HTML namespace, then reprocess the token.


In the fragment case, the context element may be a foreign element, so 
there was the wrinkle of having to handle that appropriately when we 
have this fake root html element that makes everything confusing.


The new text reads:

If the parser was originally created for the HTML fragment parsing 
algorithm, then act as described in the any other start tag entry 
below. (fragment case)


This always just adds the HTML element in place inside the foreign 
content, even if the fragment context element *is* a HTML element!


This can't be right, as it means parsing document.body.innerHTML will 
behave totally differently to parsing htmlbody, for no reason.


Looking back a couple of years, this section of the spec seems to be 
drifting in a random walk away from reality. We can study this further 
and try suggesting some text based on what we have implemented so far.


Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] Another issue in 12.2.5.5 parsing tokens in foreign content

2013-06-23 Thread Michael Day

Hi Adam,


Since the stack of open elements always has html at the top of the stack,
the element in scope algorithm will always find it, and as a result, the
first part of the condition will always fail.


Even in the fragment case?  (Note the parenthetical remark in the spec
about this text applying only in the fragment case.)


Yes, see 12.4, the stack of open elements always contains a html root 
in the fragment case when there is a context element:


Let root be a new html element with no attributes.
...
Set up the parser's stack of open elements so that it contains just
the single element root.

Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


[whatwg] [dom] attributes collection not fully defined?

2013-05-29 Thread Michael Day

Hi,

In the definition of the Element.attributes collection here:

http://dom.spec.whatwg.org/#dom-element-attributes

It doesn't seem to describe the behaviour for setting direct properties 
of the attributes collection, and how they map to attributes.


For example, setting an attribute will create a property with the same 
name as the attribute:


div = document.createElement(div);
div.setAttribute(foo, bar);
alert(div.attributes.foo); // [Object Attr]

Except for read-only properties like length, which will not be shadowed 
by attributes:


div.setAttribute(length, 99);
alert(div.attributes.length); // 2

So far so good. Things get weird, though:

div.attributes.fruit = apple;
alert(div.attributes.fruit); // apple
div.setAttribute(fruit, orange);
alert(div.attributes.fruit); // [object Attr]
div.removeAttribute(fruit);
alert(div.attributes.fruit); // apple (!!!)

Firefox and Chrome seem to be inconsistent on this, but at least in some 
situations they will shadow the property with an attribute, then restore 
the original property when the attribute is removed.


You can have more fun by using Object.defineProperty to make the 
property read-only or unconfigurable, which Firefox and Chrome will 
again treat inconsistently.


The mind boggles. How are these pseudo-properties supposed to be 
implemented? What magic hook calls them to life?


The reason I ask is that jQuery = 1.9 uses div.attributes in its 
feature detection code, and it's causing us problems.


Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] [dom] attributes collection not fully defined?

2013-05-29 Thread Michael Day

Hi Boris,

Thank you for the detailed explanation. Having the WebIDL named getter 
definition helps to simplify things.


This part still seems inconsistent with current browsers:


4)  Setting a property name that is currently exposed does a Reject
 (which means throw in strict mode, silently do nothing in
 non-strict mode).  Unless there is a named setter, of course.


If I set the property name which has already been used for an attribute, 
it still seems to store the value:


div.setAttribute(fruit, orange);
div.attributes.fruit = apple;
div.removeAttribute(fruit);
alert(div.attributes.fruit); // apple

except for a very strange bug in Firefox only, where if I *read* the 
value before removing it, the attribute doesn't go away later:


div.setAttribute(fruit, orange);
div.attributes.fruit = apple;
alert(div.attributes.fruit.value); // orange
div.removeAttribute(fruit);
alert(div.attributes.fruit); // [object Attr] ???

Just adding the extra alert in the middle changes the value after 
removing the attribute, so that the Attr object is still returned.


Anyway, doing nothing or throwing if the user tries to write to a 
property which is currently exposed seems like a much better option.


Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] [dom] attributes collection not fully defined?

2013-05-29 Thread Michael Day

And this is why we should make named getter/setters a thing of the
past. New specs are still being written which use these WebIDL
features and almost all of them end up with confusing behavior like
this.


+1 +1 +1 +1e100 :)

Michael

--
Prince: Print with CSS!
http://www.princexml.com


[whatwg] Pull requests for HTML5 spec?

2013-05-14 Thread Michael Day

Hi,

There are various branches and versions of the W3C and WHAT-WG HTML 
specifications hosted on Github.


Is there any standard procedure in place for pull requests, if you have 
editorial changes to suggest?


Or is there a better way to track these kinds of changes?

Cheers,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] Pull requests for HTML5 spec?

2013-05-14 Thread Michael Day

Hi Silvia,


If you want to contribute to the WHATWG spec, you should register a
bug on https://www.w3.org/Bugs/Public/describecomponents.cgi?product=WHATWG
. WHATWG patches eventually get cherry-picked into the W3C spec, too,
unless there is strong opposition in the HTML WG.


If the WHATWG spec is hosted on Subversion, I guess that means pull 
requests to that branch on Github will be ignored?


Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


[whatwg] Spec ambiguity and Firefox bug for newlines following pre and textarea

2013-05-13 Thread Michael Day

Hi,

If a newline character token follows a pre or textarea start tag, it 
is supposed to be ignored as an authoring convenience.


However, what if a NULL character token gets in the way? Consider these 
two cases, where NULL represents a literal U+ character:


preNULL#xA;

textareaNULL#xA;

For textarea, the tokenizer will be in the rcdata state, which generates 
replacement character (U+FFFD) tokens for each NULL. So the newline will 
not be the next token following the start tag, and should not be 
ignored. Chrome gets this right, Firefox get this wrong, and displays 
the replacement character *and* strips the newline.


For pre, the tokenizer will be in the data state, which emits NULL 
characters as-is. The NULL character token is then ignored by the in 
body insertion mode. Does this mean it doesn't count as the next token 
after the start tag? Both browsers seem to think so.


In general, the concept of next token is not well defined; in fact I 
don't think it is ever explicitly defined in the spec. If a token is 
ignored, is it still the next token?


Since this concept is only used for the specific case of ignoring 
newlines at the start of pre, listing, and textarea, perhaps a 
better mechanism could be found to describe how it should work.


Best regards,

Michael

--
Prince: Print with CSS!
http://www.princexml.com


Re: [whatwg] Spec ambiguity and Firefox bug for newlines followingpre and textarea

2013-05-13 Thread Michael Day

Hi Peter,


You should report this issue and your previous issue (HTML5 is broken:
menuitem causes infinite loop)
in Bugzilla.  The WHATWG HTML spec makes it easy.


Thanks, I've done this now.

Michael

--
Prince: Print with CSS!
http://www.princexml.com


[whatwg] HTML5 is broken: menuitem causes infinite loop

2013-05-08 Thread Michael Day
Hilarious spec bug of the week: HTML5 requires implementations to loop 
indefinitely if they see a menuitem start tag.


12.2.5.4.7 in body insertion mode
 = see a menuitem start tag, process using rules for in head

12.2.5.4.4 in head insertion mode
 = see menuitem, act as if /head and reprocess

12.2.5.4.6 after head insertion mode
 = see menuitem, act as if body and reprocess

...and we're back at in body insertion mode, and will continue to 
bounce around with the menuitem start tag token making absolutely no 
progress whatsoever.


What is the menuitem tag supposed to be, anyway? A test to ensure that 
implementers are awake, like the /sarcasm close tag?


Cheers,

Michael


[whatwg] adjusted current node in 12.2.5.5

2013-04-15 Thread Michael Day

Hi,

Recently the spec has been changed to introduce the concept of the 
adjusted current node defined in 12.2.3.2 The stack of open elements.


The intention seems to be to handle the case of setting innerHTML on a 
MathML or SVG element, and hence triggering the fragment parsing 
algorithm in a foreign content context. Since the math or svg 
element will not be in the stack of open elements, this would otherwise 
cause problems with child elements not in the right namespace, and CDATA 
sections not being parsed properly.


However, 12.2.5.5 The rules for parsing tokens in foreign content 
still only refers to the current node, not the adjusted current node.


For example, the rules for parsing Any other start tag:


If the current node is an element in the MathML namespace, adjust MathML 
attributes for the token.


Since the current node in the fragment parsing case is still html, 
this will not have the desired effect.


Should this section be changed to refer to the adjusted current node?

Best regards,

Michael


Re: [whatwg] canvas miterLimit property

2012-09-20 Thread Michael Day

Hi Ian,


The main thing driving this API is back-compat with canvas
implementations, not consistency with SVG. :-)


As always, whatever random crap gets implemented first becomes the 
official standard we have to support forever in the name of backwards 
compatibility because it already has a few dozen users :)


Cheers,

Michael



Re: [whatwg] Canvas arcTo method

2012-09-12 Thread Michael Day

Hi Ian,


Yeah, that's why the spec hand-waves to transform the line too... but I
agree that that doesn't really work.

Do you have any suggestion of how to spec this better?


This is the most general arcTo situation:

setTransform(M0)
lineTo(x0, y0)
setTransform(M)
arcTo(x1, y1, x2, y2, radius, ...)

To generate the arc we need three points: P0, P1, P2, all in the same 
coordinate system. The three points are:


P0 = inverse(M) * M0 * (x0, y0)
P1 = (x1, y1)
P2 = (x2, y2)

We are transforming (x0, y0) by M0, which is the transform current at 
the time the point was added to the path. This gives us a point in 
canvas coordinates that we can transform by the inverse of M, which is 
the transform current at the time the arc is added to the path. This 
gives us a point in the same coordinate space as P1 and P2.


In the common case where M = M0, the transforms cancel each other out 
and P0 = (x0, y0).


Once we have the three points in the same coordinate space we can 
generate the arc and then apply M to all of the points in the generated 
arc to draw the arc in canvas coordinates.


Does this make sense?

I don't think it is possible to specify this process without requiring 
an inverse transformation somewhere, to get all three points into the 
same coordinate space. If so, it is probably best to describe this 
explicitly, rather than ambiguously implying the need for it.


Best regards,

Michael


[whatwg] canvas miterLimit property

2012-09-11 Thread Michael Day

Hi,

The canvas miterLimit property has a default value of 10, while the SVG 
stroke-miterlimit property has a default value of 4. Is there a reason 
for this inconsistency?


For reference, the PDF rendering model also has a default value of 10 
for miterLimit, making SVG apparently the odd one out here.


Cheers,

Michael


Re: [whatwg] canvas miterLimit property

2012-09-11 Thread Michael Day

Hi Rik,


I'm unsure why SVG is different.


While we are on the subject, in SVG stroke-miterlimit must be = 1.0, 
whereas in the canvas it must be = 0.0.


In Prince we are clamping it to 1.0, as the PDF spec is consistent with 
SVG this time, and Adobe Reader will fail if the miter limit is dropped 
below 1.0.


Best regards,

Michael



Re: [whatwg] Canvas arcTo method

2012-08-20 Thread Michael Day

Hi Rik,


The 'scale(2,1)' set up a different coordinate system. You can rewrite
your code from this:

ctx.lineTo(100, 100);
ctx.scale(2, 1);
ctx.arcTo(100, 100, 100, 200, 100);

to this:

ctx.scale(2, 1);
*ctx.lineTo(50, 100);*
ctx.arcTo(100, 100, 100, 200, 100);


Right, these will produce the same arc. But how should this be 
implemented in the user agent?


It's almost like it is getting the last point in the previous subpath, 
transforming it by the inverse of the current transformation matrix, 
generating the arc, and then transforming the arc by the matrix.


Is this what Firefox and Chrome do? There is no hint of this in the 
spec, which is quite ambiguous about how the current transform should 
affect previous subpaths.


Cheers,

Michael



Re: [whatwg] Canvas arcTo method

2012-08-20 Thread Michael Day

Hi Rik,


Yes, that is one way of implementing it. This is not specific to arcTo;
this happens with all drawing operators.


It is not quite the same with other drawing operators, for example:

ctx.setTransform(...T1...);
ctx.lineTo(100, 100);
ctx.setTransform(...T2...);
ctx.lineTo(100, 100);

This will draw a line from T1*(100,100) to T2*(100,100), and these 
points can be calculated immediately in absolute canvas coordinates, 
there is no need to apply any inverse transformations.


For arcTo, it's much less obvious how the arc should be generated from 
the three control points, when the first control point is transformed by 
a different matrix to the last two; in this case you cannot just 
remember the three points in absolute canvas coordinates, but the 
specification does not clarify this.



I don't know. It just depends how they implemented in. They might apply
the CTM to all the coordinates or keep the coordinates and pass them
along with the CTM to the drawing system.


In our case we are rendering to PDF, which cannot change the 
transformation matrix halfway through a path. Even if it could, it does 
not support arc primitives.


But anyway, regardless of the exact details of how the browsers 
implement it, there is the question of how to describe the algorithm to 
someone such that it can be implemented with pencil and paper.


Currently it is very non-obvious how arcTo should work when a new 
transform has been applied since the last drawing command.


Best regards,

Michael


Re: [whatwg] Canvas arcTo method

2012-08-20 Thread Michael Day

Hi Rik,


Yes you can go to absolute canvas coordinates but you need to remember
that the radius is transformed too.


You cannot transform the three control points, and then generate the 
arc. If you do this, you will always get circular arcs, whereas a 
scale(2, 1) will produce an elliptical arc. You have to generate the 
arc, then scale it.



I am sure that it's supposed to work. Do you have an example where this
is not the case? (Maybe you're using PDFL?)


This is getting off-topic, but in the PDF 1.7 specification, section 4.1 
Graphics Objects, it states that inside a path object the only allowed 
operators are the path construction operators, followed by the path 
painting operators. This does not include page description level 
operators that change the graphics state, such as the transformation.


Cheers,

Michael


[whatwg] Canvas arcTo method

2012-08-19 Thread Michael Day

Hi,

The camvas arcTo method generates an arc that touches two tangent lines. 
The first tangent line is from the last point in the previous subpath to 
the first point passed to the arcTo method.


What happens in this situation:

ctx.lineTo(100, 100);
ctx.scale(2, 1);
ctx.arcTo(100, 100, 100, 200, 100);

The current transformation matrix should be used to transform the 
generated arc, not to transform its control points.


However, in this case the first untransformed control point is equal to 
the last point in the previous subpath, which means it must generate a 
straight line and not an arc.


Firefox and Chrome do not do this, as can be seen by viewing the 
attached HTML file.


What is the correct behaviour in this case?

Best regards,

Michael


Re: [whatwg] Canvas arcTo method

2012-08-19 Thread Michael Day

Firefox and Chrome do not do this, as can be seen by viewing the
attached HTML file.


Or since attachments are stripped, here is the file:

http://www.princexml.com/arcto.html

Cheers,

Michael


[whatwg] title/meta elements outside of head

2012-01-18 Thread Michael Day

Hi,

Currently the spec seems to indicate that title and meta elements found 
in the body will stay where they are and not be added to the head.


However, if these elements occur after the head and before the body then 
they will be added to the head.


Is this intentional?

Sample document #1:

html
head
/head
body
titleThis will stay in the body/title

Sample document #2:

html
head
/head
titleThis will be moved to the head/title

Sample document #3:

html
head
abc
titleNow we are in the body, where this will stay/title

What is the reason why title/meta elements are not always moved to the 
head, regardless of where they appear?


Best regards,

Michael

--
Print XML with Prince!
http://www.princexml.com


[whatwg] Minor clarification of meta charset sniffing

2007-05-23 Thread Michael Day

Hi,

A minor point relating to comment skipping in the charset sniffing 
algorithm described in section 8.2.2 of HTML5. The existing text says:


Advance the position pointer so that it points at the first 0x3E byte 
which is preceeded by two 0x2D bytes (i.e. at the end of an ASCII '--' 
sequence) and comes after the second 0x2D byte that was found. (The two 
0x2D bytes cannot be the same as the those in the '!--' sequence.) If 
no such byte is found before the nth byte, abort this two step algorithm.


This clearly says that '!--' is not a complete comment, as the second 
pair of hyphens cannot be the same as the first. However, it doesn't 
clearly say whether '!---' is a complete comment or not.


One option would be to say that the second two 0x2D bytes come after the 
second 0x2D byte that was found, not just the 0x3E byte coming after the 
second 0x2D byte that was found.


Best regards,

Michael

--
Print XML with Prince!
http://www.princexml.com


[whatwg] Minor bug in meta charset sniffing

2007-05-23 Thread Michael Day

Hi,

0x3C 0x2D (ASCII '!')

the 0x2D should be 0x21.

Cheers,

Michael

--
Print XML with Prince!
http://www.princexml.com


[whatwg] Drop UTF-32

2007-05-15 Thread Michael Day

Hi,

Suggestion: drop UTF-32 from the character encoding detection section of 
HTML5, and even better, discourage of forbid user agents from 
implementing support for UTF-32.


Why:

 - It's not widely used. In fact, has UTF-32 ever been used at all, 
outside of test suites?


 - It's not widely implemented. For example, the expat XML parser does 
not support it, and nobody cares.


 - When it is supported, people get it wrong, and the bugs are never 
fixed because no one uses UTF-32 anyway and no one cares.


For an example of this, see html5lib 0.9, which implements the BOM 
detection algorithm, but gets it wrong by checking for UTF-16 before 
checking for UTF-32. Because the UTF-16 BOM (FF FE) is a substring of 
the UTF-32 BOM (FF FE 00 00) the test will always succeed and UTF-32 
will always be misidentified as UTF-16. But no one cares, as no one uses 
UTF-32 anyway.


 - UTF-32 is horrendously inefficient for just about all real world 
text and its use should not be encouraged on the web. Really, UTF-32 
only exists as a tutorial example of how UNICODE can be encoded, not as 
a practical character encoding that people should actually use.


Please, drop UTF-32 and save implementors from worrying about it when no 
one uses it and no one should use it.


Thanks,

Michael

--
Print XML with Prince!
http://www.princexml.com


Re: [whatwg] Resurrection of HTML+'s image

2007-03-20 Thread Michael Day

Hi Anne,

Oh yes, lets upgrade DOCTYPE sniffing to the 20th century. Fricking 
awesome.


21st century -- or to put it another way, discworld let's drag DOCTYPE 
kicking and screaming into the century of the fruitbat /discworld


Michael

--
Print XML with Prince!
http://www.princexml.com


Re: [whatwg] Configure Apache to send the right MIME type for XHTML

2007-03-07 Thread Michael Day

Hi David,

Or export them to PDF via PrinceXML, for example. The ability to mark up 
content once but publish it twice, in a usable, attractive format both 
for the web and for print, gives XHTML tremendous practical value for 
web publishers. It isn't just theoretical or fashionable anymore. 


While I agree that XHTML is indeed great, Prince also supports regular 
HTML documents, too. This can be convenient when grabbing content off 
the web that you need to print, or reusing your existing HTML content.


One downside of using HTML is that errors in the document can cause odd 
behaviour and can be harder to track down than errors in XML/XHTML.


Best regards,

Michael

--
Print XML with Prince!
http://www.princexml.com


Re: [whatwg] Distinguishing XML and HTML by content sniffing

2007-03-05 Thread Michael Day

Hi Simon,

If you load a file from disk, then use any meta information the OS can 
provide. (I think Linux can store Content-Type information for files.) 
If the OS relies on file extensions (like Windows does) then use that.


Some Linux file systems might potentially be capable of storing extra
metadata in extended attributes, but in practice I haven't seen any
Linux distributions actually use this functionality for storing the
content type of files. This basically leaves us with file extensions,
just like Windows.

.htm and .html are HTML. I know of lots of HTML documents that start 
with an XML declaration but are not well-formed if parsed as XML. (For 
starters, some version of DreamWeaver emitted XML declarations for 
documents, but did not ensure well-formedness and the result is often 
not well-formed.) Even if it was well-formed, it probably wasn't tested 
under XML conditions so it's likely that style sheets and scripts only 
work correctly under HTML conditions.


Given that Prince serves a different niche than most user agents, our 
users tend to be more likely to use XML with embedded SVG etc., and less 
likely to run Prince on documents created by DreamWeaver. When Prince is 
run on a document retrieved over HTTP it obeys the Content-Type header, 
so that documents on the web will be parsed as HTML.


However, it is true that if a document that appears to be XML but 
actually isn't is downloaded and saved as a file then Prince will try to 
load it as XML rather than HTML after sniffing the content in the 
absence of a Content-Type header. The user will then receive error 
messages if the document is not well-formed. In practice, this case does 
not seem to arise very often, but if it encourages people to either fix 
their XML and make it well-formed or stop pretending that their HTML is 
XML then that doesn't sound like such a bad thing :)


If an author authored a document and testing it with Prince, finding 
that XML-only features work even with a .html file extension, then it is 
likely that that document would break in browsers (because XML-only 
features don't work in HTML).


This comes back to the thorny issue of how MathML is supposed to work on 
the web. It seems to require that content be served up as XHTML, which 
no one does, or that HTML documents contain XML islands, which is not 
well specified at all. It would be nice if HTML5 could tackle this in a 
way that makes sense.


HTML5 has specified content-sniffing rules, FWIW: 
http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing


Yes, these rules never seem to identify a document as being XML, though.


See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500


Prince always respects the Content-Type header, and only sniffs document 
content when no such metadata is available.


Best regards,

Michael

--
Print XML with Prince!
http://www.princexml.com


Re: [whatwg] Distinguishing XML and HTML by content sniffing

2007-03-04 Thread Michael Day

Hi Julian,

What, except efficiency, prevents you from parsing the whole file with 
an XML parser? If it parses, it is XML. Otherwise it isn't.


This approach would suffer from the opposite problem: documents that the 
author intended to be treated as XML would be treated as HTML if there 
was a single well-formedness error anywhere in the document.


The resulting behaviour would be quite confusing for users, as an XHTML 
file containing SVG and MathML content would suddenly stop working if a 
tag was left unclosed. However, since the file would probably still 
parse correctly as HTML, especially if the unclosed tag was something 
like img or br, the user might not get any error messages relating 
to the well-formedness error. Instead, they could get error messages 
relating to the unknown SVG and MathML tags in their HTML document.


Our heuristics are an attempt to guess the intentions of users. 
Specifying an XML declaration or other XML-specific content is an 
indication that the document should be treated as XML. In the absence of 
any XML-specific signs, a .html file really has to be treated like a 
HTML document, even if it would potentially be successfully parsed by an 
XML parser. Any other policy would appear to lead to very confusing 
behaviour.


Best regards,

Michael

--
Print XML with Prince!
http://www.princexml.com


[whatwg] Distinguishing XML and HTML by content sniffing

2007-03-03 Thread Michael Day

Hi all,

For user agents like Prince that support XML and HTML content it is 
sometimes necessary to distinguish whether a .html file is actually XML 
or HTML in order for it to be processed correctly.


I've written an article for XML.com explaining exactly how Prince 
performs content sniffing to distinguish XML and HTML documents:


What Does XML Smell Like?
http://www.xml.com/pub/a/2007/02/28/what-does-xml-smell-like.html

Any feedback would be greatly appreciated. No doubt at some point it 
will be necessary to revise our heuristics for HTML5 :)


Best regards,

Michael

--
Print XML with Prince!
http://www.princexml.com