Ian Hickson wrote:
On Mon, 18 May 2009, Brett Zamir wrote:
Section 10.1, "Writing XHTML documents" observes: "According to the XML
specification, XML processors are not guaranteed to process the external
DTD subset referenced in the DOCTYPE."

While this is true, since no doubt the majority of web browsers are
already able to process external stylesheets or scripts, might the very
useful feature of external entity files, be employed by XHTML 5 as a
stricter subset of XML (similar to how XML Namespaces re-annexed the
colon character) in order to allow this useful feature to work for XHTML
(to have access to HTML entities or other useful entities for one, as
well as enable a poor man's localization, etc.)?

While there are arguments on both sides of whether this is a good idea or
not, I think the more important concern in this case is whether we can
extend XML in this way. I think in practice we should leave this up to the
XML specs and their successors. I don't think it would be appropriate for
us to profile the XML spec in this way.


While it is not my purpose to extend the debate on external DTD's, I wanted to bring up the following points (brought to light after a recent re-review of the spec) because it raises a few serious issues which I believe current browsers are failing at, and if the browsers do not address these issues, they would make claims for real XHTML 5 support (as with XHTML 1.* and plain XML support) unworkable. While I agree that any changes to XML itself should be up to the XML specs, from what I can now tell, it looks like a closer adherence to the existing spec would solve most of the existing problems. I wanted to share the following points which I think could resolve most of the issues, if the browsers would make the required changes.

I was pleasantly surprised to find that the spec seems to recommend solutions which I believe avoid the more serious issue of single point of failure problems.

(The other complaints with DTD's, such as avoiding cross-domain DTDs for the sake of security or avoidance of DOS attacks might be an optional issue if that may, in combination with adhering to existing recommendations, satisfy concerns, though I personally do not think such a risk is similar to inclusion of cross-domain scripts.)

So what follows is what I have gleaned from these various statements as applied to current browsers. I can provide specific citations, but I did not wish to expand this post unnecessarily (though I list references at the end).

The major issues which I think ought to be resolved by certain browsers, as they do not seem to be in accord with the XML spec and as a result, create interoperability problems:

1) Firefox and Webkit, should not give a single point of failure for a missing entity as they do now, (unless they switch to a validating parser which finds no declaration in the external file and the user is in validation mode), since such failures in a document with an external DTD are NOT well-formedness errors unless the document deliberately declares standalone=yes. 2) Explorer, which no longer seems to require in IE8 that the document be completely described by the DTD as I believe it had earlier (though it will report errors if the document violates rules which are specified), should, per the spec, really only report validation errors upon user option (ideally, I would say, off by default, and activatable on a case-by-case as well as preference-based basis). This will possibly speed things up if the option could be disabled as well as let their browser work with documents which violate validation. But this issue is not as serious as #1, since #1 prevents even valid documents from being interoperably viewed on the web.

If these issues are addressed by those aiming for compliance, the only disadvantages which will remain (and which are inherent in XML by allowing the co-existence of validating and non-validating parsers) are those issues described in http://www.w3.org/TR/REC-xml/#safe-behavior and http://www.w3.org/TR/REC-xml/#proc-types , namely that:

1) some (entity-related) /well-formedness/ errors (e.g., if an entity is not defined but is used) will go hidden to a non-validating parser as these will not need to load an entity replacement (which is not a big problem, since a document author should presumably have checked (with an application which does external entity substitution) that their entities integrate properly with the text--it is not as important, however, that they check for /validation/ errors, since as mentioned above, these need only be reported optionally). 2) The application may possibly not be notified by its processor of, e.g., entity replacement values, if it is a non-validating processor (though non-validating processors can also make such replacements). But since these are, as mentioned above, not to produce well-formedness errors, there is no single point of failure here either (though there may be some missing content, but indicated by an entity reference in the output display). 3) A few validation issues, such as duplicate declarations (which might include attribute defaults) can lead to undefined behavior (though given that validation is only optional even for validating applications, it seems all applications will have to deal with this).

In other words, as the spec seems to indicate from my reading, users going from one browser to the other will not face problems, unless: 1) They visit invalid documents and have the option to validate the document turned on (it is only supposed to be an option) and expect other browsers to report the same errors as well (not a big issue, since a document which describes its validation constraints and then breaks them is basically asking for trouble--and even here, the user is supposed to have the option to view the document without validation). 2) They expect to see the entity replacement text (and at least, this is not a single point of failure, and in many cases, such as when entities are merely used to represent symbols, the text can be fully read without any disruption in the document flow). Of course, doing the replacements would be even better to avoid this problem, and the solution does not require supporting validation.

There are also the following optional issues which browsers might wish to consider (though if these are not implemented, the above fixes alone would address the most serious problems):

1) Since even a non-validating processor is to inform the application that it recognized but did not read an entity (if it does not replace their references with content found in an external DTD), a browser like Opera (the only one that I can tell does not report such issues, even though it correctly does not lead to a single point of failure), might (if not implementing #2 below) wish to consider doing so, since a compliant processor at least is supposed to report such issues to the application (to do with it as it sees fit). But there is admittedly no obligation on the application to do so, and in any case, such reporting is not to be a single point of failure. But it still might be nice to distinguish the display of entities which are not found from deliberately escaped entities (e.g., &myEnt; produced by a missing entity currently appears the same in Opera (except in source view) as a deliberately escaped &myEnt;) 2) Opera, Firefox, and Webkit (after the latter two fix the more serious issue mentioned above) might also wish to consider expanding their XML support for their users to: a) Show a link to optionally expand each external parsed entity references or other entities (if they don't do the following) b) Build on a non-validating parser to do automatic entity and default attribute value replacement, and attribute value normalization using an external DTD (at least same domain ones). The XML spec only warns against relying on this for the sake of an application having the freedom to switch between non-validating parsers which may or may not all take these actions--this issue doesn't impact interoperability for users (it only improves it), however, so even if there is no desire to support validation, they can still offer entity replacement, etc. to their users. c) Implement a validating parser which can do entity and default attribute value replacement, and attribute value normalization from an external DTD, as well as optionally validate the document at user discretion. This should not slow things down for the user, since the spec itself indicates that reporting of validation errors is required "at user option". This would give the user the best of both worlds--the opportunity to fully read XML/XHTML files online (and without any requirement to face a validation performance cost), and if they are, for example, a document author, they could choose to take a client-side performance hit to optionally check for validation. Of course, they'll need to load the external files in either of these cases to be able to do the replacements, but the document author will NOT need to provide full DTD validation in the external DTD, so users will not be forced to download DTDs reflecting the whole document structure, unless the document author wishes to reference such files). Indeed, authors might be encouraged not to include such content in their DTDs (performing validation offline) so that they and their users can reduce bandwidth, unless their purpose is to transparently show the validation (though DTD validation is of course not very strong).

References:
http://www.w3.org/TR/REC-xml/#wf-entdeclared
http://www.w3.org/TR/REC-xml/#proc-types
http://www.w3.org/TR/REC-xml/#safe-behavior
http://www.w3.org/TR/REC-xml/#dt-vc (validity constraint definition)
http://www.w3.org/TR/REC-xml/#include-if-valid

regards,
Brett

Reply via email to