Re: [whatwg] External document subset support

Brett Zamir Thu, 18 Jun 2009 20:53:18 -0700

Ian Hickson wrote:

On Mon, 18 May 2009, Brett Zamir wrote:

Section 10.1, "Writing XHTML documents" observes: "According to the XML
specification, XML processors are not guaranteed to process the external
DTD subset referenced in the DOCTYPE."


While this is true, since no doubt the majority of web browsers are
already able to process external stylesheets or scripts, might the very
useful feature of external entity files, be employed by XHTML 5 as a
stricter subset of XML (similar to how XML Namespaces re-annexed the
colon character) in order to allow this useful feature to work for XHTML
(to have access to HTML entities or other useful entities for one, as
well as enable a poor man's localization, etc.)?


While there are arguments on both sides of whether this is a good idea or
not, I think the more important concern in this case is whether we can
extend XML in this way. I think in practice we should leave this up to the
XML specs and their successors. I don't think it would be appropriate for
us to profile the XML spec in this way.

While it is not my purpose to extend the debate on external DTD's, Iwanted to bring up the following points (brought to light after a recentre-review of the spec) because it raises a few serious issues which Ibelieve current browsers are failing at, and if the browsers do notaddress these issues, they would make claims for real XHTML 5 support(as with XHTML 1.* and plain XML support) unworkable. While I agree thatany changes to XML itself should be up to the XML specs, from what I cannow tell, it looks like a closer adherence to the existing spec wouldsolve most of the existing problems. I wanted to share the followingpoints which I think could resolve most of the issues, if the browserswould make the required changes.

I was pleasantly surprised to find that the spec seems to recommendsolutions which I believe avoid the more serious issue of single pointof failure problems.

(The other complaints with DTD's, such as avoiding cross-domain DTDs forthe sake of security or avoidance of DOS attacks might be an optionalissue if that may, in combination with adhering to existingrecommendations, satisfy concerns, though I personally do not think sucha risk is similar to inclusion of cross-domain scripts.)

So what follows is what I have gleaned from these various statements asapplied to current browsers. I can provide specific citations, but I didnot wish to expand this post unnecessarily (though I list references atthe end).

The major issues which I think ought to be resolved by certain browsers,as they do not seem to be in accord with the XML spec and as a result,create interoperability problems:

1) Firefox and Webkit, should not give a single point of failure for amissing entity as they do now, (unless they switch to a validatingparser which finds no declaration in the external file and the user isin validation mode), since such failures in a document with an externalDTD are NOT well-formedness errors unless the document deliberatelydeclares standalone=yes.2) Explorer, which no longer seems to require in IE8 that the documentbe completely described by the DTD as I believe it had earlier (thoughit will report errors if the document violates rules which arespecified), should, per the spec, really only report validation errorsupon user option (ideally, I would say, off by default, and activatableon a case-by-case as well as preference-based basis). This will possiblyspeed things up if the option could be disabled as well as let theirbrowser work with documents which violate validation. But this issue isnot as serious as #1, since #1 prevents even valid documents from beinginteroperably viewed on the web.

If these issues are addressed by those aiming for compliance, the onlydisadvantages which will remain (and which are inherent in XML byallowing the co-existence of validating and non-validating parsers) arethose issues described in http://www.w3.org/TR/REC-xml/#safe-behaviorand http://www.w3.org/TR/REC-xml/#proc-types , namely that:

1) some (entity-related) /well-formedness/ errors (e.g., if an entity isnot defined but is used) will go hidden to a non-validating parser asthese will not need to load an entity replacement (which is not a bigproblem, since a document author should presumably have checked (with anapplication which does external entity substitution) that their entitiesintegrate properly with the text--it is not as important, however, thatthey check for /validation/ errors, since as mentioned above, these needonly be reported optionally).2) The application may possibly not be notified by its processor of,e.g., entity replacement values, if it is a non-validating processor(though non-validating processors can also make such replacements). Butsince these are, as mentioned above, not to produce well-formednesserrors, there is no single point of failure here either (though theremay be some missing content, but indicated by an entity reference in theoutput display).3) A few validation issues, such as duplicate declarations (which mightinclude attribute defaults) can lead to undefined behavior (though giventhat validation is only optional even for validating applications, itseems all applications will have to deal with this).

In other words, as the spec seems to indicate from my reading, usersgoing from one browser to the other will not face problems, unless:1) They visit invalid documents and have the option to validate thedocument turned on (it is only supposed to be an option) and expectother browsers to report the same errors as well (not a big issue, sincea document which describes its validation constraints and then breaksthem is basically asking for trouble--and even here, the user issupposed to have the option to view the document without validation).2) They expect to see the entity replacement text (and at least, this isnot a single point of failure, and in many cases, such as when entitiesare merely used to represent symbols, the text can be fully read withoutany disruption in the document flow). Of course, doing the replacementswould be even better to avoid this problem, and the solution does notrequire supporting validation.

There are also the following optional issues which browsers might wishto consider (though if these are not implemented, the above fixes alonewould address the most serious problems):

1) Since even a non-validating processor is to inform the applicationthat it recognized but did not read an entity (if it does not replacetheir references with content found in an external DTD), a browser likeOpera (the only one that I can tell does not report such issues, eventhough it correctly does not lead to a single point of failure), might(if not implementing #2 below) wish to consider doing so, since acompliant processor at least is supposed to report such issues to theapplication (to do with it as it sees fit). But there is admittedly noobligation on the application to do so, and in any case, such reportingis not to be a single point of failure. But it still might be nice todistinguish the display of entities which are not found fromdeliberately escaped entities (e.g., &myEnt; produced by a missingentity currently appears the same in Opera (except in source view) as adeliberately escaped &myEnt;)2) Opera, Firefox, and Webkit (after the latter two fix the more seriousissue mentioned above) might also wish to consider expanding their XMLsupport for their users to:a) Show a link to optionally expand each external parsed entityreferences or other entities (if they don't do the following)b) Build on a non-validating parser to do automatic entity anddefault attribute value replacement, and attribute value normalizationusing an external DTD (at least same domain ones). The XML spec onlywarns against relying on this for the sake of an application having thefreedom to switch between non-validating parsers which may or may notall take these actions--this issue doesn't impact interoperability forusers (it only improves it), however, so even if there is no desire tosupport validation, they can still offer entity replacement, etc. totheir users.c) Implement a validating parser which can do entity and defaultattribute value replacement, and attribute value normalization from anexternal DTD, as well as optionally validate the document at userdiscretion. This should not slow things down for the user, since thespec itself indicates that reporting of validation errors is required"at user option". This would give the user the best of both worlds--theopportunity to fully read XML/XHTML files online (and without anyrequirement to face a validation performance cost), and if they are, forexample, a document author, they could choose to take a client-sideperformance hit to optionally check for validation. Of course, they'llneed to load the external files in either of these cases to be able todo the replacements, but the document author will NOT need to providefull DTD validation in the external DTD, so users will not be forced todownload DTDs reflecting the whole document structure, unless thedocument author wishes to reference such files). Indeed, authors mightbe encouraged not to include such content in their DTDs (performingvalidation offline) so that they and their users can reduce bandwidth,unless their purpose is to transparently show the validation (though DTDvalidation is of course not very strong).


References:
http://www.w3.org/TR/REC-xml/#wf-entdeclared
http://www.w3.org/TR/REC-xml/#proc-types
http://www.w3.org/TR/REC-xml/#safe-behavior
http://www.w3.org/TR/REC-xml/#dt-vc (validity constraint definition)
http://www.w3.org/TR/REC-xml/#include-if-valid

regards,
Brett

Re: [whatwg] External document subset support

Reply via email to