From: "Jon Hanna" <[EMAIL PROTECTED]> > If this is > > different, then it is not XML but a derived language (for example HTML or > > SGML which are using more "relaxed" syntaxes). > > XML is derived from SGML, not the other way around. Still doesn't matter.
I did not say that, despite the sentence may let you think so. Of course XML is born based on the ground of SGML and its HTML application, but now contains enough differences that it can no longer be considered an application of SGML, as it is both a subset and a superset of SGML (XML allows things forbidden in SGML, and forbids things that is completely valid in SGML). Additionally the DTD syntax profile used in XML is very limited face to SGML, and even this DTD syntax is not enough to represent in SGML XML features like namespaces (in XML, namespace prefixes can be freely substituted without requiring a new DTD, and are resolved as URIs instead of being part of the element or attribute names). Naming conventions in XML are based on two orthogonal dimensions, unlike in HTML and SGML which just use a single namespace. Finally DTDs are being deprecated in XML, because they cannot represent correctly the semantics of allowed attributes and even the allowed content models for schemas (so a XML document would validate with a DTD which would not if the schema was defined more precisely with a XSD schema: nearly all DTDs I have seen for XML, HTML and SGML contain important comments that cannot be represented in a parsable way. OK I used the term DOM instead of InfoSet but what I said was "DOM-like" data-representation (meaning InfoSet if this is what is used to represent the document). I won't discuss the case of element names or attribute names, which are by essence constrained by XML datatypes and do not represent any arbitrary Unicode text. But CDATA sections, attribute values (in non validating parsers), and anonymous text elements are where the handling of initial/final whitespaces as well as sequences of whitespaces, cause problems. This is clearly NOT markup, but plain text data, which may or may not be constrained by datatype facets, without even the need to specify a special xml:whitespace attribute in the markup of the document itself. As validating documents against their definitions is an optional part of a valid XML document, normalization of whitespace sequences occurs only if the schema is known. In the case of standardized schemas, like XHTML, it becomes mandatory, and there's no way to bypass this rule, as any client could assume and load the corresponding schema and preprocess the DOM-like data contained in the parsed document to create data which will not expose unnormalized whitespaces. So the behavior of spaces must be assumed by authors which canot predict if the XML parser will validate or not the parsed document. It is clearly not a rendering issue in fonts or XSLT processors or stylesheets. I see absolutely no place where a XML author can create a valid XML schema instance that will work with parsers if the author wants to use SPACE+diacritics sequences in the document. The only way to bypass safely this behavior is to use unparsed entities to represent the leading SPACE, or the whole combining sequence. This is really a shame that there is no "XML-safe" base character in Unicode to represent leading spacing diacritics in actual documents (either in HTML, XML, SGML, or even for other Rich-Text format, including TeX, RTF, or proprietary text formats like MS-Doc, or PDF which already can and do use Unicode as its now prefered encoding). Ignoring the extremely huge number of applications assuming this role to spaces, is then a critical caveat as such rules cannot be changed easily.

