On 16 Nov 2008, at 00:20, Matt Mullenweg wrote:
Geoffrey Sneddon wrote:
I can see no way to improve the XML export without breaking
backwards compatibility as WXR is so far from XML. It is also
impossible to change the version in the URI you use as a namespace,
as currently even unknown major versions are attempted to be parsed.
How would making it better-formed break backward compatibility, in
practice?
Looking more closely at what's done, it seems anything that I thought
would be broken is already broken (try exporting custom fields with
both name and value set as "This < foo & bar", this doesn't round-trip
and causes the export to not be XML). However, if we removed the CDATA
sections (which are undesirable as there is no way to have them
containing "]]>" as a literal string (outside of a CDATA section
"]]>" works fine)), then existing versions of WP would break on the
input in places (where CDATA is used, the importer seems to reply on
CDATA for doing all escaping so it doesn't have to do any parsing of
entities, another violation of the XML spec). Just bumping the version
doesn't help (despite the comment saying it's there for when we might
break compat.) as the current importer completely ignores the version
number (and shipping something that starts caring doesn't fix the
backwards compat. issue, as you still have the millions of copies of
WP that have already been shipped with support for WXR) ā regardless,
bumping the version in such a case would seem extremely kludgy as the
format would be entirely compatible with itself before, it would just
be working around a bug in the current de-facto implementation. Either
backwards compatibility has to be broken, or WXR has to admit to not
being XML.
We can be liberal in what we accept, conservative in what we output.
If you want to follow Postel's Law, then XML certainly isn't what you
want. XML is absolutely clear that any errors should be fatal. We
can't be liberal in what we accept (and that means disallowing some
edge case backups that are currently supported such as the above
example).
Any patches to improve our output are always welcome - I know we're
far from perfect in the content we output sometimes but that doesn't
mean we shouldn't strive to be.
Is there any interest in moving over to a fully fledged XML serializer
(which would at the very least mean any XML conformance bug would be
in one place)? This could be used not only for WXR but also for
generic RSS/Atom. <http://hsivonen.iki.fi/producing-xml/> covers most
of the advantages for using a serializer.
The importer should be easier ā WP already has a built in RSS parser
(in the form MagpieRSS, which doesn't comply to XML either (esp. the
Namespaces for XML spec), but is a lot lot lot closer), and I am still
unable to think of any reason why it wasn't used initially (which
would be avoid most of this issue which now arises) as it would
involve a lot less code. See trac ticket #7400 for this.
BTW Movable Type has a well-working WXR importer (or so they claim)
so obviously it isn't impossible to make something else work from it.
I guess provided you ignore edge cases you can pretty much get away
with it from an import POV ā from an exporter POV would need to very
carefully make sure you follow the subset of XML-like byte-streams
that the importer supports.
Also, you said back in August 2007 (<http://ma.tt/2007/08/movabletype-4-vs-wordpress-22/#div-comment-424413
>):
However I still do plan to get a spec doc up for it one day. If that
were a condition of them [MT] supporting it Iād happily prioritize it.
When I brought this up on wp-hackers in July (2008), I received the
reply (from Otto):
If you want it documented, then look at it and write a document for
it.
Either there's a disconnect, or there's a change of plan. Is it still
the case that there is a spec coming? I know the main reason why
Habari to this day does not have a WXR importer is because it is, as
it stands, an undefined XML-look-alike byte-stream. This makes it very
hard to support any export that isn't XML, and on the principle of
trying to implement edge-cases first (thereby making all more normal
cases work fine) supporting something that isn't XML is hard. If it
were the case that it was possible to be reasonably certain that the
output will be XML (which, IMO, it isn't) then I would be willing to
write a spec sometime (as then it could be done in terms of a DOM, and
not in terms of a byte-stream, massively simplifying it), though it
would probably be unlikely to happen until March/April '09 (I would
guess it would probably be around a day's work).
That said, I think having some standardized format would be better,
and something based upon RSS probably isn't a good way to do that
(mainly because RSS is uselessly vague, and WXR would almost certainly
have to be defined as an extension of a subset of that (e.g., Is the
title element text or HTML? Different implementations do different
things here, some performing heuristics to try and determine the
answer.), but also because Atom has far more of what is needed already
standarized).
--
Geoffrey Sneddon
<http://gsnedders.com/>