Re: [automattic] False Marketing

Geoffrey Sneddon Fri, 12 Dec 2008 11:00:52 -0800


On 16 Nov 2008, at 00:20, Matt Mullenweg wrote:

Geoffrey Sneddon wrote:
I can see no way to improve the XML export without breakingbackwards compatibility as WXR is so far from XML. It is alsoimpossible to change the version in the URI you use as a namespace,as currently even unknown major versions are attempted to be parsed.
How would making it better-formed break backward compatibility, inpractice?

Looking more closely at what's done, it seems anything that I thoughtwould be broken is already broken (try exporting custom fields withboth name and value set as "This < foo & bar", this doesn't round-tripand causes the export to not be XML). However, if we removed the CDATAsections (which are undesirable as there is no way to have themcontaining "]]>" as a literal string (outside of a CDATA section"]]>" works fine)), then existing versions of WP would break on theinput in places (where CDATA is used, the importer seems to reply onCDATA for doing all escaping so it doesn't have to do any parsing ofentities, another violation of the XML spec). Just bumping the versiondoesn't help (despite the comment saying it's there for when we mightbreak compat.) as the current importer completely ignores the versionnumber (and shipping something that starts caring doesn't fix thebackwards compat. issue, as you still have the millions of copies ofWP that have already been shipped with support for WXR) — regardless,bumping the version in such a case would seem extremely kludgy as theformat would be entirely compatible with itself before, it would justbe working around a bug in the current de-facto implementation. Eitherbackwards compatibility has to be broken, or WXR has to admit to notbeing XML.

We can be liberal in what we accept, conservative in what we output.

If you want to follow Postel's Law, then XML certainly isn't what youwant. XML is absolutely clear that any errors should be fatal. Wecan't be liberal in what we accept (and that means disallowing someedge case backups that are currently supported such as the aboveexample).

Any patches to improve our output are always welcome - I know we'refar from perfect in the content we output sometimes but that doesn'tmean we shouldn't strive to be.

Is there any interest in moving over to a fully fledged XML serializer(which would at the very least mean any XML conformance bug would bein one place)? This could be used not only for WXR but also forgeneric RSS/Atom. <http://hsivonen.iki.fi/producing-xml/> covers mostof the advantages for using a serializer.

The importer should be easier — WP already has a built in RSS parser(in the form MagpieRSS, which doesn't comply to XML either (esp. theNamespaces for XML spec), but is a lot lot lot closer), and I am stillunable to think of any reason why it wasn't used initially (whichwould be avoid most of this issue which now arises) as it wouldinvolve a lot less code. See trac ticket #7400 for this.

BTW Movable Type has a well-working WXR importer (or so they claim)so obviously it isn't impossible to make something else work from it.

I guess provided you ignore edge cases you can pretty much get awaywith it from an import POV — from an exporter POV would need to verycarefully make sure you follow the subset of XML-like byte-streamsthat the importer supports.

Also, you said back in August 2007 (<http://ma.tt/2007/08/movabletype-4-vs-wordpress-22/#div-comment-424413>):

However I still do plan to get a spec doc up for it one day. If thatwere a condition of them [MT] supporting it I’d happily prioritize it.

When I brought this up on wp-hackers in July (2008), I received thereply (from Otto):

If you want it documented, then look at it and write a document forit.

Either there's a disconnect, or there's a change of plan. Is it stillthe case that there is a spec coming? I know the main reason whyHabari to this day does not have a WXR importer is because it is, asit stands, an undefined XML-look-alike byte-stream. This makes it veryhard to support any export that isn't XML, and on the principle oftrying to implement edge-cases first (thereby making all more normalcases work fine) supporting something that isn't XML is hard. If itwere the case that it was possible to be reasonably certain that theoutput will be XML (which, IMO, it isn't) then I would be willing towrite a spec sometime (as then it could be done in terms of a DOM, andnot in terms of a byte-stream, massively simplifying it), though itwould probably be unlikely to happen until March/April '09 (I wouldguess it would probably be around a day's work).

That said, I think having some standardized format would be better,and something based upon RSS probably isn't a good way to do that(mainly because RSS is uselessly vague, and WXR would almost certainlyhave to be defined as an extension of a subset of that (e.g., Is thetitle element text or HTML? Different implementations do differentthings here, some performing heuristics to try and determine theanswer.), but also because Atom has far more of what is needed alreadystandarized).



--
Geoffrey Sneddon
<http://gsnedders.com/>

Re: [automattic] False Marketing

Reply via email to