* Daniel Friesen <[email protected]> [Thu, 10 Feb 2011 01:37:18 -0800]: > I've been experimenting with a mixed xml/html based template syntax for > skinning[1]. > However I've been having issues with the parsing of it. > > - DOMDocument::loadHTML throws warning and when I output it strips out > namespaces turning <mw:foo> into <foo> > - SimpleHTMLDOM was the most promising, in fact my current experiments > got very far with it, however when I got to the need to insert a node > before/after an element it completely messed up, I'm also not optimistic > of it's performance since there are no dom operations and it's "insert" > is essentially "concatenate some html with the outertext and set > outertext to it" > - html5lib choked on namespaces other than built-in handling of things > like svg: presumably. > - phpQuery is just a wrapper around DOMDocument > - tidy's plugin is supposed to support dom parsing, but that is not > deployed on every server, and even people using tidy through mw might > not be using the plugin since we support the executable as well. Not to > mention tidy seamed to share issues stripping or choking on <mw:...> > tags when it came to my editsection stuff. So even the idea of piping > through tidy then using loadXML on it is out. > - wiseparser, well I couldn't even get that to execute. > - XML_HTMLSax is so old and unmaintained I couldn't really get into > looking at it. > > > The requirements ideally are that it should support the normal html > parsing we already have (ie: boolean attributes and quoteless attributes > <div foo bar=baz>, perhaps the simple implicitly closed tags like <br>), > but also support parsing tags and attributes with mw: in them, in other > words XML namespaces. > > Is there anyone willing to help out building a parser for it? > Possibilities could be custom parsing directly to dom, custom parsing > and calling a SAX-like api, or at it's simplest a light parser that > parses the html and outputs xml we can parse with loadXML instead (I > believe the issue in DOMDocument is it's html processing not issues with > namespaces), that would end up being a potential tidy replacement. Tidy > can't be used in this case because it too messes up namespaced stuff. > > [1]: > http://www.mediawiki.org/wiki/User:Dantman/Skinning_system#xml.2Fhtml_template_syntax > Why not just use XMLReader / XMLWriter as WikiImporter does? Performance concerns? It uses libxml, should that be good enough? Dmitriy
_______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
