Dmitry Dzhus wrote:
> My aim is to apply XSLT to some HTML document (which may be broken
> just a little).
>
> I'm using standard Python libxml2/libxslt bindings.
>
> My code is:
>
> mf_extract = libxslt.parseStylesheetFile("mf-extract.xsl")
>
> doc = libxml2.readHtmlFile(url, None, libxml2.HTML_PARSE_RECOVER)
>
> mf_extract.applyStylesheet(doc, None)
>
> Applying XSLT results as if there were no content in `doc` tree at
> all. Using `readFile` instead of `readHtmlFile` works fine as
> expected.
>
> I tried to `print doc` after using both `readHtmlFile` and `readFile`
> and noticed that, given the input document is well-formed, the output
> differs only in XML declaration at the very beginning.
>
> As I understand (and `document.type` indicates), using `readFile` and
> `readHtmlFile` results in different kinds of documents --
> `document_xml` and `document_html` -- while applying XSLT is only
> possible with `document_xml` one. Is there any way to convert
> `document_html` to `document_xml`?
Consider using lxml.
http://codespeak.net/lxml/
untested:
import lxml.etree as et
parser = et.HTMLParser()
doc = et.parse(url, parser)
doc.xslt(et.parse("mf-extract.xsl"))
for el in doc.getiterator("*"):
if '{' not in el.tag:
el.tag = "{http://www.w3.org/1999/xhtml}" + el.tag
Stefan
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml