Dmitry Dzhus wrote:
> My aim is to apply XSLT to some HTML document (which may be broken
> just a little). 
> 
> I'm using standard Python libxml2/libxslt bindings.
> 
> My code is:
> 
>    mf_extract = libxslt.parseStylesheetFile("mf-extract.xsl")
>    
>    doc = libxml2.readHtmlFile(url, None, libxml2.HTML_PARSE_RECOVER)
>    
>    mf_extract.applyStylesheet(doc, None)
> 
> Applying XSLT results as if there were no content in `doc` tree at
> all. Using `readFile` instead of `readHtmlFile` works fine as
> expected.
> 
> I tried to `print doc` after using both `readHtmlFile` and `readFile`
> and noticed that, given the input document is well-formed, the output
> differs only in XML declaration at the very beginning.
> 
> As I understand (and `document.type` indicates), using `readFile` and
> `readHtmlFile` results in different kinds of documents --
> `document_xml` and `document_html` -- while applying XSLT is only
> possible with `document_xml` one. Is there any way to convert
> `document_html` to `document_xml`?


Consider using lxml.

http://codespeak.net/lxml/

untested:

   import lxml.etree as et
   parser = et.HTMLParser()
   doc = et.parse(url, parser)

   doc.xslt(et.parse("mf-extract.xsl"))

   for el in doc.getiterator("*"):
       if '{' not in el.tag:
           el.tag = "{http://www.w3.org/1999/xhtml}"; + el.tag

Stefan
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to