[XML-SIG] Parsing malformed XHTML

Lars Kellogg-Stedman Fri, 19 May 2006 18:10:59 -0700

Hello all,

There a document out there on the 'net that appears to be an XHTML document:


<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en"
  "http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml";
  xmlns:v="urn:schemas-microsoft-com:vml">

Great, right?  But unfortunately it's malformed in a number of ways
(mismatched tags, tag case problems, unescaped '&' in URLs, etc).
Neither minidom.parseStream() nor
xml.dom.ext.reader.Sax2.Reader.fromStream() will parse it correctly:

  xml.sax._exceptions.SAXParseException: foo.html:2:0: syntax error

And even if one gets rid of the bogus doctype declaration, the rest of
the document just makes the parsers fall over:

  xml.sax._exceptions.SAXParseException: foo.html:14:53: not
well-formed (invalid token)

My next thought was to parse this with
xml.dom.ext.reader.HtmlLib...but HtmlLib doesn't like the namespace
declarations:

  xml.dom.NamespaceErr: Invalid or illegal namespace operation

I need to parse this document into a DOM, make some changes, and then
spit back out the modified file as (X?)HTML (ideally well-formed).  Am
I going to be able to do this with PyXML?  If not, I'd love to hear
your suggestions for the appropriate tools.

Thanks!

-- Lars

-- 
Lars Kellogg-Stedman <[EMAIL PROTECTED]>
_______________________________________________
XML-SIG maillist  -  XML-SIG@python.org
http://mail.python.org/mailman/listinfo/xml-sig

[XML-SIG] Parsing malformed XHTML

Reply via email to