Re: [xml] Possible to get XHTML output from HTMLparser?

Martin (gzlist) Sat, 20 Dec 2008 04:26:16 -0800

On 19/12/2008, R. Steven Rainwater <[email protected]> wrote:
> I'm using libxml2 for an application that generates XHTML output. I've
>  recently needed to parse some nasty HTML tag soup input and incorporate
>  it into some of my pages. Libxml2's HTMLparser does a great job of
>  fixing up the bad HTML but it outputs HTML v4 markup. Is there any
>  existing function that will output XHTML markup from the HTMLparser?
>
>  ... I'm assuming I'd just need to walk the HTMLparser output
>  tree, closing empty elements, expanding stand-alone attributes, and
>  such. Looks like HTMLparser already fixes some things like making sure
>  attribute values are quoted.


Those are serialisation details that the tree doesn't care about.

In libxml2 htmlDoc objects *are* xmlDoc objects, so if you just care
about well-formedness  any of the normal XML functions will do. Will
need to walk the tree to set the correct the namespace on all the
nodes however.

If you also care about validity according to a particular XHTML DTD,
you'd have to do considerable tree modifications to turn arbitrary tag
soup into something correct. Browsers have complex heuristics to, for
instance, make sanity out of form elements inside tables.

Martin
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Possible to get XHTML output from HTMLparser?

Reply via email to