Re: [xml] Parsing tag-soup HTML

Daniel Veillard Sun, 17 Jun 2007 07:18:41 -0700

On Sun, Jun 17, 2007 at 02:39:57PM +0100, Nick Kew wrote:
> I've been using libxml2 for some years to parse both XML and HTML
> in the context of Apache filter modules.  All these modules use the
> parseChunk API, which is the only reasonable option in the context
> of the Apache filter architecture.  My most widely-used libxml2-based
> module is mod_proxy_html, which serves to rewrite HTML links in a
> reverse proxy.
> 
> A FAQ arising in this context is why some pages get mangled.
> The straight answer is that they're hopelessly malformed tag-soup,
> and HTMLparser is somewhat less forgiving than mainstream browsers.
> Common examples include:
>   - Documents that start with a <meta ...>, followed by
>     <html>(normal contents)</html>
>   - <script> sections that are prematurely closed by things
>     like document.write("<p>foo</p>");
>   - Documents with multiple <html> or multiple <body> tags.
> 
> I have some hacks to error-correct for some of these: for example
> as described at
> http://bahumbug.wordpress.com/2006/10/12/mod_proxy_html-revisited/
> But now I'm looking at providing a systematically more forgiving
> parser as an option to my users.  That leaves me two options:
>   (1) Write a new tag-soup parser from scratch, and make the
>       choice of parser a configuration option for users.
>   (2) Work within your existing HTMLparser to make it (optionally)
>       more forgiving.
> The second is only realistically an option if I can feed back
> changes to the libxml2 codebase and not land myself with an
> unmaintainable branch.
> 
> So, what do you think?  Is this something the libxml2 project
> would like to see, or would you prefer to steer well clear?


  I'm not adverse to adding a new HTML parsing option for 'tag soup'
but you would have to define clearly what is the new parsing strategy
before I (and others on this list) can say yes or no to that option.
So what would the 'tag soup' parser do that the current HTML parser
does not and vice-versa ? If you could define this other than by an
accumulation of specific cases then that's probably viable, but if
it's just an ever growing list of individual preferences on a case
by case basis, this doesn't sound okay to say yes to your selection 
rather than someone else application own set.
  Makes sense ?

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Parsing tag-soup HTML

Reply via email to