On Sun, Jun 17, 2007 at 02:39:57PM +0100, Nick Kew wrote:
> I've been using libxml2 for some years to parse both XML and HTML
> in the context of Apache filter modules. All these modules use the
> parseChunk API, which is the only reasonable option in the context
> of the Apache filter architecture. My most widely-used libxml2-based
> module is mod_proxy_html, which serves to rewrite HTML links in a
> reverse proxy.
>
> A FAQ arising in this context is why some pages get mangled.
> The straight answer is that they're hopelessly malformed tag-soup,
> and HTMLparser is somewhat less forgiving than mainstream browsers.
> Common examples include:
> - Documents that start with a <meta ...>, followed by
> <html>(normal contents)</html>
> - <script> sections that are prematurely closed by things
> like document.write("<p>foo</p>");
> - Documents with multiple <html> or multiple <body> tags.
>
> I have some hacks to error-correct for some of these: for example
> as described at
> http://bahumbug.wordpress.com/2006/10/12/mod_proxy_html-revisited/
> But now I'm looking at providing a systematically more forgiving
> parser as an option to my users. That leaves me two options:
> (1) Write a new tag-soup parser from scratch, and make the
> choice of parser a configuration option for users.
> (2) Work within your existing HTMLparser to make it (optionally)
> more forgiving.
> The second is only realistically an option if I can feed back
> changes to the libxml2 codebase and not land myself with an
> unmaintainable branch.
>
> So, what do you think? Is this something the libxml2 project
> would like to see, or would you prefer to steer well clear?
I'm not adverse to adding a new HTML parsing option for 'tag soup'
but you would have to define clearly what is the new parsing strategy
before I (and others on this list) can say yes or no to that option.
So what would the 'tag soup' parser do that the current HTML parser
does not and vice-versa ? If you could define this other than by an
accumulation of specific cases then that's probably viable, but if
it's just an ever growing list of individual preferences on a case
by case basis, this doesn't sound okay to say yes to your selection
rather than someone else application own set.
Makes sense ?
Daniel
--
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard | virtualization library http://libvirt.org/
[EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml