Re: [xml] Parsing tag-soup HTML

Daniel Veillard Sun, 17 Jun 2007 08:42:16 -0700

On Sun, Jun 17, 2007 at 03:52:28PM +0100, Nick Kew wrote:
> On Sun, 17 Jun 2007 10:18:29 -0400
> Daniel Veillard <[EMAIL PROTECTED]> wrote:
> 
> > > So, what do you think?  Is this something the libxml2 project
> > > would like to see, or would you prefer to steer well clear?
> > 
> >   I'm not adverse to adding a new HTML parsing option for 'tag soup'
> > but you would have to define clearly what is the new parsing strategy
> > before I (and others on this list) can say yes or no to that option.
> > So what would the 'tag soup' parser do that the current HTML parser
> > does not and vice-versa ? If you could define this other than by an
> > accumulation of specific cases then that's probably viable, but if
> > it's just an ever growing list of individual preferences on a case
> > by case basis, this doesn't sound okay to say yes to your selection 
> > rather than someone else application own set.
> >   Makes sense ?
> 
> Thanks for the quick response.
> 
> Yes, of course I didn't expect a straight "yes" to such a vague
> proposal.  My question concerned whether I should invest the time
> and effort to determine the details of how this should look in the
> context of HTMLparser.
> 
> I'll take your reply as a yes in principle, and dive into the code
> to think it through a little more.  If it looks promising, I'll
> come back to you with more concrete proposals.


 Coming back with some kind of definition of what a tag soup parser
behaviour is is probably more important than digging in libxml2 code.
I am not sure we can emulate web browser parsers behaviour. There
is John Cowan's TagSoup which is probably what most people will think
about in term of implementation:

  http://ccil.org/~cowan/XML/tagsoup/

  "It does guarantee well-structured results: tags will wind up properly
   nested, default attributes will appear appropriately, and so on"

but also

  "For example, if the first tag is LI, it will supply the application
   with enclosing HTML, BODY, and UL tags."

which it seems would defeat your first example I guess.
The problem really is to try to come back to a set of garantees and 
behavior rules. Reading the slides pointed from the end of that page
may help. But I'm not sure it's what you want, but since you use the
same name, it should hopefully be close.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Parsing tag-soup HTML

Reply via email to