On Sun, Jun 17, 2007 at 03:52:28PM +0100, Nick Kew wrote: > On Sun, 17 Jun 2007 10:18:29 -0400 > Daniel Veillard <[EMAIL PROTECTED]> wrote: > > > > So, what do you think? Is this something the libxml2 project > > > would like to see, or would you prefer to steer well clear? > > > > I'm not adverse to adding a new HTML parsing option for 'tag soup' > > but you would have to define clearly what is the new parsing strategy > > before I (and others on this list) can say yes or no to that option. > > So what would the 'tag soup' parser do that the current HTML parser > > does not and vice-versa ? If you could define this other than by an > > accumulation of specific cases then that's probably viable, but if > > it's just an ever growing list of individual preferences on a case > > by case basis, this doesn't sound okay to say yes to your selection > > rather than someone else application own set. > > Makes sense ? > > Thanks for the quick response. > > Yes, of course I didn't expect a straight "yes" to such a vague > proposal. My question concerned whether I should invest the time > and effort to determine the details of how this should look in the > context of HTMLparser. > > I'll take your reply as a yes in principle, and dive into the code > to think it through a little more. If it looks promising, I'll > come back to you with more concrete proposals.
Coming back with some kind of definition of what a tag soup parser behaviour is is probably more important than digging in libxml2 code. I am not sure we can emulate web browser parsers behaviour. There is John Cowan's TagSoup which is probably what most people will think about in term of implementation: http://ccil.org/~cowan/XML/tagsoup/ "It does guarantee well-structured results: tags will wind up properly nested, default attributes will appear appropriately, and so on" but also "For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags." which it seems would defeat your first example I guess. The problem really is to try to come back to a set of garantees and behavior rules. Reading the slides pointed from the end of that page may help. But I'm not sure it's what you want, but since you use the same name, it should hopefully be close. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ [EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
