On Sat, Oct 14, 2006 at 09:07:59PM -0700, Jeffrey Bigham wrote: > > * Jeffrey Bigham wrote: > > >libxml correctly messes this up because the closing HTML tags between > > >the </script> tags aren't correctly written as <\/name>. Is there a > > >way to use libxml (I'm currently using the SAX parser) without having > > >it try to fix things for me? If not, is there another C library that > > >people know of that can just return each tag to me, one at a time, > > >without enforcing adherence to the standard? > > > > HTML Tidy (http://tidy.sf.net/) is able to cope with most of these cases > > and you could use it as replacement or as pre-processor (e.g., you could > > use it to convert the tag soup into well-formed XML and parse that with > > libxml2). Perl's HTML::Parser (http://search.cpan.org/dist/HTML-Parser/) > > is also written in C and can handle such tag soup in a similar way. > > Thanks for the suggestions. Tidy isn't attractive because time is of > paramount concern and I don't really want to have to do two passes > over the the data. I took a look at the Perl version and I think it > could probably work for my purposes, although it doesn't look like > there's an easy way to just drop it into my current project. > > Isn't there a flag or something I could set in libxml that would tell > it not to output a tag if it doesn't exist in the original source. If > not, why not?
In your case, it *is* present in the original source. Reread the HTML spec about condition for closing <script>, so <script> *is* closed and next < marks the beginning of an opening tag. Sorry the intended behaviour of the application in that case is to ignore tags which *are* present. Stating that libxml2 should not add tags which doesn't exist is a reformulation of the problem, the input is broken, not libxml2, and you must agree that special diverging processing will be needed to cope with those. Willing to parse and accept specially broken input cost a lot to everybody, and well, you must be ready to accept this cost if you want to accept this input in a broken way, sad situation, but the current one. If you start changing the parser to make the broken behaviour the default, then you will break correctly written pages as far as I can tell, so the choice is relatively obvious to me. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ [EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
