> * Jeffrey Bigham wrote: > >libxml correctly messes this up because the closing HTML tags between > >the </script> tags aren't correctly written as <\/name>. Is there a > >way to use libxml (I'm currently using the SAX parser) without having > >it try to fix things for me? If not, is there another C library that > >people know of that can just return each tag to me, one at a time, > >without enforcing adherence to the standard? > > HTML Tidy (http://tidy.sf.net/) is able to cope with most of these cases > and you could use it as replacement or as pre-processor (e.g., you could > use it to convert the tag soup into well-formed XML and parse that with > libxml2). Perl's HTML::Parser (http://search.cpan.org/dist/HTML-Parser/) > is also written in C and can handle such tag soup in a similar way.
Thanks for the suggestions. Tidy isn't attractive because time is of paramount concern and I don't really want to have to do two passes over the the data. I took a look at the Perl version and I think it could probably work for my purposes, although it doesn't look like there's an easy way to just drop it into my current project. Isn't there a flag or something I could set in libxml that would tell it not to output a tag if it doesn't exist in the original source. If not, why not? Thanks again! Jeff > -- > Björn Höhrmann · mailto:[EMAIL PROTECTED] · http://bjoern.hoehrmann.de > Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de > 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ > _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
