Re: [xml] Parsing a file that I didn't create

Daniel Veillard Sun, 15 Oct 2006 02:20:45 -0700

On Sat, Oct 14, 2006 at 09:07:59PM -0700, Jeffrey Bigham wrote:
> > * Jeffrey Bigham wrote:
> > >libxml correctly messes this up because the closing HTML tags between
> > >the </script> tags aren't correctly written as <\/name>.  Is there a
> > >way to use libxml (I'm currently using the SAX parser) without having
> > >it try to fix things for me?  If not, is there another C library that
> > >people know of that can just return each tag to me, one at a time,
> > >without enforcing adherence to the standard?
> >
> > HTML Tidy (http://tidy.sf.net/) is able to cope with most of these cases
> > and you could use it as replacement or as pre-processor (e.g., you could
> > use it to convert the tag soup into well-formed XML and parse that with
> > libxml2). Perl's HTML::Parser (http://search.cpan.org/dist/HTML-Parser/)
> > is also written in C and can handle such tag soup in a similar way.
> 
> Thanks for the suggestions.  Tidy isn't attractive because time is of
> paramount concern and I don't really want to have to do two passes
> over the the data.  I took a look at the Perl version and I think it
> could probably work for my purposes, although it doesn't look like
> there's an easy way to just drop it into my current project.
> 
> Isn't there a flag or something I could set in libxml that would tell
> it not to output a tag if it doesn't exist in the original source.  If
> not, why not?


 In your case, it *is* present in the original source. Reread the HTML
spec about condition for closing <script>, so <script> *is* closed and 
next < marks the beginning of an opening tag. Sorry the intended behaviour
of the application in that case is to ignore tags which *are* present.
Stating that libxml2 should not add tags which doesn't exist is a reformulation
of the problem, the input is broken, not libxml2, and you must agree
that special diverging processing will be needed to cope with those.
Willing to parse and accept specially broken input cost a lot to everybody,
and well, you must be ready to accept this cost if you want to accept this
input in a broken way, sad situation, but the current one. If you start
changing the parser to make the broken behaviour the default, then you
will break correctly written pages as far as I can tell, so the choice is
relatively obvious to me.

Daniel

-- 
Red Hat Virtualization group http://redhat.com/virtualization/
Daniel Veillard      | virtualization library  http://libvirt.org/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine  http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Parsing a file that I didn't create

Reply via email to