I see a lot of interest in HTML parsing recently :-)
A short explanation. XML, whether in the form of DOM or SAX events, must
always be well formed, that means elements must always be closed. An
HTML parser will report a well formed stream of SAX events, or a DOM
document (always well formed).
What the HTML parser will do is attempt to convert non-well-formed HTML
into well-formed XML using the following logic:
1. Some elements are empty, especially IMG. Simply create an empty
element (equivalent to <IMG/> in XML).
2. Some elements have an optional closing tag, and a set of tags known
to mark the end of these elements (e.g. <LI> closes the previously open
<LI>, while </UL> closes the last open <LI>)
3. When all else fails, simply figure out which element is open that
should be closed, and close it.
So, the following HTML:
<b>123<i>456</b>789</i>
would come out as (SAX):
startElement b
characters 123
startElement i
characters 456
closeElement i
closeElement b
characters 789
or (DOM):
element B
text 123
element i
text 456
text 789
(the two are equivalent, with the exception that one is represented as a
traversable tree, the other as a stream of events)
arkin
Mike Pogue wrote:
>
> No, you can't do this. A lot of HTML is "not well-formed XML", or "badly
> formed XML",
> and so the parser will generate errors.
>
> Here's another example,
>
> <b>123<i>456</b>789</i>
>
> generates an XML well-formedness error, but it's also ambiguous.
> Should it generate:
>
> start b
> text 123
> start i
> text 456
> end b
> text 789
> end i
>
> which is an illegal XML event stream, or, maybe this?
>
> start b
> text 123
> start i
> text 456
> end i <--inserted
> end b
> text 789
> end i <-- error
>
> or maybe this?
>
> start b
> text 123
> start i
> text 456
> end i <--inserted
> end b
> start i <-- inserted, but the parser had to go backwards to do it (it didn't
> know to
> insert it, until it got the next end i
> text 789
> end i
>
> We're looking into getting submissions for a true HTML parser, that is
> forgiving, and
> basically creates something reasonable from it (a DOM tree, maybe?), for this
> reason.
>
> Mike
>
> [EMAIL PROTECTED] wrote:
> >
> > I was too but this the only thing I can come up with and I'm hoping someone
> > might be able to correct me:
> >
> > The DOM parser is built off the SAX parser which in itself wouldn't need
> > well-formedness but because the DOM parser needs proper end tags, etc. the
> > SAX parser does also???? I was assuming that with the SAX parser I could
> > simply handle startElement() and grab all attributes associated with IMG --
> > this doesn't work though because my sample HTML doc is not well-formed --
> > certain eng tags are left out (which is acceptable in HTML land).
> >
> > -Heather
> >
> > -----Original Message-----
> > From: Ward D. Cannon [mailto:[EMAIL PROTECTED]
> > Sent: Monday, March 13, 2000 10:57 AM
> > To: [EMAIL PROTECTED]
> > Subject: RE: HTML parsing
> >
> > Well, I hope it can be done. Couldn't you just trap elements that contain
> > the tag IMG as you parse the Instance? You know like using startElement and
> > EndElement. I would be blown away if the Sax Parser couldn't handle this.
> > Regards,
> > Ward
> >
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > Sent: Monday, March 13, 2000 10:32 AM
> > To: [EMAIL PROTECTED]
> > Subject: HTML parsing
> >
> > For what I can tell, I cannot expect to be able to parse an HTML doc with
> > the
> > xerces parser? I was hoping to use the C++ SAX parser to find <IMG> tags
> > but I
> > don't think I will be able to do that. Can someone confirm this dreadful
> > fact?
> >
> > Thanks,
> > Heather Matthews