I see a lot of interest in HTML parsing recently :-) A short explanation. XML, whether in the form of DOM or SAX events, must always be well formed, that means elements must always be closed. An HTML parser will report a well formed stream of SAX events, or a DOM document (always well formed).
What the HTML parser will do is attempt to convert non-well-formed HTML into well-formed XML using the following logic: 1. Some elements are empty, especially IMG. Simply create an empty element (equivalent to <IMG/> in XML). 2. Some elements have an optional closing tag, and a set of tags known to mark the end of these elements (e.g. <LI> closes the previously open <LI>, while </UL> closes the last open <LI>) 3. When all else fails, simply figure out which element is open that should be closed, and close it. So, the following HTML: <b>123<i>456</b>789</i> would come out as (SAX): startElement b characters 123 startElement i characters 456 closeElement i closeElement b characters 789 or (DOM): element B text 123 element i text 456 text 789 (the two are equivalent, with the exception that one is represented as a traversable tree, the other as a stream of events) arkin Mike Pogue wrote: > > No, you can't do this. A lot of HTML is "not well-formed XML", or "badly > formed XML", > and so the parser will generate errors. > > Here's another example, > > <b>123<i>456</b>789</i> > > generates an XML well-formedness error, but it's also ambiguous. > Should it generate: > > start b > text 123 > start i > text 456 > end b > text 789 > end i > > which is an illegal XML event stream, or, maybe this? > > start b > text 123 > start i > text 456 > end i <--inserted > end b > text 789 > end i <-- error > > or maybe this? > > start b > text 123 > start i > text 456 > end i <--inserted > end b > start i <-- inserted, but the parser had to go backwards to do it (it didn't > know to > insert it, until it got the next end i > text 789 > end i > > We're looking into getting submissions for a true HTML parser, that is > forgiving, and > basically creates something reasonable from it (a DOM tree, maybe?), for this > reason. > > Mike > > [EMAIL PROTECTED] wrote: > > > > I was too but this the only thing I can come up with and I'm hoping someone > > might be able to correct me: > > > > The DOM parser is built off the SAX parser which in itself wouldn't need > > well-formedness but because the DOM parser needs proper end tags, etc. the > > SAX parser does also???? I was assuming that with the SAX parser I could > > simply handle startElement() and grab all attributes associated with IMG -- > > this doesn't work though because my sample HTML doc is not well-formed -- > > certain eng tags are left out (which is acceptable in HTML land). > > > > -Heather > > > > -----Original Message----- > > From: Ward D. Cannon [mailto:[EMAIL PROTECTED] > > Sent: Monday, March 13, 2000 10:57 AM > > To: [EMAIL PROTECTED] > > Subject: RE: HTML parsing > > > > Well, I hope it can be done. Couldn't you just trap elements that contain > > the tag IMG as you parse the Instance? You know like using startElement and > > EndElement. I would be blown away if the Sax Parser couldn't handle this. > > Regards, > > Ward > > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > Sent: Monday, March 13, 2000 10:32 AM > > To: [EMAIL PROTECTED] > > Subject: HTML parsing > > > > For what I can tell, I cannot expect to be able to parse an HTML doc with > > the > > xerces parser? I was hoping to use the C++ SAX parser to find <IMG> tags > > but I > > don't think I will be able to do that. Can someone confirm this dreadful > > fact? > > > > Thanks, > > Heather Matthews