No, you can't do this. A lot of HTML is "not well-formed XML", or "badly formed XML", and so the parser will generate errors.
Here's another example, <b>123<i>456</b>789</i> generates an XML well-formedness error, but it's also ambiguous. Should it generate: start b text 123 start i text 456 end b text 789 end i which is an illegal XML event stream, or, maybe this? start b text 123 start i text 456 end i <--inserted end b text 789 end i <-- error or maybe this? start b text 123 start i text 456 end i <--inserted end b start i <-- inserted, but the parser had to go backwards to do it (it didn't know to insert it, until it got the next end i text 789 end i We're looking into getting submissions for a true HTML parser, that is forgiving, and basically creates something reasonable from it (a DOM tree, maybe?), for this reason. Mike [EMAIL PROTECTED] wrote: > > I was too but this the only thing I can come up with and I'm hoping someone > might be able to correct me: > > The DOM parser is built off the SAX parser which in itself wouldn't need > well-formedness but because the DOM parser needs proper end tags, etc. the > SAX parser does also???? I was assuming that with the SAX parser I could > simply handle startElement() and grab all attributes associated with IMG -- > this doesn't work though because my sample HTML doc is not well-formed -- > certain eng tags are left out (which is acceptable in HTML land). > > -Heather > > -----Original Message----- > From: Ward D. Cannon [mailto:[EMAIL PROTECTED] > Sent: Monday, March 13, 2000 10:57 AM > To: [EMAIL PROTECTED] > Subject: RE: HTML parsing > > Well, I hope it can be done. Couldn't you just trap elements that contain > the tag IMG as you parse the Instance? You know like using startElement and > EndElement. I would be blown away if the Sax Parser couldn't handle this. > Regards, > Ward > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Monday, March 13, 2000 10:32 AM > To: [EMAIL PROTECTED] > Subject: HTML parsing > > For what I can tell, I cannot expect to be able to parse an HTML doc with > the > xerces parser? I was hoping to use the C++ SAX parser to find <IMG> tags > but I > don't think I will be able to do that. Can someone confirm this dreadful > fact? > > Thanks, > Heather Matthews