No, you can't do this. A lot of HTML is "not well-formed XML", or "badly
formed XML",
and so the parser will generate errors.
Here's another example,
<b>123<i>456</b>789</i>
generates an XML well-formedness error, but it's also ambiguous.
Should it generate:
start b
text 123
start i
text 456
end b
text 789
end i
which is an illegal XML event stream, or, maybe this?
start b
text 123
start i
text 456
end i <--inserted
end b
text 789
end i <-- error
or maybe this?
start b
text 123
start i
text 456
end i <--inserted
end b
start i <-- inserted, but the parser had to go backwards to do it (it didn't
know to
insert it, until it got the next end i
text 789
end i
We're looking into getting submissions for a true HTML parser, that is
forgiving, and
basically creates something reasonable from it (a DOM tree, maybe?), for this
reason.
Mike
[EMAIL PROTECTED] wrote:
>
> I was too but this the only thing I can come up with and I'm hoping someone
> might be able to correct me:
>
> The DOM parser is built off the SAX parser which in itself wouldn't need
> well-formedness but because the DOM parser needs proper end tags, etc. the
> SAX parser does also???? I was assuming that with the SAX parser I could
> simply handle startElement() and grab all attributes associated with IMG --
> this doesn't work though because my sample HTML doc is not well-formed --
> certain eng tags are left out (which is acceptable in HTML land).
>
> -Heather
>
> -----Original Message-----
> From: Ward D. Cannon [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 13, 2000 10:57 AM
> To: [EMAIL PROTECTED]
> Subject: RE: HTML parsing
>
> Well, I hope it can be done. Couldn't you just trap elements that contain
> the tag IMG as you parse the Instance? You know like using startElement and
> EndElement. I would be blown away if the Sax Parser couldn't handle this.
> Regards,
> Ward
>
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 13, 2000 10:32 AM
> To: [EMAIL PROTECTED]
> Subject: HTML parsing
>
> For what I can tell, I cannot expect to be able to parse an HTML doc with
> the
> xerces parser? I was hoping to use the C++ SAX parser to find <IMG> tags
> but I
> don't think I will be able to do that. Can someone confirm this dreadful
> fact?
>
> Thanks,
> Heather Matthews