Re: HTML parsing

Mike Pogue 13 Mar 2000 22:54:58 -0000

No, you can't do this.  A lot of HTML is "not well-formed XML", or "badly 
formed XML",
and so the parser will generate errors.


Here's another example,

<b>123<i>456</b>789</i>

generates an XML well-formedness error, but it's also ambiguous.  
Should it generate:

start b
text 123
start i
text 456
end b
text 789
end i

which is an illegal XML event stream, or, maybe this?

start b
text 123
start i
text 456
end i   <--inserted
end b
text 789
end i <-- error

or maybe this?

start b
text 123
start i
text 456
end i   <--inserted
end b
start i <-- inserted, but the parser had to go backwards to do it (it didn't 
know to 
                insert it, until it got the next end i
text 789
end i 

We're looking into getting submissions for a true HTML parser, that is 
forgiving, and
basically creates something reasonable from it (a DOM tree, maybe?), for this 
reason.

Mike

[EMAIL PROTECTED] wrote:
> 
> I was too but this the only thing I can come up with and I'm hoping someone
> might be able to correct me:
> 
> The DOM parser is built off the SAX parser which in itself wouldn't need
> well-formedness but because the DOM parser needs proper end tags, etc. the
> SAX parser does also????  I was assuming that with the SAX parser I could
> simply handle startElement() and grab all attributes associated with IMG --
> this doesn't work though because my sample HTML doc is not well-formed --
> certain eng tags are left out (which is acceptable in HTML land).
> 
> -Heather
> 
> -----Original Message-----
> From: Ward D. Cannon [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 13, 2000 10:57 AM
> To: [EMAIL PROTECTED]
> Subject: RE: HTML parsing
> 
> Well, I hope it can be done. Couldn't you just trap elements that contain
> the tag IMG as you parse the Instance? You know like using startElement and
> EndElement. I would be blown away if the Sax Parser couldn't handle this.
> Regards,
> Ward
> 
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 13, 2000 10:32 AM
> To: [EMAIL PROTECTED]
> Subject: HTML parsing
> 
> For what I can tell, I cannot expect to be able to parse an HTML doc with
> the
> xerces parser?  I was hoping to use the C++ SAX parser to find <IMG> tags
> but I
> don't think I will be able to do that.  Can someone confirm this dreadful
> fact?
> 
> Thanks,
> Heather Matthews

Re: HTML parsing

Reply via email to