Re: HTML parsing

Assaf Arkin 14 Mar 2000 01:23:51 -0000

I see a lot of interest in HTML parsing recently :-)

A short explanation. XML, whether in the form of DOM or SAX events, must
always be well formed, that means elements must always be closed. An
HTML parser will report a well formed stream of SAX events, or a DOM
document (always well formed).


What the HTML parser will do is attempt to convert non-well-formed HTML
into well-formed XML using the following logic:

1. Some elements are empty, especially IMG. Simply create an empty
element (equivalent to <IMG/> in XML).

2. Some elements have an optional closing tag, and a set of tags known
to mark the end of these elements (e.g. <LI> closes the previously open
<LI>, while </UL> closes the last open <LI>)

3. When all else fails, simply figure out which element is open that
should be closed, and close it.

So, the following HTML:

  <b>123<i>456</b>789</i>

would come out as (SAX):

  startElement b
  characters   123
  startElement i
  characters   456
  closeElement i
  closeElement b
  characters   789

or (DOM):

  element B
    text 123
    element i
      text 456
    text 789

(the two are equivalent, with the exception that one is represented as a
traversable tree, the other as a stream of events)

arkin


Mike Pogue wrote:
> 
> No, you can't do this.  A lot of HTML is "not well-formed XML", or "badly 
> formed XML",
> and so the parser will generate errors.
> 
> Here's another example,
> 
> <b>123<i>456</b>789</i>
> 
> generates an XML well-formedness error, but it's also ambiguous.
> Should it generate:
> 
> start b
> text 123
> start i
> text 456
> end b
> text 789
> end i
> 
> which is an illegal XML event stream, or, maybe this?
> 
> start b
> text 123
> start i
> text 456
> end i   <--inserted
> end b
> text 789
> end i <-- error
> 
> or maybe this?
> 
> start b
> text 123
> start i
> text 456
> end i   <--inserted
> end b
> start i <-- inserted, but the parser had to go backwards to do it (it didn't 
> know to
>                 insert it, until it got the next end i
> text 789
> end i
> 
> We're looking into getting submissions for a true HTML parser, that is 
> forgiving, and
> basically creates something reasonable from it (a DOM tree, maybe?), for this 
> reason.
> 
> Mike
> 
> [EMAIL PROTECTED] wrote:
> >
> > I was too but this the only thing I can come up with and I'm hoping someone
> > might be able to correct me:
> >
> > The DOM parser is built off the SAX parser which in itself wouldn't need
> > well-formedness but because the DOM parser needs proper end tags, etc. the
> > SAX parser does also????  I was assuming that with the SAX parser I could
> > simply handle startElement() and grab all attributes associated with IMG --
> > this doesn't work though because my sample HTML doc is not well-formed --
> > certain eng tags are left out (which is acceptable in HTML land).
> >
> > -Heather
> >
> > -----Original Message-----
> > From: Ward D. Cannon [mailto:[EMAIL PROTECTED]
> > Sent: Monday, March 13, 2000 10:57 AM
> > To: [EMAIL PROTECTED]
> > Subject: RE: HTML parsing
> >
> > Well, I hope it can be done. Couldn't you just trap elements that contain
> > the tag IMG as you parse the Instance? You know like using startElement and
> > EndElement. I would be blown away if the Sax Parser couldn't handle this.
> > Regards,
> > Ward
> >
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > Sent: Monday, March 13, 2000 10:32 AM
> > To: [EMAIL PROTECTED]
> > Subject: HTML parsing
> >
> > For what I can tell, I cannot expect to be able to parse an HTML doc with
> > the
> > xerces parser?  I was hoping to use the C++ SAX parser to find <IMG> tags
> > but I
> > don't think I will be able to do that.  Can someone confirm this dreadful
> > fact?
> >
> > Thanks,
> > Heather Matthews

Re: HTML parsing

Reply via email to