Re: HTML parsing

Assaf Arkin 14 Mar 2000 01:40:54 -0000

> The OpenXML one is the same as the one being contributed from ExOffice.  I'm 
> surprised that it is
> unforgiving, because I know it's been used for a lot of web spiders out 
> there, and it would have to
> be pretty forgiving to work for that application!


Forgiving, but also reporting, which is kind of confusing at first.

OpenXML uses auto correction, which means it find the error, reports it,
figures out how to get around it, and constructs a well formed document
in spite of it. If your document is really bad (as most HTML documents
are), OpenXML will read it and will construct a valid document including
all the elements/text content it find, but will spit a lot of errors
along the way.

Those using it for spider simply ignore the errors. I actually list and
print these errors in the examples, which can get confusing.

arkin

> 
> Mike
> 
> [EMAIL PROTECTED] wrote:
> >
> > Can anyone send us links to the Sun and IBM versions please?  Are these
> > Java or C++ implementations?  I'm using the java Tidy  html parser because
> > the OpenXML one is way too unforgiving of unwellformed HTML.  But I'd
> > prefer to use something else (also java) because Tidy's not built for
> > speed.
> >
> > Thanks in advance,
> >
> > --Susan
> >
> >
> >                     Mike Pogue
> >                     <[EMAIL PROTECTED]        To:     [EMAIL PROTECTED]
> >                     e.org>               cc:
> >                                          Subject:     Re: HTML parsing
> >                     03/13/00
> >                     11:41 AM
> >                     Please
> >                     respond to
> >                     xerces-dev
> >
> >
> >
> > Note that we have a couple of people who would like to donate an
> > HTML parser to xml.apache.org, to be added to Xerces.  The ones I know of
> > are:
> >
> >      ExOffice (extremely well tested, used for web spiders),
> >      Sun (I haven't seen it yet), and
> >      IBM (I haven't seen it yet either).
> >
> > I suspect that if people are interested in this, we ought to have people
> > look at all three,
> > and figure out whether one is better, or whether they should be merged
> > somehow before
> > being checked in...assuming there's interest in this!
> >
> > Any volunteers?
> >
> > Mike
> >
> > Cox Andy wrote:
> > >
> > > If the HTML is not well-formed XML (which most is not), you are correct.
> > >
> > > Andy
> > >
> > > | -----Original Message-----
> > > | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > > | Sent: Monday, March 13, 2000 10:32 AM
> > > | To: [EMAIL PROTECTED]
> > > | Subject: HTML parsing
> > > |
> > > |
> > > | For what I can tell, I cannot expect to be able to parse an HTML doc
> > with
> > > | the xerces parser?  I was hoping to use the C++ SAX parser to find
> > <IMG>
> > > | tags but I don't think I will be able to do that.  Can someone confirm
> > > this
> > > | dreadful fact?

Re: HTML parsing

Reply via email to