Re: HTML parsing

Assaf Arkin 14 Mar 2000 21:05:14 -0000

:-)

arkin


[EMAIL PROTECTED] wrote:
> 
> Arkin--
> 
> I'm sorry to have publicly maligned the OpenXML HTML parser!  I clearly
> misunderstood the error reports to be fatal, while they were not.  Will
> resume use of your product immediately!!
> 
> --Susan
> 
> 
>                     Assaf Arkin
>                     <[EMAIL PROTECTED]        To:     [EMAIL PROTECTED]
>                     ce.com>              cc:
>                                          Subject:     Re: HTML parsing
>                     03/13/00
>                     05:38 PM
>                     Please
>                     respond to
>                     xerces-dev
> 
> 
> 
> > The OpenXML one is the same as the one being contributed from ExOffice.
> I'm surprised that it is
> > unforgiving, because I know it's been used for a lot of web spiders out
> there, and it would have to
> > be pretty forgiving to work for that application!
> 
> Forgiving, but also reporting, which is kind of confusing at first.
> 
> OpenXML uses auto correction, which means it find the error, reports it,
> figures out how to get around it, and constructs a well formed document
> in spite of it. If your document is really bad (as most HTML documents
> are), OpenXML will read it and will construct a valid document including
> all the elements/text content it find, but will spit a lot of errors
> along the way.
> 
> Those using it for spider simply ignore the errors. I actually list and
> print these errors in the examples, which can get confusing.
> 
> arkin
> 
> >
> > Mike
> >
> > [EMAIL PROTECTED] wrote:
> > >
> > > Can anyone send us links to the Sun and IBM versions please?  Are these
> > > Java or C++ implementations?  I'm using the java Tidy  html parser
> because
> > > the OpenXML one is way too unforgiving of unwellformed HTML.  But I'd
> > > prefer to use something else (also java) because Tidy's not built for
> > > speed.
> > >
> > > Thanks in advance,
> > >
> > > --Susan
> > >
> > >
> > >                     Mike Pogue
> > >                     <[EMAIL PROTECTED]        To:
> [EMAIL PROTECTED]
> > >                     e.org>               cc:
> > >                                          Subject:     Re: HTML parsing
> > >                     03/13/00
> > >                     11:41 AM
> > >                     Please
> > >                     respond to
> > >                     xerces-dev
> > >
> > >
> > >
> > > Note that we have a couple of people who would like to donate an
> > > HTML parser to xml.apache.org, to be added to Xerces.  The ones I know
> of
> > > are:
> > >
> > >      ExOffice (extremely well tested, used for web spiders),
> > >      Sun (I haven't seen it yet), and
> > >      IBM (I haven't seen it yet either).
> > >
> > > I suspect that if people are interested in this, we ought to have
> people
> > > look at all three,
> > > and figure out whether one is better, or whether they should be merged
> > > somehow before
> > > being checked in...assuming there's interest in this!
> > >
> > > Any volunteers?
> > >
> > > Mike
> > >
> > > Cox Andy wrote:
> > > >
> > > > If the HTML is not well-formed XML (which most is not), you are
> correct.
> > > >
> > > > Andy
> > > >
> > > > | -----Original Message-----
> > > > | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > > > | Sent: Monday, March 13, 2000 10:32 AM
> > > > | To: [EMAIL PROTECTED]
> > > > | Subject: HTML parsing
> > > > |
> > > > |
> > > > | For what I can tell, I cannot expect to be able to parse an HTML
> doc
> > > with
> > > > | the xerces parser?  I was hoping to use the C++ SAX parser to find
> > > <IMG>
> > > > | tags but I don't think I will be able to do that.  Can someone
> confirm
> > > > this
> > > > | dreadful fact?

-- 
----------------------------------------------------------------------
Assaf Arkin                                           www.exoffice.com
CTO, Exoffice Technologies, Inc.                        www.exolab.org

Re: HTML parsing

Reply via email to