Re: HTML parsing

Mike Pogue 14 Mar 2000 01:10:49 -0000

Rajiv,

        Could you post a little information on the parser itself?  Like
what does it do, and how much has it been tested (e.g. via web spider)?


        I'm trying to get the same info from the IBM folks who are considering
going open source.  The ExOffice HTML parser (already open source) is the one 
in OpenXML.

Thanks, 
Mike            

Rajiv Mordani wrote:
> 
> The xhtml parser from Sun is an internal only version which will be made
> available for Apache as soon as the licensing issues are cleared.
> 
> - Rajiv
> 
> On Mon, 13 Mar 2000, Mike Pogue wrote:
> 
> > Note that we have a couple of people who would like to donate an
> > HTML parser to xml.apache.org, to be added to Xerces.  The ones I know of
> > are:
> >
> >       ExOffice (extremely well tested, used for web spiders),
> >       Sun (I haven't seen it yet), and
> >       IBM (I haven't seen it yet either).
> >
> > I suspect that if people are interested in this, we ought to have people 
> > look at all three,
> > and figure out whether one is better, or whether they should be merged 
> > somehow before
> > being checked in...assuming there's interest in this!
> >
> > Any volunteers?
> >
> > Mike
> >
> > Cox Andy wrote:
> > >
> > > If the HTML is not well-formed XML (which most is not), you are correct.
> > >
> > > Andy
> > >
> > > | -----Original Message-----
> > > | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > > | Sent: Monday, March 13, 2000 10:32 AM
> > > | To: [EMAIL PROTECTED]
> > > | Subject: HTML parsing
> > > |
> > > |
> > > | For what I can tell, I cannot expect to be able to parse an HTML doc 
> > > with
> > > | the xerces parser?  I was hoping to use the C++ SAX parser to find <IMG>
> > > | tags but I don't think I will be able to do that.  Can someone confirm
> > > this
> > > | dreadful fact?
> >

Re: HTML parsing

Reply via email to