:-) arkin
[EMAIL PROTECTED] wrote: > > Arkin-- > > I'm sorry to have publicly maligned the OpenXML HTML parser! I clearly > misunderstood the error reports to be fatal, while they were not. Will > resume use of your product immediately!! > > --Susan > > > Assaf Arkin > <[EMAIL PROTECTED] To: [EMAIL PROTECTED] > ce.com> cc: > Subject: Re: HTML parsing > 03/13/00 > 05:38 PM > Please > respond to > xerces-dev > > > > > The OpenXML one is the same as the one being contributed from ExOffice. > I'm surprised that it is > > unforgiving, because I know it's been used for a lot of web spiders out > there, and it would have to > > be pretty forgiving to work for that application! > > Forgiving, but also reporting, which is kind of confusing at first. > > OpenXML uses auto correction, which means it find the error, reports it, > figures out how to get around it, and constructs a well formed document > in spite of it. If your document is really bad (as most HTML documents > are), OpenXML will read it and will construct a valid document including > all the elements/text content it find, but will spit a lot of errors > along the way. > > Those using it for spider simply ignore the errors. I actually list and > print these errors in the examples, which can get confusing. > > arkin > > > > > Mike > > > > [EMAIL PROTECTED] wrote: > > > > > > Can anyone send us links to the Sun and IBM versions please? Are these > > > Java or C++ implementations? I'm using the java Tidy html parser > because > > > the OpenXML one is way too unforgiving of unwellformed HTML. But I'd > > > prefer to use something else (also java) because Tidy's not built for > > > speed. > > > > > > Thanks in advance, > > > > > > --Susan > > > > > > > > > Mike Pogue > > > <[EMAIL PROTECTED] To: > [EMAIL PROTECTED] > > > e.org> cc: > > > Subject: Re: HTML parsing > > > 03/13/00 > > > 11:41 AM > > > Please > > > respond to > > > xerces-dev > > > > > > > > > > > > Note that we have a couple of people who would like to donate an > > > HTML parser to xml.apache.org, to be added to Xerces. The ones I know > of > > > are: > > > > > > ExOffice (extremely well tested, used for web spiders), > > > Sun (I haven't seen it yet), and > > > IBM (I haven't seen it yet either). > > > > > > I suspect that if people are interested in this, we ought to have > people > > > look at all three, > > > and figure out whether one is better, or whether they should be merged > > > somehow before > > > being checked in...assuming there's interest in this! > > > > > > Any volunteers? > > > > > > Mike > > > > > > Cox Andy wrote: > > > > > > > > If the HTML is not well-formed XML (which most is not), you are > correct. > > > > > > > > Andy > > > > > > > > | -----Original Message----- > > > > | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > > > | Sent: Monday, March 13, 2000 10:32 AM > > > > | To: [EMAIL PROTECTED] > > > > | Subject: HTML parsing > > > > | > > > > | > > > > | For what I can tell, I cannot expect to be able to parse an HTML > doc > > > with > > > > | the xerces parser? I was hoping to use the C++ SAX parser to find > > > <IMG> > > > > | tags but I don't think I will be able to do that. Can someone > confirm > > > > this > > > > | dreadful fact? -- ---------------------------------------------------------------------- Assaf Arkin www.exoffice.com CTO, Exoffice Technologies, Inc. www.exolab.org