Re: HTML parsing

2000-03-24 Thread Mike Pogue
The IBM HTML parser code isn't public. I talked to the IBM group who wrote it, and it's in Java, and I think it does less than what Sun and ExOffice have. So, I don't think it's an option. When we get the HTML parser into the Java code base, it would be great to get it ported to work with the

Re: HTML parsing

2000-03-23 Thread Michael Mason
There was lots of discussion on this, but not really much of a conclusion. We're currently using Tidy as a preprocessor for nasty random HTML pages, but it seems to be overkill. It does lots of stuff so we get perfect HTML out the other end, rather than just creating something that's well formed.

Re: HTML parsing

2000-03-20 Thread Mike Pogue
's stuff is Java based - a lot of > people on this list are working in C++. Whatever path is followed, I want > to voice my enthusiasm for C++ support as early as possible. > > Ed > > -Original Message- > From: Pierpaolo Fumagalli [mailto:[EMAIL PROTECTED] > S

RE: HTML parsing

2000-03-18 Thread Ed Draper
iday, March 17, 2000 12:59 PM To: [EMAIL PROTECTED] Subject: Re: HTML parsing Mike Pogue wrote: > > I haven't seen the Sun one, though, so I think we should take a look > before we start checking in the OpenXML one. Let's look at all the > possibilities, before we choose o

Re: HTML parsing

2000-03-18 Thread Arkin
Pierpaolo Fumagalli wrote: > > Mike Pogue wrote: > > > > Note that we have a couple of people who would like to donate an > > HTML parser to xml.apache.org, to be added to Xerces. The ones I know of > > are: > > > > ExOffice (extremely well tested, used for web spiders), > > Sun (

Re: HTML parsing

2000-03-17 Thread Pierpaolo Fumagalli
Mike Pogue wrote: > > I haven't seen the Sun one, though, so I think we should take a look > before we start checking in the OpenXML one. Let's look at all the > possibilities, before we choose one. Sounds good to me... Less stuff to do if we decide to go w/ Sun :) Pier -- ---

Re: HTML parsing

2000-03-17 Thread Mike Pogue
I propose we wait and see what Sun has, before we pull the OpenXML one in. After talking with the IBM folks in Tokyo, I suspect that the IBM HTML parser will *not* be suitable for our needs (at this point, I think OpenXML is better, because it supports HTML 4.0, instead of 3.2, and it handles th

Re: HTML parsing

2000-03-17 Thread Pierpaolo Fumagalli
Mike Pogue wrote: > > Note that we have a couple of people who would like to donate an > HTML parser to xml.apache.org, to be added to Xerces. The ones I know of > are: > > ExOffice (extremely well tested, used for web spiders), > Sun (I haven't seen it yet), and > IBM (I

Re: HTML parsing

2000-03-14 Thread Edwin Goei
Wong Kok Wai wrote: > > There is a HTML parser in Swing/JFC. It is also event-based but not SAX. > Another bad news is the > parsed object tree is not DOM-based. > > Rajiv Mordani wrote: > > > The xhtml parser from Sun is an internal only version which will be made > > available for Apache as s

Re: HTML parsing

2000-03-14 Thread Assaf Arkin
> Assaf Arkin > <[EMAIL PROTECTED]To: [EMAIL PROTECTED] > ce.com> cc: > Subject: Re: HTML parsing > 03/13/00 >

Re: HTML parsing

2000-03-14 Thread Wong Kok Wai
That's a lot of baggage! Not to mention the Swing parser only supports HTML 3.2, if I remember correctly. Rajiv Mordani wrote: > Well the xhtml parser is infact just a small handler that builds on the > swing html parser using sax events. > > - Rajiv > >

Re: HTML parsing

2000-03-14 Thread Rajiv Mordani
Well the xhtml parser is infact just a small handler that builds on the swing html parser using sax events. - Rajiv On Tue, 14 Mar 2000, Wong Kok Wai wrote: > There is a HTML parser in Swing/JFC. It is also event-based but not SAX. > Another bad news is the > parsed object tree is not DOM-based

Re: HTML parsing

2000-03-14 Thread Wong Kok Wai
There is a HTML parser in Swing/JFC. It is also event-based but not SAX. Another bad news is the parsed object tree is not DOM-based. Rajiv Mordani wrote: > The xhtml parser from Sun is an internal only version which will be made > available for Apache as soon as the licensing issues are cleared

Re: HTML parsing

2000-03-14 Thread susan_levine
ce.com> cc: Subject: Re: HTML parsing

Re: HTML parsing

2000-03-14 Thread Assaf Arkin
t; > prefer to use something else (also java) because Tidy's not built for > > speed. > > > > Thanks in advance, > > > > --Susan > > > > > > Mike Pogue > > <[EMAIL PROTECTED] To: [E

Re: HTML parsing

2000-03-14 Thread Assaf Arkin
though because my sample HTML doc is not well-formed -- > certain eng tags are left out (which is acceptable in HTML land). > > -Heather > > -Original Message- > From: Ward D. Cannon [mailto:[EMAIL PROTECTED] > Sent: Monday, March 13, 2000 10:57 AM > To: [EMAIL P

Re: HTML parsing

2000-03-14 Thread Assaf Arkin
e the DOM parser needs proper end tags, etc. the > > SAX parser does also I was assuming that with the SAX parser I could > > simply handle startElement() and grab all attributes associated with IMG -- > > this doesn't work though because my sample HTML doc is not

Re: HTML parsing

2000-03-14 Thread Mike Pogue
Rajiv, Could you post a little information on the parser itself? Like what does it do, and how much has it been tested (e.g. via web spider)? I'm trying to get the same info from the IBM folks who are considering going open source. The ExOffice HTML parser (already open source)

Re: HTML parsing

2000-03-14 Thread Rajiv Mordani
The xhtml parser from Sun is an internal only version which will be made available for Apache as soon as the licensing issues are cleared. - Rajiv On Mon, 13 Mar 2000, Mike Pogue wrote: > Note that we have a couple of people who would like to donate an > HTML parser to xml.apache.org, to be add

Re: HTML parsing

2000-03-13 Thread Mike Pogue
ain eng tags are left out (which is acceptable in HTML land). > > -Heather > > -Original Message- > From: Ward D. Cannon [mailto:[EMAIL PROTECTED] > Sent: Monday, March 13, 2000 10:57 AM > To: [EMAIL PROTECTED] > Subject: RE: HTML parsing > > Well, I hope it

RE: HTML parsing

2000-03-13 Thread heather_matthews
-- From: Ward D. Cannon [mailto:[EMAIL PROTECTED] Sent: Monday, March 13, 2000 10:57 AM To: [EMAIL PROTECTED] Subject: RE: HTML parsing Well, I hope it can be done. Couldn't you just trap elements that contain the tag IMG as you parse the Instance? You know like using startElement and EndEleme

Re: HTML parsing

2000-03-13 Thread Mike Pogue
not built for > speed. > > Thanks in advance, > > --Susan > > > Mike Pogue > <[EMAIL PROTECTED]To: [EMAIL PROTECTED] > e.org> cc: >

Re: HTML parsing

2000-03-13 Thread susan_levine
<[EMAIL PROTECTED]To: [EMAIL PROTECTED] e.org> cc: Subject:

Re: HTML parsing

2000-03-13 Thread Mike Pogue
Note that we have a couple of people who would like to donate an HTML parser to xml.apache.org, to be added to Xerces. The ones I know of are: ExOffice (extremely well tested, used for web spiders), Sun (I haven't seen it yet), and IBM (I haven't seen it yet either).

Re: HTML parsing

2000-03-13 Thread Dmitry Volpyansky
Heather, You can look into HTMLTidy utility: http://www.w3.org/People/Raggett/tidy/ Partially supported by HP, BTW. Thanks, Dmitry Volpyansky - Original Message - From: "Cox Andy" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, March 13, 2000 1:0

RE: HTML parsing

2000-03-13 Thread Cox Andy
If the HTML is not well-formed XML (which most is not), you are correct. Andy | -Original Message- | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] | Sent: Monday, March 13, 2000 10:32 AM | To: [EMAIL PROTECTED] | Subject: HTML parsing | | | For what I can tell, I cannot expect to be a

RE: HTML parsing

2000-03-13 Thread Ward D. Cannon
Well, I hope it can be done. Couldn't you just trap elements that contain the tag IMG as you parse the Instance? You know like using startElement and EndElement. I would be blown away if the Sax Parser couldn't handle this. Regards, Ward -Original Message- From: [EMAIL PROTECTED] [mailto