Re: HTML parsing

2000-03-24 Thread Mike Pogue
The IBM HTML parser code isn't public. I talked to the IBM group who wrote it, and it's in Java, and I think it does less than what Sun and ExOffice have. So, I don't think it's an option. When we get the HTML parser into the Java code base, it would be great to get it ported to work with the

Re: HTML parsing

2000-03-23 Thread Michael Mason
There was lots of discussion on this, but not really much of a conclusion. We're currently using Tidy as a preprocessor for nasty random HTML pages, but it seems to be overkill. It does lots of stuff so we get perfect HTML out the other end, rather than just creating something that's well formed.

Re: HTML parsing

2000-03-20 Thread Mike Pogue
's stuff is Java based - a lot of > people on this list are working in C++. Whatever path is followed, I want > to voice my enthusiasm for C++ support as early as possible. > > Ed > > -Original Message- > From: Pierpaolo Fumagalli [mailto:[EMAIL PROTECTED] > S

RE: HTML parsing

2000-03-18 Thread Ed Draper
iday, March 17, 2000 12:59 PM To: [EMAIL PROTECTED] Subject: Re: HTML parsing Mike Pogue wrote: > > I haven't seen the Sun one, though, so I think we should take a look > before we start checking in the OpenXML one. Let's look at all the > possibilities, before we choose o

Re: HTML parsing

2000-03-18 Thread Arkin
Pierpaolo Fumagalli wrote: > > Mike Pogue wrote: > > > > Note that we have a couple of people who would like to donate an > > HTML parser to xml.apache.org, to be added to Xerces. The ones I know of > > are: > > > > ExOffice (extremely well tested, used for web spiders), > > Sun (

Re: HTML parsing

2000-03-17 Thread Pierpaolo Fumagalli
Mike Pogue wrote: > > I haven't seen the Sun one, though, so I think we should take a look > before we start checking in the OpenXML one. Let's look at all the > possibilities, before we choose one. Sounds good to me... Less stuff to do if we decide to go w/ Sun :) Pier -- ---

Re: HTML parsing

2000-03-17 Thread Mike Pogue
I propose we wait and see what Sun has, before we pull the OpenXML one in. After talking with the IBM folks in Tokyo, I suspect that the IBM HTML parser will *not* be suitable for our needs (at this point, I think OpenXML is better, because it supports HTML 4.0, instead of 3.2, and it handles th

Re: HTML parsing

2000-03-17 Thread Pierpaolo Fumagalli
Mike Pogue wrote: > > Note that we have a couple of people who would like to donate an > HTML parser to xml.apache.org, to be added to Xerces. The ones I know of > are: > > ExOffice (extremely well tested, used for web spiders), > Sun (I haven't seen it yet), and > IBM (I

Re: HTML parsing

2000-03-14 Thread Edwin Goei
Wong Kok Wai wrote: > > There is a HTML parser in Swing/JFC. It is also event-based but not SAX. > Another bad news is the > parsed object tree is not DOM-based. > > Rajiv Mordani wrote: > > > The xhtml parser from Sun is an internal only version which will be made > > available for Apache as s

Re: HTML parsing

2000-03-14 Thread Assaf Arkin
> Assaf Arkin > <[EMAIL PROTECTED]To: [EMAIL PROTECTED] > ce.com> cc: > Subject: Re: HTML parsing > 03/13/00 >

Re: HTML parsing

2000-03-14 Thread Wong Kok Wai
That's a lot of baggage! Not to mention the Swing parser only supports HTML 3.2, if I remember correctly. Rajiv Mordani wrote: > Well the xhtml parser is infact just a small handler that builds on the > swing html parser using sax events. > > - Rajiv > >

Re: HTML parsing

2000-03-14 Thread Rajiv Mordani
Well the xhtml parser is infact just a small handler that builds on the swing html parser using sax events. - Rajiv On Tue, 14 Mar 2000, Wong Kok Wai wrote: > There is a HTML parser in Swing/JFC. It is also event-based but not SAX. > Another bad news is the > parsed object tree is not DOM-based

Re: HTML parsing

2000-03-14 Thread Wong Kok Wai
There is a HTML parser in Swing/JFC. It is also event-based but not SAX. Another bad news is the parsed object tree is not DOM-based. Rajiv Mordani wrote: > The xhtml parser from Sun is an internal only version which will be made > available for Apache as soon as the licensing issues are cleared

Re: HTML parsing

2000-03-14 Thread susan_levine
ce.com> cc: Subject: Re: HTML parsing

Re: HTML parsing

2000-03-14 Thread Assaf Arkin
t; > prefer to use something else (also java) because Tidy's not built for > > speed. > > > > Thanks in advance, > > > > --Susan > > > > > > Mike Pogue > > <[EMAIL PROTECTED] To: [E

Re: HTML parsing

2000-03-14 Thread Assaf Arkin
though because my sample HTML doc is not well-formed -- > certain eng tags are left out (which is acceptable in HTML land). > > -Heather > > -Original Message- > From: Ward D. Cannon [mailto:[EMAIL PROTECTED] > Sent: Monday, March 13, 2000 10:57 AM > To: [EMAIL P

Re: HTML parsing

2000-03-14 Thread Assaf Arkin
I see a lot of interest in HTML parsing recently :-) A short explanation. XML, whether in the form of DOM or SAX events, must always be well formed, that means elements must always be closed. An HTML parser will report a well formed stream of SAX events, or a DOM document (always well formed

Re: HTML parsing

2000-03-14 Thread Mike Pogue
> > > Andy > > > > > > | -Original Message- > > > | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > > | Sent: Monday, March 13, 2000 10:32 AM > > > | To: [EMAIL PROTECTED] > > > | Subject: HTML parsing > > > | >

Re: HTML parsing

2000-03-14 Thread Rajiv Mordani
the HTML is not well-formed XML (which most is not), you are correct. > > > > Andy > > > > | -Original Message- > > | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > | Sent: Monday, March 13, 2000 10:32 AM > > | To: [EMAIL PROTECTED] > >

Re: HTML parsing

2000-03-13 Thread Mike Pogue
ain eng tags are left out (which is acceptable in HTML land). > > -Heather > > -Original Message- > From: Ward D. Cannon [mailto:[EMAIL PROTECTED] > Sent: Monday, March 13, 2000 10:57 AM > To: [EMAIL PROTECTED] > Subject: RE: HTML parsing > > Well, I hope it

RE: HTML parsing

2000-03-13 Thread heather_matthews
-- From: Ward D. Cannon [mailto:[EMAIL PROTECTED] Sent: Monday, March 13, 2000 10:57 AM To: [EMAIL PROTECTED] Subject: RE: HTML parsing Well, I hope it can be done. Couldn't you just trap elements that contain the tag IMG as you parse the Instance? You know like using startElement and EndEleme

Re: HTML parsing

2000-03-13 Thread Mike Pogue
not built for > speed. > > Thanks in advance, > > --Susan > > > Mike Pogue > <[EMAIL PROTECTED]To: [EMAIL PROTECTED] > e.org> cc: >

Re: HTML parsing

2000-03-13 Thread susan_levine
<[EMAIL PROTECTED]To: [EMAIL PROTECTED] e.org> cc: Subject:

Re: HTML parsing

2000-03-13 Thread Mike Pogue
t; If the HTML is not well-formed XML (which most is not), you are correct. > > Andy > > | -Original Message- > | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > | Sent: Monday, March 13, 2000 10:32 AM > | To: [EMAIL PROTECTED] > | Subject: HTML parsing > | &g

Re: HTML parsing

2000-03-13 Thread Dmitry Volpyansky
7 PM Subject: RE: HTML parsing > If the HTML is not well-formed XML (which most is not), you are correct. > > Andy > > | -Original Message- > | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > | Sent: Monday, March 13, 2000 10:32 AM > | To: [EMAIL PROTECTED] > | Subj

RE: HTML parsing

2000-03-13 Thread Cox Andy
If the HTML is not well-formed XML (which most is not), you are correct. Andy | -Original Message- | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] | Sent: Monday, March 13, 2000 10:32 AM | To: [EMAIL PROTECTED] | Subject: HTML parsing | | | For what I can tell, I cannot expect to be

RE: HTML parsing

2000-03-13 Thread Ward D. Cannon
PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Monday, March 13, 2000 10:32 AM To: [EMAIL PROTECTED] Subject: HTML parsing For what I can tell, I cannot expect to be able to parse an HTML doc with the xerces parser? I was hoping to use the C++ SAX parser to find tags but I don't think I will be able t

HTML parsing

2000-03-13 Thread heather_matthews
For what I can tell, I cannot expect to be able to parse an HTML doc with the xerces parser? I was hoping to use the C++ SAX parser to find tags but I don't think I will be able to do that. Can someone confirm this dreadful fact? Thanks, Heather Matthews