The IBM HTML parser code isn't public. I talked to the IBM group who wrote it,
and it's in Java,
and I think it does less than what Sun and ExOffice have. So, I don't think
it's an option. When
we get the HTML parser into the Java code base, it would be great to get it
ported to work with the
There was lots of discussion on this, but not really much of a
conclusion. We're currently using Tidy as a preprocessor for nasty
random HTML pages, but it seems to be overkill. It does lots of stuff so
we get perfect HTML out the other end, rather than just creating
something that's well formed.
's stuff is Java based - a lot of
> people on this list are working in C++. Whatever path is followed, I want
> to voice my enthusiasm for C++ support as early as possible.
>
> Ed
>
> -Original Message-
> From: Pierpaolo Fumagalli [mailto:[EMAIL PROTECTED]
> S
iday, March 17, 2000 12:59 PM
To: [EMAIL PROTECTED]
Subject: Re: HTML parsing
Mike Pogue wrote:
>
> I haven't seen the Sun one, though, so I think we should take a look
> before we start checking in the OpenXML one. Let's look at all the
> possibilities, before we choose o
Pierpaolo Fumagalli wrote:
>
> Mike Pogue wrote:
> >
> > Note that we have a couple of people who would like to donate an
> > HTML parser to xml.apache.org, to be added to Xerces. The ones I know of
> > are:
> >
> > ExOffice (extremely well tested, used for web spiders),
> > Sun (
Mike Pogue wrote:
>
> I haven't seen the Sun one, though, so I think we should take a look
> before we start checking in the OpenXML one. Let's look at all the
> possibilities, before we choose one.
Sounds good to me... Less stuff to do if we decide to go w/ Sun :)
Pier
--
---
I propose we wait and see what Sun has, before we pull the OpenXML one in.
After talking with the IBM folks in Tokyo, I suspect that the IBM HTML parser
will *not* be suitable
for our needs (at this point, I think OpenXML is better, because it supports
HTML 4.0, instead of
3.2, and it handles th
Mike Pogue wrote:
>
> Note that we have a couple of people who would like to donate an
> HTML parser to xml.apache.org, to be added to Xerces. The ones I know of
> are:
>
> ExOffice (extremely well tested, used for web spiders),
> Sun (I haven't seen it yet), and
> IBM (I
Wong Kok Wai wrote:
>
> There is a HTML parser in Swing/JFC. It is also event-based but not SAX.
> Another bad news is the
> parsed object tree is not DOM-based.
>
> Rajiv Mordani wrote:
>
> > The xhtml parser from Sun is an internal only version which will be made
> > available for Apache as s
> Assaf Arkin
> <[EMAIL PROTECTED]To: [EMAIL PROTECTED]
> ce.com> cc:
> Subject: Re: HTML parsing
> 03/13/00
>
That's a lot of baggage! Not to mention the Swing parser only supports HTML
3.2, if I remember
correctly.
Rajiv Mordani wrote:
> Well the xhtml parser is infact just a small handler that builds on the
> swing html parser using sax events.
>
> - Rajiv
>
>
Well the xhtml parser is infact just a small handler that builds on the
swing html parser using sax events.
- Rajiv
On Tue, 14 Mar 2000, Wong Kok Wai wrote:
> There is a HTML parser in Swing/JFC. It is also event-based but not SAX.
> Another bad news is the
> parsed object tree is not DOM-based
There is a HTML parser in Swing/JFC. It is also event-based but not SAX.
Another bad news is the
parsed object tree is not DOM-based.
Rajiv Mordani wrote:
> The xhtml parser from Sun is an internal only version which will be made
> available for Apache as soon as the licensing issues are cleared
ce.com> cc:
Subject: Re: HTML parsing
t; > prefer to use something else (also java) because Tidy's not built for
> > speed.
> >
> > Thanks in advance,
> >
> > --Susan
> >
> >
> > Mike Pogue
> > <[EMAIL PROTECTED] To: [E
though because my sample HTML doc is not well-formed --
> certain eng tags are left out (which is acceptable in HTML land).
>
> -Heather
>
> -Original Message-
> From: Ward D. Cannon [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 13, 2000 10:57 AM
> To: [EMAIL P
e the DOM parser needs proper end tags, etc. the
> > SAX parser does also I was assuming that with the SAX parser I could
> > simply handle startElement() and grab all attributes associated with IMG --
> > this doesn't work though because my sample HTML doc is not
Rajiv,
Could you post a little information on the parser itself? Like
what does it do, and how much has it been tested (e.g. via web spider)?
I'm trying to get the same info from the IBM folks who are considering
going open source. The ExOffice HTML parser (already open source)
The xhtml parser from Sun is an internal only version which will be made
available for Apache as soon as the licensing issues are cleared.
- Rajiv
On Mon, 13 Mar 2000, Mike Pogue wrote:
> Note that we have a couple of people who would like to donate an
> HTML parser to xml.apache.org, to be add
ain eng tags are left out (which is acceptable in HTML land).
>
> -Heather
>
> -Original Message-
> From: Ward D. Cannon [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 13, 2000 10:57 AM
> To: [EMAIL PROTECTED]
> Subject: RE: HTML parsing
>
> Well, I hope it
--
From: Ward D. Cannon [mailto:[EMAIL PROTECTED]
Sent: Monday, March 13, 2000 10:57 AM
To: [EMAIL PROTECTED]
Subject: RE: HTML parsing
Well, I hope it can be done. Couldn't you just trap elements that contain
the tag IMG as you parse the Instance? You know like using startElement and
EndEleme
not built for
> speed.
>
> Thanks in advance,
>
> --Susan
>
>
> Mike Pogue
> <[EMAIL PROTECTED]To: [EMAIL PROTECTED]
> e.org> cc:
>
<[EMAIL PROTECTED]To: [EMAIL PROTECTED]
e.org> cc:
Subject:
Note that we have a couple of people who would like to donate an
HTML parser to xml.apache.org, to be added to Xerces. The ones I know of
are:
ExOffice (extremely well tested, used for web spiders),
Sun (I haven't seen it yet), and
IBM (I haven't seen it yet either).
Heather,
You can look into HTMLTidy utility: http://www.w3.org/People/Raggett/tidy/
Partially supported by HP, BTW.
Thanks,
Dmitry Volpyansky
- Original Message -
From: "Cox Andy" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, March 13, 2000 1:0
If the HTML is not well-formed XML (which most is not), you are correct.
Andy
| -Original Message-
| From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
| Sent: Monday, March 13, 2000 10:32 AM
| To: [EMAIL PROTECTED]
| Subject: HTML parsing
|
|
| For what I can tell, I cannot expect to be a
Well, I hope it can be done. Couldn't you just trap elements that contain
the tag IMG as you parse the Instance? You know like using startElement and
EndElement. I would be blown away if the Sax Parser couldn't handle this.
Regards,
Ward
-Original Message-
From: [EMAIL PROTECTED] [mailto
27 matches
Mail list logo