The IBM HTML parser code isn't public. I talked to the IBM group who wrote it,
and it's in Java,
and I think it does less than what Sun and ExOffice have. So, I don't think
it's an option. When
we get the HTML parser into the Java code base, it would be great to get it
ported to work with the
There was lots of discussion on this, but not really much of a
conclusion. We're currently using Tidy as a preprocessor for nasty
random HTML pages, but it seems to be overkill. It does lots of stuff so
we get perfect HTML out the other end, rather than just creating
something that's well formed.
's stuff is Java based - a lot of
> people on this list are working in C++. Whatever path is followed, I want
> to voice my enthusiasm for C++ support as early as possible.
>
> Ed
>
> -Original Message-
> From: Pierpaolo Fumagalli [mailto:[EMAIL PROTECTED]
> S
iday, March 17, 2000 12:59 PM
To: [EMAIL PROTECTED]
Subject: Re: HTML parsing
Mike Pogue wrote:
>
> I haven't seen the Sun one, though, so I think we should take a look
> before we start checking in the OpenXML one. Let's look at all the
> possibilities, before we choose o
Pierpaolo Fumagalli wrote:
>
> Mike Pogue wrote:
> >
> > Note that we have a couple of people who would like to donate an
> > HTML parser to xml.apache.org, to be added to Xerces. The ones I know of
> > are:
> >
> > ExOffice (extremely well tested, used for web spiders),
> > Sun (
Mike Pogue wrote:
>
> I haven't seen the Sun one, though, so I think we should take a look
> before we start checking in the OpenXML one. Let's look at all the
> possibilities, before we choose one.
Sounds good to me... Less stuff to do if we decide to go w/ Sun :)
Pier
--
---
I propose we wait and see what Sun has, before we pull the OpenXML one in.
After talking with the IBM folks in Tokyo, I suspect that the IBM HTML parser
will *not* be suitable
for our needs (at this point, I think OpenXML is better, because it supports
HTML 4.0, instead of
3.2, and it handles th
Mike Pogue wrote:
>
> Note that we have a couple of people who would like to donate an
> HTML parser to xml.apache.org, to be added to Xerces. The ones I know of
> are:
>
> ExOffice (extremely well tested, used for web spiders),
> Sun (I haven't seen it yet), and
> IBM (I
Wong Kok Wai wrote:
>
> There is a HTML parser in Swing/JFC. It is also event-based but not SAX.
> Another bad news is the
> parsed object tree is not DOM-based.
>
> Rajiv Mordani wrote:
>
> > The xhtml parser from Sun is an internal only version which will be made
> > available for Apache as s
> Assaf Arkin
> <[EMAIL PROTECTED]To: [EMAIL PROTECTED]
> ce.com> cc:
> Subject: Re: HTML parsing
> 03/13/00
>
That's a lot of baggage! Not to mention the Swing parser only supports HTML
3.2, if I remember
correctly.
Rajiv Mordani wrote:
> Well the xhtml parser is infact just a small handler that builds on the
> swing html parser using sax events.
>
> - Rajiv
>
>
Well the xhtml parser is infact just a small handler that builds on the
swing html parser using sax events.
- Rajiv
On Tue, 14 Mar 2000, Wong Kok Wai wrote:
> There is a HTML parser in Swing/JFC. It is also event-based but not SAX.
> Another bad news is the
> parsed object tree is not DOM-based
There is a HTML parser in Swing/JFC. It is also event-based but not SAX.
Another bad news is the
parsed object tree is not DOM-based.
Rajiv Mordani wrote:
> The xhtml parser from Sun is an internal only version which will be made
> available for Apache as soon as the licensing issues are cleared
ce.com> cc:
Subject: Re: HTML parsing
t; > prefer to use something else (also java) because Tidy's not built for
> > speed.
> >
> > Thanks in advance,
> >
> > --Susan
> >
> >
> > Mike Pogue
> > <[EMAIL PROTECTED] To: [E
though because my sample HTML doc is not well-formed --
> certain eng tags are left out (which is acceptable in HTML land).
>
> -Heather
>
> -Original Message-
> From: Ward D. Cannon [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 13, 2000 10:57 AM
> To: [EMAIL P
I see a lot of interest in HTML parsing recently :-)
A short explanation. XML, whether in the form of DOM or SAX events, must
always be well formed, that means elements must always be closed. An
HTML parser will report a well formed stream of SAX events, or a DOM
document (always well formed
> > > Andy
> > >
> > > | -Original Message-
> > > | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > > | Sent: Monday, March 13, 2000 10:32 AM
> > > | To: [EMAIL PROTECTED]
> > > | Subject: HTML parsing
> > > |
>
the HTML is not well-formed XML (which most is not), you are correct.
> >
> > Andy
> >
> > | -Original Message-
> > | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > | Sent: Monday, March 13, 2000 10:32 AM
> > | To: [EMAIL PROTECTED]
> >
ain eng tags are left out (which is acceptable in HTML land).
>
> -Heather
>
> -Original Message-
> From: Ward D. Cannon [mailto:[EMAIL PROTECTED]
> Sent: Monday, March 13, 2000 10:57 AM
> To: [EMAIL PROTECTED]
> Subject: RE: HTML parsing
>
> Well, I hope it
--
From: Ward D. Cannon [mailto:[EMAIL PROTECTED]
Sent: Monday, March 13, 2000 10:57 AM
To: [EMAIL PROTECTED]
Subject: RE: HTML parsing
Well, I hope it can be done. Couldn't you just trap elements that contain
the tag IMG as you parse the Instance? You know like using startElement and
EndEleme
not built for
> speed.
>
> Thanks in advance,
>
> --Susan
>
>
> Mike Pogue
> <[EMAIL PROTECTED]To: [EMAIL PROTECTED]
> e.org> cc:
>
<[EMAIL PROTECTED]To: [EMAIL PROTECTED]
e.org> cc:
Subject:
t; If the HTML is not well-formed XML (which most is not), you are correct.
>
> Andy
>
> | -Original Message-
> | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> | Sent: Monday, March 13, 2000 10:32 AM
> | To: [EMAIL PROTECTED]
> | Subject: HTML parsing
> |
&g
7 PM
Subject: RE: HTML parsing
> If the HTML is not well-formed XML (which most is not), you are correct.
>
> Andy
>
> | -Original Message-
> | From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> | Sent: Monday, March 13, 2000 10:32 AM
> | To: [EMAIL PROTECTED]
> | Subj
If the HTML is not well-formed XML (which most is not), you are correct.
Andy
| -Original Message-
| From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
| Sent: Monday, March 13, 2000 10:32 AM
| To: [EMAIL PROTECTED]
| Subject: HTML parsing
|
|
| For what I can tell, I cannot expect to be
PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Monday, March 13, 2000 10:32 AM
To: [EMAIL PROTECTED]
Subject: HTML parsing
For what I can tell, I cannot expect to be able to parse an HTML doc with
the
xerces parser? I was hoping to use the C++ SAX parser to find tags
but I
don't think I will be able t
For what I can tell, I cannot expect to be able to parse an HTML doc with the
xerces parser? I was hoping to use the C++ SAX parser to find tags but I
don't think I will be able to do that. Can someone confirm this dreadful fact?
Thanks,
Heather Matthews
28 matches
Mail list logo