Re: HTMLDocument

2004-02-04 Thread lucene
On Monday 02 February 2004 10:41, John Moylan wrote:
 Another easy HTML parser is HTMLparser.sf.net

This one doesn't seem to be a SAX parser...:-\

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTMLDocument

2004-02-02 Thread John Moylan
Another easy HTML parser is HTMLparser.sf.net

John

On Sun, 2004-02-01 at 11:19, [EMAIL PROTECTED] wrote:
 Hi!
 
 Is there any HTMLDocument out there? The one in the demo package of lucene 
 does not handle non-wellformed HTML files (what about nekohtml?) and seems to 
 have some other inabilities and bugs as well (and why isn't it part of the 
 distro but in a demo package?!)?
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
-- 
John Moylan
--
ePublishing
Radio Telefis Eireann,
Montrose House,
Donnybrook,
Dublin 4,
Eire
t:+353 1 2083564
e:[EMAIL PROTECTED]


**
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution, or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful.
Please note that emails to, from and within RTÉ may be subject to the Freedom
of Information Act 1997 and may be liable to disclosure.
**

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTMLDocument

2004-02-02 Thread lucene
On Sunday 01 February 2004 15:27, Felix Huber wrote:
 Of course it's there: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/

Thanks. But didn't find that contribution/ant directory there anyway...:-(


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



HTMLDocument

2004-02-01 Thread lucene
Hi!

Is there any HTMLDocument out there? The one in the demo package of lucene 
does not handle non-wellformed HTML files (what about nekohtml?) and seems to 
have some other inabilities and bugs as well (and why isn't it part of the 
distro but in a demo package?!)?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: HTMLDocument

2004-02-01 Thread Erik Hatcher
On Feb 1, 2004, at 6:19 AM, [EMAIL PROTECTED] wrote:
Hi!

Is there any HTMLDocument out there? The one in the demo package of 
lucene
does not handle non-wellformed HTML files (what about nekohtml?) and 
seems to
have some other inabilities and bugs as well (and why isn't it part of 
the
distro but in a demo package?!)?
Nutch uses NekoHTML, so you can browse around that codebase and borrow 
its implementation.  The sandbox has a contribution/ant directory which 
contains an HTMLDocument that uses JTidy to parse HTML which does a 
pretty good job at handling bad HTML.

Why isn't it in the distribution?  Parsing HTML and turning it into a 
Lucene document is not always done the same way and doing so is really 
on top of the core, not integral to it.

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HTMLDocument

2004-02-01 Thread lucene
On Sunday 01 February 2004 13:21, Erik Hatcher wrote:
 On Feb 1, 2004, at 6:19 AM, [EMAIL PROTECTED] wrote:

 Nutch uses NekoHTML, so you can browse around that codebase and borrow

Nutch(.org)? No code there...


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]