Re: HTMLDocument
On Monday 02 February 2004 10:41, John Moylan wrote: Another easy HTML parser is HTMLparser.sf.net This one doesn't seem to be a SAX parser...:-\ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTMLDocument
Another easy HTML parser is HTMLparser.sf.net John On Sun, 2004-02-01 at 11:19, [EMAIL PROTECTED] wrote: Hi! Is there any HTMLDocument out there? The one in the demo package of lucene does not handle non-wellformed HTML files (what about nekohtml?) and seems to have some other inabilities and bugs as well (and why isn't it part of the distro but in a demo package?!)? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- John Moylan -- ePublishing Radio Telefis Eireann, Montrose House, Donnybrook, Dublin 4, Eire t:+353 1 2083564 e:[EMAIL PROTECTED] ** The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution, or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Please note that emails to, from and within RTÉ may be subject to the Freedom of Information Act 1997 and may be liable to disclosure. ** - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTMLDocument
On Sunday 01 February 2004 15:27, Felix Huber wrote: Of course it's there: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/ Thanks. But didn't find that contribution/ant directory there anyway...:-( - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
HTMLDocument
Hi! Is there any HTMLDocument out there? The one in the demo package of lucene does not handle non-wellformed HTML files (what about nekohtml?) and seems to have some other inabilities and bugs as well (and why isn't it part of the distro but in a demo package?!)? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTMLDocument
On Feb 1, 2004, at 6:19 AM, [EMAIL PROTECTED] wrote: Hi! Is there any HTMLDocument out there? The one in the demo package of lucene does not handle non-wellformed HTML files (what about nekohtml?) and seems to have some other inabilities and bugs as well (and why isn't it part of the distro but in a demo package?!)? Nutch uses NekoHTML, so you can browse around that codebase and borrow its implementation. The sandbox has a contribution/ant directory which contains an HTMLDocument that uses JTidy to parse HTML which does a pretty good job at handling bad HTML. Why isn't it in the distribution? Parsing HTML and turning it into a Lucene document is not always done the same way and doing so is really on top of the core, not integral to it. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: HTMLDocument
On Sunday 01 February 2004 13:21, Erik Hatcher wrote: On Feb 1, 2004, at 6:19 AM, [EMAIL PROTECTED] wrote: Nutch uses NekoHTML, so you can browse around that codebase and borrow Nutch(.org)? No code there... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]