Re: How to dump the crawled Html pages?

Paul Lypaczewski Fri, 17 Dec 2010 11:32:34 -0800

Thanks, Markus. I will check it out.

--- On Fri, 12/17/10, Markus Jelsma <[email protected]> wrote:

From: Markus Jelsma <[email protected]>
Subject: Re: How to dump the crawled Html pages?
To: [email protected]
Cc: "Paul Lypaczewski" <[email protected]>
Received: Friday, December 17, 2010, 7:25 PM

Hi,

Check out the readseg command.

Cheers,

> Hi
> I am new to Nutch. I just started to use Nutch to crawl an intranet and
> extract a certain field from the html pages. The first step I would like
> to do is to dump all the html pages to a directory. I guess I should add a
> filter class to do it, but I have no idea where should I start. Can
> someone give me some advice on how to start or which class's source code I
> should read? Thank you very much!
> Paul

Re: How to dump the crawled Html pages?

Reply via email to