Thanks, Markus. I will check it out. --- On Fri, 12/17/10, Markus Jelsma <[email protected]> wrote:
From: Markus Jelsma <[email protected]> Subject: Re: How to dump the crawled Html pages? To: [email protected] Cc: "Paul Lypaczewski" <[email protected]> Received: Friday, December 17, 2010, 7:25 PM Hi, Check out the readseg command. Cheers, > Hi > I am new to Nutch. I just started to use Nutch to crawl an intranet and > extract a certain field from the html pages. The first step I would like > to do is to dump all the html pages to a directory. I guess I should add a > filter class to do it, but I have no idea where should I start. Can > someone give me some advice on how to start or which class's source code I > should read? Thank you very much! > Paul

