The best method is to read or dump the contents of your crawldb and work based on this.
Please have a look on the wiki for using the readdb tool. On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama < [email protected]> wrote: > Hi, > I am using Nutch to generate a small dataset of web; dataset on which I am > planning of running a focused crawler later. > > I did a test crawl of and I have the 'segments' folder built up. Now I need > to get that exact html pages it fetched out of the seed url/s. > > Is it possible to create a dataset this way? If so, how do I get those html > pages? > > Thanks a lot! > -- *Lewis*

