As Lewis mentioned, I dumped the crawldb using the readdb tool as below. $ ./bin/nutch readdb crawl-tinysite/crawldb/ -dump outdir (on Cygwin)
But the dump (outdir) contains only two files named '.part-0000.crc' and 'part-0000'. It doesn't have the html pages I wanted. What should I do? On Sun, Jan 22, 2012 at 4:32 PM, Lewis John Mcgibbney < [email protected]> wrote: > The best method is to read or dump the contents of your crawldb and work > based on this. > > Please have a look on the wiki for using the readdb tool. > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama < > [email protected]> wrote: > > > Hi, > > I am using Nutch to generate a small dataset of web; dataset on which I > am > > planning of running a focused crawler later. > > > > I did a test crawl of and I have the 'segments' folder built up. Now I > need > > to get that exact html pages it fetched out of the seed url/s. > > > > Is it possible to create a dataset this way? If so, how do I get those > html > > pages? > > > > Thanks a lot! > > > > > > -- > *Lewis* >

