Hi, I tried the readdb comamnd, but I can't get the html pages with it. Thanks, Sameendra
On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <[email protected]>wrote: > Hi Sameendra, > > read this page: http://wiki.apache.org/nutch/bin/nutch_readdb > > For instance the following command, will read your database and output the > crawled URLs to the directory output_dir: > > bin/nutch readdb crawl/crawldb -dump output_dir > > Remi > > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney < > [email protected]> wrote: > > > The best method is to read or dump the contents of your crawldb and work > > based on this. > > > > Please have a look on the wiki for using the readdb tool. > > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama < > > [email protected]> wrote: > > > > > Hi, > > > I am using Nutch to generate a small dataset of web; dataset on which I > > am > > > planning of running a focused crawler later. > > > > > > I did a test crawl of and I have the 'segments' folder built up. Now I > > need > > > to get that exact html pages it fetched out of the seed url/s. > > > > > > Is it possible to create a dataset this way? If so, how do I get those > > html > > > pages? > > > > > > Thanks a lot! > > > > > > > > > > > -- > > *Lewis* > > >

