I tried dumping several folders named with timestamps inside 'segments' using the readseg tool.
In the dump folder I don't have any html pages but a dump file (and another crc file). Dump file contains 'CrawlDatums', and there is a one 'CrawlDatum' with 'Content'. Where are the content for the other 'CrawlDatums' (hasn't the crawling been successful?) ? How do I get the crawled html pages? On Sun, Jan 22, 2012 at 9:12 PM, Markus Jelsma <[email protected]>wrote: > No, it's the readseg tool you need. It will dump, by default, all contents > of > the segment(s). > > > As Lewis mentioned, I dumped the crawldb using the readdb tool as below. > > > > $ ./bin/nutch readdb crawl-tinysite/crawldb/ -dump outdir (on Cygwin) > > > > But the dump (outdir) contains only two files named '.part-0000.crc' and > > 'part-0000'. > > It doesn't have the html pages I wanted. What should I do? > > > > > > On Sun, Jan 22, 2012 at 4:32 PM, Lewis John Mcgibbney < > > > > [email protected]> wrote: > > > The best method is to read or dump the contents of your crawldb and > work > > > based on this. > > > > > > Please have a look on the wiki for using the readdb tool. > > > > > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama < > > > > > > [email protected]> wrote: > > > > Hi, > > > > I am using Nutch to generate a small dataset of web; dataset on > which I > > > > > > am > > > > > > > planning of running a focused crawler later. > > > > > > > > I did a test crawl of and I have the 'segments' folder built up. Now > I > > > > > > need > > > > > > > to get that exact html pages it fetched out of the seed url/s. > > > > > > > > Is it possible to create a dataset this way? If so, how do I get > those > > > > > > html > > > > > > > pages? > > > > > > > > Thanks a lot! > > > > > > -- > > > *Lewis* > -- Regards, Sameendra

