Re: Getting html pages through a Nutch crawl (for a dataset)

Markus Jelsma Sun, 22 Jan 2012 07:45:42 -0800

No, it's the readseg tool you need. It will dump, by default, all contents of 
the segment(s).


> As Lewis mentioned, I dumped the crawldb using the readdb tool as below.
> 
> $ ./bin/nutch readdb crawl-tinysite/crawldb/ -dump outdir   (on Cygwin)
> 
> But the dump (outdir) contains only two files named '.part-0000.crc' and
> 'part-0000'.
> It doesn't have the html pages I wanted. What should I do?
> 
> 
> On Sun, Jan 22, 2012 at 4:32 PM, Lewis John Mcgibbney <
> 
> [email protected]> wrote:
> > The best method is to read or dump the contents of your crawldb and work
> > based on this.
> > 
> > Please have a look on the wiki for using the readdb tool.
> > 
> > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > 
> > [email protected]> wrote:
> > > Hi,
> > > I am using Nutch to generate a small dataset of web; dataset on which I
> > 
> > am
> > 
> > > planning of running a focused crawler later.
> > > 
> > > I did a test crawl of and I have the 'segments' folder built up. Now I
> > 
> > need
> > 
> > > to get that exact html pages it fetched out of the seed url/s.
> > > 
> > > Is it possible to create a dataset this way? If so, how do I get those
> > 
> > html
> > 
> > > pages?
> > > 
> > > Thanks a lot!
> > 
> > --
> > *Lewis*

Re: Getting html pages through a Nutch crawl (for a dataset)

Reply via email to