Re: Getting html pages through a Nutch crawl (for a dataset)

Sameendra Samarawickrama Sun, 22 Jan 2012 08:20:09 -0800

I tried dumping several folders named with timestamps inside 'segments'
using the readseg tool.


In the dump folder I don't have any html pages but a dump file (and another
crc file). Dump file contains 'CrawlDatums', and there is a one
'CrawlDatum' with 'Content'. Where are the content for the other
'CrawlDatums' (hasn't the crawling been successful?) ? How do I get the
crawled html pages?

On Sun, Jan 22, 2012 at 9:12 PM, Markus Jelsma
<[email protected]>wrote:

> No, it's the readseg tool you need. It will dump, by default, all contents
> of
> the segment(s).
>
> > As Lewis mentioned, I dumped the crawldb using the readdb tool as below.
> >
> > $ ./bin/nutch readdb crawl-tinysite/crawldb/ -dump outdir   (on Cygwin)
> >
> > But the dump (outdir) contains only two files named '.part-0000.crc' and
> > 'part-0000'.
> > It doesn't have the html pages I wanted. What should I do?
> >
> >
> > On Sun, Jan 22, 2012 at 4:32 PM, Lewis John Mcgibbney <
> >
> > [email protected]> wrote:
> > > The best method is to read or dump the contents of your crawldb and
> work
> > > based on this.
> > >
> > > Please have a look on the wiki for using the readdb tool.
> > >
> > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > >
> > > [email protected]> wrote:
> > > > Hi,
> > > > I am using Nutch to generate a small dataset of web; dataset on
> which I
> > >
> > > am
> > >
> > > > planning of running a focused crawler later.
> > > >
> > > > I did a test crawl of and I have the 'segments' folder built up. Now
> I
> > >
> > > need
> > >
> > > > to get that exact html pages it fetched out of the seed url/s.
> > > >
> > > > Is it possible to create a dataset this way? If so, how do I get
> those
> > >
> > > html
> > >
> > > > pages?
> > > >
> > > > Thanks a lot!
> > >
> > > --
> > > *Lewis*
>



-- 
Regards,
Sameendra

Re: Getting html pages through a Nutch crawl (for a dataset)

Reply via email to