Re: Getting html pages through a Nutch crawl (for a dataset)

remi tassing Sun, 22 Jan 2012 22:45:33 -0800

Hi Sameendra,

read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb


For instance the following command, will read your database and output the
crawled URLs to the directory output_dir:

bin/nutch readdb crawl/crawldb -dump output_dir

Remi

On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> The best method is to read or dump the contents of your crawldb and work
> based on this.
>
> Please have a look on the wiki for using the readdb tool.
>
> On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> [email protected]> wrote:
>
> > Hi,
> > I am using Nutch to generate a small dataset of web; dataset on which I
> am
> > planning of running a focused crawler later.
> >
> > I did a test crawl of and I have the 'segments' folder built up. Now I
> need
> > to get that exact html pages it fetched out of the seed url/s.
> >
> > Is it possible to create a dataset this way? If so, how do I get those
> html
> > pages?
> >
> > Thanks a lot!
> >
>
>
>
> --
> *Lewis*
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Reply via email to