Re: Getting html pages through a Nutch crawl (for a dataset)

Sameendra Samarawickrama Mon, 23 Jan 2012 06:09:06 -0800

Hi,
I tried the readdb comamnd, but I can't get the html pages with it.
Thanks,
Sameendra


On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <[email protected]>wrote:

> Hi Sameendra,
>
> read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
>
> For instance the following command, will read your database and output the
> crawled URLs to the directory output_dir:
>
> bin/nutch readdb crawl/crawldb -dump output_dir
>
> Remi
>
> On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
> > The best method is to read or dump the contents of your crawldb and work
> > based on this.
> >
> > Please have a look on the wiki for using the readdb tool.
> >
> > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > [email protected]> wrote:
> >
> > > Hi,
> > > I am using Nutch to generate a small dataset of web; dataset on which I
> > am
> > > planning of running a focused crawler later.
> > >
> > > I did a test crawl of and I have the 'segments' folder built up. Now I
> > need
> > > to get that exact html pages it fetched out of the seed url/s.
> > >
> > > Is it possible to create a dataset this way? If so, how do I get those
> > html
> > > pages?
> > >
> > > Thanks a lot!
> > >
> >
> >
> >
> > --
> > *Lewis*
> >
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Reply via email to