Re: Getting html pages through a Nutch crawl (for a dataset)

remi tassing Mon, 23 Jan 2012 06:33:25 -0800

Hi,

in your output directory, you should see two files:
1..part-00000.crc
2. part-00000


Open the second one with a text editor and you should be able to see the
crawled urls. Perharps if there is no html in there, you probably didn't
crawl any.

Remi

On Mon, Jan 23, 2012 at 4:08 PM, Sameendra Samarawickrama <
[email protected]> wrote:

> Hi,
> I tried the readdb comamnd, but I can't get the html pages with it.
> Thanks,
> Sameendra
>
> On Mon, Jan 23, 2012 at 12:14 PM, remi tassing <[email protected]
> >wrote:
>
> > Hi Sameendra,
> >
> > read this page:  http://wiki.apache.org/nutch/bin/nutch_readdb
> >
> > For instance the following command, will read your database and output
> the
> > crawled URLs to the directory output_dir:
> >
> > bin/nutch readdb crawl/crawldb -dump output_dir
> >
> > Remi
> >
> > On Sun, Jan 22, 2012 at 1:02 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> > > The best method is to read or dump the contents of your crawldb and
> work
> > > based on this.
> > >
> > > Please have a look on the wiki for using the readdb tool.
> > >
> > > On Sun, Jan 22, 2012 at 10:51 AM, Sameendra Samarawickrama <
> > > [email protected]> wrote:
> > >
> > > > Hi,
> > > > I am using Nutch to generate a small dataset of web; dataset on
> which I
> > > am
> > > > planning of running a focused crawler later.
> > > >
> > > > I did a test crawl of and I have the 'segments' folder built up. Now
> I
> > > need
> > > > to get that exact html pages it fetched out of the seed url/s.
> > > >
> > > > Is it possible to create a dataset this way? If so, how do I get
> those
> > > html
> > > > pages?
> > > >
> > > > Thanks a lot!
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>

Re: Getting html pages through a Nutch crawl (for a dataset)

Reply via email to