Re: Nutch crawl vs other commands

Bai Shen Thu, 22 Sep 2011 10:57:19 -0700

My hadoop.log file has this at the end.

2011-09-22 13:18:52,971 INFO  fetcher.Fetcher - -activeThreads=1,
spinWaiting=0, fetchQueues.totalSize=0
2011-09-22 13:18:53,971 INFO  fetcher.Fetcher - -activeThreads=1,
spinWaiting=0, fetchQueues.totalSize=0
2011-09-22 13:18:54,231 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2011-09-22 13:18:54,972 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2011-09-22 13:18:54,972 INFO  fetcher.Fetcher - -activeThreads=0
2011-09-22 13:19:23,425 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2011-09-22 13:31:37,660 INFO  fetcher.Fetcher - Fetcher: finished at
2011-09-22 13:31:37, elapsed: 00:24:06


I'm trying to figure out what's going on during that 10-15 minutes at the
end.  The machine loads up one core during the time and nothing shows up in
the console.

On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
[email protected]> wrote:

> Hi Bai,
>
> You haven't mentioned which Nutch version you're using... this would be
> good
> if you could.
>
> You haven't injected any seed URLs into your crawldb. From memory I think
> the -topN parameter should be passed to the generate command.
>
> Just to note, it is not necessary to set noParsing while executing the
> fetch
> command. This is already default behaviour. Not sure why your machine is
> churning but this shouldn't be happening. Do you have any log data to
> suggest why this is the case.
>
> On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <[email protected]> wrote:
>
> > So I was able to get Nutch up and working using the crawl command.  I set
> > my
> > depth and topN and it ran and indexed the pages for me.
> >
> > But not I'm trying to split out the separate pieces in order to
> distribute
> > them and add my own parser.  I'm running the following.
> >
> > bin/nutch generate crawl/crawldb crawl/segments
> > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > bin/nutch fetch $SEGMENT -noParsing
> > bin/nutch parse $SEGMENT
> > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> >
> >
> > I don't see any way to determine how deep to crawl.  Is this possible, or
> > do
> > I have to manually manage the db?  And if so, how do I do that?
> >
> > And as a side note, why does Nutch invoke hadoop during the fetch command
> > even though I have noParsing set?  After fetching my links, my machine
> > churns for around twenty minutes before finally ending, even though all
> the
> > fetch threads completed already.
> >
> > Thanks.
> >
>
>
>
> --
> *Lewis*
>

Re: Nutch crawl vs other commands

Reply via email to