My hadoop.log file has this at the end. 2011-09-22 13:18:52,971 INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 2011-09-22 13:18:53,971 INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 2011-09-22 13:18:54,231 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2011-09-22 13:18:54,972 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2011-09-22 13:18:54,972 INFO fetcher.Fetcher - -activeThreads=0 2011-09-22 13:19:23,425 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2011-09-22 13:31:37,660 INFO fetcher.Fetcher - Fetcher: finished at 2011-09-22 13:31:37, elapsed: 00:24:06
I'm trying to figure out what's going on during that 10-15 minutes at the end. The machine loads up one core during the time and nothing shows up in the console. On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney < [email protected]> wrote: > Hi Bai, > > You haven't mentioned which Nutch version you're using... this would be > good > if you could. > > You haven't injected any seed URLs into your crawldb. From memory I think > the -topN parameter should be passed to the generate command. > > Just to note, it is not necessary to set noParsing while executing the > fetch > command. This is already default behaviour. Not sure why your machine is > churning but this shouldn't be happening. Do you have any log data to > suggest why this is the case. > > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <[email protected]> wrote: > > > So I was able to get Nutch up and working using the crawl command. I set > > my > > depth and topN and it ran and indexed the pages for me. > > > > But not I'm trying to split out the separate pieces in order to > distribute > > them and add my own parser. I'm running the following. > > > > bin/nutch generate crawl/crawldb crawl/segments > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` > > bin/nutch fetch $SEGMENT -noParsing > > bin/nutch parse $SEGMENT > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize > > > > > > I don't see any way to determine how deep to crawl. Is this possible, or > > do > > I have to manually manage the db? And if so, how do I do that? > > > > And as a side note, why does Nutch invoke hadoop during the fetch command > > even though I have noParsing set? After fetching my links, my machine > > churns for around twenty minutes before finally ending, even though all > the > > fetch threads completed already. > > > > Thanks. > > > > > > -- > *Lewis* >

