Re: Nutch crawl vs other commands

Markus Jelsma Thu, 22 Sep 2011 11:03:32 -0700

> So I was able to get Nutch up and working using the crawl command.  I set
> my depth and topN and it ran and indexed the pages for me.
> 
> But not I'm trying to split out the separate pieces in order to distribute
> them and add my own parser.  I'm running the following.
> 
> bin/nutch generate crawl/crawldb crawl/segments
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> bin/nutch fetch $SEGMENT -noParsing
> bin/nutch parse $SEGMENT
> bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> 
> 
> I don't see any way to determine how deep to crawl.  Is this possible, or
> do I have to manually manage the db?  And if so, how do I do that?


Not really. Once the you've got many links pointing to eachother, the concept 
of depth no longer really applies. You don't have to manage the DB manually as 
it will regulate itself (either by using a custom fetch scheduler). 
Nutch will select URL's due for fetch and will in the end exhaust the full 
list of URL's, unless you're crawling the internet. Fetched URL's will be 
refetched over time.

> 
> And as a side note, why does Nutch invoke hadoop during the fetch command
> even though I have noParsing set?  After fetching my links, my machine
> churns for around twenty minutes before finally ending, even though all the
> fetch threads completed already.

Because the fetcher runs as a Hadoop mapred job. When the actual fetch 
finishes Hadoop must write the contents, merge spilled records etc. This is 
part of how mapred works.

> 
> Thanks.

Re: Nutch crawl vs other commands

Reply via email to