Nutch crawl vs other commands

Bai Shen Thu, 22 Sep 2011 05:27:16 -0700

So I was able to get Nutch up and working using the crawl command.  I set my
depth and topN and it ran and indexed the pages for me.


But not I'm trying to split out the separate pieces in order to distribute
them and add my own parser.  I'm running the following.

bin/nutch generate crawl/crawldb crawl/segments
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
bin/nutch fetch $SEGMENT -noParsing
bin/nutch parse $SEGMENT
bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize


I don't see any way to determine how deep to crawl.  Is this possible, or do
I have to manually manage the db?  And if so, how do I do that?

And as a side note, why does Nutch invoke hadoop during the fetch command
even though I have noParsing set?  After fetching my links, my machine
churns for around twenty minutes before finally ending, even though all the
fetch threads completed already.

Thanks.

Nutch crawl vs other commands

Reply via email to