> So I was able to get Nutch up and working using the crawl command. I set > my depth and topN and it ran and indexed the pages for me. > > But not I'm trying to split out the separate pieces in order to distribute > them and add my own parser. I'm running the following. > > bin/nutch generate crawl/crawldb crawl/segments > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` > bin/nutch fetch $SEGMENT -noParsing > bin/nutch parse $SEGMENT > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize > > > I don't see any way to determine how deep to crawl. Is this possible, or > do I have to manually manage the db? And if so, how do I do that?
Not really. Once the you've got many links pointing to eachother, the concept of depth no longer really applies. You don't have to manage the DB manually as it will regulate itself (either by using a custom fetch scheduler). Nutch will select URL's due for fetch and will in the end exhaust the full list of URL's, unless you're crawling the internet. Fetched URL's will be refetched over time. > > And as a side note, why does Nutch invoke hadoop during the fetch command > even though I have noParsing set? After fetching my links, my machine > churns for around twenty minutes before finally ending, even though all the > fetch threads completed already. Because the fetcher runs as a Hadoop mapred job. When the actual fetch finishes Hadoop must write the contents, merge spilled records etc. This is part of how mapred works. > > Thanks.

