Folks, As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN cluster .
In order to scale I would need to Fetch concurrently with multiple map tasks on multiple nodes ,I think that the first step to do so would be to generate multiple segments in the generate phase so that multiple fetch map tasks can operate in parallel and in order to generate multiple segments at Generate time I have made the following changes , but unfortunately I have been unsuccessful in doing so. I have tweaked the following parameters in bin/crawl to do so . added the *maxNumSegments* and *numFetchers* parameters in the call to generate in *bin/crawl *script as can be seen below. *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter* (Here $numFetchers has a value of 15) The *generate.max.count* and *generate.count.mode* and *topN* are all default values , meaning I am not providing any values for them. Also the crawldb status before the Generate phase is as shown below , it shows that the number of unfetched URLs is more than *75 million* , so its not that there are not enough urls for Generate to generate multiple segments. * CrawlDB status* * db_fetched=318708* * db_gone=4774* * db_notmodified=2274* * db_redir_perm=2253* * db_redir_temp=2527* * db_unfetched=75246666* However I do see this message in the logs consistently during the generate phase. *Generator: jobtracker is 'local', generating exactly one partition.* is this "one partition" referring to the the single segment that is going to be generated ? If so how do I address this. I feel like I have exhausted all the options but I am unable to have the Generate phase generate more than one segment at a time. Can someone let me know if there is anything else that I should be trying here ? *Thanks and any help is much appreciated!*

