Just wanted to update and let everyone know that this issue with single map task for fetch was occurring because Generator.java had logic around MRV1 property *mapred.job.tracker*, I had to change that logic and as I am running this on YARN and now multiple fetch tasks operate on a single segment.
Also I misunderstood that multiple segments would need to be generated to achieve parallelism , it does not seem to be the case , parallelism at fetch time is achieved by having multiple fetch tasks operate on a single segment. Thanks everyone for your help on resolving this issue. On Wed, Sep 24, 2014 at 6:14 PM, Meraj A. Khan <[email protected]> wrote: > Folks, > > As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN > cluster . > > In order to scale I would need to Fetch concurrently with multiple map > tasks on multiple nodes ,I think that the first step to do so would be to > generate multiple segments in the generate phase so that multiple fetch map > tasks can operate in parallel and in order to generate multiple segments > at Generate time I have made the following changes , but unfortunately I > have been unsuccessful in doing so. > > I have tweaked the following parameters in bin/crawl to do so . > > added the *maxNumSegments* and *numFetchers* parameters in the call to > generate in *bin/crawl *script as can be seen below. > > > *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb > $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers > -noFilter* > > (Here $numFetchers has a value of 15) > > The *generate.max.count* and *generate.count.mode* and *topN* are all > default values , meaning I am not providing any values for them. > > Also the crawldb status before the Generate phase is as shown below , it > shows that the number of unfetched URLs is more than *75 million* , so > its not that there are not enough urls for Generate to generate multiple > segments. > > * CrawlDB status* > * db_fetched=318708* > * db_gone=4774* > * db_notmodified=2274* > * db_redir_perm=2253* > * db_redir_temp=2527* > * db_unfetched=75246666* > > However I do see this message in the logs consistently during the generate > phase. > > *Generator: jobtracker is 'local', generating exactly one partition.* > > is this "one partition" referring to the the single segment that is going > to be generated ? If so how do I address this. > > > I feel like I have exhausted all the options but I am unable to have the > Generate phase generate more than one segment at a time. > > Can someone let me know if there is anything else that I should be trying > here ? > > *Thanks and any help is much appreciated!* > > >

