Thanks for sharing this Meraj. It's already proving useful to other users. On 25 September 2014 17:04, Meraj A. Khan <[email protected]> wrote:
> Just wanted to update and let everyone know that this issue with single map > task for fetch was occurring because Generator.java had logic around MRV1 > property *mapred.job.tracker*, I had to change that logic and as I am > running this on YARN and now multiple fetch tasks operate on a single > segment. > > Also I misunderstood that multiple segments would need to be generated to > achieve parallelism , it does not seem to be the case , parallelism at > fetch time is achieved by having multiple fetch tasks operate on a single > segment. > > Thanks everyone for your help on resolving this issue. > > > > On Wed, Sep 24, 2014 at 6:14 PM, Meraj A. Khan <[email protected]> wrote: > > > Folks, > > > > As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN > > cluster . > > > > In order to scale I would need to Fetch concurrently with multiple map > > tasks on multiple nodes ,I think that the first step to do so would be > to > > generate multiple segments in the generate phase so that multiple fetch > map > > tasks can operate in parallel and in order to generate multiple segments > > at Generate time I have made the following changes , but unfortunately I > > have been unsuccessful in doing so. > > > > I have tweaked the following parameters in bin/crawl to do so . > > > > added the *maxNumSegments* and *numFetchers* parameters in the call to > > generate in *bin/crawl *script as can be seen below. > > > > > > *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb > > $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers > $numFetchers > > -noFilter* > > > > (Here $numFetchers has a value of 15) > > > > The *generate.max.count* and *generate.count.mode* and *topN* are all > > default values , meaning I am not providing any values for them. > > > > Also the crawldb status before the Generate phase is as shown below , it > > shows that the number of unfetched URLs is more than *75 million* , so > > its not that there are not enough urls for Generate to generate multiple > > segments. > > > > * CrawlDB status* > > * db_fetched=318708* > > * db_gone=4774* > > * db_notmodified=2274* > > * db_redir_perm=2253* > > * db_redir_temp=2527* > > * db_unfetched=75246666* > > > > However I do see this message in the logs consistently during the > generate > > phase. > > > > *Generator: jobtracker is 'local', generating exactly one partition.* > > > > is this "one partition" referring to the the single segment that is going > > to be generated ? If so how do I address this. > > > > > > I feel like I have exhausted all the options but I am unable to have the > > Generate phase generate more than one segment at a time. > > > > Can someone let me know if there is anything else that I should be trying > > here ? > > > > *Thanks and any help is much appreciated!* > > > > > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

