Just wanted to update and let everyone know that this issue with single map
task for fetch was occurring because Generator.java had logic around MRV1
property *mapred.job.tracker*, I had to change that logic and as I am
running this on YARN and now multiple fetch tasks operate on a single
segment.

Also I misunderstood that multiple segments would need to be generated to
achieve parallelism , it does not seem to be the case , parallelism at
fetch time is achieved by having multiple fetch tasks operate on a single
segment.

Thanks everyone for your help on resolving this issue.



On Wed, Sep 24, 2014 at 6:14 PM, Meraj A. Khan <[email protected]> wrote:

> Folks,
>
> As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN
> cluster .
>
> In order to scale I would need to Fetch concurrently with multiple map
> tasks on multiple nodes ,I  think that the first step to do so would be to
> generate multiple segments in the generate phase so that multiple fetch map
> tasks can operate in parallel and in  order to generate multiple segments
> at Generate time I have made the following changes , but unfortunately I
> have been unsuccessful in doing so.
>
> I have tweaked the following parameters in bin/crawl to do so .
>
> added the *maxNumSegments* and *numFetchers* parameters in the call to
> generate in *bin/crawl *script as can be seen below.
>
>
> *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
> $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers
> -noFilter*
>
> (Here $numFetchers has a value of 15)
>
> The *generate.max.count* and *generate.count.mode* and *topN* are all
> default values , meaning I am not providing any values for them.
>
> Also the crawldb status before the Generate phase is as shown below , it
> shows that the number of unfetched URLs is more than *75 million* , so
> its not that there are not enough urls for Generate to generate multiple
> segments.
>
> * CrawlDB status*
> * db_fetched=318708*
> * db_gone=4774*
> * db_notmodified=2274*
> * db_redir_perm=2253*
> * db_redir_temp=2527*
> * db_unfetched=75246666*
>
> However I do see this message in the logs consistently during the generate
> phase.
>
>  *Generator: jobtracker is 'local', generating exactly one partition.*
>
> is this "one partition" referring to the the single segment that is going
> to be generated ? If so how do I address this.
>
>
> I feel like I have exhausted all the options but I am unable to have the
> Generate phase generate more than one segment at a time.
>
> Can someone let me know if there is anything else that I should be trying
> here ?
>
> *Thanks and any help is much appreciated!*
>
>
>

Reply via email to