Folks,

As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN
cluster .

In order to scale I would need to Fetch concurrently with multiple map
tasks on multiple nodes ,I  think that the first step to do so would be to
generate multiple segments in the generate phase so that multiple fetch map
tasks can operate in parallel and in  order to generate multiple segments
at Generate time I have made the following changes , but unfortunately I
have been unsuccessful in doing so.

I have tweaked the following parameters in bin/crawl to do so .

added the *maxNumSegments* and *numFetchers* parameters in the call to
generate in *bin/crawl *script as can be seen below.


*$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
$CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers $numFetchers
-noFilter*

(Here $numFetchers has a value of 15)

The *generate.max.count* and *generate.count.mode* and *topN* are all
default values , meaning I am not providing any values for them.

Also the crawldb status before the Generate phase is as shown below , it
shows that the number of unfetched URLs is more than *75 million* , so its
not that there are not enough urls for Generate to generate multiple
segments.

* CrawlDB status*
* db_fetched=318708*
* db_gone=4774*
* db_notmodified=2274*
* db_redir_perm=2253*
* db_redir_temp=2527*
* db_unfetched=75246666*

However I do see this message in the logs consistently during the generate
phase.

 *Generator: jobtracker is 'local', generating exactly one partition.*

is this "one partition" referring to the the single segment that is going
to be generated ? If so how do I address this.


I feel like I have exhausted all the options but I am unable to have the
Generate phase generate more than one segment at a time.

Can someone let me know if there is anything else that I should be trying
here ?

*Thanks and any help is much appreciated!*

Reply via email to