Thanks for sharing this Meraj. It's already proving useful to other users.

On 25 September 2014 17:04, Meraj A. Khan <[email protected]> wrote:

> Just wanted to update and let everyone know that this issue with single map
> task for fetch was occurring because Generator.java had logic around MRV1
> property *mapred.job.tracker*, I had to change that logic and as I am
> running this on YARN and now multiple fetch tasks operate on a single
> segment.
>
> Also I misunderstood that multiple segments would need to be generated to
> achieve parallelism , it does not seem to be the case , parallelism at
> fetch time is achieved by having multiple fetch tasks operate on a single
> segment.
>
> Thanks everyone for your help on resolving this issue.
>
>
>
> On Wed, Sep 24, 2014 at 6:14 PM, Meraj A. Khan <[email protected]> wrote:
>
> > Folks,
> >
> > As mentioned previously , I am running Nutch 1.7 on a Apache Hadoop YARN
> > cluster .
> >
> > In order to scale I would need to Fetch concurrently with multiple map
> > tasks on multiple nodes ,I  think that the first step to do so would be
> to
> > generate multiple segments in the generate phase so that multiple fetch
> map
> > tasks can operate in parallel and in  order to generate multiple segments
> > at Generate time I have made the following changes , but unfortunately I
> > have been unsuccessful in doing so.
> >
> > I have tweaked the following parameters in bin/crawl to do so .
> >
> > added the *maxNumSegments* and *numFetchers* parameters in the call to
> > generate in *bin/crawl *script as can be seen below.
> >
> >
> > *$bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
> > $CRAWL_PATH/segments -maxNumSegments $numFetchers -numFetchers
> $numFetchers
> > -noFilter*
> >
> > (Here $numFetchers has a value of 15)
> >
> > The *generate.max.count* and *generate.count.mode* and *topN* are all
> > default values , meaning I am not providing any values for them.
> >
> > Also the crawldb status before the Generate phase is as shown below , it
> > shows that the number of unfetched URLs is more than *75 million* , so
> > its not that there are not enough urls for Generate to generate multiple
> > segments.
> >
> > * CrawlDB status*
> > * db_fetched=318708*
> > * db_gone=4774*
> > * db_notmodified=2274*
> > * db_redir_perm=2253*
> > * db_redir_temp=2527*
> > * db_unfetched=75246666*
> >
> > However I do see this message in the logs consistently during the
> generate
> > phase.
> >
> >  *Generator: jobtracker is 'local', generating exactly one partition.*
> >
> > is this "one partition" referring to the the single segment that is going
> > to be generated ? If so how do I address this.
> >
> >
> > I feel like I have exhausted all the options but I am unable to have the
> > Generate phase generate more than one segment at a time.
> >
> > Can someone let me know if there is anything else that I should be trying
> > here ?
> >
> > *Thanks and any help is much appreciated!*
> >
> >
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to