Hi Meraj,

Nutch and Hadoop abstract all of that for you, so you don’t need to worry about 
it. When you execute the fetch command for a segment, it will be parallelized 
across the nodes in your cluster.

Cheers

Jake

On Sep 19, 2014, at 1:52 PM, Meraj A. Khan <[email protected]> wrote:

> Julien,
> 
> How would you achieve parallelism then on a Hadoop cluster , am I missing
> something here? My understanding was that we could scale the crawl  by
> allowing fetch to happen in multiple map tasks in multiple nodes in a
> Hadoop cluster , otherwise I am stuck in sequentially crawling a large set
> of urls spread across mutiple domains.
> 
> If that is indeed the way to scale the crawl , then we would need to
> generate multiple segments at the generate time so that these could be
> fetched in paralle.
> 
> So I guess I really need help in .
> 
> 
>   1. Making the generate phase generate multiple segments
>   2. Being able to fetch these segments in parallel.
> 
> 
> Can you please let me know if my approach to scale the crawl sounds right
> to you ?
> 
> 
> Thanks and much appreciated, all the help I have gotten so far....
> 
> 
> 
> On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche <
> [email protected]> wrote:
> 
>> The fetching operates segment by segment and won't fetch more than one at
>> the same time. You can get the generation step to build multiple segments
>> in one go but you'd need to modify the script so that the fetching step is
>> called as many times as you have segments + you'd probably need to add some
>> logic for detecting that they've all finished before you move on to the
>> update step.
>> Out of curiosity : why do you want to fetch multiple segments at the same
>> time?
>> 
>> On 19 September 2014 06:00, Meraj A. Khan <[email protected]> wrote:
>> 
>>> Hello Folks,
>>> 
>>> I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
>>> 
>>> Based on Julien's suggestion I am using the bin/crawl script and did the
>>> following tweaks to trigger a fetch with multiple map tasks , however I
>> am
>>> unable to do so.
>>> 
>>> 1. Added maxNumSegments and numFetchers parameters to the generate phase.
>>> $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
>> $CRAWL_PATH/segments
>>> -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
>>> 
>>> 2. Removed the topN paramter and removed the noParsing parameter because
>> I
>>> want the parsing to happen at the time of fetch.
>>> $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
>>> $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
>>> 
>>> The generate phase is not generating more than one segment.
>>> 
>>> And as a result the fetch phase is not creating multiple map tasks, also
>> I
>>> belive the way the script is written it does not allow the fecth to fecth
>>> multiple segements in parallel  even if the generate were to generate
>>> multiple segments.
>>> 
>>> Can someone please let me know , how they go the script to run in a
>>> distributed Hadoop cluster ? Or if there is a different version of script
>>> that should be used?
>>> 
>>> Thanks.
>>> 
>> 
>> 
>> 
>> --
>> 
>> Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>> 

Reply via email to