Hi Meraj, Nutch and Hadoop abstract all of that for you, so you don’t need to worry about it. When you execute the fetch command for a segment, it will be parallelized across the nodes in your cluster.
Cheers Jake On Sep 19, 2014, at 1:52 PM, Meraj A. Khan <[email protected]> wrote: > Julien, > > How would you achieve parallelism then on a Hadoop cluster , am I missing > something here? My understanding was that we could scale the crawl by > allowing fetch to happen in multiple map tasks in multiple nodes in a > Hadoop cluster , otherwise I am stuck in sequentially crawling a large set > of urls spread across mutiple domains. > > If that is indeed the way to scale the crawl , then we would need to > generate multiple segments at the generate time so that these could be > fetched in paralle. > > So I guess I really need help in . > > > 1. Making the generate phase generate multiple segments > 2. Being able to fetch these segments in parallel. > > > Can you please let me know if my approach to scale the crawl sounds right > to you ? > > > Thanks and much appreciated, all the help I have gotten so far.... > > > > On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche < > [email protected]> wrote: > >> The fetching operates segment by segment and won't fetch more than one at >> the same time. You can get the generation step to build multiple segments >> in one go but you'd need to modify the script so that the fetching step is >> called as many times as you have segments + you'd probably need to add some >> logic for detecting that they've all finished before you move on to the >> update step. >> Out of curiosity : why do you want to fetch multiple segments at the same >> time? >> >> On 19 September 2014 06:00, Meraj A. Khan <[email protected]> wrote: >> >>> Hello Folks, >>> >>> I am unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN. >>> >>> Based on Julien's suggestion I am using the bin/crawl script and did the >>> following tweaks to trigger a fetch with multiple map tasks , however I >> am >>> unable to do so. >>> >>> 1. Added maxNumSegments and numFetchers parameters to the generate phase. >>> $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb >> $CRAWL_PATH/segments >>> -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter >>> >>> 2. Removed the topN paramter and removed the noParsing parameter because >> I >>> want the parsing to happen at the time of fetch. >>> $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch >>> $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing# >>> >>> The generate phase is not generating more than one segment. >>> >>> And as a result the fetch phase is not creating multiple map tasks, also >> I >>> belive the way the script is written it does not allow the fecth to fecth >>> multiple segements in parallel even if the generate were to generate >>> multiple segments. >>> >>> Can someone please let me know , how they go the script to run in a >>> distributed Hadoop cluster ? Or if there is a different version of script >>> that should be used? >>> >>> Thanks. >>> >> >> >> >> -- >> >> Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >>

