Jake, I am not sure how to make that happen, every time I run the nutch 1.7 job on YARN , I see a single segment being generated a nd a single map task bein launched,underutilizing the capacity of the cluster and slowing the crawl.
Are you suggesting I should be seeing multiple fetch map tasks for a single segment, if so I am not. Thanks. On Sep 19, 2014 5:13 PM, "Jake Dodd" <j...@ontopic.io> wrote: > Hi Meraj, > > Nutch and Hadoop abstract all of that for you, so you don’t need to worry > about it. When you execute the fetch command for a segment, it will be > parallelized across the nodes in your cluster. > > Cheers > > Jake > > On Sep 19, 2014, at 1:52 PM, Meraj A. Khan <mera...@gmail.com> wrote: > > > Julien, > > > > How would you achieve parallelism then on a Hadoop cluster , am I missing > > something here? My understanding was that we could scale the crawl by > > allowing fetch to happen in multiple map tasks in multiple nodes in a > > Hadoop cluster , otherwise I am stuck in sequentially crawling a large > set > > of urls spread across mutiple domains. > > > > If that is indeed the way to scale the crawl , then we would need to > > generate multiple segments at the generate time so that these could be > > fetched in paralle. > > > > So I guess I really need help in . > > > > > > 1. Making the generate phase generate multiple segments > > 2. Being able to fetch these segments in parallel. > > > > > > Can you please let me know if my approach to scale the crawl sounds right > > to you ? > > > > > > Thanks and much appreciated, all the help I have gotten so far.... > > > > > > > > On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche < > > lists.digitalpeb...@gmail.com> wrote: > > > >> The fetching operates segment by segment and won't fetch more than one > at > >> the same time. You can get the generation step to build multiple > segments > >> in one go but you'd need to modify the script so that the fetching step > is > >> called as many times as you have segments + you'd probably need to add > some > >> logic for detecting that they've all finished before you move on to the > >> update step. > >> Out of curiosity : why do you want to fetch multiple segments at the > same > >> time? > >> > >> On 19 September 2014 06:00, Meraj A. Khan <mera...@gmail.com> wrote: > >> > >>> Hello Folks, > >>> > >>> I am unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop > YARN. > >>> > >>> Based on Julien's suggestion I am using the bin/crawl script and did > the > >>> following tweaks to trigger a fetch with multiple map tasks , however I > >> am > >>> unable to do so. > >>> > >>> 1. Added maxNumSegments and numFetchers parameters to the generate > phase. > >>> $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb > >> $CRAWL_PATH/segments > >>> -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter > >>> > >>> 2. Removed the topN paramter and removed the noParsing parameter > because > >> I > >>> want the parsing to happen at the time of fetch. > >>> $bin/nutch fetch $commonOptions -D > fetcher.timelimit.mins=$timeLimitFetch > >>> $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing# > >>> > >>> The generate phase is not generating more than one segment. > >>> > >>> And as a result the fetch phase is not creating multiple map tasks, > also > >> I > >>> belive the way the script is written it does not allow the fecth to > fecth > >>> multiple segements in parallel even if the generate were to generate > >>> multiple segments. > >>> > >>> Can someone please let me know , how they go the script to run in a > >>> distributed Hadoop cluster ? Or if there is a different version of > script > >>> that should be used? > >>> > >>> Thanks. > >>> > >> > >> > >> > >> -- > >> > >> Open Source Solutions for Text Engineering > >> > >> http://digitalpebble.blogspot.com/ > >> http://www.digitalpebble.com > >> http://twitter.com/digitalpebble > >> > >