Re: Running multiple fetch map tasks on a Hadoop Cluster.

Meraj A. Khan Fri, 19 Sep 2014 15:12:41 -0700

Jake,

I am not sure how to make that happen, every time I run the nutch 1.7 job
on YARN , I see a single segment being generated a nd a single map task
bein launched,underutilizing the capacity of the cluster and slowing the
crawl.


Are you suggesting  I should be seeing multiple fetch map tasks for a
single segment, if so I am not.

Thanks.
On Sep 19, 2014 5:13 PM, "Jake Dodd" <j...@ontopic.io> wrote:

> Hi Meraj,
>
> Nutch and Hadoop abstract all of that for you, so you don’t need to worry
> about it. When you execute the fetch command for a segment, it will be
> parallelized across the nodes in your cluster.
>
> Cheers
>
> Jake
>
> On Sep 19, 2014, at 1:52 PM, Meraj A. Khan <mera...@gmail.com> wrote:
>
> > Julien,
> >
> > How would you achieve parallelism then on a Hadoop cluster , am I missing
> > something here? My understanding was that we could scale the crawl  by
> > allowing fetch to happen in multiple map tasks in multiple nodes in a
> > Hadoop cluster , otherwise I am stuck in sequentially crawling a large
> set
> > of urls spread across mutiple domains.
> >
> > If that is indeed the way to scale the crawl , then we would need to
> > generate multiple segments at the generate time so that these could be
> > fetched in paralle.
> >
> > So I guess I really need help in .
> >
> >
> >   1. Making the generate phase generate multiple segments
> >   2. Being able to fetch these segments in parallel.
> >
> >
> > Can you please let me know if my approach to scale the crawl sounds right
> > to you ?
> >
> >
> > Thanks and much appreciated, all the help I have gotten so far....
> >
> >
> >
> > On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche <
> > lists.digitalpeb...@gmail.com> wrote:
> >
> >> The fetching operates segment by segment and won't fetch more than one
> at
> >> the same time. You can get the generation step to build multiple
> segments
> >> in one go but you'd need to modify the script so that the fetching step
> is
> >> called as many times as you have segments + you'd probably need to add
> some
> >> logic for detecting that they've all finished before you move on to the
> >> update step.
> >> Out of curiosity : why do you want to fetch multiple segments at the
> same
> >> time?
> >>
> >> On 19 September 2014 06:00, Meraj A. Khan <mera...@gmail.com> wrote:
> >>
> >>> Hello Folks,
> >>>
> >>> I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop
> YARN.
> >>>
> >>> Based on Julien's suggestion I am using the bin/crawl script and did
> the
> >>> following tweaks to trigger a fetch with multiple map tasks , however I
> >> am
> >>> unable to do so.
> >>>
> >>> 1. Added maxNumSegments and numFetchers parameters to the generate
> phase.
> >>> $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
> >> $CRAWL_PATH/segments
> >>> -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
> >>>
> >>> 2. Removed the topN paramter and removed the noParsing parameter
> because
> >> I
> >>> want the parsing to happen at the time of fetch.
> >>> $bin/nutch fetch $commonOptions -D
> fetcher.timelimit.mins=$timeLimitFetch
> >>> $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
> >>>
> >>> The generate phase is not generating more than one segment.
> >>>
> >>> And as a result the fetch phase is not creating multiple map tasks,
> also
> >> I
> >>> belive the way the script is written it does not allow the fecth to
> fecth
> >>> multiple segements in parallel  even if the generate were to generate
> >>> multiple segments.
> >>>
> >>> Can someone please let me know , how they go the script to run in a
> >>> distributed Hadoop cluster ? Or if there is a different version of
> script
> >>> that should be used?
> >>>
> >>> Thanks.
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> Open Source Solutions for Text Engineering
> >>
> >> http://digitalpebble.blogspot.com/
> >> http://www.digitalpebble.com
> >> http://twitter.com/digitalpebble
> >>
>
>

Re: Running multiple fetch map tasks on a Hadoop Cluster.

Reply via email to