Re: Running multiple fetch map tasks on a Hadoop Cluster.

2014-09-19 Thread Meraj A. Khan
Julien,

How would you achieve parallelism then on a Hadoop cluster , am I missing
something here? My understanding was that we could scale the crawl  by
allowing fetch to happen in multiple map tasks in multiple nodes in a
Hadoop cluster , otherwise I am stuck in sequentially crawling a large set
of urls spread across mutiple domains.

If that is indeed the way to scale the crawl , then we would need to
generate multiple segments at the generate time so that these could be
fetched in paralle.

So I guess I really need help in .


   1. Making the generate phase generate multiple segments
   2. Being able to fetch these segments in parallel.


Can you please let me know if my approach to scale the crawl sounds right
to you ?


Thanks and much appreciated, all the help I have gotten so far



On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 The fetching operates segment by segment and won't fetch more than one at
 the same time. You can get the generation step to build multiple segments
 in one go but you'd need to modify the script so that the fetching step is
 called as many times as you have segments + you'd probably need to add some
 logic for detecting that they've all finished before you move on to the
 update step.
 Out of curiosity : why do you want to fetch multiple segments at the same
 time?

 On 19 September 2014 06:00, Meraj A. Khan mera...@gmail.com wrote:

  Hello Folks,
 
  I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
 
  Based on Julien's suggestion I am using the bin/crawl script and did the
  following tweaks to trigger a fetch with multiple map tasks , however I
 am
  unable to do so.
 
  1. Added maxNumSegments and numFetchers parameters to the generate phase.
  $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
 $CRAWL_PATH/segments
  -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
 
  2. Removed the topN paramter and removed the noParsing parameter because
 I
  want the parsing to happen at the time of fetch.
  $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
  $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
 
  The generate phase is not generating more than one segment.
 
  And as a result the fetch phase is not creating multiple map tasks, also
 I
  belive the way the script is written it does not allow the fecth to fecth
  multiple segements in parallel  even if the generate were to generate
  multiple segments.
 
  Can someone please let me know , how they go the script to run in a
  distributed Hadoop cluster ? Or if there is a different version of script
  that should be used?
 
  Thanks.
 



 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble



Re: Running multiple fetch map tasks on a Hadoop Cluster.

2014-09-19 Thread Jake Dodd
Hi Meraj,

Nutch and Hadoop abstract all of that for you, so you don’t need to worry about 
it. When you execute the fetch command for a segment, it will be parallelized 
across the nodes in your cluster.

Cheers

Jake

On Sep 19, 2014, at 1:52 PM, Meraj A. Khan mera...@gmail.com wrote:

 Julien,
 
 How would you achieve parallelism then on a Hadoop cluster , am I missing
 something here? My understanding was that we could scale the crawl  by
 allowing fetch to happen in multiple map tasks in multiple nodes in a
 Hadoop cluster , otherwise I am stuck in sequentially crawling a large set
 of urls spread across mutiple domains.
 
 If that is indeed the way to scale the crawl , then we would need to
 generate multiple segments at the generate time so that these could be
 fetched in paralle.
 
 So I guess I really need help in .
 
 
   1. Making the generate phase generate multiple segments
   2. Being able to fetch these segments in parallel.
 
 
 Can you please let me know if my approach to scale the crawl sounds right
 to you ?
 
 
 Thanks and much appreciated, all the help I have gotten so far
 
 
 
 On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
 
 The fetching operates segment by segment and won't fetch more than one at
 the same time. You can get the generation step to build multiple segments
 in one go but you'd need to modify the script so that the fetching step is
 called as many times as you have segments + you'd probably need to add some
 logic for detecting that they've all finished before you move on to the
 update step.
 Out of curiosity : why do you want to fetch multiple segments at the same
 time?
 
 On 19 September 2014 06:00, Meraj A. Khan mera...@gmail.com wrote:
 
 Hello Folks,
 
 I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop YARN.
 
 Based on Julien's suggestion I am using the bin/crawl script and did the
 following tweaks to trigger a fetch with multiple map tasks , however I
 am
 unable to do so.
 
 1. Added maxNumSegments and numFetchers parameters to the generate phase.
 $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
 $CRAWL_PATH/segments
 -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
 
 2. Removed the topN paramter and removed the noParsing parameter because
 I
 want the parsing to happen at the time of fetch.
 $bin/nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch
 $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
 
 The generate phase is not generating more than one segment.
 
 And as a result the fetch phase is not creating multiple map tasks, also
 I
 belive the way the script is written it does not allow the fecth to fecth
 multiple segements in parallel  even if the generate were to generate
 multiple segments.
 
 Can someone please let me know , how they go the script to run in a
 distributed Hadoop cluster ? Or if there is a different version of script
 that should be used?
 
 Thanks.
 
 
 
 
 --
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 



Re: Running multiple fetch map tasks on a Hadoop Cluster.

2014-09-19 Thread Meraj A. Khan
Jake,

I am not sure how to make that happen, every time I run the nutch 1.7 job
on YARN , I see a single segment being generated a nd a single map task
bein launched,underutilizing the capacity of the cluster and slowing the
crawl.

Are you suggesting  I should be seeing multiple fetch map tasks for a
single segment, if so I am not.

Thanks.
On Sep 19, 2014 5:13 PM, Jake Dodd j...@ontopic.io wrote:

 Hi Meraj,

 Nutch and Hadoop abstract all of that for you, so you don’t need to worry
 about it. When you execute the fetch command for a segment, it will be
 parallelized across the nodes in your cluster.

 Cheers

 Jake

 On Sep 19, 2014, at 1:52 PM, Meraj A. Khan mera...@gmail.com wrote:

  Julien,
 
  How would you achieve parallelism then on a Hadoop cluster , am I missing
  something here? My understanding was that we could scale the crawl  by
  allowing fetch to happen in multiple map tasks in multiple nodes in a
  Hadoop cluster , otherwise I am stuck in sequentially crawling a large
 set
  of urls spread across mutiple domains.
 
  If that is indeed the way to scale the crawl , then we would need to
  generate multiple segments at the generate time so that these could be
  fetched in paralle.
 
  So I guess I really need help in .
 
 
1. Making the generate phase generate multiple segments
2. Being able to fetch these segments in parallel.
 
 
  Can you please let me know if my approach to scale the crawl sounds right
  to you ?
 
 
  Thanks and much appreciated, all the help I have gotten so far
 
 
 
  On Fri, Sep 19, 2014 at 10:40 AM, Julien Nioche 
  lists.digitalpeb...@gmail.com wrote:
 
  The fetching operates segment by segment and won't fetch more than one
 at
  the same time. You can get the generation step to build multiple
 segments
  in one go but you'd need to modify the script so that the fetching step
 is
  called as many times as you have segments + you'd probably need to add
 some
  logic for detecting that they've all finished before you move on to the
  update step.
  Out of curiosity : why do you want to fetch multiple segments at the
 same
  time?
 
  On 19 September 2014 06:00, Meraj A. Khan mera...@gmail.com wrote:
 
  Hello Folks,
 
  I am  unable to run multiple fetch Map taks for Nutch 1.7 on Hadoop
 YARN.
 
  Based on Julien's suggestion I am using the bin/crawl script and did
 the
  following tweaks to trigger a fetch with multiple map tasks , however I
  am
  unable to do so.
 
  1. Added maxNumSegments and numFetchers parameters to the generate
 phase.
  $bin/nutch generate $commonOptions $CRAWL_PATH/crawldb
  $CRAWL_PATH/segments
  -maxNumSegments $numFetchers -numFetchers $numFetchers -noFilter
 
  2. Removed the topN paramter and removed the noParsing parameter
 because
  I
  want the parsing to happen at the time of fetch.
  $bin/nutch fetch $commonOptions -D
 fetcher.timelimit.mins=$timeLimitFetch
  $CRAWL_PATH/segments/$SEGMENT -threads $numThreads #-noParsing#
 
  The generate phase is not generating more than one segment.
 
  And as a result the fetch phase is not creating multiple map tasks,
 also
  I
  belive the way the script is written it does not allow the fecth to
 fecth
  multiple segements in parallel  even if the generate were to generate
  multiple segments.
 
  Can someone please let me know , how they go the script to run in a
  distributed Hadoop cluster ? Or if there is a different version of
 script
  that should be used?
 
  Thanks.
 
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble