Re: create and run a nutch crawler using aws emr on a schedule

Sebastian Nagel Thu, 26 Jan 2017 01:59:26 -0800

Hi,

> I would like to export the crawled output to s3
> (already have the seed file stored in s3)


Please, also have a look at
  https://issues.apache.org/jira/browse/NUTCH-2281
(would be great to have a second test for the patch / pull request)

At a first glance, all 3 approaches seem feasible.
Personally, I only have experience with shell scripting
and AWS CLI commands to launch the cluster. It's quite
flexible, but sometimes cumbersome.

Best,
Sebastian

On 01/26/2017 03:09 AM, Srinivasan Ramaswamy wrote:
> Hi Nutch users,
> 
> I am trying to run a nutch crawler periodically on a schedule (like a cron
> job). I am running my nutch setup in  AWS EMR to avoid setting up and
> maintaining infrastructure. I would like to export the crawled output to s3
> (already have the seed file stored in s3) and then terminate the EMR
> cluster as my nutch job would not run for more than half a day (atleast for
> now).
> 
> Here is my question:
> 
> How can i automate the AWS EMR cluster creation with nutch installed and my
> configurations  (both emr and nutch) updated and also terminate the cluster
> once nutch finishes  ?
> 
>  Here are some ideas i can think of, purely based on my reading not tried
> any of them yet.
> 
> - write a script using AWS CLI commands to create the emr cluster and run
> the nutch job and terminate once its done
> - use cloudformation to create the emr cluster with necessary application
> (nutch in this case)
> - use AWS data pipeline and create a schedule and pipeline for this flow (i
> dont know whether data pipeline can achieve what i want)
> 
> I would be curious to hear how others approached similar requirement.
> 
> Thanks
> Srini
>

Re: create and run a nutch crawler using aws emr on a schedule

Reply via email to