Thanks Sebastin, Its interesting to note that you have a patch to directly
write to S3. I will check it out. I am curious how you approached shutting
down the emr cluster for nutch ? did you do that using the shell script by
listening to the exit status of the crawl command ?

will cloudformation make my job easier or it will not have the flexibility
of using a shell script ? anyone tried that approach ?

Thanks
Srini




On Thu, Jan 26, 2017 at 1:58 AM, Sebastian Nagel <[email protected]
> wrote:

> Hi,
>
> > I would like to export the crawled output to s3
> > (already have the seed file stored in s3)
>
> Please, also have a look at
>   https://issues.apache.org/jira/browse/NUTCH-2281
> (would be great to have a second test for the patch / pull request)
>
> At a first glance, all 3 approaches seem feasible.
> Personally, I only have experience with shell scripting
> and AWS CLI commands to launch the cluster. It's quite
> flexible, but sometimes cumbersome.
>
> Best,
> Sebastian
>
> On 01/26/2017 03:09 AM, Srinivasan Ramaswamy wrote:
> > Hi Nutch users,
> >
> > I am trying to run a nutch crawler periodically on a schedule (like a
> cron
> > job). I am running my nutch setup in  AWS EMR to avoid setting up and
> > maintaining infrastructure. I would like to export the crawled output to
> s3
> > (already have the seed file stored in s3) and then terminate the EMR
> > cluster as my nutch job would not run for more than half a day (atleast
> for
> > now).
> >
> > Here is my question:
> >
> > How can i automate the AWS EMR cluster creation with nutch installed and
> my
> > configurations  (both emr and nutch) updated and also terminate the
> cluster
> > once nutch finishes  ?
> >
> >  Here are some ideas i can think of, purely based on my reading not tried
> > any of them yet.
> >
> > - write a script using AWS CLI commands to create the emr cluster and run
> > the nutch job and terminate once its done
> > - use cloudformation to create the emr cluster with necessary application
> > (nutch in this case)
> > - use AWS data pipeline and create a schedule and pipeline for this flow
> (i
> > dont know whether data pipeline can achieve what i want)
> >
> > I would be curious to hear how others approached similar requirement.
> >
> > Thanks
> > Srini
> >
>
>

Reply via email to