Thanks Sebastin, Its interesting to note that you have a patch to directly write to S3. I will check it out. I am curious how you approached shutting down the emr cluster for nutch ? did you do that using the shell script by listening to the exit status of the crawl command ?
will cloudformation make my job easier or it will not have the flexibility of using a shell script ? anyone tried that approach ? Thanks Srini On Thu, Jan 26, 2017 at 1:58 AM, Sebastian Nagel <[email protected] > wrote: > Hi, > > > I would like to export the crawled output to s3 > > (already have the seed file stored in s3) > > Please, also have a look at > https://issues.apache.org/jira/browse/NUTCH-2281 > (would be great to have a second test for the patch / pull request) > > At a first glance, all 3 approaches seem feasible. > Personally, I only have experience with shell scripting > and AWS CLI commands to launch the cluster. It's quite > flexible, but sometimes cumbersome. > > Best, > Sebastian > > On 01/26/2017 03:09 AM, Srinivasan Ramaswamy wrote: > > Hi Nutch users, > > > > I am trying to run a nutch crawler periodically on a schedule (like a > cron > > job). I am running my nutch setup in AWS EMR to avoid setting up and > > maintaining infrastructure. I would like to export the crawled output to > s3 > > (already have the seed file stored in s3) and then terminate the EMR > > cluster as my nutch job would not run for more than half a day (atleast > for > > now). > > > > Here is my question: > > > > How can i automate the AWS EMR cluster creation with nutch installed and > my > > configurations (both emr and nutch) updated and also terminate the > cluster > > once nutch finishes ? > > > > Here are some ideas i can think of, purely based on my reading not tried > > any of them yet. > > > > - write a script using AWS CLI commands to create the emr cluster and run > > the nutch job and terminate once its done > > - use cloudformation to create the emr cluster with necessary application > > (nutch in this case) > > - use AWS data pipeline and create a schedule and pipeline for this flow > (i > > dont know whether data pipeline can achieve what i want) > > > > I would be curious to hear how others approached similar requirement. > > > > Thanks > > Srini > > > >

