Hi Srini, > I will check it out. Thanks, would like to see whether it works.
> I am curious how you approached shutting > down the emr cluster for nutch I'm running Nutch on Cloudera CDH. When the crawl is done (which is manually checked), a script terminates all EC2 instances of the cluster (they are identified by a tag). Best, Sebastian On 01/26/2017 07:16 PM, Srinivasan Ramaswamy wrote: > Thanks Sebastin, Its interesting to note that you have a patch to directly > write to S3. I will check it out. I am curious how you approached shutting > down the emr cluster for nutch ? did you do that using the shell script by > listening to the exit status of the crawl command ? > > will cloudformation make my job easier or it will not have the flexibility > of using a shell script ? anyone tried that approach ? > > Thanks > Srini > > > > > On Thu, Jan 26, 2017 at 1:58 AM, Sebastian Nagel <[email protected] >> wrote: > >> Hi, >> >>> I would like to export the crawled output to s3 >>> (already have the seed file stored in s3) >> >> Please, also have a look at >> https://issues.apache.org/jira/browse/NUTCH-2281 >> (would be great to have a second test for the patch / pull request) >> >> At a first glance, all 3 approaches seem feasible. >> Personally, I only have experience with shell scripting >> and AWS CLI commands to launch the cluster. It's quite >> flexible, but sometimes cumbersome. >> >> Best, >> Sebastian >> >> On 01/26/2017 03:09 AM, Srinivasan Ramaswamy wrote: >>> Hi Nutch users, >>> >>> I am trying to run a nutch crawler periodically on a schedule (like a >> cron >>> job). I am running my nutch setup in AWS EMR to avoid setting up and >>> maintaining infrastructure. I would like to export the crawled output to >> s3 >>> (already have the seed file stored in s3) and then terminate the EMR >>> cluster as my nutch job would not run for more than half a day (atleast >> for >>> now). >>> >>> Here is my question: >>> >>> How can i automate the AWS EMR cluster creation with nutch installed and >> my >>> configurations (both emr and nutch) updated and also terminate the >> cluster >>> once nutch finishes ? >>> >>> Here are some ideas i can think of, purely based on my reading not tried >>> any of them yet. >>> >>> - write a script using AWS CLI commands to create the emr cluster and run >>> the nutch job and terminate once its done >>> - use cloudformation to create the emr cluster with necessary application >>> (nutch in this case) >>> - use AWS data pipeline and create a schedule and pipeline for this flow >> (i >>> dont know whether data pipeline can achieve what i want) >>> >>> I would be curious to hear how others approached similar requirement. >>> >>> Thanks >>> Srini >>> >> >> >

