Hi, > I would like to export the crawled output to s3 > (already have the seed file stored in s3)
Please, also have a look at https://issues.apache.org/jira/browse/NUTCH-2281 (would be great to have a second test for the patch / pull request) At a first glance, all 3 approaches seem feasible. Personally, I only have experience with shell scripting and AWS CLI commands to launch the cluster. It's quite flexible, but sometimes cumbersome. Best, Sebastian On 01/26/2017 03:09 AM, Srinivasan Ramaswamy wrote: > Hi Nutch users, > > I am trying to run a nutch crawler periodically on a schedule (like a cron > job). I am running my nutch setup in AWS EMR to avoid setting up and > maintaining infrastructure. I would like to export the crawled output to s3 > (already have the seed file stored in s3) and then terminate the EMR > cluster as my nutch job would not run for more than half a day (atleast for > now). > > Here is my question: > > How can i automate the AWS EMR cluster creation with nutch installed and my > configurations (both emr and nutch) updated and also terminate the cluster > once nutch finishes ? > > Here are some ideas i can think of, purely based on my reading not tried > any of them yet. > > - write a script using AWS CLI commands to create the emr cluster and run > the nutch job and terminate once its done > - use cloudformation to create the emr cluster with necessary application > (nutch in this case) > - use AWS data pipeline and create a schedule and pipeline for this flow (i > dont know whether data pipeline can achieve what i want) > > I would be curious to hear how others approached similar requirement. > > Thanks > Srini >

