Hi Nutch users, I am trying to run a nutch crawler periodically on a schedule (like a cron job). I am running my nutch setup in AWS EMR to avoid setting up and maintaining infrastructure. I would like to export the crawled output to s3 (already have the seed file stored in s3) and then terminate the EMR cluster as my nutch job would not run for more than half a day (atleast for now).
Here is my question: How can i automate the AWS EMR cluster creation with nutch installed and my configurations (both emr and nutch) updated and also terminate the cluster once nutch finishes ? Here are some ideas i can think of, purely based on my reading not tried any of them yet. - write a script using AWS CLI commands to create the emr cluster and run the nutch job and terminate once its done - use cloudformation to create the emr cluster with necessary application (nutch in this case) - use AWS data pipeline and create a schedule and pipeline for this flow (i dont know whether data pipeline can achieve what i want) I would be curious to hear how others approached similar requirement. Thanks Srini

