Hi Jim,

> I could reverse engineer the bin/nutch script to get such a list of jar calls,

Adding
  set -x
to bin/nutch and then running bin/crawl with a sample crawl which includes all 
steps
should log all commands with a full list of arguments.

> all the config can be brought down from S3

could copy it via
  aws s3 cp ...
from S3 to the local filesystem of the master.
But on EMR it should be possible to directly reference the Nutch job file
by a s3:// URL. (but haven't tried it this way)

> simply add "&& init 0"

  aws emr terminate-cluster ...
should do the job cleanly. Also have a look on other subcommands of
  aws emr

Sebastian

P.S.: I haven't done this and still shutting down the cluster on AWS manually
      - but that doesn't really matter since the crawling takes over a week.

On 11/16/2016 01:24 PM, Jim Lamb wrote:
> Hello,
>  
> I am looking for a way to automate Nutch 2.3.1 crawls on Amazon EMR. I have 
> seen lots of documentation and examples of SSHing to the master node in the 
> cluster and running bin/crawl from there, but it would be much cleaner to be 
> able to add a set of  "steps" to the EMR create-cluster script where the job 
> file is called with the appropriate jarfile name and parameters. That way, 
> the cluster could be started from a script and would terminate once it had 
> completed, having pushed its data into our external Solr index.
>  
> I could reverse engineer the bin/nutch script to get such a list of jar 
> calls, but the one point that I cannot quite grasp is how to emulate the loop 
> of rounds that the bin/crawl script performs. Since the removal of 
> org.apache.nutch.crawl.Crawl I can't see how to do more than one round, short 
> of repeating the same steps over and over (except inject) in the 
> create-cluster command.
>  
> At the moment, everything works on EMR (3.11.0) with Nutch 2.3.1 and HBase 
> 0.98.0 both installed as bootstrap actions, but I have to get the master node 
> IP address, then ssh in and run bin/crawl manually, I then have to keep 
> checking whether it has finished to go terminate the cluster to not incur 
> extra cost (though a workaround here is to simply add "&& init 0" to the 
> command so that the master node dies and takes the cluster with it, albeit 
> always showing as a failed cluster to EMR).
>  
> It would be very desirable to use automate this, as we have the need to run 
> many separate ad-hoc Nutch runs and all the config can be brought down from 
> S3, leaving just one manual step.
>  
> Any help/pointers, particularly from anyone who has done this, would be 
> appreciated.
>  
> Regards,
>  
> Jim
> 

Reply via email to