Hello,
 
I am looking for a way to automate Nutch 2.3.1 crawls on Amazon EMR. I have 
seen lots of documentation and examples of SSHing to the master node in the 
cluster and running bin/crawl from there, but it would be much cleaner to be 
able to add a set of  "steps" to the EMR create-cluster script where the job 
file is called with the appropriate jarfile name and parameters. That way, the 
cluster could be started from a script and would terminate once it had 
completed, having pushed its data into our external Solr index.
 
I could reverse engineer the bin/nutch script to get such a list of jar calls, 
but the one point that I cannot quite grasp is how to emulate the loop of 
rounds that the bin/crawl script performs. Since the removal of 
org.apache.nutch.crawl.Crawl I can't see how to do more than one round, short 
of repeating the same steps over and over (except inject) in the create-cluster 
command.
 
At the moment, everything works on EMR (3.11.0) with Nutch 2.3.1 and HBase 
0.98.0 both installed as bootstrap actions, but I have to get the master node 
IP address, then ssh in and run bin/crawl manually, I then have to keep 
checking whether it has finished to go terminate the cluster to not incur extra 
cost (though a workaround here is to simply add "&& init 0" to the command so 
that the master node dies and takes the cluster with it, albeit always showing 
as a failed cluster to EMR).
 
It would be very desirable to use automate this, as we have the need to run 
many separate ad-hoc Nutch runs and all the config can be brought down from S3, 
leaving just one manual step.
 
Any help/pointers, particularly from anyone who has done this, would be 
appreciated.
 
Regards,
 
Jim

Reply via email to