Hello,
I am looking for a way to automate Nutch 2.3.1 crawls on Amazon EMR. I have
seen lots of documentation and examples of SSHing to the master node in the
cluster and running bin/crawl from there, but it would be much cleaner to be
able to add a set of "steps" to the EMR create-cluster script where the job
file is called with the appropriate jarfile name and parameters. That way, the
cluster could be started from a script and would terminate once it had
completed, having pushed its data into our external Solr index.
I could reverse engineer the bin/nutch script to get such a list of jar calls,
but the one point that I cannot quite grasp is how to emulate the loop of
rounds that the bin/crawl script performs. Since the removal of
org.apache.nutch.crawl.Crawl I can't see how to do more than one round, short
of repeating the same steps over and over (except inject) in the create-cluster
command.
At the moment, everything works on EMR (3.11.0) with Nutch 2.3.1 and HBase
0.98.0 both installed as bootstrap actions, but I have to get the master node
IP address, then ssh in and run bin/crawl manually, I then have to keep
checking whether it has finished to go terminate the cluster to not incur extra
cost (though a workaround here is to simply add "&& init 0" to the command so
that the master node dies and takes the cluster with it, albeit always showing
as a failed cluster to EMR).
It would be very desirable to use automate this, as we have the need to run
many separate ad-hoc Nutch runs and all the config can be brought down from S3,
leaving just one manual step.
Any help/pointers, particularly from anyone who has done this, would be
appreciated.
Regards,
Jim