Two runs of whirr on EC2 yesterday randomly failed to install Hadoop components. First it occurred on the master node, but when it occurred in one slave and not another, I could find the diff of the /tmp/logs/ from jclouds. In a third run, everything worked fine. Same scripts driving whirr, same AMI, same number of nodes, same region, etc. Snippets of /tmp/logs/stderr.log shown below indicate that apt-get update had "Could not get lock /var/lib/dpkg/lock" on one slave, but not another.

This is a serious reliability issue.  What is non-deterministic here?

Paul

------------ slave 1 -------------------
+ register_cloudera_repo
+ which dpkg
+ cat
+ curl -s http://archive.cloudera.com/debian/archive.key
+ sudo apt-key add -
+ sudo apt-get update
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable) E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?
+ which dpkg
+ apt-get update
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable) E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?
+ apt-get -y install hadoop-0.20

-------------- slave 2 ---------------
+ register_cloudera_repo
+ which dpkg
+ cat
+ curl -s http://archive.cloudera.com/debian/archive.key
+ sudo apt-key add -
+ sudo apt-get update
+ which dpkg
+ apt-get update
+ apt-get -y install hadoop-0.20
dpkg-preconfigure: unable to re-open stdin:
+ cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.dist
+ update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.dist 90 + install_cdh_hbase -c aws-ec2 -u http://apache.cs.utah.edu/hbase/hbase-0.90.3/hbase-0.90.3.tar.gz

-------------

Reply via email to