Two runs of whirr on EC2 yesterday randomly failed to install Hadoop
components. First it occurred on the master node, but when it occurred
in one slave and not another, I could find the diff of the /tmp/logs/
from jclouds. In a third run, everything worked fine. Same scripts
driving whirr, same AMI, same number of nodes, same region, etc.
Snippets of /tmp/logs/stderr.log shown below indicate that apt-get
update had "Could not get lock /var/lib/dpkg/lock" on one slave, but not
another.
This is a serious reliability issue. What is non-deterministic here?
Paul
------------ slave 1 -------------------
+ register_cloudera_repo
+ which dpkg
+ cat
+ curl -s http://archive.cloudera.com/debian/archive.key
+ sudo apt-key add -
+ sudo apt-get update
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource
temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is
another process using it?
+ which dpkg
+ apt-get update
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource
temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is
another process using it?
+ apt-get -y install hadoop-0.20
-------------- slave 2 ---------------
+ register_cloudera_repo
+ which dpkg
+ cat
+ curl -s http://archive.cloudera.com/debian/archive.key
+ sudo apt-key add -
+ sudo apt-get update
+ which dpkg
+ apt-get update
+ apt-get -y install hadoop-0.20
dpkg-preconfigure: unable to re-open stdin:
+ cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.dist
+ update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf
/etc/hadoop-0.20/conf.dist 90
+ install_cdh_hbase -c aws-ec2 -u
http://apache.cs.utah.edu/hbase/hbase-0.90.3/hbase-0.90.3.tar.gz
-------------