Good catch Karel! I have tried to investigate this in the past but I have never considered that it may be a race condition with a cron job (most of the synchronisation tests we've added are designed to prove that this is not a condition triggered by Whirr).
What if we stop the crond service while running the install/configure scripts? http://www.cyberciti.biz/faq/howto-linux-unix-start-restart-cron/ > In my opinion, as much of the installation/configuration steps should > be done using a config management tool (puppet/chef). > Totally agree + we have the needed infrastructure for this. > Once the configuration is published to each node you can trigger > puppet/chef it as much as you like, and eventually you should reach a > good state. Running the complete whirr-generated script(s) multiple > times is going to be slower and much more error prone. > + it's hard to make retry-friendly bash scripts. > > Regards, > Karel > > On Mon, Oct 3, 2011 at 10:22 PM, Paul Baclace <[email protected]> > wrote: > > Two runs of whirr on EC2 yesterday randomly failed to install Hadoop > > components. First it occurred on the master node, but when it occurred > in > > one slave and not another, I could find the diff of the /tmp/logs/ from > > jclouds. In a third run, everything worked fine. Same scripts driving > > whirr, same AMI, same number of nodes, same region, etc. Snippets of > > /tmp/logs/stderr.log shown below indicate that apt-get update had "Could > not > > get lock /var/lib/dpkg/lock" on one slave, but not another. > > > > This is a serious reliability issue. What is non-deterministic here? > > > > Paul > > > > ------------ slave 1 ------------------- > > + register_cloudera_repo > > + which dpkg > > + cat > > + curl -s http://archive.cloudera.com/debian/archive.key > > + sudo apt-key add - > > + sudo apt-get update > > E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily > > unavailable) > > E: Unable to lock the administration directory (/var/lib/dpkg/), is > another > > process using it? > > + which dpkg > > + apt-get update > > E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily > > unavailable) > > E: Unable to lock the administration directory (/var/lib/dpkg/), is > another > > process using it? > > + apt-get -y install hadoop-0.20 > > > > -------------- slave 2 --------------- > > + register_cloudera_repo > > + which dpkg > > + cat > > + curl -s http://archive.cloudera.com/debian/archive.key > > + sudo apt-key add - > > + sudo apt-get update > > + which dpkg > > + apt-get update > > + apt-get -y install hadoop-0.20 > > dpkg-preconfigure: unable to re-open stdin: > > + cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.dist > > + update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf > > /etc/hadoop-0.20/conf.dist 90 > > + install_cdh_hbase -c aws-ec2 -u > > http://apache.cs.utah.edu/hbase/hbase-0.90.3/hbase-0.90.3.tar.gz > > > > ------------- > > > > -- > Karel Vervaeke > http://outerthought.org/ > Open Source Content Applications > Makers of Kauri, Daisy CMS and Lily >
