Hi Alain, You are totally right, this may be a problem when instantiated multiple VMs at the same time. I've filled an issue to look for the best way to generate the TID [1].
We'd be interested in updating the tgtadm_next_tid function in scripts_common.sh. Also if the tgt server is getting overloaded by this simultaneous deployments, there are several ways to limit the concurrency of the TM (e.g. the -t option in oned.conf) THANKS for the feedback! Ruben [1] http://dev.opennebula.org/issues/1682 [1] http://dev.opennebula.org/issues/1682 On Thu, Dec 6, 2012 at 1:52 PM, Alain Pannetrat <[email protected]> wrote: > Hi all, > > I'm new to OpenNebula and this mailing list, so forgive me if I > stumble over a topic that may have already been discussed. > > I'm currently discovering opennebula 3.8.1 with a simple 3 node > system: a control node, a compute node and a datastore node > (iscsi+lvm). > > I have been testing the bulk instantiation of virtual machines in > sunstone, where I initiate the bulk creation of 8 virtual machines in > parallel. I have noticed that between 2 and 4 machines just fail to > instantiate correctly with the typical following error message: > > 08 2012 [TM][I]: Command execution fail: > /var/lib/one/remotes/tm/iscsi/clone > iqn.2012-02.org.opennebula:san.vg-one.lv-one-26 > compute.admin.lan:/var/lib/one//datastores/0/111/disk.0 111 101 > Thu Dec 6 14:40:08 2012 [TM][E]: clone: Command " set -e > Thu Dec 6 14:40:08 2012 [TM][I]: set -x > Thu Dec 6 14:40:08 2012 [TM][I]: > Thu Dec 6 14:40:08 2012 [TM][I]: # get size > Thu Dec 6 14:40:08 2012 [TM][I]: SIZE=$(sudo lvs --noheadings -o > lv_size "/dev/vg-one/lv-one-26") > Thu Dec 6 14:40:08 2012 [TM][I]: > Thu Dec 6 14:40:08 2012 [TM][I]: # create lv > Thu Dec 6 14:40:08 2012 [TM][I]: sudo lvcreate -L${SIZE} vg-one -n > lv-one-26-111 > Thu Dec 6 14:40:08 2012 [TM][I]: > Thu Dec 6 14:40:08 2012 [TM][I]: # clone lv with dd > Thu Dec 6 14:40:08 2012 [TM][I]: sudo dd if=/dev/vg-one/lv-one-26 > of=/dev/vg-one/lv-one-26-111 bs=64k > Thu Dec 6 14:40:08 2012 [TM][I]: > Thu Dec 6 14:40:08 2012 [TM][I]: # new iscsi target > Thu Dec 6 14:40:08 2012 [TM][I]: TID=$(sudo tgtadm --lld iscsi --op > show --mode target | grep "Target" | tail -n 1 | > awk '{split($2,tmp,":"); print tmp[1]+1;}') > Thu Dec 6 14:40:08 2012 [TM][I]: > Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op new > --mode target --tid $TID --targetname > iqn.2012-02.org.opennebula:san.vg-one.lv-one-26-111 > Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op bind > --mode target --tid $TID -I ALL > Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op new > --mode logicalunit --tid $TID --lun 1 --backing-store > /dev/vg-one/lv-one-26-111 > Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgt-admin --dump |sudo tee > /etc/tgt/targets.conf > /dev/null 2>&1" failed: + sudo lvs > --noheadings -o lv_size /dev/vg-one/lv-one-26 > Thu Dec 6 14:40:08 2012 [TM][I]: 131072+0 records in > Thu Dec 6 14:40:08 2012 [TM][I]: 131072+0 records out > Thu Dec 6 14:40:08 2012 [TM][I]: 8589934592 bytes (8.6 GB) copied, > 898.903 s, 9.6 MB/s > Thu Dec 6 14:40:08 2012 [TM][I]: tgtadm: this target already exists > Thu Dec 6 14:40:08 2012 [TM][E]: Error cloning > compute.admin.lan:/dev/vg-one/lv-one-26-111 > Thu Dec 6 14:40:08 2012 [TM][I]: ExitCode: 22 > Thu Dec 6 14:40:08 2012 [TM][E]: Error executing image transfer > script: Error cloning compute.admin.lan:/dev/vg-one/lv-one-26-111 > Thu Dec 6 14:40:09 2012 [DiM][I]: New VM state is FAILED > > After adding traces in the code, I found that there seems to be a race > condition in /var/lib/one/remotes/tm/iscsi/clone here the following > commands get executed: > > TID=\$($SUDO $(tgtadm_next_tid)) > $SUDO $(tgtadm_target_new "\$TID" "$NEW_IQN") > > These commands are typically expanded to something like this: > > TID=$(sudo tgtadm --lld iscsi --op show --mode target | grep "Target" > | tail -n 1 | awk '{split($2,tmp,":"); > sudo tgtadm --lld iscsi --op new --mode target --tid $TID > --targetname iqn.2012-02.org.opennebula:san.vg-one.lv-one-26-111 > > What seems to happens is two (or more) calls to the first command > tgtadm_next_tid happen simultaneously before the second command gets a > chance to get executed, and then TID as the same value for two (or > more) VMs. > > The workaround I found is to replace the line: > TID=\$($SUDO $(tgtadm_next_tid)) > with > TID=$VMID > in /var/lib/one/remotes/tm/iscsi/clone > > Since $VMID is globally unique no race conditions can happen here. > I've tested this and the failures don't happen anymore in my setting. > Of course I'm not sure this is the ideal fix, since perhaps VMID can > take values that are out of range for tgtadm. So futher testing would > be needed. > > I'd be happy to get your thoughts/feedback on this issue. > > Best, > > Alain > _______________________________________________ > Users mailing list > [email protected] > http://lists.opennebula.org/listinfo.cgi/users-opennebula.org -- Ruben S. Montero, PhD Project co-Lead and Chief Architect OpenNebula - The Open Source Solution for Data Center Virtualization www.OpenNebula.org | [email protected] | @OpenNebula _______________________________________________ Users mailing list [email protected] http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
