Nice! Ok, I'll mark the issue as related to the iSCSI drivers. Merging the new iSCSI drivers upstream is probably happening next week...
Cheers Ruben On Thu, Dec 6, 2012 at 9:24 PM, Alain Pannetrat <[email protected]> wrote: > Dear Mark, Ruben, > > From reading the code from those patches, I think they seem indeed to > greatly improve the iSCSI driver, and solve my problem. > > Tonight, I was also confronted with the issue that in some cases it > sometimes takes more than two seconds between "iscsiadm_login > "$NEW_IQN" "$TARGET_HOST"" and "/dev/disk/by-path/*$NEW_IQN-lun-1" in > the DISCOVERY_CMD inside iscsi/clone, so the "sleep 2" inserted there > is not enough. This is also fixed by the proposed patches. > Nice work! > > All the best, > > Alain > > > On Thu, Dec 6, 2012 at 9:09 PM, Mark Gergely <[email protected]> > wrote: >> Dear Ruben, Alain, >> >> our improved iSCSI driver set that we proposed before should solve this >> issue. As mentioned in the ticket, it is possible to simultaneously start >> hundreds of non persistent virtual machines. >> The TM concurrency level is 15. >> You can check the details at: http://dev.opennebula.org/issues/1592 >> >> All the best, >> Mark Gergely >> MTA-SZTAKI LPDS >> >> On 2012.12.06., at 20:01, "Ruben S. Montero" <[email protected]> >> wrote: >> >>> Hi Alain, >>> >>> You are totally right, this may be a problem when instantiated >>> multiple VMs at the same time. I've filled an issue to look for the >>> best way to generate the TID [1]. >>> >>> We'd be interested in updating the tgtadm_next_tid function in >>> scripts_common.sh. Also if the tgt server is getting overloaded by >>> this simultaneous deployments, there are several ways to limit the >>> concurrency of the TM (e.g. the -t option in oned.conf) >>> >>> THANKS for the feedback! >>> >>> Ruben >>> >>> [1] http://dev.opennebula.org/issues/1682 >>> >>> [1] http://dev.opennebula.org/issues/1682 >>> >>> On Thu, Dec 6, 2012 at 1:52 PM, Alain Pannetrat >>> <[email protected]> wrote: >>>> Hi all, >>>> >>>> I'm new to OpenNebula and this mailing list, so forgive me if I >>>> stumble over a topic that may have already been discussed. >>>> >>>> I'm currently discovering opennebula 3.8.1 with a simple 3 node >>>> system: a control node, a compute node and a datastore node >>>> (iscsi+lvm). >>>> >>>> I have been testing the bulk instantiation of virtual machines in >>>> sunstone, where I initiate the bulk creation of 8 virtual machines in >>>> parallel. I have noticed that between 2 and 4 machines just fail to >>>> instantiate correctly with the typical following error message: >>>> >>>> 08 2012 [TM][I]: Command execution fail: >>>> /var/lib/one/remotes/tm/iscsi/clone >>>> iqn.2012-02.org.opennebula:san.vg-one.lv-one-26 >>>> compute.admin.lan:/var/lib/one//datastores/0/111/disk.0 111 101 >>>> Thu Dec 6 14:40:08 2012 [TM][E]: clone: Command " set -e >>>> Thu Dec 6 14:40:08 2012 [TM][I]: set -x >>>> Thu Dec 6 14:40:08 2012 [TM][I]: >>>> Thu Dec 6 14:40:08 2012 [TM][I]: # get size >>>> Thu Dec 6 14:40:08 2012 [TM][I]: SIZE=$(sudo lvs --noheadings -o >>>> lv_size "/dev/vg-one/lv-one-26") >>>> Thu Dec 6 14:40:08 2012 [TM][I]: >>>> Thu Dec 6 14:40:08 2012 [TM][I]: # create lv >>>> Thu Dec 6 14:40:08 2012 [TM][I]: sudo lvcreate -L${SIZE} vg-one -n >>>> lv-one-26-111 >>>> Thu Dec 6 14:40:08 2012 [TM][I]: >>>> Thu Dec 6 14:40:08 2012 [TM][I]: # clone lv with dd >>>> Thu Dec 6 14:40:08 2012 [TM][I]: sudo dd if=/dev/vg-one/lv-one-26 >>>> of=/dev/vg-one/lv-one-26-111 bs=64k >>>> Thu Dec 6 14:40:08 2012 [TM][I]: >>>> Thu Dec 6 14:40:08 2012 [TM][I]: # new iscsi target >>>> Thu Dec 6 14:40:08 2012 [TM][I]: TID=$(sudo tgtadm --lld iscsi --op >>>> show --mode target | grep "Target" | tail -n 1 | >>>> awk '{split($2,tmp,":"); print tmp[1]+1;}') >>>> Thu Dec 6 14:40:08 2012 [TM][I]: >>>> Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op new >>>> --mode target --tid $TID --targetname >>>> iqn.2012-02.org.opennebula:san.vg-one.lv-one-26-111 >>>> Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op bind >>>> --mode target --tid $TID -I ALL >>>> Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op new >>>> --mode logicalunit --tid $TID --lun 1 --backing-store >>>> /dev/vg-one/lv-one-26-111 >>>> Thu Dec 6 14:40:08 2012 [TM][I]: sudo tgt-admin --dump |sudo tee >>>> /etc/tgt/targets.conf > /dev/null 2>&1" failed: + sudo lvs >>>> --noheadings -o lv_size /dev/vg-one/lv-one-26 >>>> Thu Dec 6 14:40:08 2012 [TM][I]: 131072+0 records in >>>> Thu Dec 6 14:40:08 2012 [TM][I]: 131072+0 records out >>>> Thu Dec 6 14:40:08 2012 [TM][I]: 8589934592 bytes (8.6 GB) copied, >>>> 898.903 s, 9.6 MB/s >>>> Thu Dec 6 14:40:08 2012 [TM][I]: tgtadm: this target already exists >>>> Thu Dec 6 14:40:08 2012 [TM][E]: Error cloning >>>> compute.admin.lan:/dev/vg-one/lv-one-26-111 >>>> Thu Dec 6 14:40:08 2012 [TM][I]: ExitCode: 22 >>>> Thu Dec 6 14:40:08 2012 [TM][E]: Error executing image transfer >>>> script: Error cloning compute.admin.lan:/dev/vg-one/lv-one-26-111 >>>> Thu Dec 6 14:40:09 2012 [DiM][I]: New VM state is FAILED >>>> >>>> After adding traces in the code, I found that there seems to be a race >>>> condition in /var/lib/one/remotes/tm/iscsi/clone here the following >>>> commands get executed: >>>> >>>> TID=\$($SUDO $(tgtadm_next_tid)) >>>> $SUDO $(tgtadm_target_new "\$TID" "$NEW_IQN") >>>> >>>> These commands are typically expanded to something like this: >>>> >>>> TID=$(sudo tgtadm --lld iscsi --op show --mode target | grep "Target" >>>> | tail -n 1 | awk '{split($2,tmp,":"); >>>> sudo tgtadm --lld iscsi --op new --mode target --tid $TID >>>> --targetname iqn.2012-02.org.opennebula:san.vg-one.lv-one-26-111 >>>> >>>> What seems to happens is two (or more) calls to the first command >>>> tgtadm_next_tid happen simultaneously before the second command gets a >>>> chance to get executed, and then TID as the same value for two (or >>>> more) VMs. >>>> >>>> The workaround I found is to replace the line: >>>> TID=\$($SUDO $(tgtadm_next_tid)) >>>> with >>>> TID=$VMID >>>> in /var/lib/one/remotes/tm/iscsi/clone >>>> >>>> Since $VMID is globally unique no race conditions can happen here. >>>> I've tested this and the failures don't happen anymore in my setting. >>>> Of course I'm not sure this is the ideal fix, since perhaps VMID can >>>> take values that are out of range for tgtadm. So futher testing would >>>> be needed. >>>> >>>> I'd be happy to get your thoughts/feedback on this issue. >>>> >>>> Best, >>>> >>>> Alain >>>> _______________________________________________ >>>> Users mailing list >>>> [email protected] >>>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org >>> >>> >>> >>> -- >>> Ruben S. Montero, PhD >>> Project co-Lead and Chief Architect >>> OpenNebula - The Open Source Solution for Data Center Virtualization >>> www.OpenNebula.org | [email protected] | @OpenNebula >>> _______________________________________________ >>> Users mailing list >>> [email protected] >>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org >> >> _______________________________________________ >> Users mailing list >> [email protected] >> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org > _______________________________________________ > Users mailing list > [email protected] > http://lists.opennebula.org/listinfo.cgi/users-opennebula.org -- Ruben S. Montero, PhD Project co-Lead and Chief Architect OpenNebula - The Open Source Solution for Data Center Virtualization www.OpenNebula.org | [email protected] | @OpenNebula _______________________________________________ Users mailing list [email protected] http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
