Nice!

Ok, I'll mark the issue as related to the iSCSI drivers. Merging the
new iSCSI drivers upstream is probably happening next week...

Cheers

Ruben

On Thu, Dec 6, 2012 at 9:24 PM, Alain Pannetrat
<[email protected]> wrote:
> Dear Mark, Ruben,
>
> From reading the code from those patches, I think they seem indeed to
> greatly improve the iSCSI driver, and solve my problem.
>
> Tonight, I was also confronted with the issue that in some cases it
> sometimes takes more than two seconds between "iscsiadm_login
> "$NEW_IQN" "$TARGET_HOST"" and "/dev/disk/by-path/*$NEW_IQN-lun-1" in
> the DISCOVERY_CMD inside iscsi/clone, so the "sleep 2" inserted there
> is not enough. This is also fixed by the proposed patches.
> Nice work!
>
> All the best,
>
> Alain
>
>
> On Thu, Dec 6, 2012 at 9:09 PM, Mark Gergely <[email protected]> 
> wrote:
>> Dear Ruben, Alain,
>>
>> our improved iSCSI driver set that we proposed before should solve this 
>> issue. As mentioned in the ticket, it is possible to simultaneously start 
>> hundreds of non persistent virtual machines.
>> The TM concurrency level is 15.
>> You can check the details at: http://dev.opennebula.org/issues/1592
>>
>> All the best,
>> Mark Gergely
>> MTA-SZTAKI LPDS
>>
>> On 2012.12.06., at 20:01, "Ruben S. Montero" <[email protected]> 
>> wrote:
>>
>>> Hi Alain,
>>>
>>> You are totally right, this may be a problem when instantiated
>>> multiple VMs at the same time.  I've filled an issue to look for the
>>> best way to generate the TID [1].
>>>
>>> We'd be interested in updating the tgtadm_next_tid function in
>>> scripts_common.sh. Also if the tgt server is getting overloaded by
>>> this simultaneous deployments, there are several ways to limit the
>>> concurrency of the TM (e.g. the -t option in oned.conf)
>>>
>>> THANKS for the feedback!
>>>
>>> Ruben
>>>
>>> [1]  http://dev.opennebula.org/issues/1682
>>>
>>> [1] http://dev.opennebula.org/issues/1682
>>>
>>> On Thu, Dec 6, 2012 at 1:52 PM, Alain Pannetrat
>>> <[email protected]> wrote:
>>>> Hi all,
>>>>
>>>> I'm new to OpenNebula and this mailing list, so forgive me if I
>>>> stumble over a topic that may have already been discussed.
>>>>
>>>> I'm currently discovering opennebula 3.8.1 with a simple 3 node
>>>> system: a control node, a compute node and a datastore node
>>>> (iscsi+lvm).
>>>>
>>>> I have been testing the bulk instantiation of virtual machines in
>>>> sunstone, where I initiate the bulk creation of 8 virtual machines in
>>>> parallel. I have noticed that between 2 and 4 machines just fail to
>>>> instantiate correctly with the typical following error message:
>>>>
>>>> 08 2012 [TM][I]: Command execution fail:
>>>> /var/lib/one/remotes/tm/iscsi/clone
>>>> iqn.2012-02.org.opennebula:san.vg-one.lv-one-26
>>>> compute.admin.lan:/var/lib/one//datastores/0/111/disk.0 111 101
>>>> Thu Dec  6 14:40:08 2012 [TM][E]: clone: Command "    set -e
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: set -x
>>>> Thu Dec  6 14:40:08 2012 [TM][I]:
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: # get size
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: SIZE=$(sudo lvs --noheadings -o
>>>> lv_size "/dev/vg-one/lv-one-26")
>>>> Thu Dec  6 14:40:08 2012 [TM][I]:
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: # create lv
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: sudo lvcreate -L${SIZE} vg-one -n
>>>> lv-one-26-111
>>>> Thu Dec  6 14:40:08 2012 [TM][I]:
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: # clone lv with dd
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: sudo dd if=/dev/vg-one/lv-one-26
>>>> of=/dev/vg-one/lv-one-26-111 bs=64k
>>>> Thu Dec  6 14:40:08 2012 [TM][I]:
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: # new iscsi target
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: TID=$(sudo tgtadm --lld iscsi --op
>>>> show --mode target |             grep "Target" | tail -n 1 |
>>>>  awk '{split($2,tmp,":"); print tmp[1]+1;}')
>>>> Thu Dec  6 14:40:08 2012 [TM][I]:
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op new
>>>> --mode target --tid $TID  --targetname
>>>> iqn.2012-02.org.opennebula:san.vg-one.lv-one-26-111
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op bind
>>>> --mode target --tid $TID -I ALL
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: sudo tgtadm --lld iscsi --op new
>>>> --mode logicalunit --tid $TID  --lun 1 --backing-store
>>>> /dev/vg-one/lv-one-26-111
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: sudo tgt-admin --dump |sudo tee
>>>> /etc/tgt/targets.conf > /dev/null 2>&1" failed: + sudo lvs
>>>> --noheadings -o lv_size /dev/vg-one/lv-one-26
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: 131072+0 records in
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: 131072+0 records out
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: 8589934592 bytes (8.6 GB) copied,
>>>> 898.903 s, 9.6 MB/s
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: tgtadm: this target already exists
>>>> Thu Dec  6 14:40:08 2012 [TM][E]: Error cloning
>>>> compute.admin.lan:/dev/vg-one/lv-one-26-111
>>>> Thu Dec  6 14:40:08 2012 [TM][I]: ExitCode: 22
>>>> Thu Dec  6 14:40:08 2012 [TM][E]: Error executing image transfer
>>>> script: Error cloning compute.admin.lan:/dev/vg-one/lv-one-26-111
>>>> Thu Dec  6 14:40:09 2012 [DiM][I]: New VM state is FAILED
>>>>
>>>> After adding traces in the code, I found that there seems to be a race
>>>> condition in /var/lib/one/remotes/tm/iscsi/clone here the following
>>>> commands get executed:
>>>>
>>>> TID=\$($SUDO $(tgtadm_next_tid))
>>>> $SUDO $(tgtadm_target_new "\$TID" "$NEW_IQN")
>>>>
>>>> These commands are typically expanded to something like this:
>>>>
>>>> TID=$(sudo tgtadm --lld iscsi --op show --mode target | grep "Target"
>>>> | tail -n 1 | awk '{split($2,tmp,":");
>>>> sudo tgtadm --lld iscsi --op new --mode target --tid $TID
>>>> --targetname iqn.2012-02.org.opennebula:san.vg-one.lv-one-26-111
>>>>
>>>> What seems to happens is two (or more) calls to the first command
>>>> tgtadm_next_tid happen simultaneously before the second command gets a
>>>> chance to get executed, and then TID as the same value for two (or
>>>> more) VMs.
>>>>
>>>> The workaround I found is to replace the line:
>>>> TID=\$($SUDO $(tgtadm_next_tid))
>>>> with
>>>> TID=$VMID
>>>> in /var/lib/one/remotes/tm/iscsi/clone
>>>>
>>>> Since $VMID is globally unique no race conditions can happen here.
>>>> I've tested this and the failures don't happen anymore in my setting.
>>>> Of course I'm not sure this is the ideal fix, since perhaps VMID can
>>>> take values that are out of range for tgtadm. So futher testing would
>>>> be needed.
>>>>
>>>> I'd be happy to get your thoughts/feedback on this issue.
>>>>
>>>> Best,
>>>>
>>>> Alain
>>>> _______________________________________________
>>>> Users mailing list
>>>> [email protected]
>>>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>>
>>>
>>>
>>> --
>>> Ruben S. Montero, PhD
>>> Project co-Lead and Chief Architect
>>> OpenNebula - The Open Source Solution for Data Center Virtualization
>>> www.OpenNebula.org | [email protected] | @OpenNebula
>>> _______________________________________________
>>> Users mailing list
>>> [email protected]
>>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
>>
>> _______________________________________________
>> Users mailing list
>> [email protected]
>> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
> _______________________________________________
> Users mailing list
> [email protected]
> http://lists.opennebula.org/listinfo.cgi/users-opennebula.org



-- 
Ruben S. Montero, PhD
Project co-Lead and Chief Architect
OpenNebula - The Open Source Solution for Data Center Virtualization
www.OpenNebula.org | [email protected] | @OpenNebula
_______________________________________________
Users mailing list
[email protected]
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

Reply via email to