Re: [ovirt-users] How to create more than 1 vm from template

2014-06-09 Thread John Xue
On Mon, Jun 9, 2014 at 1:45 PM, John Xue xgxj...@gmail.com wrote:
 Dear all,

As you know, we can create 1 vm from template, but how to create many vm
 at the same time? We call it is a pool.

 --
 Regards,
 John Xue



-- 
Regards,
John Xue
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Ovirt Guest Agent Windows 7

2014-06-09 Thread Vinzenz Feenstra

On 06/05/2014 10:21 PM, Jeff Clay wrote:
I have the spice guest agent/tools installed, but I'm reading that I 
also need to install/setup the ovirt-guest-agent to get proper 
reporting of resources, etc. I'm following the instructions in 
https://github.com/oVirt/ovirt-guest-agent/blob/master/ovirt-guest-agent/README-windows.txt 



I am confused at

Update the AGENT_CONFIG global variable in OVirtGuestService.py to 
point to

right configuration location.
I can find the file without issue, the value I'm requested to change 
has a default value of: AGENT_CONFIG = 'ovirt-guest-agent.ini'



I cannot locate a file named ovirt-guest-agent.ini within the 
C:\ovirt-guest-agent-master\ovirt-guest-agent folder so I'm not sure 
what to set this value to.

The file is not located in in ovirt-guest-agent-master\configurations\
Please copy all *.ini files into the same folder as the executable.
Then it should work.




___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users



--
Regards,

Vinzenz Feenstra | Senior Software Engineer
RedHat Engineering Virtualization R  D
Phone: +420 532 294 625
IRC: vfeenstr or evilissimo

Better technology. Faster innovation. Powered by community collaboration.
See how it works at redhat.com

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] How to create more than 1 vm from template

2014-06-09 Thread Maor Lipchuk
Hi John,

You are right, if you want to create many VMs from Template you can
create a pool.
I think the main difference between creating a single VM and creating a
pool is that in a pool you can not create a VM with cloned disks.

regards,
Maor

On 06/09/2014 08:45 AM, John Xue wrote:
 Dear all,
 
As you know, we can create 1 vm from template, but how to create many
 vm at the same time? We call it is a pool.
 
 -- 
 Regards,
 John Xue
 
 
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users
 

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] iSCSI and multipath

2014-06-09 Thread Nicolas Ecarnot

Hi,

Context here :
- 2 setups (2 datacenters) in oVirt 3.4.1 with CentOS 6.4 and 6.5 hosts
- connected to some LUNs in iSCSI on a dedicated physical network

Every host has two interfaces used for management and end-user LAN 
activity. Every host also have 4 additional NICs dedicated to the iSCSI 
network.


Those 4 NICs were setup from the oVirt web GUI in a bonding with a 
unique IP address and connected to the SAN.


Everything is working fine. I just had to manually tweak some points 
(MTU, other small things) but it is working.



Recently, our SAN dealer told us that using bonding in an iSCSI context 
was terrible, and the recommendation is to use multipathing.
My previous experience pre-oVirt was to agree with that. Long story 
short is just that when setting up the host from oVirt, it was so 
convenient to click and setup bonding, and observe it working that I did 
not pay further attention. (and we seem to have no bottleneck yet).


Anyway, I dedicated a host to experiment, I things are not clear to me.
I know how to setup NICs, iSCSI and multipath to present the host OS a 
partition or a logical volume, using multipathing instead of bonding.


But in this precise case, what is disturbing me is that many layers 
described above are managed by oVirt (mount/unmount of LV, creation of 
bridges on top of bonded interfaces, managing the WWID amongst the cluster).


And I see nothing related to multipath at the NICs level.
Though I can setup everything fine in the host, this setup does not 
match what oVirt is expecting : oVirt is expecting a bridge named as the 
iSCSI network, and able to connect to the SAN.
My multipathing is offering the access to the partition of the LUNs, it 
is not the same.


I saw that multipathing is talked here :
http://www.ovirt.org/Feature/iSCSI-Multipath

I here read :

Add an iSCSI Storage to the Data Center
Make sure the Data Center contains networks.
Go to the Data Center main tab and choose the specific Data Center
At the sub tab choose iSCSI Bond


The only tabs I see are Storage/Logical Networks/Network 
QoS/Clusters/Permissions.


In this datacenter, I have one iSCSI master storage domain, two iSCSI 
storage domains and one NFS export domain.


What did I miss?


Press the new button to add a new iSCSI Bond
Configure the networks you want to add to the new iSCSI Bond.


Anyway, I'm not sure to understand the point of this wiki page and this 
implementation : it looks like a much higher level of multipathing over 
virtual networks, and not at all what I'm talking about above...?


Well as you see, I need enlightenments.

--
Nicolas Ecarnot
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread Artyom Lukianov
I just blocked connection to storage for testing, but on result I had this 
error: Failed to acquire lock error -243, so I added it in reproduce steps.
If you know another steps to reproduce this error, without blocking connection 
to storage it also can be wonderful if you can provide them.
Thanks

- Original Message -
From: Andrew Lau and...@andrewklau.com
To: combuster combus...@archlinux.us
Cc: users users@ovirt.org
Sent: Monday, June 9, 2014 3:47:00 AM
Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal 
error Failed to acquire lock error -243

I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have a score of 0 now.

I'm also curious, in the BZ there's a note about:

where engine-vm block connection to storage domain(via iptables -I
INPUT -s sd_ip -j DROP)

What's the purpose for that?

On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:
 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known problem. 
 Or
 you should double check if you hosts have different ids otherwise they 
 would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird
 (showed
 that storage domains were down but the vm's were running fine etc).
 Lucky
 me, I had already exported all of the VM's on the first sign of trouble
 and
 then installed ovirt-engine on the dedicated server and attached the
 export
 domain.

 So while really a usefull feature, and it's working (for the most part
 ie,
 automatic migration works), manually migrating VM with the hosted-engine
 will lead to troubles.

 I hope that my experience with it, will be of use to you. It happened to
 me
 two weeks ago, ovirt-engine was current (3.4.1) and there was no fix
 available.

 Regards,

 Ivan

 On 06/06/2014 05:12 AM, Andrew Lau wrote:

 Hi,

 I'm seeing this weird message in my engine log

 2014-06-06 03:06:09,380 INFO
 [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo]
 (DefaultQuartzScheduler_Worker-79) 

Re: [ovirt-users] Recommended setup for a FC based storage domain

2014-06-09 Thread combuster

OK, I have good news and bad news :)

Good news is that I can run different VM's on different nodes when all 
of their drives are on FC Storage domain. I don't think that all of I/O 
is running through SPM, but I need to test that. Simply put, for every 
virtual disk that you create on the shared fc storage domain, ovirt will 
present that vdisk only to the node wich is running the VM itself. They 
all can see domain infrastructure (inbox,outbox,metadata) but the LV for 
the virtual disk itself for that VM is visible only to the node that is 
running that particular VM. There is no limitation (except for the free 
space on the storage).


Bad news!

I can create the virtual disk on the fc storage for a vm, but when I 
start the VM itself, node wich hosts the VM that I'm starting is going 
non-operational, and quickly goes up again (ilo fencing agent checks if 
the node is ok and bring it back up). During that time, vm starts on 
another node (Default Host parameter was ignored - assigned Host was not 
available). I can manualy migrate it later to the intended node, that 
works. Lucky me, on two nodes (of the four) in the cluster, there were 
no vm's running (i tried this on both, with two different vm's created 
from scratch and i got the same result.


I've killed everything above WARNING because it was killing the 
performance of the cluster. vdsm.log :


[code]
Thread-305::WARNING::2014-06-09 
12:15:53,236::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 
12:17:25,013::utils::129::root::(rmFile) File: 
/rhev/data-center/a0500f5c-e8d9-42f1-8f04-15b23514c8ed/55338570-e537-412b-97a9-635eea1ecb10/images/90659ad8-bd90-4a0a-bb4e-7c6afe90e925/242a1bce-a434-4246-ad24-b62f99c03a05 
already removed
55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 
12:17:25,074::blockSD::761::Storage.StorageDomain::(_getOccupiedMetadataSlots) 
Could not find mapping for lv 
55338570-e537-412b-97a9-635eea1ecb10/242a1bce-a434-4246-ad24-b62f99c03a05
Thread-305::WARNING::2014-06-09 
12:20:54,341::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
Thread-305::WARNING::2014-06-09 
12:25:55,378::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
Thread-305::WARNING::2014-06-09 
12:30:56,424::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
Thread-1857::WARNING::2014-06-09 
12:32:45,639::libvirtconnection::116::root::(wrapper) connection to 
libvirt broken. ecode: 1 edom: 7
Thread-1857::CRITICAL::2014-06-09 
12:32:45,640::libvirtconnection::118::root::(wrapper) taking calling 
process down.
Thread-17704::WARNING::2014-06-09 
12:32:48,009::libvirtconnection::116::root::(wrapper) connection to 
libvirt broken. ecode: 1 edom: 7
Thread-17704::CRITICAL::2014-06-09 
12:32:48,013::libvirtconnection::118::root::(wrapper) taking calling 
process down.
Thread-17704::ERROR::2014-06-09 
12:32:48,018::vm::2285::vm.Vm::(_startUnderlyingVm) 
vmId=`2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9`::The vm start process failed

Traceback (most recent call last):
  File /usr/share/vdsm/vm.py, line 2245, in _startUnderlyingVm
self._run()
  File /usr/share/vdsm/vm.py, line 3185, in _run
self._connection.createXML(domxml, flags),
  File /usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py, 
line 110, in wrapper

__connections.get(id(target)).pingLibvirt()
  File /usr/lib64/python2.6/site-packages/libvirt.py, line 3389, in 
getLibVersion
if ret == -1: raise libvirtError ('virConnectGetLibVersion() 
failed', conn=self)

libvirtError: internal error client socket is closed
Thread-1857::WARNING::2014-06-09 
12:32:50,673::vm::1963::vm.Vm::(_set_lastStatus) 
vmId=`2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9`::trying to set state to 
Powering down when already Down
Thread-1857::WARNING::2014-06-09 
12:32:50,815::utils::129::root::(rmFile) File: 
/var/lib/libvirt/qemu/channels/2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9.com.redhat.rhevm.vdsm 
already removed
Thread-1857::WARNING::2014-06-09 
12:32:50,816::utils::129::root::(rmFile) File: 
/var/lib/libvirt/qemu/channels/2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9.org.qemu.guest_agent.0 
already removed
MainThread::WARNING::2014-06-09 
12:33:03,770::fileUtils::167::Storage.fileUtils::(createdir) Dir 
/rhev/data-center/mnt already exists
MainThread::WARNING::2014-06-09 
12:33:05,738::clientIF::181::vds::(_prepareBindings) Unable to load the 
json rpc server module. Please make sure it is installed.
storageRefresh::WARNING::2014-06-09 
12:33:06,133::fileUtils::167::Storage.fileUtils::(createdir) Dir 
/rhev/data-center/hsm-tasks already exists
Thread-35::ERROR::2014-06-09 
12:33:08,375::sdc::137::Storage.StorageDomainCache::(_findDomain) 
looking for unfetched domain 55338570-e537-412b-97a9-635eea1ecb10
Thread-35::ERROR::2014-06-09 

Re: [ovirt-users] Recommended setup for a FC based storage domain

2014-06-09 Thread combuster

Bad news happens only when running a VM for the first time, if it helps...

On 06/09/2014 01:30 PM, combuster wrote:

OK, I have good news and bad news :)

Good news is that I can run different VM's on different nodes when all 
of their drives are on FC Storage domain. I don't think that all of 
I/O is running through SPM, but I need to test that. Simply put, for 
every virtual disk that you create on the shared fc storage domain, 
ovirt will present that vdisk only to the node wich is running the VM 
itself. They all can see domain infrastructure (inbox,outbox,metadata) 
but the LV for the virtual disk itself for that VM is visible only to 
the node that is running that particular VM. There is no limitation 
(except for the free space on the storage).


Bad news!

I can create the virtual disk on the fc storage for a vm, but when I 
start the VM itself, node wich hosts the VM that I'm starting is going 
non-operational, and quickly goes up again (ilo fencing agent checks 
if the node is ok and bring it back up). During that time, vm starts 
on another node (Default Host parameter was ignored - assigned Host 
was not available). I can manualy migrate it later to the intended 
node, that works. Lucky me, on two nodes (of the four) in the cluster, 
there were no vm's running (i tried this on both, with two different 
vm's created from scratch and i got the same result.


I've killed everything above WARNING because it was killing the 
performance of the cluster. vdsm.log :


[code]
Thread-305::WARNING::2014-06-09 
12:15:53,236::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 
12:17:25,013::utils::129::root::(rmFile) File: 
/rhev/data-center/a0500f5c-e8d9-42f1-8f04-15b23514c8ed/55338570-e537-412b-97a9-635eea1ecb10/images/90659ad8-bd90-4a0a-bb4e-7c6afe90e925/242a1bce-a434-4246-ad24-b62f99c03a05 
already removed
55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 
12:17:25,074::blockSD::761::Storage.StorageDomain::(_getOccupiedMetadataSlots) 
Could not find mapping for lv 
55338570-e537-412b-97a9-635eea1ecb10/242a1bce-a434-4246-ad24-b62f99c03a05
Thread-305::WARNING::2014-06-09 
12:20:54,341::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
Thread-305::WARNING::2014-06-09 
12:25:55,378::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
Thread-305::WARNING::2014-06-09 
12:30:56,424::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
Thread-1857::WARNING::2014-06-09 
12:32:45,639::libvirtconnection::116::root::(wrapper) connection to 
libvirt broken. ecode: 1 edom: 7
Thread-1857::CRITICAL::2014-06-09 
12:32:45,640::libvirtconnection::118::root::(wrapper) taking calling 
process down.
Thread-17704::WARNING::2014-06-09 
12:32:48,009::libvirtconnection::116::root::(wrapper) connection to 
libvirt broken. ecode: 1 edom: 7
Thread-17704::CRITICAL::2014-06-09 
12:32:48,013::libvirtconnection::118::root::(wrapper) taking calling 
process down.
Thread-17704::ERROR::2014-06-09 
12:32:48,018::vm::2285::vm.Vm::(_startUnderlyingVm) 
vmId=`2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9`::The vm start process failed

Traceback (most recent call last):
  File /usr/share/vdsm/vm.py, line 2245, in _startUnderlyingVm
self._run()
  File /usr/share/vdsm/vm.py, line 3185, in _run
self._connection.createXML(domxml, flags),
  File /usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py, 
line 110, in wrapper

__connections.get(id(target)).pingLibvirt()
  File /usr/lib64/python2.6/site-packages/libvirt.py, line 3389, in 
getLibVersion
if ret == -1: raise libvirtError ('virConnectGetLibVersion() 
failed', conn=self)

libvirtError: internal error client socket is closed
Thread-1857::WARNING::2014-06-09 
12:32:50,673::vm::1963::vm.Vm::(_set_lastStatus) 
vmId=`2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9`::trying to set state to 
Powering down when already Down
Thread-1857::WARNING::2014-06-09 
12:32:50,815::utils::129::root::(rmFile) File: 
/var/lib/libvirt/qemu/channels/2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9.com.redhat.rhevm.vdsm 
already removed
Thread-1857::WARNING::2014-06-09 
12:32:50,816::utils::129::root::(rmFile) File: 
/var/lib/libvirt/qemu/channels/2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9.org.qemu.guest_agent.0 
already removed
MainThread::WARNING::2014-06-09 
12:33:03,770::fileUtils::167::Storage.fileUtils::(createdir) Dir 
/rhev/data-center/mnt already exists
MainThread::WARNING::2014-06-09 
12:33:05,738::clientIF::181::vds::(_prepareBindings) Unable to load 
the json rpc server module. Please make sure it is installed.
storageRefresh::WARNING::2014-06-09 
12:33:06,133::fileUtils::167::Storage.fileUtils::(createdir) Dir 
/rhev/data-center/hsm-tasks already exists
Thread-35::ERROR::2014-06-09 

Re: [ovirt-users] iSCSI and multipath

2014-06-09 Thread Maor Lipchuk
Hi Nicolas,

Which DC level are you using?
iSCSI multipath should be supported only from DC with compatibility
version of 3.4

regards,
Maor

On 06/09/2014 01:06 PM, Nicolas Ecarnot wrote:
 Hi,
 
 Context here :
 - 2 setups (2 datacenters) in oVirt 3.4.1 with CentOS 6.4 and 6.5 hosts
 - connected to some LUNs in iSCSI on a dedicated physical network
 
 Every host has two interfaces used for management and end-user LAN
 activity. Every host also have 4 additional NICs dedicated to the iSCSI
 network.
 
 Those 4 NICs were setup from the oVirt web GUI in a bonding with a
 unique IP address and connected to the SAN.
 
 Everything is working fine. I just had to manually tweak some points
 (MTU, other small things) but it is working.
 
 
 Recently, our SAN dealer told us that using bonding in an iSCSI context
 was terrible, and the recommendation is to use multipathing.
 My previous experience pre-oVirt was to agree with that. Long story
 short is just that when setting up the host from oVirt, it was so
 convenient to click and setup bonding, and observe it working that I did
 not pay further attention. (and we seem to have no bottleneck yet).
 
 Anyway, I dedicated a host to experiment, I things are not clear to me.
 I know how to setup NICs, iSCSI and multipath to present the host OS a
 partition or a logical volume, using multipathing instead of bonding.
 
 But in this precise case, what is disturbing me is that many layers
 described above are managed by oVirt (mount/unmount of LV, creation of
 bridges on top of bonded interfaces, managing the WWID amongst the
 cluster).
 
 And I see nothing related to multipath at the NICs level.
 Though I can setup everything fine in the host, this setup does not
 match what oVirt is expecting : oVirt is expecting a bridge named as the
 iSCSI network, and able to connect to the SAN.
 My multipathing is offering the access to the partition of the LUNs, it
 is not the same.
 
 I saw that multipathing is talked here :
 http://www.ovirt.org/Feature/iSCSI-Multipath
 
 I here read :
 Add an iSCSI Storage to the Data Center
 Make sure the Data Center contains networks.
 Go to the Data Center main tab and choose the specific Data Center
 At the sub tab choose iSCSI Bond
 
 The only tabs I see are Storage/Logical Networks/Network
 QoS/Clusters/Permissions.
 
 In this datacenter, I have one iSCSI master storage domain, two iSCSI
 storage domains and one NFS export domain.
 
 What did I miss?
 
 Press the new button to add a new iSCSI Bond
 Configure the networks you want to add to the new iSCSI Bond.
 
 Anyway, I'm not sure to understand the point of this wiki page and this
 implementation : it looks like a much higher level of multipathing over
 virtual networks, and not at all what I'm talking about above...?
 
 Well as you see, I need enlightenments.
 

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread Andrew Lau
Interesting, my storage network is a L2 only and doesn't run on the
ovirtmgmt (which is the only thing HostedEngine sees) but I've only
seen this issue when running ctdb in front of my NFS server. I
previously was using localhost as all my hosts had the nfs server on
it (gluster).

On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote:
 I just blocked connection to storage for testing, but on result I had this 
 error: Failed to acquire lock error -243, so I added it in reproduce steps.
 If you know another steps to reproduce this error, without blocking 
 connection to storage it also can be wonderful if you can provide them.
 Thanks

 - Original Message -
 From: Andrew Lau and...@andrewklau.com
 To: combuster combus...@archlinux.us
 Cc: users users@ovirt.org
 Sent: Monday, June 9, 2014 3:47:00 AM
 Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal 
 error Failed to acquire lock error -243

 I just ran a few extra tests, I had a 2 host, hosted-engine running
 for a day. They both had a score of 2400. Migrated the VM through the
 UI multiple times, all worked fine. I then added the third host, and
 that's when it all fell to pieces.
 Other two hosts have a score of 0 now.

 I'm also curious, in the BZ there's a note about:

 where engine-vm block connection to storage domain(via iptables -I
 INPUT -s sd_ip -j DROP)

 What's the purpose for that?

 On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:
 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known problem. 
 Or
 you should double check if you hosts have different ids otherwise they 
 would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird
 (showed
 that storage domains were down but the vm's were running fine etc).
 Lucky
 me, I had already exported all of the VM's on the first sign of trouble
 and
 then installed ovirt-engine on the dedicated server and attached the
 export
 domain.

 So while really a usefull feature, and it's working (for the most part
 ie,
 automatic migration works), manually migrating VM with the hosted-engine
 will lead to troubles.

 I hope that my experience 

Re: [ovirt-users] iSCSI and multipath

2014-06-09 Thread Nicolas Ecarnot

Le 09-06-2014 13:55, Maor Lipchuk a écrit :

Hi Nicolas,

Which DC level are you using?
iSCSI multipath should be supported only from DC with compatibility
version of 3.4


Hi Maor,

Oops you're right, my both 3.4 datacenters are using 3.3 level.
I migrated recently.

How safe or risky is it to increase this DC level ?



regards,
Maor

On 06/09/2014 01:06 PM, Nicolas Ecarnot wrote:

Hi,

Context here :
- 2 setups (2 datacenters) in oVirt 3.4.1 with CentOS 6.4 and 6.5 
hosts

- connected to some LUNs in iSCSI on a dedicated physical network

Every host has two interfaces used for management and end-user LAN
activity. Every host also have 4 additional NICs dedicated to the 
iSCSI

network.

Those 4 NICs were setup from the oVirt web GUI in a bonding with a
unique IP address and connected to the SAN.

Everything is working fine. I just had to manually tweak some points
(MTU, other small things) but it is working.


Recently, our SAN dealer told us that using bonding in an iSCSI 
context

was terrible, and the recommendation is to use multipathing.
My previous experience pre-oVirt was to agree with that. Long story
short is just that when setting up the host from oVirt, it was so
convenient to click and setup bonding, and observe it working that I 
did

not pay further attention. (and we seem to have no bottleneck yet).

Anyway, I dedicated a host to experiment, I things are not clear to 
me.

I know how to setup NICs, iSCSI and multipath to present the host OS a
partition or a logical volume, using multipathing instead of bonding.

But in this precise case, what is disturbing me is that many layers
described above are managed by oVirt (mount/unmount of LV, creation of
bridges on top of bonded interfaces, managing the WWID amongst the
cluster).

And I see nothing related to multipath at the NICs level.
Though I can setup everything fine in the host, this setup does not
match what oVirt is expecting : oVirt is expecting a bridge named as 
the

iSCSI network, and able to connect to the SAN.
My multipathing is offering the access to the partition of the LUNs, 
it

is not the same.

I saw that multipathing is talked here :
http://www.ovirt.org/Feature/iSCSI-Multipath

I here read :

Add an iSCSI Storage to the Data Center
Make sure the Data Center contains networks.
Go to the Data Center main tab and choose the specific Data 
Center

At the sub tab choose iSCSI Bond


The only tabs I see are Storage/Logical Networks/Network
QoS/Clusters/Permissions.

In this datacenter, I have one iSCSI master storage domain, two iSCSI
storage domains and one NFS export domain.

What did I miss?


Press the new button to add a new iSCSI Bond
Configure the networks you want to add to the new iSCSI Bond.


Anyway, I'm not sure to understand the point of this wiki page and 
this
implementation : it looks like a much higher level of multipathing 
over

virtual networks, and not at all what I'm talking about above...?

Well as you see, I need enlightenments.



--
Nicolas Ecarnot
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] iSCSI and multipath

2014-06-09 Thread Maor Lipchuk
basically, you should upgrade your DC to 3.4, and then upgrade the
clusters you desire also to 3.4.

You might need to upgrade your hosts to be compatible with the cluster
emulated machines, or they might become non-operational if qemu-kvm does
not support it.

ether way, you can always ask for advice in the mailing list if you
encounter any problem.

Regards,
Maor

On 06/09/2014 03:30 PM, Nicolas Ecarnot wrote:
 Le 09-06-2014 13:55, Maor Lipchuk a écrit :
 Hi Nicolas,

 Which DC level are you using?
 iSCSI multipath should be supported only from DC with compatibility
 version of 3.4
 
 Hi Maor,
 
 Oops you're right, my both 3.4 datacenters are using 3.3 level.
 I migrated recently.
 
 How safe or risky is it to increase this DC level ?
 

 regards,
 Maor

 On 06/09/2014 01:06 PM, Nicolas Ecarnot wrote:
 Hi,

 Context here :
 - 2 setups (2 datacenters) in oVirt 3.4.1 with CentOS 6.4 and 6.5 hosts
 - connected to some LUNs in iSCSI on a dedicated physical network

 Every host has two interfaces used for management and end-user LAN
 activity. Every host also have 4 additional NICs dedicated to the iSCSI
 network.

 Those 4 NICs were setup from the oVirt web GUI in a bonding with a
 unique IP address and connected to the SAN.

 Everything is working fine. I just had to manually tweak some points
 (MTU, other small things) but it is working.


 Recently, our SAN dealer told us that using bonding in an iSCSI context
 was terrible, and the recommendation is to use multipathing.
 My previous experience pre-oVirt was to agree with that. Long story
 short is just that when setting up the host from oVirt, it was so
 convenient to click and setup bonding, and observe it working that I did
 not pay further attention. (and we seem to have no bottleneck yet).

 Anyway, I dedicated a host to experiment, I things are not clear to me.
 I know how to setup NICs, iSCSI and multipath to present the host OS a
 partition or a logical volume, using multipathing instead of bonding.

 But in this precise case, what is disturbing me is that many layers
 described above are managed by oVirt (mount/unmount of LV, creation of
 bridges on top of bonded interfaces, managing the WWID amongst the
 cluster).

 And I see nothing related to multipath at the NICs level.
 Though I can setup everything fine in the host, this setup does not
 match what oVirt is expecting : oVirt is expecting a bridge named as the
 iSCSI network, and able to connect to the SAN.
 My multipathing is offering the access to the partition of the LUNs, it
 is not the same.

 I saw that multipathing is talked here :
 http://www.ovirt.org/Feature/iSCSI-Multipath

 I here read :
 Add an iSCSI Storage to the Data Center
 Make sure the Data Center contains networks.
 Go to the Data Center main tab and choose the specific Data Center
 At the sub tab choose iSCSI Bond

 The only tabs I see are Storage/Logical Networks/Network
 QoS/Clusters/Permissions.

 In this datacenter, I have one iSCSI master storage domain, two iSCSI
 storage domains and one NFS export domain.

 What did I miss?

 Press the new button to add a new iSCSI Bond
 Configure the networks you want to add to the new iSCSI Bond.

 Anyway, I'm not sure to understand the point of this wiki page and this
 implementation : it looks like a much higher level of multipathing over
 virtual networks, and not at all what I'm talking about above...?

 Well as you see, I need enlightenments.

 

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] iSCSI and multipath

2014-06-09 Thread Nicolas Ecarnot

Le 09-06-2014 14:44, Maor Lipchuk a écrit :

basically, you should upgrade your DC to 3.4, and then upgrade the
clusters you desire also to 3.4.


Well, that seems to have worked, except I had to raise the cluster level 
first, then the DC level.


Now, I can see the iSCSI multipath tab has appeared.
But I confirm what I wrote below :


I saw that multipathing is talked here :
http://www.ovirt.org/Feature/iSCSI-Multipath


Add an iSCSI Storage to the Data Center
Make sure the Data Center contains networks.
Go to the Data Center main tab and choose the specific Data 
Center

At the sub tab choose iSCSI Bond
Press the new button to add a new iSCSI Bond
Configure the networks you want to add to the new iSCSI Bond.


Anyway, I'm not sure to understand the point of this wiki page and 
this
implementation : it looks like a much higher level of multipathing 
over

virtual networks, and not at all what I'm talking about above...?


I am actually trying to know whether bonding interfaces (at low level) 
for the iSCSI network is a bad thing, as was told by my storage 
provider?


--
Nicolas Ecarnot
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] iSCSI and multipath

2014-06-09 Thread John Taylor
On Mon, Jun 9, 2014 at 9:23 AM, Nicolas Ecarnot nico...@ecarnot.net wrote:
 Le 09-06-2014 14:44, Maor Lipchuk a écrit :

 basically, you should upgrade your DC to 3.4, and then upgrade the
 clusters you desire also to 3.4.


 Well, that seems to have worked, except I had to raise the cluster level
 first, then the DC level.

 Now, I can see the iSCSI multipath tab has appeared.
 But I confirm what I wrote below :

 I saw that multipathing is talked here :
 http://www.ovirt.org/Feature/iSCSI-Multipath

 Add an iSCSI Storage to the Data Center
 Make sure the Data Center contains networks.
 Go to the Data Center main tab and choose the specific Data Center
 At the sub tab choose iSCSI Bond
 Press the new button to add a new iSCSI Bond
 Configure the networks you want to add to the new iSCSI Bond.


 Anyway, I'm not sure to understand the point of this wiki page and this
 implementation : it looks like a much higher level of multipathing over
 virtual networks, and not at all what I'm talking about above...?


 I am actually trying to know whether bonding interfaces (at low level) for
 the iSCSI network is a bad thing, as was told by my storage provider?

 --
 Nicolas Ecarnot


Hi Nicolas,
I think the naming of the managed iscsi multipathing feature a bond
might be a bit confusing. It's not an ethernet/nic bond, but a way to
group networks and targets together, so it's not bonding interfaces
Behind the scenes what it does is creates iscsi
ifaces(/var/lib/iscsi/ifaces) and changes the way the iscsiadm calls
are constructed to use those ifaces (instead of the default) to
connect and login to the targets
Hope that helps.

-John
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] oVirt - Node install on CentOS

2014-06-09 Thread Simon Barrett
Could anyone please confirm the correct process to run oVirt node on a standard 
CentOS install, rather than using the node iso?

I'm currently doing the following:


-  Install CentOS 6.5

-  Install qemu-kvm-rhev rpm's to resolve live snapshot issues on the 
CentOS supplied rpm's

-  Yum install vdsm ovirt-node-plugin-vdsm vdsm-reg

o   I have to remove noexec from /tmp or the config fails

-  I then add the node from the ovirt-engine gui

After resolving some problems with group memberships and vdsm requiring sudo 
access, all is working. Live snapshots and storage migration are OK (tested NFS 
and Gluster as well).

I couldn't really find any docs on how to do this so I just wanted to confirm 
if what I am doing makes sense.

I also don't have the text configuration interface that I would normally get on 
the oVirt node iso. Can I install this and use it on a non node iso install?

Many thanks for any assistance.

Simon


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] Hacking in Ceph rather then Gluster.

2014-06-09 Thread Nathan Stratton
So I understand that the news is still fresh and there may not be much
going on yet in making Ceph work with ovirt, but I thought I would reach
out and see if it was possible to hack them together and still use librdb
rather then NFS.

I know, why not just use Gluster... the problem is I have tried to use
Gluster for VM storage for years and I still don't think it is ready. Ceph
still has work in other areas, but this is one area where I think it
shines. This is a new lab cluster and I would like to try to use ceph over
gluster if possible.

Unless I am missing something, can anyone tell me they are happy with
Gluster as a backend image store? This will be a small 16 node 10 gig
cluster of shared compute / storage (yes I know people want to keep them
separate).


nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 |
www.broadsoft.com
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] oVirt - Node install on CentOS

2014-06-09 Thread Joop

Simon Barrett wrote:


Could anyone please confirm the correct process to run oVirt node on a 
standard CentOS install, rather than using the node iso?


 


I'm currently doing the following:

 


-  Install CentOS 6.5

-  Install qemu-kvm-rhev rpm's to resolve live snapshot issues 
on the CentOS supplied rpm's


-  Yum install vdsm ovirt-node-plugin-vdsm vdsm-reg

o   I have to remove noexec from /tmp or the config fails

-  I then add the node from the ovirt-engine gui

 

After resolving some problems with group memberships and vdsm 
requiring sudo access, all is working. Live snapshots and storage 
migration are OK (tested NFS and Gluster as well).


 

I couldn't really find any docs on how to do this so I just wanted to 
confirm if what I am doing makes sense.


 

I also don't have the text configuration interface that I would 
normally get on the oVirt node iso. Can I install this and use it on a 
non node iso install?


 

If you install a minimal Centos-6.5 and add the ovirt repository and 
then add the host using the webui of engine then it will install all 
needed packages (vdsm/libvirt/kvm) and you're done. You can then replace 
the standard qemu with the one that will do live snapshots. Depending on 
where you're storage is located you shouldn't have to tinker with 
memberships etc.


Regards,

Joop

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Spam Re: Spam Re: Spam Windows guest agent

2014-06-09 Thread Lev Veyde
Hi Bob,

Thanks for your feedback.
We fixed the issue and the new version of oVirt WGT ISO (3.5-2 alpha) is now 
available from the oVirt website:

http://resources.ovirt.org/pub/ovirt-master-snapshot-static/iso/ovirt-guest-tools/ovirt-guest-tools-3.5-2.iso

as well is the updated installer:

http://resources.ovirt.org/pub/ovirt-master-snapshot-static/exe/ovirt-guest-tools/ovirt-guest-tools-3.5-2.exe

Please also note that currently upgrades between versions require to manually 
stop all the relevant services (Spice and oVirt Agents) before performing an 
upgrade, we're working on getting this fixed as well.

Thanks in advance,
Lev Veyde.

- Original Message -
From: Bob Doolittle b...@doolittle.us.com
To: Sandro Bonazzola sbona...@redhat.com, Maurice James 
mja...@media-node.com, Joop jvdw...@xs4all.nl
Cc: Lev Veyde lve...@redhat.com, users@ovirt.org
Sent: Friday, June 6, 2014 5:44:42 PM
Subject: Re: [ovirt-users] Spam  Re:  Spam Re:  Spam  Windows guest agent

Just gave this a try on Windows Server 2008 R2, and it worked almost 
perfectly!

The one small problem I had: 
https://bugzilla.redhat.com/show_bug.cgi?id=1105624
Service was configured with Type Manual rather than Autostart, so did 
not restart upon reboot.
Easy workaround.

Thanks guys - this will be an enormous help! :)

-Bob

P.S. On my system the What would you like me to do with this CD? 
AutoPlay dialog has a goofy option - Import photos and videos (Using 
Dropbox). Not sure if that's something you can control.

On 06/06/2014 09:56 AM, Sandro Bonazzola wrote:
 Il 06/06/2014 15:29, Maurice James ha scritto:
 I think I got it. Just a few key steps that are not obvious for us python 
 for windows virgins. I will send in some screen shots with text so that
 someone with write access to the wiki can edit and post it

 I suggest to try the shiny new ovirt-guest-tools iso. You can find more info 
 here:
 http://www.ovirt.org/Features/oVirt_Windows_Guest_Tools




 --
 *From: *Joop jvdw...@xs4all.nl
 *To: *users@ovirt.org
 *Sent: *Friday, June 6, 2014 8:27:53 AM
 *Subject: *Re: [ovirt-users] Spam Re:  Spam  Windows guest agent

 On 6-6-2014 14:14, Karli Sjöberg wrote:


  Den 6 jun 2014 13:39 skrev Maurice James mja...@media-node.com:
  
   Yes that FM in particular :)


 The step of py2exe doesn't work: can't open File 'setup.py': [Errno 2] No 
 such file or directory
 My cd is where README-windows is located and the archive of today (-10min 
 ago)

 Just copying the parent dir to 'Program Files' and execute what is in the 
 README will work though. Thats how I have done it all the time.

 Joop


 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users



 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users



___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Hacking in Ceph rather then Gluster.

2014-06-09 Thread Itamar Heim

On 06/09/2014 01:28 PM, Nathan Stratton wrote:

So I understand that the news is still fresh and there may not be much
going on yet in making Ceph work with ovirt, but I thought I would reach
out and see if it was possible to hack them together and still use
librdb rather then NFS.

I know, why not just use Gluster... the problem is I have tried to use
Gluster for VM storage for years and I still don't think it is ready.
Ceph still has work in other areas, but this is one area where I think
it shines. This is a new lab cluster and I would like to try to use ceph
over gluster if possible.

Unless I am missing something, can anyone tell me they are happy with
Gluster as a backend image store? This will be a small 16 node 10 gig
cluster of shared compute / storage (yes I know people want to keep them
separate).

 
nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 |
www.broadsoft.com http://www.broadsoft.com


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users



there was a threat about this recently. afaict, ceph support will 
require adding a specific ceph storage domain to engine and vdsm, which 
is a full blown feature (I assume you could try and hack it somewhat 
with a custom hook). waiting for next version planning cycle to see 
if/how it gets pushed.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] Live migration - quest VM stall

2014-06-09 Thread Markus Stockhausen
Hello,

at the moment we are investigating stalls of Windows XP VMs during
live migration. Our environment consists of:

- FC20 hypervisor nodes
- qemu 1.6.2
- OVirt 3.4.1
- Guest: Windows XP SP2
- VM Disks: Virtio  IDE tested
- SPICE / VNC: both tested
- Balloon: With  without tested
- Cluster compatibility: 3.4 - CPU Nehalem

After 2-10 live migrations the Windows XP guest is no longer responsive.

First of all we thougth that it might be related to SPICE because we were
no longer able to logon to the console. So we installed XP telnet server in
the VM but that showed a similar behaviour:

- The telnet welcome dialogue is always available (network seems ok)
- Sometime after a live migration  if you enter the password the telnet
  gives no response.
In parallel the SPICE console allows to move open windows. But as soon
as one clicks on the start the menu the system gives no response.

Even after updating to qemu 2.0 with virt-preview respositories the
behaviour stays the same. Looks like the system cannot access

Any ideas?

Markus

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Über das Internet versandte E-Mails können unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche Willenserklärung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

Vorstand:
Kadir Akin
Dr. Michael Höhnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht Köln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 Köln

executive board:
Kadir Akin
Dr. Michael Höhnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Hacking in Ceph rather then Gluster.

2014-06-09 Thread Nathan Stratton
Thanks, I will take a look at it, anyone else currently using Gluster for
backend images in production?



nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 |
www.broadsoft.com


On Mon, Jun 9, 2014 at 2:55 PM, Itamar Heim ih...@redhat.com wrote:

 On 06/09/2014 01:28 PM, Nathan Stratton wrote:

 So I understand that the news is still fresh and there may not be much
 going on yet in making Ceph work with ovirt, but I thought I would reach
 out and see if it was possible to hack them together and still use
 librdb rather then NFS.

 I know, why not just use Gluster... the problem is I have tried to use
 Gluster for VM storage for years and I still don't think it is ready.
 Ceph still has work in other areas, but this is one area where I think
 it shines. This is a new lab cluster and I would like to try to use ceph
 over gluster if possible.

 Unless I am missing something, can anyone tell me they are happy with
 Gluster as a backend image store? This will be a small 16 node 10 gig
 cluster of shared compute / storage (yes I know people want to keep them
 separate).

  
 nathan stratton | vp technology | broadsoft, inc | +1-240-404-6580 |
 www.broadsoft.com http://www.broadsoft.com


 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users


 there was a threat about this recently. afaict, ceph support will require
 adding a specific ceph storage domain to engine and vdsm, which is a full
 blown feature (I assume you could try and hack it somewhat with a custom
 hook). waiting for next version planning cycle to see if/how it gets pushed.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread Andrew Lau
So after adding the L3 capabilities to my storage network, I'm no
longer seeing this issue anymore. So the engine needs to be able to
access the storage domain it sits on? But that doesn't show up in the
UI?

Ivan, was this also the case with your setup? Engine couldn't access
storage domain?

On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, my storage network is a L2 only and doesn't run on the
 ovirtmgmt (which is the only thing HostedEngine sees) but I've only
 seen this issue when running ctdb in front of my NFS server. I
 previously was using localhost as all my hosts had the nfs server on
 it (gluster).

 On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote:
 I just blocked connection to storage for testing, but on result I had this 
 error: Failed to acquire lock error -243, so I added it in reproduce steps.
 If you know another steps to reproduce this error, without blocking 
 connection to storage it also can be wonderful if you can provide them.
 Thanks

 - Original Message -
 From: Andrew Lau and...@andrewklau.com
 To: combuster combus...@archlinux.us
 Cc: users users@ovirt.org
 Sent: Monday, June 9, 2014 3:47:00 AM
 Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal 
 error Failed to acquire lock error -243

 I just ran a few extra tests, I had a 2 host, hosted-engine running
 for a day. They both had a score of 2400. Migrated the VM through the
 UI multiple times, all worked fine. I then added the third host, and
 that's when it all fell to pieces.
 Other two hosts have a score of 0 now.

 I'm also curious, in the BZ there's a note about:

 where engine-vm block connection to storage domain(via iptables -I
 INPUT -s sd_ip -j DROP)

 What's the purpose for that?

 On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:
 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known 
 problem. Or
 you should double check if you hosts have different ids otherwise they 
 would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us 
 wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's 
 related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't want to
 start) but the database was damaged enough and acted pretty weird
 (showed
 that storage domains were down but the vm's were running fine etc).
 

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread Andrew Lau
nvm, just as I hit send the error has returned.
Ignore this..

On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote:
 So after adding the L3 capabilities to my storage network, I'm no
 longer seeing this issue anymore. So the engine needs to be able to
 access the storage domain it sits on? But that doesn't show up in the
 UI?

 Ivan, was this also the case with your setup? Engine couldn't access
 storage domain?

 On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, my storage network is a L2 only and doesn't run on the
 ovirtmgmt (which is the only thing HostedEngine sees) but I've only
 seen this issue when running ctdb in front of my NFS server. I
 previously was using localhost as all my hosts had the nfs server on
 it (gluster).

 On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote:
 I just blocked connection to storage for testing, but on result I had this 
 error: Failed to acquire lock error -243, so I added it in reproduce 
 steps.
 If you know another steps to reproduce this error, without blocking 
 connection to storage it also can be wonderful if you can provide them.
 Thanks

 - Original Message -
 From: Andrew Lau and...@andrewklau.com
 To: combuster combus...@archlinux.us
 Cc: users users@ovirt.org
 Sent: Monday, June 9, 2014 3:47:00 AM
 Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal 
 error Failed to acquire lock error -243

 I just ran a few extra tests, I had a 2 host, hosted-engine running
 for a day. They both had a score of 2400. Migrated the VM through the
 UI multiple times, all worked fine. I then added the third host, and
 that's when it all fell to pieces.
 Other two hosts have a score of 0 now.

 I'm also curious, in the BZ there's a note about:

 where engine-vm block connection to storage domain(via iptables -I
 INPUT -s sd_ip -j DROP)

 What's the purpose for that?

 On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:
 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:
 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:
 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was nfs
 on top of gluster. So if you have this setup, then it's a known 
 problem. Or
 you should double check if you hosts have different ids otherwise they 
 would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after your
 manual migration? It's a little frustrating this is happening as I was
 hoping to get this into a production environment. It was all working
 except that log message :(

 Thanks,
 Andrew


 On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us 
 wrote:

 Hi Andrew,

 this is something that I saw in my logs too, first on one node and 
 then
 on
 the other three. When that happend on all four of them, engine was
 corrupted
 beyond repair.

 First of all, I think that message is saying that sanlock can't get a
 lock
 on the shared storage that you defined for the hostedengine during
 installation. I got this error when I've tried to manually migrate the
 hosted engine. There is an unresolved bug there and I think it's 
 related
 to
 this one:

 [Bug 1093366 - Migration of hosted-engine vm put target host score to
 zero]
 https://bugzilla.redhat.com/show_bug.cgi?id=1093366

 This is a blocker bug (or should be) for the selfhostedengine and, 
 from
 my
 own experience with it, shouldn't be used in the production enviroment
 (not
 untill it's fixed).

 Nothing that I've done couldn't fix the fact that the score for the
 target
 node was Zero, tried to reinstall the node, reboot the node, restarted
 several services, tailed a tons of logs etc but to no avail. When only
 one
 node was left (that was actually running the hosted engine), I brought
 the
 engine's vm down gracefully (hosted-engine --vm-shutdown I belive) and
 after
 that, when I've tried to start the vm - it wouldn't load. Running VNC
 showed
 that the filesystem inside the vm was corrupted and when I ran fsck 
 and
 finally started up - it was too badly damaged. I succeded to start the
 engine itself (after repairing postgresql service that wouldn't 

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread combuster
Nah, I've explicitly allowed hosted-engine vm to be able to access the 
NAS device as the NFS share itself, before the deploy procedure even 
started. But I'm puzzled at how you can reproduce the bug, all was well 
on my setup before I've stated manual migration of the engine's vm. Even 
auto migration worked before that (tested it). Does it just happen 
without any procedure on the engine itself? Is the score 0 for just one 
node, or two of three of them?

On 06/10/2014 01:02 AM, Andrew Lau wrote:

nvm, just as I hit send the error has returned.
Ignore this..

On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote:

So after adding the L3 capabilities to my storage network, I'm no
longer seeing this issue anymore. So the engine needs to be able to
access the storage domain it sits on? But that doesn't show up in the
UI?

Ivan, was this also the case with your setup? Engine couldn't access
storage domain?

On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:

Interesting, my storage network is a L2 only and doesn't run on the
ovirtmgmt (which is the only thing HostedEngine sees) but I've only
seen this issue when running ctdb in front of my NFS server. I
previously was using localhost as all my hosts had the nfs server on
it (gluster).

On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com wrote:

I just blocked connection to storage for testing, but on result I had this error: 
Failed to acquire lock error -243, so I added it in reproduce steps.
If you know another steps to reproduce this error, without blocking connection 
to storage it also can be wonderful if you can provide them.
Thanks

- Original Message -
From: Andrew Lau and...@andrewklau.com
To: combuster combus...@archlinux.us
Cc: users users@ovirt.org
Sent: Monday, June 9, 2014 3:47:00 AM
Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message: internal 
error Failed to acquire lock error -243

I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have a score of 0 now.

I'm also curious, in the BZ there's a note about:

where engine-vm block connection to storage domain(via iptables -I
INPUT -s sd_ip -j DROP)

What's the purpose for that?

On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com wrote:

Ignore that, the issue came back after 10 minutes.

I've even tried a gluster mount + nfs server on top of that, and the
same issue has come back.

On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com wrote:

Interesting, I put it all into global maintenance. Shut it all down
for 10~ minutes, and it's regained it's sanlock control and doesn't
seem to have that issue coming up in the log.

On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us wrote:

It was pure NFS on a NAS device. They all had different ids (had no
redeployements of nodes before problem occured).

Thanks Jirka.


On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

I've seen that problem in other threads, the common denominator was nfs
on top of gluster. So if you have this setup, then it's a known problem. Or
you should double check if you hosts have different ids otherwise they would
be trying to acquire the same lock.

--Jirka

On 06/06/2014 08:03 AM, Andrew Lau wrote:

Hi Ivan,

Thanks for the in depth reply.

I've only seen this happen twice, and only after I added a third host
to the HA cluster. I wonder if that's the root problem.

Have you seen this happen on all your installs or only just after your
manual migration? It's a little frustrating this is happening as I was
hoping to get this into a production environment. It was all working
except that log message :(

Thanks,
Andrew


On Fri, Jun 6, 2014 at 3:20 PM, combuster combus...@archlinux.us wrote:

Hi Andrew,

this is something that I saw in my logs too, first on one node and then
on
the other three. When that happend on all four of them, engine was
corrupted
beyond repair.

First of all, I think that message is saying that sanlock can't get a
lock
on the shared storage that you defined for the hostedengine during
installation. I got this error when I've tried to manually migrate the
hosted engine. There is an unresolved bug there and I think it's related
to
this one:

[Bug 1093366 - Migration of hosted-engine vm put target host score to
zero]
https://bugzilla.redhat.com/show_bug.cgi?id=1093366

This is a blocker bug (or should be) for the selfhostedengine and, from
my
own experience with it, shouldn't be used in the production enviroment
(not
untill it's fixed).

Nothing that I've done couldn't fix the fact that the score for the
target
node was Zero, tried to reinstall the node, reboot the node, restarted
several services, tailed a tons of logs etc but to no avail. When only
one
node was left (that was actually running 

Re: [ovirt-users] Recommended setup for a FC based storage domain

2014-06-09 Thread combuster
Hm, another update on this one. If I create another VM with another 
virtual disk on the node that already have a vm running from the FC 
storage, then libvirt doesn't brake. I guess it just happens for the 
first time on any of the nodes. If this is the case, I would have to 
bring all of the vm's on the other two nodes in this four node cluster 
and start a VM from the FC storage just to make sure it doesn't brake 
during working hours. I guess it would be fine then.


It seems to me that this is some sort of a timeout issue that happens 
when I start the vm for the first time on fc sd, this could have 
something to do with fc card driver settings, or libvirt won't wait for 
ovirt-engine to present the new LV to the targeted node. I don't see why 
ovirt-engine waits for the first-time launch of the vm to present the LV 
at all, shouldn't it be doing this at the time of the virtual disk 
creation in case I have selected to run from the specific node?


On 06/09/2014 01:49 PM, combuster wrote:

Bad news happens only when running a VM for the first time, if it helps...

On 06/09/2014 01:30 PM, combuster wrote:

OK, I have good news and bad news :)

Good news is that I can run different VM's on different nodes when 
all of their drives are on FC Storage domain. I don't think that all 
of I/O is running through SPM, but I need to test that. Simply put, 
for every virtual disk that you create on the shared fc storage 
domain, ovirt will present that vdisk only to the node wich is 
running the VM itself. They all can see domain infrastructure 
(inbox,outbox,metadata) but the LV for the virtual disk itself for 
that VM is visible only to the node that is running that particular 
VM. There is no limitation (except for the free space on the storage).


Bad news!

I can create the virtual disk on the fc storage for a vm, but when I 
start the VM itself, node wich hosts the VM that I'm starting is 
going non-operational, and quickly goes up again (ilo fencing agent 
checks if the node is ok and bring it back up). During that time, vm 
starts on another node (Default Host parameter was ignored - assigned 
Host was not available). I can manualy migrate it later to the 
intended node, that works. Lucky me, on two nodes (of the four) in 
the cluster, there were no vm's running (i tried this on both, with 
two different vm's created from scratch and i got the same result.


I've killed everything above WARNING because it was killing the 
performance of the cluster. vdsm.log :


[code]
Thread-305::WARNING::2014-06-09 
12:15:53,236::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 
12:17:25,013::utils::129::root::(rmFile) File: 
/rhev/data-center/a0500f5c-e8d9-42f1-8f04-15b23514c8ed/55338570-e537-412b-97a9-635eea1ecb10/images/90659ad8-bd90-4a0a-bb4e-7c6afe90e925/242a1bce-a434-4246-ad24-b62f99c03a05 
already removed
55809e40-ccf3-4f7c-aeec-802bc1c326a7::WARNING::2014-06-09 
12:17:25,074::blockSD::761::Storage.StorageDomain::(_getOccupiedMetadataSlots) 
Could not find mapping for lv 
55338570-e537-412b-97a9-635eea1ecb10/242a1bce-a434-4246-ad24-b62f99c03a05
Thread-305::WARNING::2014-06-09 
12:20:54,341::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
Thread-305::WARNING::2014-06-09 
12:25:55,378::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
Thread-305::WARNING::2014-06-09 
12:30:56,424::persistentDict::256::Storage.PersistentDict::(refresh) 
data has no embedded checksum - trust it as it is
Thread-1857::WARNING::2014-06-09 
12:32:45,639::libvirtconnection::116::root::(wrapper) connection to 
libvirt broken. ecode: 1 edom: 7
Thread-1857::CRITICAL::2014-06-09 
12:32:45,640::libvirtconnection::118::root::(wrapper) taking calling 
process down.
Thread-17704::WARNING::2014-06-09 
12:32:48,009::libvirtconnection::116::root::(wrapper) connection to 
libvirt broken. ecode: 1 edom: 7
Thread-17704::CRITICAL::2014-06-09 
12:32:48,013::libvirtconnection::118::root::(wrapper) taking calling 
process down.
Thread-17704::ERROR::2014-06-09 
12:32:48,018::vm::2285::vm.Vm::(_startUnderlyingVm) 
vmId=`2bee9d79-b8d1-4a5a-a4f7-8092d1c803d9`::The vm start process failed

Traceback (most recent call last):
  File /usr/share/vdsm/vm.py, line 2245, in _startUnderlyingVm
self._run()
  File /usr/share/vdsm/vm.py, line 3185, in _run
self._connection.createXML(domxml, flags),
  File 
/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py, line 
110, in wrapper

__connections.get(id(target)).pingLibvirt()
  File /usr/lib64/python2.6/site-packages/libvirt.py, line 3389, in 
getLibVersion
if ret == -1: raise libvirtError ('virConnectGetLibVersion() 
failed', conn=self)

libvirtError: internal error client socket is closed
Thread-1857::WARNING::2014-06-09 
12:32:50,673::vm::1963::vm.Vm::(_set_lastStatus) 

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread Andrew Lau
I'm really having a hard time finding out why it's happening..

If I set the cluster to global for a minute or two, the scores will
reset back to 2400. Set maintenance mode to none, and all will be fine
until a migration occurs. It seems it tries to migrate, fails and sets
the score to 0 permanently rather than the 10? minutes mentioned in
one of the ovirt slides.

When I have two hosts, it's score 0 only when a migration occurs.
(Just on the host which doesn't have engine up). The score 0 only
happens when it's tried to migrate when I set the host to local
maintenance. Migrating the VM from the UI has worked quite a few
times, but it's recently started to fail.

When I have three hosts, after 5~ mintues of them all up the score
will hit 0 on the hosts not running the VMs. It doesn't even have to
attempt to migrate before the score goes to 0. Stopping the ha agent
on one host, and resetting it with the global maintenance method
brings it back to the 2 host scenario above.

I may move on and just go back to a standalone engine as this is not
getting very much luck..

On Tue, Jun 10, 2014 at 3:11 PM, combuster combus...@archlinux.us wrote:
 Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS
 device as the NFS share itself, before the deploy procedure even started.
 But I'm puzzled at how you can reproduce the bug, all was well on my setup
 before I've stated manual migration of the engine's vm. Even auto migration
 worked before that (tested it). Does it just happen without any procedure on
 the engine itself? Is the score 0 for just one node, or two of three of
 them?

 On 06/10/2014 01:02 AM, Andrew Lau wrote:

 nvm, just as I hit send the error has returned.
 Ignore this..

 On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote:

 So after adding the L3 capabilities to my storage network, I'm no
 longer seeing this issue anymore. So the engine needs to be able to
 access the storage domain it sits on? But that doesn't show up in the
 UI?

 Ivan, was this also the case with your setup? Engine couldn't access
 storage domain?

 On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:

 Interesting, my storage network is a L2 only and doesn't run on the
 ovirtmgmt (which is the only thing HostedEngine sees) but I've only
 seen this issue when running ctdb in front of my NFS server. I
 previously was using localhost as all my hosts had the nfs server on
 it (gluster).

 On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com
 wrote:

 I just blocked connection to storage for testing, but on result I had
 this error: Failed to acquire lock error -243, so I added it in 
 reproduce
 steps.
 If you know another steps to reproduce this error, without blocking
 connection to storage it also can be wonderful if you can provide them.
 Thanks

 - Original Message -
 From: Andrew Lau and...@andrewklau.com
 To: combuster combus...@archlinux.us
 Cc: users users@ovirt.org
 Sent: Monday, June 9, 2014 3:47:00 AM
 Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message:
 internal error Failed to acquire lock error -243

 I just ran a few extra tests, I had a 2 host, hosted-engine running
 for a day. They both had a score of 2400. Migrated the VM through the
 UI multiple times, all worked fine. I then added the third host, and
 that's when it all fell to pieces.
 Other two hosts have a score of 0 now.

 I'm also curious, in the BZ there's a note about:

 where engine-vm block connection to storage domain(via iptables -I
 INPUT -s sd_ip -j DROP)

 What's the purpose for that?

 On Sat, Jun 7, 2014 at 4:16 PM, Andrew Lau and...@andrewklau.com
 wrote:

 Ignore that, the issue came back after 10 minutes.

 I've even tried a gluster mount + nfs server on top of that, and the
 same issue has come back.

 On Fri, Jun 6, 2014 at 6:26 PM, Andrew Lau and...@andrewklau.com
 wrote:

 Interesting, I put it all into global maintenance. Shut it all down
 for 10~ minutes, and it's regained it's sanlock control and doesn't
 seem to have that issue coming up in the log.

 On Fri, Jun 6, 2014 at 4:21 PM, combuster combus...@archlinux.us
 wrote:

 It was pure NFS on a NAS device. They all had different ids (had no
 redeployements of nodes before problem occured).

 Thanks Jirka.


 On 06/06/2014 08:19 AM, Jiri Moskovcak wrote:

 I've seen that problem in other threads, the common denominator was
 nfs
 on top of gluster. So if you have this setup, then it's a known
 problem. Or
 you should double check if you hosts have different ids otherwise
 they would
 be trying to acquire the same lock.

 --Jirka

 On 06/06/2014 08:03 AM, Andrew Lau wrote:

 Hi Ivan,

 Thanks for the in depth reply.

 I've only seen this happen twice, and only after I added a third
 host
 to the HA cluster. I wonder if that's the root problem.

 Have you seen this happen on all your installs or only just after
 your
 manual migration? It's a little frustrating this is happening as I
 was
 

Re: [ovirt-users] VM HostedEngie is down. Exist message: internal error Failed to acquire lock error -243

2014-06-09 Thread combuster

On 06/10/2014 07:19 AM, Andrew Lau wrote:

I'm really having a hard time finding out why it's happening..

If I set the cluster to global for a minute or two, the scores will
reset back to 2400. Set maintenance mode to none, and all will be fine
until a migration occurs. It seems it tries to migrate, fails and sets
the score to 0 permanently rather than the 10? minutes mentioned in
one of the ovirt slides.

When I have two hosts, it's score 0 only when a migration occurs.
(Just on the host which doesn't have engine up). The score 0 only
happens when it's tried to migrate when I set the host to local
maintenance. Migrating the VM from the UI has worked quite a few
times, but it's recently started to fail.

When I have three hosts, after 5~ mintues of them all up the score
will hit 0 on the hosts not running the VMs. It doesn't even have to
attempt to migrate before the score goes to 0. Stopping the ha agent
on one host, and resetting it with the global maintenance method
brings it back to the 2 host scenario above.

I may move on and just go back to a standalone engine as this is not
getting very much luck..
Well I've done this already, I can't really afford to have so much 
unplanned downtime on my critical vm's, especially since it would take 
me several hours (even a whole day) to install a dedicated engine, then 
setup the nodes if need be, and then import vm's from export domain. I 
would love to help more to resolve this one, but I was pressed with 
time, I already had ovirt 3.3 running (for a year and a half rock solid 
stable, started from 3.1 i think), and I couldn't spare more then a day 
in trying to get around this bug (had to have a setup runing by the end 
of the weekend). I wasn't using gluster at all, so at least we know now 
that gluster is not a must in the mix. Besides Artyom already described 
it nicely in the bug report, havent had anything to add.


You were lucky Andrew, when I've tried the global maintenance method and 
restarted the VM, I got a corrupted filesystem on the VM's engine and it 
wouldn't even start on that one node that had a good score. It was bad 
health or uknown state on all of the nodes, and I've managed to repair 
the fs on the vm via VNC, then just barely bring the services online but 
the postgres db was too much damaged, so engine missbehaved.


At the time, I've explained it to myself :) that the locking mechanism 
didn't prevent one node to try to start (or write to) the vm while it 
was already running on another node, because filesystem was so damaged 
that I couldn't belive it, for 15 years I've never seen an extX fs so 
badly damaged, and the fact that this happens during migration just 
amped this thought up.


On Tue, Jun 10, 2014 at 3:11 PM, combuster combus...@archlinux.us wrote:

Nah, I've explicitly allowed hosted-engine vm to be able to access the NAS
device as the NFS share itself, before the deploy procedure even started.
But I'm puzzled at how you can reproduce the bug, all was well on my setup
before I've stated manual migration of the engine's vm. Even auto migration
worked before that (tested it). Does it just happen without any procedure on
the engine itself? Is the score 0 for just one node, or two of three of
them?

On 06/10/2014 01:02 AM, Andrew Lau wrote:

nvm, just as I hit send the error has returned.
Ignore this..

On Tue, Jun 10, 2014 at 9:01 AM, Andrew Lau and...@andrewklau.com wrote:

So after adding the L3 capabilities to my storage network, I'm no
longer seeing this issue anymore. So the engine needs to be able to
access the storage domain it sits on? But that doesn't show up in the
UI?

Ivan, was this also the case with your setup? Engine couldn't access
storage domain?

On Mon, Jun 9, 2014 at 9:56 PM, Andrew Lau and...@andrewklau.com wrote:

Interesting, my storage network is a L2 only and doesn't run on the
ovirtmgmt (which is the only thing HostedEngine sees) but I've only
seen this issue when running ctdb in front of my NFS server. I
previously was using localhost as all my hosts had the nfs server on
it (gluster).

On Mon, Jun 9, 2014 at 9:15 PM, Artyom Lukianov aluki...@redhat.com
wrote:

I just blocked connection to storage for testing, but on result I had
this error: Failed to acquire lock error -243, so I added it in reproduce
steps.
If you know another steps to reproduce this error, without blocking
connection to storage it also can be wonderful if you can provide them.
Thanks

- Original Message -
From: Andrew Lau and...@andrewklau.com
To: combuster combus...@archlinux.us
Cc: users users@ovirt.org
Sent: Monday, June 9, 2014 3:47:00 AM
Subject: Re: [ovirt-users] VM HostedEngie is down. Exist message:
internal error Failed to acquire lock error -243

I just ran a few extra tests, I had a 2 host, hosted-engine running
for a day. They both had a score of 2400. Migrated the VM through the
UI multiple times, all worked fine. I then added the third host, and
that's when it all fell to pieces.
Other two hosts have a