Re: [Users] Vm's being paused

Dafna Ron Wed, 29 Jan 2014 08:06:45 -0800

mmm... I think that there is a bug with the iso domain.... and I am notsure if it was already opened.


can you help me to debug this and see if its related? :)

I think that you have some intermittent network issues to the iso domainand every time it happens, the vms that have booted with a cd (even ifyou detached it) would pause.

I have a second suspicion... is it possible that the vms that pause hada cd and you ejected it at some point? perhaps after or during thenetwork issues you had on the 14th?can you run dumpxml from libvirt? let me know if you need help with thiscommand.


Thanks,

Dafna

On 01/29/2014 02:16 PM, Neil wrote:

Hi Dafna,


On Wed, Jan 29, 2014 at 1:14 PM, Dafna Ron <[email protected]> wrote:

The reason I asked about the size if because this was the original issue no?
vm's pausing on lack of space?

Apologies, I just wanted to make sure it was still about this pausing
and not the original migration issue that I think you were also
helping me with a few weeks back.

You're having a problem with your data domains.
Can you check the rout from the hosts to the storage? I think that you have
some disconnection to the storage from the hosts
since it's random and not from all the vm's I would suggest that its a
routing problem?
Thanks,
Dafna

The connections to the main data domain is 8Gb Fibre Channel directly
from each of the hosts to the FC SAN, so if it is a connection issue
then I can't understand how anything would be working. Or am I barking
up the wrong tree completely? There were some ethernet network
bridging changes on each of the hosts in early January, but this would
only affect the NFS mounted ISO domain, or could this be the cause of
the problems?

Is this disconnection causing the huge log files that I sent previously?

Thank you.

Regards.

Neil Wilson.

On 01/29/2014 08:00 AM, Neil wrote:

Sorry, more on this issue, I see my logs are rapidly filling up my
disk space on node02 with this error in /var/log/messages...

Jan 29 09:56:53 node02 vdsm vm.Vm ERROR
vmId=`dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b`::Stats function failed:
<AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most
recent call last):#012  File "/usr/share/vdsm/sampling.py", line 351,
in collect#012    statsFunction()#012  File
"/usr/share/vdsm/sampling.py", line 226, in __call__#012    retValue =
self._function(*args, **kwargs)#012  File "/usr/share/vdsm/vm.py",
line 513, in _highWrite#012    self._vm._dom.blockInfo(vmDrive.path,
0)#012  File "/usr/share/vdsm/vm.py", line 835, in f#012    ret =
attr(*args, **kwargs)#012  File
"/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line
76, in wrapper#012    ret = f(*args, **kwargs)#012  File
"/usr/lib64/python2.6/site-packages/libvirt.py", line 1814, in
blockInfo#012    if ret is None: raise libvirtError
('virDomainGetBlockInfo() failed', dom=self)#012libvirtError: invalid
argument: invalid path

/rhev/data-center/mnt/blockSD/0e6991ae-6238-4c61-96d2-ca8fed35161e/images/fac8a3bb-e414-43c0-affc-6e2628757a28/6c3e5ae8-23fc-4196-ba42-778bdc0fbad8
not assigned to domain
Jan 29 09:56:53 node02 vdsm vm.Vm ERROR
vmId=`ac2a3f99-a6db-4cae-955d-efdfb901abb7`::Stats function failed:
<AdvancedStatsFunction _highWrite at 0x1c2fb90>#012Traceback (most
recent call last):#012  File "/usr/share/vdsm/sampling.py", line 351,
in collect#012    statsFunction()#012  File
"/usr/share/vdsm/sampling.py", line 226, in __call__#012    retValue =
self._function(*args, **kwargs)#012  File "/usr/share/vdsm/vm.py",
line 509, in _highWrite#012    if not vmDrive.blockDev or
vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no
attribute 'format'

Not sure if this is related at all though?

Thanks.

Regards.

Neil Wilson.

On Wed, Jan 29, 2014 at 9:02 AM, Neil <[email protected]> wrote:

Hi Dafna,

Thanks for clarifying that, I found the migration issue and this was
resolved once I sorted out the ISO domain problem.

I'm sorry I don't understand your last question?
"> after the engine restart, do you still see a problem with the size
or did the report of size changed?"

The migration issue was resolved, it's now just trying to track down
why the two VM's paused on their own, one on the 8th of Jan(I think)
and one on the 19th of Jan.

Thank you.


Regards.

Neil Wilson.


On Tue, Jan 28, 2014 at 8:18 PM, Dafna Ron <[email protected]> wrote:

yes - engine lost communication with vdsm and it has no way of knowing
if
the host is down or if there was a network issue so a network issue
would
cause the same errors that I see in the logs.

The error you put on the iso is the reason the vm's have failed
migration -
if a vm is run with a cd and the cd is gone than the vm will not be able
to
be migrated.

after the engine restart, do you still see a problem with the size or
did
the report of size changed?

Dafna


On 01/28/2014 01:02 PM, Neil wrote:

Hi Dafna,

Thanks for coming back to me. I'll try answer your queries one by one.

On Tue, Jan 28, 2014 at 1:38 PM, Dafna Ron <[email protected]> wrote:

you had a problem with your storage on the 14th of Jan and one of the
hosts
rebooted (if you have the vdsm log from that day than I can see what
happened on vdsm side)
in engine, I could see a problem with the export domain and this
should
not
have cause a reboot.

1.) I don't unfortunately have logs going back that far. Looking at
all 3 hosts uptime, the one with the least uptime is 21 days, the
others are all over 40 days, so there definitely wasn't a host that
rebooted on the 14th of Jan, would a network issue or Firewall issue
also cause the error you've seen to look as if a host rebooted? There
was a bonding mode change on the 14th of January, so perhaps this
caused the issue?

Can you tell me if you had a problem with the data
domain as well or was it just the export domain? were you having any
vm's
exported/imported at that time?
In any case - this is a bug.

2.) I think this was the same day that the bonding mode was changed on
the host while the host was live (by mistake), and had SPM running on
it. I haven't done any importing or exporting for a few years on this
oVirt setup.

As for the vm's - if the vm's are no longer in migrating state than
please
restart ovirt-engine service (looks like a cache issue)

3.) Restarted ovirt-engine, logging now appears to be normal without
any
errors.

if they are in migrating state - there should have been a timeout a
long
time ago.
can you please run 'vdsClient -s 0 list table' and 'virsh -r list'  on
both
all hosts?

4.) Ran on all hosts...

node01.blabla.com
63da7faa-f92a-4652-90f2-b6660a4fb7b3  11232  adam                 Up
502170aa-0fc6-4287-bb08-5844be6e0352  13986  babbage              Up
ff9036fb-1499-45e4-8cde-e350eee3c489  26733  reports              Up
2736197b-6dc3-4155-9a29-9306ca64881d  13804  tux                  Up
0a3af7b2-ea94-42f3-baeb-78b950af4402  25257  Moodle               Up

    Id    Name                           State
----------------------------------------------------
    1     adam                           running
    2     reports                        running
    4     tux                            running
    6     Moodle                         running
    7     babbage                        running

node02.blabla.com
dfa2cf7c-3f0e-42e3-b495-10ccb3e0c71b   2879  spam                 Up
23b9212c-1e25-4003-aa18-b1e819bf6bb1  32454  proxy02              Up
ac2a3f99-a6db-4cae-955d-efdfb901abb7   5605  software             Up
179c293b-e6a3-4ec6-a54c-2f92f875bc5e   8870  zimbra               Up

    Id    Name                           State
----------------------------------------------------
    9     proxy02                        running
    10    spam                           running
    12    software                       running
    13    zimbra                         running

node03.blabla.com
e42b7ccc-ce04-4308-aeb2-2291399dd3ef  25809  dhcp                 Up
16d3f077-b74c-4055-97d0-423da78d8a0c  23939  oliver               Up

    Id    Name                           State
----------------------------------------------------
    13    oliver                         running
    14    dhcp                           running

Last thing is that your ISO domain seems to be having issues as well.
This should not effect the host status but if any of the vm's were
booted
from an iso or have an iso attached in the boot sequence this will
explain
the migration issue.

There was an ISO domain issue a while back, but this was corrected
about 2 weeks ago after iptables re-enabled itself on boot after
running updates, I've checked now and the ISO domain appears to be
fine and I can see all the images stored within.

I've stumbled across what appears to be another error and all three
hosts are showing this over and over in /var/log/messages, and I'm not
sure if it's related? ...

Jan 28 14:58:59 node01 vdsm vm.Vm ERROR
vmId=`63da7faa-f92a-4652-90f2-b6660a4fb7b3`::Stats function failed:
<AdvancedStatsFunction _highWrite at 0x2ce0998>#012Traceback (most
recent call last):#012  File "/usr/share/vdsm/sampling.py", line 351,
in collect#012    statsFunction()#012  File
"/usr/share/vdsm/sampling.py", line 226, in __call__#012    retValue =
self._function(*args, **kwargs)#012  File "/usr/share/vdsm/vm.py",
line 509, in _highWrite#012    if not vmDrive.blockDev or
vmDrive.format != 'cow':#012AttributeError: 'Drive' object has no
attribute 'format'

I've attached the full vdsm log from node02 to this reply.

Please shout if you need anything else.

Thank you.

Regards.

Neil Wilson.

On 01/28/2014 09:28 AM, Neil wrote:

Hi guys,

Sorry for the very late reply, I've been out of the office doing
installations.
Unfortunately due to the time delay, my oldest logs are only as far
back as the attached.

I've only grep'd for Thread-286029 in the vdsm log. The engine.log
I'm
not sure what info is required, so the full log is attached.

Please shout if you need any info or further details.

Thank you very much.

Regards.

Neil Wilson.


On Fri, Jan 24, 2014 at 10:55 AM, Meital Bourvine
<[email protected]>
wrote:

Could you please attach the engine.log from the same time?

thanks!

----- Original Message -----

From: "Neil" <[email protected]>
To: [email protected]
Cc: "users" <[email protected]>
Sent: Wednesday, January 22, 2014 1:14:25 PM
Subject: Re: [Users] Vm's being paused

Hi Dafna,

Thanks.

The vdsm logs are quite large, so I've only attached the logs for
the
pause of the VM called Babbage on the 19th of Jan.

As for snapshots, Babbage has one from June 2013 and Reports has
two
from June and Oct 2013.

I'm using FC storage, with 11 VM's and 3 nodes/hosts, 9 of the 11
VM's
have thin provisioned disks.

Please shout if you'd like any further info or logs.

Thank you.

Regards.

Neil Wilson.

On Wed, Jan 22, 2014 at 10:58 AM, Dafna Ron <[email protected]>
wrote:

Hi Neil,

Can you please attach the vdsm logs?
also, as for the vm's, do they have any snapshots?
from your suggestion to allocate more luns, are you using iscsi or
FC?

Thanks,

Dafna


On 01/22/2014 08:45 AM, Neil wrote:

Thanks for the replies guys,

Looking at my two VM's that have paused so far through the oVirt
GUI
the following sizes show under Disks.

VM Reports:
Virtual Size 35GB,  Actual Size 41GB
Looking on the Centos OS side, Disk size is 33G and used is 12G
with
19G available (40%) usage.

VM Babbage:
Virtual Size is 40GB, Actual Size 53GB
On the Server 2003 OS side, Disk size is 39.9Gb and used is
16.3G,
so
under 50% usage.


Do you see any issues with the above stats?

Then my main Datacenter storage is as follows...

Size: 6887 GB
Available: 1948 GB
Used: 4939 GB
Allocated: 1196 GB
Over Allocation: 61%

Could there be a problem here? I can allocate additional LUNS if
you
feel the space isn't correctly allocated.

Apologies for going on about this, but I'm really concerned that
something isn't right and I might have a serious problem if an
important machine locks up.

Thank you and much appreciated.

Regards.

Neil Wilson.












On Tue, Jan 21, 2014 at 7:02 PM, Dafna Ron <[email protected]>
wrote:

the storage space is configured in percentages and not physical
size.
so if 20G is less than 10% (default config) of your storage it
will
pause
the vms regardless of how much GB you still have.
this is configurable though so you can change it to less than
10%
if
you
like.

to answer the second question, vm's will not pause on ENOSpace
error
if
they
run out of space internally but only if the external storage
cannot
be
consumed. so only if you run out of space in the storage and and
not
if
vm
runs out of space in its on fs.



On 01/21/2014 09:51 AM, Neil wrote:

Hi Dan,

Sorry, attached is engine.log I've taken out the two sections
where
each of the VM's were paused.

Does the error "VM babbage has paused due to no Storage space
error"
mean the main storage domain has run out of storage, or that
the
VM
has run out?

Both VM's appear to have been running on node01 when they were
paused.
My vdsm versions are all...

vdsm-cli-4.13.0-11.el6.noarch
vdsm-python-cpopen-4.13.0-11.el6.x86_64
vdsm-xmlrpc-4.13.0-11.el6.noarch
vdsm-4.13.0-11.el6.x86_64
vdsm-python-4.13.0-11.el6.x86_64

I currently have a 61% over allocation ratio on my primary
storage
domain, with 1948GB available.

Thank you.

Regards.

Neil Wilson.


On Tue, Jan 21, 2014 at 11:24 AM, Neil <[email protected]>
wrote:

Hi Dan,

Sorry for only coming back to you now.
The VM's are thin provisioned. The Server 2003 VM hasn't run
out
of
disk space there is about 20Gigs free, and the usage barely
grows
as
the VM only shares printers. The other VM that paused is also
on
thin
provisioned disks and also has plenty space, this guest is
running
Centos 6.3 64bit and only runs basic reporting.

After the 2003 guest was rebooted, the network card showed up
as
unplugged in ovirt, and we had to remove it, and re-add it
again
in
order to correct the issue. The Centos VM did not have the
same
issue.

I'm concerned that this might happen to a VM that's quite
critical,
any thoughts or ideas?

The only recent changes have been updating from Dreyou 3.2 to
the
official Centos repo and updating to 3.3.1-2. Prior to
updating I
haven't had this issue.

Any assistance is greatly appreciated.

Thank you.

Regards.

Neil Wilson.











On Sun, Jan 19, 2014 at 8:20 PM, Dan Yasny <[email protected]>
wrote:

Do you have the VMs on thin provisioned storage or sparse
disks?

Pausing happens when the VM has an IO error or runs out of
space
on
the
storage domain, and it is done intentionally, so that the VM
will
not
experience a disk corruption. If you have thin provisioned
disks,
and
the VM
writes to it's disks faster than the disks can grow, this is
exactly
what
you will see


On Sun, Jan 19, 2014 at 10:04 AM, Neil <[email protected]>
wrote:

Hi guys,

I've had two different Vm's randomly pause this past week
and
inside
ovirt
the error received is something like 'vm ran out of storage
and
was
paused'.
Resuming the vm's didn't work and I had to force them off
and
then on
which
resolved the issue.

Has anyone had this issue before?

I realise this is very vague so if you could please let me
know
which
logs
to send in.

Thank you

Regards.

Neil Wilson


_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users



--
Dafna Ron



--
Dafna Ron

_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

--
Dafna Ron



--
Dafna Ron



--
Dafna Ron



--
Dafna Ron
_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

Re: [Users] Vm's being paused

Reply via email to