Re: [ovirt-users] VM get stuck randomly

2016-03-30 Thread Kevin Wolf
Hi Christophe,

Am 30.03.2016 um 13:45 hat Christophe TREFOIS geschrieben:
> Another host went down, so I have to prepare info for this one.
> 
> I could not SSH to it anymore.
> Console would show login screen, but no keystrokes were registered.
> 
> I could “suspend” the VM and “run” it, but still can’t SSH to it.
> Before suspension, all QEMU threads were around 0%, after resuming, 3 of them 
> hover at 100%.
> 
> Attached you could find the gdb, core dump, and other logs.
> 
> Logs: https://dl.dropboxusercontent.com/u/63261/ubuntu2-logs.tar.gz
> 
> Core Dump: https://dl.dropboxusercontent.com/u/63261/core-ubuntu2.tar.gz
> 
> Is there anything else we could provide?

This sounds much like it's not qemu that hangs (because then stopping
and resuming wouldn't work any more), but just the guest OS that is
running inside the VM.

We've had cases before where qemu was reported to hang with 100% CPU
usage and in the end it turned out that the guest kernel had panicked.
Can you check whether a guest kernel crash could be the cause? If this
is reproducible, maybe the easiest way would be to attach a serial
console to the VM and let the kernel print its messages there.

Kevin
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-30 Thread Christophe TREFOIS
Hi Kevin,

Ok thanks for the feedback.

Do you have experience with kernel panics and more specifically how to get any 
meaningful message if this is happening?

According to syslog on the guest, at least after reboot, there's no information 
that something went wrong.

Since this seems to leave the qemu world, I'm now not sure where to ask for 
support in this :(

Kind regards,

--
Christophe 
Sent from my iPhone

> On 30 Mar 2016, at 14:14, Kevin Wolf  wrote:
> 
> Hi Christophe,
> 
> Am 30.03.2016 um 13:45 hat Christophe TREFOIS geschrieben:
>> Another host went down, so I have to prepare info for this one.
>> 
>> I could not SSH to it anymore.
>> Console would show login screen, but no keystrokes were registered.
>> 
>> I could “suspend” the VM and “run” it, but still can’t SSH to it.
>> Before suspension, all QEMU threads were around 0%, after resuming, 3 of 
>> them hover at 100%.
>> 
>> Attached you could find the gdb, core dump, and other logs.
>> 
>> Logs: https://dl.dropboxusercontent.com/u/63261/ubuntu2-logs.tar.gz
>> 
>> Core Dump: https://dl.dropboxusercontent.com/u/63261/core-ubuntu2.tar.gz
>> 
>> Is there anything else we could provide?
> 
> This sounds much like it's not qemu that hangs (because then stopping
> and resuming wouldn't work any more), but just the guest OS that is
> running inside the VM.
> 
> We've had cases before where qemu was reported to hang with 100% CPU
> usage and in the end it turned out that the guest kernel had panicked.
> Can you check whether a guest kernel crash could be the cause? If this
> is reproducible, maybe the easiest way would be to attach a serial
> console to the VM and let the kernel print its messages there.
> 
> Kevin


smime.p7s
Description: S/MIME cryptographic signature
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-30 Thread Christophe TREFOIS
Hi Kevin,

Another host went down, so I have to prepare info for this one.

I could not SSH to it anymore.
Console would show login screen, but no keystrokes were registered.

I could “suspend” the VM and “run” it, but still can’t SSH to it.
Before suspension, all QEMU threads were around 0%, after resuming, 3 of them 
hover at 100%.

Attached you could find the gdb, core dump, and other logs.

Logs: https://dl.dropboxusercontent.com/u/63261/ubuntu2-logs.tar.gz

Core Dump: https://dl.dropboxusercontent.com/u/63261/core-ubuntu2.tar.gz

Is there anything else we could provide?

Since this is a test machine, I will leave it “hanging” for now.

Best,

Dr Christophe Trefois, Dipl.-Ing.  
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine  
6, avenue du Swing 
L-4367 Belvaux  
T: +352 46 66 44 6124 
F: +352 46 66 44 6949  
http://www.uni.lu/lcsb




This message is confidential and may contain privileged information. 
It is intended for the named recipient only. 
If you receive it in error please notify me and permanently delete the original 
message and any copies. 


  

> On 29 Mar 2016, at 15:40, Kevin Wolf  wrote:
> 
> Am 27.03.2016 um 22:38 hat Christophe TREFOIS geschrieben:
>> Hi,
>> 
>> MS does not like my previous email, so here it is again with a link to 
>> Dropbox
>> instead of as attached.
>> 
>> ——
>> Hi Nir,
>> 
>> Inside the core dump tarball is also the output of the two gdb commands you
>> mentioned.
>> 
>> Understandbly, you might not want to download the big files for that, so I
>> attached them here seperately.
> 
> The gdb dump looks pretty much like an idle qemu that just sits there
> and waits for events. The vcpu threads seem to be running guest code,
> the I/O thread and SPICE thread are in poll() waiting for events to
> respond to, and finally the RCU thread is idle as well.
> 
> Does the qemu process still respond to monitor commands, so for example
> can you still pause and resume the guest?
> 
> Kevin
> 
>> For the other logs, here you go.
>> 
>> For gluster I didn’t know which, so I sent all.
>> 
>> I got the icinga notifcation at 17:06 CEST on March 27th (today). So for 
>> vdsm,
>> I provided logs from 16h-18h.
>> The check said that the VM was down for 11 minutes at that time.
>> 
>> https://dl.dropboxusercontent.com/u/63261/bioservice-1.tar.gz
>> 
>> Please do let me know if there is anything else I can provide.
>> 
>> Best regards,
>> 
>> 
>>> On 27 Mar 2016, at 21:24, Nir Soffer  wrote:
>>> 
>>> On Sun, Mar 27, 2016 at 8:39 PM, Christophe TREFOIS
>>>  wrote:
 Hi Nir,
 
 Here is another one, this time with strace of children and gdb dump.
 
 Interestingly, this time, the qemu seems stuck 0%, vs 100% for other cases.
 
 The files for strace are attached.
>>> 
>>> Hopefully Kevin can take a look.
>>> 
>>> 
 The gdb + core dump is found here (too
 big):
 
 https://dl.dropboxusercontent.com/u/63261/gdb-core.tar.gz
>>> 
>>> I think it will be more useful to extract a traceback of all threads
>>> and send the tiny traceback.
>>> 
>>> gdb --pid  --batch --eval-command='thread apply all bt'
>>> 
 If it helps, most machines get stuck on the host hosting the self-hosted
 engine, which runs a local 1-node glusterfs.
>>> 
>>> And getting also /var/log/messages, sanlock, vdsm, glusterfs and
>>> libvirt logs for this timeframe
>>> would be helpful.
>>> 
>>> Nir
>>> 
 
 Thank you for your help,
 
 —
 Christophe
 
 Dr Christophe Trefois, Dipl.-Ing.
 Technical Specialist / Post-Doc
 
 UNIVERSITÉ DU LUXEMBOURG
 
 LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
 Campus Belval | House of Biomedicine
 6, avenue du Swing
 L-4367 Belvaux
 T: +352 46 66 44 6124
 F: +352 46 66 44 6949
 http://www.uni.lu/lcsb
 
 
 
 
 This message is confidential and may contain privileged information.
 It is intended for the named recipient only.
 If you receive it in error please notify me and permanently delete the
 original message and any copies.
 
 
 
 
> On 25 Mar 2016, at 11:53, Nir Soffer  wrote:
> 
> gdb --pid  --batch --eval-command='thread apply all bt'

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-30 Thread Kevin Wolf
Am 27.03.2016 um 22:38 hat Christophe TREFOIS geschrieben:
> Hi,
> 
> MS does not like my previous email, so here it is again with a link to Dropbox
> instead of as attached.
> 
> ——
> Hi Nir,
> 
> Inside the core dump tarball is also the output of the two gdb commands you
> mentioned.
> 
> Understandbly, you might not want to download the big files for that, so I
> attached them here seperately.

The gdb dump looks pretty much like an idle qemu that just sits there
and waits for events. The vcpu threads seem to be running guest code,
the I/O thread and SPICE thread are in poll() waiting for events to
respond to, and finally the RCU thread is idle as well.

Does the qemu process still respond to monitor commands, so for example
can you still pause and resume the guest?

Kevin

> For the other logs, here you go.
> 
> For gluster I didn’t know which, so I sent all.
> 
> I got the icinga notifcation at 17:06 CEST on March 27th (today). So for vdsm,
> I provided logs from 16h-18h.
> The check said that the VM was down for 11 minutes at that time.
> 
> https://dl.dropboxusercontent.com/u/63261/bioservice-1.tar.gz
> 
> Please do let me know if there is anything else I can provide.
> 
> Best regards,
>  
> 
> > On 27 Mar 2016, at 21:24, Nir Soffer  wrote:
> >
> > On Sun, Mar 27, 2016 at 8:39 PM, Christophe TREFOIS
> >  wrote:
> >> Hi Nir,
> >>
> >> Here is another one, this time with strace of children and gdb dump.
> >>
> >> Interestingly, this time, the qemu seems stuck 0%, vs 100% for other cases.
> >>
> >> The files for strace are attached.
> >
> > Hopefully Kevin can take a look.
> >
> >
> >> The gdb + core dump is found here (too
> >> big):
> >>
> >> https://dl.dropboxusercontent.com/u/63261/gdb-core.tar.gz
> >
> > I think it will be more useful to extract a traceback of all threads
> > and send the tiny traceback.
> >
> > gdb --pid  --batch --eval-command='thread apply all bt'
> >
> >> If it helps, most machines get stuck on the host hosting the self-hosted
> >> engine, which runs a local 1-node glusterfs.
> >
> > And getting also /var/log/messages, sanlock, vdsm, glusterfs and
> > libvirt logs for this timeframe
> > would be helpful.
> >
> > Nir
> >
> >>
> >> Thank you for your help,
> >>
> >> —
> >> Christophe
> >>
> >> Dr Christophe Trefois, Dipl.-Ing.
> >> Technical Specialist / Post-Doc
> >>
> >> UNIVERSITÉ DU LUXEMBOURG
> >>
> >> LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
> >> Campus Belval | House of Biomedicine
> >> 6, avenue du Swing
> >> L-4367 Belvaux
> >> T: +352 46 66 44 6124
> >> F: +352 46 66 44 6949
> >> http://www.uni.lu/lcsb
> >>
> >>
> >>
> >> 
> >> This message is confidential and may contain privileged information.
> >> It is intended for the named recipient only.
> >> If you receive it in error please notify me and permanently delete the
> >> original message and any copies.
> >> 
> >>
> >>
> >>
> >>> On 25 Mar 2016, at 11:53, Nir Soffer  wrote:
> >>>
> >>> gdb --pid  --batch --eval-command='thread apply all bt'
> >>
> 


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-27 Thread Christophe TREFOIS
FILE QUARANTINED

Microsoft Forefront Protection for Exchange Server removed a file since it was 
found to be infected.
File name: "winmail.dat"
Malware name: "ExceedinglyNested"
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-27 Thread Christophe TREFOIS
Hi Nir,

Here is another one, this time with strace of children and gdb dump.

Interestingly, this time, the qemu seems stuck 0%, vs 100% for other cases.

The files for strace are attached. The gdb + core dump is found here (too big):

https://dl.dropboxusercontent.com/u/63261/gdb-core.tar.gz




If it helps, most machines get stuck on the host hosting the self-hosted 
engine, which runs a local 1-node glusterfs.

Thank you for your help,

—
Christophe

Dr Christophe Trefois, Dipl.-Ing.
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine
6, avenue du Swing
L-4367 Belvaux
T: +352 46 66 44 6124
F: +352 46 66 44 6949
http://www.uni.lu/lcsb




This message is confidential and may contain privileged information.
It is intended for the named recipient only.
If you receive it in error please notify me and permanently delete the original 
message and any copies.




> On 25 Mar 2016, at 11:53, Nir Soffer  wrote:
>
> gdb --pid  --batch --eval-command='thread apply all bt'



trace-stuck.tar.gz
Description: trace-stuck.tar.gz
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-25 Thread Nir Soffer
On Thu, Mar 24, 2016 at 10:43 PM, Christophe TREFOIS
<christophe.tref...@uni.lu> wrote:
> Hi Nir,
>
> I restarted the VM now so I can't provide more info until the next time.
>
> I could try strace -p  -f &> strace.log next time it hangs.
>
> Could you just point me on how to obtain a dump with gdb?

I think you should install the debug info package for qemu, something like:
debuginfo

debuginfo-install qemu-kvm-ev

Then you can extract a backtrace of all threads like this:

gdb --pid  --batch --eval-command='thread apply all bt'

Sometimes "bt full" return more useful info:

gdb --pid  --batch --eval-command='thread apply all bt full'

To generate a core dump you can do:

gcore -o filename 

This is generic way that works with anything, there may  be a better
qemu specific way.

Nir

> Do I have to do anything special in order to catch the required contents?
>
> For the idle vs stuck in a loop, I guess the VM has 4 children qemu threads, 
> and one of them was at 100%.
>
> Thank you for your help,
>
> --
> Christophe
>
>> -Original Message-
>> From: Nir Soffer [mailto:nsof...@redhat.com]
>> Sent: jeudi 24 mars 2016 20:17
>> To: Christophe TREFOIS <christophe.tref...@uni.lu>; Kevin Wolf
>> <kw...@redhat.com>; Francesco Romani <from...@redhat.com>
>> Cc: users <users@ovirt.org>; lcsb-sysadmins <lcsb-sysadm...@uni.lu>
>> Subject: Re: [ovirt-users] VM get stuck randomly
>>
>> On Thu, Mar 24, 2016 at 7:51 PM, Christophe TREFOIS
>> <christophe.tref...@uni.lu> wrote:
>> > Hi Nir,
>> >
>> > And the second one is down now too. see some comments below.
>> >
>> >> On 13 Mar 2016, at 12:51, Nir Soffer <nsof...@redhat.com> wrote:
>> >>
>> >> On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS
>> >> <christophe.tref...@uni.lu> wrote:
>> >>> Dear all,
>> >>>
>> >>> I have a problem since couple of weeks, where randomly 1 VM (not
>> always the same) becomes completely unresponsive.
>> >>> We find this out because our Icinga server complains that host is down.
>> >>>
>> >>> Upon inspection, we find we can’t open a console to the VM, nor can we
>> login.
>> >>>
>> >>> In oVirt engine, the VM looks like “up”. The only weird thing is that RAM
>> usage shows 0% and CPU usage shows 100% or 75% depending on number of
>> cores.
>> >>> The only way to recover is to force shutdown the VM via 2-times
>> shutdown from the engine.
>> >>>
>> >>> Could you please help me to start debugging this?
>> >>> I can provide any logs, but I’m not sure which ones, because I couldn’t
>> see anything with ERROR in the vdsm logs on the host.
>> >>
>> >> I would inspect this vm on the host when it happens.
>> >>
>> >> What is vdsm cpu usage? what is the qemu process (for this vm) cpu
>> usage?
>> >
>> > vdsm cpu usage is going up and down to 15%.
>> >
>> > qemu process usage for the VM was 0, except for 1 of the threads “stuck”
>> at 100%, rest was idle.
>>
>> 0% may be a deadlock, 100% a thread stuck in endless loop, but this is just a
>> wild guess.
>>
>> >
>> >>
>> >> strace output of this qemu process (all threads) or a core dump can
>> >> help qemu developers to understand this issue.
>> >
>> > I attached an strace on the process for:
>> >
>> > qemu 15241 10.6  0.4 4742904 1934988 ? Sl   Mar23 131:41
>> /usr/libexec/qemu-kvm -name test-ubuntu-uni-lu -S -machine pc-i440fx-
>> rhel7.2.0,accel=kvm,usb=off -cpu SandyBridge -m
>> size=4194304k,slots=16,maxmem=4294967296k -realtime mlock=off -smp
>> 4,maxcpus=64,sockets=16,cores=4,threads=1 -numa node,nodeid=0,cpus=0-
>> 3,mem=4096 -uuid 754871ec-0339-4a65-b490-6a766aaea537 -smbios
>> type=1,manufacturer=oVirt,product=oVirt Node,version=7-
>> 2.1511.el7.centos.2.10,serial=4C4C4544-0048-4610-8052-
>> B4C04F575831,uuid=754871ec-0339-4a65-b490-6a766aaea537 -no-user-config
>> -nodefaults -chardev
>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-test-ubuntu-uni-
>> lu/monitor.sock,server,nowait -mon
>> chardev=charmonitor,id=monitor,mode=control -rtc base=2016-03-
>> 23T22:06:01,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -
>> no-shutdown -boot strict=on -device piix3-usb-
>> uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-
>> pci,id=scsi0,bus=pci.0,addr=0x4 -device virti

Re: [ovirt-users] VM get stuck randomly

2016-03-24 Thread Christophe TREFOIS
Hi Nir,

I restarted the VM now so I can't provide more info until the next time.

I could try strace -p  -f &> strace.log next time it hangs.

Could you just point me on how to obtain a dump with gdb?
Do I have to do anything special in order to catch the required contents?

For the idle vs stuck in a loop, I guess the VM has 4 children qemu threads, 
and one of them was at 100%.

Thank you for your help,

--
Christophe

> -Original Message-
> From: Nir Soffer [mailto:nsof...@redhat.com]
> Sent: jeudi 24 mars 2016 20:17
> To: Christophe TREFOIS <christophe.tref...@uni.lu>; Kevin Wolf
> <kw...@redhat.com>; Francesco Romani <from...@redhat.com>
> Cc: users <users@ovirt.org>; lcsb-sysadmins <lcsb-sysadm...@uni.lu>
> Subject: Re: [ovirt-users] VM get stuck randomly
> 
> On Thu, Mar 24, 2016 at 7:51 PM, Christophe TREFOIS
> <christophe.tref...@uni.lu> wrote:
> > Hi Nir,
> >
> > And the second one is down now too. see some comments below.
> >
> >> On 13 Mar 2016, at 12:51, Nir Soffer <nsof...@redhat.com> wrote:
> >>
> >> On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS
> >> <christophe.tref...@uni.lu> wrote:
> >>> Dear all,
> >>>
> >>> I have a problem since couple of weeks, where randomly 1 VM (not
> always the same) becomes completely unresponsive.
> >>> We find this out because our Icinga server complains that host is down.
> >>>
> >>> Upon inspection, we find we can’t open a console to the VM, nor can we
> login.
> >>>
> >>> In oVirt engine, the VM looks like “up”. The only weird thing is that RAM
> usage shows 0% and CPU usage shows 100% or 75% depending on number of
> cores.
> >>> The only way to recover is to force shutdown the VM via 2-times
> shutdown from the engine.
> >>>
> >>> Could you please help me to start debugging this?
> >>> I can provide any logs, but I’m not sure which ones, because I couldn’t
> see anything with ERROR in the vdsm logs on the host.
> >>
> >> I would inspect this vm on the host when it happens.
> >>
> >> What is vdsm cpu usage? what is the qemu process (for this vm) cpu
> usage?
> >
> > vdsm cpu usage is going up and down to 15%.
> >
> > qemu process usage for the VM was 0, except for 1 of the threads “stuck”
> at 100%, rest was idle.
> 
> 0% may be a deadlock, 100% a thread stuck in endless loop, but this is just a
> wild guess.
> 
> >
> >>
> >> strace output of this qemu process (all threads) or a core dump can
> >> help qemu developers to understand this issue.
> >
> > I attached an strace on the process for:
> >
> > qemu 15241 10.6  0.4 4742904 1934988 ? Sl   Mar23 131:41
> /usr/libexec/qemu-kvm -name test-ubuntu-uni-lu -S -machine pc-i440fx-
> rhel7.2.0,accel=kvm,usb=off -cpu SandyBridge -m
> size=4194304k,slots=16,maxmem=4294967296k -realtime mlock=off -smp
> 4,maxcpus=64,sockets=16,cores=4,threads=1 -numa node,nodeid=0,cpus=0-
> 3,mem=4096 -uuid 754871ec-0339-4a65-b490-6a766aaea537 -smbios
> type=1,manufacturer=oVirt,product=oVirt Node,version=7-
> 2.1511.el7.centos.2.10,serial=4C4C4544-0048-4610-8052-
> B4C04F575831,uuid=754871ec-0339-4a65-b490-6a766aaea537 -no-user-config
> -nodefaults -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-test-ubuntu-uni-
> lu/monitor.sock,server,nowait -mon
> chardev=charmonitor,id=monitor,mode=control -rtc base=2016-03-
> 23T22:06:01,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -
> no-shutdown -boot strict=on -device piix3-usb-
> uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-
> pci,id=scsi0,bus=pci.0,addr=0x4 -device virtio-serial-pci,id=virtio-
> serial0,max_ports=16,bus=pci.0,addr=0x5 -drive if=none,id=drive-ide0-1-
> 0,readonly=on,format=raw,serial= -device ide-
> cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive file=/rhev/data-
> center/0002-0002-0002-0002-03d5/8253a89b-651e-4ff4-865b-
> 57adef05d383/images/9d60ae41-bf17-48b4-b0e6-29625b248718/47a6916c-
> c902-4ea3-8dfb-a3240d7d9515,if=none,id=drive-virtio-
> disk0,format=qcow2,serial=9d60ae41-bf17-48b4-b0e6-
> 29625b248718,cache=none,werror=stop,rerror=stop,aio=threads -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-
> disk0,bootindex=1 -netdev tap,fd=108,id=hostnet0,vhost=on,vhostfd=109 -
> device virtio-net-
> pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e5:12:0f,bus=pci.0,addr=0x3,boo
> tindex=2 -chardev
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/754871ec-
> 0339-4a65-b490-6a766aaea537.com.redhat.rhevm.vdsm,se

Re: [ovirt-users] VM get stuck randomly

2016-03-24 Thread Nir Soffer
On Thu, Mar 24, 2016 at 7:51 PM, Christophe TREFOIS
 wrote:
> Hi Nir,
>
> And the second one is down now too. see some comments below.
>
>> On 13 Mar 2016, at 12:51, Nir Soffer  wrote:
>>
>> On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS
>>  wrote:
>>> Dear all,
>>>
>>> I have a problem since couple of weeks, where randomly 1 VM (not always the 
>>> same) becomes completely unresponsive.
>>> We find this out because our Icinga server complains that host is down.
>>>
>>> Upon inspection, we find we can’t open a console to the VM, nor can we 
>>> login.
>>>
>>> In oVirt engine, the VM looks like “up”. The only weird thing is that RAM 
>>> usage shows 0% and CPU usage shows 100% or 75% depending on number of cores.
>>> The only way to recover is to force shutdown the VM via 2-times shutdown 
>>> from the engine.
>>>
>>> Could you please help me to start debugging this?
>>> I can provide any logs, but I’m not sure which ones, because I couldn’t see 
>>> anything with ERROR in the vdsm logs on the host.
>>
>> I would inspect this vm on the host when it happens.
>>
>> What is vdsm cpu usage? what is the qemu process (for this vm) cpu usage?
>
> vdsm cpu usage is going up and down to 15%.
>
> qemu process usage for the VM was 0, except for 1 of the threads “stuck” at 
> 100%, rest was idle.

0% may be a deadlock, 100% a thread stuck in endless loop, but this is
just a wild guess.

>
>>
>> strace output of this qemu process (all threads) or a core dump can help qemu
>> developers to understand this issue.
>
> I attached an strace on the process for:
>
> qemu 15241 10.6  0.4 4742904 1934988 ? Sl   Mar23 131:41 
> /usr/libexec/qemu-kvm -name test-ubuntu-uni-lu -S -machine 
> pc-i440fx-rhel7.2.0,accel=kvm,usb=off -cpu SandyBridge -m 
> size=4194304k,slots=16,maxmem=4294967296k -realtime mlock=off -smp 
> 4,maxcpus=64,sockets=16,cores=4,threads=1 -numa 
> node,nodeid=0,cpus=0-3,mem=4096 -uuid 754871ec-0339-4a65-b490-6a766aaea537 
> -smbios type=1,manufacturer=oVirt,product=oVirt 
> Node,version=7-2.1511.el7.centos.2.10,serial=4C4C4544-0048-4610-8052-B4C04F575831,uuid=754871ec-0339-4a65-b490-6a766aaea537
>  -no-user-config -nodefaults -chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-test-ubuntu-uni-lu/monitor.sock,server,nowait
>  -mon chardev=charmonitor,id=monitor,mode=control -rtc 
> base=2016-03-23T22:06:01,driftfix=slew -global 
> kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on 
> -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device 
> virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -device 
> virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x5 -drive 
> if=none,id=drive-ide0-1-0,readonly=on,format=raw,serial= -device 
> ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -drive 
> file=/rhev/data-center/0002-0002-0002-0002-03d5/8253a89b-651e-4ff4-865b-57adef05d383/images/9d60ae41-bf17-48b4-b0e6-29625b248718/47a6916c-c902-4ea3-8dfb-a3240d7d9515,if=none,id=drive-virtio-disk0,format=qcow2,serial=9d60ae41-bf17-48b4-b0e6-29625b248718,cache=none,werror=stop,rerror=stop,aio=threads
>  -device 
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
>  -netdev tap,fd=108,id=hostnet0,vhost=on,vhostfd=109 -device 
> virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:e5:12:0f,bus=pci.0,addr=0x3,bootindex=2
>  -chardev 
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/754871ec-0339-4a65-b490-6a766aaea537.com.redhat.rhevm.vdsm,server,nowait
>  -device 
> virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm
>  -chardev 
> socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/754871ec-0339-4a65-b490-6a766aaea537.org.qemu.guest_agent.0,server,nowait
>  -device 
> virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0
>  -device usb-tablet,id=input0 -vnc 10.79.2.2:76,password -device 
> cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on
>
> http://paste.fedoraproject.org/344756/84131214

You connected only to one thread. I would try to use -f to see all threads,
or connect with gdb and get a backtrace of all threads.

Adding Kevin to suggest how to continue.

I think we need a qemu bug for this.

Nir

>
> This is CentOS 7.2, latest patches and latest 3.6.4 oVirt.
>
> Thank you for any help / pointers.
>
> Could it be memory ballooning?
>
> Best,
>
>>
>>>
>>> The host is running
>>>
>>> OS Version: RHEL - 7 - 1.1503.el7.centos.2.8
>>> Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
>>> KVM Version:2.1.2 - 23.el7_1.8.1
>>> LIBVIRT Version:libvirt-1.2.8-16.el7_1.4
>>> VDSM Version:   vdsm-4.16.26-0.el7.centos
>>> SPICE Version:  0.12.4 - 9.el7_1.3
>>> GlusterFS Version:  glusterfs-3.7.5-1.el7
>>
>> You are running old 

Re: [ovirt-users] VM get stuck randomly

2016-03-24 Thread Christophe TREFOIS
Dear list,

An Ubuntu 14.04 got stuck again on latest 3.6.4 with all patches applied.

Do you have any advice for me now, to try and figure out what could be wrong?

Does anybody else face issues with Ubuntu 14.04 and kernel 3.13.0-79-generic ?

Thank you,

—
Christophe

Dr Christophe Trefois, Dipl.-Ing.  
Technical Specialist / Post-Doc

UNIVERSITÉ DU LUXEMBOURG

LUXEMBOURG CENTRE FOR SYSTEMS BIOMEDICINE
Campus Belval | House of Biomedicine  
6, avenue du Swing 
L-4367 Belvaux  
T: +352 46 66 44 6124 
F: +352 46 66 44 6949  
http://www.uni.lu/lcsb




This message is confidential and may contain privileged information. 
It is intended for the named recipient only. 
If you receive it in error please notify me and permanently delete the original 
message and any copies. 


  

> On 24 Mar 2016, at 10:45, Christophe TREFOIS <christophe.tref...@uni.lu> 
> wrote:
> 
> Hi,
> 
> We finally upgraded to 3.6.3 across the whole data center and will now see if 
> this issue reappears.
> 
> The upgrade went quite smooth, first from 3.5.4 to 3.5.6 and then to 3.6.3.
> 
> Thank you,
> 
> --
> Christophe
> 
>> -Original Message-
>> From: Nir Soffer [mailto:nsof...@redhat.com]
>> Sent: dimanche 13 mars 2016 12:51
>> To: Christophe TREFOIS <christophe.tref...@uni.lu>
>> Cc: users <users@ovirt.org>
>> Subject: Re: [ovirt-users] VM get stuck randomly
>> 
>> On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS
>> <christophe.tref...@uni.lu> wrote:
>>> Dear all,
>>> 
>>> I have a problem since couple of weeks, where randomly 1 VM (not always
>> the same) becomes completely unresponsive.
>>> We find this out because our Icinga server complains that host is down.
>>> 
>>> Upon inspection, we find we can’t open a console to the VM, nor can we
>> login.
>>> 
>>> In oVirt engine, the VM looks like “up”. The only weird thing is that RAM
>> usage shows 0% and CPU usage shows 100% or 75% depending on number of
>> cores.
>>> The only way to recover is to force shutdown the VM via 2-times shutdown
>> from the engine.
>>> 
>>> Could you please help me to start debugging this?
>>> I can provide any logs, but I’m not sure which ones, because I couldn’t see
>> anything with ERROR in the vdsm logs on the host.
>> 
>> I would inspect this vm on the host when it happens.
>> 
>> What is vdsm cpu usage? what is the qemu process (for this vm) cpu usage?
>> 
>> strace output of this qemu process (all threads) or a core dump can help
>> qemu developers to understand this issue.
>> 
>>> 
>>> The host is running
>>> 
>>> OS Version: RHEL - 7 - 1.1503.el7.centos.2.8
>>> Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
>>> KVM Version:2.1.2 - 23.el7_1.8.1
>>> LIBVIRT Version:libvirt-1.2.8-16.el7_1.4
>>> VDSM Version:   vdsm-4.16.26-0.el7.centos
>>> SPICE Version:  0.12.4 - 9.el7_1.3
>>> GlusterFS Version:  glusterfs-3.7.5-1.el7
>> 
>> You are running old versions, missing lot of fixes. Nothing specific to your
>> problem but this lower the chance to get a working system.
>> 
>> It would be nice if you can upgrade to ovirt-3.6 and report if it made any
>> change.
>> Or at lest latest ovirt-3.5.
>> 
>>> We use a locally exported gluster as storage domain (eg, storage is on the
>> same machine exposed via gluster). No replica.
>>> We run around 50 VMs on that host.
>> 
>> Why use gluster for this? Do you plan to add more gluster servers in the
>> future?
>> 
>> Nir
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-24 Thread Christophe TREFOIS
Hi,

We finally upgraded to 3.6.3 across the whole data center and will now see if 
this issue reappears.

The upgrade went quite smooth, first from 3.5.4 to 3.5.6 and then to 3.6.3.

Thank you,

--
Christophe

> -Original Message-
> From: Nir Soffer [mailto:nsof...@redhat.com]
> Sent: dimanche 13 mars 2016 12:51
> To: Christophe TREFOIS <christophe.tref...@uni.lu>
> Cc: users <users@ovirt.org>
> Subject: Re: [ovirt-users] VM get stuck randomly
> 
> On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS
> <christophe.tref...@uni.lu> wrote:
> > Dear all,
> >
> > I have a problem since couple of weeks, where randomly 1 VM (not always
> the same) becomes completely unresponsive.
> > We find this out because our Icinga server complains that host is down.
> >
> > Upon inspection, we find we can’t open a console to the VM, nor can we
> login.
> >
> > In oVirt engine, the VM looks like “up”. The only weird thing is that RAM
> usage shows 0% and CPU usage shows 100% or 75% depending on number of
> cores.
> > The only way to recover is to force shutdown the VM via 2-times shutdown
> from the engine.
> >
> > Could you please help me to start debugging this?
> > I can provide any logs, but I’m not sure which ones, because I couldn’t see
> anything with ERROR in the vdsm logs on the host.
> 
> I would inspect this vm on the host when it happens.
> 
> What is vdsm cpu usage? what is the qemu process (for this vm) cpu usage?
> 
> strace output of this qemu process (all threads) or a core dump can help
> qemu developers to understand this issue.
> 
> >
> > The host is running
> >
> > OS Version: RHEL - 7 - 1.1503.el7.centos.2.8
> > Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
> > KVM Version:2.1.2 - 23.el7_1.8.1
> > LIBVIRT Version:libvirt-1.2.8-16.el7_1.4
> > VDSM Version:   vdsm-4.16.26-0.el7.centos
> > SPICE Version:  0.12.4 - 9.el7_1.3
> > GlusterFS Version:  glusterfs-3.7.5-1.el7
> 
> You are running old versions, missing lot of fixes. Nothing specific to your
> problem but this lower the chance to get a working system.
> 
> It would be nice if you can upgrade to ovirt-3.6 and report if it made any
> change.
> Or at lest latest ovirt-3.5.
> 
> > We use a locally exported gluster as storage domain (eg, storage is on the
> same machine exposed via gluster). No replica.
> > We run around 50 VMs on that host.
> 
> Why use gluster for this? Do you plan to add more gluster servers in the
> future?
> 
> Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-14 Thread Pavel Gashev
Hello,

I saw the same issue at least once. There were the following lines in 
/var/log/libvirt/qemu/VMNAME.log at the moment:

main_channel_link: add main channel client
main_channel_handle_parsed: net test: latency 539.767000 ms, bitrate 7289423 
bps (6.951735 Mbps) LOW BANDWIDTH
red_dispatcher_set_cursor_peer:
inputs_connect: inputs channel client create
red_channel_client_disconnect: rcc=0x7fd368324000 (channel=0x7fd366428000 
type=1 id=0)
main_channel_client_on_disconnect: rcc=0x7fd368324000
red_client_destroy: destroy client 0x7fd366332200 with #channels=4
red_channel_client_disconnect: rcc=0x7fd3683aa000 (channel=0x7fd36643 
type=3 id=0)
red_dispatcher_disconnect_display_peer:
red_channel_client_disconnect: rcc=0x7fd3681e6000 (channel=0x7fd366fea600 
type=2 id=0)
red_channel_client_disconnect: rcc=0x7fd36758a000 (channel=0x7fd3663eab00 
type=4 id=0)
red_dispatcher_disconnect_cursor_peer:

Host software:

OS Version: RHEL - 7 - 2.1511.el7.centos.2.10
Kernel Version: 3.10.0 - 327.10.1.el7.x86_64
KVM Version: 2.3.0 - 31.el7_2.7.1
LIBVIRT Version: libvirt-1.2.17-13.el7_2.3
VDSM Version: vdsm-4.17.23-0.el7.centos
SPICE Version: 0.12.4 - 15.el7

VM is a quite old FC9, so there are no ovirt/qemu guest agents installed inside.

And I have no Gluster there.

On Sun, 2016-03-13 at 07:46 +, Christophe TREFOIS wrote:

Dear all,

I have a problem since couple of weeks, where randomly 1 VM (not always the 
same) becomes completely unresponsive.
We find this out because our Icinga server complains that host is down.

Upon inspection, we find we can’t open a console to the VM, nor can we login.

In oVirt engine, the VM looks like “up”. The only weird thing is that RAM usage 
shows 0% and CPU usage shows 100% or 75% depending on number of cores.
The only way to recover is to force shutdown the VM via 2-times shutdown from 
the engine.

Could you please help me to start debugging this?
I can provide any logs, but I’m not sure which ones, because I couldn’t see 
anything with ERROR in the vdsm logs on the host.

The host is running

OS Version: RHEL - 7 - 1.1503.el7.centos.2.8
Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
KVM Version:2.1.2 - 23.el7_1.8.1
LIBVIRT Version:libvirt-1.2.8-16.el7_1.4
VDSM Version:   vdsm-4.16.26-0.el7.centos
SPICE Version:  0.12.4 - 9.el7_1.3
GlusterFS Version:  glusterfs-3.7.5-1.el7

We use a locally exported gluster as storage domain (eg, storage is on the same 
machine exposed via gluster). No replica.
We run around 50 VMs on that host.

Thank you for your help in this,

—
Christophe


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-13 Thread Yaniv Kaul
On Sun, Mar 13, 2016 at 1:14 PM, Christophe TREFOIS <
christophe.tref...@uni.lu> wrote:

> Hi Yaniv,
>
>
>
> See my answers / questions below under [CT].
>
>
>
> *From:* Yaniv Kaul [mailto:yk...@redhat.com]
> *Sent:* dimanche 13 mars 2016 12:08
> *To:* Christophe TREFOIS <christophe.tref...@uni.lu>
> *Cc:* users <users@ovirt.org>
> *Subject:* Re: [ovirt-users] VM get stuck randomly
>
>
>
>
>
>
>
> On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS <
> christophe.tref...@uni.lu> wrote:
>
> Dear all,
>
> I have a problem since couple of weeks, where randomly 1 VM (not always
> the same) becomes completely unresponsive.
> We find this out because our Icinga server complains that host is down.
>
> Upon inspection, we find we can’t open a console to the VM, nor can we
> login.
>
>
>
> I assume 3.6's console feature, or is it Spice/VNC?
>
> *[CT] *
>
>
>
> This is 3.5, VNC/Spice yes. Sometimes we can connect, but there’s no way
> to do anything, eg type or so on.
>
>
>
>
> In oVirt engine, the VM looks like “up”. The only weird thing is that RAM
> usage shows 0% and CPU usage shows 100% or 75% depending on number of cores.
>
>
>
> Any chance there's really something bad going on within the VM? Anything
> in its journal or /var/log/messages or ... depending on the OS?
>
> Y.
>
> *[CT] *
>
> *It is possible. It seems to be mostly VMs with Ubuntu 14.04 and latest
> kernels. I read somewhere, I couldn’t find now, that there’s perhaps a bug
> in 3.x kernel with regards to libvirt / vdsm. But my knowledge is too
> limited to even know where to begin the investigation **J*
>
>
>
> *On the VM logs, we just see normal VM stuff, then nothing, and then when
> the VM was rebooted, there’s a couple of lines of ^@^@^@ characters
> repeating. But nothing else really.*
>
> *Initially we thought it’s a bug with aufs on Docker, but the machines
> getting stuck now don’t run either.*
>
>
>
> *From your answer, I deduce that if vdsm or libvirt or spm would see a
> problem with storage / memory / cpu, it would suspend the VM and provide
> that info to ovirt-engine? *
>
> *Since this is not happening, you think it could be related to the inside
> of the VM rather than the oVirt environment, correct?*
>

Either that, or to libvirt/QEMU.
I suggest, if possible, to upgrade the components first to newer versions
(as Nir suggested).
Y.


>
> *Thank you for your help (especially on a Sunday) **J*
>
>
>
>
>
> The only way to recover is to force shutdown the VM via 2-times shutdown
> from the engine.
>
> Could you please help me to start debugging this?
> I can provide any logs, but I’m not sure which ones, because I couldn’t
> see anything with ERROR in the vdsm logs on the host.
>
> The host is running
>
> OS Version: RHEL - 7 - 1.1503.el7.centos.2.8
> Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
> KVM Version:2.1.2 - 23.el7_1.8.1
> LIBVIRT Version:libvirt-1.2.8-16.el7_1.4
> VDSM Version:   vdsm-4.16.26-0.el7.centos
> SPICE Version:  0.12.4 - 9.el7_1.3
> GlusterFS Version:  glusterfs-3.7.5-1.el7
>
> We use a locally exported gluster as storage domain (eg, storage is on the
> same machine exposed via gluster). No replica.
> We run around 50 VMs on that host.
>
> Thank you for your help in this,
>
> —
> Christophe
>
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-13 Thread Christophe TREFOIS
Hi Yaniv,

See my answers / questions below under [CT].

From: Yaniv Kaul [mailto:yk...@redhat.com]
Sent: dimanche 13 mars 2016 12:08
To: Christophe TREFOIS <christophe.tref...@uni.lu>
Cc: users <users@ovirt.org>
Subject: Re: [ovirt-users] VM get stuck randomly



On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS 
<christophe.tref...@uni.lu<mailto:christophe.tref...@uni.lu>> wrote:
Dear all,

I have a problem since couple of weeks, where randomly 1 VM (not always the 
same) becomes completely unresponsive.
We find this out because our Icinga server complains that host is down.

Upon inspection, we find we can’t open a console to the VM, nor can we login.

I assume 3.6's console feature, or is it Spice/VNC?
[CT]

This is 3.5, VNC/Spice yes. Sometimes we can connect, but there’s no way to do 
anything, eg type or so on.


In oVirt engine, the VM looks like “up”. The only weird thing is that RAM usage 
shows 0% and CPU usage shows 100% or 75% depending on number of cores.

Any chance there's really something bad going on within the VM? Anything in its 
journal or /var/log/messages or ... depending on the OS?
Y.
[CT]
It is possible. It seems to be mostly VMs with Ubuntu 14.04 and latest kernels. 
I read somewhere, I couldn’t find now, that there’s perhaps a bug in 3.x kernel 
with regards to libvirt / vdsm. But my knowledge is too limited to even know 
where to begin the investigation ☺

On the VM logs, we just see normal VM stuff, then nothing, and then when the VM 
was rebooted, there’s a couple of lines of ^@^@^@ characters repeating. But 
nothing else really.
Initially we thought it’s a bug with aufs on Docker, but the machines getting 
stuck now don’t run either.

From your answer, I deduce that if vdsm or libvirt or spm would see a problem 
with storage / memory / cpu, it would suspend the VM and provide that info to 
ovirt-engine?
Since this is not happening, you think it could be related to the inside of the 
VM rather than the oVirt environment, correct?

Thank you for your help (especially on a Sunday) ☺


The only way to recover is to force shutdown the VM via 2-times shutdown from 
the engine.

Could you please help me to start debugging this?
I can provide any logs, but I’m not sure which ones, because I couldn’t see 
anything with ERROR in the vdsm logs on the host.

The host is running

OS Version: RHEL - 7 - 1.1503.el7.centos.2.8
Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
KVM Version:2.1.2 - 23.el7_1.8.1
LIBVIRT Version:libvirt-1.2.8-16.el7_1.4
VDSM Version:   vdsm-4.16.26-0.el7.centos
SPICE Version:  0.12.4 - 9.el7_1.3
GlusterFS Version:  glusterfs-3.7.5-1.el7

We use a locally exported gluster as storage domain (eg, storage is on the same 
machine exposed via gluster). No replica.
We run around 50 VMs on that host.

Thank you for your help in this,

—
Christophe


___
Users mailing list
Users@ovirt.org<mailto:Users@ovirt.org>
http://lists.ovirt.org/mailman/listinfo/users

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-13 Thread Nir Soffer
On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS
 wrote:
> Dear all,
>
> I have a problem since couple of weeks, where randomly 1 VM (not always the 
> same) becomes completely unresponsive.
> We find this out because our Icinga server complains that host is down.
>
> Upon inspection, we find we can’t open a console to the VM, nor can we login.
>
> In oVirt engine, the VM looks like “up”. The only weird thing is that RAM 
> usage shows 0% and CPU usage shows 100% or 75% depending on number of cores.
> The only way to recover is to force shutdown the VM via 2-times shutdown from 
> the engine.
>
> Could you please help me to start debugging this?
> I can provide any logs, but I’m not sure which ones, because I couldn’t see 
> anything with ERROR in the vdsm logs on the host.

I would inspect this vm on the host when it happens.

What is vdsm cpu usage? what is the qemu process (for this vm) cpu usage?

strace output of this qemu process (all threads) or a core dump can help qemu
developers to understand this issue.

>
> The host is running
>
> OS Version: RHEL - 7 - 1.1503.el7.centos.2.8
> Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
> KVM Version:2.1.2 - 23.el7_1.8.1
> LIBVIRT Version:libvirt-1.2.8-16.el7_1.4
> VDSM Version:   vdsm-4.16.26-0.el7.centos
> SPICE Version:  0.12.4 - 9.el7_1.3
> GlusterFS Version:  glusterfs-3.7.5-1.el7

You are running old versions, missing lot of fixes. Nothing specific
to your problem
but this lower the chance to get a working system.

It would be nice if you can upgrade to ovirt-3.6 and report if it made
any change.
Or at lest latest ovirt-3.5.

> We use a locally exported gluster as storage domain (eg, storage is on the 
> same machine exposed via gluster). No replica.
> We run around 50 VMs on that host.

Why use gluster for this? Do you plan to add more gluster servers in the future?

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] VM get stuck randomly

2016-03-13 Thread Yaniv Kaul
On Sun, Mar 13, 2016 at 9:46 AM, Christophe TREFOIS <
christophe.tref...@uni.lu> wrote:

> Dear all,
>
> I have a problem since couple of weeks, where randomly 1 VM (not always
> the same) becomes completely unresponsive.
> We find this out because our Icinga server complains that host is down.
>
> Upon inspection, we find we can’t open a console to the VM, nor can we
> login.
>

I assume 3.6's console feature, or is it Spice/VNC?


>
> In oVirt engine, the VM looks like “up”. The only weird thing is that RAM
> usage shows 0% and CPU usage shows 100% or 75% depending on number of cores.
>

Any chance there's really something bad going on within the VM? Anything in
its journal or /var/log/messages or ... depending on the OS?
Y.



> The only way to recover is to force shutdown the VM via 2-times shutdown
> from the engine.
>
> Could you please help me to start debugging this?
> I can provide any logs, but I’m not sure which ones, because I couldn’t
> see anything with ERROR in the vdsm logs on the host.
>
> The host is running
>
> OS Version: RHEL - 7 - 1.1503.el7.centos.2.8
> Kernel Version: 3.10.0 - 229.14.1.el7.x86_64
> KVM Version:2.1.2 - 23.el7_1.8.1
> LIBVIRT Version:libvirt-1.2.8-16.el7_1.4
> VDSM Version:   vdsm-4.16.26-0.el7.centos
> SPICE Version:  0.12.4 - 9.el7_1.3
> GlusterFS Version:  glusterfs-3.7.5-1.el7
>
> We use a locally exported gluster as storage domain (eg, storage is on the
> same machine exposed via gluster). No replica.
> We run around 50 VMs on that host.
>
> Thank you for your help in this,
>
> —
> Christophe
>
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users