[ovirt-users] Re: Hosted Engine Abruptly Stopped Responding - Unexpected Shutdown

2019-06-14 Thread Strahil
Hi Maria,

I guess the memory usage is very specidic to the environment.

My setup includes 2 VDOs, 8 gluster volumes , 2 clusters, 12 VMs and only 1 
user - the built-in admin .
In result my engine is using 4GB of RAM.
How many users/storages/clusters/VMs do you have ?

When you login on the engine, what is the process eating most of the RAM?
My suspicion is the DB. If so, maybe someone else can advise if performing 
vacuum on DB during upgrade will be beneficial.

Best Regards,
Strahil NikolovOn Jun 13, 2019 15:55, souvaliotima...@mail.com wrote:
>
> Hello and thank you very much for your reply. 
>
> I'm terribly sorry for being so late to respond. 
>
> I thought the same, that dropping the cache was more of a workaround and not 
> a real solution but truthfully I was stuck and can't think of anything more 
> than how much I need to upgrade the memory on the nodes. I try to find info 
> about other ovirt virtualization set-ups and the amount of memory allocated 
> so I can get an idea of what my set-up needs. The only thing that I found was 
> that one admin had set ovirt up with 128GB and still needed more because of 
> the growing needs of the system and its users and was about to upgrade its 
> memory too. I'm just worried that ovirt is very memory consuming and no 
> matter how much I will "feed" it, it will still ask for more. Also, I'm 
> worried that there one, two or even more tweaks in the configurations that I 
> still miss and they'd be able to solve the memory problem. 
>
> Anyway, KSM is enabled. Sar shows that the committed memory when a Windows 10 
> VM is active too (alongside Hosted Engine of course, and two Linux VMs - 1 
> CentOS, 1 Debian) is around 89% in the specific host that it runs (together 
> with the Debian VM) and has reached up to 98%. 
>
> You are correct about the monitoring system too. I have set up a PRTG 
> environment and there's Nagios running but they can't yet see ovirt. I will 
> set them up correctly the next few days. 
>
> I haven't made any changes to my tuned profile. it's the default from ovirt. 
> Specifically, the active profile says it's set to virtual-host. 
>
> Again I'm very sorry for taking me so long to reply and thank you very much 
> for your response. 
>
> Best Regards, 
> Maria Souvalioti
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/G4YELWF5L4AKUT3OH4C4QJHHEEJPCI3G/
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZTBB7JJY2OLEIF7UZOIRB2BNG35GIXE2/


[ovirt-users] Re: Hosted Engine Abruptly Stopped Responding - Unexpected Shutdown

2019-06-13 Thread souvaliotimaria
Hello and thank you very much for your reply.

I'm terribly sorry for being so late to respond. 

I thought the same, that dropping the cache was more of a workaround and not a 
real solution but truthfully I was stuck and can't think of anything more than 
how much I need to upgrade the memory on the nodes. I try to find info about 
other ovirt virtualization set-ups and the amount of memory allocated so I can 
get an idea of what my set-up needs. The only thing that I found was that one 
admin had set ovirt up with 128GB and still needed more because of the growing 
needs of the system and its users and was about to upgrade its memory too. I'm 
just worried that ovirt is very memory consuming and no matter how much I will 
"feed" it, it will still ask for more. Also, I'm worried that there one, two or 
even more tweaks in the configurations that I still miss and they'd be able to 
solve the memory problem. 

Anyway, KSM is enabled. Sar shows that the committed memory when a Windows 10 
VM is active too (alongside Hosted Engine of course, and two Linux VMs - 1 
CentOS, 1 Debian) is around 89% in the specific host that it runs (together 
with the Debian VM) and has reached up to 98%.

You are correct about the monitoring system too. I have set up a PRTG 
environment and there's Nagios running but they can't yet see ovirt. I will set 
them up correctly the next few days.

I haven't made any changes to my tuned profile. it's the default from ovirt. 
Specifically, the active profile says it's set to virtual-host.

Again I'm very sorry for taking me so long to reply and thank you very much for 
your response. 

Best Regards,
Maria Souvalioti
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/G4YELWF5L4AKUT3OH4C4QJHHEEJPCI3G/


[ovirt-users] Re: Hosted Engine Abruptly Stopped Responding - Unexpected Shutdown

2019-06-06 Thread Strahil

On Jun 6, 2019 12:52, souvaliotima...@mail.com wrote:
>
> Hello, 
>
> I came upon a problem the previous month that I figured it would be good to 
> discuss here. I'm sorry I didn't post here earlier but time slipped me. 
>
> I have set up a glustered, hyperconverged oVirt environment for experimental 
> use as a means to see its  behaviour and get used to its management and 
> performance before setting it up as a production environment for use in our 
> organization. The environment is up and running since 2018 October. The three 
> nodes are HP ProLiant DL380 G7 and have the following characteristics: 
>
> Mem: 22GB 
> CPU: 2x Hexa Core - Intel Xeon Hexa Core E56xx 
> HDD: 5x 300GB 
> Network: BCM5709C with dual-port Gigabit 
> OS: Linux RedHat 7.5.1804(Core 3.10.0-862.3.2.el7.x86_64 x86_64) - Ovirt Node 
> 4.2.3.1 
>
> As I was working on the environment, the engine stopped working. 
> Not long before the time the HE stopped, I was in the web interface managing 
> my VMs, when the browser froze and the HE was also not responding to ICMP 
> requests. 
>
> The first thing I did was to connect via ssh to all nodes and run the command 
> #hosted-engine --vm-status 
> which showed that the HE was down in nodes 1 and 2 and up on the 3rd node. 
>
> After executing 
> #virsh -r list 
> the VM list that was shown contained two of the VMs I had previously created 
> and were up; the HE was nowhere. 
>
> I tried to restart the HE with the 
> #hosted-engine --vm-start 
> but it didn't work. 
>
> I then put all nodes in maintenance mode with the command 
> #hosted-engine --set-maintenance --mode=global 
> (I guess I should have done that earlier) and re-run 
> #hosted-engine --vm-start 
> that had the same result as it previously did. 
>
> After checking the mails the system sent to the root user, I saw there were 
> several mails on the 3rd node (where the HE had been), informing of the HE's 
> state. The messages were changing between EngineDown-EngineStart, 
> EngineStart-EngineStarting, EngineStarting-EngineMaybeAway, 
> EngineMaybeAway-EngineUnexpectedlyDown, EngineUnexpectedlyDown-EngineDown, 
> EngineDown-EngineStart and so forth. 
>
> I continued by searching the following logs in all nodes : 
> /var/log/libvirt/qemu/HostedEngine.log 
> /var/log/libvirt/qemu/win10.log 
> /var/log/libvirt/qemu/DNStest.log 
> /var/log/vdsm/vdsm.log 
> /var/log/ovirt-hosted-engine-ha/agent.log 
>
> After that I spotted and error that had started appearing almost a month ago 
> in node #2: 
> ERROR Internal server error Traceback (most recent call last): File 
> "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in 
> _handle_request res = method(**params) File 
> "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in 
> _dynamicMethod result = fn(*methodArgs) File 
> "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 85, in 
> logicalVolumeList return self._gluster.logicalVolumeList() File 
> "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper 
> rv = func(*args, **kwargs) File 
> "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 808, in 
> logicalVolumeList status = self.svdsmProxy.glusterLogicalVolumeList() File 
> "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in 
> __call__ return callMethod() File 
> "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 52, in 
>  getattr(self._supervdsmProxy._svdsm, self._funcName)(*args, 
> AttributeError: 'AutoProxy[instance]' object has no attribute 
> 'glusterLogicalVolumeList' 
>
>
> The outputs of the following commands were also checked as a way to see if 
> there was a mandatory process missing/killed, a memory problem or even disk 
> space shortage that led to the sudden death of a process 
> #ps -A 
> #top 
> #free -h 
> #df -hT 
>
> Finally, after some time delving in the logs, the output of the 
> #journalctl --dmesg 
> showed the following message 
> "Out of memory: Kill process 5422 (qemu-kvm) score 514 or sacrifice child. 
> Killed process 5422 (qemu-kvm) total-vm:17526548kB, anon-rss:9310396kB, 
> file-rss:2336kB, shmem-rss:12kB" 
> which after that the ovirtmgmt started not responding. 
If you run out of memory, you should take that serious.Droping the cache seems 
like a workaround and not a fix.
Check if KSM is enabled - this will merge your VM's memory pages for an 
exchange for CPU cycles - still better than getting a VM killed.
Also, you can protect the HostedEngine from OOM killer.

> I tried to restart the vhostd by executing 
> #/etc/rc.d/init.d/vhostmd start 
> but it didn't work. 
>
> Finally, I decided to run the HE restart command on the other nodes as well 
> (I'd figured that since the HE was last running on the node #3, that's where 
> I should try to restart it). So, I run 
> #hosted-engine --vm-start 
> and the output was 
> "Command VM.getStats with args {'vmID':'...<το ID της HE>'} failed: 
> (code=1,message=Virtual machine does 

[ovirt-users] Re: Hosted Engine Abruptly Stopped Responding - Unexpected Shutdown

2019-06-06 Thread Edward Berger
When I read your intro, and I hit the memory figure,  I was saying to
myself, what
I'd definitely increase the memory if possible.   As high as you can
affordably fit into the servers.
Engine asks 16GB at installation time, add some for gluster services and
you're at your limits before you add a user VM.

My first non-hyperconverged hosted-engine install used a 32GB and a 24GB
dual xeon machines with only 8GB allocated for the engine VM.
I felt more confident in it when I upgraded the 24GB node to 48GB.  So 48GB
would my minimum, 64 OK, and the more the better..
Later, I was able to find some used 144GB supermicro servers which I
replaced the above nodes with.

Modern 64bit CentOS likes to have around 2GB per core for basic server
functions.
For desktops, I say have at least 8GB because web browsers eat up RAM.



On Thu, Jun 6, 2019 at 5:52 AM  wrote:

> Hello,
>
> I came upon a problem the previous month that I figured it would be good
> to discuss here. I'm sorry I didn't post here earlier but time slipped me.
>
> I have set up a glustered, hyperconverged oVirt environment for
> experimental use as a means to see its  behaviour and get used to its
> management and performance before setting it up as a production environment
> for use in our organization. The environment is up and running since 2018
> October. The three nodes are HP ProLiant DL380 G7 and have the following
> characteristics:
>
> Mem: 22GB
> CPU: 2x Hexa Core - Intel Xeon Hexa Core E56xx
> HDD: 5x 300GB
> Network: BCM5709C with dual-port Gigabit
> OS: Linux RedHat 7.5.1804(Core 3.10.0-862.3.2.el7.x86_64 x86_64) - Ovirt
> Node 4.2.3.1
>
> As I was working on the environment, the engine stopped working.
> Not long before the time the HE stopped, I was in the web interface
> managing my VMs, when the browser froze and the HE was also not responding
> to ICMP requests.
>
> The first thing I did was to connect via ssh to all nodes and run the
> command
> #hosted-engine --vm-status
> which showed that the HE was down in nodes 1 and 2 and up on the 3rd node.
>
> After executing
> #virsh -r list
> the VM list that was shown contained two of the VMs I had previously
> created and were up; the HE was nowhere.
>
> I tried to restart the HE with the
> #hosted-engine --vm-start
> but it didn't work.
>
> I then put all nodes in maintenance mode with the command
> #hosted-engine --set-maintenance --mode=global
> (I guess I should have done that earlier) and re-run
> #hosted-engine --vm-start
> that had the same result as it previously did.
>
> After checking the mails the system sent to the root user, I saw there
> were several mails on the 3rd node (where the HE had been), informing of
> the HE's state. The messages were changing between EngineDown-EngineStart,
> EngineStart-EngineStarting, EngineStarting-EngineMaybeAway,
> EngineMaybeAway-EngineUnexpectedlyDown, EngineUnexpectedlyDown-EngineDown,
> EngineDown-EngineStart and so forth.
>
> I continued by searching the following logs in all nodes :
> /var/log/libvirt/qemu/HostedEngine.log
> /var/log/libvirt/qemu/win10.log
> /var/log/libvirt/qemu/DNStest.log
> /var/log/vdsm/vdsm.log
> /var/log/ovirt-hosted-engine-ha/agent.log
>
> After that I spotted and error that had started appearing almost a month
> ago in node #2:
> ERROR Internal server error Traceback (most recent call last): File
> "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in
> _handle_request res = method(**params) File
> "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, in
> _dynamicMethod result = fn(*methodArgs) File
> "/usr/lib/python2.7/site-packages/vdsm/gluster/apiwrapper.py", line 85, in
> logicalVolumeList return self._gluster.logicalVolumeList() File
> "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 90, in wrapper
> rv = func(*args, **kwargs) File
> "/usr/lib/python2.7/site-packages/vdsm/gluster/api.py", line 808, in
> logicalVolumeList status = self.svdsmProxy.glusterLogicalVolumeList() File
> "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in
> __call__ return callMethod() File
> "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 52, in
>  getattr(self._supervdsmProxy._svdsm, self._funcName)(*args,
> AttributeError: 'AutoProxy[instance]' object has no attribute
> 'glusterLogicalVolumeList'
>
>
> The outputs of the following commands were also checked as a way to see if
> there was a mandatory process missing/killed, a memory problem or even disk
> space shortage that led to the sudden death of a process
> #ps -A
> #top
> #free -h
> #df -hT
>
> Finally, after some time delving in the logs, the output of the
> #journalctl --dmesg
> showed the following message
> "Out of memory: Kill process 5422 (qemu-kvm) score 514 or sacrifice child.
> Killed process 5422 (qemu-kvm) total-vm:17526548kB, anon-rss:9310396kB,
> file-rss:2336kB, shmem-rss:12kB"
> which after that the ovirtmgmt started not responding.
>
> I tried to restart the vhostd by