[ovirt-users] Re: VMs unexpectidly restarted

Sahina Bose Mon, 29 Oct 2018 23:34:07 -0700

On Sun, Oct 28, 2018 at 5:17 PM fsoyer <[email protected]> wrote:
>
>
> Well guys,
> I can say now that I have a real problem, maybe between ovirt and gluster 
> storage, but I can't be sure. Yesterday, I wanted to clone a VM (named 
> "crij2") from a snapshot, but (this is another problem I think) the UI never 
> gave me the popup (blank window with the cursor with a message 400 after a 
> timeout). So I decided to export it, then import it.
> The export/import finally works, but when it was working, some VMs became 
> randomly unresponsives, and one restarted on error. At this time, the engine 
> was on "ginger" node. Copy of the event log :
> 27 oct. 2018 20:32:12 VM crij2 started on Host victor.local.systea.fr
> 27 oct. 2018 20:31:37 VM crij2 was started by admin@internal-authz (Host: 
> victor.local.systea.fr).
> 27 oct. 2018 20:26:39 Vm crij2 was imported successfully to Data Center 
> Default, Cluster Default
> 27 oct. 2018 20:22:53 VM logcollector is not responding.
> 27 oct. 2018 20:22:10 VM Sogov3 is not responding.
> 27 oct. 2018 20:17:53 VM cerbere4 is not responding.
> 27 oct. 2018 20:17:49 VM cerbere3 is not responding.
> 27 oct. 2018 20:17:48 VM logcollector is not responding.
> 27 oct. 2018 20:16:38 VM Sogov3 is not responding.
> 27 oct. 2018 20:16:38 VM cerbere4 is not responding.
> 27 oct. 2018 20:16:38 VM op2drugs1 is not responding.
> 27 oct. 2018 20:16:33 VM cerbere3 is not responding.
> 27 oct. 2018 20:07:30 VM op2drugs1 is not responding.
> 27 oct. 2018 20:06:14 VM cerbere3 is not responding.
> 27 oct. 2018 20:02:27 VM cerbere3 is not responding.
> 27 oct. 2018 20:01:11 VM logcollector is not responding.
> 27 oct. 2018 20:00:56 VM zabbix is not responding.
> 27 oct. 2018 19:57:42 VM zabbix is not responding.
> 27 oct. 2018 19:57:42 VM cerbere3 is not responding.
> 27 oct. 2018 19:57:42 VM logcollector is not responding.
> 27 oct. 2018 19:54:40 VM zabbix is not responding.
> 27 oct. 2018 19:53:25 VM cerbere3 is not responding.
> 27 oct. 2018 19:53:25 VM cerbere4 is not responding.
> 27 oct. 2018 19:48:29 Starting to import Vm crij2 to Data Center Default, 
> Cluster Default
> 27 oct. 2018 19:47:41 Refresh image list succeeded for domain(s): ISO (ISO 
> file type)
> 27 oct. 2018 19:46:46 VM crij2 was renamed from crij2 to crij2_ok by admin.
> 27 oct. 2018 19:46:46 VM crij2 configuration was updated by 
> admin@internal-authz.
> 27 oct. 2018 19:46:12 Refresh image list succeeded for domain(s): ISO (ISO 
> file type)
> 27 oct. 2018 19:42:36 Refresh image list succeeded for domain(s): ISO (ISO 
> file type)
> 27 oct. 2018 19:37:22 Vm crij2 was exported successfully to EXPORT
> 27 oct. 2018 19:36:04 VM HostedEngine is not responding.
> 27 oct. 2018 19:33:03 VM op2drugs1 is not responding.
> 27 oct. 2018 19:32:48 VM altern8 is not responding.
> 27 oct. 2018 19:32:48 VM patjoub1 is not responding.
> 27 oct. 2018 19:31:03 VM op2drugs1 is not responding.
> 27 oct. 2018 19:30:48 VM altern8 is not responding.
> 27 oct. 2018 19:30:48 VM patjoub1 is not responding.
> 27 oct. 2018 19:28:37 VM Sogov3 is not responding.
> 27 oct. 2018 19:28:07 VM altern8 is not responding.
> 27 oct. 2018 19:28:07 VM op2drugs1 is not responding.
> 27 oct. 2018 19:28:07 VM patjoub1 is not responding.
> 27 oct. 2018 19:25:10 VM Mint19 is not responding.
> 27 oct. 2018 19:25:10 VM zabbix is not responding.
> 27 oct. 2018 19:24:55 VM HostedEngine is not responding.
> 27 oct. 2018 19:23:33 VM op2drugs1 is not responding.
> 27 oct. 2018 19:23:18 VM altern8 is not responding.
> 27 oct. 2018 19:23:18 VM patjoub1 is not responding.
> 27 oct. 2018 19:21:52 VM op2drugs1 is not responding.
> 27 oct. 2018 19:20:06 VM patjoub1 is not responding.
> 27 oct. 2018 19:19:51 VM Sogov3 is not responding.
> 27 oct. 2018 19:18:26 Host ginger.local.systea.fr power management was 
> verified successfully.
> 27 oct. 2018 19:18:26 Status of host ginger.local.systea.fr was set to Up.
> 27 oct. 2018 19:18:25 Manually synced the storage devices from host 
> ginger.local.systea.fr
> 27 oct. 2018 19:17:51 Executing power management status on Host 
> ginger.local.systea.fr using Proxy Host victor.local.systea.fr and Fence 
> Agent ipmilan:10.0.0.225.
> 27 oct. 2018 19:17:39 Host ginger.local.systea.fr is not responding. It will 
> stay in Connecting state for a grace period of 82 seconds and after that an 
> attempt to fence the host will be issued.
> 27 oct. 2018 19:17:21 VM altern8 is not responding.
> 27 oct. 2018 19:17:21 Invalid status on Data Center Default. Setting Data 
> Center status to Non Responsive (On host ginger.local.systea.fr, Error: 
> Network error during communication with the Host.).
> 27 oct. 2018 19:17:21 VM patjoub1 is not responding.
> 27 oct. 2018 19:17:20 VM HostedEngine is not responding.
> 27 oct. 2018 19:17:20 VM op2drugs1 is not responding.
> 27 oct. 2018 19:17:19 VDSM ginger.local.systea.fr command SpmStatusVDS 
> failed: Connection timeout for host 'ginger.local.systea.fr', last response 
> arrived 17279 ms ago.
> 27 oct. 2018 19:16:16 Failed to update VMs/Templates OVF data for Storage 
> Domain DATA02 in Data Center Default. 27 oct. 2018 19:16:16
> Failed to update OVF disks 85d67951-d610-49b3-aaab-a81850621e35, OVF data 
> isn't updated on those OVF stores (Data Center Default, Storage Domain 
> DATA02).
> 27 oct. 2018 19:16:16 VDSM command SetVolumeDescriptionVDS failed: Resource 
> timeout: ()
> 27 oct. 2018 19:16:16 VM patjoub1 is not responding.
> 27 oct. 2018 19:16:16 VM op2drugs1 is not responding.
> 27 oct. 2018 19:14:46 VM patjoub1 is not responding.
> 27 oct. 2018 19:14:46 VM op2drugs1 is not responding.
> 27 oct. 2018 19:13:18 Host ginger.local.systea.fr power management was 
> verified successfully.
> 27 oct. 2018 19:13:18 Status of host ginger.local.systea.fr was set to Up.
> 27 oct. 2018 19:13:03 Manually synced the storage devices from host 
> ginger.local.systea.fr
> 27 oct. 2018 19:12:51 VM altern8 is not responding.
> 27 oct. 2018 19:12:51 VM HostedEngine is not responding.
> 27 oct. 2018 19:12:51 VM op2drugs1 is not responding.
> 27 oct. 2018 19:12:48 Executing power management status on Host 
> ginger.local.systea.fr using Proxy Host victor.local.systea.fr and Fence 
> Agent ipmilan:10.0.0.225.
> 27 oct. 2018 19:12:44 Host ginger.local.systea.fr does not enforce SELinux. 
> Current status: DISABLED
> 27 oct. 2018 19:12:36 Invalid status on Data Center Default. Setting Data 
> Center status to Non Responsive (On host ginger.local.systea.fr, Error: 
> Network error during communication with the Host.).
> 27 oct. 2018 19:12:28 Host ginger.local.systea.fr is not responding. It will 
> stay in Connecting state for a grace period of 82 seconds and after that an 
> attempt to fence the host will be issued.
> 27 oct. 2018 19:12:28 VDSM ginger.local.systea.fr command SpmStatusVDS 
> failed: Connection timeout for host 'ginger.local.systea.fr', last response 
> arrived 25225 ms ago.
> 27 oct. 2018 19:10:06 VM altern8 is not responding.
> 27 oct. 2018 19:10:06 VM patjoub1 is not responding.
> 27 oct. 2018 19:10:06 VM op2drugs1 is not responding.
> 27 oct. 2018 19:08:49 VM op2drugs1 is not responding.
> 27 oct. 2018 19:08:45 Refresh image list succeeded for domain(s): ISO (ISO 
> file type)
> 27 oct. 2018 19:08:34 VM altern8 is not responding.
> 27 oct. 2018 19:08:34 VM patjoub1 is not responding.
> 27 oct. 2018 19:08:34 VM HostedEngine is not responding.
> 27 oct. 2018 19:04:01 VM op2drugs1 is not responding.
> 27 oct. 2018 19:01:08 VM HostedEngine is not responding.
> 27 oct. 2018 19:00:53 VM zabbix is not responding.
> 27 oct. 2018 19:00:01 Trying to restart VM npi2 on Host victor.local.systea.fr
> 27 oct. 2018 18:59:14 Trying to restart VM npi2 on Host victor.local.systea.fr
> 27 oct. 2018 18:59:13 Highly Available VM np2 failed. It will be restarted 
> automatically.
> 27 oct. 2018 18:59:13 VM npi2 is down with error. Exit message: VM has been 
> terminated on the host.
> 27 oct. 2018 18:59:05 VM altern8 is not responding.
> 27 oct. 2018 18:58:44 Storage domain DATA02 experienced a high latency of 
> 6.16279 seconds from host ginger.local.systea.fr. This may cause performance 
> and functional issues. Please consult your Storage Administrator.
> 27 oct. 2018 18:57:19 VM altern8 is not responding.
> 27 oct. 2018 18:57:19 VM patjoub1 is not responding.
> 27 oct. 2018 18:57:19 VM HostedEngine is not responding.
> 27 oct. 2018 18:57:19 VM op2drugs1 is not responding.
> 27 oct. 2018 18:55:56 VM altern8 is not responding.
> 27 oct. 2018 18:55:41 VM op2drugs1 is not responding.
> 27 oct. 2018 18:55:00 VM altern8 is not responding.
> 27 oct. 2018 18:54:45 VM op2drugs1 is not responding.
> 27 oct. 2018 18:52:21 VM Sogov3 is not responding.
> 27 oct. 2018 18:52:21 VM npi2 is not responding.
> 27 oct. 2018 18:50:50 VM altern8 is not responding.
> 27 oct. 2018 18:50:47 VM zabbix is not responding.
> 27 oct. 2018 18:48:16 VM op2drugs1 is not responding.
> 27 oct. 2018 18:48:03 VM altern8 is not responding.
> 27 oct. 2018 18:48:03 VM HostedEngine is not responding.
> 27 oct. 2018 18:45:48 Starting export Vm crij2 to EXPORT
> 27 oct. 2018 18:42:57 Refresh image list succeeded for domain(s): ISO (ISO 
> file type)
> 27 oct. 2018 18:40:44 Refresh image list succeeded for domain(s): ISO (ISO 
> file type)
> 27 oct. 2018 18:40:04 VM crij2 is down. Exit message: User shut down from 
> within the guest
> 27 oct. 2018 18:39:25 User admin@internal-authz got disconnected from VM 
> crij2.
> I checked the network and gluster since it works but saw absolutly nothing 
> special. The storage network was not too sollicited (bwm-ng indicated max 
> 50MB/s on bond1). Gluster log no errors too (even if the engine reported some 
> timeouts).
>
> This morning I started to search why and wanted to submit to you some logs on 
> this thread, but I found something that had not caught my attention before, 
> so I ask about it before all.
>
> I recall the configuration :
> 3 hosts with gluster (replica 2 + arbiter). The volumes are on a separate 
> network (bond1 is an aggregation of 2 Gb cards while ovirmgmt is on bond0, 2 
> NICs in backup mode).
> For now, I have only declared the first 2 nodes in the engine GUI as ovirt 
> nodes, because the arbiter is a small machine with a smaller CPU (and only 
> 8Gb memory), that needed to downgrade the cluster from Sandybridge to 
> Nehalem. Maybe it was an error. The storagenetwork on bond1 was declared too 
> in the GYUI, but not yet as a gluster storage.
>
> The Gluster volumes themselves were declared on the storage network by using 
> names indicated in /etc/hosts on bond1 network. Here is a volume status :
> # gluster volume status
> Status of volume: DATA01
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick victorstorage.local.systea.fr:/home/d
> ata01/data01/brick                          49152     0          Y       2489
> Brick gingerstorage.local.systea.fr:/home/d
> ata01/data01/brick                          49152     0          Y       2531
> Brick eskarinastorage.local.systea.fr:/home
> /data01/data01/brick                        49153     0          Y       28119
> Self-heal Daemon on localhost               N/A       N/A        Y       24859
> Self-heal Daemon on eskarinastorage.local.s
> ystea.fr                                    N/A       N/A        Y       30725
> Self-heal Daemon on victorstorage.local.sys
> tea.fr                                      N/A       N/A        Y       2810
>
> Task Status of Volume DATA01
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
> Status of volume: DATA02
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick victorstorage.local.systea.fr:/home/d
> ata02/data02/brick                          49153     0          Y       2553
> Brick gingerstorage.local.systea.fr:/home/d
> ata02/data02/brick                          49153     0          Y       2561
> Brick eskarinastorage.local.systea.fr:/home
> /data01/data02/brick                        49154     0          Y       28204
> Self-heal Daemon on localhost               N/A       N/A        Y       24859
> Self-heal Daemon on eskarinastorage.local.s
> ystea.fr                                    N/A       N/A        Y       30725
> Self-heal Daemon on victorstorage.local.sys
> tea.fr                                      N/A       N/A        Y       2810
>
> Task Status of Volume DATA02
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
> Status of volume: ENGINE
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick victorstorage.local.systea.fr:/home/d
> ata02/engine/brick                          49154     0          Y       2571
> Brick gingerstorage.local.systea.fr:/home/d
> ata02/engine/brick                          49154     0          Y       2610
> Brick eskarinastorage.local.systea.fr:/home
> /data01/engine/brick                        49152     0          Y       28013
> Self-heal Daemon on localhost               N/A       N/A        Y       24859
> Self-heal Daemon on eskarinastorage.local.s
> ystea.fr                                    N/A       N/A        Y       30725
> Self-heal Daemon on victorstorage.local.sys
> tea.fr                                      N/A       N/A        Y       2810
>
> Task Status of Volume ENGINE
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
> Status of volume: EXPORT
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick victorstorage.local.systea.fr:/home/d
> ata01/export/brick                          49155     0          Y       2588
> Brick gingerstorage.local.systea.fr:/home/d
> ata01/export/brick                          49155     0          Y       2629
> Brick eskarinastorage.local.systea.fr:/home
> /data01/export/brick                        49156     0          Y       28384
> Self-heal Daemon on localhost               N/A       N/A        Y       24859
> Self-heal Daemon on eskarinastorage.local.s
> ystea.fr                                    N/A       N/A        Y       30725
> Self-heal Daemon on victorstorage.local.sys
> tea.fr                                      N/A       N/A        Y       2810
>
> Task Status of Volume EXPORT
> ------------------------------------------------------------------------------
> There are no active volume tasks
>
> Status of volume: ISO
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick victorstorage.local.systea.fr:/home/d
> ata01/iso/brick                             49156     0          Y       2595
> Brick gingerstorage.local.systea.fr:/home/d
> ata01/iso/brick                             49156     0          Y       2636
> Brick eskarinastorage.local.systea.fr:/home
> /data01/iso/brick                           49155     0          Y       28292
> Self-heal Daemon on localhost               N/A       N/A        Y       24859
> Self-heal Daemon on eskarinastorage.local.s
> ystea.fr                                    N/A       N/A        Y       30725
> Self-heal Daemon on victorstorage.local.sys
> tea.fr                                      N/A       N/A        Y       2810
>
> Task Status of Volume ISO
> -------------------------------
> But, a df on the nodes shows that all volumes except ENGINE was mounted on 
> ovirmgmt network (hosts names without "storage") :
>
> gingerstorage.local.systea.fr:/ENGINE   5,0T    226G  4,7T   5% 
> /rhev/data-center/mnt/glusterSD/gingerstorage.local.systea.fr:_ENGINE
> victor.local.systea.fr:/DATA01          1,3T    425G  862G  33% 
> /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA01
> victor.local.systea.fr:/DATA02          5,0T    226G  4,7T   5% 
> /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_DATA02
> victor.local.systea.fr:/ISO             1,3T    425G  862G  33% 
> /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_ISO
> victor.local.systea.fr:/EXPORT          1,3T    425G  862G  33% 
> /rhev/data-center/mnt/glusterSD/victor.local.systea.fr:_EXPORT
>
> I can't remember how it was declared at install time, maybe I had not seen 
> that, but if I tried to had a domain now, gluster managed, effectively it 
> proposes to me only the nodes by their ovirmgmt names, not storage names.
>
> Names are only known in the /etc/hosts of all nodes + engine, there is no DNS 
> for this local addresses.
>
> So : in your opinion, can this configuration be a (the) source of my problems 
> ? And have you an idea how I could correct this now, without loosing anything 
> ?


I don't think this is the cause of your issues.
Are there errors in vdsm logs? Do you have issues with storage latency
(can you check the gluster volume profile output?)

>
> Thanks for all suggestions.
>
> --
>
> Regards,
>
> Frank
>
>
> Le Jeudi, Octobre 18, 2018 23:13 CEST, Nir Soffer <[email protected]> a 
> écrit:
>  On Thu, Oct 18, 2018 at 3:43 PM fsoyer <[email protected]> wrote:Hi,
> I forgot to look in the /var/log/messages file on the host ! What a shame :/
> Here is the messages file at the time of the error : 
> https://gist.github.com/fsoyer/4d1247d4c3007a8727459efd23d89737
> At the sasme time, the second host as no particular messages in its log
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/AXYRWIFKQN2V7P2GDFR6OWLQZPEPUXEJ/

[ovirt-users] Re: VMs unexpectidly restarted

Reply via email to