Re: [Users] Nodes lose storage at random
Gents, I'm sending this email for archiving purposes: It's been a while since my last update on this topic. It turned out that although only one node, randomly, at a time lost connection to storage, the issue was not at all with ovirt, but with the storage. I'd like to refer to these 2 topics for more information: http://lists.freebsd.org/pipermail/freebsd-net/2014-March/038061.html and http://lists.freebsd.org/pipermail/freebsd-net/2014-February/037967.html. Keywords: ovirt freebsd 9.2 zfs ixgbe intel 10gbit NFS On Mon, Feb 24, 2014 at 3:55 PM, Ronen Hod r...@redhat.com wrote: On 02/24/2014 11:48 AM, Nir Soffer wrote: - Original Message - From: Johan Kooijman m...@johankooijman.com To: Nir Soffer nsof...@redhat.com Cc: users users@ovirt.org Sent: Monday, February 24, 2014 2:45:59 AM Subject: Re: [Users] Nodes lose storage at random Interestingly enough - same thing happened today, around the same time. Logs from this host are attached. Around 1:10 AM stuff starts to go wrong again. Same pattern - we reboot the node and the node is fine again. So we made some progress, we know that it is not a problem with old kernel. In messages we see the same picture: 1. sanlock fail to renew the lease 2. after 80 secodns, kill vdsm 3. sanlock and vdsm cannot access the storage 4. kernel complain about nfs server timeouts (explains why sanlock failed to renew the lease) 5. after reboot, nfs is accessible again 6. after few days goto step 1 This looks like kernel nfs issue. Could be also kvm issue (running bsd on the one of the vm?) Could be also some incompatibility with the nfs server - maybe you are using esoteric configuration options? CCing Ronen, in case this is related to kvm. Not seems to be related to KVM. Adding Ric Wheeler. Ronen. thread: http://lists.ovirt.org/pipermail/users/2014-February/021507.html Nir -- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E m...@johankooijman.com ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] Nodes lose storage at random
- Original Message - From: Johan Kooijman m...@johankooijman.com To: Nir Soffer nsof...@redhat.com Cc: users users@ovirt.org Sent: Monday, February 24, 2014 2:45:59 AM Subject: Re: [Users] Nodes lose storage at random Interestingly enough - same thing happened today, around the same time. Logs from this host are attached. Around 1:10 AM stuff starts to go wrong again. Same pattern - we reboot the node and the node is fine again. So we made some progress, we know that it is not a problem with old kernel. In messages we see the same picture: 1. sanlock fail to renew the lease 2. after 80 secodns, kill vdsm 3. sanlock and vdsm cannot access the storage 4. kernel complain about nfs server timeouts (explains why sanlock failed to renew the lease) 5. after reboot, nfs is accessible again 6. after few days goto step 1 This looks like kernel nfs issue. Could be also kvm issue (running bsd on the one of the vm?) Could be also some incompatibility with the nfs server - maybe you are using esoteric configuration options? CCing Ronen, in case this is related to kvm. thread: http://lists.ovirt.org/pipermail/users/2014-February/021507.html Nir ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] Nodes lose storage at random
On 02/24/2014 11:48 AM, Nir Soffer wrote: - Original Message - From: Johan Kooijman m...@johankooijman.com To: Nir Soffer nsof...@redhat.com Cc: users users@ovirt.org Sent: Monday, February 24, 2014 2:45:59 AM Subject: Re: [Users] Nodes lose storage at random Interestingly enough - same thing happened today, around the same time. Logs from this host are attached. Around 1:10 AM stuff starts to go wrong again. Same pattern - we reboot the node and the node is fine again. So we made some progress, we know that it is not a problem with old kernel. In messages we see the same picture: 1. sanlock fail to renew the lease 2. after 80 secodns, kill vdsm 3. sanlock and vdsm cannot access the storage 4. kernel complain about nfs server timeouts (explains why sanlock failed to renew the lease) 5. after reboot, nfs is accessible again 6. after few days goto step 1 This looks like kernel nfs issue. Could be also kvm issue (running bsd on the one of the vm?) Could be also some incompatibility with the nfs server - maybe you are using esoteric configuration options? CCing Ronen, in case this is related to kvm. Not seems to be related to KVM. Adding Ric Wheeler. Ronen. thread: http://lists.ovirt.org/pipermail/users/2014-February/021507.html Nir ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] Nodes lose storage at random
- Original Message - From: Johan Kooijman m...@johankooijman.com To: Nir Soffer nsof...@redhat.com Cc: users users@ovirt.org Sent: Sunday, February 23, 2014 3:48:25 AM Subject: Re: [Users] Nodes lose storage at random Been reinstalling to stocj CentOS 6.5 last night, all successful. Until roughly midnight GMT, 2 out of 4 hosts were showing the same errors. Any more suggestions? Lets see the logs from these hosts? ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] Nodes lose storage at random
- Original Message - From: Johan Kooijman m...@johankooijman.com To: Nir Soffer nsof...@redhat.com Cc: users users@ovirt.org Sent: Wednesday, February 19, 2014 2:34:36 PM Subject: Re: [Users] Nodes lose storage at random Messages: https://t-x.dignus.nl/messages.txt Sanlock: https://t-x.dignus.nl/sanlock.log.txt We can see in /var/log/messages, that sanlock failed to write to the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose its host id lease. In this case, sanlock kill vdsm [3], which die after 11 retries [4]. Then vdsm is respawned again [5]. This is expected. We don't know why sanlock failed to write to the storage, but in [6] the kernel tell us that the nfs server is not responding. Since the nfs server is accessible from other machines, it means you have some issue with this host. Later the machine reboots [7], and nfs server is still not accessible. Then you have lot of WARN_ON call traces [8], that looks related to network code. We can see that you are not running most recent kernel [7]. We experienced various nfs issues during the 6.5 beta. I would try to get help from kernel folks about this. [1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+ 1251833 [21345]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.0.24.1:_santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids [2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882 [14753]: s2 check_our_lease failed 80 [3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882 [14753]: s2 kill 19317 sig 15 count 1 [4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+ 1251892 [14753]: dead 19317 ci 3 count 11 [5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died, respawning slave [6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding, timed out [7] Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started. Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 (mockbu...@c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013 [8] Feb 18 18:29:53 hv5 kernel: [ cut here ] Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759 skb_gso_segment+0x1df/0x2b0() (Not tainted) Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596 data_len=0 ip_summed=0 Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache auth_rpcgss nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe libfcoe libfc scsi_transport_fc scsi_tgt lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack xt_multi port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc sha256_generic cbc cryptoloop dm_crypt aesni_intel cryptd aes_x86_64 aes_generic vhost_net macvtap macvlan tun kvm_ intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci libsas scsi_transport_sas 3w_ sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xx x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_scan] Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted 2.6.32-358.18.1.el6.x86_64 #1 Feb 18 18:29:53 hv5 kernel: Call Trace: Feb 18 18:29:53 hv5 kernel: IRQ [8106e3e7] ? warn_slowpath_common+0x87/0xc0 Feb 18 18:29:53 hv5 kernel: [8106e4d6] ? warn_slowpath_fmt+0x46/0x50 Feb 18 18:29:53 hv5 kernel: [a020bd62] ? igb_get_drvinfo+0x82/0xe0 [igb] Feb 18 18:29:53 hv5 kernel: [81448e7f] ? skb_gso_segment+0x1df/0x2b0 Feb 18 18:29:53 hv5 kernel: [81449260] ? dev_hard_start_xmit+0x1b0/0x530 Feb 18 18:29:53 hv5 kernel: [8146773a] ? sch_direct_xmit+0x15a/0x1c0 Feb 18 18:29:53 hv5 kernel: [8144d0c0] ? dev_queue_xmit+0x3b0/0x550 Feb 18 18:29:53 hv5 kernel: [a04af65c] ? br_dev_queue_push_xmit+0x6c/0xa0 [bridge] Feb 18 18:29:53 hv5 kernel: [a04af6e8] ? br_forward_finish+0x58/0x60 [bridge] Feb 18 18:29:53 hv5 kernel: [a04af79a] ? __br_forward+0xaa/0xd0 [bridge] Feb 18 18:29:53 hv5 kernel: [81474f34] ? nf_hook_slow+0x74/0x110 Feb 18 18:29:53 hv5 kernel: [a04af81d] ? br_forward+0x5d/0x70 [bridge] Feb 18 18:29:53 hv5 kernel: [a04b0609] ? br_handle_frame_finish+0x179/0x2a0 [bridge] Feb 18 18:29:53 hv5 kernel: [a04b08da] ? br_handle_frame+0x1aa/0x250 [bridge] Feb 18 18:29:53 hv5 kernel: [a0331690] ? pit_timer_fn+0x0/0x80 [kvm] Feb 18 18:29:53 hv5 kernel: [81448929] ? __netif_receive_skb+0x529/0x750 Feb 18 18:29:53 hv5 kernel: [81448bea] ? process_backlog+0x9a/0x100 Feb 18 18:29
Re: [Users] Nodes lose storage at random
Thanks for looking into it. I've been running the ovirt ISO untill now, will switch to stock C6.5 to see if it makes a difference. On Sat, Feb 22, 2014 at 8:57 PM, Nir Soffer nsof...@redhat.com wrote: - Original Message - From: Johan Kooijman m...@johankooijman.com To: Nir Soffer nsof...@redhat.com Cc: users users@ovirt.org Sent: Wednesday, February 19, 2014 2:34:36 PM Subject: Re: [Users] Nodes lose storage at random Messages: https://t-x.dignus.nl/messages.txt Sanlock: https://t-x.dignus.nl/sanlock.log.txt We can see in /var/log/messages, that sanlock failed to write to the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose its host id lease. In this case, sanlock kill vdsm [3], which die after 11 retries [4]. Then vdsm is respawned again [5]. This is expected. We don't know why sanlock failed to write to the storage, but in [6] the kernel tell us that the nfs server is not responding. Since the nfs server is accessible from other machines, it means you have some issue with this host. Later the machine reboots [7], and nfs server is still not accessible. Then you have lot of WARN_ON call traces [8], that looks related to network code. We can see that you are not running most recent kernel [7]. We experienced various nfs issues during the 6.5 beta. I would try to get help from kernel folks about this. [1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+ 1251833 [21345]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.0.24.1: _santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids [2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882 [14753]: s2 check_our_lease failed 80 [3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882 [14753]: s2 kill 19317 sig 15 count 1 [4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+ 1251892 [14753]: dead 19317 ci 3 count 11 [5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died, respawning slave [6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding, timed out [7] Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started. Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 ( mockbu...@c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013 [8] Feb 18 18:29:53 hv5 kernel: [ cut here ] Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759 skb_gso_segment+0x1df/0x2b0() (Not tainted) Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596 data_len=0 ip_summed=0 Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache auth_rpcgss nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe libfcoe libfc scsi_transport_fc scsi_tgt lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack xt_multi port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc sha256_generic cbc cryptoloop dm_crypt aesni_intel cryptd aes_x86_64 aes_generic vhost_net macvtap macvlan tun kvm_ intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci libsas scsi_transport_sas 3w_ sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xx x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_scan] Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted 2.6.32-358.18.1.el6.x86_64 #1 Feb 18 18:29:53 hv5 kernel: Call Trace: Feb 18 18:29:53 hv5 kernel: IRQ [8106e3e7] ? warn_slowpath_common+0x87/0xc0 Feb 18 18:29:53 hv5 kernel: [8106e4d6] ? warn_slowpath_fmt+0x46/0x50 Feb 18 18:29:53 hv5 kernel: [a020bd62] ? igb_get_drvinfo+0x82/0xe0 [igb] Feb 18 18:29:53 hv5 kernel: [81448e7f] ? skb_gso_segment+0x1df/0x2b0 Feb 18 18:29:53 hv5 kernel: [81449260] ? dev_hard_start_xmit+0x1b0/0x530 Feb 18 18:29:53 hv5 kernel: [8146773a] ? sch_direct_xmit+0x15a/0x1c0 Feb 18 18:29:53 hv5 kernel: [8144d0c0] ? dev_queue_xmit+0x3b0/0x550 Feb 18 18:29:53 hv5 kernel: [a04af65c] ? br_dev_queue_push_xmit+0x6c/0xa0 [bridge] Feb 18 18:29:53 hv5 kernel: [a04af6e8] ? br_forward_finish+0x58/0x60 [bridge] Feb 18 18:29:53 hv5 kernel: [a04af79a] ? __br_forward+0xaa/0xd0 [bridge] Feb 18 18:29:53 hv5 kernel: [81474f34] ? nf_hook_slow+0x74/0x110 Feb 18 18:29:53 hv5 kernel: [a04af81d] ? br_forward+0x5d/0x70 [bridge] Feb 18 18:29:53 hv5 kernel: [a04b0609] ? br_handle_frame_finish+0x179/0x2a0 [bridge] Feb 18 18:29:53 hv5 kernel: [a04b08da] ? br_handle_frame
Re: [Users] Nodes lose storage at random
Been reinstalling to stocj CentOS 6.5 last night, all successful. Until roughly midnight GMT, 2 out of 4 hosts were showing the same errors. Any more suggestions? On Sat, Feb 22, 2014 at 8:57 PM, Nir Soffer nsof...@redhat.com wrote: - Original Message - From: Johan Kooijman m...@johankooijman.com To: Nir Soffer nsof...@redhat.com Cc: users users@ovirt.org Sent: Wednesday, February 19, 2014 2:34:36 PM Subject: Re: [Users] Nodes lose storage at random Messages: https://t-x.dignus.nl/messages.txt Sanlock: https://t-x.dignus.nl/sanlock.log.txt We can see in /var/log/messages, that sanlock failed to write to the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose its host id lease. In this case, sanlock kill vdsm [3], which die after 11 retries [4]. Then vdsm is respawned again [5]. This is expected. We don't know why sanlock failed to write to the storage, but in [6] the kernel tell us that the nfs server is not responding. Since the nfs server is accessible from other machines, it means you have some issue with this host. Later the machine reboots [7], and nfs server is still not accessible. Then you have lot of WARN_ON call traces [8], that looks related to network code. We can see that you are not running most recent kernel [7]. We experienced various nfs issues during the 6.5 beta. I would try to get help from kernel folks about this. [1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+ 1251833 [21345]: s2 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.0.24.1: _santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids [2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882 [14753]: s2 check_our_lease failed 80 [3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882 [14753]: s2 kill 19317 sig 15 count 1 [4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+ 1251892 [14753]: dead 19317 ci 3 count 11 [5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died, respawning slave [6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding, timed out [7] Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started. Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 ( mockbu...@c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013 [8] Feb 18 18:29:53 hv5 kernel: [ cut here ] Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759 skb_gso_segment+0x1df/0x2b0() (Not tainted) Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596 data_len=0 ip_summed=0 Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache auth_rpcgss nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe libfcoe libfc scsi_transport_fc scsi_tgt lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack xt_multi port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc sha256_generic cbc cryptoloop dm_crypt aesni_intel cryptd aes_x86_64 aes_generic vhost_net macvtap macvlan tun kvm_ intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci libsas scsi_transport_sas 3w_ sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp qla4xx x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_scan] Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted 2.6.32-358.18.1.el6.x86_64 #1 Feb 18 18:29:53 hv5 kernel: Call Trace: Feb 18 18:29:53 hv5 kernel: IRQ [8106e3e7] ? warn_slowpath_common+0x87/0xc0 Feb 18 18:29:53 hv5 kernel: [8106e4d6] ? warn_slowpath_fmt+0x46/0x50 Feb 18 18:29:53 hv5 kernel: [a020bd62] ? igb_get_drvinfo+0x82/0xe0 [igb] Feb 18 18:29:53 hv5 kernel: [81448e7f] ? skb_gso_segment+0x1df/0x2b0 Feb 18 18:29:53 hv5 kernel: [81449260] ? dev_hard_start_xmit+0x1b0/0x530 Feb 18 18:29:53 hv5 kernel: [8146773a] ? sch_direct_xmit+0x15a/0x1c0 Feb 18 18:29:53 hv5 kernel: [8144d0c0] ? dev_queue_xmit+0x3b0/0x550 Feb 18 18:29:53 hv5 kernel: [a04af65c] ? br_dev_queue_push_xmit+0x6c/0xa0 [bridge] Feb 18 18:29:53 hv5 kernel: [a04af6e8] ? br_forward_finish+0x58/0x60 [bridge] Feb 18 18:29:53 hv5 kernel: [a04af79a] ? __br_forward+0xaa/0xd0 [bridge] Feb 18 18:29:53 hv5 kernel: [81474f34] ? nf_hook_slow+0x74/0x110 Feb 18 18:29:53 hv5 kernel: [a04af81d] ? br_forward+0x5d/0x70 [bridge] Feb 18 18:29:53 hv5 kernel: [a04b0609] ? br_handle_frame_finish+0x179/0x2a0 [bridge] Feb 18 18:29:53 hv5 kernel
Re: [Users] Nodes lose storage at random
Meital, It's been 4 days since the last crash - but 5 minutes ago one of the nodes had the same issues. I've been running the script on the SPM as you mentioned. It turns out that at the time the node went down, the SPM didn't have more remoteFileHandler processes than before or after the crash - 29. I'm not sure what to make of this piece of information. On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine mbour...@redhat.comwrote: Hi Johan, Can you please run something like this on the spm node? while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` /tmp/handler_num.txt; sleep 1; done When it'll happen again, please stop the script, and write here the maximum number and the time that it happened. Also, please check if process_pool_max_slots_per_domain is defined in /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the default is 10) Thanks! -- *From: *Johan Kooijman m...@johankooijman.com *To: *Meital Bourvine mbour...@redhat.com *Cc: *users users@ovirt.org *Sent: *Tuesday, February 18, 2014 2:55:11 PM *Subject: *Re: [Users] Nodes lose storage at random To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in /rhev/data-center/mnt/. On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman m...@johankooijman.comwrote: Meital, I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6. I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up. Could the fact that lockd statd were not running on the NFS host cause this error? Is there a workaround available that we know of? On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine mbour...@redhat.comwrote: Hi Johan, Please take a look at this error (from vdsm.log): Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool And then you can see after a few seconds: MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64) Meaning that vdsm was restarted. Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2]. Can you think of any reproduction steps that might be causing this issue? [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 -- *From: *Johan Kooijman m...@johankooijman.com *To: *users users@ovirt.org *Sent: *Tuesday, February 18, 2014 1:32:56 PM *Subject: *[Users] Nodes lose storage at random Hi All, We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS). Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage
Re: [Users] Nodes lose storage at random
- Original Message - From: Johan Kooijman m...@johankooijman.com To: users users@ovirt.org Sent: Tuesday, February 18, 2014 1:32:56 PM Subject: [Users] Nodes lose storage at random Hi All, We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS). Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time. We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so. In the logs, we see that vdsm was restarted: MainThread::DEBUG::2014-02-18 10:48:35,809::vdsm::45::vds::(sigtermHandler) Received signal 15 But we don't know why it happened. Please attach also /var/log/messages and /var/log/sanlock.log around the time that vdsm was restarted. Thanks, Nir ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] Nodes lose storage at random
Nir, Messages: https://t-x.dignus.nl/messages.txt Sanlock: https://t-x.dignus.nl/sanlock.log.txt Any input is more than welcome! On Wed, Feb 19, 2014 at 10:38 AM, Nir Soffer nsof...@redhat.com wrote: - Original Message - From: Johan Kooijman m...@johankooijman.com To: users users@ovirt.org Sent: Tuesday, February 18, 2014 1:32:56 PM Subject: [Users] Nodes lose storage at random Hi All, We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS). Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time. We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so. In the logs, we see that vdsm was restarted: MainThread::DEBUG::2014-02-18 10:48:35,809::vdsm::45::vds::(sigtermHandler) Received signal 15 But we don't know why it happened. Please attach also /var/log/messages and /var/log/sanlock.log around the time that vdsm was restarted. Thanks, Nir -- Met vriendelijke groeten / With kind regards, Johan Kooijman T +31(0) 6 43 44 45 27 F +31(0) 162 82 00 01 E m...@johankooijman.com ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users
Re: [Users] Nodes lose storage at random
Hi Johan, Please take a look at this error (from vdsm.log): Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool And then you can see after a few seconds: MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64) Meaning that vdsm was restarted. Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2]. Can you think of any reproduction steps that might be causing this issue? [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 - Original Message - From: Johan Kooijman m...@johankooijman.com To: users users@ovirt.org Sent: Tuesday, February 18, 2014 1:32:56 PM Subject: [Users] Nodes lose storage at random Hi All, We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS). Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time. We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so. The engine tells me this: 2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. 2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.). The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all. Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of the node ISO, not a full blown installation. Any clues on where to begin searching? The NFS server shows no issues nor anything in the logs. I did notice that the statd and lockd daemons were not running, but I wonder if that can have anything to do with the issue. -- Met vriendelijke groeten / With kind regards, Johan Kooijman m...@johankooijman.com ___ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users ___ Users mailing list
Re: [Users] Nodes lose storage at random
Meital, I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6. I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up. Could the fact that lockd statd were not running on the NFS host cause this error? Is there a workaround available that we know of? On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine mbour...@redhat.comwrote: Hi Johan, Please take a look at this error (from vdsm.log): Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool And then you can see after a few seconds: MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64) Meaning that vdsm was restarted. Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2]. Can you think of any reproduction steps that might be causing this issue? [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 -- *From: *Johan Kooijman m...@johankooijman.com *To: *users users@ovirt.org *Sent: *Tuesday, February 18, 2014 1:32:56 PM *Subject: *[Users] Nodes lose storage at random Hi All, We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS). Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time. We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so. The engine tells me this: 2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. 2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error during communication with the Host.). The export and data domain live over NFS. There's another domain, ISO, that lives on the engine machine, also shared over NFS. That domain doesn't have any issue at all. Attached are the logfiles for the relevant time period for both the engine server and the node. The node by the way, is a deployment of
Re: [Users] Nodes lose storage at random
To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in /rhev/data-center/mnt/. On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman m...@johankooijman.comwrote: Meital, I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6. I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up. Could the fact that lockd statd were not running on the NFS host cause this error? Is there a workaround available that we know of? On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine mbour...@redhat.comwrote: Hi Johan, Please take a look at this error (from vdsm.log): Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool And then you can see after a few seconds: MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64) Meaning that vdsm was restarted. Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2]. Can you think of any reproduction steps that might be causing this issue? [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 -- *From: *Johan Kooijman m...@johankooijman.com *To: *users users@ovirt.org *Sent: *Tuesday, February 18, 2014 1:32:56 PM *Subject: *[Users] Nodes lose storage at random Hi All, We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS). Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time. We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so. The engine tells me this: 2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds = 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing. 2014-02-18 11:48:45,070 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS. Setting Data Center status to Non Responsive (On host hv5, Error: Network error
Re: [Users] Nodes lose storage at random
Hi Johan, Can you please run something like this on the spm node? while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` /tmp/handler_num.txt; sleep 1; done When it'll happen again, please stop the script, and write here the maximum number and the time that it happened. Also, please check if process_pool_max_slots_per_domain is defined in /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the default is 10) Thanks! - Original Message - From: Johan Kooijman m...@johankooijman.com To: Meital Bourvine mbour...@redhat.com Cc: users users@ovirt.org Sent: Tuesday, February 18, 2014 2:55:11 PM Subject: Re: [Users] Nodes lose storage at random To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in / rhev/data-center/mnt/. On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman m...@johankooijman.com wrote: Meital, I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6. I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up. Could the fact that lockd statd were not running on the NFS host cause this error? Is there a workaround available that we know of? On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine mbour...@redhat.com wrote: Hi Johan, Please take a look at this error (from vdsm.log): Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool And then you can see after a few seconds: MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64) Meaning that vdsm was restarted. Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2]. Can you think of any reproduction steps that might be causing this issue? [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 From: Johan Kooijman m...@johankooijman.com To: users users@ovirt.org Sent: Tuesday, February 18, 2014 1:32:56 PM Subject: [Users] Nodes lose storage at random Hi All, We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS). Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time. We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just
Re: [Users] Nodes lose storage at random
Ok, will do. The process_pool_max_slots_per_domain is not defined, default node values. On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine mbour...@redhat.comwrote: Hi Johan, Can you please run something like this on the spm node? while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` /tmp/handler_num.txt; sleep 1; done When it'll happen again, please stop the script, and write here the maximum number and the time that it happened. Also, please check if process_pool_max_slots_per_domain is defined in /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the default is 10) Thanks! -- *From: *Johan Kooijman m...@johankooijman.com *To: *Meital Bourvine mbour...@redhat.com *Cc: *users users@ovirt.org *Sent: *Tuesday, February 18, 2014 2:55:11 PM *Subject: *Re: [Users] Nodes lose storage at random To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in /rhev/data-center/mnt/. On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman m...@johankooijman.comwrote: Meital, I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6. I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up. Could the fact that lockd statd were not running on the NFS host cause this error? Is there a workaround available that we know of? On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine mbour...@redhat.comwrote: Hi Johan, Please take a look at this error (from vdsm.log): Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool And then you can see after a few seconds: MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64) Meaning that vdsm was restarted. Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2]. Can you think of any reproduction steps that might be causing this issue? [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 -- *From: *Johan Kooijman m...@johankooijman.com *To: *users users@ovirt.org *Sent: *Tuesday, February 18, 2014 1:32:56 PM *Subject: *[Users] Nodes lose storage at random Hi All, We're seeing some weird issues in our ovirt setup. We have 4 nodes connected and an NFS (v3) filestore (FreeBSD/ZFS). Once in a while, it seems at random, a node loses their connection to storage, recovers it a minute later. The other nodes usually don't lose their storage at that moment. Just one, or two at a time. We've setup extra tooling to verify the storage performance at those moments and the availability for other systems. It's always online, just the nodes don't think so. The engine tells me this: 2014-02-18 11:48:03,598 WARN
Re: [Users] Nodes lose storage at random
One other interesting fact is that each node has 4 NFS mountpoints. 2 (data export) to the main SAN, 1 to the engine machine for ISO and one to the legacy SAN. When this issue occurs, the only mountpoint in a problem state seem to be the 2 mountpoints to the main SAN: 2014-02-18 11:48:03,598 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in problem. vds: hv5 2014-02-18 11:48:18,909 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in problem. vds: hv5 On Tue, Feb 18, 2014 at 3:04 PM, Johan Kooijman m...@johankooijman.comwrote: Ok, will do. The process_pool_max_slots_per_domain is not defined, default node values. On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine mbour...@redhat.comwrote: Hi Johan, Can you please run something like this on the spm node? while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` /tmp/handler_num.txt; sleep 1; done When it'll happen again, please stop the script, and write here the maximum number and the time that it happened. Also, please check if process_pool_max_slots_per_domain is defined in /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, the default is 10) Thanks! -- *From: *Johan Kooijman m...@johankooijman.com *To: *Meital Bourvine mbour...@redhat.com *Cc: *users users@ovirt.org *Sent: *Tuesday, February 18, 2014 2:55:11 PM *Subject: *Re: [Users] Nodes lose storage at random To follow up on this: The setup has only ~80 VM's active right now. The 2 bugreports are not in scope for this setup, the issues occur at random, even when there's no activity (create/delete VM's) and there are only 4 directories in /rhev/data-center/mnt/. On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman m...@johankooijman.comwrote: Meital, I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6. I have no ways of reproducing just yet. I can confirm that it's happening on all nodes in the cluster. And every time a node goes offline, this error pops up. Could the fact that lockd statd were not running on the NFS host cause this error? Is there a workaround available that we know of? On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine mbour...@redhat.comwrote: Hi Johan, Please take a look at this error (from vdsm.log): Thread-636938::DEBUG::2014-02-18 10:48:06,374::task::579::TaskManager.Task::(_updateState) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - state preparing Thread-636938::INFO::2014-02-18 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', spUUID='59980e09-b329-4254-b66e-790abd69e194', imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None) Thread-636938::ERROR::2014-02-18 10:48:06,376::task::850::TaskManager.Task::(_setError) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error Thread-636938::DEBUG::2014-02-18 10:48:06,415::task::869::TaskManager.Task::(_run) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', '59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::1194::TaskManager.Task::(stop) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force False) Thread-636938::DEBUG::2014-02-18 10:48:06,416::task::974::TaskManager.Task::(_decref) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True Thread-636938::INFO::2014-02-18 10:48:06,416::task::1151::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No free file handlers in pool' - code 100 Thread-636938::DEBUG::2014-02-18 10:48:06,417::task::1156::TaskManager.Task::(prepare) Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file handlers in pool And then you can see after a few seconds: MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan (2.6.32-358.18.1.el6.x86_64) Meaning that vdsm was restarted. Which oVirt version are you using? I see that there are a few old bugs that describes the same behaviour, but with different reproduction steps, for example [1], [2]. Can you think of any reproduction steps that might be causing this issue? [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 -- *From: *Johan