Re: [Users] Nodes lose storage at random

2014-03-12 Thread Johan Kooijman
Gents,

I'm sending this email for archiving purposes:
It's been a while since my last update on this topic. It turned out that
although only one node, randomly, at a time lost connection to storage, the
issue was not at all with ovirt, but with the storage.

I'd like to refer to these 2 topics for more information:
http://lists.freebsd.org/pipermail/freebsd-net/2014-March/038061.html and
http://lists.freebsd.org/pipermail/freebsd-net/2014-February/037967.html.

Keywords: ovirt freebsd 9.2 zfs ixgbe intel 10gbit NFS


On Mon, Feb 24, 2014 at 3:55 PM, Ronen Hod r...@redhat.com wrote:

 On 02/24/2014 11:48 AM, Nir Soffer wrote:

 - Original Message -

 From: Johan Kooijman m...@johankooijman.com
 To: Nir Soffer nsof...@redhat.com
 Cc: users users@ovirt.org
 Sent: Monday, February 24, 2014 2:45:59 AM
 Subject: Re: [Users] Nodes lose storage at random

 Interestingly enough - same thing happened today, around the same time.
 Logs from this host are attached.

 Around 1:10 AM stuff starts to go wrong again. Same pattern - we reboot
 the
 node and the node is fine again.

 So we made some progress, we know that it is not a problem with old
 kernel.

 In messages we see the same picture:

 1. sanlock fail to renew the lease
 2. after 80 secodns, kill vdsm
 3. sanlock and vdsm cannot access the storage
 4. kernel complain about nfs server timeouts
 (explains why sanlock failed to renew the lease)
 5. after reboot, nfs is accessible again
 6. after few days goto step 1

 This looks like kernel nfs issue.

 Could be also kvm issue (running bsd on the one of the vm?)

 Could be also some incompatibility with the nfs server - maybe you are
 using
 esoteric configuration options?

 CCing Ronen, in case this is related to kvm.


 Not seems to be related to KVM.
 Adding Ric Wheeler.

 Ronen.


 thread: http://lists.ovirt.org/pipermail/users/2014-February/021507.html

 Nir





-- 
Met vriendelijke groeten / With kind regards,
Johan Kooijman

T +31(0) 6 43 44 45 27
F +31(0) 162 82 00 01
E m...@johankooijman.com
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Nodes lose storage at random

2014-02-24 Thread Nir Soffer
- Original Message -
 From: Johan Kooijman m...@johankooijman.com
 To: Nir Soffer nsof...@redhat.com
 Cc: users users@ovirt.org
 Sent: Monday, February 24, 2014 2:45:59 AM
 Subject: Re: [Users] Nodes lose storage at random
 
 Interestingly enough - same thing happened today, around the same time.
 Logs from this host are attached.
 
 Around 1:10 AM stuff starts to go wrong again. Same pattern - we reboot the
 node and the node is fine again.

So we made some progress, we know that it is not a problem with old kernel.

In messages we see the same picture:

1. sanlock fail to renew the lease
2. after 80 secodns, kill vdsm
3. sanlock and vdsm cannot access the storage
4. kernel complain about nfs server timeouts
   (explains why sanlock failed to renew the lease)
5. after reboot, nfs is accessible again
6. after few days goto step 1

This looks like kernel nfs issue.

Could be also kvm issue (running bsd on the one of the vm?)

Could be also some incompatibility with the nfs server - maybe you are using
esoteric configuration options?

CCing Ronen, in case this is related to kvm.

thread: http://lists.ovirt.org/pipermail/users/2014-February/021507.html

Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Nodes lose storage at random

2014-02-24 Thread Ronen Hod

On 02/24/2014 11:48 AM, Nir Soffer wrote:

- Original Message -

From: Johan Kooijman m...@johankooijman.com
To: Nir Soffer nsof...@redhat.com
Cc: users users@ovirt.org
Sent: Monday, February 24, 2014 2:45:59 AM
Subject: Re: [Users] Nodes lose storage at random

Interestingly enough - same thing happened today, around the same time.
Logs from this host are attached.

Around 1:10 AM stuff starts to go wrong again. Same pattern - we reboot the
node and the node is fine again.

So we made some progress, we know that it is not a problem with old kernel.

In messages we see the same picture:

1. sanlock fail to renew the lease
2. after 80 secodns, kill vdsm
3. sanlock and vdsm cannot access the storage
4. kernel complain about nfs server timeouts
(explains why sanlock failed to renew the lease)
5. after reboot, nfs is accessible again
6. after few days goto step 1

This looks like kernel nfs issue.

Could be also kvm issue (running bsd on the one of the vm?)

Could be also some incompatibility with the nfs server - maybe you are using
esoteric configuration options?

CCing Ronen, in case this is related to kvm.


Not seems to be related to KVM.
Adding Ric Wheeler.

Ronen.



thread: http://lists.ovirt.org/pipermail/users/2014-February/021507.html

Nir


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Nodes lose storage at random

2014-02-23 Thread Nir Soffer
- Original Message -
 From: Johan Kooijman m...@johankooijman.com
 To: Nir Soffer nsof...@redhat.com
 Cc: users users@ovirt.org
 Sent: Sunday, February 23, 2014 3:48:25 AM
 Subject: Re: [Users] Nodes lose storage at random
 
 Been reinstalling to stocj CentOS 6.5 last night, all successful. Until
 roughly midnight GMT, 2 out of 4 hosts were showing the same errors.
 
 Any more suggestions?

Lets see the logs from these hosts?
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Nodes lose storage at random

2014-02-22 Thread Nir Soffer
- Original Message -
 From: Johan Kooijman m...@johankooijman.com
 To: Nir Soffer nsof...@redhat.com
 Cc: users users@ovirt.org
 Sent: Wednesday, February 19, 2014 2:34:36 PM
 Subject: Re: [Users] Nodes lose storage at random
 
 Messages: https://t-x.dignus.nl/messages.txt
 Sanlock: https://t-x.dignus.nl/sanlock.log.txt

We can see in /var/log/messages, that sanlock failed to write to 
the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose 
its host id lease. In this case, sanlock kill vdsm [3], which die after 11
retries [4]. Then vdsm is respawned again [5]. This is expected.

We don't know why sanlock failed to write to the storage, but in [6] the
kernel tell us that the nfs server is not responding. Since the nfs server
is accessible from other machines, it means you have some issue with this host.

Later the machine reboots [7], and nfs server is still not accessible. Then
you have lot of WARN_ON call traces [8], that looks related to network code.

We can see that you are not running most recent kernel [7]. We experienced 
various
nfs issues during the 6.5 beta.

I would try to get help from kernel folks about this.

[1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+ 1251833 
[21345]: s2 delta_renew read rv -202 offset 0 
/rhev/data-center/mnt/10.0.24.1:_santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids

[2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882 
[14753]: s2 check_our_lease failed 80

[3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882 
[14753]: s2 kill 19317 sig 15 count 1

[4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+ 1251892 
[14753]: dead 19317 ci 3 count 11

[5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died, respawning 
slave

[6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding, timed out

[7]
Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 
(mockbu...@c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 
4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013

[8]
Feb 18 18:29:53 hv5 kernel: [ cut here ]
Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759 
skb_gso_segment+0x1df/0x2b0() (Not tainted)
Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW
Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596 data_len=0 
ip_summed=0
Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache auth_rpcgss 
nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe libfcoe libfc 
scsi_transport_fc scsi_tgt
 lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter 
ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state 
nf_conntrack xt_multi
port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc sha256_generic cbc 
cryptoloop dm_crypt aesni_intel cryptd aes_x86_64 aes_generic vhost_net macvtap 
macvlan tun kvm_
intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp 
dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci libsas 
scsi_transport_sas 3w_
sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash 
dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 
mdio libiscsi_tcp qla4xx
x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded: scsi_wait_scan]
Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted 
2.6.32-358.18.1.el6.x86_64 #1
Feb 18 18:29:53 hv5 kernel: Call Trace:
Feb 18 18:29:53 hv5 kernel: IRQ  [8106e3e7] ? 
warn_slowpath_common+0x87/0xc0
Feb 18 18:29:53 hv5 kernel: [8106e4d6] ? warn_slowpath_fmt+0x46/0x50
Feb 18 18:29:53 hv5 kernel: [a020bd62] ? igb_get_drvinfo+0x82/0xe0 
[igb]
Feb 18 18:29:53 hv5 kernel: [81448e7f] ? skb_gso_segment+0x1df/0x2b0
Feb 18 18:29:53 hv5 kernel: [81449260] ? 
dev_hard_start_xmit+0x1b0/0x530
Feb 18 18:29:53 hv5 kernel: [8146773a] ? sch_direct_xmit+0x15a/0x1c0
Feb 18 18:29:53 hv5 kernel: [8144d0c0] ? dev_queue_xmit+0x3b0/0x550
Feb 18 18:29:53 hv5 kernel: [a04af65c] ? 
br_dev_queue_push_xmit+0x6c/0xa0 [bridge]
Feb 18 18:29:53 hv5 kernel: [a04af6e8] ? br_forward_finish+0x58/0x60 
[bridge]
Feb 18 18:29:53 hv5 kernel: [a04af79a] ? __br_forward+0xaa/0xd0 
[bridge]
Feb 18 18:29:53 hv5 kernel: [81474f34] ? nf_hook_slow+0x74/0x110
Feb 18 18:29:53 hv5 kernel: [a04af81d] ? br_forward+0x5d/0x70 [bridge]
Feb 18 18:29:53 hv5 kernel: [a04b0609] ? 
br_handle_frame_finish+0x179/0x2a0 [bridge]
Feb 18 18:29:53 hv5 kernel: [a04b08da] ? br_handle_frame+0x1aa/0x250 
[bridge]
Feb 18 18:29:53 hv5 kernel: [a0331690] ? pit_timer_fn+0x0/0x80 [kvm]
Feb 18 18:29:53 hv5 kernel: [81448929] ? 
__netif_receive_skb+0x529/0x750
Feb 18 18:29:53 hv5 kernel: [81448bea] ? process_backlog+0x9a/0x100
Feb 18 18:29

Re: [Users] Nodes lose storage at random

2014-02-22 Thread Johan Kooijman
Thanks for looking into it. I've been running the ovirt ISO untill now,
will switch to stock C6.5 to see if it makes a difference.


On Sat, Feb 22, 2014 at 8:57 PM, Nir Soffer nsof...@redhat.com wrote:

 - Original Message -
  From: Johan Kooijman m...@johankooijman.com
  To: Nir Soffer nsof...@redhat.com
  Cc: users users@ovirt.org
  Sent: Wednesday, February 19, 2014 2:34:36 PM
  Subject: Re: [Users] Nodes lose storage at random
 
  Messages: https://t-x.dignus.nl/messages.txt
  Sanlock: https://t-x.dignus.nl/sanlock.log.txt

 We can see in /var/log/messages, that sanlock failed to write to
 the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose
 its host id lease. In this case, sanlock kill vdsm [3], which die after 11
 retries [4]. Then vdsm is respawned again [5]. This is expected.

 We don't know why sanlock failed to write to the storage, but in [6] the
 kernel tell us that the nfs server is not responding. Since the nfs server
 is accessible from other machines, it means you have some issue with this
 host.

 Later the machine reboots [7], and nfs server is still not accessible. Then
 you have lot of WARN_ON call traces [8], that looks related to network
 code.

 We can see that you are not running most recent kernel [7]. We experienced
 various
 nfs issues during the 6.5 beta.

 I would try to get help from kernel folks about this.

 [1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+ 1251833
 [21345]: s2 delta_renew read rv -202 offset 0
 /rhev/data-center/mnt/10.0.24.1:
 _santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids

 [2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882
 [14753]: s2 check_our_lease failed 80

 [3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882
 [14753]: s2 kill 19317 sig 15 count 1

 [4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+ 1251892
 [14753]: dead 19317 ci 3 count 11

 [5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died,
 respawning slave

 [6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding,
 timed out

 [7]
 Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started.
 Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 (
 mockbu...@c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
 4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013

 [8]
 Feb 18 18:29:53 hv5 kernel: [ cut here ]
 Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759
 skb_gso_segment+0x1df/0x2b0() (Not tainted)
 Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW
 Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596
 data_len=0 ip_summed=0
 Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache
 auth_rpcgss nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe
 libfcoe libfc scsi_transport_fc scsi_tgt
  lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
 iptable_filter ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6
 nf_defrag_ipv6 xt_state nf_conntrack xt_multi
 port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc
 sha256_generic cbc cryptoloop dm_crypt aesni_intel cryptd aes_x86_64
 aes_generic vhost_net macvtap macvlan tun kvm_
 intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp
 dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci
 libsas scsi_transport_sas 3w_
 sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash
 dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi
 cxgb3 mdio libiscsi_tcp qla4xx
 x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded:
 scsi_wait_scan]
 Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted
 2.6.32-358.18.1.el6.x86_64 #1
 Feb 18 18:29:53 hv5 kernel: Call Trace:
 Feb 18 18:29:53 hv5 kernel: IRQ  [8106e3e7] ?
 warn_slowpath_common+0x87/0xc0
 Feb 18 18:29:53 hv5 kernel: [8106e4d6] ?
 warn_slowpath_fmt+0x46/0x50
 Feb 18 18:29:53 hv5 kernel: [a020bd62] ?
 igb_get_drvinfo+0x82/0xe0 [igb]
 Feb 18 18:29:53 hv5 kernel: [81448e7f] ?
 skb_gso_segment+0x1df/0x2b0
 Feb 18 18:29:53 hv5 kernel: [81449260] ?
 dev_hard_start_xmit+0x1b0/0x530
 Feb 18 18:29:53 hv5 kernel: [8146773a] ?
 sch_direct_xmit+0x15a/0x1c0
 Feb 18 18:29:53 hv5 kernel: [8144d0c0] ?
 dev_queue_xmit+0x3b0/0x550
 Feb 18 18:29:53 hv5 kernel: [a04af65c] ?
 br_dev_queue_push_xmit+0x6c/0xa0 [bridge]
 Feb 18 18:29:53 hv5 kernel: [a04af6e8] ?
 br_forward_finish+0x58/0x60 [bridge]
 Feb 18 18:29:53 hv5 kernel: [a04af79a] ? __br_forward+0xaa/0xd0
 [bridge]
 Feb 18 18:29:53 hv5 kernel: [81474f34] ? nf_hook_slow+0x74/0x110
 Feb 18 18:29:53 hv5 kernel: [a04af81d] ? br_forward+0x5d/0x70
 [bridge]
 Feb 18 18:29:53 hv5 kernel: [a04b0609] ?
 br_handle_frame_finish+0x179/0x2a0 [bridge]
 Feb 18 18:29:53 hv5 kernel: [a04b08da] ?
 br_handle_frame

Re: [Users] Nodes lose storage at random

2014-02-22 Thread Johan Kooijman
Been reinstalling to stocj CentOS 6.5 last night, all successful. Until
roughly midnight GMT, 2 out of 4 hosts were showing the same errors.

Any more suggestions?


On Sat, Feb 22, 2014 at 8:57 PM, Nir Soffer nsof...@redhat.com wrote:

 - Original Message -
  From: Johan Kooijman m...@johankooijman.com
  To: Nir Soffer nsof...@redhat.com
  Cc: users users@ovirt.org
  Sent: Wednesday, February 19, 2014 2:34:36 PM
  Subject: Re: [Users] Nodes lose storage at random
 
  Messages: https://t-x.dignus.nl/messages.txt
  Sanlock: https://t-x.dignus.nl/sanlock.log.txt

 We can see in /var/log/messages, that sanlock failed to write to
 the ids lockspace [1], which after 80 seconds [2], caused vdsm to loose
 its host id lease. In this case, sanlock kill vdsm [3], which die after 11
 retries [4]. Then vdsm is respawned again [5]. This is expected.

 We don't know why sanlock failed to write to the storage, but in [6] the
 kernel tell us that the nfs server is not responding. Since the nfs server
 is accessible from other machines, it means you have some issue with this
 host.

 Later the machine reboots [7], and nfs server is still not accessible. Then
 you have lot of WARN_ON call traces [8], that looks related to network
 code.

 We can see that you are not running most recent kernel [7]. We experienced
 various
 nfs issues during the 6.5 beta.

 I would try to get help from kernel folks about this.

 [1] Feb 18 10:47:46 hv5 sanlock[14753]: 2014-02-18 10:47:46+ 1251833
 [21345]: s2 delta_renew read rv -202 offset 0
 /rhev/data-center/mnt/10.0.24.1:
 _santank_ovirt-data/e9f70496-f181-4c9b-9ecb-d7f780772b04/dom_md/ids

 [2] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882
 [14753]: s2 check_our_lease failed 80

 [3] Feb 18 10:48:35 hv5 sanlock[14753]: 2014-02-18 10:48:35+ 1251882
 [14753]: s2 kill 19317 sig 15 count 1

 [4] Feb 18 10:48:45 hv5 sanlock[14753]: 2014-02-18 10:48:45+ 1251892
 [14753]: dead 19317 ci 3 count 11

 [5] Feb 18 10:48:45 hv5 respawn: slave '/usr/share/vdsm/vdsm' died,
 respawning slave

 [6] Feb 18 10:57:36 hv5 kernel: nfs: server 10.0.24.1 not responding,
 timed out

 [7]
 Feb 18 11:03:01 hv5 kernel: imklog 5.8.10, log source = /proc/kmsg started.
 Feb 18 11:03:01 hv5 kernel: Linux version 2.6.32-358.18.1.el6.x86_64 (
 mockbu...@c6b10.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
 4.4.7-3) (GCC) ) #1 SMP Wed Aug 28 17:19:38 UTC 2013

 [8]
 Feb 18 18:29:53 hv5 kernel: [ cut here ]
 Feb 18 18:29:53 hv5 kernel: WARNING: at net/core/dev.c:1759
 skb_gso_segment+0x1df/0x2b0() (Not tainted)
 Feb 18 18:29:53 hv5 kernel: Hardware name: X9DRW
 Feb 18 18:29:53 hv5 kernel: igb: caps=(0x12114bb3, 0x0) len=1596
 data_len=0 ip_summed=0
 Feb 18 18:29:53 hv5 kernel: Modules linked in: ebt_arp nfs fscache
 auth_rpcgss nfs_acl bonding softdog ebtable_nat ebtables bnx2fc fcoe
 libfcoe libfc scsi_transport_fc scsi_tgt
  lockd sunrpc bridge ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
 iptable_filter ip_tables xt_physdev ip6t_REJECT nf_conntrack_ipv6
 nf_defrag_ipv6 xt_state nf_conntrack xt_multi
 port ip6table_filter ip6_tables ext4 jbd2 8021q garp stp llc
 sha256_generic cbc cryptoloop dm_crypt aesni_intel cryptd aes_x86_64
 aes_generic vhost_net macvtap macvlan tun kvm_
 intel kvm sg sb_edac edac_core iTCO_wdt iTCO_vendor_support ioatdma shpchp
 dm_snapshot squashfs ext2 mbcache dm_round_robin sd_mod crc_t10dif isci
 libsas scsi_transport_sas 3w_
 sas ahci ixgbe igb dca ptp pps_core dm_multipath dm_mirror dm_region_hash
 dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i cxgb4 cxgb3i libcxgbi
 cxgb3 mdio libiscsi_tcp qla4xx
 x iscsi_boot_sysfs libiscsi scsi_transport_iscsi [last unloaded:
 scsi_wait_scan]
 Feb 18 18:29:53 hv5 kernel: Pid: 5462, comm: vhost-5458 Not tainted
 2.6.32-358.18.1.el6.x86_64 #1
 Feb 18 18:29:53 hv5 kernel: Call Trace:
 Feb 18 18:29:53 hv5 kernel: IRQ  [8106e3e7] ?
 warn_slowpath_common+0x87/0xc0
 Feb 18 18:29:53 hv5 kernel: [8106e4d6] ?
 warn_slowpath_fmt+0x46/0x50
 Feb 18 18:29:53 hv5 kernel: [a020bd62] ?
 igb_get_drvinfo+0x82/0xe0 [igb]
 Feb 18 18:29:53 hv5 kernel: [81448e7f] ?
 skb_gso_segment+0x1df/0x2b0
 Feb 18 18:29:53 hv5 kernel: [81449260] ?
 dev_hard_start_xmit+0x1b0/0x530
 Feb 18 18:29:53 hv5 kernel: [8146773a] ?
 sch_direct_xmit+0x15a/0x1c0
 Feb 18 18:29:53 hv5 kernel: [8144d0c0] ?
 dev_queue_xmit+0x3b0/0x550
 Feb 18 18:29:53 hv5 kernel: [a04af65c] ?
 br_dev_queue_push_xmit+0x6c/0xa0 [bridge]
 Feb 18 18:29:53 hv5 kernel: [a04af6e8] ?
 br_forward_finish+0x58/0x60 [bridge]
 Feb 18 18:29:53 hv5 kernel: [a04af79a] ? __br_forward+0xaa/0xd0
 [bridge]
 Feb 18 18:29:53 hv5 kernel: [81474f34] ? nf_hook_slow+0x74/0x110
 Feb 18 18:29:53 hv5 kernel: [a04af81d] ? br_forward+0x5d/0x70
 [bridge]
 Feb 18 18:29:53 hv5 kernel: [a04b0609] ?
 br_handle_frame_finish+0x179/0x2a0 [bridge]
 Feb 18 18:29:53 hv5 kernel

Re: [Users] Nodes lose storage at random

2014-02-21 Thread Johan Kooijman
Meital,

It's been 4 days since the last crash - but 5 minutes ago one of the nodes
had the same issues. I've been running the script on the SPM as you
mentioned. It turns out that at the time the node went down, the SPM didn't
have more remoteFileHandler processes than before or after the crash - 29.

I'm not sure what to make of this piece of information.



On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine mbour...@redhat.comwrote:

 Hi Johan,

 Can you please run something like this on the spm node?
 while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` 
 /tmp/handler_num.txt; sleep 1; done

 When it'll happen again, please stop the script, and write here the
 maximum number and the time that it happened.

 Also, please check if process_pool_max_slots_per_domain is defined in
 /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined
 there, the default is 10)

 Thanks!


 --

 *From: *Johan Kooijman m...@johankooijman.com
 *To: *Meital Bourvine mbour...@redhat.com
 *Cc: *users users@ovirt.org
 *Sent: *Tuesday, February 18, 2014 2:55:11 PM
 *Subject: *Re: [Users] Nodes lose storage at random


 To follow up on this: The setup has only ~80 VM's active right now. The 2
 bugreports are not in scope for this setup, the issues occur at random,
 even when there's no activity (create/delete VM's) and there are only 4
 directories in /rhev/data-center/mnt/.



 On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman m...@johankooijman.comwrote:

 Meital,

 I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I
 use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6.

 I have no ways of reproducing just yet. I can confirm that it's happening
 on all nodes in the cluster. And every time a node goes offline, this error
 pops up.

 Could the fact that lockd  statd were not running on the NFS host cause
 this error? Is there a workaround available that we know of?


 On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine mbour...@redhat.comwrote:

 Hi Johan,

 Please take a look at this error (from vdsm.log):


 Thread-636938::DEBUG::2014-02-18 
 10:48:06,374::task::579::TaskManager.Task::(_updateState) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - 
 state preparing
 Thread-636938::INFO::2014-02-18 
 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: 
 getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', 
 spUUID='59980e09-b329-4254-b66e-790abd69e194', 
 imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
 volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
 Thread-636938::ERROR::2014-02-18 
 10:48:06,376::task::850::TaskManager.Task::(_setError) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,415::task::869::TaskManager.Task::(_run) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: 
 f4ce9a6e-0292-4071-9a24-a8d8fba7222b 
 ('e9f70496-f181-4c9b-9ecb-d7f780772b04', 
 '59980e09-b329-4254-b66e-790abd69e194', 
 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
 '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,416::task::1194::TaskManager.Task::(stop) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing 
 (force False)
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,416::task::974::TaskManager.Task::(_decref) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
 Thread-636938::INFO::2014-02-18 
 10:48:06,416::task::1151::TaskManager.Task::(prepare) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: 
 u'No free file handlers in pool' - code 100
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,417::task::1156::TaskManager.Task::(prepare) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file 
 handlers in pool



 And then you can see after a few seconds:

 MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 
 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan 
 (2.6.32-358.18.1.el6.x86_64)



 Meaning that vdsm was restarted.

 Which oVirt version are you using?
 I see that there are a few old bugs that describes the same behaviour, but 
 with different reproduction steps, for example [1], [2].


 Can you think of any reproduction steps that might be causing this issue?


 [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210


 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011



 --

 *From: *Johan Kooijman m...@johankooijman.com
 *To: *users users@ovirt.org
 *Sent: *Tuesday, February 18, 2014 1:32:56 PM
 *Subject: *[Users] Nodes lose storage at random


 Hi All,

 We're seeing some weird issues in our ovirt setup. We have 4 nodes
 connected and an NFS (v3) filestore (FreeBSD/ZFS).

 Once in a while, it seems at random, a node loses their connection to
 storage, recovers it a minute later. The other nodes usually don't lose
 their storage

Re: [Users] Nodes lose storage at random

2014-02-19 Thread Nir Soffer
- Original Message -
 From: Johan Kooijman m...@johankooijman.com
 To: users users@ovirt.org
 Sent: Tuesday, February 18, 2014 1:32:56 PM
 Subject: [Users] Nodes lose storage at random
 
 Hi All,
 
 We're seeing some weird issues in our ovirt setup. We have 4 nodes connected
 and an NFS (v3) filestore (FreeBSD/ZFS).
 
 Once in a while, it seems at random, a node loses their connection to
 storage, recovers it a minute later. The other nodes usually don't lose
 their storage at that moment. Just one, or two at a time.
 
 We've setup extra tooling to verify the storage performance at those moments
 and the availability for other systems. It's always online, just the nodes
 don't think so.

In the logs, we see that vdsm was restarted:
MainThread::DEBUG::2014-02-18 10:48:35,809::vdsm::45::vds::(sigtermHandler) 
Received signal 15

But we don't know why it happened.

Please attach also /var/log/messages and /var/log/sanlock.log around the time 
that
vdsm was restarted.

Thanks,
Nir
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Nodes lose storage at random

2014-02-19 Thread Johan Kooijman
Nir,

Messages: https://t-x.dignus.nl/messages.txt
Sanlock: https://t-x.dignus.nl/sanlock.log.txt

Any input is more than welcome!


On Wed, Feb 19, 2014 at 10:38 AM, Nir Soffer nsof...@redhat.com wrote:

 - Original Message -
  From: Johan Kooijman m...@johankooijman.com
  To: users users@ovirt.org
  Sent: Tuesday, February 18, 2014 1:32:56 PM
  Subject: [Users] Nodes lose storage at random
 
  Hi All,
 
  We're seeing some weird issues in our ovirt setup. We have 4 nodes
 connected
  and an NFS (v3) filestore (FreeBSD/ZFS).
 
  Once in a while, it seems at random, a node loses their connection to
  storage, recovers it a minute later. The other nodes usually don't lose
  their storage at that moment. Just one, or two at a time.
 
  We've setup extra tooling to verify the storage performance at those
 moments
  and the availability for other systems. It's always online, just the
 nodes
  don't think so.

 In the logs, we see that vdsm was restarted:
 MainThread::DEBUG::2014-02-18
 10:48:35,809::vdsm::45::vds::(sigtermHandler) Received signal 15

 But we don't know why it happened.

 Please attach also /var/log/messages and /var/log/sanlock.log around the
 time that
 vdsm was restarted.

 Thanks,
 Nir




-- 
Met vriendelijke groeten / With kind regards,
Johan Kooijman

T +31(0) 6 43 44 45 27
F +31(0) 162 82 00 01
E m...@johankooijman.com
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] Nodes lose storage at random

2014-02-18 Thread Meital Bourvine
Hi Johan, 

Please take a look at this error (from vdsm.log): 

Thread-636938::DEBUG::2014-02-18 
10:48:06,374::task::579::TaskManager.Task::(_updateState) 
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - state 
preparing
Thread-636938::INFO::2014-02-18 
10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: 
getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', 
spUUID='59980e09-b329-4254-b66e-790abd69e194', 
imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
Thread-636938::ERROR::2014-02-18 
10:48:06,376::task::850::TaskManager.Task::(_setError) 
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
Thread-636938::DEBUG::2014-02-18 
10:48:06,415::task::869::TaskManager.Task::(_run) 
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: 
f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', 
'59980e09-b329-4254-b66e-790abd69e194', 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
'68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
Thread-636938::DEBUG::2014-02-18 
10:48:06,416::task::1194::TaskManager.Task::(stop) 
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing (force 
False)
Thread-636938::DEBUG::2014-02-18 
10:48:06,416::task::974::TaskManager.Task::(_decref) 
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
Thread-636938::INFO::2014-02-18 
10:48:06,416::task::1151::TaskManager.Task::(prepare) 
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No 
free file handlers in pool' - code 100
Thread-636938::DEBUG::2014-02-18 
10:48:06,417::task::1156::TaskManager.Task::(prepare) 
Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file 
handlers in pool 

And then you can see after a few seconds: 
MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) I 
am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan 
(2.6.32-358.18.1.el6.x86_64) 

Meaning that vdsm was restarted. 

Which oVirt version are you using? 
I see that there are a few old bugs that describes the same behaviour, but with 
different reproduction steps, for example [1], [2]. 
Can you think of any reproduction steps that might be causing this issue? 

[1] https://bugzilla.redhat.com/show_bug.cgi?id=948210 
[2] https://bugzilla.redhat.com/show_bug.cgi?id=853011 

- Original Message -

 From: Johan Kooijman m...@johankooijman.com
 To: users users@ovirt.org
 Sent: Tuesday, February 18, 2014 1:32:56 PM
 Subject: [Users] Nodes lose storage at random

 Hi All,

 We're seeing some weird issues in our ovirt setup. We have 4 nodes connected
 and an NFS (v3) filestore (FreeBSD/ZFS).

 Once in a while, it seems at random, a node loses their connection to
 storage, recovers it a minute later. The other nodes usually don't lose
 their storage at that moment. Just one, or two at a time.

 We've setup extra tooling to verify the storage performance at those moments
 and the availability for other systems. It's always online, just the nodes
 don't think so.

 The engine tells me this:

 2014-02-18 11:48:03,598 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
 (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in
 problem. vds: hv5
 2014-02-18 11:48:18,909 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
 (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
 problem. vds: hv5
 2014-02-18 11:48:45,021 WARN [org.ovirt.engine.core.vdsbroker.VdsManager]
 (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds =
 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.
 2014-02-18 11:48:45,070 INFO
 [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
 (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894, Call
 Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center GS.
 Setting Data Center status to Non Responsive (On host hv5, Error: Network
 error during communication with the Host.).

 The export and data domain live over NFS. There's another domain, ISO, that
 lives on the engine machine, also shared over NFS. That domain doesn't have
 any issue at all.

 Attached are the logfiles for the relevant time period for both the engine
 server and the node. The node by the way, is a deployment of the node ISO,
 not a full blown installation.

 Any clues on where to begin searching? The NFS server shows no issues nor
 anything in the logs. I did notice that the statd and lockd daemons were not
 running, but I wonder if that can have anything to do with the issue.

 --
 Met vriendelijke groeten / With kind regards,
 Johan Kooijman

 m...@johankooijman.com

 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list
Users

Re: [Users] Nodes lose storage at random

2014-02-18 Thread Johan Kooijman
Meital,

I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I
use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6.

I have no ways of reproducing just yet. I can confirm that it's happening
on all nodes in the cluster. And every time a node goes offline, this error
pops up.

Could the fact that lockd  statd were not running on the NFS host cause
this error? Is there a workaround available that we know of?


On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine mbour...@redhat.comwrote:

 Hi Johan,

 Please take a look at this error (from vdsm.log):

 Thread-636938::DEBUG::2014-02-18 
 10:48:06,374::task::579::TaskManager.Task::(_updateState) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - state 
 preparing
 Thread-636938::INFO::2014-02-18 
 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: 
 getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', 
 spUUID='59980e09-b329-4254-b66e-790abd69e194', 
 imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
 volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
 Thread-636938::ERROR::2014-02-18 
 10:48:06,376::task::850::TaskManager.Task::(_setError) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,415::task::869::TaskManager.Task::(_run) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: 
 f4ce9a6e-0292-4071-9a24-a8d8fba7222b ('e9f70496-f181-4c9b-9ecb-d7f780772b04', 
 '59980e09-b329-4254-b66e-790abd69e194', 
 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
 '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,416::task::1194::TaskManager.Task::(stop) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing 
 (force False)
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,416::task::974::TaskManager.Task::(_decref) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
 Thread-636938::INFO::2014-02-18 
 10:48:06,416::task::1151::TaskManager.Task::(prepare) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No 
 free file handlers in pool' - code 100
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,417::task::1156::TaskManager.Task::(prepare) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file 
 handlers in pool



 And then you can see after a few seconds:

 MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) 
 I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan 
 (2.6.32-358.18.1.el6.x86_64)


 Meaning that vdsm was restarted.

 Which oVirt version are you using?
 I see that there are a few old bugs that describes the same behaviour, but 
 with different reproduction steps, for example [1], [2].
 Can you think of any reproduction steps that might be causing this issue?


 [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210
 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011



 --

 *From: *Johan Kooijman m...@johankooijman.com
 *To: *users users@ovirt.org
 *Sent: *Tuesday, February 18, 2014 1:32:56 PM
 *Subject: *[Users] Nodes lose storage at random


 Hi All,

 We're seeing some weird issues in our ovirt setup. We have 4 nodes
 connected and an NFS (v3) filestore (FreeBSD/ZFS).

 Once in a while, it seems at random, a node loses their connection to
 storage, recovers it a minute later. The other nodes usually don't lose
 their storage at that moment. Just one, or two at a time.

 We've setup extra tooling to verify the storage performance at those
 moments and the availability for other systems. It's always online, just
 the nodes don't think so.

 The engine tells me this:

 2014-02-18 11:48:03,598 WARN
  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
 (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in
 problem. vds: hv5
 2014-02-18 11:48:18,909 WARN
  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
 (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
 problem. vds: hv5
 2014-02-18 11:48:45,021 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager]
 (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds =
 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.
 2014-02-18 11:48:45,070 INFO
  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
 (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894,
 Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data
 Center GS. Setting Data Center status to Non Responsive (On host hv5,
 Error: Network error during communication with the Host.).

 The export and data domain live over NFS. There's another domain, ISO,
 that lives on the engine machine, also shared over NFS. That domain doesn't
 have any issue at all.

 Attached are the logfiles for the relevant time period for both the engine
 server and the node. The node by the way, is a deployment

Re: [Users] Nodes lose storage at random

2014-02-18 Thread Johan Kooijman
To follow up on this: The setup has only ~80 VM's active right now. The 2
bugreports are not in scope for this setup, the issues occur at random,
even when there's no activity (create/delete VM's) and there are only 4
directories in /rhev/data-center/mnt/.



On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman m...@johankooijman.comwrote:

 Meital,

 I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I
 use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6.

 I have no ways of reproducing just yet. I can confirm that it's happening
 on all nodes in the cluster. And every time a node goes offline, this error
 pops up.

 Could the fact that lockd  statd were not running on the NFS host cause
 this error? Is there a workaround available that we know of?


 On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine mbour...@redhat.comwrote:

 Hi Johan,

 Please take a look at this error (from vdsm.log):

 Thread-636938::DEBUG::2014-02-18 
 10:48:06,374::task::579::TaskManager.Task::(_updateState) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - state 
 preparing
 Thread-636938::INFO::2014-02-18 
 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: 
 getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', 
 spUUID='59980e09-b329-4254-b66e-790abd69e194', 
 imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
 volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
 Thread-636938::ERROR::2014-02-18 
 10:48:06,376::task::850::TaskManager.Task::(_setError) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,415::task::869::TaskManager.Task::(_run) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: 
 f4ce9a6e-0292-4071-9a24-a8d8fba7222b 
 ('e9f70496-f181-4c9b-9ecb-d7f780772b04', 
 '59980e09-b329-4254-b66e-790abd69e194', 
 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
 '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,416::task::1194::TaskManager.Task::(stop) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing 
 (force False)
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,416::task::974::TaskManager.Task::(_decref) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
 Thread-636938::INFO::2014-02-18 
 10:48:06,416::task::1151::TaskManager.Task::(prepare) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: u'No 
 free file handlers in pool' - code 100
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,417::task::1156::TaskManager.Task::(prepare) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file 
 handlers in pool



 And then you can see after a few seconds:

 MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 1450) 
 I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan 
 (2.6.32-358.18.1.el6.x86_64)


 Meaning that vdsm was restarted.

 Which oVirt version are you using?
 I see that there are a few old bugs that describes the same behaviour, but 
 with different reproduction steps, for example [1], [2].

 Can you think of any reproduction steps that might be causing this issue?


 [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210

 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011



 --

 *From: *Johan Kooijman m...@johankooijman.com
 *To: *users users@ovirt.org
 *Sent: *Tuesday, February 18, 2014 1:32:56 PM
 *Subject: *[Users] Nodes lose storage at random


 Hi All,

 We're seeing some weird issues in our ovirt setup. We have 4 nodes
 connected and an NFS (v3) filestore (FreeBSD/ZFS).

 Once in a while, it seems at random, a node loses their connection to
 storage, recovers it a minute later. The other nodes usually don't lose
 their storage at that moment. Just one, or two at a time.

 We've setup extra tooling to verify the storage performance at those
 moments and the availability for other systems. It's always online, just
 the nodes don't think so.

 The engine tells me this:

 2014-02-18 11:48:03,598 WARN
  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
 (pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in
 problem. vds: hv5
 2014-02-18 11:48:18,909 WARN
  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
 (pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
 problem. vds: hv5
 2014-02-18 11:48:45,021 WARN
  [org.ovirt.engine.core.vdsbroker.VdsManager]
 (DefaultQuartzScheduler_Worker-18) [46683672] Failed to refresh VDS , vds =
 66e6aace-e51d-4006-bb2f-d85c2f1fd8d2 : hv5, VDS Network Error, continuing.
 2014-02-18 11:48:45,070 INFO
  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
 (DefaultQuartzScheduler_Worker-41) [2ef1a894] Correlation ID: 2ef1a894,
 Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data
 Center GS. Setting Data Center status to Non Responsive (On host hv5,
 Error: Network error

Re: [Users] Nodes lose storage at random

2014-02-18 Thread Meital Bourvine
Hi Johan, 

Can you please run something like this on the spm node? 
while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l`  
/tmp/handler_num.txt; sleep 1; done 

When it'll happen again, please stop the script, and write here the maximum 
number and the time that it happened. 

Also, please check if process_pool_max_slots_per_domain is defined in 
/etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined there, 
the default is 10) 

Thanks! 

- Original Message -

 From: Johan Kooijman m...@johankooijman.com
 To: Meital Bourvine mbour...@redhat.com
 Cc: users users@ovirt.org
 Sent: Tuesday, February 18, 2014 2:55:11 PM
 Subject: Re: [Users] Nodes lose storage at random

 To follow up on this: The setup has only ~80 VM's active right now. The 2
 bugreports are not in scope for this setup, the issues occur at random, even
 when there's no activity (create/delete VM's) and there are only 4
 directories in / rhev/data-center/mnt/.

 On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman  m...@johankooijman.com 
 wrote:

  Meital,
 

  I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I
  use
  the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6.
 

  I have no ways of reproducing just yet. I can confirm that it's happening
  on
  all nodes in the cluster. And every time a node goes offline, this error
  pops up.
 

  Could the fact that lockd  statd were not running on the NFS host cause
  this
  error? Is there a workaround available that we know of?
 

  On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine  mbour...@redhat.com 
  wrote:
 

   Hi Johan,
  
 

   Please take a look at this error (from vdsm.log):
  
 

   Thread-636938::DEBUG::2014-02-18
   10:48:06,374::task::579::TaskManager.Task::(_updateState)
   Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init -
   state
   preparing
  
 
   Thread-636938::INFO::2014-02-18
   10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect:
   getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04',
   spUUID='59980e09-b329-4254-b66e-790abd69e194',
   imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb',
   volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
  
 
   Thread-636938::ERROR::2014-02-18
   10:48:06,376::task::850::TaskManager.Task::(_setError)
   Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
  
 
   Thread-636938::DEBUG::2014-02-18
   10:48:06,415::task::869::TaskManager.Task::(_run)
   Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run:
   f4ce9a6e-0292-4071-9a24-a8d8fba7222b
   ('e9f70496-f181-4c9b-9ecb-d7f780772b04',
   '59980e09-b329-4254-b66e-790abd69e194',
   'd50ecfbb-dc98-40cf-9b19-4bd402952aeb',
   '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
  
 
   Thread-636938::DEBUG::2014-02-18
   10:48:06,416::task::1194::TaskManager.Task::(stop)
   Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing
   (force False)
  
 
   Thread-636938::DEBUG::2014-02-18
   10:48:06,416::task::974::TaskManager.Task::(_decref)
   Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
  
 
   Thread-636938::INFO::2014-02-18
   10:48:06,416::task::1151::TaskManager.Task::(prepare)
   Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted:
   u'No
   free file handlers in pool' - code 100
  
 
   Thread-636938::DEBUG::2014-02-18
   10:48:06,417::task::1156::TaskManager.Task::(prepare)
   Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free
   file
   handlers in pool
  
 

   And then you can see after a few seconds:
  
 
   MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID:
   1450)
   I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan
   (2.6.32-358.18.1.el6.x86_64)
  
 

   Meaning that vdsm was restarted.
  
 

   Which oVirt version are you using?
  
 
   I see that there are a few old bugs that describes the same behaviour,
   but
   with different reproduction steps, for example [1], [2].
  
 
   Can you think of any reproduction steps that might be causing this issue?
  
 

   [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210
  
 
   [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011
  
 

From: Johan Kooijman  m...@johankooijman.com 
   
  
 
To: users  users@ovirt.org 
   
  
 
Sent: Tuesday, February 18, 2014 1:32:56 PM
   
  
 
Subject: [Users] Nodes lose storage at random
   
  
 

Hi All,
   
  
 

We're seeing some weird issues in our ovirt setup. We have 4 nodes
connected
and an NFS (v3) filestore (FreeBSD/ZFS).
   
  
 

Once in a while, it seems at random, a node loses their connection to
storage, recovers it a minute later. The other nodes usually don't lose
their storage at that moment. Just one, or two at a time.
   
  
 

We've setup extra tooling to verify the storage performance at those
moments
and the availability for other systems. It's always online, just

Re: [Users] Nodes lose storage at random

2014-02-18 Thread Johan Kooijman
Ok, will do. The process_pool_max_slots_per_domain is not defined, default
node values.


On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine mbour...@redhat.comwrote:

 Hi Johan,

 Can you please run something like this on the spm node?
 while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` 
 /tmp/handler_num.txt; sleep 1; done

 When it'll happen again, please stop the script, and write here the
 maximum number and the time that it happened.

 Also, please check if process_pool_max_slots_per_domain is defined in
 /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined
 there, the default is 10)

 Thanks!


 --

 *From: *Johan Kooijman m...@johankooijman.com
 *To: *Meital Bourvine mbour...@redhat.com
 *Cc: *users users@ovirt.org
 *Sent: *Tuesday, February 18, 2014 2:55:11 PM
 *Subject: *Re: [Users] Nodes lose storage at random


 To follow up on this: The setup has only ~80 VM's active right now. The 2
 bugreports are not in scope for this setup, the issues occur at random,
 even when there's no activity (create/delete VM's) and there are only 4
 directories in /rhev/data-center/mnt/.



 On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman m...@johankooijman.comwrote:

 Meital,

 I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I
 use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6.

 I have no ways of reproducing just yet. I can confirm that it's happening
 on all nodes in the cluster. And every time a node goes offline, this error
 pops up.

 Could the fact that lockd  statd were not running on the NFS host cause
 this error? Is there a workaround available that we know of?


 On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine mbour...@redhat.comwrote:

 Hi Johan,

 Please take a look at this error (from vdsm.log):


 Thread-636938::DEBUG::2014-02-18 
 10:48:06,374::task::579::TaskManager.Task::(_updateState) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - 
 state preparing
 Thread-636938::INFO::2014-02-18 
 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: 
 getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', 
 spUUID='59980e09-b329-4254-b66e-790abd69e194', 
 imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
 volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
 Thread-636938::ERROR::2014-02-18 
 10:48:06,376::task::850::TaskManager.Task::(_setError) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,415::task::869::TaskManager.Task::(_run) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: 
 f4ce9a6e-0292-4071-9a24-a8d8fba7222b 
 ('e9f70496-f181-4c9b-9ecb-d7f780772b04', 
 '59980e09-b329-4254-b66e-790abd69e194', 
 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
 '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,416::task::1194::TaskManager.Task::(stop) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing 
 (force False)
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,416::task::974::TaskManager.Task::(_decref) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
 Thread-636938::INFO::2014-02-18 
 10:48:06,416::task::1151::TaskManager.Task::(prepare) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: 
 u'No free file handlers in pool' - code 100
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,417::task::1156::TaskManager.Task::(prepare) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free file 
 handlers in pool



 And then you can see after a few seconds:

 MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 
 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan 
 (2.6.32-358.18.1.el6.x86_64)



 Meaning that vdsm was restarted.

 Which oVirt version are you using?
 I see that there are a few old bugs that describes the same behaviour, but 
 with different reproduction steps, for example [1], [2].


 Can you think of any reproduction steps that might be causing this issue?


 [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210


 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011



 --

 *From: *Johan Kooijman m...@johankooijman.com
 *To: *users users@ovirt.org
 *Sent: *Tuesday, February 18, 2014 1:32:56 PM
 *Subject: *[Users] Nodes lose storage at random


 Hi All,

 We're seeing some weird issues in our ovirt setup. We have 4 nodes
 connected and an NFS (v3) filestore (FreeBSD/ZFS).

 Once in a while, it seems at random, a node loses their connection to
 storage, recovers it a minute later. The other nodes usually don't lose
 their storage at that moment. Just one, or two at a time.

 We've setup extra tooling to verify the storage performance at those
 moments and the availability for other systems. It's always online, just
 the nodes don't think so.

 The engine tells me this:

 2014-02-18 11:48:03,598 WARN

Re: [Users] Nodes lose storage at random

2014-02-18 Thread Johan Kooijman
One other interesting fact is that each node has 4 NFS mountpoints. 2 (data
 export) to the main SAN, 1 to the engine machine for ISO and one to the
legacy SAN.

When this issue occurs, the only mountpoint in a problem state seem to be
the 2 mountpoints to the main SAN:

2014-02-18 11:48:03,598 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) domain d88764c8-ecc3-4f22-967e-2ce225ac4498:Export in
problem. vds: hv5
2014-02-18 11:48:18,909 WARN
 [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand]
(pool-6-thread-48) domain e9f70496-f181-4c9b-9ecb-d7f780772b04:Data in
problem. vds: hv5


On Tue, Feb 18, 2014 at 3:04 PM, Johan Kooijman m...@johankooijman.comwrote:

 Ok, will do. The process_pool_max_slots_per_domain is not defined, default
 node values.


 On Tue, Feb 18, 2014 at 2:56 PM, Meital Bourvine mbour...@redhat.comwrote:

 Hi Johan,

 Can you please run something like this on the spm node?
 while true; do echo `date; ps ax | grep -i remotefilehandler | wc -l` 
 /tmp/handler_num.txt; sleep 1; done

 When it'll happen again, please stop the script, and write here the
 maximum number and the time that it happened.

 Also, please check if process_pool_max_slots_per_domain is defined in
 /etc/vdsm/vdsm.conf, and if so, what's the value? (if it's not defined
 there, the default is 10)

 Thanks!


 --

 *From: *Johan Kooijman m...@johankooijman.com
 *To: *Meital Bourvine mbour...@redhat.com
 *Cc: *users users@ovirt.org
 *Sent: *Tuesday, February 18, 2014 2:55:11 PM
 *Subject: *Re: [Users] Nodes lose storage at random


 To follow up on this: The setup has only ~80 VM's active right now. The 2
 bugreports are not in scope for this setup, the issues occur at random,
 even when there's no activity (create/delete VM's) and there are only 4
 directories in /rhev/data-center/mnt/.



 On Tue, Feb 18, 2014 at 1:51 PM, Johan Kooijman 
 m...@johankooijman.comwrote:

 Meital,

 I'm running the latest stable oVirt, 3.3.3 on Centos 6.5. For my nodes I
 use the node iso CentOS 6 oVirt Node - 3.0.1 - 1.0.2.el6.

 I have no ways of reproducing just yet. I can confirm that it's
 happening on all nodes in the cluster. And every time a node goes offline,
 this error pops up.

 Could the fact that lockd  statd were not running on the NFS host cause
 this error? Is there a workaround available that we know of?


 On Tue, Feb 18, 2014 at 12:57 PM, Meital Bourvine 
 mbour...@redhat.comwrote:

 Hi Johan,

 Please take a look at this error (from vdsm.log):


 Thread-636938::DEBUG::2014-02-18 
 10:48:06,374::task::579::TaskManager.Task::(_updateState) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::moving from state init - 
 state preparing
 Thread-636938::INFO::2014-02-18 
 10:48:06,375::logUtils::44::dispatcher::(wrapper) Run and protect: 
 getVolumeSize(sdUUID='e9f70496-f181-4c9b-9ecb-d7f780772b04', 
 spUUID='59980e09-b329-4254-b66e-790abd69e194', 
 imgUUID='d50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
 volUUID='68fefe24-0346-4d0d-b377-ddd7be7be29c', options=None)
 Thread-636938::ERROR::2014-02-18 
 10:48:06,376::task::850::TaskManager.Task::(_setError) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Unexpected error
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,415::task::869::TaskManager.Task::(_run) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Task._run: 
 f4ce9a6e-0292-4071-9a24-a8d8fba7222b 
 ('e9f70496-f181-4c9b-9ecb-d7f780772b04', 
 '59980e09-b329-4254-b66e-790abd69e194', 
 'd50ecfbb-dc98-40cf-9b19-4bd402952aeb', 
 '68fefe24-0346-4d0d-b377-ddd7be7be29c') {} failed - stopping task
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,416::task::1194::TaskManager.Task::(stop) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::stopping in state preparing 
 (force False)
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,416::task::974::TaskManager.Task::(_decref) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::ref 1 aborting True
 Thread-636938::INFO::2014-02-18 
 10:48:06,416::task::1151::TaskManager.Task::(prepare) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::aborting: Task is aborted: 
 u'No free file handlers in pool' - code 100
 Thread-636938::DEBUG::2014-02-18 
 10:48:06,417::task::1156::TaskManager.Task::(prepare) 
 Task=`f4ce9a6e-0292-4071-9a24-a8d8fba7222b`::Prepare: aborted: No free 
 file handlers in pool



 And then you can see after a few seconds:

 MainThread::INFO::2014-02-18 10:48:45,258::vdsm::101::vds::(run) (PID: 
 1450) I am the actual vdsm 4.12.1-2.el6 hv5.ovirt.gs.cloud.lan 
 (2.6.32-358.18.1.el6.x86_64)




 Meaning that vdsm was restarted.

 Which oVirt version are you using?
 I see that there are a few old bugs that describes the same behaviour, but 
 with different reproduction steps, for example [1], [2].



 Can you think of any reproduction steps that might be causing this issue?


 [1] https://bugzilla.redhat.com/show_bug.cgi?id=948210



 [2] https://bugzilla.redhat.com/show_bug.cgi?id=853011



 --

 *From: *Johan