Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2016-02-17 Thread Scottix
Looks like the bug with the kernel using ceph and XFS was fixed, I haven't tested it yet but just wanted to give an update. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1527062 On Tue, Dec 8, 2015 at 8:05 AM Scottix wrote: > I can confirm it seems to be kernels greater than 3.16, we had

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Scottix
I can confirm it seems to be kernels greater than 3.16, we had this problem where servers would lock up and had to perform restarts on a weekly basis. We downgraded to 3.16, since then we have not had to do any restarts. I did find this thread in the XFS forums and I am not sure if has been fixed

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Tom Christensen
We run deep scrubs via cron with a script so we know when deep scrubs are happening, and we've seen nodes fail both during deep scrubbing and while no deep scrubs are occurring so I'm pretty sure its not related. On Tue, Dec 8, 2015 at 2:42 AM, Benedikt Fraunhofer wrote: > Hi Tom, > > 2015-12-0

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Mykola Dvornik
The same thing happens to my setup with CentOS7.x + non-stock kernel (kernel-ml from elrepo). I was not happy with IOPS I got out of the stock CentOS7.x so I did the kernel upgrade and crashes started to happen until some of the OSDs become non-bootable at all. The funny thing is that I was no

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Benedikt Fraunhofer
Hi Tom, 2015-12-08 10:34 GMT+01:00 Tom Christensen : > We didn't go forward to 4.2 as its a large production cluster, and we just > needed the problem fixed. We'll probably test out 4.2 in the next couple unfortunately we don't have the luxury of a test cluster. and to add to that, we couldnt s

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Tom Christensen
We didn't go forward to 4.2 as its a large production cluster, and we just needed the problem fixed. We'll probably test out 4.2 in the next couple months, but this one slipped past us as it didn't occur in our test cluster until after we had upgraded production. In our experience it takes about

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Benedikt Fraunhofer
Hi Tom, > We have been seeing this same behavior on a cluster that has been perfectly > happy until we upgraded to the ubuntu vivid 3.19 kernel. We are in the i can't recall when we gave 3.19 a shot but now that you say it... The cluster was happy for >9 months with 3.16. Did you try 4.2 or do y

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Tom Christensen
We have been seeing this same behavior on a cluster that has been perfectly happy until we upgraded to the ubuntu vivid 3.19 kernel. We are in the process of "upgrading" back to the 3.16 kernel across our cluster as we've not seen this behavior on that kernel for over 6 months and we're pretty str

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Jan Schermer
> On 08 Dec 2015, at 08:57, Benedikt Fraunhofer wrote: > > Hi Jan, > >> Doesn't look near the limit currently (but I suppose you rebooted it in the >> meantime?). > > the box this numbers came from has an uptime of 13 days > so it's one of the boxes that did survive yesterdays half-cluster-wi

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
Hi Jan, > Doesn't look near the limit currently (but I suppose you rebooted it in the > meantime?). the box this numbers came from has an uptime of 13 days so it's one of the boxes that did survive yesterdays half-cluster-wide-reboot. > Did iostat say anything about the drives? (btw dm-1 and dm

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Jan Schermer
Doesn't look near the limit currently (but I suppose you rebooted it in the meantime?). Did iostat say anything about the drives? (btw dm-1 and dm-6 are what? Is that your data drives?) - were they overloaded really? Jan > On 08 Dec 2015, at 08:41, Benedikt Fraunhofer wrote: > > Hi Jan, > >

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
Hi Jan, we had 65k for pid_max, which made kernel.threads-max = 1030520. or kernel.threads-max = 256832 (looks like it depends on the number of cpus?) currently we've root@ceph1-store209:~# sysctl -a | grep -e thread -e pid kernel.cad_pid = 1 kernel.core_uses_pid = 0 kernel.ns_last_pid = 60298 k

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Jan Schermer
And how many pids do you have currently? This should do it I think # ps axH |wc -l Jan > On 08 Dec 2015, at 08:26, Benedikt Fraunhofer wrote: > > Hi Jan, > > we initially had to bump it once we had more than 12 osds > per box. But it'll change that to the values you provided. > > Thx! > > Be

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
Hi Jan, we initially had to bump it once we had more than 12 osds per box. But it'll change that to the values you provided. Thx! Benedikt 2015-12-08 8:15 GMT+01:00 Jan Schermer : > What is the setting of sysctl kernel.pid_max? > You relly need to have this: > kernel.pid_max = 4194304 > (I thi

Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Jan Schermer
What is the setting of sysctl kernel.pid_max? You relly need to have this: kernel.pid_max = 4194304 (I think it also sets this as well: kernel.threads-max = 4194304) I think you are running out of processs IDs. Jan > On 08 Dec 2015, at 08:10, Benedikt Fraunhofer wrote: > > Hello Cephers, > >

[ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-07 Thread Benedikt Fraunhofer
Hello Cephers, lately, our ceph-cluster started to show some weird behavior: the osd boxes show a load of 5000-15000 before the osds get marked down. Usually the box is fully usable, even "apt-get dist-upgrade" runs smoothly, you can read and write to any disk, only things you can't do are strace