Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-11 Thread Karan Singh
Thanks Sage I will create a “new feature” request on tracker.ceph.com http://tracker.ceph.com/ so that this discussion should not get buried under mailing list. Developers can implement this as per their convenience. Karan

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-10 Thread Sage Weil
On Tue, 10 Mar 2015, Christian Eichelmann wrote: Hi Sage, we hit this problem a few monthes ago as well and it took us quite a while to figure out what's wrong. As a Systemadministrator I don't like the idea that daemons or even init scripts are changing system wide configuration

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-10 Thread Christian Eichelmann
Hi Sage, we hit this problem a few monthes ago as well and it took us quite a while to figure out what's wrong. As a Systemadministrator I don't like the idea that daemons or even init scripts are changing system wide configuration parameters, so I wouldn't like to see the OSDs do it

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Mohamed Pakkeer
Hi Karan, We faced same issue and resolved after increasing the open file limit and maximum no of threads Config reference /etc/security/limit.conf root hard nofile 65535 sysctl -w kernel.pid_max=4194303 http://tracker.ceph.com/issues/10554#change-47024 Cheers Mohamed Pakkeer On Mon, Mar

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Christian Eichelmann
Hi Karan, as you are actually writing in your own book, the problem is the sysctl setting kernel.pid_max. I've seen in your bug report that you were setting it to 65536, which is still to low for high density hardware. In our cluster, one OSD server has in an idle situation about 66.000 Threads

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Karan Singh
Thanks Guys kernel.pid_max=4194303 did the trick. - Karan - On 09 Mar 2015, at 14:48, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi Karan, as you are actually writing in your own book, the problem is the sysctl setting kernel.pid_max. I've seen in your bug report that

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Nicheal
Umm.. Too many Threads are created in SimpleMessenger, every pipe should create two working threads for sending and receiving messages. Thus, AsyncMessenger would be promissing but still in development. Regards Ning Yao 2015-03-09 20:48 GMT+08:00 Christian Eichelmann

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Tony Harris
I know I'm not even close to this type of a problem yet with my small cluster (both test and production clusters) - but it would be great if something like that could appear in the cluster HEALTHWARN, if Ceph could determine the amount of used processes and compare them against the current limit

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Udo Lembke
Hi Tony, sounds like an good idea! Udo On 09.03.2015 21:55, Tony Harris wrote: I know I'm not even close to this type of a problem yet with my small cluster (both test and production clusters) - but it would be great if something like that could appear in the cluster HEALTHWARN, if Ceph could

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Sage Weil
On Mon, 9 Mar 2015, Karan Singh wrote: Thanks Guys kernel.pid_max=4194303 did the trick. Great to hear! Sorry we missed that you only had it at 65536. This is a really common problem that people hit when their clusters start to grow. Is there somewhere in the docs we can put this to catch

[ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Karan Singh
Hello Community need help to fix a long going Ceph problem. Cluster is unhealthy , Multiple OSDs are DOWN. When i am trying to restart OSD’s i am getting this error 2015-03-09 12:22:16.312774 7f760dac9700 -1 common/Thread.cc: In function 'void Thread::create(size_t)' thread 7f760dac9700 time

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Azad Aliyar
*Check Max Threadcount:* If you have a node with a lot of OSDs, you may be hitting the default maximum number of threads (e.g., usually 32k), especially during recovery. You can increase the number of threads using sysctl to see if increasing the maximum number of threads to the maximum possible

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Azad Aliyar
Great Karan. On Mon, Mar 9, 2015 at 9:32 PM, Karan Singh karan.si...@csc.fi wrote: Thanks Guys kernel.pid_max=4194303 did the trick. - Karan - On 09 Mar 2015, at 14:48, Christian Eichelmann christian.eichelm...@1und1.de wrote: Hi Karan, as you are actually writing in your own book,

Re: [ceph-users] Ceph BIG outage : 200+ OSD are down , OSD cannot create thread

2015-03-09 Thread Nicheal
2015-03-10 3:01 GMT+08:00 Sage Weil s...@newdream.net: On Mon, 9 Mar 2015, Karan Singh wrote: Thanks Guys kernel.pid_max=4194303 did the trick. Great to hear! Sorry we missed that you only had it at 65536. This is a really common problem that people hit when their clusters start to grow.