Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Dan van der Ster
On Wed, Oct 19, 2016 at 3:22 PM, Yoann Moulin wrote: > Hello, > >>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is >>> compose by 12 nodes, each nodes have 10 OSD with journal on disk. >>> >>> We have one rbd partition and a radosGW with 2 data pool,

Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Yoann Moulin
Hello, >> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose >> by 12 nodes, each nodes have 10 OSD with journal on disk. >> >> We have one rbd partition and a radosGW with 2 data pool, one replicated, >> one EC (8+2) >> >> in attachment few details on our cluster. >>

Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Christian Balzer
Hello, no specific ideas, but this somewhat sounds familiar. One thing first, you already stopped client traffic but to make sure your cluster really becomes quiescent, stop all scrubs as well. That's always a good idea in any recovery, overload situation. Have you verified CPU load (are those

Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Dan van der Ster
Hi Yoann, On Wed, Oct 19, 2016 at 9:44 AM, Yoann Moulin wrote: > Dear List, > > We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose > by 12 nodes, each nodes have 10 OSD with journal on disk. > > We have one rbd partition and a radosGW with 2 data

[ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Yoann Moulin
Dear List, We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 12 nodes, each nodes have 10 OSD with journal on disk. We have one rbd partition and a radosGW with 2 data pool, one replicated, one EC (8+2) in attachment few details on our cluster. Currently, our