Re: [ceph-users] OSD had suicide timed out

2018-08-09 Thread Josef Zelenka
THe only reason that i could think of is some kind of a network issue, even though different clusters run on the same switch with the same settings and we don't register any issues on there. One thing i recall - one of my colleagues was testing something out on this cluster and after he

Re: [ceph-users] OSD had suicide timed out

2018-08-08 Thread Brad Hubbard
If, in the above case, osd 13 was not too busy to respond (resource shortage) then you need to find out why else osd 5, etc. could not contact it. On Wed, Aug 8, 2018 at 6:47 PM, Josef Zelenka wrote: > Checked the system load on the host with the OSD that is suiciding currently > and it's fine,

Re: [ceph-users] OSD had suicide timed out

2018-08-08 Thread Brad Hubbard
Do you see "internal heartbeat not healthy" messages in the log of the osd that suicides? On Wed, Aug 8, 2018 at 5:45 PM, Brad Hubbard wrote: > What is the load like on the osd host at the time and what does the > disk utilization look like? > > Also, what does the transaction look like from one

Re: [ceph-users] OSD had suicide timed out

2018-08-08 Thread Brad Hubbard
What is the load like on the osd host at the time and what does the disk utilization look like? Also, what does the transaction look like from one of the osds that sends the "you died" message with debugging osd 20 and ms 1 enabled? On Wed, Aug 8, 2018 at 5:34 PM, Josef Zelenka wrote: > Thank

Re: [ceph-users] OSD had suicide timed out

2018-08-08 Thread Josef Zelenka
Thank you for your suggestion, tried it,  really seems like the other osds think the osd is dead(if I understand this right), however the networking seems absolutely fine between the nodes(no issues in graphs etc).    -13> 2018-08-08 09:13:58.466119 7fe053d41700  1 -- 10.12.3.17:0/706864 <==

Re: [ceph-users] OSD had suicide timed out

2018-08-07 Thread Brad Hubbard
Try to work out why the other osds are saying this one is down. Is it because this osd is too busy to respond or something else. debug_ms = 1 will show you some message debugging which may help. On Tue, Aug 7, 2018 at 10:34 PM, Josef Zelenka wrote: > To follow up, I did some further digging

Re: [ceph-users] OSD had suicide timed out

2018-08-07 Thread Josef Zelenka
To follow up, I did some further digging with debug_osd=20/20 and it appears as if there's no traffic to the OSD, even though it comes UP for the cluster (this started happening on another OSD in the cluster today, same stuff):    -27> 2018-08-07 14:10:55.146531 7f9fce3cd700 10 osd.0 12560

[ceph-users] OSD had suicide timed out

2018-08-06 Thread Josef Zelenka
Hi, i'm running a cluster on Luminous(12.2.5), Ubuntu 16.04 - configuration is 3 nodes, 6 drives each(though i have encountered this on a different cluster, similar hardware, only the drives were HDD instead of SSD - same usage). I have recently seen a bug(?) where one of the OSDs suddenly