THe only reason that i could think of is some kind of a network issue,
even though different clusters run on the same switch with the same
settings and we don't register any issues on there. One thing i recall -
one of my colleagues was testing something out on this cluster and after
he
If, in the above case, osd 13 was not too busy to respond (resource
shortage) then you need to find out why else osd 5, etc. could not
contact it.
On Wed, Aug 8, 2018 at 6:47 PM, Josef Zelenka
wrote:
> Checked the system load on the host with the OSD that is suiciding currently
> and it's fine,
Do you see "internal heartbeat not healthy" messages in the log of the
osd that suicides?
On Wed, Aug 8, 2018 at 5:45 PM, Brad Hubbard wrote:
> What is the load like on the osd host at the time and what does the
> disk utilization look like?
>
> Also, what does the transaction look like from one
What is the load like on the osd host at the time and what does the
disk utilization look like?
Also, what does the transaction look like from one of the osds that
sends the "you died" message with debugging osd 20 and ms 1 enabled?
On Wed, Aug 8, 2018 at 5:34 PM, Josef Zelenka
wrote:
> Thank
Thank you for your suggestion, tried it, really seems like the other
osds think the osd is dead(if I understand this right), however the
networking seems absolutely fine between the nodes(no issues in graphs
etc).
-13> 2018-08-08 09:13:58.466119 7fe053d41700 1 --
10.12.3.17:0/706864 <==
Try to work out why the other osds are saying this one is down. Is it
because this osd is too busy to respond or something else.
debug_ms = 1 will show you some message debugging which may help.
On Tue, Aug 7, 2018 at 10:34 PM, Josef Zelenka
wrote:
> To follow up, I did some further digging
To follow up, I did some further digging with debug_osd=20/20 and it
appears as if there's no traffic to the OSD, even though it comes UP for
the cluster (this started happening on another OSD in the cluster today,
same stuff):
-27> 2018-08-07 14:10:55.146531 7f9fce3cd700 10 osd.0 12560
Hi,
i'm running a cluster on Luminous(12.2.5), Ubuntu 16.04 - configuration
is 3 nodes, 6 drives each(though i have encountered this on a different
cluster, similar hardware, only the drives were HDD instead of SSD -
same usage). I have recently seen a bug(?) where one of the OSDs
suddenly