Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Sage Weil
On Mon, 23 Nov 2015, Robert LeBlanc wrote: > Thanks for the log dump command, I'll keep that in the back pocket, it > would have been helpful in a few situations. > > I'm trying to microbenchmark the new Weighted Round Robin queue I've > been working on and just trying to dump the info to the logs

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Robert LeBlanc
Thanks for the log dump command, I'll keep that in the back pocket, it would have been helpful in a few situations. I'm trying to microbenchmark the new Weighted Round Robin queue I've been working on and just trying to dump the info to the logs so that I can see it at runtime. So this is in a bra

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I saw posts about that in the mailing lists. According to SAR, there wasn't an abnormal amount of page faults. We have swap disabled and have min_kbytes_free set to 6GB which has worked well for us so far. We kicked around still setting swappiness to

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Sage Weil
On Mon, 23 Nov 2015, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Is there a way through the admin socket or inject args that can tell > the OSD process to dump the in memory logs without crashing? Do you Yep, 'ceph daemon osd.NN log dump'. > have an idea of the

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Is there a way through the admin socket or inject args that can tell the OSD process to dump the in memory logs without crashing? Do you have an idea of the overhead? From the code it looks like it is always evaluated, just depends on if it is stored

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Mark Nelson
FWIW, if you've got collectl per-process logs, you might look for major pagefaults associated with the osd processes. I've seen process swapping cause heartbeat timeouts in the past. Not to say that's the issue, but worth confirming it's not happening. Mark On 11/23/2015 01:03 PM, Robert Le

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Sage Weil
On Mon, 23 Nov 2015, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > We set the debugging to 0/0, but are you talking about lines like: > >-12> 2015-11-20 20:59:47.138746 7f70067de700 -1 osd.177 103793 > heartbeat_check: no reply from osd.133 since back 2015-11-2

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 We set the debugging to 0/0, but are you talking about lines like: -12> 2015-11-20 20:59:47.138746 7f70067de700 -1 osd.177 103793 heartbeat_check: no reply from osd.133 since back 2015-11-20 20:57:32.413156 front 2015-11-20 20:57:32.413156 (cutof

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Gregory Farnum
On Mon, Nov 23, 2015 at 12:03 PM, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > This is one of our production clusters which is dual 40 Gb Ethernet > using VLANs for cluster and public networks. I don't think this is > unusual, not like my dev cluster which runs Inf

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 This is one of our production clusters which is dual 40 Gb Ethernet using VLANs for cluster and public networks. I don't think this is unusual, not like my dev cluster which runs Infiniband and IPoIB. The client nodes are connected at 10 GB Ethernet.

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Gregory Farnum
On Mon, Nov 23, 2015 at 11:27 AM, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > I checked the SAR data and the disks for all the OSDs showed usual > performance until 20:57:32 when over the next few minutes the I/OPs, > bandwidth and latency all decreased. The only

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I checked the SAR data and the disks for all the OSDs showed usual performance until 20:57:32 when over the next few minutes the I/OPs, bandwidth and latency all decreased. The only thing that I can think of is that some replies to the client got hun

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Gregory Farnum
On Mon, Nov 23, 2015 at 11:03 AM, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > The backtrace is: > > 2015-11-20 20:59:48.856679 7f7012ff7700 -1 common/HeartbeatMap.cc: In > function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, > const char*, time_t)'

Re: Multiple OSDs suicide because of client issues?

2015-11-23 Thread Gregory Farnum
On Sat, Nov 21, 2015 at 1:34 AM, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > We had two interesting issues today. In both cases multiple OSDs > suicided at the exact same moment. The first incident had four OSDs, > the second had 12. > > First set: > 145,159,79,17

Multiple OSDs suicide because of client issues?

2015-11-20 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 We had two interesting issues today. In both cases multiple OSDs suicided at the exact same moment. The first incident had four OSDs, the second had 12. First set: 145,159,79,176 Second Set: osd.177 down at 20:59:48, osd.131, osd.136, osd.133, osd.