Re: [ceph-users] Potential OSD deadlock?

2015-10-16 Thread Max A. Krasilnikov
Hello! On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote: > Have you tried running iperf between the nodes? Capturing a pcap of the > (failing) Ceph comms from both sides could help narrow it down. > Is there any SDN layer involved that could add overhead/padding to the frames? > What about

Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 It seems in our situation the cluster is just busy, usually with really small RBD I/O. We have gotten things to where it doesn't happen as much in a steady state, but when we have an OSD fail (mostly from an XFS log bug we hit at least once a week),

Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Sage Weil
On Wed, 14 Oct 2015, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > It seems in our situation the cluster is just busy, usually with > really small RBD I/O. We have gotten things to where it doesn't happen > as much in a steady state, but when we have an OSD fail

Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I'm sure I have a log of a 1,000 second block somewhere, I'll have to look around for it. I'll try turning that knob and see what happens. I'll come back with the results. Thanks, - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4

Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Haomai Wang
On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil wrote: > On Mon, 12 Oct 2015, Robert LeBlanc wrote: >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA256 >> >> After a weekend, I'm ready to hit this from a different direction. >> >> I replicated the issue with Firefly so it doesn't

Re: [ceph-users] Potential OSD deadlock?

2015-10-13 Thread Sage Weil
On Mon, 12 Oct 2015, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > After a weekend, I'm ready to hit this from a different direction. > > I replicated the issue with Firefly so it doesn't seem an issue that > has been introduced or resolved in any nearby version.

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Jan Schermer
Are there any errors on the NICs? (ethtool -s ethX) Also take a look at the switch and look for flow control statistics - do you have flow control enabled or disabled? We had to disable flow control as it would pause all IO on the port whenever any path got congested which you don't want to

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Max A. Krasilnikov
Hello! On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > Sage, > After trying to bisect this issue (all test moved the bisect towards > Infernalis) and eventually testing the Infernalis branch again, it > looks like the problem still

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Dzianis Kahanovich
Additional issues about Intel NICs: some of them (I*GB series, not e1000e) are multiqueue. Default qdisc - "mq", not "pfifo_fast". I have half of cluster with e1000e and half - IGB (every - 2x with bonding+bridge, no jumbo, txqueuelen 2000). So, on my MQ NICs irqbalance produce massive network

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Max A. Krasilnikov
Hello! On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote: > Are there any errors on the NICs? (ethtool -s ethX) No errors. Neither on nodes, nor on switches. > Also take a look at the switch and look for flow control statistics - do you > have flow control enabled or disabled? flow control

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Jan Schermer
Have you tried running iperf between the nodes? Capturing a pcap of the (failing) Ceph comms from both sides could help narrow it down. Is there any SDN layer involved that could add overhead/padding to the frames? What about some intermediate MTU like 8000 - does that work? Oh and if there's

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Max A. Krasilnikov
Здравствуйте! On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote: > Have you tried running iperf between the nodes? Capturing a pcap of the > (failing) Ceph comms from both sides could help narrow it down. > Is there any SDN layer involved that could add overhead/padding to the frames? No

Re: [ceph-users] Potential OSD deadlock?

2015-10-08 Thread Dzianis Kahanovich
I have probably similar situation on latest hammer & 4.1+ kernels on spinning OSDs (journal - leased partition on same HDD): evential slow requests, etc. Try: 1) even on leased partition journal - "journal aio = false"; 2) single-queue "noop" scheduler (OSDs); 3) reduce nr_requests to 32 (OSDs);

Re: [ceph-users] Potential OSD deadlock?

2015-10-08 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Sage, After trying to bisect this issue (all test moved the bisect towards Infernalis) and eventually testing the Infernalis branch again, it looks like the problem still exists although it is handled a tad better in Infernalis. I'm going to test

Re: [ceph-users] Potential OSD deadlock?

2015-10-07 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 We forgot to upload the ceph.log yesterday. It is there now. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc wrote: > -BEGIN PGP SIGNED

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Sage Weil
On Tue, 6 Oct 2015, Robert LeBlanc wrote: > Thanks for your time Sage. It sounds like a few people may be helped if you > can find something. > > I did a recursive chown as in the instructions (although I didn't know about > the doc at the time). I did an osd debug at 20/20 but didn't see

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
Thanks for your time Sage. It sounds like a few people may be helped if you can find something. I did a recursive chown as in the instructions (although I didn't know about the doc at the time). I did an osd debug at 20/20 but didn't see anything. I'll also do ms and make the logs available. I'll

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 This was from the monitor (can't bring it up with Hammer now, complete cluster is down, this is only my lab, so no urgency). I got it up and running this way: 1. Upgrade the mon node to Infernalis and started the mon. 2. Downgraded the OSDs to

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Sage Weil
On Tue, 6 Oct 2015, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d > (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got > messages when the OSD was marked out: > > 2015-10-06

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Sage Weil
On Tue, 6 Oct 2015, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > I can't think of anything. In my dev cluster the only thing that has > changed is the Ceph versions (no reboot). What I like is even though > the disks are 100% utilized, it is preforming as I expect

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got messages when the OSD was marked out: 2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 : cluster

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I can't think of anything. In my dev cluster the only thing that has changed is the Ceph versions (no reboot). What I like is even though the disks are 100% utilized, it is preforming as I expect now. Client I/O is slightly degraded during the

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 On my second test (a much longer one), it took nearly an hour, but a few messages have popped up over a 20 window. Still far less than I have been seeing. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I'll capture another set of logs. Is there any other debugging you want turned up? I've seen the same thing where I see the message dispatched to the secondary OSD, but the message just doesn't show up for 30+ seconds in the secondary OSD logs. -

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I upped the debug on about everything and ran the test for about 40 minutes. I took OSD.19 on ceph1 doen and then brought it back in. There was at least one op on osd.19 that was blocked for over 1,000 seconds. Hopefully this will have something

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Max A. Krasilnikov
Hello! On Mon, Oct 05, 2015 at 09:35:26PM -0600, robert wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > With some off-list help, we have adjusted > osd_client_message_cap=1. This seems to have helped a bit and we > have seen some OSDs have a value up to 4,000 for client

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Sage Weil
On Mon, 5 Oct 2015, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > With some off-list help, we have adjusted > osd_client_message_cap=1. This seems to have helped a bit and we > have seen some OSDs have a value up to 4,000 for client messages. But > it does not

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Ken Dreyer
On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil wrote: > Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build) > first. They won't be allowed to boot until that happens... all upgrades > must stop at 0.94.4 first. This sounds pretty crucial. is there Redmine

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Sage Weil
On Tue, 6 Oct 2015, Robert LeBlanc wrote: > I downgraded to the hammer gitbuilder branch, but it looks like I've > passed the point of no return: > > 2015-10-06 09:44:52.210873 7fd3dd8b78c0 -1 ERROR: on disk data > includes unsupported features: > compat={},rocompat={},incompat={7=support shec

Re: [ceph-users] Potential OSD deadlock?

2015-10-06 Thread Robert LeBlanc
I downgraded to the hammer gitbuilder branch, but it looks like I've passed the point of no return: 2015-10-06 09:44:52.210873 7fd3dd8b78c0 -1 ERROR: on disk data includes unsupported features: compat={},rocompat={},incompat={7=support shec erasure code} 2015-10-06 09:44:52.210922 7fd3dd8b78c0 -1

Re: [ceph-users] Potential OSD deadlock?

2015-10-05 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 With some off-list help, we have adjusted osd_client_message_cap=1. This seems to have helped a bit and we have seen some OSDs have a value up to 4,000 for client messages. But it does not solve the problem with the blocked I/O. One thing that

Re: [ceph-users] Potential OSD deadlock?

2015-10-05 Thread Josef Johansson
Hi, Looking over disks etc and comparing to our setup, we got a bit different hardware, but they should be comparable. Running Hitachi 4TB (HUS724040AL), Intel DC S3700 and SAS3008 instead. In our old cluster (almost same hardware in new and old) we have overloaded the cluster and had to wait

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Josef Johansson
Hi, I don't know what brand those 4TB spindles are, but I know that mine are very bad at doing write at the same time as read. Especially small read write. This has an absurdly bad effect when doing maintenance on ceph. That being said we see a lot of difference between dumpling and hammer in

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Sage Weil
On Sat, 3 Oct 2015, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > We are still struggling with this and have tried a lot of different > things. Unfortunately, Inktank (now Red Hat) no longer provides > consulting services for non-Red Hat systems. If there are some

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Alex Gorbachev
We had multiple issues with 4TB drives and delays. Here is the configuration that works for us fairly well on Ubuntu (but we are about to significantly increase the IO load so this may change). NTP: always use NTP and make sure it is working - Ceph is very sensitive to time being precise

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Josef Johansson
I would start with defrag the drives, the good part is that you can just run the defrag with the time parameter and it will take all available xfs drives. On 4 Oct 2015 6:13 pm, "Robert LeBlanc" wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > These are

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 These are Toshiba MG03ACA400 drives. sd{a,b} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 05) at 3.0 Gb sd{c,d} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series

Re: [ceph-users] Potential OSD deadlock?

2015-10-04 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I have eight nodes running the fio job rbd_test_real to different RBD volumes. I've included the CRUSH map in the tarball. I stopped one OSD process and marked it out. I let it recover for a few minutes and then I started the process again and

Re: [ceph-users] Potential OSD deadlock?

2015-10-03 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 We are still struggling with this and have tried a lot of different things. Unfortunately, Inktank (now Red Hat) no longer provides consulting services for non-Red Hat systems. If there are some certified Ceph consultants in the US that we can do

Re: [ceph-users] Potential OSD deadlock?

2015-09-25 Thread Robert LeBlanc
We dropped the replication on our cluster from 4 to 3 and it looks like all the blocked I/O has stopped (no entries in the log for the last 12 hours). This makes me believe that there is some issue with the number of sockets or some other TCP issue. We have not messed with Ephemeral ports and

Re: [ceph-users] Potential OSD deadlock?

2015-09-23 Thread Mark Nelson
FWIW, we've got some 40GbE Intel cards in the community performance cluster on a Mellanox 40GbE switch that appear (knock on wood) to be running fine with 3.10.0-229.7.2.el7.x86_64. We did get feedback from Intel that older drivers might cause problems though. Here's ifconfig from one of the

Re: [ceph-users] Potential OSD deadlock?

2015-09-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 We were able to only get ~17Gb out of the XL710 (heavily tweaked) until we went to the 4.x kernel where we got ~36Gb (no tweaking). It seems that there were some major reworks in the network handling in the kernel to efficiently handle that network

Re: [ceph-users] Potential OSD deadlock?

2015-09-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 OK, here is the update on the saga... I traced some more of blocked I/Os and it seems that communication between two hosts seemed worse than others. I did a two way ping flood between the two hosts using max packet sizes (1500). After 1.5M packets,

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I'm starting to wonder if this has to do with some OSDs getting full or the 0.94.3 code. Earlier this afternoon, I cleared out my test cluster so there was no pools. I created anew rbd pool and started filling it with 6 - 1TB fio jobs replication 3

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 This is IPoIB and we have the MTU set to 64K. There was some issues pinging hosts with "No buffer space available" (hosts are currently configured for 4GB to test SSD caching rather than page cache). I found that MTU under 32K worked reliable for

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 OK, looping in ceph-devel to see if I can get some more eyes. I've extracted what I think are important entries from the logs for the first blocked request. NTP is running all the servers so the logs should be close in terms of time. Logs for 12:50

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Samuel Just
I looked at the logs, it looks like there was a 53 second delay between when osd.17 started sending the osd_repop message and when osd.13 started reading it, which is pretty weird. Sage, didn't we once see a kernel issue which caused some messages to be mysteriously delayed for many 10s of

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 4.2.0-1.el7.elrepo.x86_64 - - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, Sep 22, 2015 at 3:41 PM, Samuel Just wrote: > I looked at

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Sage Weil
On Tue, 22 Sep 2015, Samuel Just wrote: > I looked at the logs, it looks like there was a 53 second delay > between when osd.17 started sending the osd_repop message and when > osd.13 started reading it, which is pretty weird. Sage, didn't we > once see a kernel issue which caused some messages

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Gregory Farnum
On Mon, Sep 21, 2015 at 11:43 PM, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > I'm starting to wonder if this has to do with some OSDs getting full > or the 0.94.3 code. Earlier this afternoon, I cleared out my test > cluster so there was no

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Is there some way to tell in the logs that this is happening? I'm not seeing much I/O, CPU usage during these times. Is there some way to prevent the splitting? Is there a negative side effect to doing so? We've had I/O block for over 900 seconds

Re: [ceph-users] Potential OSD deadlock?

2015-09-22 Thread Gregory Farnum
On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Is there some way to tell in the logs that this is happening? You can search for the (mangled) name _split_collection > I'm not > seeing much I/O, CPU usage during

Re: [ceph-users] Potential OSD deadlock?

2015-09-21 Thread Gregory Farnum
So it sounds like you've got two different things here: 1) You get a lot of slow operations that show up as warnings. 2) Rarely, you get blocked op warnings that don't seem to go away until the cluster state changes somehow. (2) is the interesting one. Since you say the cluster is under heavy

Re: [ceph-users] Potential OSD deadlock?

2015-09-21 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 In my lab cluster I can saturate the disks and I'm not seeing any of the blocked I/Os from the Ceph side, although the client shows that I/O stops for a while. I'm not convinced that it is load related. I was looking through the logs using the

Re: [ceph-users] Potential OSD deadlock?

2015-09-20 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I was able to catch the tail end of one of these and increased the logging on it. I had to kill it a minute or two after the logging was increased because of the time of the day. I've put the logs at https://robert.leblancnet.us/ceph-osd.8.log.xz .

Re: [ceph-users] Potential OSD deadlock?

2015-09-20 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 We had another incident of 100 long blocked I/O this morning, but I didn't get to it in time. I wound up clearing itself after almost 1,000 seconds. On interesting note is that the blocked I/O kept creeping up until I see a bunch of entrys in the

Re: [ceph-users] Potential OSD deadlock?

2015-09-20 Thread Robert LeBlanc
We set the logging on an OSD that had problems pretty frequently, but cleared up in less than 30 seconds. The logs are at http://162.144.87.113/files/ceph-osd.112.log.xz and are uncompressed at 8.6GB. Some of the messages we were seeing in ceph -w are: 2015-09-20 20:55:44.029041 osd.112 [WRN] 10

[ceph-users] Potential OSD deadlock?

2015-09-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 We have had two situations where I/O just seems to be indefinitely blocked on our production cluster today (0.94.3). In the case this morning, it was just normal I/O traffic, no recovery or backfill. The case this evening, we were backfilling to