Hello!
On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote:
> Have you tried running iperf between the nodes? Capturing a pcap of the
> (failing) Ceph comms from both sides could help narrow it down.
> Is there any SDN layer involved that could add overhead/padding to the frames?
> What about
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
It seems in our situation the cluster is just busy, usually with
really small RBD I/O. We have gotten things to where it doesn't happen
as much in a steady state, but when we have an OSD fail (mostly from
an XFS log bug we hit at least once a week),
On Wed, 14 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> It seems in our situation the cluster is just busy, usually with
> really small RBD I/O. We have gotten things to where it doesn't happen
> as much in a steady state, but when we have an OSD fail
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
I'm sure I have a log of a 1,000 second block somewhere, I'll have to
look around for it.
I'll try turning that knob and see what happens. I'll come back with
the results.
Thanks,
-
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4
On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil wrote:
> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> After a weekend, I'm ready to hit this from a different direction.
>>
>> I replicated the issue with Firefly so it doesn't
On Mon, 12 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> After a weekend, I'm ready to hit this from a different direction.
>
> I replicated the issue with Firefly so it doesn't seem an issue that
> has been introduced or resolved in any nearby version.
Are there any errors on the NICs? (ethtool -s ethX)
Also take a look at the switch and look for flow control statistics - do you
have flow control enabled or disabled?
We had to disable flow control as it would pause all IO on the port whenever
any path got congested which you don't want to
Hello!
On Thu, Oct 08, 2015 at 11:44:09PM -0600, robert wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> Sage,
> After trying to bisect this issue (all test moved the bisect towards
> Infernalis) and eventually testing the Infernalis branch again, it
> looks like the problem still
Additional issues about Intel NICs: some of them (I*GB series, not e1000e) are
multiqueue. Default qdisc - "mq", not "pfifo_fast". I have half of cluster with
e1000e and half - IGB (every - 2x with bonding+bridge, no jumbo, txqueuelen
2000). So, on my MQ NICs irqbalance produce massive network
Hello!
On Fri, Oct 09, 2015 at 11:05:59AM +0200, jan wrote:
> Are there any errors on the NICs? (ethtool -s ethX)
No errors. Neither on nodes, nor on switches.
> Also take a look at the switch and look for flow control statistics - do you
> have flow control enabled or disabled?
flow control
Have you tried running iperf between the nodes? Capturing a pcap of the
(failing) Ceph comms from both sides could help narrow it down.
Is there any SDN layer involved that could add overhead/padding to the frames?
What about some intermediate MTU like 8000 - does that work?
Oh and if there's
Здравствуйте!
On Fri, Oct 09, 2015 at 01:45:42PM +0200, jan wrote:
> Have you tried running iperf between the nodes? Capturing a pcap of the
> (failing) Ceph comms from both sides could help narrow it down.
> Is there any SDN layer involved that could add overhead/padding to the frames?
No
I have probably similar situation on latest hammer & 4.1+ kernels on spinning
OSDs (journal - leased partition on same HDD): evential slow requests, etc. Try:
1) even on leased partition journal - "journal aio = false";
2) single-queue "noop" scheduler (OSDs);
3) reduce nr_requests to 32 (OSDs);
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Sage,
After trying to bisect this issue (all test moved the bisect towards
Infernalis) and eventually testing the Infernalis branch again, it
looks like the problem still exists although it is handled a tad
better in Infernalis. I'm going to test
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
We forgot to upload the ceph.log yesterday. It is there now.
-
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Tue, Oct 6, 2015 at 5:40 PM, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED
On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> Thanks for your time Sage. It sounds like a few people may be helped if you
> can find something.
>
> I did a recursive chown as in the instructions (although I didn't know about
> the doc at the time). I did an osd debug at 20/20 but didn't see
Thanks for your time Sage. It sounds like a few people may be helped if you
can find something.
I did a recursive chown as in the instructions (although I didn't know
about the doc at the time). I did an osd debug at 20/20 but didn't see
anything. I'll also do ms and make the logs available. I'll
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
This was from the monitor (can't bring it up with Hammer now, complete
cluster is down, this is only my lab, so no urgency).
I got it up and running this way:
1. Upgrade the mon node to Infernalis and started the mon.
2. Downgraded the OSDs to
On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
> (4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
> messages when the OSD was marked out:
>
> 2015-10-06
On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I can't think of anything. In my dev cluster the only thing that has
> changed is the Ceph versions (no reboot). What I like is even though
> the disks are 100% utilized, it is preforming as I expect
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
OK, an interesting point. Running ceph version 9.0.3-2036-g4f54a0d
(4f54a0dd7c4a5c8bdc788c8b7f58048b2a28b9be) looks a lot better. I got
messages when the OSD was marked out:
2015-10-06 11:52:46.961040 osd.13 192.168.55.12:6800/20870 81 :
cluster
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
I can't think of anything. In my dev cluster the only thing that has
changed is the Ceph versions (no reboot). What I like is even though
the disks are 100% utilized, it is preforming as I expect now. Client
I/O is slightly degraded during the
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
On my second test (a much longer one), it took nearly an hour, but a
few messages have popped up over a 20 window. Still far less than I
have been seeing.
-
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
I'll capture another set of logs. Is there any other debugging you
want turned up? I've seen the same thing where I see the message
dispatched to the secondary OSD, but the message just doesn't show up
for 30+ seconds in the secondary OSD logs.
-
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
I upped the debug on about everything and ran the test for about 40
minutes. I took OSD.19 on ceph1 doen and then brought it back in.
There was at least one op on osd.19 that was blocked for over 1,000
seconds. Hopefully this will have something
Hello!
On Mon, Oct 05, 2015 at 09:35:26PM -0600, robert wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> With some off-list help, we have adjusted
> osd_client_message_cap=1. This seems to have helped a bit and we
> have seen some OSDs have a value up to 4,000 for client
On Mon, 5 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> With some off-list help, we have adjusted
> osd_client_message_cap=1. This seems to have helped a bit and we
> have seen some OSDs have a value up to 4,000 for client messages. But
> it does not
On Tue, Oct 6, 2015 at 8:38 AM, Sage Weil wrote:
> Oh.. I bet you didn't upgrade the osds to 0.94.4 (or latest hammer build)
> first. They won't be allowed to boot until that happens... all upgrades
> must stop at 0.94.4 first.
This sounds pretty crucial. is there Redmine
On Tue, 6 Oct 2015, Robert LeBlanc wrote:
> I downgraded to the hammer gitbuilder branch, but it looks like I've
> passed the point of no return:
>
> 2015-10-06 09:44:52.210873 7fd3dd8b78c0 -1 ERROR: on disk data
> includes unsupported features:
> compat={},rocompat={},incompat={7=support shec
I downgraded to the hammer gitbuilder branch, but it looks like I've
passed the point of no return:
2015-10-06 09:44:52.210873 7fd3dd8b78c0 -1 ERROR: on disk data
includes unsupported features:
compat={},rocompat={},incompat={7=support shec erasure code}
2015-10-06 09:44:52.210922 7fd3dd8b78c0 -1
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
With some off-list help, we have adjusted
osd_client_message_cap=1. This seems to have helped a bit and we
have seen some OSDs have a value up to 4,000 for client messages. But
it does not solve the problem with the blocked I/O.
One thing that
Hi,
Looking over disks etc and comparing to our setup, we got a bit different
hardware, but they should be comparable. Running Hitachi 4TB (HUS724040AL),
Intel DC S3700 and SAS3008 instead.
In our old cluster (almost same hardware in new and old) we have overloaded the
cluster and had to wait
Hi,
I don't know what brand those 4TB spindles are, but I know that mine are
very bad at doing write at the same time as read. Especially small read
write.
This has an absurdly bad effect when doing maintenance on ceph. That being
said we see a lot of difference between dumpling and hammer in
On Sat, 3 Oct 2015, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
We had multiple issues with 4TB drives and delays. Here is the
configuration that works for us fairly well on Ubuntu (but we are about to
significantly increase the IO load so this may change).
NTP: always use NTP and make sure it is working - Ceph is very sensitive to
time being precise
I would start with defrag the drives, the good part is that you can just
run the defrag with the time parameter and it will take all available xfs
drives.
On 4 Oct 2015 6:13 pm, "Robert LeBlanc" wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> These are
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
These are Toshiba MG03ACA400 drives.
sd{a,b} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79
series chipset 6-Port SATA AHCI Controller (rev 05) at 3.0 Gb
sd{c,d} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79
series
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
I have eight nodes running the fio job rbd_test_real to different RBD
volumes. I've included the CRUSH map in the tarball.
I stopped one OSD process and marked it out. I let it recover for a
few minutes and then I started the process again and
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
We are still struggling with this and have tried a lot of different
things. Unfortunately, Inktank (now Red Hat) no longer provides
consulting services for non-Red Hat systems. If there are some
certified Ceph consultants in the US that we can do
We dropped the replication on our cluster from 4 to 3 and it looks
like all the blocked I/O has stopped (no entries in the log for the
last 12 hours). This makes me believe that there is some issue with
the number of sockets or some other TCP issue. We have not messed with
Ephemeral ports and
FWIW, we've got some 40GbE Intel cards in the community performance
cluster on a Mellanox 40GbE switch that appear (knock on wood) to be
running fine with 3.10.0-229.7.2.el7.x86_64. We did get feedback from
Intel that older drivers might cause problems though.
Here's ifconfig from one of the
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
We were able to only get ~17Gb out of the XL710 (heavily tweaked)
until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
seems that there were some major reworks in the network handling in
the kernel to efficiently handle that network
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
OK, here is the update on the saga...
I traced some more of blocked I/Os and it seems that communication
between two hosts seemed worse than others. I did a two way ping flood
between the two hosts using max packet sizes (1500). After 1.5M
packets,
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
I'm starting to wonder if this has to do with some OSDs getting full
or the 0.94.3 code. Earlier this afternoon, I cleared out my test
cluster so there was no pools. I created anew rbd pool and started
filling it with 6 - 1TB fio jobs replication 3
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
This is IPoIB and we have the MTU set to 64K. There was some issues
pinging hosts with "No buffer space available" (hosts are currently
configured for 4GB to test SSD caching rather than page cache). I
found that MTU under 32K worked reliable for
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
OK, looping in ceph-devel to see if I can get some more eyes. I've
extracted what I think are important entries from the logs for the
first blocked request. NTP is running all the servers so the logs
should be close in terms of time. Logs for 12:50
I looked at the logs, it looks like there was a 53 second delay
between when osd.17 started sending the osd_repop message and when
osd.13 started reading it, which is pretty weird. Sage, didn't we
once see a kernel issue which caused some messages to be mysteriously
delayed for many 10s of
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
4.2.0-1.el7.elrepo.x86_64
- -
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Tue, Sep 22, 2015 at 3:41 PM, Samuel Just wrote:
> I looked at
On Tue, 22 Sep 2015, Samuel Just wrote:
> I looked at the logs, it looks like there was a 53 second delay
> between when osd.17 started sending the osd_repop message and when
> osd.13 started reading it, which is pretty weird. Sage, didn't we
> once see a kernel issue which caused some messages
On Mon, Sep 21, 2015 at 11:43 PM, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I'm starting to wonder if this has to do with some OSDs getting full
> or the 0.94.3 code. Earlier this afternoon, I cleared out my test
> cluster so there was no
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Is there some way to tell in the logs that this is happening? I'm not
seeing much I/O, CPU usage during these times. Is there some way to
prevent the splitting? Is there a negative side effect to doing so?
We've had I/O block for over 900 seconds
On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Is there some way to tell in the logs that this is happening?
You can search for the (mangled) name _split_collection
> I'm not
> seeing much I/O, CPU usage during
So it sounds like you've got two different things here:
1) You get a lot of slow operations that show up as warnings.
2) Rarely, you get blocked op warnings that don't seem to go away
until the cluster state changes somehow.
(2) is the interesting one. Since you say the cluster is under heavy
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
In my lab cluster I can saturate the disks and I'm not seeing any of
the blocked I/Os from the Ceph side, although the client shows that
I/O stops for a while. I'm not convinced that it is load related.
I was looking through the logs using the
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
I was able to catch the tail end of one of these and increased the
logging on it. I had to kill it a minute or two after the logging was
increased because of the time of the day.
I've put the logs at https://robert.leblancnet.us/ceph-osd.8.log.xz .
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
We had another incident of 100 long blocked I/O this morning, but I
didn't get to it in time. I wound up clearing itself after almost
1,000 seconds. On interesting note is that the blocked I/O kept
creeping up until I see a bunch of entrys in the
We set the logging on an OSD that had problems pretty frequently, but
cleared up in less than 30 seconds. The logs are at
http://162.144.87.113/files/ceph-osd.112.log.xz and are uncompressed
at 8.6GB. Some of the messages we were seeing in ceph -w are:
2015-09-20 20:55:44.029041 osd.112 [WRN] 10
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
We have had two situations where I/O just seems to be indefinitely
blocked on our production cluster today (0.94.3). In the case this
morning, it was just normal I/O traffic, no recovery or backfill. The
case this evening, we were backfilling to
58 matches
Mail list logo