-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
We are still struggling with this and have tried a lot of different
things. Unfortunately, Inktank (now Red Hat) no longer provides
consulting services for non-Red Hat systems. If there are some
certified Ceph consultants in the US that we can do both remote and
on-site engagements, please let us know.
This certainly seems to be network related, but somewhere in the
kernel. We have tried increasing the network and TCP buffers, number
of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
on the boxes, the disks are busy, but not constantly at 100% (they
cycle from <10% up to 100%, but not 100% for more than a few seconds
at a time). There seems to be no reasonable explanation why I/O is
blocked pretty frequently longer than 30 seconds. We have verified
Jumbo frames by pinging from/to each node with 9000 byte packets. The
network admins have verified that packets are not being dropped in the
switches for these nodes. We have tried different kernels including
the recent Google patch to cubic. This is showing up on three cluster
(two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
(from CentOS 7.1) with similar results.
The messages seem slightly different:
2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
100.087155 secs
2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
cluster [WRN] slow request 30.041999 seconds old, received at
2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
rbd_data.13fdcb2ae8944a.0001264f [read 975360~4096]
11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
points reached
I don't know what "no flag points reached" means.
The problem is most pronounced when we have to reboot an OSD node (1
of 13), we will have hundreds of I/O blocked for some times up to 300
seconds. It takes a good 15 minutes for things to settle down. The
production cluster is very busy doing normally 8,000 I/O and peaking
at 15,000. This is all 4TB spindles with SSD journals and the disks
are between 25-50% full. We are currently splitting PGs to distribute
the load better across the disks, but we are having to do this 10 PGs
at a time as we get blocked I/O. We have max_backfills and
max_recovery set to 1, client op priority is set higher than recovery
priority. We tried increasing the number of op threads but this didn't
seem to help. It seems as soon as PGs are finished being checked, they
become active and could be the cause for slow I/O while the other PGs
are being checked.
What I don't understand is that the messages are delayed. As soon as
the message is received by Ceph OSD process, it is very quickly
committed to the journal and a response is sent back to the primary
OSD which is received very quickly as well. I've adjust
min_free_kbytes and it seems to keep the OSDs from crashing, but
doesn't solve the main problem. We don't have swap and there is 64 GB
of RAM per nodes for 10 OSDs.
Is there something that could cause the kernel to get a packet but not
be able to dispatch it to Ceph such that it could be explaining why we
are seeing these blocked I/O for 30+ seconds. Is there some pointers
to tracing Ceph messages from the network buffer through the kernel to
the Ceph process?
We can really use some pointers no matter how outrageous. We've have
over 6 people looking into this for weeks now and just can't think of
anything else.
Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com
wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
l7OF
=OI++
-END PGP SIGNATURE-
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc wrote:
> We dropped the replication on our cluster from 4 to 3 and it looks
> like all the blocked I/O has stopped (no entries in the log for the
> last 12 hours). This makes me believe that there is some issue with
> the number of sockets or some other TCP issue. We have not messed with
> Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> processes