Re: [ceph-users] Testing CephFS
Need to be more careful but probably you're right -; ./net/ceph/messenger.c Shinobu On Mon, Aug 24, 2015 at 8:53 PM, Simon Hallam s...@pml.ac.uk wrote: The clients are: [root@gridnode50 ~]# uname -a Linux gridnode50 4.0.8-200.fc21.x86_64 #1 SMP Fri Jul 10 21:09:54 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux [root@gridnode50 ~]# ceph -v ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) I don't think it is a reconnect timeout, as they don't even attempt to reconnect until I plug the Ethernet cable back into the original MDS? Cheers, Simon -Original Message- From: Yan, Zheng [mailto:z...@redhat.com] Sent: 24 August 2015 12:28 To: Simon Hallam Cc: ceph-users@lists.ceph.com; Gregory Farnum Subject: Re: [ceph-users] Testing CephFS On Aug 24, 2015, at 18:38, Gregory Farnum gfar...@redhat.com wrote: On Mon, Aug 24, 2015 at 11:35 AM, Simon Hallam s...@pml.ac.uk wrote: Hi Greg, The MDS' detect that the other one went down and started the replay. I did some further testing with 20 client machines. Of the 20 client machines, 5 hung with the following error: [Aug24 10:53] ceph: mds0 caps stale [Aug24 10:54] ceph: mds0 caps stale [Aug24 10:58] ceph: mds0 hung [Aug24 11:03] ceph: mds0 came back [ +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN) [ +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new mon [Aug24 11:04] ceph: mds0 reconnect start [ +0.084938] libceph: mon2 10.15.0.3:6789 session established [ +0.008475] ceph: mds0 reconnect denied Oh, this might be a kernel bug, failing to ask for mdsmap updates when the connection goes away. Zheng, does that sound familiar? -Greg This seems like reconnect timeout. you can try enlarging mds_reconnect_timeout config option. Which version of kernel are you using? Yan, Zheng 10.15.0.3 was the active MDS at the time I unplugged the Ethernet cable. This was the output of ceph -w as I ran the test (I've removed a lot of the pg remapping): 2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor election 2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor election 2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1 2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,1 ceph1,ceph2 2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0} 2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256 active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s wr, 0 op/s 2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up {0=ceph3=up:active}, 2 up:standby 2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36 in 2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up {0=ceph2=up:replay}, 1 up:standby 2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:reconnect 2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up {0=ceph2=up:reconnect}, 1 up:standby 2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:rejoin 2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active 2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up {0=ceph2=up:active}, 1 up:standby 2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active 2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up {0=ceph2=up:active}, 1 up:standby *cable plugged back in* 2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565 boot 2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25 in 2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993 up:boot 2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby 2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor election 2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor election 2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1,2 2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby 2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507 (allowed interval 45) 2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952 (allowed interval 45) 2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302 (allowed interval 45) 2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds is up:active) from
[ceph-users] Ceph for multi-site operation
Hello, First, let me advise I'm really a noob with Cephsince I have only read some documentation. I'm now trying to deploy a Ceph cluster for testing purposes. The cluster is based on 3 (more if necessary) hypervisors running proxmox 3.4. Before going futher, I have an essential question : is Ceph usable in a case of multiple sites storage ? Long story : My goal is to run hypervisors on 2 datacenters separated by 4ms latency. Bandwidth is 1Gbps actually but will be upgraded in a near future. So is it possible to run a an active/active Ceph cluster to get a shared storage between the two sites. Of course, I'll have to be sure that no machien is running at the same time on both sites. Hypervisor will be in charge of this. Is there a mean to ask Ceph to keep at least one copy (or two) in each site and ask it to make all blocs reads from the nearest location ? I'm aware that writes would have to be replicated and there's only a synchronous mode for this. I've read many documentation and use cases about Ceph and it seems some are saying it could be used in such replication and others are not. Need of erasure coding isn't clear too. Just hoping my english is clear enough to explain my case ;-) Thanks for your help, Julien Escario smime.p7s Description: Signature cryptographique S/MIME ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Opensource plugin for pulling out cluster recovery and client IO metric
Hello Ceph Geeks I am planning to develop a python plugin that pulls out cluster *recovery IO* and *client IO* operation metrics , that can be further used with collectd. *For example , i need to take out these values* *recovery io 814 MB/s, 101 objects/s* *client io 85475 kB/s rd, 1430 kB/s wr, 32 op/s* Could you please help me in understanding how *ceph -s* and *ceph -w* outputs *prints cluster recovery IO and client IO information*. Where this information is coming from. *Is it coming from perf dump* ? If yes then which section of perf dump output is should focus on. If not then how can i get this values. I tried *ceph --admin-daemon /var/run/ceph/ceph-osd.48.asok perf dump* , but it generates hell lot of information and i am confused which section of output should i use. Please help Thanks in advance Vickey ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw secret_key
When I create a new user using radosgw-admin most of the time the secret key gets escaped with a backslash, making it not work. Something like secret_key: xx\/\/. Why would the / need to be escaped? Why is it printing the \/ instead of / that does work? Usually I just remove the backslash and it works fine. I've seen this on several different clusters. Is it just me? This may require opening a bug in the tracking tool, but just asking here first. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph for multi-site operation
Le 24/08/2015 15:11, Julien Escario a écrit : Hello, First, let me advise I'm really a noob with Cephsince I have only read some documentation. I'm now trying to deploy a Ceph cluster for testing purposes. The cluster is based on 3 (more if necessary) hypervisors running proxmox 3.4. Before going futher, I have an essential question : is Ceph usable in a case of multiple sites storage ? It depends on what you really need it to do (access patterns and behaviour when a link goes down). Long story : My goal is to run hypervisors on 2 datacenters separated by 4ms latency. Note : unless you are studying Ceph behaviour in this case this goal is in fact a method to reach a goal. If you describe the actual goal you might get different suggestions. Bandwidth is 1Gbps actually but will be upgraded in a near future. So is it possible to run a an active/active Ceph cluster to get a shared storage between the two sites. It is but it probably won't behave correctly in your case. The latency and the bandwidth will hurt a lot. Any application requiring that data is confirmed stored on disk will be hit by the 4ms latency and 1Gbps will have to be shared between inter-site replication traffic and regular VM disk accesses. Your storage will most probably behave like a very slow single hard drive shared between all your VMs. Some workloads might work correctly (if you don't have any significant writes and most of your data will fit in caches for example). When the link between your 2 datacenters is severed, in the worst case (no quorum reachable or a crushmap that won't allow each pg to reach min_size with only one datacenter) everything will freeze, in the best case (giving priority to a single datacenter by running more monitors on it and a crushmap storing at least min_size replicas on it) when the link will be going down everything will run on this datacenter. You can get around a part of the performance problems by going with a 3-way replication, 2 replicas on your primary datacenter and 1 on the secondary where all OSD are configured with primary affinity 0. All reads will be served from the primary datacenter and only writes would go to the secondary. You'll have to run all your VM on the primary datacenter and setup your monitors such that the elected master will be in the primary datacenter (I believe it is chosen by the first name according to alphabetical order). You'll have a copy of your data on the secondary datacenter in case of a disaster on the primary but recovering will be hard (you'll have to reach a quorum of monitors in the secondary datacenter and I'm not sure how to proceed if you only have one out of 3 for example). Of course, I'll have to be sure that no machien is running at the same time on both sites. With your bandwidth and latency, without knowing more about your workloads it's probable that running VM on both sites will get you very slow IOs. Multi datacenter for simple object storage using RGW seems to work, but RBD volumes accesses are usually more demanding. Hypervisor will be in charge of this. Is there a mean to ask Ceph to keep at least one copy (or two) in each site and ask it to make all blocs reads from the nearest location ? I'm aware that writes would have to be replicated and there's only a synchronous mode for this. I've read many documentation and use cases about Ceph and it seems some are saying it could be used in such replication and others are not. Need of erasure coding isn't clear too. Don't use erasure coding for RBD volumes. You'll need a caching tier and it seems tricky to get right and might not be fully tested (I've seen a snapshot bug discussed here last week). Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] TRIM / DISCARD run at low priority by the OSDs?
Hi, I'm not sure for krbd, but with librbd, using trim/discard on the client, don't do trim/discard on the osd physical disk. It's simply write zeroes in the rbd image. zeores write can be skipped since this commit (librbd related) https://github.com/xiaoxichen/ceph/commit/e7812b8416012141cf8faef577e7b27e1b29d5e3 +OPTION(rbd_skip_partial_discard, OPT_BOOL, false) Then you can still manage fstrim manually on the osd servers - Mail original - De: Chad William Seys cws...@physics.wisc.edu À: ceph-users ceph-us...@ceph.com Envoyé: Samedi 22 Août 2015 04:26:38 Objet: [ceph-users] TRIM / DISCARD run at low priority by the OSDs? Hi All, Is it possible to give TRIM / DISCARD initiated by krbd low priority on the OSDs? I know it is possible to run fstrim at Idle priority on the rbd mount point, e.g. ionice -c Idle fstrim -v $MOUNT . But this Idle priority (it appears) only is within the context of the node executing fstrim . If the node executing fstrim is Idle then the OSDs are very busy and performance suffers. Is it possible to tell the OSD daemons (or whatever) to perform the TRIMs at low priority also? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs
This can be tuned in the iSCSI initiation on VMware - look in advanced settings on your ESX hosts (at least if you use the software initiator). Thanks, Jan. I asked this question of Vmware as well, I think the problem is specific to a given iSCSI session, so wondering if that's strictly the job of the target? Do you know of any specific SCSI settings that mitigate this kind of issue? Basically, give up on a session and terminate it and start a new one should an RBD not respond? As I understand, RBD simply never gives up. If an OSD does not respond but is still technically up and in, Ceph will retry IOs forever. I think RBD and Ceph need a timeout mechanism for this. Best regards, Alex Jan On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote: Hi Alex, Currently RBD+LIO+ESX is broken. The problem is caused by the RBD device not handling device aborts properly causing LIO and ESXi to enter a death spiral together. If something in the Ceph cluster causes an IO to take longer than 10 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens, as you have seen it never recovers. Mike Christie from Redhat is doing a lot of work on this currently, so hopefully in the future there will be a direct RBD interface into LIO and it will all work much better. Either tgt or SCST seem to be pretty stable in testing. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Gorbachev Sent: 23 August 2015 02:17 To: ceph-users ceph-users@lists.ceph.com Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs Hello, this is an issue we have been suffering from and researching along with a good number of other Ceph users, as evidenced by the recent posts. In our specific case, these issues manifest themselves in a RBD - iSCSI LIO - ESXi configuration, but the problem is more general. When there is an issue on OSD nodes (examples: network hangs/blips, disk HBAs failing, driver issues, page cache/XFS issues), some OSDs respond slowly or with significant delays. ceph osd perf does not show this, neither does ceph osd tree, ceph -s / ceph -w. Instead, the RBD IO hangs to a point where the client times out, crashes or displays other unsavory behavior - operationally this crashes production processes. Today in our lab we had a disk controller issue, which brought an OSD node down. Upon restart, the OSDs started up and rejoined into the cluster. However, immediately all IOs started hanging for a long time and aborts from ESXi - LIO were not succeeding in canceling these IOs. The only warning I could see was: root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30 requests are blocked 32 sec; 1 osds have slow requests 30 ops are blocked 2097.15 sec 30 ops are blocked 2097.15 sec on osd.4 1 osds have slow requests However, ceph osd perf is not showing high latency on osd 4: root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms) fs_apply_latency(ms) 0 0 13 1 00 2 00 3 172 208 4 00 5 00 6 01 7 00 8 174 819 9 6 10 10 01 11 01 12 35 13 01 14 7 23 15 01 16 00 17 59 18 01 1910 18 20 00 21 00 22 01 23 5 10 SMART state for osd 4 disk is OK. The OSD in up and in: root@lab2-mon1:/var/log/ceph# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -80 root ssd -7 14.71997 root platter -3 7.12000 host croc3 22 0.89000 osd.22 up 1.0 1.0 15 0.89000 osd.15 up 1.0 1.0 16 0.89000 osd.16 up 1.0 1.0 13 0.89000 osd.13 up 1.0 1.0 18 0.89000 osd.18 up 1.0 1.0 8 0.89000 osd.8 up 1.0 1.0 11 0.89000 osd.11 up 1.0 1.0 20 0.89000 osd.20 up 1.0 1.0 -4 0.47998 host croc2 10 0.06000
Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs
I never actually set up iSCSI with VMware, I just had to research various VMware storage options when we had a SAN-probelm at a former job... But I can take a look at it again if you want me to. Is it realy deadlocked when this issue occurs? What I think is partly responsible for this situation is that the iSCSI LUN queues fill up and that's what actually kills your IO - VMware lowers queue depth to 1 in that situation and it can take a really long time to recover (especially if one of the LUNs on the target constantly has problems, or when heavy IO hammers the adapter) - you should never fill this queue, ever. iSCSI will likely be innocent victim in the chain, not the cause of the issues. Ceph should gracefully handle all those situations, you just need to set the timeouts right. I have it set so that whatever happens the OSD can only delay work for 40s and then it is marked down - at that moment all IO start flowing again. You should take this to VMware support, they should be able to tell whether the problem is in iSCSI target (then you can take a look at how that behaves) or in the initiator settings. Though in my experience after two visits from their foremost experts I had to google everything myself because they were clueless - YMMV. The root cause is however slow ops in Ceph, and I have no idea why you'd have them if the OSDs come back up - maybe one of them is really deadlocked or backlogged in some way? I found that when OSDs are dead but up they don't respond to ceph tell osd.xxx ... so try if they all respond in a timely manner, that should help pinpoint the bugger. Jan On 24 Aug 2015, at 18:26, Alex Gorbachev a...@iss-integration.com wrote: This can be tuned in the iSCSI initiation on VMware - look in advanced settings on your ESX hosts (at least if you use the software initiator). Thanks, Jan. I asked this question of Vmware as well, I think the problem is specific to a given iSCSI session, so wondering if that's strictly the job of the target? Do you know of any specific SCSI settings that mitigate this kind of issue? Basically, give up on a session and terminate it and start a new one should an RBD not respond? As I understand, RBD simply never gives up. If an OSD does not respond but is still technically up and in, Ceph will retry IOs forever. I think RBD and Ceph need a timeout mechanism for this. Best regards, Alex Jan On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote: Hi Alex, Currently RBD+LIO+ESX is broken. The problem is caused by the RBD device not handling device aborts properly causing LIO and ESXi to enter a death spiral together. If something in the Ceph cluster causes an IO to take longer than 10 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens, as you have seen it never recovers. Mike Christie from Redhat is doing a lot of work on this currently, so hopefully in the future there will be a direct RBD interface into LIO and it will all work much better. Either tgt or SCST seem to be pretty stable in testing. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Gorbachev Sent: 23 August 2015 02:17 To: ceph-users ceph-users@lists.ceph.com Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs Hello, this is an issue we have been suffering from and researching along with a good number of other Ceph users, as evidenced by the recent posts. In our specific case, these issues manifest themselves in a RBD - iSCSI LIO - ESXi configuration, but the problem is more general. When there is an issue on OSD nodes (examples: network hangs/blips, disk HBAs failing, driver issues, page cache/XFS issues), some OSDs respond slowly or with significant delays. ceph osd perf does not show this, neither does ceph osd tree, ceph -s / ceph -w. Instead, the RBD IO hangs to a point where the client times out, crashes or displays other unsavory behavior - operationally this crashes production processes. Today in our lab we had a disk controller issue, which brought an OSD node down. Upon restart, the OSDs started up and rejoined into the cluster. However, immediately all IOs started hanging for a long time and aborts from ESXi - LIO were not succeeding in canceling these IOs. The only warning I could see was: root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30 requests are blocked 32 sec; 1 osds have slow requests 30 ops are blocked 2097.15 sec 30 ops are blocked 2097.15 sec on osd.4 1 osds have slow requests However, ceph osd perf is not showing high latency on osd 4: root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms) fs_apply_latency(ms) 0 0 13 1 00 2 00 3 172
[ceph-users] rbd du
Hi all, The online manual (http://ceph.com/docs/master/man/8/rbd/) for rbd has documentation for the 'du' command. I'm running ceph 0.94.2 and that command isn't recognized, nor is it in the man page. Is there another command that will calculate the provisioned and actual disk usage of all images and associated snapshots within the specified pool? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs
HI Jan, On Mon, Aug 24, 2015 at 12:40 PM, Jan Schermer j...@schermer.cz wrote: I never actually set up iSCSI with VMware, I just had to research various VMware storage options when we had a SAN-probelm at a former job... But I can take a look at it again if you want me to. Thank you, I don't want to waste your time as I have asked Vmware TAP to research that - I will communicate back anything with which they respond. Is it realy deadlocked when this issue occurs? What I think is partly responsible for this situation is that the iSCSI LUN queues fill up and that's what actually kills your IO - VMware lowers queue depth to 1 in that situation and it can take a really long time to recover (especially if one of the LUNs on the target constantly has problems, or when heavy IO hammers the adapter) - you should never fill this queue, ever. iSCSI will likely be innocent victim in the chain, not the cause of the issues. Completely agreed, so iSCSI's job then is to properly communicate to the initiator that it cannot do what it is asked to do and quit the IO. Ceph should gracefully handle all those situations, you just need to set the timeouts right. I have it set so that whatever happens the OSD can only delay work for 40s and then it is marked down - at that moment all IO start flowing again. What setting in ceph do you use to do that? is that mon_osd_down_out_interval? I think stopping slow OSDs is the answer to the root of the problem - so far I only know to do ceph osd perf and look at latencies. You should take this to VMware support, they should be able to tell whether the problem is in iSCSI target (then you can take a look at how that behaves) or in the initiator settings. Though in my experience after two visits from their foremost experts I had to google everything myself because they were clueless - YMMV. I am hoping the TAP Elite team can do better...but we'll see... The root cause is however slow ops in Ceph, and I have no idea why you'd have them if the OSDs come back up - maybe one of them is really deadlocked or backlogged in some way? I found that when OSDs are dead but up they don't respond to ceph tell osd.xxx ... so try if they all respond in a timely manner, that should help pinpoint the bugger. I think I know in this case - there are some PCIe AER/Bus errors and TLP Header messages strewing across the console of one OSD machine - ceph osd perf showing latencies aboce a second per OSD, but only when IO is done to those OSDs. I am thankful this is not production storage, but worried of this situation in production - the OSDs are staying up and in, but their latencies are slowing clusterwide IO to a crawl. I am trying to envision this situation in production and how would one find out what is slowing everything down without guessing. Regards, Alex Jan On 24 Aug 2015, at 18:26, Alex Gorbachev a...@iss-integration.com wrote: This can be tuned in the iSCSI initiation on VMware - look in advanced settings on your ESX hosts (at least if you use the software initiator). Thanks, Jan. I asked this question of Vmware as well, I think the problem is specific to a given iSCSI session, so wondering if that's strictly the job of the target? Do you know of any specific SCSI settings that mitigate this kind of issue? Basically, give up on a session and terminate it and start a new one should an RBD not respond? As I understand, RBD simply never gives up. If an OSD does not respond but is still technically up and in, Ceph will retry IOs forever. I think RBD and Ceph need a timeout mechanism for this. Best regards, Alex Jan On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote: Hi Alex, Currently RBD+LIO+ESX is broken. The problem is caused by the RBD device not handling device aborts properly causing LIO and ESXi to enter a death spiral together. If something in the Ceph cluster causes an IO to take longer than 10 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens, as you have seen it never recovers. Mike Christie from Redhat is doing a lot of work on this currently, so hopefully in the future there will be a direct RBD interface into LIO and it will all work much better. Either tgt or SCST seem to be pretty stable in testing. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Gorbachev Sent: 23 August 2015 02:17 To: ceph-users ceph-users@lists.ceph.com Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs Hello, this is an issue we have been suffering from and researching along with a good number of other Ceph users, as evidenced by the recent posts. In our specific case, these issues manifest themselves in a RBD - iSCSI LIO - ESXi configuration, but the problem is more general. When there is an issue on OSD nodes (examples: network hangs/blips, disk HBAs
[ceph-users] EXT4 for Production and Journal Question?
Building off a discussion earlier this month [1], how supported is EXT4 for OSDs? It seems that some people are getting good results with it and I'll be testing it in our environment. The other question is if the EXT4 journal is even necessary if you are using Ceph SSD journals. My thoughts are thus: Incoming I/O is written to the SSD journal. The journal then flushes to the EXT4 partition. Only after the write is completed (I understand that this is a direct sync write) does Ceph free the SSD journal entry. Doesn't this provide the same reliability as the EXT4 journal? If an OSD crashed in the middle of the write with no EXT4 journal, the file system would be repaired and then Ceph would rewrite the last transaction that didn't complete? I'm sure I'm missing something here... Thanks, [1] http://www.spinics.net/lists/ceph-users/msg20839.html Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] v9.0.3 released
This is the second to last batch of development work for the Infernalis cycle. The most intrusive change is an internal (non user-visible) change to the OSD's ObjectStore interface. Many fixes and improvements elsewhere across RGW, RBD, and another big pile of CephFS scrub/repair improvements. Upgrading - * The return code for librbd's rbd_aio_read and Image::aio_read API methods no longer returns the number of bytes read upon success. Instead, it returns 0 upon success and a negative value upon failure. * 'ceph scrub', 'ceph compact' and 'ceph sync force are now DEPRECATED. Users should instead use 'ceph mon scrub', 'ceph mon compact' and 'ceph mon sync force'. * 'ceph mon_metadata' should now be used as 'ceph mon metadata'. There is no need to deprecate this command (same major release since it was first introduced). * The `--dump-json` option of osdmaptool is replaced by `--dump json`. * The commands of pg ls-by-{pool,primary,osd} and pg ls now take recovering instead of recovery, to include the recovering pgs in the listed pgs. Notable Changes --- * autotools: fix out of tree build (Krxysztof Kosinski) * autotools: improve make check output (Loic Dachary) * buffer: add invalidate_crc() (Piotr Dalek) * buffer: fix zero bug (#12252 Haomai Wang) * build: fix junit detection on Fedora 22 (Ira Cooper) * ceph-disk: install pip 6.1 (#11952 Loic Dachary) * cephfs-data-scan: many additions, improvements (John Spray) * ceph: improve error output for 'tell' (#11101 Kefu Chai) * ceph-objectstore-tool: misc improvements (David Zafman) * ceph-objectstore-tool: refactoring and cleanup (John Spray) * ceph_test_rados: test pipelined reads (Zhiqiang Wang) * common: fix bit_vector extent calc (#12611 Jason Dillaman) * common: make work queue addition/removal thread safe (#12662 Jason Dillaman) * common: optracker improvements (Zhiqiang Wang, Jianpeng Ma) * crush: add --check to validate dangling names, max osd id (Kefu Chai) * crush: cleanup, sync with kernel (Ilya Dryomov) * crush: fix subtree base weight on adjust_subtree_weight (#11855 Sage Weil) * crypo: fix NSS leak (Jason Dillaman) * crypto: fix unbalanced init/shutdown (#12598 Zheng Yan) * doc: misc updates (Kefu Chai, Owen Synge, Gael Fenet-Garde, Loic Dachary, Yannick Atchy-Dalama, Jiaying Ren, Kevin Caradant, Robert Maxime, Nicolas Yong, Germain Chipaux, Arthur Gorjux, Gabriel Sentucq, Clement Lebrun, Jean-Remi Deveaux, Clair Massot, Robin Tang, Thomas Laumondais, Jordan Dorne, Yuan Zhou, Valentin Thomas, Pierre Chaumont, Benjamin Troquereau, Benjamin Sesia, Vikhyat Umrao) * erasure-code: cleanup (Kefu Chai) * erasure-code: improve tests (Loic Dachary) * erasure-code: shec: fix recovery bugs (Takanori Nakao, Shotaro Kawaguchi) * libcephfs: add pread, pwrite (Jevon Qiao) * libcephfs,ceph-fuse: cache cleanup (Zheng Yan) * librados: add src_fadvise_flags for copy-from (Jianpeng Ma) * librados: respect default_crush_ruleset on pool_create (#11640 Yuan Zhou) * librbd: fadvise for copy, export, import (Jianpeng Ma) * librbd: handle NOCACHE fadvise flag (Jinapeng Ma) * librbd: optionally disable allocation hint (Haomai Wang) * librbd: prevent race between resize requests (#12664 Jason Dillaman) * log: fix data corruption race resulting from log rotation (#12465 Samuel Just) * mds: expose frags via asok (John Spray) * mds: fix setting entire file layout in one setxattr (John Spray) * mds: fix shutdown (John Spray) * mds: handle misc corruption issues (John Spray) * mds: misc fixes (Jianpeng Ma, Dan van der Ster, Zhang Zhi) * mds: misc snap fixes (Zheng Yan) * mds: store layout on header object (#4161 John Spray) * misc performance and cleanup (Nathan Cutler, Xinxin Shu) * mon: add NOFORWARD, OBSOLETE, DEPRECATE flags for mon commands (Joao Eduardo Luis) * mon: add PG count to 'ceph osd df' output (Michal Jarzabek) * mon: clean up, reorg some mon commands (Joao Eduardo Luis) * mon: disallow 2 tiers (#11840 Kefu Chai) * mon: fix log dump crash when debugging (Mykola Golub) * mon: fix metadata update race (Mykola Golub) * mon: fix refresh (#11470 Joao Eduardo Luis) * mon: make blocked op messages more readable (Jianpeng Ma) * mon: only send mon metadata to supporting peers (Sage Weil) * mon: periodic background scrub (Joao Eduardo Luis) * mon: prevent pgp_num pg_num (#12025 Xinxin Shu) * mon: reject large max_mds values (#1 John Spray) * msgr: add ceph_perf_msgr tool (Hoamai Wang) * msgr: async: fix seq handling (Haomai Wang) * msgr: xio: fastpath improvements (Raju Kurunkad) * msgr: xio: sync with accellio v1.4 (Vu Pham) * osd: clean up temp object if promotion fails (Jianpeng Ma) * osd: constrain collections to meta and PGs (normal and temp) (Sage Weil) * osd: filestore: clone using splice (Jianpeng Ma) * osd: filestore: fix
Re: [ceph-users] OSD GHz vs. Cores Question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Thanks to all the responses. There has been more to think about which is what I was looking for. We have MySQL running on this cluster so we will have some VMs with fairly low queue depths. Our Ops teams are not excited about unplugging cables and pulling servers to replace fixed disks, so we are looking at hot swap options. I'll try and do some testing in our lab, but I won't be able to get a very good spread of data due to clock and core limitations in the existing hardware. - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Sat, Aug 22, 2015 at 2:42 PM, Luis Periquito wrote: I've been meaning to write an email with the experience we had at the company I work. For the lack of a more complete one I'll just tell some of the findings. Please note these are my experiences, and are correct for my environment. The clients are running on openstack, and all servers are trusty. Tests were made with Hammer (0.94.2). TLDR: if performance is your objective buy 1S boxes with high frequency, good journal SSDs, and not many SSDs. Also change the cpu to performance mode, instead the default ondemand. And don't forget 10Gig is a must. Replicated pools are also a must for performance. We wanted to have a small cluster (30TB RAW), performance was important (IOPS and latency), network was designed to be 10G copper with BGP attached hosts. There was complete leeway in design and some in budget. Starting with the network that required us to only create a single network, but both links are usable - iperf between boxes is usually around 17-19Gbits. We could choose the nodes, we evaluated dual cpu and single cpu nodes. The dual cpus would have 24 2.5'' drive bays on a 2U chassis whereas the single were 8 2.5'' drive bays on a 1U chassis. Long story short we chose the single cpu (E3 1241 v3). On the CPU all the tests we did with the scaling governors shown that performance would give us a 30-50% boost in IOPS. Latency also improved but not by much. The downside was that each system increased power usage by 5W (!?). For the difference in price (£80) we bought the boxes with 32G of ram. As for the disks, as we wanted fast IO we had to go with SSDs. Due to the budget we had we went with 4x Samsung 850 PRO + 1x Intel S3710 200G. We also tested the P3600, but one of the critical IO clients had far worse performance with it. From benchmarking the write performance is that of the Intel SSD. We made tests with Intel SSD with journal + different Intel SSD with data and performance was within margin for error the same that Intel SSD for journal + Samsung SSD for data. Single SSD performance was slightly lower with either one (around 10%). From what I've seen: on very big sequential read and write I can get up to 700-800 MBps. On random IO (8k, random writes, reads or mixed workloads) we still haven't finished all the tests, but so far it indicates the SSDs are the bottleneck on the writes, and ceph latency on the reads. However we've been able to extract 400 MBps read IO with 4 clients, each doing 32 threads. I don't have the numbers here but that represents around 50k IOPS out of a smallish cluster. Stuff we still have to do revolves around jemalloc vs tcmalloc - trusty has the bug on the thread cache bytes variable. Also we still have to test various tunable options, like threads, caches, etc... Hope this helps. On Sat, Aug 22, 2015 at 4:45 PM, Nick Fisk wrote: Another thing that is probably worth considering is the practical side as well. A lot of the Xeon E5 boards tend to have more SAS/SATA ports and onboard 10GB, this can make quite a difference to the overall cost of the solution if you need to buy extra PCI-E cards. Unless I've missed one, I've not spotted a Xeon-D board with a large amount of onboard sata/sas ports. Please let me know if such a system exists as I would be very interested. We settled on the Hadoop version of the Supermicro Fat Twin. 12 x 3.5 disks + 2x 2.5 SSD's per U, onboard 10GB-T and the fact they share chassis and PSU's keeps the price down. For bulk storage one of these with a single 8 core low clocked E5 Xeon is ideal in my mind. I did a spreadsheet working out U space, power and cost per GB for several different types of server, this solution came out ahead in nearly every category. If there is a requirement for a high perf SSD tier I would probably look at dedicated SSD nodes as I doubt you could cram enough CPU power into a single server to drive 12xSSD's. You mentioned low latency was a key requirement, is this always going to be at low queue depths? If you just need very low latency but won't actually be driving the SSD's very hard you will probably find a very highly clocked E3 is the best bet with 2-4 SSD's per node. However if you drive the SSD's hard, a single one can easily max out several cores.
Re: [ceph-users] rbd du
That rbd CLI command is a new feature that will be included with the upcoming infernalis release. In the meantime, you can use this approach [1] to estimate your RBD image usage. [1] http://ceph.com/planet/real-size-of-a-ceph-rbd-image/ -- Jason Dillaman Red Hat Ceph Storage Engineering dilla...@redhat.com http://www.redhat.com - Original Message - From: Allen Liao aliao.svsga...@gmail.com To: ceph-users@lists.ceph.com Sent: Monday, August 24, 2015 1:03:03 PM Subject: [ceph-users] rbd du Hi all, The online manual ( http://ceph.com/docs/master/man/8/rbd/ ) for rbd has documentation for the 'du' command. I'm running ceph 0.94.2 and that command isn't recognized, nor is it in the man page. Is there another command that will calculate the provisioned and actual disk usage of all images and associated snapshots within the specified pool? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] EXT4 for Production and Journal Question?
Le 24/08/2015 19:34, Robert LeBlanc a écrit : Building off a discussion earlier this month [1], how supported is EXT4 for OSDs? It seems that some people are getting good results with it and I'll be testing it in our environment. The other question is if the EXT4 journal is even necessary if you are using Ceph SSD journals. My thoughts are thus: Incoming I/O is written to the SSD journal. The journal then flushes to the EXT4 partition. Only after the write is completed (I understand that this is a direct sync write) does Ceph free the SSD journal entry. Doesn't this provide the same reliability as the EXT4 journal? If an OSD crashed in the middle of the write with no EXT4 journal, the file system would be repaired and then Ceph would rewrite the last transaction that didn't complete? I'm sure I'm missing something here... I didn't try this configuration but what you miss is probably : - the file system recovery time when there's no journal available. e2fsck on large filesystems can be long and may need user interaction. You don't want that if you just had a cluster-wide (or even partial but involving tens of disks some of which might be needed to reach min_size) power failure. - the less tested behaviour: I'm not sure there's even a guarantee from ext4 without journal than e2fsck can recover properly after a crash (ie: with data consistent with the Ceph journal). Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Testing CephFS
Hi Greg, The MDS' detect that the other one went down and started the replay. I did some further testing with 20 client machines. Of the 20 client machines, 5 hung with the following error: [Aug24 10:53] ceph: mds0 caps stale [Aug24 10:54] ceph: mds0 caps stale [Aug24 10:58] ceph: mds0 hung [Aug24 11:03] ceph: mds0 came back [ +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN) [ +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new mon [Aug24 11:04] ceph: mds0 reconnect start [ +0.084938] libceph: mon2 10.15.0.3:6789 session established [ +0.008475] ceph: mds0 reconnect denied 10.15.0.3 was the active MDS at the time I unplugged the Ethernet cable. This was the output of ceph -w as I ran the test (I've removed a lot of the pg remapping): 2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor election 2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor election 2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1 2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,1 ceph1,ceph2 2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0} 2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256 active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s wr, 0 op/s 2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up {0=ceph3=up:active}, 2 up:standby 2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36 in 2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up {0=ceph2=up:replay}, 1 up:standby 2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:reconnect 2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up {0=ceph2=up:reconnect}, 1 up:standby 2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:rejoin 2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active 2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up {0=ceph2=up:active}, 1 up:standby 2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active 2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up {0=ceph2=up:active}, 1 up:standby *cable plugged back in* 2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565 boot 2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25 in 2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993 up:boot 2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby 2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor election 2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor election 2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1,2 2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby 2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507 (allowed interval 45) 2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952 (allowed interval 45) 2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302 (allowed interval 45) 2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24152 10.10.10.67:0/3475305031 after 647.925323 (allowed interval 45) 2015-08-24 11:13:51.037681 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24149 10.10.10.68:0/22416725 after 648.029988 (allowed interval 45) I did just notice that none of the times match up. So may try again once I fix ntp/chrony and see if that makes a difference. Cheers, Simon -Original Message- From: Gregory Farnum [mailto:gfar...@redhat.com] Sent: 21 August 2015 12:16 To: Simon Hallam Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Testing CephFS On Thu, Aug 20, 2015 at 11:07 AM, Simon Hallam s...@pml.ac.uk wrote: Hey all, We are currently testing CephFS on a small (3 node) cluster. The setup is currently: Each server has 12 OSDs, 1 Monitor and 1 MDS running on it: The servers are running: 0.94.2-0.el7 The clients are running: Ceph: 0.80.10-1.fc21, Kernel: 4.0.6-200.fc21.x86_64 ceph -s cluster 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd health HEALTH_OK monmap e1: 3 mons at {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0} election epoch 20, quorum 0,1,2 ceph1,ceph2,ceph3 mdsmap e12: 1/1/1 up {0=ceph3=up:active}, 2 up:standby osdmap e389: 36 osds: 36 up, 36 in pgmap v19370: 8256 pgs, 3 pools, 51217 MB data, 14035 objects 95526 MB used, 196 TB / 196 TB
Re: [ceph-users] Testing CephFS
On Mon, Aug 24, 2015 at 11:35 AM, Simon Hallam s...@pml.ac.uk wrote: Hi Greg, The MDS' detect that the other one went down and started the replay. I did some further testing with 20 client machines. Of the 20 client machines, 5 hung with the following error: [Aug24 10:53] ceph: mds0 caps stale [Aug24 10:54] ceph: mds0 caps stale [Aug24 10:58] ceph: mds0 hung [Aug24 11:03] ceph: mds0 came back [ +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN) [ +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new mon [Aug24 11:04] ceph: mds0 reconnect start [ +0.084938] libceph: mon2 10.15.0.3:6789 session established [ +0.008475] ceph: mds0 reconnect denied Oh, this might be a kernel bug, failing to ask for mdsmap updates when the connection goes away. Zheng, does that sound familiar? -Greg 10.15.0.3 was the active MDS at the time I unplugged the Ethernet cable. This was the output of ceph -w as I ran the test (I've removed a lot of the pg remapping): 2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor election 2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor election 2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1 2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,1 ceph1,ceph2 2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0} 2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256 active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s wr, 0 op/s 2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up {0=ceph3=up:active}, 2 up:standby 2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36 in 2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up {0=ceph2=up:replay}, 1 up:standby 2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:reconnect 2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up {0=ceph2=up:reconnect}, 1 up:standby 2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:rejoin 2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active 2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up {0=ceph2=up:active}, 1 up:standby 2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active 2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up {0=ceph2=up:active}, 1 up:standby *cable plugged back in* 2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565 boot 2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25 in 2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993 up:boot 2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby 2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor election 2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor election 2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1,2 2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby 2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507 (allowed interval 45) 2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952 (allowed interval 45) 2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302 (allowed interval 45) 2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24152 10.10.10.67:0/3475305031 after 647.925323 (allowed interval 45) 2015-08-24 11:13:51.037681 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24149 10.10.10.68:0/22416725 after 648.029988 (allowed interval 45) I did just notice that none of the times match up. So may try again once I fix ntp/chrony and see if that makes a difference. Cheers, Simon -Original Message- From: Gregory Farnum [mailto:gfar...@redhat.com] Sent: 21 August 2015 12:16 To: Simon Hallam Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Testing CephFS On Thu, Aug 20, 2015 at 11:07 AM, Simon Hallam s...@pml.ac.uk wrote: Hey all, We are currently testing CephFS on a small (3 node) cluster. The setup is currently: Each server has 12 OSDs, 1 Monitor and 1 MDS running on it: The servers are running: 0.94.2-0.el7 The clients are running: Ceph: 0.80.10-1.fc21, Kernel: 4.0.6-200.fc21.x86_64 ceph -s cluster 4ed5ecdd-0c5b-4422-9d99-c9e42c6bd4cd health HEALTH_OK monmap e1: 3 mons at {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0}
Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs
This can be tuned in the iSCSI initiation on VMware - look in advanced settings on your ESX hosts (at least if you use the software initiator). Jan On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote: Hi Alex, Currently RBD+LIO+ESX is broken. The problem is caused by the RBD device not handling device aborts properly causing LIO and ESXi to enter a death spiral together. If something in the Ceph cluster causes an IO to take longer than 10 seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens, as you have seen it never recovers. Mike Christie from Redhat is doing a lot of work on this currently, so hopefully in the future there will be a direct RBD interface into LIO and it will all work much better. Either tgt or SCST seem to be pretty stable in testing. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Gorbachev Sent: 23 August 2015 02:17 To: ceph-users ceph-users@lists.ceph.com Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs Hello, this is an issue we have been suffering from and researching along with a good number of other Ceph users, as evidenced by the recent posts. In our specific case, these issues manifest themselves in a RBD - iSCSI LIO - ESXi configuration, but the problem is more general. When there is an issue on OSD nodes (examples: network hangs/blips, disk HBAs failing, driver issues, page cache/XFS issues), some OSDs respond slowly or with significant delays. ceph osd perf does not show this, neither does ceph osd tree, ceph -s / ceph -w. Instead, the RBD IO hangs to a point where the client times out, crashes or displays other unsavory behavior - operationally this crashes production processes. Today in our lab we had a disk controller issue, which brought an OSD node down. Upon restart, the OSDs started up and rejoined into the cluster. However, immediately all IOs started hanging for a long time and aborts from ESXi - LIO were not succeeding in canceling these IOs. The only warning I could see was: root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30 requests are blocked 32 sec; 1 osds have slow requests 30 ops are blocked 2097.15 sec 30 ops are blocked 2097.15 sec on osd.4 1 osds have slow requests However, ceph osd perf is not showing high latency on osd 4: root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms) fs_apply_latency(ms) 0 0 13 1 00 2 00 3 172 208 4 00 5 00 6 01 7 00 8 174 819 9 6 10 10 01 11 01 12 35 13 01 14 7 23 15 01 16 00 17 59 18 01 1910 18 20 00 21 00 22 01 23 5 10 SMART state for osd 4 disk is OK. The OSD in up and in: root@lab2-mon1:/var/log/ceph# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -80 root ssd -7 14.71997 root platter -3 7.12000 host croc3 22 0.89000 osd.22 up 1.0 1.0 15 0.89000 osd.15 up 1.0 1.0 16 0.89000 osd.16 up 1.0 1.0 13 0.89000 osd.13 up 1.0 1.0 18 0.89000 osd.18 up 1.0 1.0 8 0.89000 osd.8 up 1.0 1.0 11 0.89000 osd.11 up 1.0 1.0 20 0.89000 osd.20 up 1.0 1.0 -4 0.47998 host croc2 10 0.06000 osd.10 up 1.0 1.0 12 0.06000 osd.12 up 1.0 1.0 14 0.06000 osd.14 up 1.0 1.0 17 0.06000 osd.17 up 1.0 1.0 19 0.06000 osd.19 up 1.0 1.0 21 0.06000 osd.21 up 1.0 1.0 9 0.06000 osd.9 up 1.0 1.0 23 0.06000 osd.23 up 1.0 1.0 -2 7.12000 host croc1 7 0.89000 osd.7 up 1.0
Re: [ceph-users] ceph osd debug question / proposal
I'm not talking about IO happening, I'm talking about file descriptors staying open. If they weren't open you could umount it without the -l. Once you hit the OSD again all those open files will start working and if more need to be opened it will start looking for them... Jan On 24 Aug 2015, at 03:07, Goncalo Borges gonc...@physics.usyd.edu.au wrote: Hi Jan... Thank for the reply. Yes, I did an 'umount -l' but I was sure that no I/O was happening at the time. So, I was almost 100% sure that there were no real incoherence in terms of open files in the OS. On 08/20/2015 07:31 PM, Jan Schermer wrote: Just to clarify - you unmounted the filesystem with umount -l? That almost never a good idea, and it puts the OSD in a very unusual situation where IO will actually work on the open files, but it can't open any new ones. I think this would be enough to confuse just about any piece of software. Yes, I did an 'umount -l' but I was sure that no I/O was happening at the time. So, I was almost 100% sure that there were no real incoherence in terms of open files in the OS. Was journal on the filesystem or on a separate partition/device? The journal in on the same disk, but in a different partition. It's not the same as R/O filesystem (I hit that once and no such havoc happened), in my experience the OSD traps and exits when something like that happens. It would be interesting to know what would happen if you just did rm -rf /var/lib/ceph/osd/ceph-4/current/* - that could be an equivalent to umount -l, more or less :-) Will try that today and report back here. Cheers Goncalo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Testing CephFS
On Aug 24, 2015, at 18:38, Gregory Farnum gfar...@redhat.com wrote: On Mon, Aug 24, 2015 at 11:35 AM, Simon Hallam s...@pml.ac.uk wrote: Hi Greg, The MDS' detect that the other one went down and started the replay. I did some further testing with 20 client machines. Of the 20 client machines, 5 hung with the following error: [Aug24 10:53] ceph: mds0 caps stale [Aug24 10:54] ceph: mds0 caps stale [Aug24 10:58] ceph: mds0 hung [Aug24 11:03] ceph: mds0 came back [ +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN) [ +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new mon [Aug24 11:04] ceph: mds0 reconnect start [ +0.084938] libceph: mon2 10.15.0.3:6789 session established [ +0.008475] ceph: mds0 reconnect denied Oh, this might be a kernel bug, failing to ask for mdsmap updates when the connection goes away. Zheng, does that sound familiar? -Greg This seems like reconnect timeout. you can try enlarging mds_reconnect_timeout config option. Which version of kernel are you using? Yan, Zheng 10.15.0.3 was the active MDS at the time I unplugged the Ethernet cable. This was the output of ceph -w as I ran the test (I've removed a lot of the pg remapping): 2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor election 2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor election 2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1 2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,1 ceph1,ceph2 2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0} 2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256 active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s wr, 0 op/s 2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up {0=ceph3=up:active}, 2 up:standby 2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36 in 2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up {0=ceph2=up:replay}, 1 up:standby 2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:reconnect 2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up {0=ceph2=up:reconnect}, 1 up:standby 2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:rejoin 2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active 2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up {0=ceph2=up:active}, 1 up:standby 2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active 2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up {0=ceph2=up:active}, 1 up:standby *cable plugged back in* 2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565 boot 2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25 in 2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993 up:boot 2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby 2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor election 2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor election 2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1,2 2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby 2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507 (allowed interval 45) 2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952 (allowed interval 45) 2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302 (allowed interval 45) 2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24152 10.10.10.67:0/3475305031 after 647.925323 (allowed interval 45) 2015-08-24 11:13:51.037681 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24149 10.10.10.68:0/22416725 after 648.029988 (allowed interval 45) I did just notice that none of the times match up. So may try again once I fix ntp/chrony and see if that makes a difference. Cheers, Simon -Original Message- From: Gregory Farnum [mailto:gfar...@redhat.com] Sent: 21 August 2015 12:16 To: Simon Hallam Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Testing CephFS On Thu, Aug 20, 2015 at 11:07 AM, Simon Hallam s...@pml.ac.uk wrote: Hey all, We are currently testing CephFS on a small (3 node) cluster. The setup is currently: Each server has 12 OSDs, 1 Monitor and 1 MDS running on it: The servers are running: 0.94.2-0.el7 The clients are running: Ceph: 0.80.10-1.fc21, Kernel:
Re: [ceph-users] Testing CephFS
The clients are: [root@gridnode50 ~]# uname -a Linux gridnode50 4.0.8-200.fc21.x86_64 #1 SMP Fri Jul 10 21:09:54 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux [root@gridnode50 ~]# ceph -v ceph version 0.80.10 (ea6c958c38df1216bf95c927f143d8b13c4a9e70) I don't think it is a reconnect timeout, as they don't even attempt to reconnect until I plug the Ethernet cable back into the original MDS? Cheers, Simon -Original Message- From: Yan, Zheng [mailto:z...@redhat.com] Sent: 24 August 2015 12:28 To: Simon Hallam Cc: ceph-users@lists.ceph.com; Gregory Farnum Subject: Re: [ceph-users] Testing CephFS On Aug 24, 2015, at 18:38, Gregory Farnum gfar...@redhat.com wrote: On Mon, Aug 24, 2015 at 11:35 AM, Simon Hallam s...@pml.ac.uk wrote: Hi Greg, The MDS' detect that the other one went down and started the replay. I did some further testing with 20 client machines. Of the 20 client machines, 5 hung with the following error: [Aug24 10:53] ceph: mds0 caps stale [Aug24 10:54] ceph: mds0 caps stale [Aug24 10:58] ceph: mds0 hung [Aug24 11:03] ceph: mds0 came back [ +8.803334] libceph: mon2 10.15.0.3:6789 socket closed (con state OPEN) [ +0.18] libceph: mon2 10.15.0.3:6789 session lost, hunting for new mon [Aug24 11:04] ceph: mds0 reconnect start [ +0.084938] libceph: mon2 10.15.0.3:6789 session established [ +0.008475] ceph: mds0 reconnect denied Oh, this might be a kernel bug, failing to ask for mdsmap updates when the connection goes away. Zheng, does that sound familiar? -Greg This seems like reconnect timeout. you can try enlarging mds_reconnect_timeout config option. Which version of kernel are you using? Yan, Zheng 10.15.0.3 was the active MDS at the time I unplugged the Ethernet cable. This was the output of ceph -w as I ran the test (I've removed a lot of the pg remapping): 2015-08-24 11:02:39.547529 mon.1 [INF] mon.ceph2 calling new monitor election 2015-08-24 11:02:40.011995 mon.0 [INF] mon.ceph1 calling new monitor election 2015-08-24 11:02:45.245869 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1 2015-08-24 11:02:45.257440 mon.0 [INF] HEALTH_WARN; 1 mons down, quorum 0,1 ceph1,ceph2 2015-08-24 11:02:45.535369 mon.0 [INF] monmap e1: 3 mons at {ceph1=10.15.0.1:6789/0,ceph2=10.15.0.2:6789/0,ceph3=10.15.0.3:6789/0} 2015-08-24 11:02:45.535444 mon.0 [INF] pgmap v15803: 8256 pgs: 8256 active+clean; 1248 GB data, 2503 GB used, 193 TB / 196 TB avail; 47 B/s wr, 0 op/s 2015-08-24 11:02:45.535541 mon.0 [INF] mdsmap e38: 1/1/1 up {0=ceph3=up:active}, 2 up:standby 2015-08-24 11:02:45.535629 mon.0 [INF] osdmap e197: 36 osds: 36 up, 36 in 2015-08-24 11:03:01.946397 mon.0 [INF] mdsmap e39: 1/1/1 up {0=ceph2=up:replay}, 1 up:standby 2015-08-24 11:03:02.993880 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:reconnect 2015-08-24 11:03:02.993930 mon.0 [INF] mdsmap e40: 1/1/1 up {0=ceph2=up:reconnect}, 1 up:standby 2015-08-24 11:03:51.461248 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:rejoin 2015-08-24 11:03:55.807131 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active 2015-08-24 11:03:55.807195 mon.0 [INF] mdsmap e42: 1/1/1 up {0=ceph2=up:active}, 1 up:standby 2015-08-24 11:06:48.036736 mon.0 [INF] mds.0 10.15.0.2:6849/17644 up:active 2015-08-24 11:06:48.036799 mon.0 [INF] mdsmap e43: 1/1/1 up {0=ceph2=up:active}, 1 up:standby *cable plugged back in* 2015-08-24 11:13:13.230714 mon.0 [INF] osd.32 10.15.0.3:6832/11565 boot 2015-08-24 11:13:13.230765 mon.0 [INF] osdmap e212: 36 osds: 25 up, 25 in 2015-08-24 11:13:13.230809 mon.0 [INF] mds.? 10.15.0.3:6833/16993 up:boot 2015-08-24 11:13:13.230837 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby 2015-08-24 11:13:30.799429 mon.2 [INF] mon.ceph3 calling new monitor election 2015-08-24 11:13:30.826158 mon.0 [INF] mon.ceph1 calling new monitor election 2015-08-24 11:13:30.926331 mon.0 [INF] mon.ceph1@0 won leader election with quorum 0,1,2 2015-08-24 11:13:30.968739 mon.0 [INF] mdsmap e47: 1/1/1 up {0=ceph2=up:active}, 2 up:standby 2015-08-24 11:13:28.383203 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24155 10.10.10.95:0/3238635414 after 625.375507 (allowed interval 45) 2015-08-24 11:13:29.721653 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24146 10.10.10.99:0/3454703638 after 626.713952 (allowed interval 45) 2015-08-24 11:13:31.113004 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24140 10.10.10.60:0/359606080 after 628.105302 (allowed interval 45) 2015-08-24 11:13:50.933020 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24152 10.10.10.67:0/3475305031 after 647.925323 (allowed interval 45) 2015-08-24 11:13:51.037681 mds.0 [INF] denied reconnect attempt (mds is up:active) from client.24149 10.10.10.68:0/22416725 after 648.029988 (allowed interval 45) I did just notice that none of the
Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Gorbachev Sent: 24 August 2015 18:06 To: Jan Schermer j...@schermer.cz Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk Subject: Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs HI Jan, On Mon, Aug 24, 2015 at 12:40 PM, Jan Schermer j...@schermer.cz wrote: I never actually set up iSCSI with VMware, I just had to research various VMware storage options when we had a SAN-probelm at a former job... But I can take a look at it again if you want me to. Thank you, I don't want to waste your time as I have asked Vmware TAP to research that - I will communicate back anything with which they respond. Is it realy deadlocked when this issue occurs? What I think is partly responsible for this situation is that the iSCSI LUN queues fill up and that's what actually kills your IO - VMware lowers queue depth to 1 in that situation and it can take a really long time to recover (especially if one of the LUNs on the target constantly has problems, or when heavy IO hammers the adapter) - you should never fill this queue, ever. iSCSI will likely be innocent victim in the chain, not the cause of the issues. Completely agreed, so iSCSI's job then is to properly communicate to the initiator that it cannot do what it is asked to do and quit the IO. It's not a queue full or queue throttling issue. ESXi detects a slow IO which I believe is when an IO takes longer than 10 seconds, it then tries to send an abort message to the target so it can then retry. However the RBD client doesn't handle the abort message passed to it from LIO. I'm not sure what quite happens next but between LIO and ESXi neither makes the decision to ignore the abort and so both enter a standoff with each other. Ceph should gracefully handle all those situations, you just need to set the timeouts right. I have it set so that whatever happens the OSD can only delay work for 40s and then it is marked down - at that moment all IO start flowing again. What setting in ceph do you use to do that? is that mon_osd_down_out_interval? I think stopping slow OSDs is the answer to the root of the problem - so far I only know to do ceph osd perf and look at latencies. You can maybe adjust some of the timeouts to make Ceph pause for less time to hopefully make sure all IO is processed in under 10s, but you increase the risk of OSD's randomly dropping out and there are probably still quite a few cases where IO could still take longer than 10s. You should take this to VMware support, they should be able to tell whether the problem is in iSCSI target (then you can take a look at how that behaves) or in the initiator settings. Though in my experience after two visits from their foremost experts I had to google everything myself because they were clueless - YMMV. I am hoping the TAP Elite team can do better...but we'll see... The root cause is however slow ops in Ceph, and I have no idea why you'd have them if the OSDs come back up - maybe one of them is really deadlocked or backlogged in some way? I found that when OSDs are dead but up they don't respond to ceph tell osd.xxx ... so try if they all respond in a timely manner, that should help pinpoint the bugger. I think I know in this case - there are some PCIe AER/Bus errors and TLP Header messages strewing across the console of one OSD machine - ceph osd perf showing latencies aboce a second per OSD, but only when IO is done to those OSDs. I am thankful this is not production storage, but worried of this situation in production - the OSDs are staying up and in, but their latencies are slowing clusterwide IO to a crawl. I am trying to envision this situation in production and how would one find out what is slowing everything down without guessing. Regards, Alex Jan On 24 Aug 2015, at 18:26, Alex Gorbachev a...@iss-integration.com wrote: This can be tuned in the iSCSI initiation on VMware - look in advanced settings on your ESX hosts (at least if you use the software initiator). Thanks, Jan. I asked this question of Vmware as well, I think the problem is specific to a given iSCSI session, so wondering if that's strictly the job of the target? Do you know of any specific SCSI settings that mitigate this kind of issue? Basically, give up on a session and terminate it and start a new one should an RBD not respond? As I understand, RBD simply never gives up. If an OSD does not respond but is still technically up and in, Ceph will retry IOs forever. I think RBD and Ceph need a timeout mechanism for this. Best regards, Alex Jan On 23 Aug 2015, at 21:28, Nick Fisk n...@fisk.me.uk wrote: Hi Alex, Currently RBD+LIO+ESX is broken. The problem is caused by the RBD device not handling device aborts properly causing LIO
Re: [ceph-users] TRIM / DISCARD run at low priority by the OSDs?
Hi Alexandre, Thanks for the note. I was not clear enough. The fstrim I was running was only on the krbd mountpoints. The backend OSDs only have standard hard disks, not SSDs, so they don't need to be trimmed. Instead I was reclaiming free space as reported by Ceph. Running fstrim on the rbd mountpoints this caused the OSDs to become very busy, affecting all rbds, not just those being trimmed. I was hoping someone had an idea of how to make the OSDs not become busy while running fstrim on the rbd mountpoints. E.g. if Ceph made a distinction between trim operations on RBDs and other types, it could give those operations lower priority. Thanks again! Chad. On Monday, August 24, 2015 18:26:30 you wrote: Hi, I'm not sure for krbd, but with librbd, using trim/discard on the client, don't do trim/discard on the osd physical disk. It's simply write zeroes in the rbd image. zeores write can be skipped since this commit (librbd related) https://github.com/xiaoxichen/ceph/commit/e7812b8416012141cf8faef577e7b27e1b 29d5e3 +OPTION(rbd_skip_partial_discard, OPT_BOOL, false) Then you can still manage fstrim manually on the osd servers - Mail original - De: Chad William Seys cws...@physics.wisc.edu À: ceph-users ceph-us...@ceph.com Envoyé: Samedi 22 Août 2015 04:26:38 Objet: [ceph-users] TRIM / DISCARD run at low priority by the OSDs? Hi All, Is it possible to give TRIM / DISCARD initiated by krbd low priority on the OSDs? I know it is possible to run fstrim at Idle priority on the rbd mount point, e.g. ionice -c Idle fstrim -v $MOUNT . But this Idle priority (it appears) only is within the context of the node executing fstrim . If the node executing fstrim is Idle then the OSDs are very busy and performance suffers. Is it possible to tell the OSD daemons (or whatever) to perform the TRIMs at low priority also? Thanks! Chad. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Opensource plugin for pulling out cluster recovery and client IO metric
Hello Ceph Geeks I am planning to develop a python plugin that pulls out cluster *recovery IO* and *client IO* operation metrics , that can be further used with collectd. *For example , i need to take out these values* *recovery io 814 MB/s, 101 objects/s* *client io 85475 kB/s rd, 1430 kB/s wr, 32 op/s* Could you please help me in understanding how *ceph -s* and *ceph -w* outputs *prints cluster recovery IO and client IO information*. Where this information is coming from. *Is it coming from perf dump* ? If yes then which section of perf dump output is should focus on. If not then how can i get this values. I tried *ceph --admin-daemon /var/run/ceph/ceph-osd.48.asok perf dump* , but it generates hell lot of information and i am confused which section of output should i use. Please help Thanks in advance ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd debug question / proposal
Hi Jan... We were interested in the situation where an rm -Rf is done in the current directory of the OSD. Here are my findings: 1. In this exercise, we simply deleted all the content of /var/lib/ceph/osd/ceph-23/current. # cd /var/lib/ceph/osd/ceph-23/current # rm -Rf * # df (...) /dev/sdj1 2918054776434548 2917620228 1% /var/lib/ceph/osd/ceph-23 2. After some time, ceph enters in error state because it thinks it has an inconsistent PG and several scrub errors # ceph -s cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc health HEALTH_ERR 1 pgs inconsistent 1850 scrub errors monmap e1: 3 mons at {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0} election epoch 24, quorum 0,1,2 mon1,mon3,mon2 mdsmap e162: 1/1/1 up {0=mds=up:active}, 1 up:standby-replay osdmap e1903: 32 osds: 32 up, 32 in pgmap v1041261: 2176 pgs, 2 pools, 4930 GB data, 1843 kobjects 14424 GB used, 74627 GB / 89051 GB avail 2175 active+clean 1 active+clean+inconsistent client io 989 B/s rd, 1 op/s 3. Looking to ceph.log in the mon, it is possible to check which is the PG affected and which OSD is responsible for the error: # tail -f /var/log/ceph/ceph.log (...) 2015-08-24 11:31:10.139239 osd.13 X.X.X.X:6804/20104 2384 : cluster [ERR] be_compare_scrubmaps: *5.336 shard 23* missing e300336/10001b0.2825/head//5be_compare_scrubmaps: 5.336 shard 23 missing 32600336/1000109.0754/head//5be_compare_scrubmaps: *5.336 shard 23* missing dd700336/10001ab.0b91/head//5be_compare_scrubmaps: 5.336 shard 23 missing bc220336/10001bd.387c/head//5be_compare_scrubmaps: 5.336 shard 23 missing f9320336/1000201.2e96/head//5be_compare_scrubmaps: 5.336 shard 23 missing 1a920336/1000228.d501/head//5be_compare_scrubmaps: 5.336 shard 23 missing 24a20336/10001bc.3e06/head//5be_compare_scrubmaps: 5.336 shard 23 missing cd20336/1000227.4775/head//5be_compare_scrubmaps: 5.336 shard 23 missing cef20336/10001b9.2260/head//5be_compare_scrubmaps: 5.336 shard 23 missing ba240336/10001d8.0630/head//5be_compare_scrubmaps: 5.336 shard 23 missing 3e740336/10001b1.2089/head//5be_compare_scrubmaps: 5.336 shard 23 missing e840336/10001ba.2618/head//5be_compare_scrubmaps: 5.336 shard 23 missing 17b40336/1e9.0287/head//5be_compare_scrubmaps: 5.336 shard 23 missing b7950336/1e4.0800/head//5be_compare_scrubmaps: 5.336 shard 23 missing 94560336/10001b4.2834/head//5be_compare_scrubmaps: 5.336 shard 23 missing 71370336/151.0179/head//5be_compare_scrubmaps: 5.336 shard 23 missing 62370336/10001b5.3b5b/head//5be_compare_scrubmaps: 5.336 shard 23 missing e9670336/1000120.03f8/head//5be_compare_scrubmaps: 5.336 shard 23 missing 1b480336/100019a.0d4b/head//5be_compare_scrubmaps: 5.336 shard 23 missing 11880336/10001e8.03e9/head//5be_compare_scrubmaps: 5.336 shard 23 missing 56c80336/183.0255/head//5be_compare_scrubmaps: 5.336 shard 23 missing 97790336/10001e7.0668/head//5be_compare_scrubmaps: 5.336 shard 23 missing e4ca0336/10001b6.278c/head//5be_compare_scrubmaps: 5.336 shard 23 missing 4eda0336/100019e.36ad/head//5 (...) 2015-08-24 11:31:14.336760 osd.13 X.X.X.X:6804/20104 2476 : cluster [ERR] 5.336 scrub 1850 missing, 0 inconsistent objects 2015-08-24 11:31:14.336764 osd.13 X.X.X.X:6804/20104 2477 : cluster [ERR] 5.336 scrub 1850 errors 4. We have tried to restart the problematic osd, but that fails. # /etc/init.d/ceph stop osd.23 === osd.23 === Stopping Ceph osd.23 on osd3...done [root@osd3 ~]# /etc/init.d/ceph start osd.23 === osd.23 === create-or-move updated item name 'osd.23' weight 2.72 at location {host=osd3,root=default} to crush map Starting Ceph osd.23 on osd3... starting osd.23 at :/0 osd_data /var/lib/ceph/osd/ceph-23 /var/lib/ceph/osd/ceph-23/journal # tail -f /var/log/ceph/ceph-osd.23.log 2015-08-24 11:48:12.189322 7fa24d85d800 0 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd, pid 7266 2015-08-24 11:48:12.389747 7fa24d85d800 0 filestore(/var/lib/ceph/osd/ceph-23) backend xfs (magic 0x58465342) 2015-08-24
Re: [ceph-users] ceph osd debug question / proposal
Hope nobody never does that. Anyway that's good to know in case of disaster recovery. Thank you! Shinobu On Tue, Aug 25, 2015 at 12:10 PM, Goncalo Borges gonc...@physics.usyd.edu.au wrote: Hi Shinobu Human mistake, for example :-) Not very frequent, but it happens. Nevertheless, the idea is to test ceph against different DC scenarios, triggered by different problems. On this particular situation, the cluster recovered ok ONCE the problematic OSD daemon was tagged as 'down' and 'out' Cheers Goncalo On 08/25/2015 01:06 PM, Shinobu wrote: So what is the situation where you need to do: # cd /var/lib/ceph/osd/ceph-23/current # rm -Rf * # df (...) I'm quite sure that is not normal. Shinobu On Tue, Aug 25, 2015 at 9:41 AM, Goncalo Borges gonc...@physics.usyd.edu.augonc...@physics.usyd.edu.au wrote: Hi Jan... We were interested in the situation where an rm -Rf is done in the current directory of the OSD. Here are my findings: 1. In this exercise, we simply deleted all the content of /var/lib/ceph/osd/ceph-23/current. # cd /var/lib/ceph/osd/ceph-23/current # rm -Rf * # df (...) /dev/sdj1 2918054776434548 2917620228 1% /var/lib/ceph/osd/ceph-23 2. After some time, ceph enters in error state because it thinks it has an inconsistent PG and several scrub errors # ceph -s cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc health HEALTH_ERR 1 pgs inconsistent 1850 scrub errors monmap e1: 3 mons at {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0} election epoch 24, quorum 0,1,2 mon1,mon3,mon2 mdsmap e162: 1/1/1 up {0=mds=up:active}, 1 up:standby-replay osdmap e1903: 32 osds: 32 up, 32 in pgmap v1041261: 2176 pgs, 2 pools, 4930 GB data, 1843 kobjects 14424 GB used, 74627 GB / 89051 GB avail 2175 active+clean 1 active+clean+inconsistent client io 989 B/s rd, 1 op/s 3. Looking to ceph.log in the mon, it is possible to check which is the PG affected and which OSD is responsible for the error: # tail -f /var/log/ceph/ceph.log (...) 2015-08-24 11:31:10.139239 osd.13 X.X.X.X:6804/20104 2384 : cluster [ERR] be_compare_scrubmaps: *5.336 shard 23* missing e300336/10001b0.2825/head//5be_compare_scrubmaps: 5.336 shard 23 missing 32600336/1000109.0754/head//5be_compare_scrubmaps: *5.336 shard 23* missing dd700336/10001ab.0b91/head//5be_compare_scrubmaps: 5.336 shard 23 missing bc220336/10001bd.387c/head//5be_compare_scrubmaps: 5.336 shard 23 missing f9320336/1000201.2e96/head//5be_compare_scrubmaps: 5.336 shard 23 missing 1a920336/1000228.d501/head//5be_compare_scrubmaps: 5.336 shard 23 missing 24a20336/10001bc.3e06/head//5be_compare_scrubmaps: 5.336 shard 23 missing cd20336/1000227.4775/head//5be_compare_scrubmaps: 5.336 shard 23 missing cef20336/10001b9.2260/head//5be_compare_scrubmaps: 5.336 shard 23 missing ba240336/10001d8.0630/head//5be_compare_scrubmaps: 5.336 shard 23 missing 3e740336/10001b1.2089/head//5be_compare_scrubmaps: 5.336 shard 23 missing e840336/10001ba.2618/head//5be_compare_scrubmaps: 5.336 shard 23 missing 17b40336/1e9.0287/head//5be_compare_scrubmaps: 5.336 shard 23 missing b7950336/1e4.0800/head//5be_compare_scrubmaps: 5.336 shard 23 missing 94560336/10001b4.2834/head//5be_compare_scrubmaps: 5.336 shard 23 missing 71370336/151.0179/head//5be_compare_scrubmaps: 5.336 shard 23 missing 62370336/10001b5.3b5b/head//5be_compare_scrubmaps: 5.336 shard 23 missing e9670336/1000120.03f8/head//5be_compare_scrubmaps: 5.336 shard 23 missing 1b480336/100019a.0d4b/head//5be_compare_scrubmaps: 5.336 shard 23 missing 11880336/10001e8.03e9/head//5be_compare_scrubmaps: 5.336 shard 23 missing 56c80336/183.0255/head//5be_compare_scrubmaps: 5.336 shard 23 missing 97790336/10001e7.0668/head//5be_compare_scrubmaps: 5.336 shard 23 missing e4ca0336/10001b6.278c/head//5be_compare_scrubmaps: 5.336 shard 23 missing 4eda0336/100019e.36ad/head//5 (...) 2015-08-24 11:31:14.336760 osd.13 X.X.X.X:6804/20104 2476 : cluster [ERR] 5.336 scrub 1850 missing, 0 inconsistent objects 2015-08-24 11:31:14.336764 osd.13 X.X.X.X:6804/20104 2477 : cluster [ERR] 5.336 scrub 1850 errors 4. We have tried to restart the problematic osd, but that fails. # /etc/init.d/ceph stop osd.23 === osd.23 === Stopping Ceph osd.23 on osd3...done [root@osd3 ~]# /etc/init.d/ceph start osd.23 === osd.23 === create-or-move updated item name 'osd.23' weight 2.72 at location {host=osd3,root=default} to crush map Starting Ceph osd.23 on osd3... starting osd.23 at :/0 osd_data /var/lib/ceph/osd/ceph-23 /var/lib/ceph/osd/ceph-23/journal # tail -f /var/log/ceph/ceph-osd.23.log 2015-08-24 11:48:12.189322 7fa24d85d800
Re: [ceph-users] ceph osd debug question / proposal
Hi Shinobu Human mistake, for example :-) Not very frequent, but it happens. Nevertheless, the idea is to test ceph against different DC scenarios, triggered by different problems. On this particular situation, the cluster recovered ok ONCE the problematic OSD daemon was tagged as 'down' and 'out' Cheers Goncalo On 08/25/2015 01:06 PM, Shinobu wrote: So what is the situation where you need to do: # cd /var/lib/ceph/osd/ceph-23/current # rm -Rf * # df (...) I'm quite sure that is not normal. Shinobu On Tue, Aug 25, 2015 at 9:41 AM, Goncalo Borges gonc...@physics.usyd.edu.au mailto:gonc...@physics.usyd.edu.au wrote: Hi Jan... We were interested in the situation where an rm -Rf is done in the current directory of the OSD. Here are my findings: 1. In this exercise, we simply deleted all the content of /var/lib/ceph/osd/ceph-23/current. # cd /var/lib/ceph/osd/ceph-23/current # rm -Rf * # df (...) /dev/sdj1 2918054776434548 2917620228 1% /var/lib/ceph/osd/ceph-23 2. After some time, ceph enters in error state because it thinks it has an inconsistent PG and several scrub errors # ceph -s cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc health HEALTH_ERR 1 pgs inconsistent 1850 scrub errors monmap e1: 3 mons at {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0} election epoch 24, quorum 0,1,2 mon1,mon3,mon2 mdsmap e162: 1/1/1 up {0=mds=up:active}, 1 up:standby-replay osdmap e1903: 32 osds: 32 up, 32 in pgmap v1041261: 2176 pgs, 2 pools, 4930 GB data, 1843 kobjects 14424 GB used, 74627 GB / 89051 GB avail 2175 active+clean 1 active+clean+inconsistent client io 989 B/s rd, 1 op/s 3. Looking to ceph.log in the mon, it is possible to check which is the PG affected and which OSD is responsible for the error: # tail -f /var/log/ceph/ceph.log (...) 2015-08-24 11:31:10.139239 osd.13 X.X.X.X:6804/20104 2384 : cluster [ERR] be_compare_scrubmaps: *5.336 shard 23* missing e300336/10001b0.2825/head//5be_compare_scrubmaps: 5.336 shard 23 missing 32600336/1000109.0754/head//5be_compare_scrubmaps: *5.336 shard 23* missing dd700336/10001ab.0b91/head//5be_compare_scrubmaps: 5.336 shard 23 missing bc220336/10001bd.387c/head//5be_compare_scrubmaps: 5.336 shard 23 missing f9320336/1000201.2e96/head//5be_compare_scrubmaps: 5.336 shard 23 missing 1a920336/1000228.d501/head//5be_compare_scrubmaps: 5.336 shard 23 missing 24a20336/10001bc.3e06/head//5be_compare_scrubmaps: 5.336 shard 23 missing cd20336/1000227.4775/head//5be_compare_scrubmaps: 5.336 shard 23 missing cef20336/10001b9.2260/head//5be_compare_scrubmaps: 5.336 shard 23 missing ba240336/10001d8.0630/head//5be_compare_scrubmaps: 5.336 shard 23 missing 3e740336/10001b1.2089/head//5be_compare_scrubmaps: 5.336 shard 23 missing e840336/10001ba.2618/head//5be_compare_scrubmaps: 5.336 shard 23 missing 17b40336/1e9.0287/head//5be_compare_scrubmaps: 5.336 shard 23 missing b7950336/1e4.0800/head//5be_compare_scrubmaps: 5.336 shard 23 missing 94560336/10001b4.2834/head//5be_compare_scrubmaps: 5.336 shard 23 missing 71370336/151.0179/head//5be_compare_scrubmaps: 5.336 shard 23 missing 62370336/10001b5.3b5b/head//5be_compare_scrubmaps: 5.336 shard 23 missing e9670336/1000120.03f8/head//5be_compare_scrubmaps: 5.336 shard 23 missing 1b480336/100019a.0d4b/head//5be_compare_scrubmaps: 5.336 shard 23 missing 11880336/10001e8.03e9/head//5be_compare_scrubmaps: 5.336 shard 23 missing 56c80336/183.0255/head//5be_compare_scrubmaps: 5.336 shard 23 missing 97790336/10001e7.0668/head//5be_compare_scrubmaps: 5.336 shard 23 missing e4ca0336/10001b6.278c/head//5be_compare_scrubmaps: 5.336 shard 23 missing 4eda0336/100019e.36ad/head//5 (...) 2015-08-24 11:31:14.336760 osd.13
Re: [ceph-users] ceph osd debug question / proposal
So what is the situation where you need to do: # cd /var/lib/ceph/osd/ceph-23/current # rm -Rf * # df (...) I'm quite sure that is not normal. Shinobu On Tue, Aug 25, 2015 at 9:41 AM, Goncalo Borges gonc...@physics.usyd.edu.au wrote: Hi Jan... We were interested in the situation where an rm -Rf is done in the current directory of the OSD. Here are my findings: 1. In this exercise, we simply deleted all the content of /var/lib/ceph/osd/ceph-23/current. # cd /var/lib/ceph/osd/ceph-23/current # rm -Rf * # df (...) /dev/sdj1 2918054776434548 2917620228 1% /var/lib/ceph/osd/ceph-23 2. After some time, ceph enters in error state because it thinks it has an inconsistent PG and several scrub errors # ceph -s cluster eea8578f-b3ac-4dfb-a0c5-da40509f5cdc health HEALTH_ERR 1 pgs inconsistent 1850 scrub errors monmap e1: 3 mons at {mon1=X.X.X.X:6789/0,mon2=X.X.X.X:6789/0,mon3=X.X.X.X:6789/0} election epoch 24, quorum 0,1,2 mon1,mon3,mon2 mdsmap e162: 1/1/1 up {0=mds=up:active}, 1 up:standby-replay osdmap e1903: 32 osds: 32 up, 32 in pgmap v1041261: 2176 pgs, 2 pools, 4930 GB data, 1843 kobjects 14424 GB used, 74627 GB / 89051 GB avail 2175 active+clean 1 active+clean+inconsistent client io 989 B/s rd, 1 op/s 3. Looking to ceph.log in the mon, it is possible to check which is the PG affected and which OSD is responsible for the error: # tail -f /var/log/ceph/ceph.log (...) 2015-08-24 11:31:10.139239 osd.13 X.X.X.X:6804/20104 2384 : cluster [ERR] be_compare_scrubmaps: *5.336 shard 23* missing e300336/10001b0.2825/head//5be_compare_scrubmaps: 5.336 shard 23 missing 32600336/1000109.0754/head//5be_compare_scrubmaps: *5.336 shard 23* missing dd700336/10001ab.0b91/head//5be_compare_scrubmaps: 5.336 shard 23 missing bc220336/10001bd.387c/head//5be_compare_scrubmaps: 5.336 shard 23 missing f9320336/1000201.2e96/head//5be_compare_scrubmaps: 5.336 shard 23 missing 1a920336/1000228.d501/head//5be_compare_scrubmaps: 5.336 shard 23 missing 24a20336/10001bc.3e06/head//5be_compare_scrubmaps: 5.336 shard 23 missing cd20336/1000227.4775/head//5be_compare_scrubmaps: 5.336 shard 23 missing cef20336/10001b9.2260/head//5be_compare_scrubmaps: 5.336 shard 23 missing ba240336/10001d8.0630/head//5be_compare_scrubmaps: 5.336 shard 23 missing 3e740336/10001b1.2089/head//5be_compare_scrubmaps: 5.336 shard 23 missing e840336/10001ba.2618/head//5be_compare_scrubmaps: 5.336 shard 23 missing 17b40336/1e9.0287/head//5be_compare_scrubmaps: 5.336 shard 23 missing b7950336/1e4.0800/head//5be_compare_scrubmaps: 5.336 shard 23 missing 94560336/10001b4.2834/head//5be_compare_scrubmaps: 5.336 shard 23 missing 71370336/151.0179/head//5be_compare_scrubmaps: 5.336 shard 23 missing 62370336/10001b5.3b5b/head//5be_compare_scrubmaps: 5.336 shard 23 missing e9670336/1000120.03f8/head//5be_compare_scrubmaps: 5.336 shard 23 missing 1b480336/100019a.0d4b/head//5be_compare_scrubmaps: 5.336 shard 23 missing 11880336/10001e8.03e9/head//5be_compare_scrubmaps: 5.336 shard 23 missing 56c80336/183.0255/head//5be_compare_scrubmaps: 5.336 shard 23 missing 97790336/10001e7.0668/head//5be_compare_scrubmaps: 5.336 shard 23 missing e4ca0336/10001b6.278c/head//5be_compare_scrubmaps: 5.336 shard 23 missing 4eda0336/100019e.36ad/head//5 (...) 2015-08-24 11:31:14.336760 osd.13 X.X.X.X:6804/20104 2476 : cluster [ERR] 5.336 scrub 1850 missing, 0 inconsistent objects 2015-08-24 11:31:14.336764 osd.13 X.X.X.X:6804/20104 2477 : cluster [ERR] 5.336 scrub 1850 errors 4. We have tried to restart the problematic osd, but that fails. # /etc/init.d/ceph stop osd.23 === osd.23 === Stopping Ceph osd.23 on osd3...done [root@osd3 ~]# /etc/init.d/ceph start osd.23 === osd.23 === create-or-move updated item name 'osd.23' weight 2.72 at location {host=osd3,root=default} to crush map Starting Ceph osd.23 on osd3... starting osd.23 at :/0 osd_data /var/lib/ceph/osd/ceph-23 /var/lib/ceph/osd/ceph-23/journal # tail -f /var/log/ceph/ceph-osd.23.log 2015-08-24 11:48:12.189322 7fa24d85d800 0 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3), process ceph-osd, pid 7266 2015-08-24 11:48:12.389747 7fa24d85d800 0 filestore(/var/lib/ceph/osd/ceph-23) backend xfs (magic 0x58465342) 2015-08-24 11:48:12.391370 7fa24d85d800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-23) detect_features: FIEMAP ioctl is supported and appears to work 2015-08-24 11:48:12.391381 7fa24d85d800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-23) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-08-24 11:48:12.404785 7fa24d85d800 0