Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-19 Thread Stefan Priebe - Profihost AG
Hi, we were able to solve these issues. We switched bcache OSDs from ssd to hdd in the ceph osd tree and lowered max recover from 3 to 1. Thanks for your help! Greets, Stefan Am 18.10.2018 um 15:42 schrieb David Turner: > What are you OSD node stats?  CPU, RAM, quantity and size of OSD disks. 

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-19 Thread Konstantin Shalygin
since some time we experience service outages in our Ceph cluster whenever there is any change to the HEALTH status. E. g. swapping storage devices, adding storage devices, rebooting Ceph hosts, during backfills ect. Just now I had a recent situation, where several VMs hung after I rebooted one

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-18 Thread David Turner
What are you OSD node stats? CPU, RAM, quantity and size of OSD disks. You might need to modify some bluestore settings to speed up the time it takes to peer or perhaps you might just be underpowering the amount of OSD disks you're trying to do and your servers and OSD daemons are going as fast

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-13 Thread Stefan Priebe - Profihost AG
and a 3rd one: health: HEALTH_WARN 1 MDSs report slow metadata IOs 1 MDSs report slow requests 2018-10-13 21:44:08.150722 mds.cloud1-1473 [WRN] 7 slow requests, 1 included below; oldest blocked for > 199.922552 secs 2018-10-13 21:44:08.150725 mds.cloud1-1473 [WRN]

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-13 Thread Stefan Priebe - Profihost AG
and a 3rd one: health: HEALTH_WARN 1 MDSs report slow metadata IOs 1 MDSs report slow requests 2018-10-13 21:44:08.150722 mds.cloud1-1473 [WRN] 7 slow requests, 1 included below; oldest blocked for > 199.922552 secs 2018-10-13 21:44:08.150725 mds.cloud1-1473 [WRN]

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-13 Thread Stefan Priebe - Profihost AG
ods.19 is a bluestore osd on a healthy 2TB SSD. Log of osd.19 is here: https://pastebin.com/raw/6DWwhS0A Am 13.10.2018 um 21:20 schrieb Stefan Priebe - Profihost AG: > Hi David, > > i think this should be the problem - form a new log from today: > > 2018-10-13 20:57:20.367326 mon.a [WRN]

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-13 Thread Stefan Priebe - Profihost AG
Hi David, i think this should be the problem - form a new log from today: 2018-10-13 20:57:20.367326 mon.a [WRN] Health check update: 4 osds down (OSD_DOWN) ... 2018-10-13 20:57:41.268674 mon.a [WRN] Health check update: Reduced data availability: 3 pgs peering (PG_AVAILABILITY) ... 2018-10-13

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-12 Thread Paul Emmerich
PGs switching to the peering state after a failure is normal and expected. The important thing is how long they stay in that state; it shouldn't be longer than a few seconds. It looks like less than 5 seconds from your log. What might help here is the ceph -w log (or mon cluster log file) during

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-12 Thread David Turner
The PGs per OSD does not change unless the OSDs are marked out. You have noout set, so that doesn't change at all during this test. All of your PGs peered quickly at the beginning and then were active+undersized the rest of the time, you never had any blocked requests, and you always had

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-12 Thread Nils Fahldieck - Profihost AG
Hi, in our `ceph.conf` we have: mon_max_pg_per_osd = 300 While the host is offline (9 OSDs down): 4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD If all OSDs are online: 4352 PGs * 3 / 71 OSDs ~ 183 PGs per OSD ... so this doesn't seem to be the issue. If I understood you right, that's what

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-12 Thread Burkhard Linke
Hi, On 10/12/2018 01:55 PM, Nils Fahldieck - Profihost AG wrote: I rebooted a Ceph host and logged `ceph status` & `ceph health detail` every 5 seconds. During this I encountered 'PG_AVAILABILITY Reduced data availability: pgs peering'. At the same time some VMs hung as described before.

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-12 Thread Nils Fahldieck - Profihost AG
I rebooted a Ceph host and logged `ceph status` & `ceph health detail` every 5 seconds. During this I encountered 'PG_AVAILABILITY Reduced data availability: pgs peering'. At the same time some VMs hung as described before. See the log here: https://pastebin.com/wxUKzhgB PG_AVAILABILITY is noted

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-11 Thread David Turner
You should definitely stop using `size 3 min_size 1` on your pools. Go back to the default `min_size 2`. I'm a little confused why you have 3 different CRUSH rules. They're all identical. You only need different CRUSH rules if you're using Erasure Coding or targeting a different set of OSDs

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-11 Thread Nils Fahldieck - Profihost AG
Thanks for your reply. I'll capture a `ceph status` the next time I encounter a not working RBD. Here's the other output you asked for: $ ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "data", "ruleset": 0, "type": 1, "min_size": 1,

Re: [ceph-users] Troubleshooting hanging storage backend whenever there is any cluster change

2018-10-11 Thread David Turner
My first guess is to ask what your crush rules are. `ceph osd crush rule dump` along with `ceph osd pool ls detail` would be helpful. Also if you have a `ceph status` output from a time where the VM RBDs aren't working might explain something. On Thu, Oct 11, 2018 at 1:12 PM Nils Fahldieck -