Hi,
we were able to solve these issues. We switched bcache OSDs from ssd to
hdd in the ceph osd tree and lowered max recover from 3 to 1.
Thanks for your help!
Greets,
Stefan
Am 18.10.2018 um 15:42 schrieb David Turner:
> What are you OSD node stats? CPU, RAM, quantity and size of OSD disks.
since some time we experience service outages in our Ceph cluster
whenever there is any change to the HEALTH status. E. g. swapping
storage devices, adding storage devices, rebooting Ceph hosts, during
backfills ect.
Just now I had a recent situation, where several VMs hung after I
rebooted one
What are you OSD node stats? CPU, RAM, quantity and size of OSD disks.
You might need to modify some bluestore settings to speed up the time it
takes to peer or perhaps you might just be underpowering the amount of OSD
disks you're trying to do and your servers and OSD daemons are going as
fast
and a 3rd one:
health: HEALTH_WARN
1 MDSs report slow metadata IOs
1 MDSs report slow requests
2018-10-13 21:44:08.150722 mds.cloud1-1473 [WRN] 7 slow requests, 1
included below; oldest blocked for > 199.922552 secs
2018-10-13 21:44:08.150725 mds.cloud1-1473 [WRN]
and a 3rd one:
health: HEALTH_WARN
1 MDSs report slow metadata IOs
1 MDSs report slow requests
2018-10-13 21:44:08.150722 mds.cloud1-1473 [WRN] 7 slow requests, 1
included below; oldest blocked for > 199.922552 secs
2018-10-13 21:44:08.150725 mds.cloud1-1473 [WRN]
ods.19 is a bluestore osd on a healthy 2TB SSD.
Log of osd.19 is here:
https://pastebin.com/raw/6DWwhS0A
Am 13.10.2018 um 21:20 schrieb Stefan Priebe - Profihost AG:
> Hi David,
>
> i think this should be the problem - form a new log from today:
>
> 2018-10-13 20:57:20.367326 mon.a [WRN]
Hi David,
i think this should be the problem - form a new log from today:
2018-10-13 20:57:20.367326 mon.a [WRN] Health check update: 4 osds down
(OSD_DOWN)
...
2018-10-13 20:57:41.268674 mon.a [WRN] Health check update: Reduced data
availability: 3 pgs peering (PG_AVAILABILITY)
...
2018-10-13
PGs switching to the peering state after a failure is normal and
expected. The important thing is how long they stay in that state; it
shouldn't be longer than a few seconds. It looks like less than 5
seconds from your log.
What might help here is the ceph -w log (or mon cluster log file)
during
The PGs per OSD does not change unless the OSDs are marked out. You have
noout set, so that doesn't change at all during this test. All of your PGs
peered quickly at the beginning and then were active+undersized the rest of
the time, you never had any blocked requests, and you always had
Hi, in our `ceph.conf` we have:
mon_max_pg_per_osd = 300
While the host is offline (9 OSDs down):
4352 PGs * 3 / 62 OSDs ~ 210 PGs per OSD
If all OSDs are online:
4352 PGs * 3 / 71 OSDs ~ 183 PGs per OSD
... so this doesn't seem to be the issue.
If I understood you right, that's what
Hi,
On 10/12/2018 01:55 PM, Nils Fahldieck - Profihost AG wrote:
I rebooted a Ceph host and logged `ceph status` & `ceph health detail`
every 5 seconds. During this I encountered 'PG_AVAILABILITY Reduced data
availability: pgs peering'. At the same time some VMs hung as described
before.
I rebooted a Ceph host and logged `ceph status` & `ceph health detail`
every 5 seconds. During this I encountered 'PG_AVAILABILITY Reduced data
availability: pgs peering'. At the same time some VMs hung as described
before.
See the log here: https://pastebin.com/wxUKzhgB
PG_AVAILABILITY is noted
You should definitely stop using `size 3 min_size 1` on your pools. Go
back to the default `min_size 2`. I'm a little confused why you have 3
different CRUSH rules. They're all identical. You only need different
CRUSH rules if you're using Erasure Coding or targeting a different set of
OSDs
Thanks for your reply. I'll capture a `ceph status` the next time I
encounter a not working RBD. Here's the other output you asked for:
$ ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "data",
"ruleset": 0,
"type": 1,
"min_size": 1,
My first guess is to ask what your crush rules are. `ceph osd crush rule
dump` along with `ceph osd pool ls detail` would be helpful. Also if you
have a `ceph status` output from a time where the VM RBDs aren't working
might explain something.
On Thu, Oct 11, 2018 at 1:12 PM Nils Fahldieck -
15 matches
Mail list logo