[ceph-users] PG Balancer Upmap mode not working

2019-12-12 Thread Philippe D'Anjou
@Wido Den Hollander  Regarding the amonut of PGs, and I quote from the docs: "If you have more than 50 OSDs, we recommend approximately 50-100placement groups per OSD to balance out resource usage, datadurability and distribution."

[ceph-users] Cluster in ERR status when rebalancing

2019-12-11 Thread Philippe D'Anjou
Has finally been addressed in 14.2.5, check changelog of release. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] PG Balancer Upmap mode not working

2019-12-10 Thread Philippe D'Anjou
My full OSD list (also here as pastebin https://paste.ubuntu.com/p/XJ4Pjm92B5/ ) ID  CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META    AVAIL   %USE  VAR  PGS STATUS  14   hdd 9.09470  1.0 9.1 TiB 6.9 TiB 6.8 TiB  71 KiB  18 GiB 2.2 TiB 75.34 1.04  69 up  19   hdd 9.09470 

[ceph-users] PG Balancer Upmap mode not working

2019-12-08 Thread Philippe D'Anjou
It's only getting worse after raising PGs now. Anything between: 96   hdd 9.09470  1.0 9.1 TiB 4.9 TiB 4.9 TiB  97 KiB  13 GiB  4.2 TiB 53.62 0.76  54 up and  89   hdd 9.09470  1.0 9.1 TiB 8.1 TiB 8.1 TiB  88 KiB  21 GiB 1001 GiB 89.25 1.27  87 up How is that possible? I dont

[ceph-users] PG Balancer Upmap mode not working

2019-12-08 Thread Philippe D'Anjou
@Wido Den Hollander  Still think this is acceptable?  51   hdd 9.09470  1.0 9.1 TiB 6.1 TiB 6.1 TiB  72 KiB  16 GiB 3.0 TiB 67.23 0.98  68 up  52   hdd 9.09470  1.0 9.1 TiB 6.7 TiB 6.7 TiB 3.5 MiB  18 GiB 2.4 TiB 73.99 1.08  75 up  53   hdd 9.09470  1.0 9.1 TiB 8.0 TiB 7.9

[ceph-users] PG Balancer Upmap mode not working

2019-12-07 Thread Philippe D'Anjou
I never had those issues with Luminous, never once, since Nautilus this is a constant headache.My issue is that I have OSDs that are over 85% whilst others are at 63%. My issue is that every time I do a rebalance or add new disks ceph moves PGs on near full OSDs and almost causes pool failures.

[ceph-users] PG Balancer Upmap mode not working

2019-12-07 Thread Philippe D'Anjou
@Wido Den Hollander  First of all the docs say: " In most cases, this distribution is “perfect,” whichan equal number of PGs on each OSD (+/-1 PG, since they might notdivide evenly)."Either this is just false information or very badly stated. I increased PGs and see no difference. I pointed

[ceph-users] PG Balancer Upmap mode not working

2019-12-07 Thread Philippe D'Anjou
@Wido Den Hollander  That doesn't explain why its between 76 and 92 PGs, that's major not equal. Raising PGs to 100 is an old statement anyway, anything 60+ should be fine. Not an excuse for distribution failure in this case.I am expecting more or less equal PGs/OSD

[ceph-users] PG Balancer Upmap mode not working

2019-12-07 Thread Philippe D'Anjou
Hi,the docs say the upmap mode is trying to achieve perfect distribution as to have equal amount of PGs/OSD.This is what I got(v14.2.4):   0   ssd 3.49219  1.0 3.5 TiB 794 GiB 753 GiB  38 GiB 3.4 GiB 2.7 TiB 22.20 0.32  82 up   1   ssd 3.49219  1.0 3.5 TiB 800 GiB 751 GiB  45 GiB

[ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]

2019-11-09 Thread Philippe D'Anjou
This only happens with this one specific node?checked system logs? checked SMART on all disks?I mean technically it's expected to have slower writes when the third node is there, it's by ceph design. ___ ceph-users mailing list

[ceph-users] Zombie OSD filesystems rise from the grave during bluestore conversion

2019-11-09 Thread Philippe D'Anjou
Zap had an issue back then and never properly worked, you have to manually dd, we always played it save and went 2-4GB in just to be sure.Should fix your issue. ___ ceph-users mailing list ceph-users@lists.ceph.com

[ceph-users] rebalance stuck backfill_toofull, OSD NOT full

2019-11-08 Thread Philippe D'Anjou
v14.2.4 Following issue: PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull     pg 1.285 is active+remapped+backfill_toofull, acting [118,94,84] BUT:118   hdd 9.09470  0.8 9.1 TiB  7.4 TiB  7.4 TiB  12 KiB  19 GiB 1.7 TiB 81.53 1.16  38 up Even with adjusted

[ceph-users] feature set mismatch CEPH_FEATURE_MON_GV kernel 5.0?

2019-10-31 Thread Philippe D'Anjou
So it seems like for some reason librados is used now instead of kernel module, and this produces the error. But we have all latest Nautilus repos installed on the clients...so why would librados throw a compatiblity issue?Client compatiblity level is set to Luminous.

Re: [ceph-users] changing set-require-min-compat-client will cause hiccup?

2019-10-31 Thread Philippe D'Anjou
Hi, it is NOT safe.All clients fail to mount rbds now :( Am Mittwoch, 30. Oktober 2019, 09:33:16 OEZ hat Konstantin Shalygin Folgendes geschrieben: Hi,I need to change set-require-min-compat-clientto use upmap mode for the PG balancer. Will this cause a disconnect of all clients?

[ceph-users] feature set mismatch CEPH_FEATURE_MON_GV kernel 5.0?

2019-10-31 Thread Philippe D'Anjou
Hi, we're on v14.2.4 and nothing but that. All clients and servers run kernel ubuntu 18.04 LTS 5.0.0-20. We're seeing this error:     MountVolume.WaitForAttach failed for volume "pvc-45a86719-edb9-11e9-9f38-02000a030111" : fail to check rbd image status with: (exit status 110), rbd output:

[ceph-users] changing set-require-min-compat-client will cause hiccup?

2019-10-30 Thread Philippe D'Anjou
Hi,I need to change set-require-min-compat-clientto use upmap mode for the PG balancer. Will this cause a disconnect of all clients? We're talking cephfs and RBD images for VMs. Or is it save to switch that live? thanks ___ ceph-users mailing list

[ceph-users] very high ram usage by OSDs on Nautilus

2019-10-30 Thread Philippe D'Anjou
Yes you were right, somehow there was an unusual high memory target set, not sure where this came from. I set it back to normal now, that should fix it I guess. Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com

[ceph-users] very high ram usage by OSDs on Nautilus

2019-10-29 Thread Philippe D'Anjou
Ok looking at mempool, what does it tell me? This affects multiple OSDs, got crashes almost every hour. {    "mempool": {     "by_pool": {     "bloom_filter": {     "items": 0,     "bytes": 0     },     "bluestore_alloc": {    

Re: [ceph-users] Ceph is moving data ONLY to near-full OSDs [BUG]

2019-10-28 Thread Philippe D'Anjou
% of full. PGs are not equally distributed otherwise it'd be a PG size issue. Thanks Am Sonntag, 27. Oktober 2019, 20:33:11 OEZ hat Wido den Hollander Folgendes geschrieben: On 10/26/19 8:01 AM, Philippe D'Anjou wrote: > V14.2.4 > So, this is not new, this happens ever

[ceph-users] very high ram usage by OSDs on Nautilus

2019-10-28 Thread Philippe D'Anjou
Hi, we are seeing quite a high memory usage by OSDs since Nautilus. Averaging 10GB/OSD for 10TB HDDs. But I had OOM issues on 128GB Systems because some single OSD processes used up to 32%. Here an example how they look on average: https://i.imgur.com/kXCtxMe.png Is that normal? I never seen

[ceph-users] Ceph is moving data ONLY to near-full OSDs [BUG]

2019-10-26 Thread Philippe D'Anjou
V14.2.4 So, this is not new, this happens every time there is a rebalance, now because of raising PGs. PG balancer is disabled because I thought it was the reason but apparently it's not, but it ain't helping either. Ceph is totally borged, it's only moving data on nearfull OSDs causing issues.

[ceph-users] How to reset compat weight-set changes caused by PG balancer module?

2019-10-22 Thread Philippe D'Anjou
Apparently the PG balancer crush-compat mode adds some crush bucket weights. Those cause major havoc in our cluster, our PG distribution is all over the place. Seeing things like this:... 97 hdd 9.09470 1.0 9.1 TiB 6.3 TiB 6.3 TiB 32 KiB 17 GiB 2.8 TiB 69.03 1.08 28 up 98

[ceph-users] OSD PGs are not being removed - Full OSD issues

2019-10-17 Thread Philippe D'Anjou
This is related to https://tracker.ceph.com/issues/42341 and to http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-October/037017.html After closing inspection yesterday we found that PGs are not being removed from OSDs which then leads to near full errors, explains why reweights don't

Re: [ceph-users] Issues with data distribution on Nautilus / weird filling behavior

2019-10-16 Thread Philippe D'Anjou
86.83 1.46  38 up  54   hdd  9.09470  1.0 9.1 TiB 5.0 TiB 5.0 TiB 136 KiB  13 GiB 4.1 TiB 54.80 0.92  24 up ... Now I again have to manually reweight to prevent bigger issues. How to fix this? Am Mittwoch, 2. Oktober 2019, 08:49:50 OESZ hat Philippe D'Anjou Folgendes

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-10 Thread Philippe D'Anjou
Mittwoch, 9. Oktober 2019, 20:19:42 OESZ hat Gregory Farnum Folgendes geschrieben: On Mon, Oct 7, 2019 at 11:11 PM Philippe D'Anjou wrote: > > Hi, > unfortunately it's single mon, because we had major outage on this cluster > and it's just being used to copy off data now. We wer

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-10 Thread Philippe D'Anjou
How do I Import an osdmap in Nautilus? I saw documentation for older version but it seems one now can only export but not import anymore? Am Donnerstag, 10. Oktober 2019, 08:52:03 OESZ hat Philippe D'Anjou Folgendes geschrieben: I dont think this has anything to do with CephFS

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-09 Thread Philippe D'Anjou
Folgendes geschrieben: On Mon, Oct 7, 2019 at 11:11 PM Philippe D'Anjou wrote: > > Hi, > unfortunately it's single mon, because we had major outage on this cluster > and it's just being used to copy off data now. We werent able to add more > mons because once a secon

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-08 Thread Philippe D'Anjou
no issues and all commands run fine. Am Montag, 7. Oktober 2019, 21:59:20 OESZ hat Gregory Farnum Folgendes geschrieben: On Sun, Oct 6, 2019 at 1:08 AM Philippe D'Anjou wrote: > > I had to use rocksdb repair tool before because the rocksdb files got > corrupted, for anoth

Re: [ceph-users] mon sudden crash loop - pinned map

2019-10-06 Thread Philippe D'Anjou
d from the remainder, or do they all exhibit this bug? On Fri, Oct 4, 2019 at 5:44 AM Philippe D'Anjou wrote: > > Hi, > our mon is acting up all of a sudden and dying in crash loop with the > following: > > > 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc

[ceph-users] mon sudden crash loop - pinned map

2019-10-04 Thread Philippe D'Anjou
Hi,our mon is acting up all of a sudden and dying in crash loop with the following: 2019-10-04 14:00:24.339583 lease_expire=0.00 has v0 lc 4549352     -3> 2019-10-04 14:00:24.335 7f6e5d461700  5 mon.km-fsn-1-dc4-m1-797678@0(leader).paxos(paxos active c 4548623..4549352) is_readable = 1 -

[ceph-users] Issues with data distribution on Nautilus / weird filling behavior

2019-10-01 Thread Philippe D'Anjou
Hi,this is a fresh Nautilus cluster, but there is a second old one that was upgraded from Luminous to Nautilus, both experience the same symptoms. First of all the data distribution on the OSDs is very bad. Now that could be due to low PGs although I get no recommendation to raise the PG number

[ceph-users] hanging/stopped recovery/rebalance in Nautilus

2019-10-01 Thread Philippe D'Anjou
Hi,I often observed now that the recovery/rebalance in Nautilus starts quite fast but gets extremely slow (2-3 objects/s) even if there are like 20 OSDs involved. Right now I am moving (reweighted to 0) 16x8TB disks, it's running since 4 days and since 12h it's kind of stuck now at   cluster: