[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Eugen Block
Hi, this is not an easy topic and there is no formula that can be applied to all clusters. From my experience, it is exactly how the discussion went in the thread you mentioned, trial & error. Looking at your session ls output, this reminds of a debug session we had a few years ago:

[ceph-users] Re: Upgrading nautilus / centos7 to octopus / ubuntu 20.04. - Suggestions and hints?

2024-01-16 Thread Szabo, Istvan (Agoda)
Hi Goetz, Which method you finally choose? We've done a successful migration from Centos 8 to ubuntu 20.04 but we have a centos 7 nautilus cluster which we'd like to move to Ubuntu 20.04 octopus same as you. Wonder any of you tried to skip Rocky 8 from the flow? Thank you

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
This my active MDS perf dump output: root@ud-01:~# ceph tell mds.ud-data.ud-02.xcoojt perf dump { "AsyncMessenger::Worker-0": { "msgr_recv_messages": 17179307, "msgr_send_messages": 15867134, "msgr_recv_bytes": 445239812294, "msgr_send_bytes": 42003529245,

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
All of my clients are servers located at 2 hop away with 10Gbit network and 2x Xeon CPU/16++ cores and minimum 64GB ram with SSD OS drive + 8GB spare. I use ceph kernel mount only and this is the command: - mount.ceph admin@$fsid.ud-data=/volumes/subvolumegroup ${MOUNT_DIR} -o

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
Let me share some outputs about my cluster. root@ud-01:~# ceph fs status ud-data - 84 clients === RANK STATE MDS ACTIVITY DNSINOS DIRS CAPS 0active ud-data.ud-02.xcoojt Reqs: 31 /s 3022k 3021k 52.6k 385k POOL TYPE USED AVAIL

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Özkan Göksu
Hello Eugen. Thank you for the answer. According to knowledge and test results at this issue: https://github.com/ceph/ceph/pull/38574 I've tried their advice and I've applied the following changes. max_mds = 4 standby_mds = 1 mds_cache_memory_limit = 16GB mds_recall_max_caps = 4 When I set

[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-16 Thread Drew Weaver
>Groovy. Channel drives are IMHO a pain, though in the case of certain >manufacturers it can be the only way to get firmware updates. Channel drives >often only have a 3 year warranty, vs 5 for generic drives. Yes, we have run into this with Kioxia as far as being able to find new firmware.

[ceph-users] Re: Stuck in upgrade process to reef

2024-01-16 Thread Igor Fedotov
Hi Jan, I've just fired an upstream ticket for your case, see https://tracker.ceph.com/issues/64053 for more details. You might want to tune (or preferably just remove) your custom bluestore_cache_.*_ratio settings to fix the issue. This is reproducible and fixable in my lab this way.

[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-16 Thread Anthony D'Atri
> > NVMe SSDs shouldn’t cost significantly more than SATA SSDs. Hint: certain > tier-one chassis manufacturers mark both the fsck up. You can get a better > warranty and pricing by buying drives from a VAR. > > We stopped buying “Vendor FW” drives a long time ago. Groovy.

[ceph-users] Re: How does mclock work?

2024-01-16 Thread Frédéric Nass
Sridhar,   Thanks a lot for this explantation. It's clearer now.   So at the end of the day (at least with balanced profile) it's a lower bound and no upper limit and a balanced distribution between client and cluster IOPS.   Regards, Frédéric. -Message original- De:

[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-16 Thread Drew Weaver
By HBA I suspect you mean a non-RAID HBA? Yes, something like the HBA355 NVMe SSDs shouldn’t cost significantly more than SATA SSDs. Hint: certain tier-one chassis manufacturers mark both the fsck up. You can get a better warranty and pricing by buying drives from a VAR. We

[ceph-users] Re: ceph pg mark_unfound_lost delete results in confused ceph

2024-01-16 Thread Oliver Dzombic
Hi, just in case someone else might run into this or similar issues. The following helped to solve the issue: 1. restarting the active mgr brought: pg 10.17 is stuck inactive for 18m, current state unknown, last acting [] .. the pg into inactive without last acting 2. so we recreated the

[ceph-users] OSD read latency grows over time

2024-01-16 Thread Roman Pashin
Hello Ceph users, we see strange issue on last recent Ceph installation v17.6.2. We store data on HDD pool, index pool is on SSD. Each OSD store its wal on NVME partition. Benchmarks didn't expose any issues with cluster, but since we placed production load on it we see constantly growing OSD

[ceph-users] Re: [quincy 17.2.7] ceph orchestrator not doing anything

2024-01-16 Thread Boris
Good morning Eugen, I just found this thread and saw that I had a test image for rgw in the config. After removing the global and the rgw config value everything was instantly fine. Cheers and a happy week Boris Am Di., 16. Jan. 2024 um 10:20 Uhr schrieb Eugen Block : > Hi, > > there have

[ceph-users] Email duplicates.

2024-01-16 Thread Roman Pashin
Hi owners of ceph-users list, I've been trying to post new message for the first time. First has been bounced because I've registered, but not subscribed to list. Than I've subscribed and sent message with picture, which was larger than allowed 500KB and got into quarantine as well. I've decided

[ceph-users] Re: [quincy 17.2.7] ceph orchestrator not doing anything

2024-01-16 Thread Eugen Block
Hi, there have been a few threads with this topic, one of them is this one [1]. The issue there was that different ceph container images were in use. Can you check your container versions? If you don't configure a global image for all ceph daemons, e.g.: quincy-1:~ # ceph config set

[ceph-users] Re: erasure-code-lrc Questions regarding repair

2024-01-16 Thread Eugen Block
Hi, I don't really have an answer, I just wanted to mention that I created a tracker issue [1] because I believe there's a bug in the LRC plugin. But there hasn't been any response yet. [1] https://tracker.ceph.com/issues/61861 Zitat von Ansgar Jazdzewski : hi folks, I currently test

[ceph-users] Re: Unable to locate "bluestore_compressed_allocated" & "bluestore_compressed_original" parameters while executing "ceph daemon osd.X perf dump" command.

2024-01-16 Thread Eugen Block
Hi, could you provide more details what exactly you tried and which configs you set? Which compression mode are you running? In a small Pacific test cluster I just set the mode to "force" (default "none"): storage:~ # ceph config set osd bluestore_compression_mode force And then after a

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-16 Thread Eugen Block
Hi, I have dealt with this topic multiple times, the SUSE team helped understanding what's going on under the hood. The summary can be found in this thread [1]. What helped in our case was to reduce the mds_recall_max_caps from 30k (default) to 3k. We tried it in steps of 1k IIRC. So I

[ceph-users] Re: [v18.2.1] problem with wrong osd device symlinks after upgrade to 18.2.1

2024-01-16 Thread Eugen Block
Did you find an existing tracker issue for that? I suggest to report your findings there. Thanks! Eugen Zitat von Reto Gysi : Hi Eugen LV tags seem to look ok to me. LV_tags: - root@zephir:~# lvs -a -o +devices,tags | egrep 'osd1| LV' | grep -v osd12 LV