[ceph-users] Re: Upgrade to 16.2.6 and osd+mds crash after bluestore_fsck_quick_fix_on_mount true

2021-10-20 Thread Igor Fedotov
Hey mgrzybowski! Never seen that before but perhaps some omaps have been improperly converted to new format and aren't read any more... I'll take a more detailed look at what's happening during that load_pgs call and what exact information is missing. Meanwhile could you please set

[ceph-users] Upgrade to 16.2.6 and osd+mds crash after bluestore_fsck_quick_fix_on_mount true

2021-10-20 Thread mgrzybowski
Hi Recently I did perform upgrades on single node cephfs server i have. # ceph fs ls name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ecpoolk3m1osd ecpoolk5m1osd ecpoolk4m2osd ~# ceph osd pool ls detail pool 20 'cephfs_data' replicated size 3 min_size 2 crush_rule 0

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Anthony D'Atri
> On Oct 20, 2021, at 1:49 PM, Josh Salomon wrote: > > but in the extreme case (some capacity on 1TB devices and some on 6TB > devices) the workload can't be balanced. I It’s also super easy in such a scenario to a) Have the larger drives not uniformly spread across failure domains, which

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Anthony D'Atri
> Doesn't the existing mgr balancer already balance the PGs for each pool > individually? So in your example, the PGs from the loaded pool will be > balanced across all osds, as will the idle pool's PGs. So the net load is > uniform, right? If there’s a single CRUSH root and all pools share

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi Josh, Okay, but do you agree that for any given pool, the load is uniform across it's PGs, right? Doesn't the existing mgr balancer already balance the PGs for each pool individually? So in your example, the PGs from the loaded pool will be balanced across all osds, as will the idle pool's

[ceph-users] v15.2.15 Octopus released

2021-10-20 Thread David Galloway
We're happy to announce the 15th backport release in the Octopus series. We recommend users to update to this release. For a detailed release notes with links & changelog please refer to the official blog entry at https://ceph.io/en/news/blog/2021/v15-2-15-octopus-released Notable Changes

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi Josh, That's another interesting dimension... Indeed a cluster that has plenty of free capacity could indeed be balanced by workload/iops, but once it reaches maybe 60 or 70% full, then I think capacity would need to take priority. But to be honest I don't really understand the workload/iops

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi, I don't quite understand your "huge server" scenario, other than a basic understanding that the balancer cannot do magic in some impossible cases. But anyway, I wonder if this sort of higher order balancing could/should be added as a "part two" to the mgr balancer. The existing code does a

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Jonas Jelten
Hi Dan, I'm not kidding, these were real-world observations, hence my motivation to create this balancer :) First I tried "fixing" the mgr balancer, but after understanding the exact algorithm there I thought of a completely different approach. For us the main reason things got out of balance

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-20 Thread Dan van der Ster
Hi Jonas, >From your readme: "the best possible solution is some OSDs having an offset of 1 PG to the ideal count. As a PG-distribution-optimization is done per pool, without checking other pool's distribution at all, some devices will be the +1 more often than others. At worst one OSD is the +1

[ceph-users] Re: clients failing to respond to cache pressure (nfs-ganesha)

2021-10-20 Thread Magnus HAGDORN
We have increased the cache on our MDS which makes this issue mostly go away. It is due to an interaction between the MDS and the ganesha NFS server which keeps its own cache. I believe newer versions of ganesha can deal with it. Sent from Android device On 20 Oct 2021 09:37, Marc wrote: This

[ceph-users] Re: config db host filter issue

2021-10-20 Thread Josh Baergen
Hey Richard, On Tue, Oct 19, 2021 at 8:37 PM Richard Bade wrote: > user@cstor01 DEV:~$ sudo ceph config set osd/host:cstor01 osd_max_backfills 2 > user@cstor01 DEV:~$ sudo ceph config get osd.0 osd_max_backfills > 2 > ... > Are others able to reproduce? Yes, we've found the same thing on

[ceph-users] Re: clients failing to respond to cache pressure (nfs-ganesha)

2021-10-20 Thread 胡 玮文
I don’t know if it is related. But we are routinely get warning about 1-4 clients failed to respond to cache pressure. But it seems to be harmless. We are running 16.2.6, 2 active MDSes, over 20 kernel cephfs clients, with the latest 5.11 kernel from Ubuntu. > 在 2021年10月20日,16:36,Marc 写道: >

[ceph-users] Re: monitor not joining quorum

2021-10-20 Thread Michael Moyles
Have you checked sync status and progress? A mon status command on the leader and problematic monitor should show if any sync is going on. When datastores (/var/lib/ceph/mon/ by default) get large the sync can take a long time, assuming the default sync settings, and needs to complete before a

[ceph-users] jj's "improved" ceph balancer

2021-10-20 Thread Jonas Jelten
Hi! I've been working on this for quite some time now and I think it's ready for some broader testing and feedback. https://github.com/TheJJ/ceph-balancer It's an alternative standalone balancer implementation, optimizing for equal OSD storage utilization and PG placement across all pools.

[ceph-users] Re: ceph-ansible stable-5.0 repository must be quincy?

2021-10-20 Thread Guillaume Abrioux
Hi Simon, are you well using the latest version of stable-5.0 ? Regards, On Wed, 20 Oct 2021 at 14:19, Simon Oosthoek wrote: > Hi > > we're trying to get ceph-ansible working again for our current version > of ceph (octopus), in order to be able to add some osd nodes to our > cluster.

[ceph-users] ceph-ansible stable-5.0 repository must be quincy?

2021-10-20 Thread Simon Oosthoek
Hi we're trying to get ceph-ansible working again for our current version of ceph (octopus), in order to be able to add some osd nodes to our cluster. (Obviously there's a longer story here, but just a quick question for now...) When we add in all.yml ceph_origin: repository

[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Szabo, Istvan (Agoda)
Have you tried to repair pg? Istvan Szabo Senior Infrastructure Engineer --- Agoda Services Co., Ltd. e: istvan.sz...@agoda.com --- On 2021. Oct 20., at 9:04, Glaza

[ceph-users] Expose rgw using consul or service discovery

2021-10-20 Thread Pierre GINDRAUD
Hello, I'm migrating from puppet to cephadm to deploy a ceph cluster, and I'm using consul to expose radosgateway. Before, with puppet, we were deploying radosgateway with "apt install radosgw" and applying upgrade using "apt upgrade radosgw". In our consul service a simple healthcheck on this

[ceph-users] CEPH Zabbix MGR unable to send TLS Data

2021-10-20 Thread Marc Riudalbas Clemente
Hello, we are trying to monitor our Ceph Cluster using the native Zabbix Module from CEPH. (ceph mgr zabbix). We have configured our Zabbix Server to only accept TLS (PSK) connections. When we send data with the Zabbix Sender to the Zabbix Server this way: */usr/bin/zabbix_sender -vv

[ceph-users] clients failing to respond to cache pressure (nfs-ganesha)

2021-10-20 Thread Marc
If I restart nfs-ganesha this message dissapears. Is there another solution (server side) that would clear this message? Without the need to restart nfs or have some sort of service interruption? ___ ceph-users mailing list -- ceph-users@ceph.io

[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Tomasz Płaza
Sorry Marc, didn't see second question. As the upgrade process states, rgw are the last one to be upgraded, so they are still on nautilus (centos7). Those logs showed up after upgrade of the first osd host. It is a multisite setup so I am a little afraid of upgrading rgw now. Etienne:

[ceph-users] Re: Expose rgw using consul or service discovery

2021-10-20 Thread Sebastian Wagner
Am 20.10.21 um 09:12 schrieb Pierre GINDRAUD: > Hello, > > I'm migrating from puppet to cephadm to deploy a ceph cluster, and I'm > using consul to expose radosgateway. Before, with puppet, we were > deploying radosgateway with "apt install radosgw" and applying upgrade > using "apt upgrade

[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Tomasz Płaza
I did it only on MON servers. OSDs are on centos 7. Process was: 1. stop mon 2. backup /var/lib/ceph 3. reinstall server as centos 8 and install ceph nautilus 4. restore /var/lib/ceph  and start mo 5. wait few days 6. upgrade mon to octopus On 20.10.2021 o 09:51, Marc wrote: How did you do the

[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Tomasz Płaza
Yes I did, and despite "Too many repaired reads on 1 OSDs" health is back to HEALTH_OK. But it is second time it happened and do not know, should I go forward with update or hold on. Or maybe it is a bad move makeing compaction right after migration to 15.2.14 On 20.10.2021 o 09:21, Szabo,

[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Marc
How did you do the upgrade from centos7 to centos8? I assume you kept osd config's etc? > upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr > were additionally migrated to centos8 beforehand). Each day I upgraded > one host and after all osds were up, I manually compacted

[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Etienne Menguy
Hi, You should check for inconsistency root cause. https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent - Etienne Menguy etienne.men...@croit.io > On 20 Oct

[ceph-users] inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Glaza
Hi Everyone, I am in the process of upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr were additionally migrated to centos8 beforehand). Each day I upgraded one host and after all osds were up, I manually compacted them one by one. Today (8 hosts upgraded, 7 still to go) I

[ceph-users] Re: Trying to debug "Failed to send data to Zabbix"

2021-10-20 Thread Konstantin Shalygin
Hi, Check your zabbix binary and zabbix server network reachability, the mgr call for zabbix_sender but exit code is bad: "/usr/bin/zabbix_sender exited non-zero" k Sent from my iPhone > On 20 Oct 2021, at 00:46, shubjero wrote: > > Hey all, > > Recently upgraded to Ceph Octopus

[ceph-users] Re: monitor not joining quorum

2021-10-20 Thread Konstantin Shalygin
Do you have any backfilling operations? In our case when backfilling was done mon joins to quorum immediately k Sent from my iPhone > On 20 Oct 2021, at 08:52, Denis Polom wrote: > >  > Hi, > > I've checked it, there is not IP address collision, arp tables are OK, mtu > also and according