[ceph-users] Re: osd crash randomly

2022-10-24 Thread can zhu
The same osd crashed today: 0> 2022-10-24T06:30:00.875+ 7f0bbf3bc700 -1 *** Caught signal (Segmentation fault) ** in thread 7f0bbf3bc700 thread_name:bstore_kv_final ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) 1: /lib64/libpthread.so.0(+0x12ce0)

[ceph-users] Re: Understanding rbd objects, with snapshots

2022-10-24 Thread Chris Dunlop
Hi Maged, Thanks for taking the time to go into a detailed explanation. It's certainly not as easy as working out the appropriate object to get via rados. As you suggest, I'll have to look into ceph-objectstore-tool and perhaps librados to get any further. Thanks again, Chris On Mon, Oct

[ceph-users] Re: How to remove remaining bucket index shard objects

2022-10-24 Thread 伊藤 祐司
Hi, The large omap alert looks resolved last week, Although I don't know the underlying reasons. When I got your email and tried to get the data, I noticed that the alerts had stopped. OMAP was 0 Bytes as follows. To make sure, I ran a deep scrub and waited for a while, but the alert has not

[ceph-users] Dashboard device health info missing

2022-10-24 Thread Wyll Ingersoll
Looking at the device health info for the OSDs in our cluster sometimes shows "No SMART data available". This appears to only occur for SCSI type disks in our cluster. ATA disks have their full health SMART data displayed, but the non-ATA do not. The actual SMART data (JSON formatted) is

[ceph-users] Re: CephFS constant high write I/O to the metadata pool

2022-10-24 Thread Olli Rajala
I tried my luck and upgraded to 17.2.4 but unfortunately that didn't make any difference here either. I also looked more again at all kinds of client op and request stats and wotnot which only made me even more certain that this io is not caused by any clients. What internal mds operation or

[ceph-users] Re: Temporary shutdown of subcluster and cephfs

2022-10-24 Thread Patrick Donnelly
On Wed, Oct 19, 2022 at 7:54 AM Frank Schilder wrote: > > Hi Dan, > > I know that "fs fail ..." is not ideal, but we will not have time for a clean > "fs down true" and wait for journal flush procedure to complete (on our > cluster this takes at least 20 minutes, which is way too long). My

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Joseph Mundackal
Quick napkin math for your 3 way replicated pool - eg: pool 28 - you have 9.9 TB across 256 pgs ~= 10137 GB across 256 pgs ~= 39 GB per PG for 4+2 ec on pool 51 - 32 TB across 128 pgs ~= 21768 GB across 128 pgs ~= 256 GB per pg - with the 4+2 profile this should be spread across 4 parts ~= 64 GB

[ceph-users] Re: Failed to probe daemons or devices

2022-10-24 Thread Guillaume Abrioux
Hello Sake, Could you share the output of vgs / lvs commands? Also, I would suggest you to open a tracker [1] Thanks! [1] https://tracker.ceph.com/projects/ceph-volume On Mon, 24 Oct 2022 at 10:51, Sake Paulusma wrote: > Last friday I upgrade the Ceph cluster from 17.2.3 to 17.2.5 with "ceph

[ceph-users] Re: ceph-ansible install failure

2022-10-24 Thread Guillaume Abrioux
Hi Zhongzhou, I think most of the time it means that a device is not wiped correctly. Can you check that? Thanks! On Sat, 22 Oct 2022 at 01:01, Zhongzhou Cai wrote: > Hi folks, > > I'm trying to install ceph on GCE VMs (debian/ubuntu) with PD-SSDs using > ceph-ansible image. The installation

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Josh Baergen
Hi Tim, Ah, it didn't sink in for me at first how many pools there were here. I think you might be hitting the issue that the author of https://github.com/TheJJ/ceph-balancer ran into, and thus their balancer might help in this case. Josh On Mon, Oct 24, 2022 at 8:37 AM Tim Bishop wrote: > >

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Tim Bishop
Hi Joseph, Here's some of the larger pools. Notable the largest (pool 51, 32 TiB CephFS data) doesn't have the highest number of PGs. POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL pool28 28 256 9.9 TiB2.61M 30 TiB 43.28 13 TiB pool29

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Tim Bishop
Hi Josh, On Mon, Oct 24, 2022 at 07:20:46AM -0600, Josh Baergen wrote: > > I've included the osd df output below, along with pool and crush rules. > > Looking at these, the balancer module should be taking care of this > imbalance automatically. What does "ceph balancer status" say? # ceph

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Joseph Mundackal
Hi Tim, You might want to check you pool utilization and see if there are enough pg's in that pool. Higher GB per pg can result in this scenario. I am also assuming that you have the balancer module turn on (ceph balancer status) should tell you that as well. If you have enough pgs in the bigger

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Anthony D'Atri
Hey, Tim. Visualization is great to help get a better sense of OSD fillage than a table of numbers. A Grafana panel works, or a quick script: Grab this from from CERN: https://gitlab.cern.ch/ceph/ceph-scripts/-/blob/master/tools/histogram.py

[ceph-users] Re: Advice on balancing data across OSDs

2022-10-24 Thread Josh Baergen
Hi Tim, > I've included the osd df output below, along with pool and crush rules. Looking at these, the balancer module should be taking care of this imbalance automatically. What does "ceph balancer status" say? Josh ___ ceph-users mailing list --

[ceph-users] Advice on balancing data across OSDs

2022-10-24 Thread Tim Bishop
Hi all, ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific (stable) We're having an issue with the spread of data across our OSDs. We have 108 OSDs in our cluster, all identical disk size, same number in each server, and the same number of servers in each rack. So I'd hoped

[ceph-users] Re: Debug cluster warnings "CEPHADM_HOST_CHECK_FAILED", "CEPHADM_REFRESH_FAILED" etc

2022-10-24 Thread Martin Johansen
Hi, thank you, we replaced the domain of the service in text before reporting the issue. Sorry, I should have mentioned. admin.ceph.example.com was turned into admin.ceph. for privacy sake. Best Regards, Martin Johansen On Mon, Oct 24, 2022 at 2:53 PM Murilo Morais wrote: > Hello Martin. >

[ceph-users] Re: Debug cluster warnings "CEPHADM_HOST_CHECK_FAILED", "CEPHADM_REFRESH_FAILED" etc

2022-10-24 Thread Murilo Morais
Hello Martin. Apparently cephadm is not able to resolve to `admin.ceph.`, check /etc/hosts or your DNS, try to ping and check if the IPs in `ceph orch host ls` are pinged and there is no packet loss. Try according to the documentation:

[ceph-users] Re: Understanding rbd objects, with snapshots

2022-10-24 Thread Maged Mokhtar
On 18/10/2022 01:24, Chris Dunlop wrote: Hi, Is there anywhere that describes exactly how rbd data (including snapshots) are stored within a pool? I can see how a rbd broadly stores its data in rados objects in the pool, although the object map is opaque. But once an rbd snap is created

[ceph-users] Re: rgw multisite octopus - bucket can not be resharded after cancelling prior reshard process

2022-10-24 Thread Boris Behrens
Cheers again. I am still stuck at this. Someone got an idea how to fix it? Am Fr., 7. Okt. 2022 um 11:30 Uhr schrieb Boris Behrens : > Hi, > I just wanted to reshard a bucket but mistyped the amount of shards. In a > reflex I hit ctrl-c and waited. It looked like the resharding did not > finish

[ceph-users] Debug cluster warnings "CEPHADM_HOST_CHECK_FAILED", "CEPHADM_REFRESH_FAILED" etc

2022-10-24 Thread Martin Johansen
Hi, I deployed a Ceph cluster a week ago and have started experiencing warnings. Any pointers as to how to further debug or fix it? Here is info about the warnings: # ceph version ceph version 17.2.4 (1353ed37dec8d74973edc3d5d5908c20ad5a7332) quincy (stable) # ceph status cluster: id:

[ceph-users] MGR process regularly not responding

2022-10-24 Thread Gilles Mocellin
Hi, In our Ceph Pacific clusters (16.2.10) (1 for OpenStack and S3, 2 for backup on RBD and S3), since the upgrade to Pacific, we have regularly the MGR not responding, not seen anymore in ceph status. The process is still there. Noting in the MGR log, just no more logs. Restarting the

[ceph-users] Failed to probe daemons or devices

2022-10-24 Thread Sake Paulusma
Last friday I upgrade the Ceph cluster from 17.2.3 to 17.2.5 with "ceph orch upgrade start --image localcontainerregistry.local.com:5000/ceph/ceph:v17.2.5-20221017". After sometime, an hour?, I've got a health warning: CEPHADM_REFRESH_FAILED: failed to probe daemons or devices. I'm using only