[ceph-users] Re: 1 pg recovery_unfound after multiple crash of an OSD

2023-01-09 Thread Kai Stian Olstad
Hi Just a follow up, the issue was solved by running command ceph pg 404.1ff mark_unfound_lost delete - Kai Stian Olstad On 04.01.2023 13:00, Kai Stian Olstad wrote: Hi We are running Ceph 16.2.6 deployed with Cephadm. Around Christmas OSD 245 and 327 had about 20 read error so I set

[ceph-users] User migration between clusters

2023-01-09 Thread Szabo, Istvan (Agoda)
Hi, Normally I use rclone to migrate buckets across clusters. However this time the user has close to 1000 buckets so I wonder what would be the best approach to do this rather buckets by buckets, any idea? Thank you This message is confidential and is for the

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman
We ship all of this to our centralized monitoring system (and a lot more) and have dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're running Ceph in production, I believe host-level monitoring is critical, above and beyond Ceph level. Things like inlet/outlet temperature,

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread Erik Lindahl
Hi, Good points; however, given that ceph already collects all this statistics, isn't  there any way to set (?) reasonable thresholds and actually have ceph detect the amount of read errors and suggest that a given drive should be replaced? It seems a bit strange that we all should have to

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread Anthony D'Atri
> On Jan 9, 2023, at 17:46, David Orman wrote: > > It's important to note we do not suggest using the SMART "OK" indicator as > the drive being valid. We monitor correctable/uncorrectable error counts, as > you can see a dramatic rise when the drives start to fail. 'OK' will be > reported

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman
It's important to note we do not suggest using the SMART "OK" indicator as the drive being valid. We monitor correctable/uncorrectable error counts, as you can see a dramatic rise when the drives start to fail. 'OK' will be reported for SMART health long after the drive is throwing many

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread Erik Lindahl
Hi, We too kept seeing this until a few months ago in a cluster with ~400 HDDs, while all the drive SMART statistics was always A-OK. Since we use erasure coding each PG involves up to 10 HDDs. It took us a while to realize we shouldn't expect scrub errors on healthy drives, but eventually we

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread David Orman
"dmesg" on all the linux hosts and look for signs of failing drives. Look at smart data, your HBAs/disk controllers, OOB management logs, and so forth. If you're seeing scrub errors, it's probably a bad disk backing an OSD or OSDs. Is there a common OSD in the PGs you've run the repairs on? On

[ceph-users] ceph orch osd rm - draining forever, shows -1 pgs

2023-01-09 Thread Wyll Ingersoll
Running ceph-pacific 16.2.9 using ceph orchestrator. We made a mistake adding a disk to the cluster and immediately issued a command to remove it using "ceph orch osd rm ### --replace --force". This OSD no data on it at the time and was removed after just a few minutes. "ceph orch osd rm

[ceph-users] Re: OSD crash on Onode::put

2023-01-09 Thread Igor Fedotov
Hi Dongdong, thanks a lot for your post, it's really helpful. Thanks, Igor On 1/5/2023 6:12 AM, Dongdong Tao wrote: I see many users recently reporting that they have been struggling with this Onode::put race condition issue[1] on both the latest Octopus and pacific. Igor opened a PR [2] 

[ceph-users] Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

2023-01-09 Thread Kuhring, Mathias
Hey all, I'd like to pick up on this topic, since we also see regular scrub errors recently. Roughly one per week for around six weeks now. It's always a different PG and the repair command always helps after a while. But the regular re-occurrence seems it bit unsettling. How to best

[ceph-users] OSD crash on Onode::put

2023-01-09 Thread Dongdong Tao
-- Resending this mail, it seems ceph-users@ceph.io was down for the last few days. I see many users recently reporting that they have been struggling with this Onode::put race condition issue[1] on both the latest Octopus and pacific. Igor opened a PR [2] to address this issue, I've been

[ceph-users] Re: Serious cluster issue - Incomplete PGs

2023-01-09 Thread Deep Dish
Thanks for the insight Eugen. Here's what basically happened: - Upgrade from Nautilus to Quincy via migration to new cluster on temp hardware; - Data from Nautilus migrated successfully to older / lab-type equipment running Quincy; - Nautilus Hardware rebuilt for Quincy, data migrated back; - As

[ceph-users] Re: VolumeGroup must have a non-empty name / 17.2.5

2023-01-09 Thread Eugen Block
Hi, if you intend to use those disks as OSDs you should wipe them, depending on your OSD configuration (drivegroup.yml) those disks will be automatically created. If you don't want that you might need to set: ceph orch apply osd --all-available-devices --unmanaged=true See [1] for more

[ceph-users] Re: Mixing SSD and HDD disks for data in ceph cluster deployment

2023-01-09 Thread Michel Niyoyita
Thank you very much Anthony and Eugen , I followed your instructions and now it works fine class are hdd and ssd , and also now we have 60 OSDS from 48 Thanks again Michel On Mon, 9 Jan 2023, 17:00 Anthony D'Atri, wrote: > For anyone finding this thread down the road: I wrote to the poster >

[ceph-users] Re: Mixing SSD and HDD disks for data in ceph cluster deployment

2023-01-09 Thread Anthony D'Atri
For anyone finding this thread down the road: I wrote to the poster yesterday with the same observation. Browsing the ceph-ansible docs and code, to get them to deploy as they want, one may pre-create LVs and enumerate them as explicit data devices. Their configuration also enables primary

[ceph-users] Re: Mixing SSD and HDD disks for data in ceph cluster deployment

2023-01-09 Thread Eugen Block
Hi, it appears that the SSDs were used as db devices (/dev/sd[efgh]). According to [1] (I don't use ansible) the simple case is that: [...] most of the decisions on how devices are configured to provision an OSD are made by the Ceph tooling (ceph-volume lvm batch in this case). And I

[ceph-users] Mixing SSD and HDD disks for data in ceph cluster deployment

2023-01-09 Thread Michel Niyoyita
Hello team I have an issue on ceph-deployment using ceph-ansible . we have two categories of disk , HDD and SSD , while deploying ceph only HDD are appearing no SSD appearing . the cluster is running on ubuntu OS 20.04 , unfortunately no errors appearing , did I miss something in configuration?

[ceph-users] Re: docs.ceph.com -- Do you use the header navigation bar? (RESPONSES REQUESTED)

2023-01-09 Thread Boris Behrens
I actually do not mind if i need to scroll up a line, but I also think it is a good idea to remove it. Am Mo., 9. Jan. 2023 um 11:06 Uhr schrieb Frank Schilder : > > Hi John, > > firstly, image attachments are filtered out by the list. How about you upload > the image somewhere like

[ceph-users] Re: Erasing Disk to the initial state

2023-01-09 Thread Frank Schilder
You need to stop all daemons, remove the mon stores and wipe the OSDs with ceph-volume. Find out which OSDs were running on which host (ceph-volume inventory DEVICE) and use ceph-volume lvm zap --destroy --osd-id ID on these hosts. Best regards, = Frank Schilder AIT Risø

[ceph-users] NoSuchBucket when bucket exists ..

2023-01-09 Thread Shashi Dahal
Hi, In a working All-in-one(AIO) test setup of openstack & ceph ( where making the bucket public works from the browser) radosgw-admin bucket list [ "711138fc95764303b83002c567ce0972/demo" ] I have another cluster where openstack and ceph are separate. I have set the same config options

[ceph-users] Re: Serious cluster issue - Incomplete PGs

2023-01-09 Thread Eugen Block
Hi, can you clarify what exactly you did to get into this situation? What about the undersized PGs, any chance to bring those OSDs back online? Regarding the incomplete PGs I'm not sure there's much you can do if the OSDs are lost. To me it reads like you may have destroyed/recreated

[ceph-users] Re: docs.ceph.com -- Do you use the header navigation bar? (RESPONSES REQUESTED)

2023-01-09 Thread Frank Schilder
Hi John, firstly, image attachments are filtered out by the list. How about you upload the image somewhere like https://imgur.com/ and post a link instead? In my browser, the sticky header contains only "home" and "edit on github", which are both entirely useless for a user. What exactly is

[ceph-users] Re: increasing number of (deep) scrubs

2023-01-09 Thread Frank Schilder
Hi Dan, thanks for your answer. I don't have a problem with increasing osd_max_scrubs (=1 at the moment) as such. I would simply prefer a somewhat finer grained way of controlling scrubbing than just doubling or tripling it right away. Some more info. These 2 pools are data pools for a large

[ceph-users] Re: mon scrub error (scrub mismatch)

2023-01-09 Thread Frank Schilder
Hi Dan, it went unnoticed and is in all log files + rotated. I also wondered about the difference in #auth keys and looked at it. However, we have only 23 auth keys (its a small test cluster). No idea what the 77/78 mean. Maybe including some history? I went ahead and rebuilt the mon store