[ceph-users] First 6 nodes cluster with Octopus

2021-03-30 Thread mabi
Hello, I am planning to setup a small Ceph cluster for testing purpose with 6 Ubuntu nodes and have a few questions mostly regarding planning of the infra. 1) Based on the documentation the OS requirements mentions Ubuntu 18.04 LTS, is it ok to use Ubuntu 20.04 instead or should I stick with

[ceph-users] Re: First 6 nodes cluster with Octopus

2021-03-30 Thread DHilsbos
Mabi; We're running Nautilus, and I am not wholly convinced of the "everything in containers" view of the world, so take this with a small grain of salt... 1) We don't run Ubuntu, sorry. I suspect the documentation highlights 18.04 because it's the current LTS release. Personally, if I had

[ceph-users] Re: Preferred order of operations when changing crush map and pool rules

2021-03-30 Thread Reed Dier
I've not undertaken such a large data movement, The pgupmap script may be of use here, but assuming that its not. But if I were, I would first take many backups of the current crush map. I would set the noreblance and norecover flags. Then I would verify all of the backfill settings are as

[ceph-users] Re: v14.2.19 Nautilus released

2021-03-30 Thread David Caro
Thanks for the quick release! \o/ On Tue, 30 Mar 2021, 22:30 David Galloway, wrote: > This is the 19th update to the Ceph Nautilus release series. This is a > hotfix release to prevent daemons from binding to loopback network > interfaces. All nautilus users are advised to upgrade to this

[ceph-users] Re: Rados gateway static website

2021-03-30 Thread Marcel Kuiper
Casey, Many thanks. That did the trick. Regards Marcel Casey Bodley schreef op 2021-03-30 16:48: this error 2039 is ERR_NO_SUCH_WEBSITE_CONFIGURATION. if you want to access a bucket via rgw_dns_s3website_name, you have to set a website configuration on the bucket - see

[ceph-users] v14.2.19 Nautilus released

2021-03-30 Thread David Galloway
This is the 19th update to the Ceph Nautilus release series. This is a hotfix release to prevent daemons from binding to loopback network interfaces. All nautilus users are advised to upgrade to this release. Notable Changes --- * This release fixes a regression introduced in

[ceph-users] Re: ceph-fuse false passed X_OK check

2021-03-30 Thread Patrick Donnelly
It's a bug: https://tracker.ceph.com/issues/50060 On Wed, Dec 23, 2020 at 5:53 PM Alex Taylor wrote: > > Hi Patrick, > > Any updates? Looking forward to your reply :D > > > On Thu, Dec 17, 2020 at 11:39 AM Patrick Donnelly wrote: > > > > On Wed, Dec 16, 2020 at 5:46 PM Alex Taylor wrote: > > >

[ceph-users] Re: forceful remap PGs

2021-03-30 Thread Stefan Kooman
On 3/30/21 3:02 PM, Boris Behrens wrote: I reweighted the OSD to .0 and then forced the backfilling. How long does it take for ceph to free up space? I looks like it was doing this, but it could also be the "backup cleanup job" that removed images from the buckets. I don't have any numbers

[ceph-users] Rados gateway static website

2021-03-30 Thread Marcel Kuiper
despite the examples that can be found on the internet I have troubles setting up a static website that serves from a S3 bucket If anyone could point me in the right direction that would be much appreciated Marcel I created an index.html in the bucket sky gm-rc3-jumphost01@ceph/s3cmd

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Frank Schilder
Ahh, right. I saw it fixed here https://tracker.ceph.com/issues/18749 a long time ago, but it seems the back-port never happened. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Josh Baergen Sent: 30 March 2021

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Josh Baergen
I thought that recovery below min_size for EC pools wasn't expected to work until Octopus. From the Octopus release notes: "Ceph will allow recovery below min_size for Erasure coded pools, wherever possible." Josh On Tue, Mar 30, 2021 at 6:53 AM Frank Schilder wrote: > Dear Rainer, > > hmm,

[ceph-users] Re: Preferred order of operations when changing crush map and pool rules

2021-03-30 Thread Stefan Kooman
On 3/30/21 3:00 PM, Thomas Hukkelberg wrote: Any thoughts or insight on how to achieve this with minimal data movement and risk of cluster downtime would be welcome! I would do so with Dan's "upmap-remap" script [1]. See [2] for his presentation. We have used that quite a few times now

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Frank Schilder
Sorry about the flow of messages. I forgot to mention this. Looking at other replies, in particular, that the PG in question remained at 4 out of 6 OSDs until you reduced min_size might indicate that peering was blocked for some reason, but completed after the reduction. If this was the order

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Frank Schilder
Hi, this is odd. The problem with recovery when sufficiently many but less than min_size shards are present should have been resolved with osd_allow_recovery_below_min_size=true. It is really dangerous to reduce min_size below k+1 and, in fact, should never be necessary for recovery. Can you

[ceph-users] Re: Upgrade from Luminous to Nautilus now one MDS with could not get service secret

2021-03-30 Thread Dan van der Ster
Hi Robert, We get a handful of verify_authorizer warnings on some of our clusters too but they don't seem to pose any problems. I've tried without success to debug this in the past -- IIRC I started to suspect it was coming from old cephfs kernel clients but got distracted and never reached the

[ceph-users] Re: Rados gateway static website

2021-03-30 Thread Casey Bodley
this error 2039 is ERR_NO_SUCH_WEBSITE_CONFIGURATION. if you want to access a bucket via rgw_dns_s3website_name, you have to set a website configuration on the bucket - see https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketWebsite.html On Tue, Mar 30, 2021 at 10:05 AM Marcel Kuiper

[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Boris Behrens
I raised the backfillfull_ratio to .91 to see what happens, now I am waiting. Some OSDs were around 89-91%, some are around 50-60% The pgp_num is on 1946 since one week. I think this will solve itself, when the cluster becomes a bit more tidy. Am Di., 30. März 2021 um 15:23 Uhr schrieb Dan van

[ceph-users] Re: Device class not deleted/set correctly

2021-03-30 Thread Stefan Kooman
On 3/25/21 1:05 PM, Nico Schottelius wrote: it seems there is no reference to it in the ceph documentation. Do you have any pointers to it? Not anymore with new Ceph documentation. Out of curiosity, do you have any clue why it's not in there anymore? It might still be, but I cannot find

[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Dan van der Ster
You started with 1024 PGs, and are splitting to 2048. Currently there are 1946 PGs used .. so it is nearly there at the goal. You need to watch that value 1946 and see if it increases slowly. If it does not increase, then those backfill_toofull PGs are probably splitting PGs, and they are blocked

[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Boris Behrens
The output from ceph osd pool ls detail tell me nothing, except that the pgp_num is not where it should be. Can you help me to read the output? How do I estimate how long the split will take? [root@s3db1 ~]# ceph status cluster: id: dca79fff-ffd0-58f4-1cff-82a2feea05f4 health:

[ceph-users] Re: Resolving LARGE_OMAP_OBJECTS

2021-03-30 Thread David Orman
Hi Ben, That was beyond helpful. Thank you so much for the thoughtful and detailed explanation. That should definitely be added to the documentation, until/unless the dynamic resharder/sharder handle this case (if there is even desire to do so) with versioned objects. Respectfully, David On

[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Dan van der Ster
It would be safe to turn off the balancer, yes go ahead. To know if adding more hardware will help, we need to see how much longer this current splitting should take. This will help: ceph status ceph osd pool ls detail -- dan On Tue, Mar 30, 2021 at 3:00 PM Boris Behrens wrote: > > I

[ceph-users] Re: forceful remap PGs

2021-03-30 Thread Boris Behrens
I reweighted the OSD to .0 and then forced the backfilling. How long does it take for ceph to free up space? I looks like it was doing this, but it could also be the "backup cleanup job" that removed images from the buckets. Am Di., 30. März 2021 um 14:41 Uhr schrieb Stefan Kooman : > On

[ceph-users] Re: upgrade problem nautilus 14.2.15 -> 14.2.18? (Broken ceph!)

2021-03-30 Thread Sasha Litvak
Any time frame on 14.2.19? On Fri, Mar 26, 2021, 1:43 AM Konstantin Shalygin wrote: > Finally master is merged now > > > k > > Sent from my iPhone > > > On 25 Mar 2021, at 23:09, Simon Oosthoek > wrote: > > > > I'll wait a bit before upgrading the remaining nodes. I hope 14.2.19 > will be

[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Boris Behrens
I would think due to splitting, because the balancer doesn't refuses it's work, because to many misplaced objects. I also think to turn it off for now, so it doesn't begin it's work at 5% missplaced objects. Would adding more hardware help? We wanted to insert another OSD node with 7x8TB disks

[ceph-users] Preferred order of operations when changing crush map and pool rules

2021-03-30 Thread Thomas Hukkelberg
Hi all! We run a 1.5PB cluster with 12 hosts, 192 OSDs (mix of NVMe and HDD) and need to improve our failure domain by altering the crush rules and moving rack to pods, which would imply a lot of data movement. I wonder what would the preferred order of operations be when doing such changes

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Rainer Krienke
Hello, in between ceph is runing again normally, except for the two osds that are down because of the failed disks. What really helped in my situation was to lower min_size from 5 (k+1) to 4 in my 4+2 erasure code setup. So I am also greatful for the programmer who put the helping hint in

[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Dan van der Ster
Are those PGs backfilling due to splitting or due to balancing? If it's the former, I don't think there's a way to pause them with upmap or any other trick. -- dan On Tue, Mar 30, 2021 at 2:07 PM Boris Behrens wrote: > > One week later the ceph is still balancing. > What worries me like hell is

[ceph-users] Re: should I increase the amount of PGs?

2021-03-30 Thread Boris Behrens
One week later the ceph is still balancing. What worries me like hell is the %USE on a lot of those OSDs. Does ceph resolv this on it's own? We are currently down to 5TB space in the cluster. Rebalancing single OSDs doesn't work well and it increases the "missplaced objects". I thought about

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Burkhard Linke
Hi, On 30.03.21 13:05, Rainer Krienke wrote: Hello, yes your assumptions are correct pxa-rbd ist the metadata pool for pxa-ec which uses a erasure coding 4+2 profile. In the last hours ceph repaired most of the damage. One inactive PG remained and in ceph health detail then told me:

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Rainer Krienke
Hello Frank, the option is actually set. On one of my monitors: # ceph daemon /var/run/ceph/ceph-mon.*.asok config show|grep osd_allow_recovery_below_min_size "osd_allow_recovery_below_min_size": "true", Thank you very much Rainer Am 30.03.21 um 13:20 schrieb Frank Schilder: Hi, this

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Rainer Krienke
Hello, yes your assumptions are correct pxa-rbd ist the metadata pool for pxa-ec which uses a erasure coding 4+2 profile. In the last hours ceph repaired most of the damage. One inactive PG remained and in ceph health detail then told me: - HEALTH_WARN Reduced data availability: 1

[ceph-users] Re: forceful remap PGs

2021-03-30 Thread Boris Behrens
I just move one PG away from the OSD, but the diskspace will not get freed. Do I need to do something to clean obsolete objects from the osd? Am Di., 30. März 2021 um 11:47 Uhr schrieb Boris Behrens : > Hi, > I have a couple OSDs that currently get a lot of data, and are running > towards 95%

[ceph-users] Re: Ceph User Survey Working Group - Next Steps

2021-03-30 Thread Mike Perez
Hi everyone, I didn't get enough responses on the previous Doodle to schedule a meeting. I'm wondering if people are OK with the previous PDF I released or if there's interest in the community to develop better survey results? https://ceph.io/community/ceph-user-survey-2019/ On Mon, Mar 22,

[ceph-users] forceful remap PGs

2021-03-30 Thread Boris Behrens
Hi, I have a couple OSDs that currently get a lot of data, and are running towards 95% fillrate. I would like to forcefully remap some PGs (they are around 100GB) to more empty OSDs and drop them from the full OSDs. I know this would lead to degraded objects, but I am not sure how long the

[ceph-users] Re: ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Eugen Block
Hi, from what you've sent my conclusion about the stalled I/O would be indeed the min_size of the EC pool. There's only one PG reported as incomplete, I assume that is the EC pool, not the replicated pxa-rbd, right? Both pools are for rbd so I'm guessing the rbd headers are in pxa-rbd

[ceph-users] Re: Resolving LARGE_OMAP_OBJECTS

2021-03-30 Thread Benoît Knecht
Hi David, On Tuesday, March 30th, 2021 at 00:50, David Orman wrote: > Sure enough, it is more than 200,000, just as the alert indicates. > However, why did it not reshard further? Here's the kicker - we only > see this with versioned buckets/objects. I don't see anything in the > documentation

[ceph-users] ceph Nautilus lost two disk over night everything hangs

2021-03-30 Thread Rainer Krienke
Hello, i run a ceph Nautilus cluster with 9 hosts and 144 OSDs. Last night we lost two disks, so two OSDs (67,90) are down. The two disks are on two different hosts. A third ODS on a third host repotrts slow ops. ceph is repairing at the moment. Pools affected are eg these ones: pool 35

[ceph-users] Re: OSD Crash During Deep-Scrub

2021-03-30 Thread Szabo, Istvan (Agoda)
I have this deep scrub issue in the index pool's pg's almost every week which made the cluster health error so I always need to repair that pg :/ Any solution that you have found so far? Istvan Szabo Senior Infrastructure Engineer --- Agoda