[ceph-users] Re: Corruption on cluster

2021-09-21 Thread Christian Wuerdig
This tracker item should cover it: https://tracker.ceph.com/issues/51948 On Wed, 22 Sept 2021 at 11:03, Nigel Williams wrote: > > Could we see the content of the bug report please, that RH bugzilla entry > seems to have restricted access. > "You are not authorized to access bug #1996680." > > On

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Szabo, Istvan (Agoda)
Increasing day by day, this is the current situation: (1 server has 6x 15.3TB SAS ssds, 3x ssds are using 1x 1.92TB nvme for db+wal. WRN] BLUEFS_SPILLOVER: 13 OSD(s) experiencing BlueFS spillover osd.1 spilled over 56 GiB metadata from 'db' device (318 GiB used of 596 GiB) to slow device

[ceph-users] Re: Corruption on cluster

2021-09-21 Thread Nigel Williams
Could we see the content of the bug report please, that RH bugzilla entry seems to have restricted access. "You are not authorized to access bug #1996680." On Wed, 22 Sept 2021 at 03:32, Patrick Donnelly wrote: > You're probably hitting this bug: >

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig
On Wed, 22 Sept 2021 at 07:07, Szabo, Istvan (Agoda) wrote: > > Increasing day by day, this is the current situation: (1 server has 6x 15.3TB > SAS ssds, 3x ssds are using 1x 1.92TB nvme for db+wal. > > WRN] BLUEFS_SPILLOVER: 13 OSD(s) experiencing BlueFS spillover > osd.1 spilled over 56

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Szabo, Istvan (Agoda)
Sorry to steal it, so if I have 500GB and 700GB mixed wal+rocksdb on nvme the number should be the level base 50 and 70? Or needs to be power of 2? Istvan Szabo Senior Infrastructure Engineer --- Agoda Services Co., Ltd. e:

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Szabo, Istvan (Agoda)
Sorry to steal it, so if I have 500GB and 700GB mixed wal+rocksdb on nvme the number should be the level base 50 and 70? Or needs to be power of 2? Istvan Szabo Senior Infrastructure Engineer --- Agoda Services Co., Ltd. e:

[ceph-users] Re: EC CLAY production-ready or technology preview in Pacific?

2021-09-21 Thread Neha Ojha
On Thu, Aug 19, 2021 at 9:29 AM Jeremy Austin wrote: > > I cannot speak in any official capacity, but my limited experience > (20-30TB) EC CLAY has been functioning without an error for about 2 years. > No issues in Pacific myself yet (fingers crossed). This is good to know! I don't recall too

[ceph-users] Re: Monitor issue while installation

2021-09-21 Thread Konstantin Shalygin
 Hi, Your Ansible monitoring_group_name variable is not defined, define it first k > On 21 Sep 2021, at 12:12, Michel Niyoyita wrote: > > Hello team > > I am running a ceph cluster pacific version deployed using ansible . I > would like to add other osds but it fails once riche to the mon

[ceph-users] Re: Monitor issue while installation

2021-09-21 Thread Konstantin Shalygin
   Hi, Your Ansible monitoring_group_name variable is not defined, define it first k > On 21 Sep 2021, at 12:12, Michel Niyoyita wrote: > > Hello team > > I am running a ceph cluster pacific version deployed using ansible . I > would like to add other osds but it fails once riche to the

[ceph-users] Re: Monitor issue while installation

2021-09-21 Thread Konstantin Shalygin
 Hi, Your Ansible monitoring_group_name variable is not defined, define it first k > On 21 Sep 2021, at 12:12, Michel Niyoyita wrote: > > Hello team > > I am running a ceph cluster pacific version deployed using ansible . I > would like to add other osds but it fails once riche to the mon

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig
On Wed, 22 Sept 2021 at 05:54, Szabo, Istvan (Agoda) wrote: > > Sorry to steal it, so if I have 500GB and 700GB mixed wal+rocksdb on nvme > the number should be the level base 50 and 70? Or needs to be > power of 2? Generally the sum of all levels (up to the max of your

[ceph-users] Re: *****SPAM***** Re: Corruption on cluster

2021-09-21 Thread David Schulz
Wow!  Thanks everyone! The bug report at https://tracker.ceph.com/issues/51948 describes exactly the behaviour that we are seeing.  I'll update and let everyone know when I've finished the upgrade.  This will probably take a few days as I need to wait for a window to do the work. Sincerely

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig
On Wed, 22 Sept 2021 at 00:54, mhnx wrote: > > Thanks for the explanation. Then the first thing I did wrong I didn't add > levels to reach total space. I didn't know that and I've set : > max_bytes_for_level_base=536870912 and max_bytes_for_level_multiplier=10 > 536870912*10*10=50Gb > > I have

[ceph-users] Re: *****SPAM***** Re: Corruption on cluster

2021-09-21 Thread Dan van der Ster
It's this: https://tracker.ceph.com/issues/51948 The fix just landed in 4.18.0-305.19.1 https://access.redhat.com/errata/RHSA-2021:3548 On Tue, 21 Sep 2021, 19:35 Marc, wrote: > > I do not have access to this page. Maybe others also not, so it is better > to paste it's content here. > > >

[ceph-users] Re: *****SPAM***** Re: Corruption on cluster

2021-09-21 Thread Marc
I do not have access to this page. Maybe others also not, so it is better to paste it's content here. > -Original Message- > From: Patrick Donnelly > Sent: Tuesday, 21 September 2021 19:30 > To: David Schulz > Cc: ceph-users@ceph.io > Subject: *SPAM* [ceph-users] Re:

[ceph-users] Re: Corruption on cluster

2021-09-21 Thread Patrick Donnelly
Hi Dave, On Tue, Sep 21, 2021 at 1:20 PM David Schulz wrote: > > Hi Everyone, > > For a couple of weeks I've been battling a corruption in Ceph FS that > happens when a writer on one node writes a line and calls sync as is > typical with logging and the file is corrupted when the same file that

[ceph-users] Corruption on cluster

2021-09-21 Thread David Schulz
Hi Everyone, For a couple of weeks I've been battling a corruption in Ceph FS that happens when a writer on one node writes a line and calls sync as is typical with logging and the file is corrupted when the same file that is being written is read from another client. The cluster is a

[ceph-users] Re: Error: UPGRADE_FAILED_PULL: Upgrade: failed to pull target image

2021-09-21 Thread Radoslav Milanov
There is a problem upgrading ceph-iscsi from 16.25 to 16.2.6 2021-09-21T12:43:58.767556-0400 mgr.nj3231.wagzhn [ERR] cephadm exited with an error code: 1, stderr:Redeploy daemon iscsi.iscsi.nj3231.mqeari ... Creating ceph-iscsi config... Write file:

[ceph-users] MDS 16.2.5-387-g7282d81d and DAEMON_OLD_VERSION

2021-09-21 Thread Выдрук Денис
Hello. I had upgraded my cluster from Nautilus to Pacific and switched it to Cephadm, adopting services to this new container system. I wanted to start using CephFS and created it with "sudo ceph fs volume create". Everything works fine. Some days ago, the “DAEMON_OLD_VERSION” warning

[ceph-users] Re: S3 Bucket Notification requirement

2021-09-21 Thread Sanjeev Jha
Hi Yuval, I stuck on the first step where I am trying to create sns topic but not able to create it. I am not able to figure out the issue. AMQP server is ready, up and running with AMQP 0.9.1. [root@ceprgw01 ~]# aws --endpoint-url http://localhost:8000 sns create-topic --name=mytopic

[ceph-users] Re: rocksdb corruption with 16.2.6

2021-09-21 Thread Sven Kieske
On Mo, 2021-09-20 at 10:29 -0500, Mark Nelson wrote: > At least in one case for us, the user was using consumer grade SSDs > without power loss protection.  I don't think we ever fully diagnosed if > that was the cause though.  Another case potentially was related to high > memory usage on the

[ceph-users] after upgrade: HEALTH ERR ...'devicehealth' has failed: can't subtract offset-naive and offset-aware datetimes

2021-09-21 Thread Harry G. Coin
A cluster reporting no errors running 16.2.5 immediately after upgrade to 16.2.6 features what seems to be an entirely bug-related dramatic 'Heath Err' on the dashboard: Module 'devicehealth' has failed: can't subtract offset-naive and offset-aware datetimes Looking at the bug tracking

[ceph-users] osd marked down

2021-09-21 Thread Abdelillah Asraoui
Hi, one of the osd in the cluster went down, is there a workaround to bring back this osd? logs from ceph osd pod shows the following: kubectl -n rook-ceph logs rook-ceph-osd-3-6497bdc65b-pn7mg debug 2021-09-20T14:32:46.388+ 7f930fe9cf00 -1 auth: unable to find a keyring on

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread mhnx
Thanks for the explanation. Then the first thing I did wrong I didn't add levels to reach total space. I didn't know that and I've set : max_bytes_for_level_base=536870912 and max_bytes_for_level_multiplier=10 536870912*10*10=50Gb I have space on Nvme's. I think I can resize the partitions. 1-

[ceph-users] Re: [EXTERNAL] RE: OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

2021-09-21 Thread Dave Piper
I still can't find a way to get ceph-bluestore-tool working in my containerized deployment. As soon as the OSD daemon stops, the contents of /var/lib/ceph/osd/ceph- are unreachable. I've found this blog post that suggests changes to the container's entrypoint are required, but the proposed

[ceph-users] Successful Upgrade from 14.2.22 to 15.2.14

2021-09-21 Thread Dan van der Ster
Dear friends, This morning we upgraded our pre-prod cluster from 14.2.22 to 15.2.14, successfully, following the procedure at https://docs.ceph.com/en/latest/releases/octopus/#upgrading-from-mimic-or-nautilus It's a 400TB cluster which is 10% used with 72 osds (block=hdd, block.db=ssd) and 40M

[ceph-users] Monitor issue while installation

2021-09-21 Thread Michel Niyoyita
Hello team I am running a ceph cluster pacific version deployed using ansible . I would like to add other osds but it fails once riche to the mon installation with this fatal error: msg: |- The conditional check 'groups.get(monitoring_group_name, []) | length > 0' failed. The error was:

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Szabo, Istvan (Agoda)
Let me join, having 11 bluefs spillover in my cluster. Where this settings coming from? Istvan Szabo Senior Infrastructure Engineer --- Agoda Services Co., Ltd. e: istvan.sz...@agoda.com

[ceph-users] Re: RocksDB options for HDD, SSD, NVME Mixed productions

2021-09-21 Thread Christian Wuerdig
It's been discussed a few times on the list but RocksDB levels essentially grow by a factor of 10 (max_bytes_for_level_multiplier) by default and you need (level-1)*10 space for the next level on your drive to avoid spill over So the sequence (by default) is 256MB -> 2.56GB -> 25.6GB -> 256GB GB

[ceph-users] Re: [EXTERNAL] Re: OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

2021-09-21 Thread Janne Johansson
Den mån 20 sep. 2021 kl 18:02 skrev Dave Piper : > Okay - I've finally got full debug logs from the flapping OSDs. The raw logs > are both 100M each - I can email them directly if necessary. (Igor I've > already sent these your way.) > Both flapping OSDs are reporting the same "bluefs _allocate

[ceph-users] Re: rocksdb corruption with 16.2.6

2021-09-21 Thread Andrej Filipcic
Hi, Some further investigation on the failed OSDs: 1 out of 8 OSDs actually has hardware issue, [16841006.029332] sd 0:0:10:0: [sdj] tag#96 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=2s [16841006.037917] sd 0:0:10:0: [sdj] tag#34 FAILED Result: hostbyte=DID_SOFT_ERROR