[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-21 Thread Ing . Luis Felipe Domínguez Vega
El 2020-10-20 17:57, Ing. Luis Felipe Domínguez Vega escribió: Hi, today mi Infra provider has a blackout, then the Ceph was try to recover but are in an inconsistent state because many OSD can recover itself because the kernel kill it by OOM. Even now one OSD that was OK, go down by OOM killed.

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-21 Thread Michael Thomas
On 10/21/20 6:47 AM, Frank Schilder wrote: Hi Michael, some quick thoughts. That you can create a pool with 1 PG is a good sign, the crush rule is OK. That pg query says it doesn't have PG 1.0 points in the right direction. There is an inconsistency in the cluster. This is also indicated by

[ceph-users] Re: Large map object found

2020-10-21 Thread DHilsbos
Peter; Look into bucket sharding. Thank you, Dominic L. Hilsbos, MBA Director – Information Technology Perform Air International Inc. dhils...@performair.com www.PerformAir.com From: Peter Eisch [mailto:peter.ei...@virginpulse.com] Sent: Wednesday, October 21, 2020 12:39 PM To: ceph-users@

[ceph-users] Re: 6 PG's stuck not-active, remapped

2020-10-21 Thread Mac Wynkoop
As an example, here's the acting and up set of one of the PG's: *up: 0: 1131: 1382: 303: 1324: 1055: 576: 1067: 1408: 161acting: 0: 721: 1502: 21474836473: 21474836474: 245: 486: 327: 1578: 103* So obviously there's a lot of backfilling there... but it seems it's not making an

[ceph-users] 6 PG's stuck not-active, remapped

2020-10-21 Thread Mac Wynkoop
We recently did some work on the Ceph cluster, and a few disks ended up offline at the same time. There are now 6 PG's that are stuck in a "remapped" state, and this is all of their recovery states: *recovery_state: 0: name: Started/Primary/WaitActingChangeenter_time: 2020-10-21 18:48:02.0

[ceph-users] Large map object found

2020-10-21 Thread Peter Eisch
Hi, My rgw.buckets.index has the cluster in WARN. I'm either not understanding the real issue or I'm making it worse, or both. OMAP_BYTES: 70461524 OMAP_KEYS: 250874 I thought I'd head this off by deleting rgw objects which would normally get deleted in the near future but this only seemed to

[ceph-users] Need help integrating radosgw with keystone for openstack swift

2020-10-21 Thread Bujack, Stefan
Hello, I am struggling to integrate ceph radosgw as obejctstore in openstack swift via keystone. Could someone please have a look at my configs and help finding the issue? Many thanks ins advance. ceph version 14.2.11 nautilus (stable) [root@ciosmon06 ~]# cat /etc/ceph/ceph.conf [global] fsid

[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-21 Thread Ing . Luis Felipe Domínguez Vega
El 2020-10-21 10:08, Mark Nelson escribió: On 10/21/20 7:54 AM, Ing. Luis Felipe Domínguez Vega wrote: El 2020-10-21 08:43, Mark Nelson escribió: Theoretically we shouldn't be spiking memory as much these days during recovery, but the code is complicated and it's tough to reproduce these kinds

[ceph-users] Re: Question about expansion existing Ceph cluster - adding OSDs

2020-10-21 Thread Frank Schilder
There have been threads on exactly this. Might depend a bit on your ceph version. We are running mimic and have no issues doing: - set noout, norebalance, nobackfill - add all OSDs (with weight 1) - wait for peering to complete - unset all flags and let the rebalance loose Starting with nautilus

[ceph-users] Re: multiple OSD crash, unfound objects

2020-10-21 Thread Frank Schilder
Hi Michael, some quick thoughts. That you can create a pool with 1 PG is a good sign, the crush rule is OK. That pg query says it doesn't have PG 1.0 points in the right direction. There is an inconsistency in the cluster. This is also indicated by the fact that no upmaps seem to exist (the c

[ceph-users] Rados Crashing

2020-10-21 Thread Brent Kennedy
We are performing file maintenance( deletes essentially ) and when the process gets to a certain point, all four rados gateways crash with the following: Log output: -5> 2020-10-20 06:09:53.996 7f15f1543700 2 req 7 0.000s s3:delete_obj verifying op params -4> 2020-10-20 06:09:53.996 7

[ceph-users] Difference between node exporter and ceph exporter data

2020-10-21 Thread Seena Fallah
Hi all, There is a huge difference between node exporter and ceph exporter (prometheus mgr module) data. For example I see there is a 120MB/s write on my disk from node exporter but ceph exporter says it is 22MB! Also for latency and IOPS and... Which one is reliable? Thanks. ___

[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-21 Thread Ing . Luis Felipe Domínguez Vega
El 2020-10-20 17:57, Ing. Luis Felipe Domínguez Vega escribió: Hi, today mi Infra provider has a blackout, then the Ceph was try to recover but are in an inconsistent state because many OSD can recover itself because the kernel kill it by OOM. Even now one OSD that was OK, go down by OOM killed.

[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-21 Thread Mark Nelson
Theoretically we shouldn't be spiking memory as much these days during recovery, but the code is complicated and it's tough to reproduce these kinds of issues in-house.  If you happen to catch it in the act, do you see the pglog mempool stats also spiking up? Mark On 10/21/20 2:34 AM, Dan v

[ceph-users] Re: pool pgp_num not updated

2020-10-21 Thread Toby Darling
Hi Mac We've also tweaked osd-recovery-max-single-start => 2 osd-recovery-sleep-hdd => 0.05 to speed things up. On 2020-10-20 16:04, Mac Wynkoop wrote: OK, so for interventions, I've pushed these configs out: ceph config set mon.* target_max_misplaced_ratio 0.05 > 0.20 ceph config get o

[ceph-users] Fwd: [lca-announce] linux.conf.au 2021 - Call for Sessions and Miniconfs Open

2020-10-21 Thread Tim Serong
The best F/OSS conference in the southern hemisphere is back again, virtualized, January 23-25. The CFP is open until November 6. Submit early, submit often! ;-) Forwarded Message Subject: [lca-announce] linux.conf.au 2021 - Call for Sessions and Miniconfs Open Date: Thu, 15 O

[ceph-users] How to see dprintk output

2020-10-21 Thread 展荣臻(信泰)
Hi, There are many dprintk call in crush/mapper.c and crush/builder.c,I want to debug crush algorithm. How I to see output of dprintk? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about expansion existing Ceph cluster - adding OSDs

2020-10-21 Thread Ansgar Jazdzewski
Hi, You can make use of the upmap so you do not need to rebalance the entire crush map every time you change the weight. https://docs.ceph.com/en/latest/rados/operations/upmap/ Hope it helps, Ansgar Kristof Coucke schrieb am Mi., 21. Okt. 2020, 13:29: > Hi, > > I have a cluster with 182 OS

[ceph-users] Question about expansion existing Ceph cluster - adding OSDs

2020-10-21 Thread Kristof Coucke
Hi, I have a cluster with 182 OSDs, this has been expanded towards 282 OSDs. Some disks were near full. The new disks have been added with initial weight = 0. The original plan was to increase this slowly towards their full weight using the gentle reweight script. However, this is going way too sl

[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-21 Thread Stefan Kooman
On 2020-10-20 23:57, Ing. Luis Felipe Domínguez Vega wrote: > Hi, today mi Infra provider has a blackout, then the Ceph was try to > recover but are in an inconsistent state because many OSD can recover > itself because the kernel kill it by OOM. Even now one OSD that was OK, > go down by OOM kille

[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-21 Thread Dan van der Ster
Hi, This might be the pglog issue which has been coming up a few times on the list. If the OSD cannot boot without going OOM, you might have success by trimming the pglog, e.g. search this list for "ceph-objectstore-tool --op trim-pg-log" for some recipes. The thread "OSDs taking too much memory,

[ceph-users] Re: Huge RAM Ussage on OSD recovery

2020-10-21 Thread Ing . Luis Felipe Domínguez Vega
El 2020-10-20 23:17, Anthony D'Atri escribió: On Oct 20, 2020, at 6:23 PM, Ing. Luis Felipe Domínguez Vega wrote: El 2020-10-20 19:33, Anthony D'Atri escribió: You have a *lot* of peering and recovery going on. Write a script that monitors available memory on the system and restarts the OSD p