[ceph-users] Reducing RAM usage on production MDS

2020-05-27 Thread Dylan McCulloch
Hi all, The single active MDS on one of our Ceph clusters is close to running out of RAM. MDS total system RAM = 528GB MDS current free system RAM = 4GB mds_cache_memory_limit = 451GB current mds cache usage = 426GB Presumably we need to reduce our mds_cache_memory_limit and/or

[ceph-users] Re: Nautilus to Octopus Upgrade mds without downtime

2020-05-27 Thread Konstantin Shalygin
On 5/27/20 8:43 PM, Andreas Schiefer wrote: if I understand correctly: if we upgrade from an running nautilus cluster to octopus we have a downtime on an update of MDS. Is this correct? This is always when upgrade major or minor version for MDS. It's hang for restart, actually clients

[ceph-users] Fwd: [IO-500] IO500 ISC20 Call for Submission

2020-05-27 Thread John Bent
FYI. Hope to see some awesome CephFS submissions for our virtual IO500 BoF! Thanks, John -- Forwarded message - From: committee--- via IO-500 Date: Fri, May 22, 2020 at 1:53 PM Subject: [IO-500] IO500 ISC20 Call for Submission To: *Deadline*: 08 June 2020 AoE The IO500

[ceph-users] Re: Cephadm Hangs During OSD Apply

2020-05-27 Thread m
I noticed the luks volumes were open, even though luksOpen hung. I killed cryptsetup (once per disk) and ceph-volume continued and eventually created the osd's for the host (yes, this node will be slated for another reinstall when cephadm is stabilized). Is there a way to remove an osd service

[ceph-users] Nautilus to Octopus Upgrade mds without downtime

2020-05-27 Thread Andreas Schiefer
Hello, if I understand correctly: if we upgrade from an running nautilus cluster to octopus we have a downtime on an update of MDS. Is this correct? Mit freundlichen Grüßen / Kind regards Andreas Schiefer Leiter Systemadministration / Head of systemadministration --- HOME OF LOYALTY CRM-

[ceph-users] Cephadm Hangs During OSD Apply

2020-05-27 Thread m
Hi, trying to migrate a second ceph cluster to Cephadm. All the host successfully migrated from "legacy" except one of the OSD hosts (cephadm kept duplicating osd ids e.g. two "osd.5", still not sure why). To make things easier, we re-provisioned the node (reinstalled from netinstall, applied

[ceph-users] Re: Cannot repair inconsistent PG

2020-05-27 Thread Alex Gorbachev
On Wed, May 27, 2020 at 5:28 AM Daniel Aberger - Profihost AG < d.aber...@profihost.ag> wrote: > Hi, > > (un)fortunately I can't test it because I managed to repair the pg. > > snaptrim and snaptrim_wait have been a part of this particular pg's > status. As I was trying to look deeper into the

[ceph-users] Re: Cannot repair inconsistent PG

2020-05-27 Thread Dan van der Ster
Hi, I'm not sure if the repair waits for snaptrim; but it does need a scrub reservation on all the related OSDs, hence our script. And I've also observed that the repair req isn't queued up -- if the OSDs are busy with other scrubs, the repair req is forgotten. -- Dan On Wed, May 27, 2020 at

[ceph-users] Re: Luminous, OSDs down: "osd init failed" and "failed to load OSD map for epoch ... got 0 bytes"

2020-05-27 Thread Fulvio Galeazzi
Hallo Dan, all. My attempt with ceph-bluestore-tool did not lead to a working OSD. So I decided to re-create all OSDs, as they were quite many and my cluster was rather unbalanced. Too bad I could not get any insight as to what caused the issue on the OSDs for object storage: however, I will

[ceph-users] Re: High latency spikes under jewel

2020-05-27 Thread Paul Emmerich
Common problem for FileStore and really no point in debugging this: upgrade everything to a recent version and migrate to BlueStore. 99% of random latency spikes are just fixed by doing that. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit

[ceph-users] Re: 15.2.2 bluestore issue

2020-05-27 Thread Paul Emmerich
Hi, since this bug may lead to data loss when several OSDs crash at the same time (e.g., after a power outage): can we pull the release from the mirrors and docker hub? Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h

[ceph-users] High latency spikes under jewel

2020-05-27 Thread Bence Szabo
Hi, We experienced random and relative high latency spikes (around 0.5-10 sec) in our ceph cluster which consists 6 osd nodes, all osd nodes have 6 osd-s. One osd built with one spinning disk and two nvme device. We use a bcache device for osd back end (mixed with hdd and an nvme partition as

[ceph-users] Re: osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

2020-05-27 Thread thoralf schulze
hi there - On 5/19/20 3:11 PM, thoralf schulze wrote: > […] and report back … i tried to reproduce the issue with osds each using 37gb of ssd storage for db and wal. everything went fine - so yes, spillovers are to be avoided. thank you very much & with kind regards, thoralf. signature.asc

[ceph-users] Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

2020-05-27 Thread EDH - Manuel Rios
Anyone can share their table with other MTU values? Also interested into Switch CPU load KR, Manuel -Mensaje original- De: Marc Roos Enviado el: miércoles, 27 de mayo de 2020 12:01 Para: chris.palmer ; paul.emmerich CC: amudhan83 ; anthony.datri ; ceph-users ; doustar ; kdhall ;

[ceph-users] Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

2020-05-27 Thread Marc Roos
Interesting table. I have this on a production cluster 10gbit at a datacenter (obviously doing not that much). [@]# iperf3 -c 10.0.0.13 -P 1 -M 9000 Connecting to host 10.0.0.13, port 5201 [ 4] local 10.0.0.14 port 52788 connected to 10.0.0.13 port 5201 [ ID] Interval Transfer

[ceph-users] Re: Cannot repair inconsistent PG

2020-05-27 Thread Daniel Aberger - Profihost AG
Hi, (un)fortunately I can't test it because I managed to repair the pg. snaptrim and snaptrim_wait have been a part of this particular pg's status. As I was trying to look deeper into the case I had a watch on ceph health detail and noticed that snaptrim/snaptrim_wait was suddenly not a part of

[ceph-users] Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

2020-05-27 Thread Chris Palmer
To elaborate on some aspects that have been mentioned already and add some others:: * Test using iperf3. * Don't try to use jumbos on networks where you don't have complete control over every host. This usually includes the main ceph network. It's just too much grief. You can consider

[ceph-users] Re: [External Email] Re: Ceph Nautius not working after setting MTU 9000

2020-05-27 Thread Marc Roos
I would not call a ceph page, a random tuning tip. At least I hope they are not. NVMe-only with 100Gbit is not really a standard setup. I assume with such setup you have the luxury to not notice many optimizations. What I mostly read is that changing to mtu 9000 will allow you to better

[ceph-users] Re: looking for telegram group in English or Chinese

2020-05-27 Thread Zhenshi Zhou
Awesome, thanks! Martin Verges 于2020年5月27日周三 下午2:04写道: > Hello, > > as I find it a good idea and couldn't find another, I just created > https://t.me/ceph_users. > Please feel free to join and let's see to get this channel startet ;) > > -- > Martin Verges > Managing director > > Mobile: +49

[ceph-users] Re: looking for telegram group in English or Chinese

2020-05-27 Thread Martin Verges
Hello, as I find it a good idea and couldn't find another, I just created https://t.me/ceph_users. Please feel free to join and let's see to get this channel startet ;) -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.ver...@croit.io Chat: https://t.me/MartinVerges