Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-20 Thread Nick Fisk
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jogi > Hofmüller > Sent: 20 April 2017 13:51 > To: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] slow requests and short OSD failures in small > cluster > >

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-20 Thread Jogi Hofmüller
Hi, Am Dienstag, den 18.04.2017, 18:34 + schrieb Peter Maloney: > The 'slower with every snapshot even after CoW totally flattens it' > issue I just find easy to test, and I didn't test it on hammer or > earlier, and others confirmed it, but didn't keep track of the > versions. Just make an

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-20 Thread mj
Hi Gregory, Reading your reply with great interest, thanks. Can you confirm my understanding now: - live snapshots are more expensive for the cluster as a whole, than taking the snapshot when the VM is switched off? - using fstrim in VMs is (much?) more expensive when the VM has existing

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-19 Thread Gregory Farnum
On Tue, Apr 18, 2017 at 11:34 AM, Peter Maloney wrote: > On 04/18/17 11:44, Jogi Hofmüller wrote: > > Hi, > > Am Dienstag, den 18.04.2017, 13:02 +0200 schrieb mj: > > On 04/18/2017 11:24 AM, Jogi Hofmüller wrote: > > This might have been true for hammer and

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Peter Maloney
On 04/18/17 11:44, Jogi Hofmüller wrote: > Hi, > > Am Dienstag, den 18.04.2017, 13:02 +0200 schrieb mj: >> On 04/18/2017 11:24 AM, Jogi Hofmüller wrote: >>> This might have been true for hammer and older versions of ceph. >>> From >>> what I can tell now, every snapshot taken reduces performance

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Jogi Hofmüller
Hi, Am Dienstag, den 18.04.2017, 13:02 +0200 schrieb mj: > > On 04/18/2017 11:24 AM, Jogi Hofmüller wrote: > > This might have been true for hammer and older versions of ceph. > > From > > what I can tell now, every snapshot taken reduces performance of > > the > > entire cluster :( > > Really?

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Lionel Bouton
Le 18/04/2017 à 11:24, Jogi Hofmüller a écrit : > Hi, > > thanks for all you comments so far. > > Am Donnerstag, den 13.04.2017, 16:53 +0200 schrieb Lionel Bouton: >> Hi, >> >> Le 13/04/2017 à 10:51, Peter Maloney a écrit : >>> Ceph snapshots relly slow things down. > I can confirm that now :(

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread mj
On 04/18/2017 11:24 AM, Jogi Hofmüller wrote: This might have been true for hammer and older versions of ceph. From what I can tell now, every snapshot taken reduces performance of the entire cluster :( Really? Can others confirm this? Is this a 'wellknown fact'? (unknown only to us,

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Jogi Hofmüller
Hi, thanks for all you comments so far. Am Donnerstag, den 13.04.2017, 16:53 +0200 schrieb Lionel Bouton: > Hi, > > Le 13/04/2017 à 10:51, Peter Maloney a écrit : > > Ceph snapshots relly slow things down. I can confirm that now :( > We use rbd snapshots on Firefly (and Hammer now) and I

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-14 Thread mj
ah right: _during_ the actual removal, you mean. :-) clear now. mj On 04/13/2017 05:50 PM, Lionel Bouton wrote: Le 13/04/2017 à 17:47, mj a écrit : Hi, On 04/13/2017 04:53 PM, Lionel Bouton wrote: We use rbd snapshots on Firefly (and Hammer now) and I didn't see any measurable impact on

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Lionel Bouton
Le 13/04/2017 à 17:47, mj a écrit : > Hi, > > On 04/13/2017 04:53 PM, Lionel Bouton wrote: >> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any >> measurable impact on performance... until we tried to remove them. > > What exactly do you mean with that? Just what I said :

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread mj
Hi, On 04/13/2017 04:53 PM, Lionel Bouton wrote: We use rbd snapshots on Firefly (and Hammer now) and I didn't see any measurable impact on performance... until we tried to remove them. What exactly do you mean with that? MJ ___ ceph-users mailing

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread David Turner
I wouldn't set the default for osd_heartbeat_grace to 5 minutes, but inject it when you see this happening. It's a good to know what your cluster is up to. The fact that you aren't seeing the blocked requests any more tells me that this was your issue. It will go through, split everything, go a

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Lionel Bouton
Hi, Le 13/04/2017 à 10:51, Peter Maloney a écrit : > [...] > Also more things to consider... > > Ceph snapshots relly slow things down. We use rbd snapshots on Firefly (and Hammer now) and I didn't see any measurable impact on performance... until we tried to remove them. We usually have at

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Peter Maloney
On 04/13/17 10:34, Jogi Hofmüller wrote: > Dear David, > > Am Mittwoch, den 12.04.2017, 13:46 + schrieb David Turner: >> I can almost guarantee what you're seeing is PG subfolder splitting. > Evey day there's something new to learn about ceph ;) > >> When the subfolders in a PG get X number

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Jogi Hofmüller
Dear David, Am Mittwoch, den 12.04.2017, 13:46 + schrieb David Turner: > I can almost guarantee what you're seeing is PG subfolder splitting.  Evey day there's something new to learn about ceph ;) > When the subfolders in a PG get X number of objects, it splits into > 16 subfolders.  Every

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-12 Thread David Turner
I can almost guarantee what you're seeing is PG subfolder splitting. When the subfolders in a PG get X number of objects, it splits into 16 subfolders. Every cluster I manage has blocked requests and OSDs that get marked down while this is happening. To stop the OSDs getting marked down, I

[ceph-users] slow requests and short OSD failures in small cluster

2017-04-12 Thread Jogi Hofmüller
Dear all, we run a small cluster [1] that is exclusively used for virtualisation (kvm/libvirt). Recently we started to run into performance problems (slow requests, failing OSDs) for no *obvious* reason (at least not for us). We do nightly snapshots of VM images and keep the snapshots for 14