Re: [ceph-users] Client io blocked when removing snapshot

2015-12-10 Thread Jan Schermer
> On 10 Dec 2015, at 15:14, Sage Weil wrote: > > On Thu, 10 Dec 2015, Jan Schermer wrote: >> Removing snapshot means looking for every *potential* object the snapshot >> can have, and this takes a very long time (6TB snapshot will consist of 1.5M >> objects (i

Re: [ceph-users] Client io blocked when removing snapshot

2015-12-10 Thread Jan Schermer
Removing snapshot means looking for every *potential* object the snapshot can have, and this takes a very long time (6TB snapshot will consist of 1.5M objects (in one replica) assuming the default 4MB object size). The same applies to large thin volumes (don't try creating and then dropping a 1

Re: [ceph-users] Blocked requests after "osd in"

2015-12-10 Thread Jan Schermer
Just try to give the booting OSD and all MONs the resources they ask for (CPU, memory). Yes, it causes disruption but only for a select group of clients, and only for a moment (<20s with my extremely high number of PGs). >From a service provider perspective this might break SLAs, but until you ge

Re: [ceph-users] How long will the logs be kept?

2015-12-03 Thread Jan Schermer
You can setup logrotate however you want - not sure what the default is for your distro. Usually logrotate doesn't touch files that are smaller than some size even if they are old. It will also not delete logs for OSDs that no longer exist. Ceph itself has nothing to do with log rotation, logro

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Jan Schermer
Have you tried running iperf between the nodes? Capturing a pcap of the (failing) Ceph comms from both sides could help narrow it down. Is there any SDN layer involved that could add overhead/padding to the frames? What about some intermediate MTU like 8000 - does that work? Oh and if there's any

Re: [ceph-users] Potential OSD deadlock?

2015-10-09 Thread Jan Schermer
Are there any errors on the NICs? (ethtool -s ethX) Also take a look at the switch and look for flow control statistics - do you have flow control enabled or disabled? We had to disable flow control as it would pause all IO on the port whenever any path got congested which you don't want to happe

Re: [ceph-users] jemalloc and transparent hugepage

2015-09-09 Thread Jan Schermer
> On 09/09/2015 10:54, "ceph-devel-ow...@vger.kernel.org on behalf of Jan > Schermer" > wrote: > >> I looked at THP before. It comes enabled on RHEL6 and on our KVM hosts it >> merges a lot (~300GB hugepages on a 400GB KVM footprint). >> I am probably going to

Re: [ceph-users] jemalloc and transparent hugepage

2015-09-09 Thread Jan Schermer
I looked at THP before. It comes enabled on RHEL6 and on our KVM hosts it merges a lot (~300GB hugepages on a 400GB KVM footprint). I am probably going to disable it and see if it introduces any problems for me - the most important gain here is better processor memory lookup table (cache) utiliz

Re: [ceph-users] Is it safe to increase pg number in a production environment

2015-08-11 Thread Jan Schermer
Could someone clarify what the impact of this bug is? We did increase pg_num/pgp_num and we are on dumpling (0.67.12 unofficial snapshot). Most of our clients are likely restarted already, but not all. Should we be worried? Thanks Jan > On 11 Aug 2015, at 17:31, Dan van der Ster wrote: > > On

Re: [ceph-users] Is it safe to increase pg number in a production environment

2015-08-05 Thread Jan Schermer
Hi, comments inline. > On 05 Aug 2015, at 05:45, Jevon Qiao wrote: > > Hi Jan, > > Thank you for the detailed suggestion. Please see my reply in-line. > On 5/8/15 01:23, Jan Schermer wrote: >> I think I wrote about my experience with this about 3 months ago, includi

Re: [ceph-users] Is it safe to increase pg number in a production environment

2015-08-04 Thread Jan Schermer
I think I wrote about my experience with this about 3 months ago, including what techniques I used to minimize impact on production. Basicaly we had to 1) increase pg_num in small increments only, bcreating the placement groups themselves caused slowed requests on OSDs 2) increse pgp_num in smal

Re: [ceph-users] dropping old distros: el6, precise 12.04, debian wheezy?

2015-07-30 Thread Jan Schermer
Not at all. We have this: http://ceph.com/docs/master/releases/ I would expect that whatever distribution I install Ceph LTS release on will be supported for the time specified. That means if I install Hammer on CentOS 6 now it will stay supported until 3Q/2016. Of course if in the meantime the d

Re: [ceph-users] dropping old distros: el6, precise 12.04, debian wheezy?

2015-07-30 Thread Jan Schermer
I understand your reasons, but dropping support for LTS release like this is not right. You should lege artis support every distribution the LTS release could have ever been installed on - that’s what the LTS label is for and what we rely on once we build a project on top of it CentOS 6 in partic

Re: [ceph-users] Discuss: New default recovery config settings

2015-06-01 Thread Jan Schermer
go up you can’t go down. Jan > On 01 Jun 2015, at 10:57, huang jun wrote: > > hi,jan > > 2015-06-01 15:43 GMT+08:00 Jan Schermer : >> We had to disable deep scrub or the cluster would me unusable - we need to >> turn it back on sooner or later, though. >> Wi

Re: [ceph-users] Discuss: New default recovery config settings

2015-06-01 Thread Jan Schermer
We had to disable deep scrub or the cluster would me unusable - we need to turn it back on sooner or later, though. With minimal scrubbing and recovery settings, everything is mostly good. Turned out many issues we had were due to too few PGs - once we increased them from 4K to 16K everything sp