Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

2016-03-19 Thread Waiman Long

On 03/07/2016 04:33 PM, Dave Chinner wrote:

On Mon, Mar 07, 2016 at 12:39:55PM -0500, Waiman Long wrote:

On 03/05/2016 01:34 AM, Dave Chinner wrote:

On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:

This patchset allows the degeneration of per-cpu counters back
to global counters when:

  1) The number of CPUs in the system is large, hence a high
  cost for calling percpu_counter_sum().  2) The initial count
  value is small so that it has a high chance of excessive
  percpu_counter_sum() calls.

When the above 2 conditions are true, this patchset allows the
user of per-cpu counters to selectively degenerate them into
global counters with lock. This is done by calling the new
percpu_counter_set_limit() API after percpu_counter_set().
Without this call, there is no change in the behavior of the
per-cpu counters.

Patch 1 implements the new percpu_counter_set_limit() API.

Patch 2 modifies XFS to call the new API for the m_ifree and
m_fdblocks per-cpu counters.

Waiman Long (2): percpu_counter: Allow falling back to global
counter on large system xfs: Allow degeneration of
m_fdblocks/m_ifree to global counters

NACK.

This change to turns off per-counter free block counters for 32p
for the XFS free block counters.  We proved 10 years ago that a
global lock for these counters was a massive scalability
limitation for concurrent buffered writes on 16p machines.

IOWs, this change is going to cause fast path concurrent
sequential write regressions for just about everyone, even on
empty filesystems.

That is not really the case here. The patch won't change anything
if there is enough free blocks available in the filesystem.  It
will turn on global lock at mount time iff the number of free
blocks available is less than the given limit. In the case of XFS,
it is 12MB per CPU. On the 80-thread system that I used for
testing, it will be a bit less than 1GB. Even if global lock is
enabled at the beginning, it will be transitioned back to percpu
lock as soon as enough free blocks become available.

Again: How is this an optimisation that is generally useful? Nobody
runs their production 80-thread workloads on a filesystems with less
than 1GB of free space. This is a situation that most admins would
consider "impending doom".


In most cases, there will be enough free blocks in m_fdblocks that the 
switching to global count will never happen. However, I found that 
m_ifree is a different story. On the 80-cpu system that I used, the 
percpu slowpath will be activated when there are less than 2*80^2 = 
12800 free inodes available which is usually the case because the code 
use the default batch size (which scale linearly with # of cpus). Here, 
my patch can really help.





I am aware that if there are enough threads pounding on the lock,
it can cause a scalability bottleneck. However, the qspinlock used
in x86 should greatly alleviate the scalability impact compared
with 10 years ago when we used the ticket lock.

Regardless of whether there is less contention, it still brings back
a global serialisation point and modified cacheline (the free block
counter) in the filesystem that, at some point, will limit
concurrency


Yes, that is true, but the alternative here is to access all the 
cachelines of the percpu counters and evict quite a number of other 
useful cachelines along the way. My patch activates the global counter 
at mount time only when the current count is too small. It was proven in 
my test case that accessing all those cachelines was worse that taken 
the lock when there are large number of cpus.


Once the counter increase past the limit, it will disable the global 
counter and fall back to the usual per-cpu mode. The global counter 
won't be reactivated unless you unmount and remount the filesystem 
again. So I don't this case will cause any performance bottleneck that 
is worse than what the existing code is.



BTW, what exactly
was the microbenchmark that you used to exercise concurrent
sequential write? I would like to try it out on the new hardware
and kernel.

Just something that HPC apps have been known to do for more then 20
years: concurrent sequential write from every CPU in the system.

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf


Thanks.




near to ENOSPC. As i asked you last time - if you want to make
this problem go away, please increase the size of the filesystem
you are running your massively concurrent benchmarks on.

IOWs, please stop trying to optimise a filesystem slow path that:

a) 99.9% of production workloads never execute, b) where we
expect performance to degrade as allocation gets
computationally expensive as we close in on ENOSPC, c) we
start to execute blocking data flush operations that slow
everything down massively, and d) is indicative that the
workload is about to suffer from a fatal, unrecoverable
error (i.e. ENOSPC)


I totally agree. I am not trying to optimize a 

Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

2016-03-19 Thread Waiman Long

On 03/07/2016 04:33 PM, Dave Chinner wrote:

On Mon, Mar 07, 2016 at 12:39:55PM -0500, Waiman Long wrote:

On 03/05/2016 01:34 AM, Dave Chinner wrote:

On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:

This patchset allows the degeneration of per-cpu counters back
to global counters when:

  1) The number of CPUs in the system is large, hence a high
  cost for calling percpu_counter_sum().  2) The initial count
  value is small so that it has a high chance of excessive
  percpu_counter_sum() calls.

When the above 2 conditions are true, this patchset allows the
user of per-cpu counters to selectively degenerate them into
global counters with lock. This is done by calling the new
percpu_counter_set_limit() API after percpu_counter_set().
Without this call, there is no change in the behavior of the
per-cpu counters.

Patch 1 implements the new percpu_counter_set_limit() API.

Patch 2 modifies XFS to call the new API for the m_ifree and
m_fdblocks per-cpu counters.

Waiman Long (2): percpu_counter: Allow falling back to global
counter on large system xfs: Allow degeneration of
m_fdblocks/m_ifree to global counters

NACK.

This change to turns off per-counter free block counters for 32p
for the XFS free block counters.  We proved 10 years ago that a
global lock for these counters was a massive scalability
limitation for concurrent buffered writes on 16p machines.

IOWs, this change is going to cause fast path concurrent
sequential write regressions for just about everyone, even on
empty filesystems.

That is not really the case here. The patch won't change anything
if there is enough free blocks available in the filesystem.  It
will turn on global lock at mount time iff the number of free
blocks available is less than the given limit. In the case of XFS,
it is 12MB per CPU. On the 80-thread system that I used for
testing, it will be a bit less than 1GB. Even if global lock is
enabled at the beginning, it will be transitioned back to percpu
lock as soon as enough free blocks become available.

Again: How is this an optimisation that is generally useful? Nobody
runs their production 80-thread workloads on a filesystems with less
than 1GB of free space. This is a situation that most admins would
consider "impending doom".


In most cases, there will be enough free blocks in m_fdblocks that the 
switching to global count will never happen. However, I found that 
m_ifree is a different story. On the 80-cpu system that I used, the 
percpu slowpath will be activated when there are less than 2*80^2 = 
12800 free inodes available which is usually the case because the code 
use the default batch size (which scale linearly with # of cpus). Here, 
my patch can really help.





I am aware that if there are enough threads pounding on the lock,
it can cause a scalability bottleneck. However, the qspinlock used
in x86 should greatly alleviate the scalability impact compared
with 10 years ago when we used the ticket lock.

Regardless of whether there is less contention, it still brings back
a global serialisation point and modified cacheline (the free block
counter) in the filesystem that, at some point, will limit
concurrency


Yes, that is true, but the alternative here is to access all the 
cachelines of the percpu counters and evict quite a number of other 
useful cachelines along the way. My patch activates the global counter 
at mount time only when the current count is too small. It was proven in 
my test case that accessing all those cachelines was worse that taken 
the lock when there are large number of cpus.


Once the counter increase past the limit, it will disable the global 
counter and fall back to the usual per-cpu mode. The global counter 
won't be reactivated unless you unmount and remount the filesystem 
again. So I don't this case will cause any performance bottleneck that 
is worse than what the existing code is.



BTW, what exactly
was the microbenchmark that you used to exercise concurrent
sequential write? I would like to try it out on the new hardware
and kernel.

Just something that HPC apps have been known to do for more then 20
years: concurrent sequential write from every CPU in the system.

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf


Thanks.




near to ENOSPC. As i asked you last time - if you want to make
this problem go away, please increase the size of the filesystem
you are running your massively concurrent benchmarks on.

IOWs, please stop trying to optimise a filesystem slow path that:

a) 99.9% of production workloads never execute, b) where we
expect performance to degrade as allocation gets
computationally expensive as we close in on ENOSPC, c) we
start to execute blocking data flush operations that slow
everything down massively, and d) is indicative that the
workload is about to suffer from a fatal, unrecoverable
error (i.e. ENOSPC)


I totally agree. I am not trying to optimize a 

Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

2016-03-07 Thread Dave Chinner
On Mon, Mar 07, 2016 at 12:39:55PM -0500, Waiman Long wrote:
> On 03/05/2016 01:34 AM, Dave Chinner wrote:
> >On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> >>This patchset allows the degeneration of per-cpu counters back
> >>to global counters when:
> >>
> >>  1) The number of CPUs in the system is large, hence a high
> >>  cost for calling percpu_counter_sum().  2) The initial count
> >>  value is small so that it has a high chance of excessive
> >>  percpu_counter_sum() calls.
> >>
> >>When the above 2 conditions are true, this patchset allows the
> >>user of per-cpu counters to selectively degenerate them into
> >>global counters with lock. This is done by calling the new
> >>percpu_counter_set_limit() API after percpu_counter_set().
> >>Without this call, there is no change in the behavior of the
> >>per-cpu counters.
> >>
> >>Patch 1 implements the new percpu_counter_set_limit() API.
> >>
> >>Patch 2 modifies XFS to call the new API for the m_ifree and
> >>m_fdblocks per-cpu counters.
> >>
> >>Waiman Long (2): percpu_counter: Allow falling back to global
> >>counter on large system xfs: Allow degeneration of
> >>m_fdblocks/m_ifree to global counters
> >NACK.
> >
> >This change to turns off per-counter free block counters for 32p
> >for the XFS free block counters.  We proved 10 years ago that a
> >global lock for these counters was a massive scalability
> >limitation for concurrent buffered writes on 16p machines.
> >
> >IOWs, this change is going to cause fast path concurrent
> >sequential write regressions for just about everyone, even on
> >empty filesystems.
> 
> That is not really the case here. The patch won't change anything
> if there is enough free blocks available in the filesystem.  It
> will turn on global lock at mount time iff the number of free
> blocks available is less than the given limit. In the case of XFS,
> it is 12MB per CPU. On the 80-thread system that I used for
> testing, it will be a bit less than 1GB. Even if global lock is
> enabled at the beginning, it will be transitioned back to percpu
> lock as soon as enough free blocks become available.

Again: How is this an optimisation that is generally useful? Nobody
runs their production 80-thread workloads on a filesystems with less
than 1GB of free space. This is a situation that most admins would
consider "impending doom".

> I am aware that if there are enough threads pounding on the lock,
> it can cause a scalability bottleneck. However, the qspinlock used
> in x86 should greatly alleviate the scalability impact compared
> with 10 years ago when we used the ticket lock.

Regardless of whether there is less contention, it still brings back
a global serialisation point and modified cacheline (the free block
counter) in the filesystem that, at some point, will limit
concurrency

> BTW, what exactly
> was the microbenchmark that you used to exercise concurrent
> sequential write? I would like to try it out on the new hardware
> and kernel.

Just something that HPC apps have been known to do for more then 20
years: concurrent sequential write from every CPU in the system.

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

> >near to ENOSPC. As i asked you last time - if you want to make
> >this problem go away, please increase the size of the filesystem
> >you are running your massively concurrent benchmarks on.
> >
> >IOWs, please stop trying to optimise a filesystem slow path that:
> >
> > a) 99.9% of production workloads never execute, b) where we
> > expect performance to degrade as allocation gets
> > computationally expensive as we close in on ENOSPC, c) we
> > start to execute blocking data flush operations that slow
> > everything down massively, and d) is indicative that the
> > workload is about to suffer from a fatal, unrecoverable
> > error (i.e. ENOSPC)
> >
> 
> I totally agree. I am not trying to optimize a filesystem
> slowpath.

Where else in the kernel is there a requirement for 100%
accurate threshold detection on per-cpu counters? There isn't, is
there?

> There are use cases, however, where we may want to
> create relatively small filesystem. One example that I cited in
> patch 2 is the battery backed NVDIMM that I have played with
> recently. They can be used for log files or other small files.
> Each dimm is 8 GB. You can have a few of those available. So the
> filesystem size could be 32GB or so.  That can come close to the
> the limit where excessive percpu_counter_sum() call can happen.
> What I want to do here is to try to reduce the chance of excessive
> percpu_counter_sum() calls causing a performance problem. For a
> large filesystem that is nowhere near ENOSPC, my patch will have
> no performance impact whatsoever.

Yet your patch won't have any effect on these "small" filesystems
because unless they have less free space than your threshold at
mount time (rare!) they won't ever have this global lock turned on.
Not to mention if 

Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

2016-03-07 Thread Dave Chinner
On Mon, Mar 07, 2016 at 12:39:55PM -0500, Waiman Long wrote:
> On 03/05/2016 01:34 AM, Dave Chinner wrote:
> >On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> >>This patchset allows the degeneration of per-cpu counters back
> >>to global counters when:
> >>
> >>  1) The number of CPUs in the system is large, hence a high
> >>  cost for calling percpu_counter_sum().  2) The initial count
> >>  value is small so that it has a high chance of excessive
> >>  percpu_counter_sum() calls.
> >>
> >>When the above 2 conditions are true, this patchset allows the
> >>user of per-cpu counters to selectively degenerate them into
> >>global counters with lock. This is done by calling the new
> >>percpu_counter_set_limit() API after percpu_counter_set().
> >>Without this call, there is no change in the behavior of the
> >>per-cpu counters.
> >>
> >>Patch 1 implements the new percpu_counter_set_limit() API.
> >>
> >>Patch 2 modifies XFS to call the new API for the m_ifree and
> >>m_fdblocks per-cpu counters.
> >>
> >>Waiman Long (2): percpu_counter: Allow falling back to global
> >>counter on large system xfs: Allow degeneration of
> >>m_fdblocks/m_ifree to global counters
> >NACK.
> >
> >This change to turns off per-counter free block counters for 32p
> >for the XFS free block counters.  We proved 10 years ago that a
> >global lock for these counters was a massive scalability
> >limitation for concurrent buffered writes on 16p machines.
> >
> >IOWs, this change is going to cause fast path concurrent
> >sequential write regressions for just about everyone, even on
> >empty filesystems.
> 
> That is not really the case here. The patch won't change anything
> if there is enough free blocks available in the filesystem.  It
> will turn on global lock at mount time iff the number of free
> blocks available is less than the given limit. In the case of XFS,
> it is 12MB per CPU. On the 80-thread system that I used for
> testing, it will be a bit less than 1GB. Even if global lock is
> enabled at the beginning, it will be transitioned back to percpu
> lock as soon as enough free blocks become available.

Again: How is this an optimisation that is generally useful? Nobody
runs their production 80-thread workloads on a filesystems with less
than 1GB of free space. This is a situation that most admins would
consider "impending doom".

> I am aware that if there are enough threads pounding on the lock,
> it can cause a scalability bottleneck. However, the qspinlock used
> in x86 should greatly alleviate the scalability impact compared
> with 10 years ago when we used the ticket lock.

Regardless of whether there is less contention, it still brings back
a global serialisation point and modified cacheline (the free block
counter) in the filesystem that, at some point, will limit
concurrency

> BTW, what exactly
> was the microbenchmark that you used to exercise concurrent
> sequential write? I would like to try it out on the new hardware
> and kernel.

Just something that HPC apps have been known to do for more then 20
years: concurrent sequential write from every CPU in the system.

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

> >near to ENOSPC. As i asked you last time - if you want to make
> >this problem go away, please increase the size of the filesystem
> >you are running your massively concurrent benchmarks on.
> >
> >IOWs, please stop trying to optimise a filesystem slow path that:
> >
> > a) 99.9% of production workloads never execute, b) where we
> > expect performance to degrade as allocation gets
> > computationally expensive as we close in on ENOSPC, c) we
> > start to execute blocking data flush operations that slow
> > everything down massively, and d) is indicative that the
> > workload is about to suffer from a fatal, unrecoverable
> > error (i.e. ENOSPC)
> >
> 
> I totally agree. I am not trying to optimize a filesystem
> slowpath.

Where else in the kernel is there a requirement for 100%
accurate threshold detection on per-cpu counters? There isn't, is
there?

> There are use cases, however, where we may want to
> create relatively small filesystem. One example that I cited in
> patch 2 is the battery backed NVDIMM that I have played with
> recently. They can be used for log files or other small files.
> Each dimm is 8 GB. You can have a few of those available. So the
> filesystem size could be 32GB or so.  That can come close to the
> the limit where excessive percpu_counter_sum() call can happen.
> What I want to do here is to try to reduce the chance of excessive
> percpu_counter_sum() calls causing a performance problem. For a
> large filesystem that is nowhere near ENOSPC, my patch will have
> no performance impact whatsoever.

Yet your patch won't have any effect on these "small" filesystems
because unless they have less free space than your threshold at
mount time (rare!) they won't ever have this global lock turned on.
Not to mention if 

Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

2016-03-07 Thread Waiman Long

On 03/05/2016 01:34 AM, Dave Chinner wrote:

On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:

This patchset allows the degeneration of per-cpu counters back to
global counters when:

  1) The number of CPUs in the system is large, hence a high cost for
 calling percpu_counter_sum().
  2) The initial count value is small so that it has a high chance of
 excessive percpu_counter_sum() calls.

When the above 2 conditions are true, this patchset allows the user of
per-cpu counters to selectively degenerate them into global counters
with lock. This is done by calling the new percpu_counter_set_limit()
API after percpu_counter_set(). Without this call, there is no change
in the behavior of the per-cpu counters.

Patch 1 implements the new percpu_counter_set_limit() API.

Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
per-cpu counters.

Waiman Long (2):
   percpu_counter: Allow falling back to global counter on large system
   xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

NACK.

This change to turns off per-counter free block counters for 32p for
the XFS free block counters.  We proved 10 years ago that a global
lock for these counters was a massive scalability limitation for
concurrent buffered writes on 16p machines.

IOWs, this change is going to cause fast path concurrent sequential
write regressions for just about everyone, even on empty
filesystems.


That is not really the case here. The patch won't change anything if 
there is enough free blocks available in the filesystem. It will turn on 
global lock at mount time iff the number of free blocks available is 
less than the given limit. In the case of XFS, it is 12MB per CPU. On 
the 80-thread system that I used for testing, it will be a bit less than 
1GB. Even if global lock is enabled at the beginning, it will be 
transitioned back to percpu lock as soon as enough free blocks become 
available.


I am aware that if there are enough threads pounding on the lock, it can 
cause a scalability bottleneck. However, the qspinlock used in x86 
should greatly alleviate the scalability impact compared with 10 years 
ago when we used the ticket lock. BTW, what exactly was the 
microbenchmark that you used to exercise concurrent sequential write? I 
would like to try it out on the new hardware and kernel.


The AIM7 microbenchmark that I used was not able to generate more than 
1% CPU time in spinlock contention for __percpu_counter_add() on my 
80-thread test system. On the other hand, the overhead of doing 
percpu_counter_sum() had consumed more than 18% of CPU time with the 
same microbenchmark when the filesystem was small. If the number of 
__percpu_counter_add() call is large enough to cause significant 
spinlock contention, I think the time wasted in percpu_counter_sum() 
will be even more for a small filesytem. In the borderline case when the 
filesystem is small enough to trigger the use of global lock with my 
patch, but not small enough to trigger excessive percpu_counter_sum() 
call, then my patch will have caused a degradation in performance.


So I don't think this patch will cause any problem with the free block 
count. The other percpu count m_ifree, however, is a problem in the 
current code. It used the default batch size, which is my 80-thread 
system, is 12800 (2*nr_cpus^2). However, the number of free inodes in 
the in the various XFS filesystems were less than 2k. So 
percpu_counter_sum() was called every time xfs_mod_ifree() was called. 
This costed about 3%CPU time with my microbenchmark, which was also 
eliminated by my patch.



The behaviour you are seeing only occurs when the filesystem is near
to ENOSPC. As i asked you last time - if you want to make this
problem go away, please increase the size of the filesystem you are
running your massively concurrent benchmarks on.

IOWs, please stop trying to optimise a filesystem slow path that:

a) 99.9% of production workloads never execute,
b) where we expect performance to degrade as allocation gets
   computationally expensive as we close in on ENOSPC,
c) we start to execute blocking data flush operations that
   slow everything down massively, and
d) is indicative that the workload is about to suffer
   from a fatal, unrecoverable error (i.e. ENOSPC)



I totally agree. I am not trying to optimize a filesystem slowpath. 
There are use cases, however, where we may want to create relatively 
small filesystem. One example that I cited in patch 2 is the battery 
backed NVDIMM that I have played with recently. They can be used for log 
files or other small files. Each dimm is 8 GB. You can have a few of 
those available. So the filesystem size could be 32GB or so. That can 
come close to the the limit where excessive percpu_counter_sum() call 
can happen. What I want to do here is to try to reduce the chance of 
excessive percpu_counter_sum() calls causing a performance 

Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

2016-03-07 Thread Waiman Long

On 03/05/2016 01:34 AM, Dave Chinner wrote:

On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:

This patchset allows the degeneration of per-cpu counters back to
global counters when:

  1) The number of CPUs in the system is large, hence a high cost for
 calling percpu_counter_sum().
  2) The initial count value is small so that it has a high chance of
 excessive percpu_counter_sum() calls.

When the above 2 conditions are true, this patchset allows the user of
per-cpu counters to selectively degenerate them into global counters
with lock. This is done by calling the new percpu_counter_set_limit()
API after percpu_counter_set(). Without this call, there is no change
in the behavior of the per-cpu counters.

Patch 1 implements the new percpu_counter_set_limit() API.

Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
per-cpu counters.

Waiman Long (2):
   percpu_counter: Allow falling back to global counter on large system
   xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

NACK.

This change to turns off per-counter free block counters for 32p for
the XFS free block counters.  We proved 10 years ago that a global
lock for these counters was a massive scalability limitation for
concurrent buffered writes on 16p machines.

IOWs, this change is going to cause fast path concurrent sequential
write regressions for just about everyone, even on empty
filesystems.


That is not really the case here. The patch won't change anything if 
there is enough free blocks available in the filesystem. It will turn on 
global lock at mount time iff the number of free blocks available is 
less than the given limit. In the case of XFS, it is 12MB per CPU. On 
the 80-thread system that I used for testing, it will be a bit less than 
1GB. Even if global lock is enabled at the beginning, it will be 
transitioned back to percpu lock as soon as enough free blocks become 
available.


I am aware that if there are enough threads pounding on the lock, it can 
cause a scalability bottleneck. However, the qspinlock used in x86 
should greatly alleviate the scalability impact compared with 10 years 
ago when we used the ticket lock. BTW, what exactly was the 
microbenchmark that you used to exercise concurrent sequential write? I 
would like to try it out on the new hardware and kernel.


The AIM7 microbenchmark that I used was not able to generate more than 
1% CPU time in spinlock contention for __percpu_counter_add() on my 
80-thread test system. On the other hand, the overhead of doing 
percpu_counter_sum() had consumed more than 18% of CPU time with the 
same microbenchmark when the filesystem was small. If the number of 
__percpu_counter_add() call is large enough to cause significant 
spinlock contention, I think the time wasted in percpu_counter_sum() 
will be even more for a small filesytem. In the borderline case when the 
filesystem is small enough to trigger the use of global lock with my 
patch, but not small enough to trigger excessive percpu_counter_sum() 
call, then my patch will have caused a degradation in performance.


So I don't think this patch will cause any problem with the free block 
count. The other percpu count m_ifree, however, is a problem in the 
current code. It used the default batch size, which is my 80-thread 
system, is 12800 (2*nr_cpus^2). However, the number of free inodes in 
the in the various XFS filesystems were less than 2k. So 
percpu_counter_sum() was called every time xfs_mod_ifree() was called. 
This costed about 3%CPU time with my microbenchmark, which was also 
eliminated by my patch.



The behaviour you are seeing only occurs when the filesystem is near
to ENOSPC. As i asked you last time - if you want to make this
problem go away, please increase the size of the filesystem you are
running your massively concurrent benchmarks on.

IOWs, please stop trying to optimise a filesystem slow path that:

a) 99.9% of production workloads never execute,
b) where we expect performance to degrade as allocation gets
   computationally expensive as we close in on ENOSPC,
c) we start to execute blocking data flush operations that
   slow everything down massively, and
d) is indicative that the workload is about to suffer
   from a fatal, unrecoverable error (i.e. ENOSPC)



I totally agree. I am not trying to optimize a filesystem slowpath. 
There are use cases, however, where we may want to create relatively 
small filesystem. One example that I cited in patch 2 is the battery 
backed NVDIMM that I have played with recently. They can be used for log 
files or other small files. Each dimm is 8 GB. You can have a few of 
those available. So the filesystem size could be 32GB or so. That can 
come close to the the limit where excessive percpu_counter_sum() call 
can happen. What I want to do here is to try to reduce the chance of 
excessive percpu_counter_sum() calls causing a performance 

Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

2016-03-04 Thread Dave Chinner
On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> This patchset allows the degeneration of per-cpu counters back to
> global counters when:
> 
>  1) The number of CPUs in the system is large, hence a high cost for
> calling percpu_counter_sum().
>  2) The initial count value is small so that it has a high chance of
> excessive percpu_counter_sum() calls.
> 
> When the above 2 conditions are true, this patchset allows the user of
> per-cpu counters to selectively degenerate them into global counters
> with lock. This is done by calling the new percpu_counter_set_limit()
> API after percpu_counter_set(). Without this call, there is no change
> in the behavior of the per-cpu counters.
> 
> Patch 1 implements the new percpu_counter_set_limit() API.
> 
> Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
> per-cpu counters.
> 
> Waiman Long (2):
>   percpu_counter: Allow falling back to global counter on large system
>   xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

NACK.

This change to turns off per-counter free block counters for 32p for
the XFS free block counters.  We proved 10 years ago that a global
lock for these counters was a massive scalability limitation for
concurrent buffered writes on 16p machines.

IOWs, this change is going to cause fast path concurrent sequential
write regressions for just about everyone, even on empty
filesystems.

The behaviour you are seeing only occurs when the filesystem is near
to ENOSPC. As i asked you last time - if you want to make this
problem go away, please increase the size of the filesystem you are
running your massively concurrent benchmarks on.

IOWs, please stop trying to optimise a filesystem slow path that:

a) 99.9% of production workloads never execute,
b) where we expect performance to degrade as allocation gets
   computationally expensive as we close in on ENOSPC,
c) we start to execute blocking data flush operations that
   slow everything down massively, and
d) is indicative that the workload is about to suffer
   from a fatal, unrecoverable error (i.e. ENOSPC)

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com


Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter

2016-03-04 Thread Dave Chinner
On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
> This patchset allows the degeneration of per-cpu counters back to
> global counters when:
> 
>  1) The number of CPUs in the system is large, hence a high cost for
> calling percpu_counter_sum().
>  2) The initial count value is small so that it has a high chance of
> excessive percpu_counter_sum() calls.
> 
> When the above 2 conditions are true, this patchset allows the user of
> per-cpu counters to selectively degenerate them into global counters
> with lock. This is done by calling the new percpu_counter_set_limit()
> API after percpu_counter_set(). Without this call, there is no change
> in the behavior of the per-cpu counters.
> 
> Patch 1 implements the new percpu_counter_set_limit() API.
> 
> Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
> per-cpu counters.
> 
> Waiman Long (2):
>   percpu_counter: Allow falling back to global counter on large system
>   xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

NACK.

This change to turns off per-counter free block counters for 32p for
the XFS free block counters.  We proved 10 years ago that a global
lock for these counters was a massive scalability limitation for
concurrent buffered writes on 16p machines.

IOWs, this change is going to cause fast path concurrent sequential
write regressions for just about everyone, even on empty
filesystems.

The behaviour you are seeing only occurs when the filesystem is near
to ENOSPC. As i asked you last time - if you want to make this
problem go away, please increase the size of the filesystem you are
running your massively concurrent benchmarks on.

IOWs, please stop trying to optimise a filesystem slow path that:

a) 99.9% of production workloads never execute,
b) where we expect performance to degrade as allocation gets
   computationally expensive as we close in on ENOSPC,
c) we start to execute blocking data flush operations that
   slow everything down massively, and
d) is indicative that the workload is about to suffer
   from a fatal, unrecoverable error (i.e. ENOSPC)

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com


[RFC PATCH 0/2] percpu_counter: Enable switching to global counter

2016-03-04 Thread Waiman Long
This patchset allows the degeneration of per-cpu counters back to
global counters when:

 1) The number of CPUs in the system is large, hence a high cost for
calling percpu_counter_sum().
 2) The initial count value is small so that it has a high chance of
excessive percpu_counter_sum() calls.

When the above 2 conditions are true, this patchset allows the user of
per-cpu counters to selectively degenerate them into global counters
with lock. This is done by calling the new percpu_counter_set_limit()
API after percpu_counter_set(). Without this call, there is no change
in the behavior of the per-cpu counters.

Patch 1 implements the new percpu_counter_set_limit() API.

Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
per-cpu counters.

Waiman Long (2):
  percpu_counter: Allow falling back to global counter on large system
  xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

 fs/xfs/xfs_mount.c |1 -
 fs/xfs/xfs_mount.h |5 +++
 fs/xfs/xfs_super.c |6 +++
 include/linux/percpu_counter.h |   10 +
 lib/percpu_counter.c   |   72 +++-
 5 files changed, 92 insertions(+), 2 deletions(-)



[RFC PATCH 0/2] percpu_counter: Enable switching to global counter

2016-03-04 Thread Waiman Long
This patchset allows the degeneration of per-cpu counters back to
global counters when:

 1) The number of CPUs in the system is large, hence a high cost for
calling percpu_counter_sum().
 2) The initial count value is small so that it has a high chance of
excessive percpu_counter_sum() calls.

When the above 2 conditions are true, this patchset allows the user of
per-cpu counters to selectively degenerate them into global counters
with lock. This is done by calling the new percpu_counter_set_limit()
API after percpu_counter_set(). Without this call, there is no change
in the behavior of the per-cpu counters.

Patch 1 implements the new percpu_counter_set_limit() API.

Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
per-cpu counters.

Waiman Long (2):
  percpu_counter: Allow falling back to global counter on large system
  xfs: Allow degeneration of m_fdblocks/m_ifree to global counters

 fs/xfs/xfs_mount.c |1 -
 fs/xfs/xfs_mount.h |5 +++
 fs/xfs/xfs_super.c |6 +++
 include/linux/percpu_counter.h |   10 +
 lib/percpu_counter.c   |   72 +++-
 5 files changed, 92 insertions(+), 2 deletions(-)