Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-03-01 Thread Linus Torvalds
On Fri, Mar 1, 2013 at 10:18 AM, Davidlohr Bueso  wrote:
> On Fri, 2013-03-01 at 01:42 -0500, Rik van Riel wrote:
>>
>> Checking try_atomic_semop and do_smart_update, it looks like neither
>> is using atomic operations. That part of the semaphore code would
>> still benefit from spinlocks.
>
> Agreed.

Yup. As mentioned, I hadn't even looked at that part of the code, but
yes, it definitely wants the spinlock.

> How about splitting ipc_lock()/ipc_lock_control() in two calls: one to
> obtain the ipc object (rcu_read_lock + idr_find), which can be called
> when performing the permissions and security checks, and another to
> obtain the ipcp->lock [q_]spinlock when necessary.

That sounds like absolutely the right thing to do. And we can leave
the old helper functions that do both of them around, and only use the
split case for just a few places.

And if we make the RCU read-lock be explicit too, we could get rid of
some of the ugliness. Right now we have semtimedop() do things like a
conditional "find_alloc_undo()", which will get the RCU lock. It would
be much nicer if we just cleaned up the interfaces a tiny bit, said
that the *caller* has to get the RCU lock, and just do this
unconditionally before calling any of it. Because right now the RCU
details are quite messy, and we have code like

if (un)
rcu_read_unlock();
error = PTR_ERR(sma);
goto out_free;

etc, when it would actually be much simpler to just do the RCU read
locking unconditionally (we'll need it for the semaphore lookup
anyway) and just have the exit paths unlock unconditionally like we
usually do (ie a few different exit goto's that just nest the
unlocking properly).

It would simplify the odd locking both for humans and for code
generation. Right now we actually nest those RCU read locks two deep,
as far as I can see.

And it looks like this could be done in fairly small independent steps
("add explicit RCU locking", "split up helper functions", "start using
the simplified helper functions in selected paths that care").

It won't *solve* the locking issues, and I'm sure we'll still see
contention, but it should clean the code up and probably helps the
contention numbers visibly. Even if it's only 10% (which, judging by
my profiles, would be a safe lower bound from just moving the security
callback out of the spinlock - and it could easily be 20% of more
because contention often begets more contention) that would be in the
same kind of ballpark as the spinlock improvements. And using Michel's
work would be a largely independent scalability improvement on top.

Anybody willing to give it a try? I suspect Rik's benchmark, some
"perf record", and concentrating on just semtimedop() to begin with
would be a fairly good starting point.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-03-01 Thread Rik van Riel

On 03/01/2013 01:18 PM, Davidlohr Bueso wrote:

On Fri, 2013-03-01 at 01:42 -0500, Rik van Riel wrote:

On 02/28/2013 06:09 PM, Linus Torvalds wrote:


So I almost think that *everything* there in the semaphore code could
be done under RCU. The actual spinlock doesn't seem to much matter, at
least for semaphores. The semaphore values themselves seem to be
protected by the atomic operations, but I might be wrong about that, I
didn't even check.


Checking try_atomic_semop and do_smart_update, it looks like neither
is using atomic operations. That part of the semaphore code would
still benefit from spinlocks.


Agreed.


If we assume that calls to semctl with more than one semaphore
operator are rare, we could do something smarter here.

We could turn the outer spinlock into an rwlock. If we are
doing a call that modifies the outer structure, or multiple
semops at once, we take the lock exclusively.

If we want to do just one semop, we can take the lock in
shared mode. Then each semaphore inside would have its own
spinlock, and we lock just that one.

Of course, that would just add overhead to the case where
a semaphore block has just one semaphore in it, so I'm not
sure this would be worthwhile at all...


The way the code handles a whole batch of semops all at once,
potentially to multiple semaphores at once, and with the ability
to undo all of the operations, it looks like the spinlock will
still need to be per block of semaphores.

I guess the code may still benefit from Michel's locking code,
after the permission stuff has been moved from under the spinlock.


How about splitting ipc_lock()/ipc_lock_control() in two calls: one to
obtain the ipc object (rcu_read_lock + idr_find), which can be called
when performing the permissions and security checks, and another to
obtain the ipcp->lock [q_]spinlock when necessary.


That is what I am working on now.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-03-01 Thread Davidlohr Bueso
On Fri, 2013-03-01 at 01:42 -0500, Rik van Riel wrote:
> On 02/28/2013 06:09 PM, Linus Torvalds wrote:
> 
> > So I almost think that *everything* there in the semaphore code could
> > be done under RCU. The actual spinlock doesn't seem to much matter, at
> > least for semaphores. The semaphore values themselves seem to be
> > protected by the atomic operations, but I might be wrong about that, I
> > didn't even check.
> 
> Checking try_atomic_semop and do_smart_update, it looks like neither
> is using atomic operations. That part of the semaphore code would
> still benefit from spinlocks.

Agreed.

> 
> The way the code handles a whole batch of semops all at once,
> potentially to multiple semaphores at once, and with the ability
> to undo all of the operations, it looks like the spinlock will
> still need to be per block of semaphores.
> 
> I guess the code may still benefit from Michel's locking code,
> after the permission stuff has been moved from under the spinlock.

How about splitting ipc_lock()/ipc_lock_control() in two calls: one to
obtain the ipc object (rcu_read_lock + idr_find), which can be called
when performing the permissions and security checks, and another to
obtain the ipcp->lock [q_]spinlock when necessary.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-03-01 Thread Davidlohr Bueso
On Fri, 2013-03-01 at 01:42 -0500, Rik van Riel wrote:
 On 02/28/2013 06:09 PM, Linus Torvalds wrote:
 
  So I almost think that *everything* there in the semaphore code could
  be done under RCU. The actual spinlock doesn't seem to much matter, at
  least for semaphores. The semaphore values themselves seem to be
  protected by the atomic operations, but I might be wrong about that, I
  didn't even check.
 
 Checking try_atomic_semop and do_smart_update, it looks like neither
 is using atomic operations. That part of the semaphore code would
 still benefit from spinlocks.

Agreed.

 
 The way the code handles a whole batch of semops all at once,
 potentially to multiple semaphores at once, and with the ability
 to undo all of the operations, it looks like the spinlock will
 still need to be per block of semaphores.
 
 I guess the code may still benefit from Michel's locking code,
 after the permission stuff has been moved from under the spinlock.

How about splitting ipc_lock()/ipc_lock_control() in two calls: one to
obtain the ipc object (rcu_read_lock + idr_find), which can be called
when performing the permissions and security checks, and another to
obtain the ipcp-lock [q_]spinlock when necessary.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-03-01 Thread Rik van Riel

On 03/01/2013 01:18 PM, Davidlohr Bueso wrote:

On Fri, 2013-03-01 at 01:42 -0500, Rik van Riel wrote:

On 02/28/2013 06:09 PM, Linus Torvalds wrote:


So I almost think that *everything* there in the semaphore code could
be done under RCU. The actual spinlock doesn't seem to much matter, at
least for semaphores. The semaphore values themselves seem to be
protected by the atomic operations, but I might be wrong about that, I
didn't even check.


Checking try_atomic_semop and do_smart_update, it looks like neither
is using atomic operations. That part of the semaphore code would
still benefit from spinlocks.


Agreed.


If we assume that calls to semctl with more than one semaphore
operator are rare, we could do something smarter here.

We could turn the outer spinlock into an rwlock. If we are
doing a call that modifies the outer structure, or multiple
semops at once, we take the lock exclusively.

If we want to do just one semop, we can take the lock in
shared mode. Then each semaphore inside would have its own
spinlock, and we lock just that one.

Of course, that would just add overhead to the case where
a semaphore block has just one semaphore in it, so I'm not
sure this would be worthwhile at all...


The way the code handles a whole batch of semops all at once,
potentially to multiple semaphores at once, and with the ability
to undo all of the operations, it looks like the spinlock will
still need to be per block of semaphores.

I guess the code may still benefit from Michel's locking code,
after the permission stuff has been moved from under the spinlock.


How about splitting ipc_lock()/ipc_lock_control() in two calls: one to
obtain the ipc object (rcu_read_lock + idr_find), which can be called
when performing the permissions and security checks, and another to
obtain the ipcp-lock [q_]spinlock when necessary.


That is what I am working on now.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-03-01 Thread Linus Torvalds
On Fri, Mar 1, 2013 at 10:18 AM, Davidlohr Bueso davidlohr.bu...@hp.com wrote:
 On Fri, 2013-03-01 at 01:42 -0500, Rik van Riel wrote:

 Checking try_atomic_semop and do_smart_update, it looks like neither
 is using atomic operations. That part of the semaphore code would
 still benefit from spinlocks.

 Agreed.

Yup. As mentioned, I hadn't even looked at that part of the code, but
yes, it definitely wants the spinlock.

 How about splitting ipc_lock()/ipc_lock_control() in two calls: one to
 obtain the ipc object (rcu_read_lock + idr_find), which can be called
 when performing the permissions and security checks, and another to
 obtain the ipcp-lock [q_]spinlock when necessary.

That sounds like absolutely the right thing to do. And we can leave
the old helper functions that do both of them around, and only use the
split case for just a few places.

And if we make the RCU read-lock be explicit too, we could get rid of
some of the ugliness. Right now we have semtimedop() do things like a
conditional find_alloc_undo(), which will get the RCU lock. It would
be much nicer if we just cleaned up the interfaces a tiny bit, said
that the *caller* has to get the RCU lock, and just do this
unconditionally before calling any of it. Because right now the RCU
details are quite messy, and we have code like

if (un)
rcu_read_unlock();
error = PTR_ERR(sma);
goto out_free;

etc, when it would actually be much simpler to just do the RCU read
locking unconditionally (we'll need it for the semaphore lookup
anyway) and just have the exit paths unlock unconditionally like we
usually do (ie a few different exit goto's that just nest the
unlocking properly).

It would simplify the odd locking both for humans and for code
generation. Right now we actually nest those RCU read locks two deep,
as far as I can see.

And it looks like this could be done in fairly small independent steps
(add explicit RCU locking, split up helper functions, start using
the simplified helper functions in selected paths that care).

It won't *solve* the locking issues, and I'm sure we'll still see
contention, but it should clean the code up and probably helps the
contention numbers visibly. Even if it's only 10% (which, judging by
my profiles, would be a safe lower bound from just moving the security
callback out of the spinlock - and it could easily be 20% of more
because contention often begets more contention) that would be in the
same kind of ballpark as the spinlock improvements. And using Michel's
work would be a largely independent scalability improvement on top.

Anybody willing to give it a try? I suspect Rik's benchmark, some
perf record, and concentrating on just semtimedop() to begin with
would be a fairly good starting point.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Rik van Riel

On 02/28/2013 06:09 PM, Linus Torvalds wrote:


So I almost think that *everything* there in the semaphore code could
be done under RCU. The actual spinlock doesn't seem to much matter, at
least for semaphores. The semaphore values themselves seem to be
protected by the atomic operations, but I might be wrong about that, I
didn't even check.


Checking try_atomic_semop and do_smart_update, it looks like neither
is using atomic operations. That part of the semaphore code would
still benefit from spinlocks.

The way the code handles a whole batch of semops all at once,
potentially to multiple semaphores at once, and with the ability
to undo all of the operations, it looks like the spinlock will
still need to be per block of semaphores.

I guess the code may still benefit from Michel's locking code,
after the permission stuff has been moved from under the spinlock.

Two remaining worries are the security_sem_free call and the
non-RCU list_del calls from freeary, called from the SEM_RMID
code. They are probably fine, but worth another pair of eyes...

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Linus Torvalds
On Thu, Feb 28, 2013 at 2:38 PM, Rik van Riel  wrote:
>
> I could see doing the permission checks under a seq lock.
>
> If the permissions, or any other aspect of the semaphore
> array changed while we were doing our permission check,
> we can simply jump back to the top of the function and
> try again.

We do the normal file permissions under just the RCU read lock. If
it's good enough for files, them ipc semaphores certainly don't need
anything more.

> The ugliness of using two kinds of protection may be
> necessary, since permissions can be modified in-place,
> and RCU does not seem to do in-place modification...

The selinux AVC code does its stuff by rcu, and for files we don't
even care if there is some uid/gid change going on - we'll  get one or
the other, and live with it. There is no atomic way to change the uid
and gid anyway, so there are no new races. You get old or you get new,
or a mix of the two, any of it is "correct".

So I almost think that *everything* there in the semaphore code could
be done under RCU. The actual spinlock doesn't seem to much matter, at
least for semaphores. The semaphore values themselves seem to be
protected by the atomic operations, but I might be wrong about that, I
didn't even check. I just looked at the ipcperm() part, which
certainly doesn't seem to merit spinlocks.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Rik van Riel

On 02/28/2013 04:58 PM, Linus Torvalds wrote:


I'm not seeing any real reason the permission checking couldn't be
done just under the RCU lock, before we get the spinlock. Except for
the fact that the "helper" routines in ipc/util.c are written the way
they are, so it's a layering violation. But I really think that would
be a *reasonably* low-hanging fruit thing to do.


I could see doing the permission checks under a seq lock.

If the permissions, or any other aspect of the semaphore
array changed while we were doing our permission check,
we can simply jump back to the top of the function and
try again.

Protection against IPC_RMID (removal of the semaphore
block) can probably be done with RCU.

The ugliness of using two kinds of protection may be
necessary, since permissions can be modified in-place,
and RCU does not seem to do in-place modification...

I have been staring at the code for a bit this afternoon,
and have yet to come up with a nicer idea :(

I am always open to suggestions, though :)


Changing the locking itself to be more fine-grained, and doing it
across many different ipc semaphores would be a major pain. So I do
suspect that the work Michel Lespinasse did is probably worth doing
anyway in addition to at least trying to fix the horrible lack of
scalability of the code a bit.


That would certainly be the easy fix.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Linus Torvalds
On Thu, Feb 28, 2013 at 1:14 PM, Rik van Riel  wrote:
>
> I have modified one of the semop tests to use multiple semaphores.

Ooh yeah. This shows contention quite nicely. And it's all from
ipc_lock, and looking at the top-10 loffenders of the profile:

 43.01%  semop-multi  [kernel.kallsyms]   [k] _raw_spin_lock
...
  4.73%  semop-multi  [kernel.kallsyms]   [k] avc_has_perm_flags
  4.52%  semop-multi  [kernel.kallsyms]   [k] ipc_has_perm.isra.21
...
  2.43%  semop-multi  [kernel.kallsyms]   [k] ipcperms

The 43% isn't actually all that interesting, it just shows that there
is contention and we're waiting for other user. Yes, we waste almost
half the CPU time on locking, but ignore that for a moment.

The "more than 10% of the total time is spent in ipc permission code"
*is* the interesting part. Because that 10%+ is actually more like 20%
if you ignore the "wait for lock" part. And it's all done *inside* the
lock.

In other words, I can pretty much guarantee that the contention will
go down a lot if we just move the security check outside the spinlock.
According to the above numbers, we're currently spending basically
1/5th of our remaining CPU resources serialized for absolutely no good
reason. THAT is the kind of thing we shouldn't do.

The rest of the big offenders seem to be mostly done outside the
spinlock, although it's hard to tell how much of the 10% of
sys_semtimedop() iis also under the lock. There's probably other
things there than just the permission checking.

I'm not seeing any real reason the permission checking couldn't be
done just under the RCU lock, before we get the spinlock. Except for
the fact that the "helper" routines in ipc/util.c are written the way
they are, so it's a layering violation. But I really think that would
be a *reasonably* low-hanging fruit thing to do.

Changing the locking itself to be more fine-grained, and doing it
across many different ipc semaphores would be a major pain. So I do
suspect that the work Michel Lespinasse did is probably worth doing
anyway in addition to at least trying to fix the horrible lack of
scalability of the code a bit.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Rik van Riel

On 02/28/2013 03:26 PM, Linus Torvalds wrote:

On Thu, Feb 28, 2013 at 10:22 AM, Linus Torvalds
 wrote:


I'm sure there are other things we could do to improve ipc lock times
even if we don't actually split the lock, but the security one might
be a good first step.


Btw, if somebody has a benchmark for threads using multiple ipc
semaphores (from the same semget() allocation) concurrently, and we
could have a simple way to see the contention without having to run
some big DB thing, that would also be nice. Maybe there is something
out there already? Google didn't find any, and the normal benchmarks
I'm aware of all just do one single (private) ipc semaphore per
process.

Nothing gets some people going like just having a nice benchmark to
show the effect.


I have modified one of the semop tests to use multiple semaphores.

To run the test, specify the number of threads.  If you want the
number of semaphores to be different from the number of threads,
specify a second commandline argument.

$ ./semop-multi
usage: ./semop-multi  [nsems]

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TEST_TIME	30
#define SEMMNI		128

int		semid;
int		state = 1;
unsigned long	*results_array;
int		threads_starting;
pthread_cond_t	thread_parent;
pthread_cond_t	thread_worker;
pthread_mutex_t	thread_lock;
int		nsems;

union semun {
	int val;
	struct semid_ds *buf;
	unsigned short int *array;
	struct seminfo *__buf;
	void *__pad;
};

void *
worker(void *arg)
{
	unsigned long	count = 0;
	int		id = (int)(unsigned long)arg;
	struct sembuf	sembuff;

	sembuff.sem_num = 0;
	sembuff.sem_flg = 0;

	pthread_mutex_lock(_lock);
	threads_starting--;
	if (!threads_starting)
		pthread_cond_signal(_parent);
	pthread_cond_wait(_worker, _lock);
	pthread_mutex_unlock(_lock);

	for (;state;) {

		/* Move "id" ahead through the semaphores */
		sembuff.sem_num = (sembuff.sem_num + id) % nsems;

		/* Lock the semaphore */
		sembuff.sem_op = 1;
		if (semop(semid, , 1) < 0) {
			perror("semop");
			exit(1);
		}

		/* Unlock the semaphore */
		sembuff.sem_op = -1;
		if (semop(semid, , 1) < 0) {
			perror("semop");
			exit(1);
		}

		count += 2;
	}

	results_array[id] = count;

	return NULL;
}

int
main(int argc, char **argv)
{
	pthread_t	*thread_array;
	pthread_attr_t	thread_attr;
	int		thread_count;
	unsigned short	seminit[SEMMNI];
	union semun	sem_un;
	cpu_set_t	cpu;
	unsigned long	total = 0;
	int		i, ret;
	long		cpus;

	cpus = sysconf(_SC_NPROCESSORS_ONLN);

	if (argc < 2) {
		printf("usage: %s  [nsems]\n", argv[0]);
		exit(1);
	}

	thread_count = atoi(argv[1]);

	if (thread_count < 0) {
		printf("threads must be >= 0\n");
		exit(1);
	}

	if (thread_count == 0)
		thread_count = cpus;

	if (argc > 2)
		nsems = atoi(argv[2]);
	else
		nsems = thread_count;
	if (nsems > SEMMNI)
		nsems = SEMMNI;

	printf("cpus %ld, threads: %d, semaphores: %d, test duration: %d secs\n", cpus, thread_count, nsems, TEST_TIME);

	thread_array = malloc(thread_count * sizeof(pthread_t));

	if (!thread_array) {
		perror("malloc(thread_array)");
		exit(1);
	}

	results_array = malloc(thread_count * sizeof(unsigned long));

	if (!results_array) {
		perror("malloc(results_array)");
		exit(1);
	}

	semid = semget(0x12345, nsems, 0777|IPC_CREAT );

	if (semid < 0) {
		perror("semget");
		exit(1);
	}

	for (i = 0; i < SEMMNI; i++)
		seminit[i] = 200;
	sem_un.array = seminit;
  
	if (semctl(semid, 1, SETALL, sem_un) < 0) {
		perror("semctl(setall)");
		exit(1);
	}

	pthread_mutex_init(_lock, NULL);
	pthread_cond_init(_parent, NULL);
	pthread_cond_init(_worker, NULL);
	pthread_attr_init(_attr);

	threads_starting = thread_count;

	for (i = 0; i < thread_count; i++) {

		CPU_ZERO();
		CPU_SET(i % cpus, );

		ret = pthread_attr_setaffinity_np(_attr, sizeof(cpu_set_t), );

		if (ret) {
			printf("pthread_attr_setaffinity_np: %s\n", strerror(ret));
			exit(1);
		}

		ret = pthread_create(_array[i], _attr, worker, (void *)(unsigned long)i);

		if (ret) {
			printf("pthread_create: %s\n", strerror(ret));
			exit(1);
		}
	}

	pthread_attr_destroy(_attr);

	pthread_mutex_lock(_lock);
	while (threads_starting)
		pthread_cond_wait(_parent, _lock);
	pthread_cond_broadcast(_worker);
	pthread_mutex_unlock(_lock);

	sleep(TEST_TIME);
	state = 0;

	for (i = 0; i < thread_count; i++)
		pthread_join(thread_array[i], NULL);

	pthread_cond_destroy(_parent);
	pthread_cond_destroy(_worker);
	pthread_mutex_destroy(_lock);

	if (semctl(semid, 1, IPC_RMID) < 0)
		perror("semctl(rmid)");

	for (i = 0; i < thread_count; i++)
		total += results_array[i];

	printf("total operations: %ld, ops/sec %ld\n", total, total / TEST_TIME);

	free(thread_array);
	free(results_array);

	return 0;
}


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Linus Torvalds
On Thu, Feb 28, 2013 at 10:22 AM, Linus Torvalds
 wrote:
>
> I'm sure there are other things we could do to improve ipc lock times
> even if we don't actually split the lock, but the security one might
> be a good first step.

Btw, if somebody has a benchmark for threads using multiple ipc
semaphores (from the same semget() allocation) concurrently, and we
could have a simple way to see the contention without having to run
some big DB thing, that would also be nice. Maybe there is something
out there already? Google didn't find any, and the normal benchmarks
I'm aware of all just do one single (private) ipc semaphore per
process.

Nothing gets some people going like just having a nice benchmark to
show the effect.

Because maybe I'm wrong, and there really are people who would love to
muck around with the SysV IPC code, and they just didn't have the
right reason or benchmark for it...

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Linus Torvalds
On Thu, Feb 28, 2013 at 7:13 AM, Rik van Riel  wrote:
>
> Btw, the IPC lock is already fairly fine grained. One ipc
> lock is allocated for each set of semaphores allocated through
> sys_semget. Looking up those semaphores in the namespace, when
> they are used later, is done under RCU.

Bullshit. There is no fine-grained locking, and the code has never
ever been optimized for concurrency at all.

Yes, the lookup is done with RCU, and that's good. And I can tell you
*why* it's done with RCU:  all the trivial sysv semaphore benchmarks
I've seen tend to have just one lock per semget and then they test the
performance of that semaphore.

But I'm suspecting that since there is clearly contention going on
(and hopefully the real world people aren't just all pounding on one
single mutex), maybe these loads actually allocate multiple semaphores
with semget (that 'nsems' argument). And then we scale really badly,
because we use a single spinlock for all of them, even if we didn't
have to.

THERE IS ABSOLUTELY ZERO SCALING. None. Nada. The scaling that exists
(your RCU lookup example) is for *unrelated* locks, which is exactly
what you'd expect if all the simple benchmarks around did a single
sem-lock and then the "scaling" was tested by running a thousand
copies of that program - all using their own private single ipc
semaphore.

Plus the code in ipc/sem.c really is not very good. It does that
"sem_lock()" (which gets the ipc spinlock) way too eagerly, because
the semaphore lock also protects some trivial ipc stuff, so it gets
the lock *first*, then it does all the random checks it wants to do
(security checks, you name it), and then ot does all the queue undo
crap etc. All while holding that lock.

So no. There's no scaling, and the SINGLE lock that is held is held
way too f*cking long.

And that's what I'm talking about. We've never had people actually
bother about the IPC locking, because we've never had people who
cared. Everybody hates the crazy SysV IPC code, often for good reason.
People are told not to use it (use shared memory and futexes instead,
which actually *do* scale, and people *do* care about). But of course,
lots of projects do use the horrid SysV IPC code because it's
portable, and it does have those undo features etc.

And I'm sure we *could* fix it, but judging by past performance nobody
really cares enough. Which is why I do think that fixing it the wrong
way (by making the contention on the ipc_lock()) may well be the right
thing here for once.

Of course, if the real-world applications really only use a single
SysV semaphore, then we can't scale any better. But even then we could
still try to lower the hold-times of that ipc_lock().

And hold-time really tends to be one big killer for contention. A lot
of hardware tends to have ping-pong avoidance logic that means that
quick unlocks after getting the lock, and it will still remain in the
local node caches. Hardware features like that make it much harder to
introduce the really bad contention cases even if there are tons of
CPU's trying to access the same lock. But if the lock hold times are
long, it's *much* easier to get bad contention.

Looking at the code, if we could just get the security hook outside of
the locked region, that would probably be a big improvement. Maybe do
that part in the RCU lookup part? Because selinux really is very
expensive, with the whole ipc_has_perm -> avc_has_perm_flags thing etc
etc. This is exactly the kind of thing we did with pathnames.

I'm sure there are other things we could do to improve ipc lock times
even if we don't actually split the lock, but the security one might
be a good first step.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Rik van Riel

On 02/27/2013 11:49 PM, Linus Torvalds wrote:

On Wed, Feb 27, 2013 at 8:06 PM, Davidlohr Bueso  wrote:


The attached file shows how the amount of sys time used by the ipc lock
for a 4 and 8 socket box.


I have to say, even with the improvements, that looks pretty
disgusting. It really makes me wonder if that thing couldn't be done
better some way. Is it the SysV semaphores that this all ends up
using, or what?

That said, I think the IPC layer is just about the perfect candidate
for things like this, because I'm afraid that nobody is ever going to
fix it.


There's more to it than that. Userspace expects the IPC layer
to provide exclusion and/or serialization. That makes it a
rather poor candidate for parallelism.

Btw, the IPC lock is already fairly fine grained. One ipc
lock is allocated for each set of semaphores allocated through
sys_semget. Looking up those semaphores in the namespace, when
they are used later, is done under RCU.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Rik van Riel

On 02/27/2013 11:49 PM, Linus Torvalds wrote:

On Wed, Feb 27, 2013 at 8:06 PM, Davidlohr Bueso davidlohr.bu...@hp.com wrote:


The attached file shows how the amount of sys time used by the ipc lock
for a 4 and 8 socket box.


I have to say, even with the improvements, that looks pretty
disgusting. It really makes me wonder if that thing couldn't be done
better some way. Is it the SysV semaphores that this all ends up
using, or what?

That said, I think the IPC layer is just about the perfect candidate
for things like this, because I'm afraid that nobody is ever going to
fix it.


There's more to it than that. Userspace expects the IPC layer
to provide exclusion and/or serialization. That makes it a
rather poor candidate for parallelism.

Btw, the IPC lock is already fairly fine grained. One ipc
lock is allocated for each set of semaphores allocated through
sys_semget. Looking up those semaphores in the namespace, when
they are used later, is done under RCU.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Linus Torvalds
On Thu, Feb 28, 2013 at 7:13 AM, Rik van Riel r...@redhat.com wrote:

 Btw, the IPC lock is already fairly fine grained. One ipc
 lock is allocated for each set of semaphores allocated through
 sys_semget. Looking up those semaphores in the namespace, when
 they are used later, is done under RCU.

Bullshit. There is no fine-grained locking, and the code has never
ever been optimized for concurrency at all.

Yes, the lookup is done with RCU, and that's good. And I can tell you
*why* it's done with RCU:  all the trivial sysv semaphore benchmarks
I've seen tend to have just one lock per semget and then they test the
performance of that semaphore.

But I'm suspecting that since there is clearly contention going on
(and hopefully the real world people aren't just all pounding on one
single mutex), maybe these loads actually allocate multiple semaphores
with semget (that 'nsems' argument). And then we scale really badly,
because we use a single spinlock for all of them, even if we didn't
have to.

THERE IS ABSOLUTELY ZERO SCALING. None. Nada. The scaling that exists
(your RCU lookup example) is for *unrelated* locks, which is exactly
what you'd expect if all the simple benchmarks around did a single
sem-lock and then the scaling was tested by running a thousand
copies of that program - all using their own private single ipc
semaphore.

Plus the code in ipc/sem.c really is not very good. It does that
sem_lock() (which gets the ipc spinlock) way too eagerly, because
the semaphore lock also protects some trivial ipc stuff, so it gets
the lock *first*, then it does all the random checks it wants to do
(security checks, you name it), and then ot does all the queue undo
crap etc. All while holding that lock.

So no. There's no scaling, and the SINGLE lock that is held is held
way too f*cking long.

And that's what I'm talking about. We've never had people actually
bother about the IPC locking, because we've never had people who
cared. Everybody hates the crazy SysV IPC code, often for good reason.
People are told not to use it (use shared memory and futexes instead,
which actually *do* scale, and people *do* care about). But of course,
lots of projects do use the horrid SysV IPC code because it's
portable, and it does have those undo features etc.

And I'm sure we *could* fix it, but judging by past performance nobody
really cares enough. Which is why I do think that fixing it the wrong
way (by making the contention on the ipc_lock()) may well be the right
thing here for once.

Of course, if the real-world applications really only use a single
SysV semaphore, then we can't scale any better. But even then we could
still try to lower the hold-times of that ipc_lock().

And hold-time really tends to be one big killer for contention. A lot
of hardware tends to have ping-pong avoidance logic that means that
quick unlocks after getting the lock, and it will still remain in the
local node caches. Hardware features like that make it much harder to
introduce the really bad contention cases even if there are tons of
CPU's trying to access the same lock. But if the lock hold times are
long, it's *much* easier to get bad contention.

Looking at the code, if we could just get the security hook outside of
the locked region, that would probably be a big improvement. Maybe do
that part in the RCU lookup part? Because selinux really is very
expensive, with the whole ipc_has_perm - avc_has_perm_flags thing etc
etc. This is exactly the kind of thing we did with pathnames.

I'm sure there are other things we could do to improve ipc lock times
even if we don't actually split the lock, but the security one might
be a good first step.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Linus Torvalds
On Thu, Feb 28, 2013 at 10:22 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 I'm sure there are other things we could do to improve ipc lock times
 even if we don't actually split the lock, but the security one might
 be a good first step.

Btw, if somebody has a benchmark for threads using multiple ipc
semaphores (from the same semget() allocation) concurrently, and we
could have a simple way to see the contention without having to run
some big DB thing, that would also be nice. Maybe there is something
out there already? Google didn't find any, and the normal benchmarks
I'm aware of all just do one single (private) ipc semaphore per
process.

Nothing gets some people going like just having a nice benchmark to
show the effect.

Because maybe I'm wrong, and there really are people who would love to
muck around with the SysV IPC code, and they just didn't have the
right reason or benchmark for it...

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Rik van Riel

On 02/28/2013 03:26 PM, Linus Torvalds wrote:

On Thu, Feb 28, 2013 at 10:22 AM, Linus Torvalds
torva...@linux-foundation.org wrote:


I'm sure there are other things we could do to improve ipc lock times
even if we don't actually split the lock, but the security one might
be a good first step.


Btw, if somebody has a benchmark for threads using multiple ipc
semaphores (from the same semget() allocation) concurrently, and we
could have a simple way to see the contention without having to run
some big DB thing, that would also be nice. Maybe there is something
out there already? Google didn't find any, and the normal benchmarks
I'm aware of all just do one single (private) ipc semaphore per
process.

Nothing gets some people going like just having a nice benchmark to
show the effect.


I have modified one of the semop tests to use multiple semaphores.

To run the test, specify the number of threads.  If you want the
number of semaphores to be different from the number of threads,
specify a second commandline argument.

$ ./semop-multi
usage: ./semop-multi threads [nsems]

#define _GNU_SOURCE
#include sched.h
#include pthread.h
#include unistd.h
#include stdlib.h
#include stdio.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include string.h
#include malloc.h
#include sys/ipc.h
#include sys/msg.h
#include sys/sem.h

#define TEST_TIME	30
#define SEMMNI		128

int		semid;
int		state = 1;
unsigned long	*results_array;
int		threads_starting;
pthread_cond_t	thread_parent;
pthread_cond_t	thread_worker;
pthread_mutex_t	thread_lock;
int		nsems;

union semun {
	int val;
	struct semid_ds *buf;
	unsigned short int *array;
	struct seminfo *__buf;
	void *__pad;
};

void *
worker(void *arg)
{
	unsigned long	count = 0;
	int		id = (int)(unsigned long)arg;
	struct sembuf	sembuff;

	sembuff.sem_num = 0;
	sembuff.sem_flg = 0;

	pthread_mutex_lock(thread_lock);
	threads_starting--;
	if (!threads_starting)
		pthread_cond_signal(thread_parent);
	pthread_cond_wait(thread_worker, thread_lock);
	pthread_mutex_unlock(thread_lock);

	for (;state;) {

		/* Move id ahead through the semaphores */
		sembuff.sem_num = (sembuff.sem_num + id) % nsems;

		/* Lock the semaphore */
		sembuff.sem_op = 1;
		if (semop(semid, sembuff, 1)  0) {
			perror(semop);
			exit(1);
		}

		/* Unlock the semaphore */
		sembuff.sem_op = -1;
		if (semop(semid, sembuff, 1)  0) {
			perror(semop);
			exit(1);
		}

		count += 2;
	}

	results_array[id] = count;

	return NULL;
}

int
main(int argc, char **argv)
{
	pthread_t	*thread_array;
	pthread_attr_t	thread_attr;
	int		thread_count;
	unsigned short	seminit[SEMMNI];
	union semun	sem_un;
	cpu_set_t	cpu;
	unsigned long	total = 0;
	int		i, ret;
	long		cpus;

	cpus = sysconf(_SC_NPROCESSORS_ONLN);

	if (argc  2) {
		printf(usage: %s threads [nsems]\n, argv[0]);
		exit(1);
	}

	thread_count = atoi(argv[1]);

	if (thread_count  0) {
		printf(threads must be = 0\n);
		exit(1);
	}

	if (thread_count == 0)
		thread_count = cpus;

	if (argc  2)
		nsems = atoi(argv[2]);
	else
		nsems = thread_count;
	if (nsems  SEMMNI)
		nsems = SEMMNI;

	printf(cpus %ld, threads: %d, semaphores: %d, test duration: %d secs\n, cpus, thread_count, nsems, TEST_TIME);

	thread_array = malloc(thread_count * sizeof(pthread_t));

	if (!thread_array) {
		perror(malloc(thread_array));
		exit(1);
	}

	results_array = malloc(thread_count * sizeof(unsigned long));

	if (!results_array) {
		perror(malloc(results_array));
		exit(1);
	}

	semid = semget(0x12345, nsems, 0777|IPC_CREAT );

	if (semid  0) {
		perror(semget);
		exit(1);
	}

	for (i = 0; i  SEMMNI; i++)
		seminit[i] = 200;
	sem_un.array = seminit;
  
	if (semctl(semid, 1, SETALL, sem_un)  0) {
		perror(semctl(setall));
		exit(1);
	}

	pthread_mutex_init(thread_lock, NULL);
	pthread_cond_init(thread_parent, NULL);
	pthread_cond_init(thread_worker, NULL);
	pthread_attr_init(thread_attr);

	threads_starting = thread_count;

	for (i = 0; i  thread_count; i++) {

		CPU_ZERO(cpu);
		CPU_SET(i % cpus, cpu);

		ret = pthread_attr_setaffinity_np(thread_attr, sizeof(cpu_set_t), cpu);

		if (ret) {
			printf(pthread_attr_setaffinity_np: %s\n, strerror(ret));
			exit(1);
		}

		ret = pthread_create(thread_array[i], thread_attr, worker, (void *)(unsigned long)i);

		if (ret) {
			printf(pthread_create: %s\n, strerror(ret));
			exit(1);
		}
	}

	pthread_attr_destroy(thread_attr);

	pthread_mutex_lock(thread_lock);
	while (threads_starting)
		pthread_cond_wait(thread_parent, thread_lock);
	pthread_cond_broadcast(thread_worker);
	pthread_mutex_unlock(thread_lock);

	sleep(TEST_TIME);
	state = 0;

	for (i = 0; i  thread_count; i++)
		pthread_join(thread_array[i], NULL);

	pthread_cond_destroy(thread_parent);
	pthread_cond_destroy(thread_worker);
	pthread_mutex_destroy(thread_lock);

	if (semctl(semid, 1, IPC_RMID)  0)
		perror(semctl(rmid));

	for (i = 0; i  thread_count; i++)
		total += results_array[i];

	printf(total operations: %ld, ops/sec %ld\n, total, total / TEST_TIME);

	

Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Linus Torvalds
On Thu, Feb 28, 2013 at 1:14 PM, Rik van Riel r...@redhat.com wrote:

 I have modified one of the semop tests to use multiple semaphores.

Ooh yeah. This shows contention quite nicely. And it's all from
ipc_lock, and looking at the top-10 loffenders of the profile:

 43.01%  semop-multi  [kernel.kallsyms]   [k] _raw_spin_lock
...
  4.73%  semop-multi  [kernel.kallsyms]   [k] avc_has_perm_flags
  4.52%  semop-multi  [kernel.kallsyms]   [k] ipc_has_perm.isra.21
...
  2.43%  semop-multi  [kernel.kallsyms]   [k] ipcperms

The 43% isn't actually all that interesting, it just shows that there
is contention and we're waiting for other user. Yes, we waste almost
half the CPU time on locking, but ignore that for a moment.

The more than 10% of the total time is spent in ipc permission code
*is* the interesting part. Because that 10%+ is actually more like 20%
if you ignore the wait for lock part. And it's all done *inside* the
lock.

In other words, I can pretty much guarantee that the contention will
go down a lot if we just move the security check outside the spinlock.
According to the above numbers, we're currently spending basically
1/5th of our remaining CPU resources serialized for absolutely no good
reason. THAT is the kind of thing we shouldn't do.

The rest of the big offenders seem to be mostly done outside the
spinlock, although it's hard to tell how much of the 10% of
sys_semtimedop() iis also under the lock. There's probably other
things there than just the permission checking.

I'm not seeing any real reason the permission checking couldn't be
done just under the RCU lock, before we get the spinlock. Except for
the fact that the helper routines in ipc/util.c are written the way
they are, so it's a layering violation. But I really think that would
be a *reasonably* low-hanging fruit thing to do.

Changing the locking itself to be more fine-grained, and doing it
across many different ipc semaphores would be a major pain. So I do
suspect that the work Michel Lespinasse did is probably worth doing
anyway in addition to at least trying to fix the horrible lack of
scalability of the code a bit.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Rik van Riel

On 02/28/2013 04:58 PM, Linus Torvalds wrote:


I'm not seeing any real reason the permission checking couldn't be
done just under the RCU lock, before we get the spinlock. Except for
the fact that the helper routines in ipc/util.c are written the way
they are, so it's a layering violation. But I really think that would
be a *reasonably* low-hanging fruit thing to do.


I could see doing the permission checks under a seq lock.

If the permissions, or any other aspect of the semaphore
array changed while we were doing our permission check,
we can simply jump back to the top of the function and
try again.

Protection against IPC_RMID (removal of the semaphore
block) can probably be done with RCU.

The ugliness of using two kinds of protection may be
necessary, since permissions can be modified in-place,
and RCU does not seem to do in-place modification...

I have been staring at the code for a bit this afternoon,
and have yet to come up with a nicer idea :(

I am always open to suggestions, though :)


Changing the locking itself to be more fine-grained, and doing it
across many different ipc semaphores would be a major pain. So I do
suspect that the work Michel Lespinasse did is probably worth doing
anyway in addition to at least trying to fix the horrible lack of
scalability of the code a bit.


That would certainly be the easy fix.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Linus Torvalds
On Thu, Feb 28, 2013 at 2:38 PM, Rik van Riel r...@redhat.com wrote:

 I could see doing the permission checks under a seq lock.

 If the permissions, or any other aspect of the semaphore
 array changed while we were doing our permission check,
 we can simply jump back to the top of the function and
 try again.

We do the normal file permissions under just the RCU read lock. If
it's good enough for files, them ipc semaphores certainly don't need
anything more.

 The ugliness of using two kinds of protection may be
 necessary, since permissions can be modified in-place,
 and RCU does not seem to do in-place modification...

The selinux AVC code does its stuff by rcu, and for files we don't
even care if there is some uid/gid change going on - we'll  get one or
the other, and live with it. There is no atomic way to change the uid
and gid anyway, so there are no new races. You get old or you get new,
or a mix of the two, any of it is correct.

So I almost think that *everything* there in the semaphore code could
be done under RCU. The actual spinlock doesn't seem to much matter, at
least for semaphores. The semaphore values themselves seem to be
protected by the atomic operations, but I might be wrong about that, I
didn't even check. I just looked at the ipcperm() part, which
certainly doesn't seem to merit spinlocks.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-28 Thread Rik van Riel

On 02/28/2013 06:09 PM, Linus Torvalds wrote:


So I almost think that *everything* there in the semaphore code could
be done under RCU. The actual spinlock doesn't seem to much matter, at
least for semaphores. The semaphore values themselves seem to be
protected by the atomic operations, but I might be wrong about that, I
didn't even check.


Checking try_atomic_semop and do_smart_update, it looks like neither
is using atomic operations. That part of the semaphore code would
still benefit from spinlocks.

The way the code handles a whole batch of semops all at once,
potentially to multiple semaphores at once, and with the ability
to undo all of the operations, it looks like the spinlock will
still need to be per block of semaphores.

I guess the code may still benefit from Michel's locking code,
after the permission stuff has been moved from under the spinlock.

Two remaining worries are the security_sem_free call and the
non-RCU list_del calls from freeary, called from the SEM_RMID
code. They are probably fine, but worth another pair of eyes...

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2013 at 8:06 PM, Davidlohr Bueso  wrote:
>
> The attached file shows how the amount of sys time used by the ipc lock
> for a 4 and 8 socket box.

I have to say, even with the improvements, that looks pretty
disgusting. It really makes me wonder if that thing couldn't be done
better some way. Is it the SysV semaphores that this all ends up
using, or what?

That said, I think the IPC layer is just about the perfect candidate
for things like this, because I'm afraid that nobody is ever going to
fix it.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Davidlohr Bueso
On Wed, 2013-02-27 at 21:58 -0500, Rik van Riel wrote:
> On 02/27/2013 05:13 PM, Linus Torvalds wrote:
> >
> > On Feb 27, 2013 1:56 PM, "Rik van Riel"  > > wrote:
> >>
> >> No argument there, but that does in no way negate the need for some
> >> performance robustness.
> >
> > The very numbers you posted showed that the backoff was *not* more
> > robust. Quite the reverse, there was arguably more variability.
> 
> On the other hand, both MCS and the fast queue locks
> implemented by Michel showed low variability and high
> performance.
> 
> http://thread.gmane.org/gmane.linux.kernel/1427417
> 
> > So I really don't like how you make these sweeping statements
> > *again*. Numbers talk, bullshit walks.
> 
> If you read all the text in my last mail, you will see the
> link to Michel's performance results. The numbers speak for
> themselves.
> 
> > The fact is, life is complicated. The simple spinlocks tend to work
> > really well. People have tried fancy things before, and it turns out
> > it's not as simple as they think.
> 
> The numbers for both the simple spinlocks and the
> spinlock backoff kind of suck. Both of these have
> high variability, and both eventually fall down
> under heavy load.
> 
> The numbers for Michel's MCS and fast queue lock
> implementations appear to be both fast and stable.
> 
> I agree that we need numbers.

FWIW I've been doing some benchmarking for Swingbench DSS workloads
(Oracle data mining) comparing Rik and Michel's patches. With lower
amounts of contention, Rik's ticket spinlock is better, but once
contention gets high enough the queued locks performs better.

The attached file shows how the amount of sys time used by the ipc lock
for a 4 and 8 socket box.
<>

Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2013 at 6:58 PM, Rik van Riel  wrote:
>
> On the other hand, both MCS and the fast queue locks
> implemented by Michel showed low variability and high
> performance.

On microbenchmarks, and when implemented for only one single subsystem, yes.

> The numbers for Michel's MCS and fast queue lock
> implementations appear to be both fast and stable.

I do think that doing specialized spinlocks for special areas may be a
rasonable approach, and it's quite possible that the SySV IPC thing is
one such area.

But no, I don't think the numbers I've seen for Michel's MCS are AT
ALL comparable to the generic spinlocks, and the interface makes them
incompatible as a replacement to even test in general.

Don't get me wrong: I think the targeted approach is *better*. I just
also happen to think that you spout big words about things that aren't
all that big, and try to make this a bigger deal than it is. The
benchmark numbers you point to are micro-benchmarks, and not all
comparable.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Rik van Riel

On 02/27/2013 05:13 PM, Linus Torvalds wrote:


On Feb 27, 2013 1:56 PM, "Rik van Riel" mailto:r...@redhat.com>> wrote:


No argument there, but that does in no way negate the need for some
performance robustness.


The very numbers you posted showed that the backoff was *not* more
robust. Quite the reverse, there was arguably more variability.


On the other hand, both MCS and the fast queue locks
implemented by Michel showed low variability and high
performance.

http://thread.gmane.org/gmane.linux.kernel/1427417


So I really don't like how you make these sweeping statements
*again*. Numbers talk, bullshit walks.


If you read all the text in my last mail, you will see the
link to Michel's performance results. The numbers speak for
themselves.


The fact is, life is complicated. The simple spinlocks tend to work
really well. People have tried fancy things before, and it turns out
it's not as simple as they think.


The numbers for both the simple spinlocks and the
spinlock backoff kind of suck. Both of these have
high variability, and both eventually fall down
under heavy load.

The numbers for Michel's MCS and fast queue lock
implementations appear to be both fast and stable.

I agree that we need numbers.

I do not agree that other locks should be dismissed
out of hand without looking at the numbers.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Rik van Riel

On 02/27/2013 03:18 PM, Linus Torvalds wrote:

On Wed, Feb 27, 2013 at 11:53 AM, Rik van Riel  wrote:


If we have two classes of spinlocks, I suspect we would be better
off making those high-demand spinlocks MCS or LCH locks, which have
the property that having N+1 CPUs contend on the lock will never
result in slower aggregate throughput than having N CPUs contend.


I doubt that.

The fancy "no slowdown" locks almost never work in practice. They
scale well by performing really badly for the normal case, either
needing separate allocations or having memory ordering problems
requiring multiple locked cycles.


The relative costs of atomic operations, cache line acquisition, and
other things has shifted over time. On the very latest systems, the
cost of cache line acquisition appears to dominate over the cost of
atomic operations.

Michel's results from last month show that MCS has essentially the same
performance for the single thread (uncontended) situation, on some
systems a degradation for 2 and 3 thread performance, in a tight micro
test, and improved performance when the contention involves 3 or more
CPUs.

http://thread.gmane.org/gmane.linux.kernel/1427417


A spinlock basically needs to have a fast-case that is a single locked
instruction, and all the clever ones tend to fail that simple test.


With the cost of a cache line acquisition outweighing the cost of an
atomic operation, for how much longer will this remain true?

Michel's results suggest that on Sandybridge this no longer seems
to hold.  The cost of the atomic operation on unlock appears to have
more than paid for itself by avoiding extraneous cache line bouncing,
and the cost of cache line acquisition. Even with only two CPUs
contending on the lock...


I can certainly take profiles of various workloads, but there is
absolutely no guarantee that I will see the same bottlenecks that
eg. the people at HP have seen. The largest test system I currently
have access to has 40 cores, vs. the 80 cores in the (much more
interesting) HP results I pasted.

Would you also be interested in performance numbers (and profiles)
of a kernel that has bottleneck spinlocks replaced with MCS locks?


MCS locks don't even work, last time I saw. They need that extra lock
holder allocation, which forces people to have different calling
conventions, and is just a pain. Or am I confusing them with something
else?


Nope, those are the MCS locks alright.


They might work for the special cases like the sleeping locks, which
have one or two places that take and release the lock, but not for the
generic spinlock.


I am certainly not advocating that all spinlocks be replaced with
harder to use locks.  On the other hand, we have to realize that
Linux users do not have the luxury to upgrade their kernel to the
latest upstream on whenever they run into a performance issue, so
it would be good to make Linux more robust against scalability
issues.


So before even trying anything fancy, just basic profiles would be
good to see which lock it is. Many of the really bad slowdowns are
actually about the timing details of the sleeping locks (do *not*
enable lock debugging etc for profiling, you want the mutex spinning
code to be active, for example).


No argument there, but that does in no way negate the need for
some performance robustness.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2013 at 11:53 AM, Rik van Riel  wrote:
>
> If we have two classes of spinlocks, I suspect we would be better
> off making those high-demand spinlocks MCS or LCH locks, which have
> the property that having N+1 CPUs contend on the lock will never
> result in slower aggregate throughput than having N CPUs contend.

I doubt that.

The fancy "no slowdown" locks almost never work in practice. They
scale well by performing really badly for the normal case, either
needing separate allocations or having memory ordering problems
requiring multiple locked cycles.

A spinlock basically needs to have a fast-case that is a single locked
instruction, and all the clever ones tend to fail that simple test.

> I can certainly take profiles of various workloads, but there is
> absolutely no guarantee that I will see the same bottlenecks that
> eg. the people at HP have seen. The largest test system I currently
> have access to has 40 cores, vs. the 80 cores in the (much more
> interesting) HP results I pasted.
>
> Would you also be interested in performance numbers (and profiles)
> of a kernel that has bottleneck spinlocks replaced with MCS locks?

MCS locks don't even work, last time I saw. They need that extra lock
holder allocation, which forces people to have different calling
conventions, and is just a pain. Or am I confusing them with something
else?

They might work for the special cases like the sleeping locks, which
have one or two places that take and release the lock, but not for the
generic spinlock.

Also, it might be worth trying current git - if it's a rwsem that is
implicated, the new lock stealing might be a win.

So before even trying anything fancy, just basic profiles would be
good to see which lock it is. Many of the really bad slowdowns are
actually about the timing details of the sleeping locks (do *not*
enable lock debugging etc for profiling, you want the mutex spinning
code to be active, for example).

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Rik van Riel

On 02/27/2013 12:10 PM, Linus Torvalds wrote:


Ugh. That really is rather random. "short" and fserver seems to
improve a lot (including the "new" version), the others look like they
are either unchanged or huge regressions.

Is there any way to get profiles for the improved versions vs the
regressed ones? It might well be that we have two different classes of
spinlocks. Maybe we could make the back-off version be *explicit* (ie
not part of the normal "spin_lock()", but you'd use a special
"spin_lock_backoff()" function for it) because it works well for some
cases but not for others?


If we have two classes of spinlocks, I suspect we would be better
off making those high-demand spinlocks MCS or LCH locks, which have
the property that having N+1 CPUs contend on the lock will never
result in slower aggregate throughput than having N CPUs contend.

Both the current spinlock code and the spinlock code with backoff
has the annoying property that adding more waiters to the lock can
slow down aggregate throughput.

The backoff code seems to push that point a little further out.
It appears that that may result in "lesser bottlenecks" disappearing,
which in turn triggers amplification of the worst bottlenecks.


Hmm? At the very least, it would give us an idea of *which* spinlock
it is that causes the most pain. I think your earlier indications was
that it's the mutex->wait_lock or something?


I can certainly take profiles of various workloads, but there is
absolutely no guarantee that I will see the same bottlenecks that
eg. the people at HP have seen. The largest test system I currently
have access to has 40 cores, vs. the 80 cores in the (much more
interesting) HP results I pasted.

Would you also be interested in performance numbers (and profiles)
of a kernel that has bottleneck spinlocks replaced with MCS locks?

That could make for an interesting comparison...

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2013 at 8:42 AM, Rik van Riel  wrote:
>
> To keep the results readable and relevant, I am reporting the
> plateau performance numbers. Comments are given where required.
>
> 3.7.6 vanilla   3.7.6 w/ backoff
>
> all_utime   333000  333000
> alltests30-47   18-44   large variability
> compute 528000  528000
> custom  29-32   25-33   4 fast runs, 1 slow
> dbase   92  925000
> disk10   9-12   similar plateau, wild
> swings with patches
> five_sec14  14
> fserver 16-30   25-43   w/ patch drops off at
> higher number of users
> high_systime 8-113-125000   w/ patch mostly 40k-70k,
> wild wings
> longno performance platform, equal performance for both
> new_dbase   96  96000
> new_fserver 15-30   21-42   vanilla drops off,
> w/ patches wild swings
> shared  27-44   12-44   all runs ~equal to
> vanilla up to 1000
> users, one out of 5
> runs slows down past
> 1100 users
> short   12  19

Ugh. That really is rather random. "short" and fserver seems to
improve a lot (including the "new" version), the others look like they
are either unchanged or huge regressions.

Is there any way to get profiles for the improved versions vs the
regressed ones? It might well be that we have two different classes of
spinlocks. Maybe we could make the back-off version be *explicit* (ie
not part of the normal "spin_lock()", but you'd use a special
"spin_lock_backoff()" function for it) because it works well for some
cases but not for others?

Hmm? At the very least, it would give us an idea of *which* spinlock
it is that causes the most pain. I think your earlier indications was
that it's the mutex->wait_lock or something?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Rik van Riel

On 02/13/2013 08:21 PM, Linus Torvalds wrote:

On Wed, Feb 13, 2013 at 3:41 PM, Rik van Riel  wrote:


I have an example of the second case. It is a test case
from a customer issue, where an application is contending on
semaphores, doing semaphore lock and unlock operations. The
test case simply has N threads, trying to lock and unlock the
same semaphore.

The attached graph (which I sent out with the intro email to
my patches) shows how reducing the memory accesses from the
spinlock wait path prevents the large performance degradation
seen with the vanilla kernel. This is on a 24 CPU system with
4 6-core AMD CPUs.

The "prop-N" series are with a fixed delay proportional back-off.
You can see that a small value of N does not help much for large
numbers of cpus, and a large value hurts with a small number of
CPUs. The automatic tuning appears to be quite robust.


Ok, good, so there are some numbers. I didn't see any in the commit
messages anywhere, and since the threads I've looked at are from
tip-bot, I never saw the intro email.


Some people at HP have collected an extensive list of AIM 7 results,
all the different AIM 7 workloads, on an 80-core DL-980, with HT
disabled.

The AIM7 workloads all work by slowly increasing the number of
worker processes, all of which have some duty cycle (busy & sleep).
Adding more processes tends to increase the number of jobs/minute
completed, up to a certain point. For some workloads, the system
has a performance peak and performance levels up at or near that
peak, for other workloads performance drops when more processes
are added beyond the peak, and performance drops to a lower plateau.

To keep the results readable and relevant, I am reporting the
plateau performance numbers. Comments are given where required.

3.7.6 vanilla   3.7.6 w/ backoff

all_utime   333000  333000
alltests30-47   18-44   large variability
compute 528000  528000
custom  29-32   25-33   4 fast runs, 1 slow
dbase   92  925000
disk10   9-12   similar plateau, wild
swings with patches
five_sec14  14
fserver 16-30   25-43   w/ patch drops off at
higher number of users
high_systime 8-113-125000   w/ patch mostly 40k-70k,
wild wings
longno performance platform, equal performance for both
new_dbase   96  96000
new_fserver 15-30   21-42   vanilla drops off,
w/ patches wild swings
shared  27-44   12-44   all runs ~equal to
vanilla up to 1000
users, one out of 5
runs slows down past
1100 users
short   12  19

In conclusion, the spinlock backoff patches seem to significantly
improve performance in workloads where there is simple contention
on just one or two spinlocks. However, in more complex workloads,
high variability is seen, including performance regression in some
test runs.

One hypothesis is that before the spinlock backoff patches, the
workloads get contention (and bottleneck) on multiple locks. With
the patches, the contention on some of the locks is relieved, and
more tasks bunch up on the remaining bottlenecks, leading to worse
performance.


That said, it's interesting that this happens with the semaphore path.
We've had other cases where the spinlock in the *sleeping* locks have
caused problems, and I wonder if we should look at that path in
particular.


If we want to get reliable improved performance without unpredictable
performance swings, we should probably change some of the kernel's
spinlocks, especially the ones embedded in sleeping locks, into
scalable locks like Michel's implementation of MCS locks.

We may be hitting the limit of what can be done with the current
ticket lock data structure. It simply may not scale as far as the
hardware on which Linux is being run.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Rik van Riel

On 02/13/2013 08:21 PM, Linus Torvalds wrote:

On Wed, Feb 13, 2013 at 3:41 PM, Rik van Riel r...@redhat.com wrote:


I have an example of the second case. It is a test case
from a customer issue, where an application is contending on
semaphores, doing semaphore lock and unlock operations. The
test case simply has N threads, trying to lock and unlock the
same semaphore.

The attached graph (which I sent out with the intro email to
my patches) shows how reducing the memory accesses from the
spinlock wait path prevents the large performance degradation
seen with the vanilla kernel. This is on a 24 CPU system with
4 6-core AMD CPUs.

The prop-N series are with a fixed delay proportional back-off.
You can see that a small value of N does not help much for large
numbers of cpus, and a large value hurts with a small number of
CPUs. The automatic tuning appears to be quite robust.


Ok, good, so there are some numbers. I didn't see any in the commit
messages anywhere, and since the threads I've looked at are from
tip-bot, I never saw the intro email.


Some people at HP have collected an extensive list of AIM 7 results,
all the different AIM 7 workloads, on an 80-core DL-980, with HT
disabled.

The AIM7 workloads all work by slowly increasing the number of
worker processes, all of which have some duty cycle (busy  sleep).
Adding more processes tends to increase the number of jobs/minute
completed, up to a certain point. For some workloads, the system
has a performance peak and performance levels up at or near that
peak, for other workloads performance drops when more processes
are added beyond the peak, and performance drops to a lower plateau.

To keep the results readable and relevant, I am reporting the
plateau performance numbers. Comments are given where required.

3.7.6 vanilla   3.7.6 w/ backoff

all_utime   333000  333000
alltests30-47   18-44   large variability
compute 528000  528000
custom  29-32   25-33   4 fast runs, 1 slow
dbase   92  925000
disk10   9-12   similar plateau, wild
swings with patches
five_sec14  14
fserver 16-30   25-43   w/ patch drops off at
higher number of users
high_systime 8-113-125000   w/ patch mostly 40k-70k,
wild wings
longno performance platform, equal performance for both
new_dbase   96  96000
new_fserver 15-30   21-42   vanilla drops off,
w/ patches wild swings
shared  27-44   12-44   all runs ~equal to
vanilla up to 1000
users, one out of 5
runs slows down past
1100 users
short   12  19

In conclusion, the spinlock backoff patches seem to significantly
improve performance in workloads where there is simple contention
on just one or two spinlocks. However, in more complex workloads,
high variability is seen, including performance regression in some
test runs.

One hypothesis is that before the spinlock backoff patches, the
workloads get contention (and bottleneck) on multiple locks. With
the patches, the contention on some of the locks is relieved, and
more tasks bunch up on the remaining bottlenecks, leading to worse
performance.


That said, it's interesting that this happens with the semaphore path.
We've had other cases where the spinlock in the *sleeping* locks have
caused problems, and I wonder if we should look at that path in
particular.


If we want to get reliable improved performance without unpredictable
performance swings, we should probably change some of the kernel's
spinlocks, especially the ones embedded in sleeping locks, into
scalable locks like Michel's implementation of MCS locks.

We may be hitting the limit of what can be done with the current
ticket lock data structure. It simply may not scale as far as the
hardware on which Linux is being run.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2013 at 8:42 AM, Rik van Riel r...@redhat.com wrote:

 To keep the results readable and relevant, I am reporting the
 plateau performance numbers. Comments are given where required.

 3.7.6 vanilla   3.7.6 w/ backoff

 all_utime   333000  333000
 alltests30-47   18-44   large variability
 compute 528000  528000
 custom  29-32   25-33   4 fast runs, 1 slow
 dbase   92  925000
 disk10   9-12   similar plateau, wild
 swings with patches
 five_sec14  14
 fserver 16-30   25-43   w/ patch drops off at
 higher number of users
 high_systime 8-113-125000   w/ patch mostly 40k-70k,
 wild wings
 longno performance platform, equal performance for both
 new_dbase   96  96000
 new_fserver 15-30   21-42   vanilla drops off,
 w/ patches wild swings
 shared  27-44   12-44   all runs ~equal to
 vanilla up to 1000
 users, one out of 5
 runs slows down past
 1100 users
 short   12  19

Ugh. That really is rather random. short and fserver seems to
improve a lot (including the new version), the others look like they
are either unchanged or huge regressions.

Is there any way to get profiles for the improved versions vs the
regressed ones? It might well be that we have two different classes of
spinlocks. Maybe we could make the back-off version be *explicit* (ie
not part of the normal spin_lock(), but you'd use a special
spin_lock_backoff() function for it) because it works well for some
cases but not for others?

Hmm? At the very least, it would give us an idea of *which* spinlock
it is that causes the most pain. I think your earlier indications was
that it's the mutex-wait_lock or something?

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Rik van Riel

On 02/27/2013 12:10 PM, Linus Torvalds wrote:


Ugh. That really is rather random. short and fserver seems to
improve a lot (including the new version), the others look like they
are either unchanged or huge regressions.

Is there any way to get profiles for the improved versions vs the
regressed ones? It might well be that we have two different classes of
spinlocks. Maybe we could make the back-off version be *explicit* (ie
not part of the normal spin_lock(), but you'd use a special
spin_lock_backoff() function for it) because it works well for some
cases but not for others?


If we have two classes of spinlocks, I suspect we would be better
off making those high-demand spinlocks MCS or LCH locks, which have
the property that having N+1 CPUs contend on the lock will never
result in slower aggregate throughput than having N CPUs contend.

Both the current spinlock code and the spinlock code with backoff
has the annoying property that adding more waiters to the lock can
slow down aggregate throughput.

The backoff code seems to push that point a little further out.
It appears that that may result in lesser bottlenecks disappearing,
which in turn triggers amplification of the worst bottlenecks.


Hmm? At the very least, it would give us an idea of *which* spinlock
it is that causes the most pain. I think your earlier indications was
that it's the mutex-wait_lock or something?


I can certainly take profiles of various workloads, but there is
absolutely no guarantee that I will see the same bottlenecks that
eg. the people at HP have seen. The largest test system I currently
have access to has 40 cores, vs. the 80 cores in the (much more
interesting) HP results I pasted.

Would you also be interested in performance numbers (and profiles)
of a kernel that has bottleneck spinlocks replaced with MCS locks?

That could make for an interesting comparison...

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2013 at 11:53 AM, Rik van Riel r...@redhat.com wrote:

 If we have two classes of spinlocks, I suspect we would be better
 off making those high-demand spinlocks MCS or LCH locks, which have
 the property that having N+1 CPUs contend on the lock will never
 result in slower aggregate throughput than having N CPUs contend.

I doubt that.

The fancy no slowdown locks almost never work in practice. They
scale well by performing really badly for the normal case, either
needing separate allocations or having memory ordering problems
requiring multiple locked cycles.

A spinlock basically needs to have a fast-case that is a single locked
instruction, and all the clever ones tend to fail that simple test.

 I can certainly take profiles of various workloads, but there is
 absolutely no guarantee that I will see the same bottlenecks that
 eg. the people at HP have seen. The largest test system I currently
 have access to has 40 cores, vs. the 80 cores in the (much more
 interesting) HP results I pasted.

 Would you also be interested in performance numbers (and profiles)
 of a kernel that has bottleneck spinlocks replaced with MCS locks?

MCS locks don't even work, last time I saw. They need that extra lock
holder allocation, which forces people to have different calling
conventions, and is just a pain. Or am I confusing them with something
else?

They might work for the special cases like the sleeping locks, which
have one or two places that take and release the lock, but not for the
generic spinlock.

Also, it might be worth trying current git - if it's a rwsem that is
implicated, the new lock stealing might be a win.

So before even trying anything fancy, just basic profiles would be
good to see which lock it is. Many of the really bad slowdowns are
actually about the timing details of the sleeping locks (do *not*
enable lock debugging etc for profiling, you want the mutex spinning
code to be active, for example).

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Rik van Riel

On 02/27/2013 03:18 PM, Linus Torvalds wrote:

On Wed, Feb 27, 2013 at 11:53 AM, Rik van Riel r...@redhat.com wrote:


If we have two classes of spinlocks, I suspect we would be better
off making those high-demand spinlocks MCS or LCH locks, which have
the property that having N+1 CPUs contend on the lock will never
result in slower aggregate throughput than having N CPUs contend.


I doubt that.

The fancy no slowdown locks almost never work in practice. They
scale well by performing really badly for the normal case, either
needing separate allocations or having memory ordering problems
requiring multiple locked cycles.


The relative costs of atomic operations, cache line acquisition, and
other things has shifted over time. On the very latest systems, the
cost of cache line acquisition appears to dominate over the cost of
atomic operations.

Michel's results from last month show that MCS has essentially the same
performance for the single thread (uncontended) situation, on some
systems a degradation for 2 and 3 thread performance, in a tight micro
test, and improved performance when the contention involves 3 or more
CPUs.

http://thread.gmane.org/gmane.linux.kernel/1427417


A spinlock basically needs to have a fast-case that is a single locked
instruction, and all the clever ones tend to fail that simple test.


With the cost of a cache line acquisition outweighing the cost of an
atomic operation, for how much longer will this remain true?

Michel's results suggest that on Sandybridge this no longer seems
to hold.  The cost of the atomic operation on unlock appears to have
more than paid for itself by avoiding extraneous cache line bouncing,
and the cost of cache line acquisition. Even with only two CPUs
contending on the lock...


I can certainly take profiles of various workloads, but there is
absolutely no guarantee that I will see the same bottlenecks that
eg. the people at HP have seen. The largest test system I currently
have access to has 40 cores, vs. the 80 cores in the (much more
interesting) HP results I pasted.

Would you also be interested in performance numbers (and profiles)
of a kernel that has bottleneck spinlocks replaced with MCS locks?


MCS locks don't even work, last time I saw. They need that extra lock
holder allocation, which forces people to have different calling
conventions, and is just a pain. Or am I confusing them with something
else?


Nope, those are the MCS locks alright.


They might work for the special cases like the sleeping locks, which
have one or two places that take and release the lock, but not for the
generic spinlock.


I am certainly not advocating that all spinlocks be replaced with
harder to use locks.  On the other hand, we have to realize that
Linux users do not have the luxury to upgrade their kernel to the
latest upstream on whenever they run into a performance issue, so
it would be good to make Linux more robust against scalability
issues.


So before even trying anything fancy, just basic profiles would be
good to see which lock it is. Many of the really bad slowdowns are
actually about the timing details of the sleeping locks (do *not*
enable lock debugging etc for profiling, you want the mutex spinning
code to be active, for example).


No argument there, but that does in no way negate the need for
some performance robustness.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Rik van Riel

On 02/27/2013 05:13 PM, Linus Torvalds wrote:


On Feb 27, 2013 1:56 PM, Rik van Riel r...@redhat.com
mailto:r...@redhat.com wrote:


No argument there, but that does in no way negate the need for some
performance robustness.


The very numbers you posted showed that the backoff was *not* more
robust. Quite the reverse, there was arguably more variability.


On the other hand, both MCS and the fast queue locks
implemented by Michel showed low variability and high
performance.

http://thread.gmane.org/gmane.linux.kernel/1427417


So I really don't like how you make these sweeping statements
*again*. Numbers talk, bullshit walks.


If you read all the text in my last mail, you will see the
link to Michel's performance results. The numbers speak for
themselves.


The fact is, life is complicated. The simple spinlocks tend to work
really well. People have tried fancy things before, and it turns out
it's not as simple as they think.


The numbers for both the simple spinlocks and the
spinlock backoff kind of suck. Both of these have
high variability, and both eventually fall down
under heavy load.

The numbers for Michel's MCS and fast queue lock
implementations appear to be both fast and stable.

I agree that we need numbers.

I do not agree that other locks should be dismissed
out of hand without looking at the numbers.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2013 at 6:58 PM, Rik van Riel r...@redhat.com wrote:

 On the other hand, both MCS and the fast queue locks
 implemented by Michel showed low variability and high
 performance.

On microbenchmarks, and when implemented for only one single subsystem, yes.

 The numbers for Michel's MCS and fast queue lock
 implementations appear to be both fast and stable.

I do think that doing specialized spinlocks for special areas may be a
rasonable approach, and it's quite possible that the SySV IPC thing is
one such area.

But no, I don't think the numbers I've seen for Michel's MCS are AT
ALL comparable to the generic spinlocks, and the interface makes them
incompatible as a replacement to even test in general.

Don't get me wrong: I think the targeted approach is *better*. I just
also happen to think that you spout big words about things that aren't
all that big, and try to make this a bigger deal than it is. The
benchmark numbers you point to are micro-benchmarks, and not all
comparable.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Davidlohr Bueso
On Wed, 2013-02-27 at 21:58 -0500, Rik van Riel wrote:
 On 02/27/2013 05:13 PM, Linus Torvalds wrote:
 
  On Feb 27, 2013 1:56 PM, Rik van Riel r...@redhat.com
  mailto:r...@redhat.com wrote:
 
  No argument there, but that does in no way negate the need for some
  performance robustness.
 
  The very numbers you posted showed that the backoff was *not* more
  robust. Quite the reverse, there was arguably more variability.
 
 On the other hand, both MCS and the fast queue locks
 implemented by Michel showed low variability and high
 performance.
 
 http://thread.gmane.org/gmane.linux.kernel/1427417
 
  So I really don't like how you make these sweeping statements
  *again*. Numbers talk, bullshit walks.
 
 If you read all the text in my last mail, you will see the
 link to Michel's performance results. The numbers speak for
 themselves.
 
  The fact is, life is complicated. The simple spinlocks tend to work
  really well. People have tried fancy things before, and it turns out
  it's not as simple as they think.
 
 The numbers for both the simple spinlocks and the
 spinlock backoff kind of suck. Both of these have
 high variability, and both eventually fall down
 under heavy load.
 
 The numbers for Michel's MCS and fast queue lock
 implementations appear to be both fast and stable.
 
 I agree that we need numbers.

FWIW I've been doing some benchmarking for Swingbench DSS workloads
(Oracle data mining) comparing Rik and Michel's patches. With lower
amounts of contention, Rik's ticket spinlock is better, but once
contention gets high enough the queued locks performs better.

The attached file shows how the amount of sys time used by the ipc lock
for a 4 and 8 socket box.
attachment: dss-ipclock.png

Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-27 Thread Linus Torvalds
On Wed, Feb 27, 2013 at 8:06 PM, Davidlohr Bueso davidlohr.bu...@hp.com wrote:

 The attached file shows how the amount of sys time used by the ipc lock
 for a 4 and 8 socket box.

I have to say, even with the improvements, that looks pretty
disgusting. It really makes me wonder if that thing couldn't be done
better some way. Is it the SysV semaphores that this all ends up
using, or what?

That said, I think the IPC layer is just about the perfect candidate
for things like this, because I'm afraid that nobody is ever going to
fix it.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-15 Thread Ingo Molnar

* Linus Torvalds  wrote:

> Btw, it may end up that almost nobody cares. Modern CPU's are 
> really good at handling the straightforward "save/restore to 
> stack" instructions. One of the reasons I care is not 
> performance per se, butu the fact that I still look at asm 
> code every time I do any performance profiling, and it's 
> absolutely horrible to see the code when you see "ugh, that's 
> pointless". I'm sensitive to the spinlocks in particular, 
> because we've gone back and forth on inlining them before, so 
> I've seen this. But right now I don't think we inline the lock 
> under normal configs *anyway*, so I guess it doesn't much 
> matter.

Yes, right now we only inline non-debug spin_unlock() and 
spin_unlock_irq() [on !PREEMPT] - because that was an 
unconditional win.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-15 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 Btw, it may end up that almost nobody cares. Modern CPU's are 
 really good at handling the straightforward save/restore to 
 stack instructions. One of the reasons I care is not 
 performance per se, butu the fact that I still look at asm 
 code every time I do any performance profiling, and it's 
 absolutely horrible to see the code when you see ugh, that's 
 pointless. I'm sensitive to the spinlocks in particular, 
 because we've gone back and forth on inlining them before, so 
 I've seen this. But right now I don't think we inline the lock 
 under normal configs *anyway*, so I guess it doesn't much 
 matter.

Yes, right now we only inline non-debug spin_unlock() and 
spin_unlock_irq() [on !PREEMPT] - because that was an 
unconditional win.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-14 Thread Benjamin Herrenschmidt
On Wed, 2013-02-13 at 10:30 -0800, Linus Torvalds wrote:
> On Wed, Feb 13, 2013 at 8:20 AM, Linus Torvalds
>  wrote:
> >
> > Adding an external function call is *horrible*, and you might almost
> > as well just uninline the spinlock entirely if you do this. It means
> > that all the small callers now have their registers trashed, whether
> > the unlikely function call is taken or not, and now leaf functions
> > aren't leaves any more.
> 
> Btw, we've had things like this before, and I wonder if we could
> perhaps introduce the notion of a "light-weight call" for fastpath
> code that calls unlikely slow-path code..
> 
> In particular, see the out-of-line code used by the rwlocks etc (see
> "arch_read_lock()" for an example in arch/x86/include/asm/spinlock.h
> and arch/x86/lib/rwlock.S), where we end up calling things from inline
> asm, with one big reason being exactly the fact that a "normal" C call
> has such horribly detrimental effects on the caller.

This would be nice. I've been wanting to do something like that for a
while in fact... On archs like powerpc, we lose 11 GPRs on a function
call, that ends up being a lot of stupid stack spills for cases that are
often corner cases (error cases etc... in inlines).

> Sadly, gcc doesn't seem to allow specifying which registers are
> clobbered any easy way, which means that both the caller and the
> callee *both* tend to need to have some asm interface. So we bothered
> to do this for __read_lock_failed, but we have *not* bothered to do
> the same for the otherwise very similar __mutex_fastpath_lock() case,
> for example.
>
> So for rwlocks, we actually get very nice code generation with small
> leaf functions not necessarily needing stack frames, but for mutexes
> we mark a lot of registers "unnecessarily" clobbered in the caller,
> exactly because we do *not* do that asm interface for the callee. So
> we have to clobber all the standard callee-clobbered registers, which
> is really sad, and callers almost always need a stack frame, because
> if they have any data live at all across the mutex, they have to save
> it in some register that is callee-saved - which basically means that
> the function has to have that stack frame in order to save its *own*
> callee-saved registers.
> 
> So it means that we penalize the fastpath because the slow-path can't
> be bothered to do the extra register saving, unless we go to the
> lengths we went to for the rwlocks, and build a wrapper in asm to save
> the extra registers in the cold path.
> 
> Maybe we could introduce some helpers to create these kinds of asm
> wrappers to do this? Something that would allow us to say: "this
> function only clobbers a minimal set of registers and you can call it
> from asm and only mark %rax/rcx/rdx clobbered" and that allows leaf
> functions to look like leaf functions for the fastpath?

We could so something like:

define_fastcall(func [.. figure out how to deal with args ... ])

Which spits out both a trampoline for saving the nasty stuff and calling
the real func() and a call_func() inline asm for the call site.

At least on archs with register-passing conventions, especially if we
make mandatory to stick to register args only and forbid stack spills
(ie, only a handful of args), it's fairly easy to do.

For stack based archs, it gets nastier as you have to dig out the args,
save stuff, and pile them again.

But since we also don't want to lose strong typing, we probably want to
express the args in that macro, maybe like we do for the syscall
defines. A bit ugly, but that would allow to have a strongly typed
call_func() *and* allow the trampoline to know what to do about the args
for stack based calling conventions.

About to go & travel so I don't have time to actually write something,
at least not for a couple of weeks though...

Ben.

> Hmm? That would make my dislike of uninlining the slow case largely go
> away. I still think that back-off tends to be a mistake (and is often
> horrible for virtualization etc), but as long as the fastpath stays
> close to optimal, I don't care *too* much.
> 
>  Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-14 Thread Linus Torvalds
On Thu, Feb 14, 2013 at 2:50 AM, Ingo Molnar  wrote:
>
> At least on x86, how about saving *all* volatile registers in
> the slow out of line code path (to stack)?

Sure. The reason I suggested perhaps not saving %rax/%rdx is simply
that if it's a function that returns a value, %rax obviously cannot be
saved anyway. And then I was confused about the calling convention,
and said %rx because it's the second argument register, but that's
only true on x86-32 (with our regparm calling convention). On x86-64,
it's %rdi/%rsi that have arguments.

Anyway, argument registers *can* be useful to save, but are often
computed, so it can go either way.

> It blows up the slow path somewhat, but it would allow us to
> keep the fast-path register impact even smaller - as the slow
> path would only have memory content side effects.
>
> Am I missing something?

The best option would be to just let the user say which registers it
wants saved for any particular function. We'd probably want some
default set so that people who call "unlikely()" functions can just
pick that one, but for architecture-specific helper functions for
locking etc, we would probably want to optimize it for usage (ie "is
the argument likely to be useful to the caller after the call").

Sometimes saving everything is the right thing to do, sometimes it
might just be extra work.

And even *if* the callee saves everything, the caller still ends up
being constrained by the calling convention (ie fixed registers for
arguments), so it could mess up code generation in the caller for that
reason. But at least it would be much less painful.

Btw, it may end up that almost nobody cares. Modern CPU's are really
good at handling the straightforward "save/restore to stack"
instructions. One of the reasons I care is not performance per se,
butu the fact that I still look at asm code every time I do any
performance profiling, and it's absolutely horrible to see the code
when you see "ugh, that's pointless". I'm sensitive to the spinlocks
in particular, because we've gone back and forth on inlining them
before, so I've seen this. But right now I don't think we inline the
lock under normal configs *anyway*, so I guess it doesn't much matter.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-14 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Wed, Feb 13, 2013 at 8:20 AM, Linus Torvalds
>  wrote:
> >
> > Adding an external function call is *horrible*, and you 
> > might almost as well just uninline the spinlock entirely if 
> > you do this. It means that all the small callers now have 
> > their registers trashed, whether the unlikely function call 
> > is taken or not, and now leaf functions aren't leaves any 
> > more.
> 
> Btw, we've had things like this before, and I wonder if we 
> could perhaps introduce the notion of a "light-weight call" 
> for fastpath code that calls unlikely slow-path code..
> 
> In particular, see the out-of-line code used by the rwlocks 
> etc (see "arch_read_lock()" for an example in 
> arch/x86/include/asm/spinlock.h and arch/x86/lib/rwlock.S), 
> where we end up calling things from inline asm, with one big 
> reason being exactly the fact that a "normal" C call has such 
> horribly detrimental effects on the caller.
> 
> Sadly, gcc doesn't seem to allow specifying which registers 
> are clobbered any easy way, which means that both the caller 
> and the callee *both* tend to need to have some asm interface. 
> So we bothered to do this for __read_lock_failed, but we have 
> *not* bothered to do the same for the otherwise very similar 
> __mutex_fastpath_lock() case, for example.

At least on x86, how about saving *all* volatile registers in 
the slow out of line code path (to stack)?

That means we wouldn't have to do anything fancy with the called 
functions, and the caller would see minimal register impact. It 
would also be reasonably robust and straightforward assembly 
code.

It blows up the slow path somewhat, but it would allow us to 
keep the fast-path register impact even smaller - as the slow 
path would only have memory content side effects.

Am I missing something?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-14 Thread Ingo Molnar

* Linus Torvalds  wrote:

> > Eric got a 45% increase in network throughput, and I saw a 
> > factor 4x or so improvement with the semaphore test. I 
> > realize these are not "real workloads", and I will give you 
> > numbers with those once I have gathered some, on different 
> > systems.
> 
> Good. This is what I want to see.

I've put these commits aside meanwhile, until we have the 
numbers and decide one way or another.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-14 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

  Eric got a 45% increase in network throughput, and I saw a 
  factor 4x or so improvement with the semaphore test. I 
  realize these are not real workloads, and I will give you 
  numbers with those once I have gathered some, on different 
  systems.
 
 Good. This is what I want to see.

I've put these commits aside meanwhile, until we have the 
numbers and decide one way or another.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-14 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Wed, Feb 13, 2013 at 8:20 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  Adding an external function call is *horrible*, and you 
  might almost as well just uninline the spinlock entirely if 
  you do this. It means that all the small callers now have 
  their registers trashed, whether the unlikely function call 
  is taken or not, and now leaf functions aren't leaves any 
  more.
 
 Btw, we've had things like this before, and I wonder if we 
 could perhaps introduce the notion of a light-weight call 
 for fastpath code that calls unlikely slow-path code..
 
 In particular, see the out-of-line code used by the rwlocks 
 etc (see arch_read_lock() for an example in 
 arch/x86/include/asm/spinlock.h and arch/x86/lib/rwlock.S), 
 where we end up calling things from inline asm, with one big 
 reason being exactly the fact that a normal C call has such 
 horribly detrimental effects on the caller.
 
 Sadly, gcc doesn't seem to allow specifying which registers 
 are clobbered any easy way, which means that both the caller 
 and the callee *both* tend to need to have some asm interface. 
 So we bothered to do this for __read_lock_failed, but we have 
 *not* bothered to do the same for the otherwise very similar 
 __mutex_fastpath_lock() case, for example.

At least on x86, how about saving *all* volatile registers in 
the slow out of line code path (to stack)?

That means we wouldn't have to do anything fancy with the called 
functions, and the caller would see minimal register impact. It 
would also be reasonably robust and straightforward assembly 
code.

It blows up the slow path somewhat, but it would allow us to 
keep the fast-path register impact even smaller - as the slow 
path would only have memory content side effects.

Am I missing something?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-14 Thread Linus Torvalds
On Thu, Feb 14, 2013 at 2:50 AM, Ingo Molnar mi...@kernel.org wrote:

 At least on x86, how about saving *all* volatile registers in
 the slow out of line code path (to stack)?

Sure. The reason I suggested perhaps not saving %rax/%rdx is simply
that if it's a function that returns a value, %rax obviously cannot be
saved anyway. And then I was confused about the calling convention,
and said %rx because it's the second argument register, but that's
only true on x86-32 (with our regparm calling convention). On x86-64,
it's %rdi/%rsi that have arguments.

Anyway, argument registers *can* be useful to save, but are often
computed, so it can go either way.

 It blows up the slow path somewhat, but it would allow us to
 keep the fast-path register impact even smaller - as the slow
 path would only have memory content side effects.

 Am I missing something?

The best option would be to just let the user say which registers it
wants saved for any particular function. We'd probably want some
default set so that people who call unlikely() functions can just
pick that one, but for architecture-specific helper functions for
locking etc, we would probably want to optimize it for usage (ie is
the argument likely to be useful to the caller after the call).

Sometimes saving everything is the right thing to do, sometimes it
might just be extra work.

And even *if* the callee saves everything, the caller still ends up
being constrained by the calling convention (ie fixed registers for
arguments), so it could mess up code generation in the caller for that
reason. But at least it would be much less painful.

Btw, it may end up that almost nobody cares. Modern CPU's are really
good at handling the straightforward save/restore to stack
instructions. One of the reasons I care is not performance per se,
butu the fact that I still look at asm code every time I do any
performance profiling, and it's absolutely horrible to see the code
when you see ugh, that's pointless. I'm sensitive to the spinlocks
in particular, because we've gone back and forth on inlining them
before, so I've seen this. But right now I don't think we inline the
lock under normal configs *anyway*, so I guess it doesn't much matter.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-14 Thread Benjamin Herrenschmidt
On Wed, 2013-02-13 at 10:30 -0800, Linus Torvalds wrote:
 On Wed, Feb 13, 2013 at 8:20 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  Adding an external function call is *horrible*, and you might almost
  as well just uninline the spinlock entirely if you do this. It means
  that all the small callers now have their registers trashed, whether
  the unlikely function call is taken or not, and now leaf functions
  aren't leaves any more.
 
 Btw, we've had things like this before, and I wonder if we could
 perhaps introduce the notion of a light-weight call for fastpath
 code that calls unlikely slow-path code..
 
 In particular, see the out-of-line code used by the rwlocks etc (see
 arch_read_lock() for an example in arch/x86/include/asm/spinlock.h
 and arch/x86/lib/rwlock.S), where we end up calling things from inline
 asm, with one big reason being exactly the fact that a normal C call
 has such horribly detrimental effects on the caller.

This would be nice. I've been wanting to do something like that for a
while in fact... On archs like powerpc, we lose 11 GPRs on a function
call, that ends up being a lot of stupid stack spills for cases that are
often corner cases (error cases etc... in inlines).

 Sadly, gcc doesn't seem to allow specifying which registers are
 clobbered any easy way, which means that both the caller and the
 callee *both* tend to need to have some asm interface. So we bothered
 to do this for __read_lock_failed, but we have *not* bothered to do
 the same for the otherwise very similar __mutex_fastpath_lock() case,
 for example.

 So for rwlocks, we actually get very nice code generation with small
 leaf functions not necessarily needing stack frames, but for mutexes
 we mark a lot of registers unnecessarily clobbered in the caller,
 exactly because we do *not* do that asm interface for the callee. So
 we have to clobber all the standard callee-clobbered registers, which
 is really sad, and callers almost always need a stack frame, because
 if they have any data live at all across the mutex, they have to save
 it in some register that is callee-saved - which basically means that
 the function has to have that stack frame in order to save its *own*
 callee-saved registers.
 
 So it means that we penalize the fastpath because the slow-path can't
 be bothered to do the extra register saving, unless we go to the
 lengths we went to for the rwlocks, and build a wrapper in asm to save
 the extra registers in the cold path.
 
 Maybe we could introduce some helpers to create these kinds of asm
 wrappers to do this? Something that would allow us to say: this
 function only clobbers a minimal set of registers and you can call it
 from asm and only mark %rax/rcx/rdx clobbered and that allows leaf
 functions to look like leaf functions for the fastpath?

We could so something like:

define_fastcall(func [.. figure out how to deal with args ... ])

Which spits out both a trampoline for saving the nasty stuff and calling
the real func() and a call_func() inline asm for the call site.

At least on archs with register-passing conventions, especially if we
make mandatory to stick to register args only and forbid stack spills
(ie, only a handful of args), it's fairly easy to do.

For stack based archs, it gets nastier as you have to dig out the args,
save stuff, and pile them again.

But since we also don't want to lose strong typing, we probably want to
express the args in that macro, maybe like we do for the syscall
defines. A bit ugly, but that would allow to have a strongly typed
call_func() *and* allow the trampoline to know what to do about the args
for stack based calling conventions.

About to go  travel so I don't have time to actually write something,
at least not for a couple of weeks though...

Ben.

 Hmm? That would make my dislike of uninlining the slow case largely go
 away. I still think that back-off tends to be a mistake (and is often
 horrible for virtualization etc), but as long as the fastpath stays
 close to optimal, I don't care *too* much.
 
  Linus
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread H. Peter Anvin
On 02/13/2013 05:31 PM, Linus Torvalds wrote:
> On Wed, Feb 13, 2013 at 4:54 PM, H. Peter Anvin  wrote:
>>
>> It does for the callee, but only on a whole-file basis.  It would be a
>> lot nicer if we could do it with function attributes.
> 
> A way to just set the callee-clobbered list on a per-function basis
> would be lovely. Gcc has limited support for this on some
> architectures, where you can specify "save every register for this
> function" in order to do things like interrupt handlers etc without
> even resorting to asm. But there is no generic (or even just x86)
> support for anything like it :-(
> 
> There are other calling-convention attributes that make me suspect gcc
> could easily do this (it already supports per-function ABI
> specification, so presumably it already has some concept of
> callee-saved registers being different for different attributes), but
> from my reading you currently have to generate asm wrappers by hand
> (and call them by hand with inline asm) if you want to do something
> like this.
> 

I just filed a gcc bugzilla on this:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56314

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 5:21 PM, Linus Torvalds
 wrote:
>
> Now, on other machines you get the call chain even with pebs because
> you can get the whole

Oops, that got cut short early, because I started looking up when PEBS
and the last-branch-buffer work together, and couldn't find it, and
then came back to the email and forgot to finish the sentence.

Anyway: sometimes you get call chains with precise events, sometimes
you don't. And it turns out that I can't find out which cores do both,
and which ones don't.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 4:54 PM, H. Peter Anvin  wrote:
>
> It does for the callee, but only on a whole-file basis.  It would be a
> lot nicer if we could do it with function attributes.

A way to just set the callee-clobbered list on a per-function basis
would be lovely. Gcc has limited support for this on some
architectures, where you can specify "save every register for this
function" in order to do things like interrupt handlers etc without
even resorting to asm. But there is no generic (or even just x86)
support for anything like it :-(

There are other calling-convention attributes that make me suspect gcc
could easily do this (it already supports per-function ABI
specification, so presumably it already has some concept of
callee-saved registers being different for different attributes), but
from my reading you currently have to generate asm wrappers by hand
(and call them by hand with inline asm) if you want to do something
like this.

And the straightforward "just wrap it in asm" approach sadly then
causes double "call" instructions etc. So that slows down the slow
case unnecessarily much ;(

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 3:41 PM, Rik van Riel  wrote:
>
> I have an example of the second case. It is a test case
> from a customer issue, where an application is contending on
> semaphores, doing semaphore lock and unlock operations. The
> test case simply has N threads, trying to lock and unlock the
> same semaphore.
>
> The attached graph (which I sent out with the intro email to
> my patches) shows how reducing the memory accesses from the
> spinlock wait path prevents the large performance degradation
> seen with the vanilla kernel. This is on a 24 CPU system with
> 4 6-core AMD CPUs.
>
> The "prop-N" series are with a fixed delay proportional back-off.
> You can see that a small value of N does not help much for large
> numbers of cpus, and a large value hurts with a small number of
> CPUs. The automatic tuning appears to be quite robust.

Ok, good, so there are some numbers. I didn't see any in the commit
messages anywhere, and since the threads I've looked at are from
tip-bot, I never saw the intro email.

That said, it's interesting that this happens with the semaphore path.
We've had other cases where the spinlock in the *sleeping* locks have
caused problems, and I wonder if we should look at that path in
particular.

> If we have only a few CPUs contending on the lock, the delays
> will be short.

Yes. I'm more worried about the overhead, especially on I$ (and to a
lesser degree on D$ when loading hashed delay values etc). I don't
believe it would ever loop very long, it's the other overhead I'd be
worried about.

>From looking at profiles of the kernel loads I've cared about (ie
largely VFS code), the I$ footprint seems to be a big deal, and
function entry (and the instruction *after* a call instruction)
actually tend to be hotspots. Which is why I care about things like
function prologues for leaf functions etc.

> Furthermore, the CPU at the head of the queue
> will run the old spinlock code with just cpu_relax() and checking
> the lock each iteration.

That's not AT ALL TRUE.

Look at the code you wrote. It does all the spinlock delay etc crap
unconditionally. Only the loop itself is conditional.

IOW, exactly all the overhead that I worry about. The function call,
the pointless turning of leaf functions into non-leaf functions, the
loading (and storing) of delay information etc etc.

The non-leaf-function thing is done even if you never hit the
slow-path, and affects the non-contention path. And the delay
information thing is done even if there is only one waiter on the
spinlock.

Did I miss anything?

> Eric got a 45% increase in network throughput, and I saw a factor 4x
> or so improvement with the semaphore test. I realize these are not
> "real workloads", and I will give you numbers with those once I have
> gathered some, on different systems.

Good. This is what I want to see.

> Are there significant cases where "perf -g" is not easily available,
> or harmful to tracking down the performance issue?

Yes. There are lots of machines where you cannot get call chain
information with CPU event buffers (pebs). And without the CPU event
buffers, you cannot get good profile data at all.

Now, on other machines you get the call chain even with pebs because
you can get the whole

> The cause of that was identified (with pause loop exiting, the  host
> effectively does the back-off for us), and the problem is avoided
> by limiting the maximum back-off value to something small on
> virtual guests.

And what if the hardware does something equivalent even when not
virtualized (ie power optimizations I already mentioned)?  That whole
maximum back-off limit seems to be just for known virtualization
issues. This is the kind of thing that makes me worry..

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread H. Peter Anvin
On 02/13/2013 10:30 AM, Linus Torvalds wrote:
> 
> Sadly, gcc doesn't seem to allow specifying which registers are
> clobbered any easy way, which means that both the caller and the
> callee *both* tend to need to have some asm interface. So we bothered
> to do this for __read_lock_failed, but we have *not* bothered to do
> the same for the otherwise very similar __mutex_fastpath_lock() case,
> for example.
> 

It does for the callee, but only on a whole-file basis.  It would be a
lot nicer if we could do it with function attributes.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Rik van Riel

On 02/13/2013 05:40 PM, Linus Torvalds wrote:

On Wed, Feb 13, 2013 at 2:21 PM, Rik van Riel  wrote:


What kind of numbers would you like?

Numbers showing that the common case is not affected by this
code?

Or numbers showing that performance of something is improved
with this code?

Of course, the latter would point out a scalability issue,
that may be fixable by changing the data structures and code
involved :)


Yes.

To all three.


I will run a bunch of tests to demonstrate there are no
regressions with this code. You'll have to wait a few days
before I have a good set of numbers for that.

I have an example of the second case. It is a test case
from a customer issue, where an application is contending on
semaphores, doing semaphore lock and unlock operations. The
test case simply has N threads, trying to lock and unlock the
same semaphore.

The attached graph (which I sent out with the intro email to
my patches) shows how reducing the memory accesses from the
spinlock wait path prevents the large performance degradation
seen with the vanilla kernel. This is on a 24 CPU system with
4 6-core AMD CPUs.

The "prop-N" series are with a fixed delay proportional back-off.
You can see that a small value of N does not help much for large
numbers of cpus, and a large value hurts with a small number of
CPUs. The automatic tuning appears to be quite robust.


If we have just a few CPUs bouncing the cache line around, we
have no real performance issues and the loop with cpu_relax
seems to be doing fine.

Real issues arise when we have a larger number of CPUs contending
on the same lock. At that point we know we are no longer bouncing
between sibling cores.


Again. Exactly my point.

So you're potentially making things worse for just about everybody, in
order to fix a problem that almost nobody can actually see. And
apparently you don't even know the problem.


If we have only a few CPUs contending on the lock, the delays
will be short. Furthermore, the CPU at the head of the queue
will run the old spinlock code with just cpu_relax() and checking
the lock each iteration.


This patch series is not as much for the spinlocks we know about
(we can probably fix those), but to prevent catastrophic
performance degradation when users run into contention on spinlocks
we do NOT know about.


.. and as I already explained, THAT IS PURE AND UTTER BULLSHIT.

It may make things WORSE. On all the things you haven't run to check
that it does better.

You just stated something that is not at all necessarily true. You
changed the spinlock behavior, and then you blindly state that it will
"fix" things on loads you haven't even tested.


I did not claim it will fix things.  I claim that it helps reduce
the excessive cross-cpu memory accesses (from the spinlock acquisition
path) that can cause catastrophic performance degradation.

This happens to be what I, Michel and Eric have observed.

Eric got a 45% increase in network throughput, and I saw a factor 4x
or so improvement with the semaphore test. I realize these are not
"real workloads", and I will give you numbers with those once I have
gathered some, on different systems.


For all we know, it goes exactly the other way around, and makes some
unknown load perform much much worse, because it turns something that
didn't *use* to be contended into a contended case, because the
spinlock is slower on some piece of hardware (or under
virtualization).

We already saw one virtual load regress quite noticeably, didn't we?


The cause of that was identified (with pause loop exiting, the  host
effectively does the back-off for us), and the problem is avoided
by limiting the maximum back-off value to something small on
virtual guests.


And yet you go on to say that it will fix performance problems THAT WE
DON'T EVEN KNOW ABOUT! After seeing *proof* to the exact contrary
behavior! What f*cking planet are you from, again?

Christ, that's hubris.


I realize you do not have time to read all the email discussions
around these patches, but please do not assume that everybody else
are drooling idiots who are unable to log into their computers in
the morning without their guide dogs typing their passwords for them.

I will try to get you some performance test numbers on various kinds
of hardware over the next few days (probably early next week, depending
on hardware availability in the lab), running some mix of benchmarks
and workloads.


Besides, out-of-line spinlock loops are *horrible* for performance
monitoring. One of the *best* things about inlining the spinlock
looping code is that it's so very obvious where a hot lock is used. It
shows up as a shining beacon in profiles, without any need for
callchains etc. So not only don't we know what loads it would improve,
but those unknown loads would also be much harder to figure out.


Are there significant cases where "perf -g" is not easily available,
or harmful to tracking down the performance issue?

Even with inlined spinlocks, you have 

Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 2:21 PM, Rik van Riel  wrote:
>
> What kind of numbers would you like?
>
> Numbers showing that the common case is not affected by this
> code?
>
> Or numbers showing that performance of something is improved
> with this code?
>
> Of course, the latter would point out a scalability issue,
> that may be fixable by changing the data structures and code
> involved :)

Yes.

To all three.

> If we have just a few CPUs bouncing the cache line around, we
> have no real performance issues and the loop with cpu_relax
> seems to be doing fine.
>
> Real issues arise when we have a larger number of CPUs contending
> on the same lock. At that point we know we are no longer bouncing
> between sibling cores.

Again. Exactly my point.

So you're potentially making things worse for just about everybody, in
order to fix a problem that almost nobody can actually see. And
apparently you don't even know the problem.

> This patch series is not as much for the spinlocks we know about
> (we can probably fix those), but to prevent catastrophic
> performance degradation when users run into contention on spinlocks
> we do NOT know about.

.. and as I already explained, THAT IS PURE AND UTTER BULLSHIT.

It may make things WORSE. On all the things you haven't run to check
that it does better.

You just stated something that is not at all necessarily true. You
changed the spinlock behavior, and then you blindly state that it will
"fix" things on loads you haven't even tested.

That's pure bullshit.

For all we know, it goes exactly the other way around, and makes some
unknown load perform much much worse, because it turns something that
didn't *use* to be contended into a contended case, because the
spinlock is slower on some piece of hardware (or under
virtualization).

We already saw one virtual load regress quite noticeably, didn't we?

And yet you go on to say that it will fix performance problems THAT WE
DON'T EVEN KNOW ABOUT! After seeing *proof* to the exact contrary
behavior! What f*cking planet are you from, again?

Christ, that's hubris.

Besides, out-of-line spinlock loops are *horrible* for performance
monitoring. One of the *best* things about inlining the spinlock
looping code is that it's so very obvious where a hot lock is used. It
shows up as a shining beacon in profiles, without any need for
callchains etc. So not only don't we know what loads it would improve,
but those unknown loads would also be much harder to figure out.

(And yes, this is a real problem - look at every time we have a
problem with scaling sleeping locks, and how all the time just ends up
being random scheduler time, with the actual lock being very hard to
see. Spinlocks are *easy* partly because they can be inlined)

Seriously. I'm going to NAK these kinds of changes unless you can
actually show real improvement on real loads. No more of this "in
theory this will fix all our problems" crap.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Rik van Riel

On 02/13/2013 02:36 PM, Linus Torvalds wrote:

On Wed, Feb 13, 2013 at 11:08 AM, Rik van Riel  wrote:

The spinlock backoff code prevents these last cases from
experiencing large performance regressions when the hardware
is upgraded.


I still want *numbers*.

There are real cases where backoff does exactly the reverse, and makes
things much much worse. The tuning of the backoff delays are often
*very* hardware sensitive, and upgrading hardware can turn out to do
exactly what you say - but for the backoff, not the regular spinning
code.


What kind of numbers would you like?

Numbers showing that the common case is not affected by this
code?

Or numbers showing that performance of something is improved
with this code?

Of course, the latter would point out a scalability issue,
that may be fixable by changing the data structures and code
involved :)


And we have hardware that actually autodetects some cacheline bouncing
patterns and may actually do a better job than software. It's *hard*
for software to know whether it's bouncing within the L1 cache between
threads, or across fabric in a large machine.


If we have just a few CPUs bouncing the cache line around, we
have no real performance issues and the loop with cpu_relax
seems to be doing fine.

Real issues arise when we have a larger number of CPUs contending
on the same lock. At that point we know we are no longer bouncing
between sibling cores.


As a car analogy, think of this not as an accelerator, but
as an airbag. Spinlock backoff (or other scalable locking
code) exists to keep things from going horribly wrong when
we hit a scalability wall.

Does that make more sense?


Not without tons of numbers from many different platforms, it doesn't.


That makes sense, especially if you are looking to make sure these
patches do not introduce regressions.


And not without explaining which spinlock it is that is so contended
in the first place.


This patch series is not as much for the spinlocks we know about
(we can probably fix those), but to prevent catastrophic
performance degradation when users run into contention on spinlocks
we do NOT know about.


So I claim:

  - it's *really* hard to trigger in real loads on common hardware.


This is definitely true.  However, with many millions of Linux
users out there, there will always be users who encounter a
performance problem, upgrade their hardware, and then run into
an even worse performance problem because they run into a
scalability issue.

The object of these patches is to go from:

"We doubled the number of CPUs, and now things go even slower.", to

"We doubled the number of CPUs, but things aren't going any faster."

The first is a real disaster for Linux users. Especially when the
workload consists of multiple things, some of which run faster on
the upgraded hardware, and others which run slower. What makes it
worse is that these kinds of issues always seem to happen on the
users' most expensive servers, running their most important
workloads...

The second case is something users can temporarily tolerate, while
we fix the underlying issue.


  - if it does trigger in any half-way reasonably common setup
(hardware/software), we most likely should work really hard at fixing
the underlying problem, not the symptoms.


Agreed.


  - we absolutely should *not* pessimize the common case for this


Agreed. Are there any preferred benchmarks you would absolutely like
to see, to show that these patches do not introduce regressions?


Hurting the 99.99% even a tiny amount should be something we should
never ever do. This is why I think the fast case is so important (and
I had another email about possibly making it acceptable), but I think
the *slow* case should be looked at a lot too. Because "back-off" is
absolutely *not* necessarily hugely better than plain spinning, and it
needs numbers. How many times do you spin before even looking at
back-off? How much do you back off?


Whether or not we go into back-off at all depends on a number
of factors, including how many waiters there are ahead of us
waiting for the same ticket spinlock, and whether we have spun
a lot of time on this lock in the past.


Btw, the "notice busy loops and turn it into mwait" is not some
theoretical magic thing. And it's exactly the kind of thing that
back-off *breaks* by making the loop too complex for hardware to
understand. Even just adding counters with conditionals that are *not*
about just he value loaded from memory suddently means that hardware
has a lot harder time doing things like that.



I don't know if you perhaps had some future plans of looking at using
mwait in the backoff code itself,


I looked at mwait, but the documentation suggests it only ever
works at cache line granularity (or larger). That means every
time another CPU adds itself to the queue for the ticket lock,
or every time the lock holder writes to something else in the
cache line the lock is in, mwait'ing cpus get woken up.

It looks like mwait is 

Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 11:08 AM, Rik van Riel  wrote:
>
> The spinlock backoff code prevents these last cases from
> experiencing large performance regressions when the hardware
> is upgraded.

I still want *numbers*.

There are real cases where backoff does exactly the reverse, and makes
things much much worse. The tuning of the backoff delays are often
*very* hardware sensitive, and upgrading hardware can turn out to do
exactly what you say - but for the backoff, not the regular spinning
code.

And we have hardware that actually autodetects some cacheline bouncing
patterns and may actually do a better job than software. It's *hard*
for software to know whether it's bouncing within the L1 cache between
threads, or across fabric in a large machine.

> As a car analogy, think of this not as an accelerator, but
> as an airbag. Spinlock backoff (or other scalable locking
> code) exists to keep things from going horribly wrong when
> we hit a scalability wall.
>
> Does that make more sense?

Not without tons of numbers from many different platforms, it doesn't.
And not without explaining which spinlock it is that is so contended
in the first place.

We've been very good at fixing spinlock contention. Now, that does
mean that what is likely left isn't exactly low-hanging fruit, but it
also means that the circumstances where it triggers are probably quite
uncommon.

So I claim:

 - it's *really* hard to trigger in real loads on common hardware.

 - if it does trigger in any half-way reasonably common setup
(hardware/software), we most likely should work really hard at fixing
the underlying problem, not the symptoms.

 - we absolutely should *not* pessimize the common case for this

So I suspect contention is something that you *may* need on some
particular platforms ("Oh, I have 64 sockets adn 1024 threads, I can
trigger contention easily"), but that tends to be unusual, and any
back-off code should be damn aware of the fact that it only helps the
0.01%.

Hurting the 99.99% even a tiny amount should be something we should
never ever do. This is why I think the fast case is so important (and
I had another email about possibly making it acceptable), but I think
the *slow* case should be looked at a lot too. Because "back-off" is
absolutely *not* necessarily hugely better than plain spinning, and it
needs numbers. How many times do you spin before even looking at
back-off? How much do you back off? How do you account for hardware
that notices busy loops and turns them into effectively just mwait?

Btw, the "notice busy loops and turn it into mwait" is not some
theoretical magic thing. And it's exactly the kind of thing that
back-off *breaks* by making the loop too complex for hardware to
understand. Even just adding counters with conditionals that are *not*
about just he value loaded from memory suddently means that hardware
has a lot harder time doing things like that.

And "notice busy loops and turn it into mwait" is actually a big deal
for power use of a CPU. Back-off with busy-looping timing waits can be
an absolutely *horrible* thing for power use.  So we have bigger
issues than just performance, there's complex CPU power behavior too.
Being "smart" can often be really really hard.

I don't know if you perhaps had some future plans of looking at using
mwait in the backoff code itself, but the patch I did see looked like
it might be absolutely horrible. How long does a "cpu_relax()" wait?
Do you know? How does "cpu_relax()" interface with the rest of the
CPU? Do you know? Because I've heard noises about cpu_relax() actually
affecting the memory pipe behavior of cache accesses of the CPU, and
thus the "cpu_relax()" in a busy loop that does *not* look at the
value (your "backoff delay loop") may actually work very very
differently from the cpu_relax() in the actual "wait for the value to
change" loop.

And how does that account for two different microarchitectures doing
totally different things? Maybe one uarch makes cpu_relax() just shut
down the front-end for a while, while another does something much
fancier and gives hints to in-flight memory accesses etc?

When do you start doing mwait vs just busy-looping with cpu_relax? How
do you tune it to do the right thing for different architectures?

So I think this is complex. At many different levels. And it's *all*
about the details. No handwaving about how "back-off is like a air
bag". Because the big picture is entirely and totally irrelevant, when
the details are almost all that actually matter.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Rik van Riel

On 02/13/2013 11:20 AM, Linus Torvalds wrote:

On Wed, Feb 13, 2013 at 4:06 AM, tip-bot for Rik van Riel
 wrote:


x86/smp: Move waiting on contended ticket lock out of line

Moving the wait loop for congested loops to its own function
allows us to add things to that wait loop, without growing the
size of the kernel text appreciably.


Did anybody actually look at the code generation of this?


Good catch.

This looks like something that may be fixable, though I
do not know whether it actually matters. Adding an unlikely
to the if condition where we call the contention path does
seem to clean up the code a little bit...


This is apparently for the auto-tuning, which came with absolutely no
performance numbers (except for the *regressions* it caused), and
which is something we have actively *avoided* in the past, because
back-off is a f*cking idiotic thing, and the only real fix for
contended spinlocks is to try to avoid the contention and fix the
caller to do something smarter to begin with.

In other words, the whole f*cking thing looks incredibly broken. At
least give some good explanations for why crap like this is needed,
instead of just implementing backoff without even numbers for real
loads. And no, don't bother to give numbers for pointless benchmarks.
It's easy to get contention on a benchmark, but spinlock backoff is
only remotely interesting on real loads.


Lock contention falls into two categories. One is contention
on resources that are used inside the kernel, which may be
fixable by changing the data and the code.

The second is lock contention driven by external factors,
like userspace processes all trying to access the same file,
or grab the same semaphore. Not all of these cases may be
fixable on the kernel side.

A further complication is that these kinds of performance
issues often get discovered on production systems, which
are stuck on a particular kernel and cannot introduce
drastic changes.

The spinlock backoff code prevents these last cases from
experiencing large performance regressions when the hardware
is upgraded.

None of the scalable locking systems magically make things
scale. All they do is prevent catastrophic performance drops
when moving from N to N+x CPUs, allowing user systems to
continue working while kernel developers address the actual
underlying scalability issues.

As a car analogy, think of this not as an accelerator, but
as an airbag. Spinlock backoff (or other scalable locking
code) exists to keep things from going horribly wrong when
we hit a scalability wall.

Does that make more sense?

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 8:20 AM, Linus Torvalds
 wrote:
>
> Adding an external function call is *horrible*, and you might almost
> as well just uninline the spinlock entirely if you do this. It means
> that all the small callers now have their registers trashed, whether
> the unlikely function call is taken or not, and now leaf functions
> aren't leaves any more.

Btw, we've had things like this before, and I wonder if we could
perhaps introduce the notion of a "light-weight call" for fastpath
code that calls unlikely slow-path code..

In particular, see the out-of-line code used by the rwlocks etc (see
"arch_read_lock()" for an example in arch/x86/include/asm/spinlock.h
and arch/x86/lib/rwlock.S), where we end up calling things from inline
asm, with one big reason being exactly the fact that a "normal" C call
has such horribly detrimental effects on the caller.

Sadly, gcc doesn't seem to allow specifying which registers are
clobbered any easy way, which means that both the caller and the
callee *both* tend to need to have some asm interface. So we bothered
to do this for __read_lock_failed, but we have *not* bothered to do
the same for the otherwise very similar __mutex_fastpath_lock() case,
for example.

So for rwlocks, we actually get very nice code generation with small
leaf functions not necessarily needing stack frames, but for mutexes
we mark a lot of registers "unnecessarily" clobbered in the caller,
exactly because we do *not* do that asm interface for the callee. So
we have to clobber all the standard callee-clobbered registers, which
is really sad, and callers almost always need a stack frame, because
if they have any data live at all across the mutex, they have to save
it in some register that is callee-saved - which basically means that
the function has to have that stack frame in order to save its *own*
callee-saved registers.

So it means that we penalize the fastpath because the slow-path can't
be bothered to do the extra register saving, unless we go to the
lengths we went to for the rwlocks, and build a wrapper in asm to save
the extra registers in the cold path.

Maybe we could introduce some helpers to create these kinds of asm
wrappers to do this? Something that would allow us to say: "this
function only clobbers a minimal set of registers and you can call it
from asm and only mark %rax/rcx/rdx clobbered" and that allows leaf
functions to look like leaf functions for the fastpath?

Hmm? That would make my dislike of uninlining the slow case largely go
away. I still think that back-off tends to be a mistake (and is often
horrible for virtualization etc), but as long as the fastpath stays
close to optimal, I don't care *too* much.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread tip-bot for Rik van Riel
Commit-ID:  4aef331850b637169ff036ed231f0d236874f310
Gitweb: http://git.kernel.org/tip/4aef331850b637169ff036ed231f0d236874f310
Author: Rik van Riel 
AuthorDate: Wed, 6 Feb 2013 15:04:03 -0500
Committer:  Ingo Molnar 
CommitDate: Wed, 13 Feb 2013 09:06:28 +0100

x86/smp: Move waiting on contended ticket lock out of line

Moving the wait loop for congested loops to its own function
allows us to add things to that wait loop, without growing the
size of the kernel text appreciably.

Signed-off-by: Rik van Riel 
Reviewed-by: Steven Rostedt 
Reviewed-by: Michel Lespinasse 
Reviewed-by: Rafael Aquini 
Cc: eric.duma...@gmail.com
Cc: lwood...@redhat.com
Cc: kn...@redhat.com
Cc: chegu_vi...@hp.com
Cc: raghavendra...@linux.vnet.ibm.com
Cc: Linus Torvalds 
Cc: Andrew Morton 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20130206150403.006e5...@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/spinlock.h | 11 +--
 arch/x86/kernel/smp.c   | 14 ++
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..dc492f6 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -53,12 +55,9 @@ static __always_inline void 
__ticket_spin_lock(arch_spinlock_t *lock)
 
inc = xadd(>tickets, inc);
 
-   for (;;) {
-   if (inc.head == inc.tail)
-   break;
-   cpu_relax();
-   inc.head = ACCESS_ONCE(lock->tickets.head);
-   }
+   if (inc.head != inc.tail)
+   ticket_spin_lock_wait(lock, inc);
+
barrier();  /* make sure nothing creeps before the lock is 
taken */
 }
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..20da354 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+   for (;;) {
+   cpu_relax();
+   inc.head = ACCESS_ONCE(lock->tickets.head);
+
+   if (inc.head == inc.tail)
+   break;
+   }
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread tip-bot for Rik van Riel
Commit-ID:  4aef331850b637169ff036ed231f0d236874f310
Gitweb: http://git.kernel.org/tip/4aef331850b637169ff036ed231f0d236874f310
Author: Rik van Riel r...@redhat.com
AuthorDate: Wed, 6 Feb 2013 15:04:03 -0500
Committer:  Ingo Molnar mi...@kernel.org
CommitDate: Wed, 13 Feb 2013 09:06:28 +0100

x86/smp: Move waiting on contended ticket lock out of line

Moving the wait loop for congested loops to its own function
allows us to add things to that wait loop, without growing the
size of the kernel text appreciably.

Signed-off-by: Rik van Riel r...@redhat.com
Reviewed-by: Steven Rostedt rost...@goodmiss.org
Reviewed-by: Michel Lespinasse wal...@google.com
Reviewed-by: Rafael Aquini aqu...@redhat.com
Cc: eric.duma...@gmail.com
Cc: lwood...@redhat.com
Cc: kn...@redhat.com
Cc: chegu_vi...@hp.com
Cc: raghavendra...@linux.vnet.ibm.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Andrew Morton a...@linux-foundation.org
Cc: Peter Zijlstra a.p.zijls...@chello.nl
Cc: Thomas Gleixner t...@linutronix.de
Link: http://lkml.kernel.org/r/20130206150403.006e5...@cuia.bos.redhat.com
Signed-off-by: Ingo Molnar mi...@kernel.org
---
 arch/x86/include/asm/spinlock.h | 11 +--
 arch/x86/kernel/smp.c   | 14 ++
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 33692ea..dc492f6 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -34,6 +34,8 @@
 # define UNLOCK_LOCK_PREFIX
 #endif
 
+extern void ticket_spin_lock_wait(arch_spinlock_t *, struct __raw_tickets);
+
 /*
  * Ticket locks are conceptually two parts, one indicating the current head of
  * the queue, and the other indicating the current tail. The lock is acquired
@@ -53,12 +55,9 @@ static __always_inline void 
__ticket_spin_lock(arch_spinlock_t *lock)
 
inc = xadd(lock-tickets, inc);
 
-   for (;;) {
-   if (inc.head == inc.tail)
-   break;
-   cpu_relax();
-   inc.head = ACCESS_ONCE(lock-tickets.head);
-   }
+   if (inc.head != inc.tail)
+   ticket_spin_lock_wait(lock, inc);
+
barrier();  /* make sure nothing creeps before the lock is 
taken */
 }
 
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..20da354 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -113,6 +113,20 @@ static atomic_t stopping_cpu = ATOMIC_INIT(-1);
 static bool smp_no_nmi_ipi = false;
 
 /*
+ * Wait on a congested ticket spinlock.
+ */
+void ticket_spin_lock_wait(arch_spinlock_t *lock, struct __raw_tickets inc)
+{
+   for (;;) {
+   cpu_relax();
+   inc.head = ACCESS_ONCE(lock-tickets.head);
+
+   if (inc.head == inc.tail)
+   break;
+   }
+}
+
+/*
  * this function sends a 'reschedule' IPI to another CPU.
  * it goes straight through and wastes no time serializing
  * anything. Worst case is that we lose a reschedule ...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 8:20 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 Adding an external function call is *horrible*, and you might almost
 as well just uninline the spinlock entirely if you do this. It means
 that all the small callers now have their registers trashed, whether
 the unlikely function call is taken or not, and now leaf functions
 aren't leaves any more.

Btw, we've had things like this before, and I wonder if we could
perhaps introduce the notion of a light-weight call for fastpath
code that calls unlikely slow-path code..

In particular, see the out-of-line code used by the rwlocks etc (see
arch_read_lock() for an example in arch/x86/include/asm/spinlock.h
and arch/x86/lib/rwlock.S), where we end up calling things from inline
asm, with one big reason being exactly the fact that a normal C call
has such horribly detrimental effects on the caller.

Sadly, gcc doesn't seem to allow specifying which registers are
clobbered any easy way, which means that both the caller and the
callee *both* tend to need to have some asm interface. So we bothered
to do this for __read_lock_failed, but we have *not* bothered to do
the same for the otherwise very similar __mutex_fastpath_lock() case,
for example.

So for rwlocks, we actually get very nice code generation with small
leaf functions not necessarily needing stack frames, but for mutexes
we mark a lot of registers unnecessarily clobbered in the caller,
exactly because we do *not* do that asm interface for the callee. So
we have to clobber all the standard callee-clobbered registers, which
is really sad, and callers almost always need a stack frame, because
if they have any data live at all across the mutex, they have to save
it in some register that is callee-saved - which basically means that
the function has to have that stack frame in order to save its *own*
callee-saved registers.

So it means that we penalize the fastpath because the slow-path can't
be bothered to do the extra register saving, unless we go to the
lengths we went to for the rwlocks, and build a wrapper in asm to save
the extra registers in the cold path.

Maybe we could introduce some helpers to create these kinds of asm
wrappers to do this? Something that would allow us to say: this
function only clobbers a minimal set of registers and you can call it
from asm and only mark %rax/rcx/rdx clobbered and that allows leaf
functions to look like leaf functions for the fastpath?

Hmm? That would make my dislike of uninlining the slow case largely go
away. I still think that back-off tends to be a mistake (and is often
horrible for virtualization etc), but as long as the fastpath stays
close to optimal, I don't care *too* much.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Rik van Riel

On 02/13/2013 11:20 AM, Linus Torvalds wrote:

On Wed, Feb 13, 2013 at 4:06 AM, tip-bot for Rik van Riel
r...@redhat.com wrote:


x86/smp: Move waiting on contended ticket lock out of line

Moving the wait loop for congested loops to its own function
allows us to add things to that wait loop, without growing the
size of the kernel text appreciably.


Did anybody actually look at the code generation of this?


Good catch.

This looks like something that may be fixable, though I
do not know whether it actually matters. Adding an unlikely
to the if condition where we call the contention path does
seem to clean up the code a little bit...


This is apparently for the auto-tuning, which came with absolutely no
performance numbers (except for the *regressions* it caused), and
which is something we have actively *avoided* in the past, because
back-off is a f*cking idiotic thing, and the only real fix for
contended spinlocks is to try to avoid the contention and fix the
caller to do something smarter to begin with.

In other words, the whole f*cking thing looks incredibly broken. At
least give some good explanations for why crap like this is needed,
instead of just implementing backoff without even numbers for real
loads. And no, don't bother to give numbers for pointless benchmarks.
It's easy to get contention on a benchmark, but spinlock backoff is
only remotely interesting on real loads.


Lock contention falls into two categories. One is contention
on resources that are used inside the kernel, which may be
fixable by changing the data and the code.

The second is lock contention driven by external factors,
like userspace processes all trying to access the same file,
or grab the same semaphore. Not all of these cases may be
fixable on the kernel side.

A further complication is that these kinds of performance
issues often get discovered on production systems, which
are stuck on a particular kernel and cannot introduce
drastic changes.

The spinlock backoff code prevents these last cases from
experiencing large performance regressions when the hardware
is upgraded.

None of the scalable locking systems magically make things
scale. All they do is prevent catastrophic performance drops
when moving from N to N+x CPUs, allowing user systems to
continue working while kernel developers address the actual
underlying scalability issues.

As a car analogy, think of this not as an accelerator, but
as an airbag. Spinlock backoff (or other scalable locking
code) exists to keep things from going horribly wrong when
we hit a scalability wall.

Does that make more sense?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 11:08 AM, Rik van Riel r...@redhat.com wrote:

 The spinlock backoff code prevents these last cases from
 experiencing large performance regressions when the hardware
 is upgraded.

I still want *numbers*.

There are real cases where backoff does exactly the reverse, and makes
things much much worse. The tuning of the backoff delays are often
*very* hardware sensitive, and upgrading hardware can turn out to do
exactly what you say - but for the backoff, not the regular spinning
code.

And we have hardware that actually autodetects some cacheline bouncing
patterns and may actually do a better job than software. It's *hard*
for software to know whether it's bouncing within the L1 cache between
threads, or across fabric in a large machine.

 As a car analogy, think of this not as an accelerator, but
 as an airbag. Spinlock backoff (or other scalable locking
 code) exists to keep things from going horribly wrong when
 we hit a scalability wall.

 Does that make more sense?

Not without tons of numbers from many different platforms, it doesn't.
And not without explaining which spinlock it is that is so contended
in the first place.

We've been very good at fixing spinlock contention. Now, that does
mean that what is likely left isn't exactly low-hanging fruit, but it
also means that the circumstances where it triggers are probably quite
uncommon.

So I claim:

 - it's *really* hard to trigger in real loads on common hardware.

 - if it does trigger in any half-way reasonably common setup
(hardware/software), we most likely should work really hard at fixing
the underlying problem, not the symptoms.

 - we absolutely should *not* pessimize the common case for this

So I suspect contention is something that you *may* need on some
particular platforms (Oh, I have 64 sockets adn 1024 threads, I can
trigger contention easily), but that tends to be unusual, and any
back-off code should be damn aware of the fact that it only helps the
0.01%.

Hurting the 99.99% even a tiny amount should be something we should
never ever do. This is why I think the fast case is so important (and
I had another email about possibly making it acceptable), but I think
the *slow* case should be looked at a lot too. Because back-off is
absolutely *not* necessarily hugely better than plain spinning, and it
needs numbers. How many times do you spin before even looking at
back-off? How much do you back off? How do you account for hardware
that notices busy loops and turns them into effectively just mwait?

Btw, the notice busy loops and turn it into mwait is not some
theoretical magic thing. And it's exactly the kind of thing that
back-off *breaks* by making the loop too complex for hardware to
understand. Even just adding counters with conditionals that are *not*
about just he value loaded from memory suddently means that hardware
has a lot harder time doing things like that.

And notice busy loops and turn it into mwait is actually a big deal
for power use of a CPU. Back-off with busy-looping timing waits can be
an absolutely *horrible* thing for power use.  So we have bigger
issues than just performance, there's complex CPU power behavior too.
Being smart can often be really really hard.

I don't know if you perhaps had some future plans of looking at using
mwait in the backoff code itself, but the patch I did see looked like
it might be absolutely horrible. How long does a cpu_relax() wait?
Do you know? How does cpu_relax() interface with the rest of the
CPU? Do you know? Because I've heard noises about cpu_relax() actually
affecting the memory pipe behavior of cache accesses of the CPU, and
thus the cpu_relax() in a busy loop that does *not* look at the
value (your backoff delay loop) may actually work very very
differently from the cpu_relax() in the actual wait for the value to
change loop.

And how does that account for two different microarchitectures doing
totally different things? Maybe one uarch makes cpu_relax() just shut
down the front-end for a while, while another does something much
fancier and gives hints to in-flight memory accesses etc?

When do you start doing mwait vs just busy-looping with cpu_relax? How
do you tune it to do the right thing for different architectures?

So I think this is complex. At many different levels. And it's *all*
about the details. No handwaving about how back-off is like a air
bag. Because the big picture is entirely and totally irrelevant, when
the details are almost all that actually matter.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Rik van Riel

On 02/13/2013 02:36 PM, Linus Torvalds wrote:

On Wed, Feb 13, 2013 at 11:08 AM, Rik van Riel r...@redhat.com wrote:

The spinlock backoff code prevents these last cases from
experiencing large performance regressions when the hardware
is upgraded.


I still want *numbers*.

There are real cases where backoff does exactly the reverse, and makes
things much much worse. The tuning of the backoff delays are often
*very* hardware sensitive, and upgrading hardware can turn out to do
exactly what you say - but for the backoff, not the regular spinning
code.


What kind of numbers would you like?

Numbers showing that the common case is not affected by this
code?

Or numbers showing that performance of something is improved
with this code?

Of course, the latter would point out a scalability issue,
that may be fixable by changing the data structures and code
involved :)


And we have hardware that actually autodetects some cacheline bouncing
patterns and may actually do a better job than software. It's *hard*
for software to know whether it's bouncing within the L1 cache between
threads, or across fabric in a large machine.


If we have just a few CPUs bouncing the cache line around, we
have no real performance issues and the loop with cpu_relax
seems to be doing fine.

Real issues arise when we have a larger number of CPUs contending
on the same lock. At that point we know we are no longer bouncing
between sibling cores.


As a car analogy, think of this not as an accelerator, but
as an airbag. Spinlock backoff (or other scalable locking
code) exists to keep things from going horribly wrong when
we hit a scalability wall.

Does that make more sense?


Not without tons of numbers from many different platforms, it doesn't.


That makes sense, especially if you are looking to make sure these
patches do not introduce regressions.


And not without explaining which spinlock it is that is so contended
in the first place.


This patch series is not as much for the spinlocks we know about
(we can probably fix those), but to prevent catastrophic
performance degradation when users run into contention on spinlocks
we do NOT know about.


So I claim:

  - it's *really* hard to trigger in real loads on common hardware.


This is definitely true.  However, with many millions of Linux
users out there, there will always be users who encounter a
performance problem, upgrade their hardware, and then run into
an even worse performance problem because they run into a
scalability issue.

The object of these patches is to go from:

We doubled the number of CPUs, and now things go even slower., to

We doubled the number of CPUs, but things aren't going any faster.

The first is a real disaster for Linux users. Especially when the
workload consists of multiple things, some of which run faster on
the upgraded hardware, and others which run slower. What makes it
worse is that these kinds of issues always seem to happen on the
users' most expensive servers, running their most important
workloads...

The second case is something users can temporarily tolerate, while
we fix the underlying issue.


  - if it does trigger in any half-way reasonably common setup
(hardware/software), we most likely should work really hard at fixing
the underlying problem, not the symptoms.


Agreed.


  - we absolutely should *not* pessimize the common case for this


Agreed. Are there any preferred benchmarks you would absolutely like
to see, to show that these patches do not introduce regressions?


Hurting the 99.99% even a tiny amount should be something we should
never ever do. This is why I think the fast case is so important (and
I had another email about possibly making it acceptable), but I think
the *slow* case should be looked at a lot too. Because back-off is
absolutely *not* necessarily hugely better than plain spinning, and it
needs numbers. How many times do you spin before even looking at
back-off? How much do you back off?


Whether or not we go into back-off at all depends on a number
of factors, including how many waiters there are ahead of us
waiting for the same ticket spinlock, and whether we have spun
a lot of time on this lock in the past.


Btw, the notice busy loops and turn it into mwait is not some
theoretical magic thing. And it's exactly the kind of thing that
back-off *breaks* by making the loop too complex for hardware to
understand. Even just adding counters with conditionals that are *not*
about just he value loaded from memory suddently means that hardware
has a lot harder time doing things like that.



I don't know if you perhaps had some future plans of looking at using
mwait in the backoff code itself,


I looked at mwait, but the documentation suggests it only ever
works at cache line granularity (or larger). That means every
time another CPU adds itself to the queue for the ticket lock,
or every time the lock holder writes to something else in the
cache line the lock is in, mwait'ing cpus get woken up.

It looks like 

Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 2:21 PM, Rik van Riel r...@redhat.com wrote:

 What kind of numbers would you like?

 Numbers showing that the common case is not affected by this
 code?

 Or numbers showing that performance of something is improved
 with this code?

 Of course, the latter would point out a scalability issue,
 that may be fixable by changing the data structures and code
 involved :)

Yes.

To all three.

 If we have just a few CPUs bouncing the cache line around, we
 have no real performance issues and the loop with cpu_relax
 seems to be doing fine.

 Real issues arise when we have a larger number of CPUs contending
 on the same lock. At that point we know we are no longer bouncing
 between sibling cores.

Again. Exactly my point.

So you're potentially making things worse for just about everybody, in
order to fix a problem that almost nobody can actually see. And
apparently you don't even know the problem.

 This patch series is not as much for the spinlocks we know about
 (we can probably fix those), but to prevent catastrophic
 performance degradation when users run into contention on spinlocks
 we do NOT know about.

.. and as I already explained, THAT IS PURE AND UTTER BULLSHIT.

It may make things WORSE. On all the things you haven't run to check
that it does better.

You just stated something that is not at all necessarily true. You
changed the spinlock behavior, and then you blindly state that it will
fix things on loads you haven't even tested.

That's pure bullshit.

For all we know, it goes exactly the other way around, and makes some
unknown load perform much much worse, because it turns something that
didn't *use* to be contended into a contended case, because the
spinlock is slower on some piece of hardware (or under
virtualization).

We already saw one virtual load regress quite noticeably, didn't we?

And yet you go on to say that it will fix performance problems THAT WE
DON'T EVEN KNOW ABOUT! After seeing *proof* to the exact contrary
behavior! What f*cking planet are you from, again?

Christ, that's hubris.

Besides, out-of-line spinlock loops are *horrible* for performance
monitoring. One of the *best* things about inlining the spinlock
looping code is that it's so very obvious where a hot lock is used. It
shows up as a shining beacon in profiles, without any need for
callchains etc. So not only don't we know what loads it would improve,
but those unknown loads would also be much harder to figure out.

(And yes, this is a real problem - look at every time we have a
problem with scaling sleeping locks, and how all the time just ends up
being random scheduler time, with the actual lock being very hard to
see. Spinlocks are *easy* partly because they can be inlined)

Seriously. I'm going to NAK these kinds of changes unless you can
actually show real improvement on real loads. No more of this in
theory this will fix all our problems crap.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Rik van Riel

On 02/13/2013 05:40 PM, Linus Torvalds wrote:

On Wed, Feb 13, 2013 at 2:21 PM, Rik van Riel r...@redhat.com wrote:


What kind of numbers would you like?

Numbers showing that the common case is not affected by this
code?

Or numbers showing that performance of something is improved
with this code?

Of course, the latter would point out a scalability issue,
that may be fixable by changing the data structures and code
involved :)


Yes.

To all three.


I will run a bunch of tests to demonstrate there are no
regressions with this code. You'll have to wait a few days
before I have a good set of numbers for that.

I have an example of the second case. It is a test case
from a customer issue, where an application is contending on
semaphores, doing semaphore lock and unlock operations. The
test case simply has N threads, trying to lock and unlock the
same semaphore.

The attached graph (which I sent out with the intro email to
my patches) shows how reducing the memory accesses from the
spinlock wait path prevents the large performance degradation
seen with the vanilla kernel. This is on a 24 CPU system with
4 6-core AMD CPUs.

The prop-N series are with a fixed delay proportional back-off.
You can see that a small value of N does not help much for large
numbers of cpus, and a large value hurts with a small number of
CPUs. The automatic tuning appears to be quite robust.


If we have just a few CPUs bouncing the cache line around, we
have no real performance issues and the loop with cpu_relax
seems to be doing fine.

Real issues arise when we have a larger number of CPUs contending
on the same lock. At that point we know we are no longer bouncing
between sibling cores.


Again. Exactly my point.

So you're potentially making things worse for just about everybody, in
order to fix a problem that almost nobody can actually see. And
apparently you don't even know the problem.


If we have only a few CPUs contending on the lock, the delays
will be short. Furthermore, the CPU at the head of the queue
will run the old spinlock code with just cpu_relax() and checking
the lock each iteration.


This patch series is not as much for the spinlocks we know about
(we can probably fix those), but to prevent catastrophic
performance degradation when users run into contention on spinlocks
we do NOT know about.


.. and as I already explained, THAT IS PURE AND UTTER BULLSHIT.

It may make things WORSE. On all the things you haven't run to check
that it does better.

You just stated something that is not at all necessarily true. You
changed the spinlock behavior, and then you blindly state that it will
fix things on loads you haven't even tested.


I did not claim it will fix things.  I claim that it helps reduce
the excessive cross-cpu memory accesses (from the spinlock acquisition
path) that can cause catastrophic performance degradation.

This happens to be what I, Michel and Eric have observed.

Eric got a 45% increase in network throughput, and I saw a factor 4x
or so improvement with the semaphore test. I realize these are not
real workloads, and I will give you numbers with those once I have
gathered some, on different systems.


For all we know, it goes exactly the other way around, and makes some
unknown load perform much much worse, because it turns something that
didn't *use* to be contended into a contended case, because the
spinlock is slower on some piece of hardware (or under
virtualization).

We already saw one virtual load regress quite noticeably, didn't we?


The cause of that was identified (with pause loop exiting, the  host
effectively does the back-off for us), and the problem is avoided
by limiting the maximum back-off value to something small on
virtual guests.


And yet you go on to say that it will fix performance problems THAT WE
DON'T EVEN KNOW ABOUT! After seeing *proof* to the exact contrary
behavior! What f*cking planet are you from, again?

Christ, that's hubris.


I realize you do not have time to read all the email discussions
around these patches, but please do not assume that everybody else
are drooling idiots who are unable to log into their computers in
the morning without their guide dogs typing their passwords for them.

I will try to get you some performance test numbers on various kinds
of hardware over the next few days (probably early next week, depending
on hardware availability in the lab), running some mix of benchmarks
and workloads.


Besides, out-of-line spinlock loops are *horrible* for performance
monitoring. One of the *best* things about inlining the spinlock
looping code is that it's so very obvious where a hot lock is used. It
shows up as a shining beacon in profiles, without any need for
callchains etc. So not only don't we know what loads it would improve,
but those unknown loads would also be much harder to figure out.


Are there significant cases where perf -g is not easily available,
or harmful to tracking down the performance issue?

Even with inlined spinlocks, 

Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread H. Peter Anvin
On 02/13/2013 10:30 AM, Linus Torvalds wrote:
 
 Sadly, gcc doesn't seem to allow specifying which registers are
 clobbered any easy way, which means that both the caller and the
 callee *both* tend to need to have some asm interface. So we bothered
 to do this for __read_lock_failed, but we have *not* bothered to do
 the same for the otherwise very similar __mutex_fastpath_lock() case,
 for example.
 

It does for the callee, but only on a whole-file basis.  It would be a
lot nicer if we could do it with function attributes.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 3:41 PM, Rik van Riel r...@redhat.com wrote:

 I have an example of the second case. It is a test case
 from a customer issue, where an application is contending on
 semaphores, doing semaphore lock and unlock operations. The
 test case simply has N threads, trying to lock and unlock the
 same semaphore.

 The attached graph (which I sent out with the intro email to
 my patches) shows how reducing the memory accesses from the
 spinlock wait path prevents the large performance degradation
 seen with the vanilla kernel. This is on a 24 CPU system with
 4 6-core AMD CPUs.

 The prop-N series are with a fixed delay proportional back-off.
 You can see that a small value of N does not help much for large
 numbers of cpus, and a large value hurts with a small number of
 CPUs. The automatic tuning appears to be quite robust.

Ok, good, so there are some numbers. I didn't see any in the commit
messages anywhere, and since the threads I've looked at are from
tip-bot, I never saw the intro email.

That said, it's interesting that this happens with the semaphore path.
We've had other cases where the spinlock in the *sleeping* locks have
caused problems, and I wonder if we should look at that path in
particular.

 If we have only a few CPUs contending on the lock, the delays
 will be short.

Yes. I'm more worried about the overhead, especially on I$ (and to a
lesser degree on D$ when loading hashed delay values etc). I don't
believe it would ever loop very long, it's the other overhead I'd be
worried about.

From looking at profiles of the kernel loads I've cared about (ie
largely VFS code), the I$ footprint seems to be a big deal, and
function entry (and the instruction *after* a call instruction)
actually tend to be hotspots. Which is why I care about things like
function prologues for leaf functions etc.

 Furthermore, the CPU at the head of the queue
 will run the old spinlock code with just cpu_relax() and checking
 the lock each iteration.

That's not AT ALL TRUE.

Look at the code you wrote. It does all the spinlock delay etc crap
unconditionally. Only the loop itself is conditional.

IOW, exactly all the overhead that I worry about. The function call,
the pointless turning of leaf functions into non-leaf functions, the
loading (and storing) of delay information etc etc.

The non-leaf-function thing is done even if you never hit the
slow-path, and affects the non-contention path. And the delay
information thing is done even if there is only one waiter on the
spinlock.

Did I miss anything?

 Eric got a 45% increase in network throughput, and I saw a factor 4x
 or so improvement with the semaphore test. I realize these are not
 real workloads, and I will give you numbers with those once I have
 gathered some, on different systems.

Good. This is what I want to see.

 Are there significant cases where perf -g is not easily available,
 or harmful to tracking down the performance issue?

Yes. There are lots of machines where you cannot get call chain
information with CPU event buffers (pebs). And without the CPU event
buffers, you cannot get good profile data at all.

Now, on other machines you get the call chain even with pebs because
you can get the whole

 The cause of that was identified (with pause loop exiting, the  host
 effectively does the back-off for us), and the problem is avoided
 by limiting the maximum back-off value to something small on
 virtual guests.

And what if the hardware does something equivalent even when not
virtualized (ie power optimizations I already mentioned)?  That whole
maximum back-off limit seems to be just for known virtualization
issues. This is the kind of thing that makes me worry..

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 4:54 PM, H. Peter Anvin h...@zytor.com wrote:

 It does for the callee, but only on a whole-file basis.  It would be a
 lot nicer if we could do it with function attributes.

A way to just set the callee-clobbered list on a per-function basis
would be lovely. Gcc has limited support for this on some
architectures, where you can specify save every register for this
function in order to do things like interrupt handlers etc without
even resorting to asm. But there is no generic (or even just x86)
support for anything like it :-(

There are other calling-convention attributes that make me suspect gcc
could easily do this (it already supports per-function ABI
specification, so presumably it already has some concept of
callee-saved registers being different for different attributes), but
from my reading you currently have to generate asm wrappers by hand
(and call them by hand with inline asm) if you want to do something
like this.

And the straightforward just wrap it in asm approach sadly then
causes double call instructions etc. So that slows down the slow
case unnecessarily much ;(

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread Linus Torvalds
On Wed, Feb 13, 2013 at 5:21 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 Now, on other machines you get the call chain even with pebs because
 you can get the whole

Oops, that got cut short early, because I started looking up when PEBS
and the last-branch-buffer work together, and couldn't find it, and
then came back to the email and forgot to finish the sentence.

Anyway: sometimes you get call chains with precise events, sometimes
you don't. And it turns out that I can't find out which cores do both,
and which ones don't.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:core/locking] x86/smp: Move waiting on contended ticket lock out of line

2013-02-13 Thread H. Peter Anvin
On 02/13/2013 05:31 PM, Linus Torvalds wrote:
 On Wed, Feb 13, 2013 at 4:54 PM, H. Peter Anvin h...@zytor.com wrote:

 It does for the callee, but only on a whole-file basis.  It would be a
 lot nicer if we could do it with function attributes.
 
 A way to just set the callee-clobbered list on a per-function basis
 would be lovely. Gcc has limited support for this on some
 architectures, where you can specify save every register for this
 function in order to do things like interrupt handlers etc without
 even resorting to asm. But there is no generic (or even just x86)
 support for anything like it :-(
 
 There are other calling-convention attributes that make me suspect gcc
 could easily do this (it already supports per-function ABI
 specification, so presumably it already has some concept of
 callee-saved registers being different for different attributes), but
 from my reading you currently have to generate asm wrappers by hand
 (and call them by hand with inline asm) if you want to do something
 like this.
 

I just filed a gcc bugzilla on this:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=56314

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/