Here is a diff relative of OpenBSD 5.3 to avoid taking a lock in
pthread_getspecific in the common case where storage for the key is
already allocated.

I have only tested this patch for my use case where a single key is
created once and used many times.  In that case, the bottleneck I
observed is gone.  My parallel program sees linear speedup.

This patch would be unsafe if elements were ever removed from the
per-thread list of allocated storage.  In the 5.3 implementation the
list can only grow.  Its size is bounded by PTHREAD_KEYS_MAX so I
do not consider this a leak.  Possibly it should be an array instead
of a list.

Index: rthread_tls.c
===================================================================
RCS file: /cvs/src/lib/librthread/rthread_tls.c,v
retrieving revision 1.13
diff -c -r1.13 rthread_tls.c
*** rthread_tls.c	6 Nov 2011 11:48:59 -0000	1.13
--- rthread_tls.c	30 Sep 2013 14:27:13 -0000
***************
*** 96,102 ****
  	struct rthread_storage *rs;
  	pthread_t self;
  
! 	_spinlock(&rkeyslock);
  	if (!rkeys[key].used) {
  		rs = NULL;
  		goto out;
--- 96,107 ----
  	struct rthread_storage *rs;
  	pthread_t self;
  
! 	/* No lock is needed if
! 	   (1) the key is not valid (undefined behavior)
! 	   (2) storage is already allocated for the key (the linked
! 	       list is only ever modified by adding to the head)
! 	*/
! 
  	if (!rkeys[key].used) {
  		rs = NULL;
  		goto out;
***************
*** 109,125 ****
  			break;
  	}
  	if (!rs) {
! 		rs = calloc(1, sizeof(*rs));
! 		if (!rs)
! 			goto out;
! 		rs->keyid = key;
! 		rs->data = NULL;
! 		rs->next = self->local_storage;
! 		self->local_storage = rs;
  	}
  
  out:
- 	_spinunlock(&rkeyslock);
  	return (rs);
  }
  
--- 114,138 ----
  			break;
  	}
  	if (!rs) {
! 		_spinlock(&rkeyslock);
! 		/* Repeat search with lock held */
! 		for (rs = self->local_storage; rs; rs = rs->next) {
! 			if (rs->keyid == key)
! 				break;
! 		}
! 		if (!rs) {
! 			rs = calloc(1, sizeof(*rs));
! 			if (rs) {
! 				rs->keyid = key;
! 				rs->data = NULL;
! 				rs->next = self->local_storage;
! 				self->local_storage = rs;
! 			}
! 		}
! 		_spinunlock(&rkeyslock);
  	}
  
  out:
  	return (rs);
  }
  
> 
> Problem in a nutshell: pthread_getspecific serializes execution of
> threads using it.  Without a better implementation my program doesn't
> work on OpenBSD.
> 
> I am trying to port Cilk+ to OpenBSD (5.3).
> 
> Cilk+ is a multithreaded extension to C/C++.  It adds some bookkeeping
> operations in the prologue of some functions.  In many Cilk programs
> most function calls do this bookkeeping (dynamic count, not static).
> 
> The per-call bookkeeping calls pthread_getspecific.  pthread_getspecific
> takes out a spinlock.  The lock is apparently needed in case of a race
> with pthread_key_delete.  This is unlikely to happen, but I suppose it
> is possible.  Every function call in this multithreaded program
> serializes waiting on the lock.  Also, the cache line with the lock is
> constantly moving between processors.
> 
> This is worse than useless for Cilk.  You're much better off with a
> single threaded program.
> 
> An older version of Cilk used a thread local storage class (__thread).
> If memory serves, the switch to pthread_getspecific was driven by a few
> considerations:
> 
> 1. Thread local variables don't get along well with shared libraries.
> 
> 2. Thread local variables are less portable.  OpenBSD doesn't really
> support them, for example. They are emulated with pthread_getspecific.
> 
> 3. On Linux/x86 pthread_getspecific is very fast, essentially a move
> instruction with a segment override.
> 
> It seems to me the implementation of pthread_getspecific doesn't need to
> be as slow as it is.
> 
> It ought to be possible to have multiple readers be always nonblocking
> as long as the key list doesn't change, and possibly even if it does
> change.  pthread_getspecific only needs a read lock rather than a mutex.
> The rwlock in librthread starts with a spinlock, so it's not the answer.
> 
> Any thoughts?
> 

Reply via email to