Re: [lttng-dev] [RFC PATCH urcu] Add last output parameter to pop/dequeue

2012-12-17 Thread Mathieu Desnoyers
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote:
 On Thu, Dec 13, 2012 at 06:44:56AM -0500, Mathieu Desnoyers wrote:
  I noticed that in addition to having:
  
  - push/enqueue returning whether the stack/queue was empty prior to the
operation,
  - pop_all/splice, by nature, emptying the stack/queue,
  
  it can be interesting to make pop/dequeue operations return whether they
  are returning the last element of the stack/queue (therefore emptying
  it). This allow extending the test-cases covering the number of empty
  stack/queue encountered by both push/enqueuer and pop/dequeuer threads
  not only to push/enqueue paired with pop_all/splice, but also to
  pop/dequeue.
  
  In the case of wfstack, this unfortunately requires to modify an already
  exposed API. As a RFC, one question we should answer is how we want to
  handle the way forward: should we add new functions to the wfstack API
  and leave the existing ones alone ? 
  
  Thoughts ?
 
 Hmmm...  What is the use case, given that a push might happen immediately
 after the pop said that the stack/queue was empty?  Of course, if we
 somehow know that there are no concurrent pushes, we could instead
 check for empty.
 
 So what am I missing here?

The setup for those use-cases is the following (I'm using the stack as
example, but the same applies to queue):

- we have N threads doing push and using the push return value that
  states whether it pushed into an empty stack.
- we have M threads doing pop, using the return value to know if it
  pops a stack into an empty-stack-state. Following the locking
  requirements, we protect those M threads'pop by a mutex, but they
  don't need to be protected against push.

Just to help understanding where the idea comes from, let's start with
another use-case that is similar (push/pop_all). Knowing whether we
pushed into an empty stack along with pop_all become very useful when
you want to combine the stack with a higher level batching semantic
linked to the elements present within the stack.

In the case of grace period batching, for instance, I used
push/pop_all to provide this kind of semantic: if we push into an
empty stack, we know we will have to go through the grace period. If we
are pushed into a non-empty stack, we just wait to be awakened by the
first thread which was pushed into the stack. This requires that we use
pop_all before going though the grace period.

Now more specifically about pop, one use-case I have in mind is
energy-efficient handling of empty stacks. With M threads executing
pop, let's suppose we want them to be blocked on a futex when there is
nothing to do. Now the tricky part is: how can we do this without adding
overhead (extra load/stores) to the stack ?

If we have the ability to know whether we are popping the last element
of a stack, we can use this information to go into a futex wait state
after having handled the last element. Since the threads doing push
would monitor whether they push into an empty stack, they would wake us
whenever needed.

If instead we choose to simply wait until one of the M threads discovers
that the stack is actually empty, we are issuing extra pop (which
fails) each time the stack is empty. In the worse-case, if a queue
always flip between 0 and 1 elements, we double the number of pop
needed to handle the same amount of nodes.

Otherwise, if we choose to add an explicit check to see whether the
stack is empty, we are adding an extra load of the head node for every
pop.

Another use-case I see is low-overhead monitoring of stack usage
efficiency. For this kind of use-case, we might want to know, both
within push and pop threads, if we are underutilizing our system
resources. Having the ability to know that we are reaching empty state
without any extra overhead to stack memory traffic gives us this
ability.

I must admit that the use-cases for returning whether pop takes the last
element is not as strong as the batching case with push/pop_all, mainly
because AFAIU, we can achieve the same result by doing an extra check of
stack emptiness state (either by an explicit empty() check, or by
issuing an extra pop that will see an empty stack). What we are saving
here is the extra overhead on stack cache-lines cause by this extra
check.

Another use-case, although maybe less compelling, is for validation.
With concurrent threads doing push/pop/pop_all operations on the stack,
we can perform the following check: If we empty the stack at the end of
test execution, the

  number of push-to-empty-stack

  must be equal to the

  number of pop_all-from-non-empty-stack
   + number of pop-last-element-from-non-empty-stack

We should note that this validation could not be performed if pop is
not returning whether it popped the last stack element (checked
atomically with the pop operation). This is a use-case where adding an
extra check on the pop-side would not work (it needs to be performed
atomically with pop).

And maybe there are other use-cases 

Re: [lttng-dev] [RFC PATCH urcu] Add last output parameter to pop/dequeue

2012-12-14 Thread Paul E. McKenney
On Thu, Dec 13, 2012 at 06:44:56AM -0500, Mathieu Desnoyers wrote:
 I noticed that in addition to having:
 
 - push/enqueue returning whether the stack/queue was empty prior to the
   operation,
 - pop_all/splice, by nature, emptying the stack/queue,
 
 it can be interesting to make pop/dequeue operations return whether they
 are returning the last element of the stack/queue (therefore emptying
 it). This allow extending the test-cases covering the number of empty
 stack/queue encountered by both push/enqueuer and pop/dequeuer threads
 not only to push/enqueue paired with pop_all/splice, but also to
 pop/dequeue.
 
 In the case of wfstack, this unfortunately requires to modify an already
 exposed API. As a RFC, one question we should answer is how we want to
 handle the way forward: should we add new functions to the wfstack API
 and leave the existing ones alone ? 
 
 Thoughts ?

Hmmm...  What is the use case, given that a push might happen immediately
after the pop said that the stack/queue was empty?  Of course, if we
somehow know that there are no concurrent pushes, we could instead
check for empty.

So what am I missing here?

Thanx, Paul

 Thanks,
 
 Mathieu
 
 ---
 diff --git a/tests/test_urcu_wfcq.c b/tests/test_urcu_wfcq.c
 index 91285a5..de9566d 100644
 --- a/tests/test_urcu_wfcq.c
 +++ b/tests/test_urcu_wfcq.c
 @@ -168,6 +168,7 @@ static DEFINE_URCU_TLS(unsigned long long, 
 nr_successful_dequeues);
  static DEFINE_URCU_TLS(unsigned long long, nr_successful_enqueues);
  static DEFINE_URCU_TLS(unsigned long long, nr_empty_dest_enqueues);
  static DEFINE_URCU_TLS(unsigned long long, nr_splice);
 +static DEFINE_URCU_TLS(unsigned long long, nr_dequeue_last);
 
  static unsigned int nr_enqueuers;
  static unsigned int nr_dequeuers;
 @@ -228,11 +229,15 @@ fail:
  static void do_test_dequeue(enum test_sync sync)
  {
   struct cds_wfcq_node *node;
 + bool last;
 
   if (sync == TEST_SYNC_MUTEX)
 - node = cds_wfcq_dequeue_blocking(head, tail);
 + node = cds_wfcq_dequeue_blocking(head, tail, last);
   else
 - node = __cds_wfcq_dequeue_blocking(head, tail);
 + node = __cds_wfcq_dequeue_blocking(head, tail, last);
 +
 + if (last)
 + URCU_TLS(nr_dequeue_last)++;
 
   if (node) {
   free(node);
 @@ -263,6 +268,7 @@ static void do_test_splice(enum test_sync sync)
   break;
   case CDS_WFCQ_RET_DEST_EMPTY:
   URCU_TLS(nr_splice)++;
 + URCU_TLS(nr_dequeue_last)++;
   /* ok */
   break;
   case CDS_WFCQ_RET_DEST_NON_EMPTY:
 @@ -325,16 +331,21 @@ static void *thr_dequeuer(void *_count)
   count[0] = URCU_TLS(nr_dequeues);
   count[1] = URCU_TLS(nr_successful_dequeues);
   count[2] = URCU_TLS(nr_splice);
 + count[3] = URCU_TLS(nr_dequeue_last);
   return ((void*)2);
  }
 
 -static void test_end(unsigned long long *nr_dequeues)
 +static void test_end(unsigned long long *nr_dequeues,
 + unsigned long long *nr_dequeue_last)
  {
   struct cds_wfcq_node *node;
 + bool last;
 
   do {
 - node = cds_wfcq_dequeue_blocking(head, tail);
 + node = cds_wfcq_dequeue_blocking(head, tail, last);
   if (node) {
 + if (last)
 + (*nr_dequeue_last)++;
   free(node);
   (*nr_dequeues)++;
   }
 @@ -367,7 +378,7 @@ int main(int argc, char **argv)
   unsigned long long tot_successful_enqueues = 0,
  tot_successful_dequeues = 0,
  tot_empty_dest_enqueues = 0,
 -tot_splice = 0;
 +tot_splice = 0, tot_dequeue_last = 0;
   unsigned long long end_dequeues = 0;
   int i, a, retval = 0;
 
 @@ -480,7 +491,7 @@ int main(int argc, char **argv)
   tid_enqueuer = malloc(sizeof(*tid_enqueuer) * nr_enqueuers);
   tid_dequeuer = malloc(sizeof(*tid_dequeuer) * nr_dequeuers);
   count_enqueuer = malloc(3 * sizeof(*count_enqueuer) * nr_enqueuers);
 - count_dequeuer = malloc(3 * sizeof(*count_dequeuer) * nr_dequeuers);
 + count_dequeuer = malloc(4 * sizeof(*count_dequeuer) * nr_dequeuers);
   cds_wfcq_init(head, tail);
 
   next_aff = 0;
 @@ -493,7 +504,7 @@ int main(int argc, char **argv)
   }
   for (i = 0; i  nr_dequeuers; i++) {
   err = pthread_create(tid_dequeuer[i], NULL, thr_dequeuer,
 -  count_dequeuer[3 * i]);
 +  count_dequeuer[4 * i]);
   if (err != 0)
   exit(1);
   }
 @@ -533,34 +544,37 @@ int main(int argc, char **argv)
   err = pthread_join(tid_dequeuer[i], tret);
   if (err != 0)
   exit(1);
 - tot_dequeues += count_dequeuer[3 * i];
 -