Re: [lttng-dev] [RFC PATCH urcu] Add last output parameter to pop/dequeue
* Paul E. McKenney (paul...@linux.vnet.ibm.com) wrote: On Thu, Dec 13, 2012 at 06:44:56AM -0500, Mathieu Desnoyers wrote: I noticed that in addition to having: - push/enqueue returning whether the stack/queue was empty prior to the operation, - pop_all/splice, by nature, emptying the stack/queue, it can be interesting to make pop/dequeue operations return whether they are returning the last element of the stack/queue (therefore emptying it). This allow extending the test-cases covering the number of empty stack/queue encountered by both push/enqueuer and pop/dequeuer threads not only to push/enqueue paired with pop_all/splice, but also to pop/dequeue. In the case of wfstack, this unfortunately requires to modify an already exposed API. As a RFC, one question we should answer is how we want to handle the way forward: should we add new functions to the wfstack API and leave the existing ones alone ? Thoughts ? Hmmm... What is the use case, given that a push might happen immediately after the pop said that the stack/queue was empty? Of course, if we somehow know that there are no concurrent pushes, we could instead check for empty. So what am I missing here? The setup for those use-cases is the following (I'm using the stack as example, but the same applies to queue): - we have N threads doing push and using the push return value that states whether it pushed into an empty stack. - we have M threads doing pop, using the return value to know if it pops a stack into an empty-stack-state. Following the locking requirements, we protect those M threads'pop by a mutex, but they don't need to be protected against push. Just to help understanding where the idea comes from, let's start with another use-case that is similar (push/pop_all). Knowing whether we pushed into an empty stack along with pop_all become very useful when you want to combine the stack with a higher level batching semantic linked to the elements present within the stack. In the case of grace period batching, for instance, I used push/pop_all to provide this kind of semantic: if we push into an empty stack, we know we will have to go through the grace period. If we are pushed into a non-empty stack, we just wait to be awakened by the first thread which was pushed into the stack. This requires that we use pop_all before going though the grace period. Now more specifically about pop, one use-case I have in mind is energy-efficient handling of empty stacks. With M threads executing pop, let's suppose we want them to be blocked on a futex when there is nothing to do. Now the tricky part is: how can we do this without adding overhead (extra load/stores) to the stack ? If we have the ability to know whether we are popping the last element of a stack, we can use this information to go into a futex wait state after having handled the last element. Since the threads doing push would monitor whether they push into an empty stack, they would wake us whenever needed. If instead we choose to simply wait until one of the M threads discovers that the stack is actually empty, we are issuing extra pop (which fails) each time the stack is empty. In the worse-case, if a queue always flip between 0 and 1 elements, we double the number of pop needed to handle the same amount of nodes. Otherwise, if we choose to add an explicit check to see whether the stack is empty, we are adding an extra load of the head node for every pop. Another use-case I see is low-overhead monitoring of stack usage efficiency. For this kind of use-case, we might want to know, both within push and pop threads, if we are underutilizing our system resources. Having the ability to know that we are reaching empty state without any extra overhead to stack memory traffic gives us this ability. I must admit that the use-cases for returning whether pop takes the last element is not as strong as the batching case with push/pop_all, mainly because AFAIU, we can achieve the same result by doing an extra check of stack emptiness state (either by an explicit empty() check, or by issuing an extra pop that will see an empty stack). What we are saving here is the extra overhead on stack cache-lines cause by this extra check. Another use-case, although maybe less compelling, is for validation. With concurrent threads doing push/pop/pop_all operations on the stack, we can perform the following check: If we empty the stack at the end of test execution, the number of push-to-empty-stack must be equal to the number of pop_all-from-non-empty-stack + number of pop-last-element-from-non-empty-stack We should note that this validation could not be performed if pop is not returning whether it popped the last stack element (checked atomically with the pop operation). This is a use-case where adding an extra check on the pop-side would not work (it needs to be performed atomically with pop). And maybe there are other use-cases
Re: [lttng-dev] [RFC PATCH urcu] Add last output parameter to pop/dequeue
On Thu, Dec 13, 2012 at 06:44:56AM -0500, Mathieu Desnoyers wrote: I noticed that in addition to having: - push/enqueue returning whether the stack/queue was empty prior to the operation, - pop_all/splice, by nature, emptying the stack/queue, it can be interesting to make pop/dequeue operations return whether they are returning the last element of the stack/queue (therefore emptying it). This allow extending the test-cases covering the number of empty stack/queue encountered by both push/enqueuer and pop/dequeuer threads not only to push/enqueue paired with pop_all/splice, but also to pop/dequeue. In the case of wfstack, this unfortunately requires to modify an already exposed API. As a RFC, one question we should answer is how we want to handle the way forward: should we add new functions to the wfstack API and leave the existing ones alone ? Thoughts ? Hmmm... What is the use case, given that a push might happen immediately after the pop said that the stack/queue was empty? Of course, if we somehow know that there are no concurrent pushes, we could instead check for empty. So what am I missing here? Thanx, Paul Thanks, Mathieu --- diff --git a/tests/test_urcu_wfcq.c b/tests/test_urcu_wfcq.c index 91285a5..de9566d 100644 --- a/tests/test_urcu_wfcq.c +++ b/tests/test_urcu_wfcq.c @@ -168,6 +168,7 @@ static DEFINE_URCU_TLS(unsigned long long, nr_successful_dequeues); static DEFINE_URCU_TLS(unsigned long long, nr_successful_enqueues); static DEFINE_URCU_TLS(unsigned long long, nr_empty_dest_enqueues); static DEFINE_URCU_TLS(unsigned long long, nr_splice); +static DEFINE_URCU_TLS(unsigned long long, nr_dequeue_last); static unsigned int nr_enqueuers; static unsigned int nr_dequeuers; @@ -228,11 +229,15 @@ fail: static void do_test_dequeue(enum test_sync sync) { struct cds_wfcq_node *node; + bool last; if (sync == TEST_SYNC_MUTEX) - node = cds_wfcq_dequeue_blocking(head, tail); + node = cds_wfcq_dequeue_blocking(head, tail, last); else - node = __cds_wfcq_dequeue_blocking(head, tail); + node = __cds_wfcq_dequeue_blocking(head, tail, last); + + if (last) + URCU_TLS(nr_dequeue_last)++; if (node) { free(node); @@ -263,6 +268,7 @@ static void do_test_splice(enum test_sync sync) break; case CDS_WFCQ_RET_DEST_EMPTY: URCU_TLS(nr_splice)++; + URCU_TLS(nr_dequeue_last)++; /* ok */ break; case CDS_WFCQ_RET_DEST_NON_EMPTY: @@ -325,16 +331,21 @@ static void *thr_dequeuer(void *_count) count[0] = URCU_TLS(nr_dequeues); count[1] = URCU_TLS(nr_successful_dequeues); count[2] = URCU_TLS(nr_splice); + count[3] = URCU_TLS(nr_dequeue_last); return ((void*)2); } -static void test_end(unsigned long long *nr_dequeues) +static void test_end(unsigned long long *nr_dequeues, + unsigned long long *nr_dequeue_last) { struct cds_wfcq_node *node; + bool last; do { - node = cds_wfcq_dequeue_blocking(head, tail); + node = cds_wfcq_dequeue_blocking(head, tail, last); if (node) { + if (last) + (*nr_dequeue_last)++; free(node); (*nr_dequeues)++; } @@ -367,7 +378,7 @@ int main(int argc, char **argv) unsigned long long tot_successful_enqueues = 0, tot_successful_dequeues = 0, tot_empty_dest_enqueues = 0, -tot_splice = 0; +tot_splice = 0, tot_dequeue_last = 0; unsigned long long end_dequeues = 0; int i, a, retval = 0; @@ -480,7 +491,7 @@ int main(int argc, char **argv) tid_enqueuer = malloc(sizeof(*tid_enqueuer) * nr_enqueuers); tid_dequeuer = malloc(sizeof(*tid_dequeuer) * nr_dequeuers); count_enqueuer = malloc(3 * sizeof(*count_enqueuer) * nr_enqueuers); - count_dequeuer = malloc(3 * sizeof(*count_dequeuer) * nr_dequeuers); + count_dequeuer = malloc(4 * sizeof(*count_dequeuer) * nr_dequeuers); cds_wfcq_init(head, tail); next_aff = 0; @@ -493,7 +504,7 @@ int main(int argc, char **argv) } for (i = 0; i nr_dequeuers; i++) { err = pthread_create(tid_dequeuer[i], NULL, thr_dequeuer, - count_dequeuer[3 * i]); + count_dequeuer[4 * i]); if (err != 0) exit(1); } @@ -533,34 +544,37 @@ int main(int argc, char **argv) err = pthread_join(tid_dequeuer[i], tret); if (err != 0) exit(1); - tot_dequeues += count_dequeuer[3 * i]; -
[lttng-dev] [RFC PATCH urcu] Add last output parameter to pop/dequeue
I noticed that in addition to having: - push/enqueue returning whether the stack/queue was empty prior to the operation, - pop_all/splice, by nature, emptying the stack/queue, it can be interesting to make pop/dequeue operations return whether they are returning the last element of the stack/queue (therefore emptying it). This allow extending the test-cases covering the number of empty stack/queue encountered by both push/enqueuer and pop/dequeuer threads not only to push/enqueue paired with pop_all/splice, but also to pop/dequeue. In the case of wfstack, this unfortunately requires to modify an already exposed API. As a RFC, one question we should answer is how we want to handle the way forward: should we add new functions to the wfstack API and leave the existing ones alone ? Thoughts ? Thanks, Mathieu --- diff --git a/tests/test_urcu_wfcq.c b/tests/test_urcu_wfcq.c index 91285a5..de9566d 100644 --- a/tests/test_urcu_wfcq.c +++ b/tests/test_urcu_wfcq.c @@ -168,6 +168,7 @@ static DEFINE_URCU_TLS(unsigned long long, nr_successful_dequeues); static DEFINE_URCU_TLS(unsigned long long, nr_successful_enqueues); static DEFINE_URCU_TLS(unsigned long long, nr_empty_dest_enqueues); static DEFINE_URCU_TLS(unsigned long long, nr_splice); +static DEFINE_URCU_TLS(unsigned long long, nr_dequeue_last); static unsigned int nr_enqueuers; static unsigned int nr_dequeuers; @@ -228,11 +229,15 @@ fail: static void do_test_dequeue(enum test_sync sync) { struct cds_wfcq_node *node; + bool last; if (sync == TEST_SYNC_MUTEX) - node = cds_wfcq_dequeue_blocking(head, tail); + node = cds_wfcq_dequeue_blocking(head, tail, last); else - node = __cds_wfcq_dequeue_blocking(head, tail); + node = __cds_wfcq_dequeue_blocking(head, tail, last); + + if (last) + URCU_TLS(nr_dequeue_last)++; if (node) { free(node); @@ -263,6 +268,7 @@ static void do_test_splice(enum test_sync sync) break; case CDS_WFCQ_RET_DEST_EMPTY: URCU_TLS(nr_splice)++; + URCU_TLS(nr_dequeue_last)++; /* ok */ break; case CDS_WFCQ_RET_DEST_NON_EMPTY: @@ -325,16 +331,21 @@ static void *thr_dequeuer(void *_count) count[0] = URCU_TLS(nr_dequeues); count[1] = URCU_TLS(nr_successful_dequeues); count[2] = URCU_TLS(nr_splice); + count[3] = URCU_TLS(nr_dequeue_last); return ((void*)2); } -static void test_end(unsigned long long *nr_dequeues) +static void test_end(unsigned long long *nr_dequeues, + unsigned long long *nr_dequeue_last) { struct cds_wfcq_node *node; + bool last; do { - node = cds_wfcq_dequeue_blocking(head, tail); + node = cds_wfcq_dequeue_blocking(head, tail, last); if (node) { + if (last) + (*nr_dequeue_last)++; free(node); (*nr_dequeues)++; } @@ -367,7 +378,7 @@ int main(int argc, char **argv) unsigned long long tot_successful_enqueues = 0, tot_successful_dequeues = 0, tot_empty_dest_enqueues = 0, - tot_splice = 0; + tot_splice = 0, tot_dequeue_last = 0; unsigned long long end_dequeues = 0; int i, a, retval = 0; @@ -480,7 +491,7 @@ int main(int argc, char **argv) tid_enqueuer = malloc(sizeof(*tid_enqueuer) * nr_enqueuers); tid_dequeuer = malloc(sizeof(*tid_dequeuer) * nr_dequeuers); count_enqueuer = malloc(3 * sizeof(*count_enqueuer) * nr_enqueuers); - count_dequeuer = malloc(3 * sizeof(*count_dequeuer) * nr_dequeuers); + count_dequeuer = malloc(4 * sizeof(*count_dequeuer) * nr_dequeuers); cds_wfcq_init(head, tail); next_aff = 0; @@ -493,7 +504,7 @@ int main(int argc, char **argv) } for (i = 0; i nr_dequeuers; i++) { err = pthread_create(tid_dequeuer[i], NULL, thr_dequeuer, -count_dequeuer[3 * i]); +count_dequeuer[4 * i]); if (err != 0) exit(1); } @@ -533,34 +544,37 @@ int main(int argc, char **argv) err = pthread_join(tid_dequeuer[i], tret); if (err != 0) exit(1); - tot_dequeues += count_dequeuer[3 * i]; - tot_successful_dequeues += count_dequeuer[3 * i + 1]; - tot_splice += count_dequeuer[3 * i + 2]; + tot_dequeues += count_dequeuer[4 * i]; + tot_successful_dequeues += count_dequeuer[4 * i + 1]; + tot_splice += count_dequeuer[4 * i + 2]; + tot_dequeue_last += count_dequeuer[4 * i + 3]; }