from:"Dmitry Vyukov"

Re: [lock-free] Re: About my Distributed Sequential lock

2015-02-05 Thread Dmitry Vyukov

I've looked at DRWLOCK.PAS and it looks exactly the same as;
http://www.1024cores.net/home/lock-free-algorithms/reader-writer-problem/distributed-reader-writer-mutex
What is novel there?


On Thu, Feb 5, 2015 at 8:36 PM, Amine Moulay Ramdane amine...@gmail.com wrote:

 Dmitry Vyukov wrote:
I don't see any explanation regarding difference between DSLOCK.ZIP
and DSLOCK_XE.ZIP.
I can't download both files, the server says The requested URL was
not found on this server.


 To download the zip file , please click on the small arrow that you will
 find just on the right of the name of DSLock.zip.




 Thank you,
 Amine Moulay Ramdane.





 On Friday, January 30, 2015 at 1:38:50 PM UTC-8, Amine Moulay Ramdane wrote:

 Hello,


 I want to clear something about  my Distributed sequential lock algorithm,
 you can download it and read about it here:


 https://sites.google.com/site/aminer68/scalable-distributed-sequential-lock


 If you read carefully the source code you will noticed that i am using
 this: myid3:=FCount6^.fcount6; inside the RLock()
 method in the reader side, so when  my algorithms switches from a Seqlock
 mode to a Distributed lock mode, the
 myid3:=FCount6^.fcount6; will be serialized when the writer side
 increment FCount6^.fcount6, so in the writer side i am calling the
 distributed reader-writer lock when the algorithm swithes to a Distributed
 mode, so if you have noticed since the myid3:=FCount6^.fcount6; is
 serialized in the reader side when FCount6^.fcount6 is incremented in the
 writer side,
 so this will look like a probalistic mechanism cause the reader side is
 serializing the the transfer of FCount6^.fcount6
 on the bus, and the writer side is serializing the transfer of a variable
 in the bus also when it is calling the WLock() of the
 distributed lock, so this is a kind of a probalistic mechanism cause this
 can generate contention and the reader side can call
 RLock() on a distributed lock's  reader-writer mutex that is entered
 before the reader by the writer, but even though it is a probalistic
 mechanism, this probabilistic mechanism  will give a good result and it
 eliminates the livelock situation of the Seqlock when there is more
 writers.




 Thank you for your time.


 Amine Moulay Ramdane.

 --

 ---
 You received this message because you are subscribed to the Google Groups
 Scalable Synchronization Algorithms group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to lock-free+unsubscr...@googlegroups.com.
 To view this discussion on the web visit
 https://groups.google.com/d/msgid/lock-free/0238f2ea-068c-4c33-b2ad-a8b9d95ba36c%40googlegroups.com.

 For more options, visit https://groups.google.com/d/optout.



-- 
Dmitry Vyukov

All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net

-- 

--- 
You received this message because you are subscribed to the Google Groups 
Scalable Synchronization Algorithms group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3tavuWa6BK0KFbMZ-rcLdnTFfgWSZjB%3DJfn86HWYQsDGg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Question on relaxed atomic optimizations in MPMC queue

2015-01-28 Thread Dmitry Vyukov

On Thu, Jan 29, 2015 at 1:37 AM, Pedro Ramalhete prama...@gmail.com wrote:
 Hi Dmitry,

 I have a question regarding the MPMC queue on your site
 http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

 On the enqueue() method, aren't you worried that the store on 'cell-data_'
 gets re-ordered and becomes visible before the
 'enqueue_pos_.compare_exchange_weak()' ?
 This would only be a problem if there is a speculative load on 'pos', and if
 this were to happen, then a dequeue() on the previous round of sequences
 could end up getting the new (wrong) data.

 Here's a more detailed example scenario where the buffer_size is 4:
   cell.sequence_ = [4  1  2  3 ]
   cell.data_ = [d4 d1 d2 d3]
   enqueue_pos_   = 5
   dequeue_pos_   = 2

 Suppose there are three threads, T1, T2 and T3, where T1 and T2 are doing
 enqueue() operations and T3 is calling dequeue().
 Consider the following sequence of events:

 1. T1 calls enqueue(), does a speculative load on 'pos' and it is speculated
 to be 6 and (due to re-ordering) sets cell-data_ in index 2 to d6, and goes
 to sleep:
   cell.sequence_ = [4  1  2  3 ]
   cell.data_ = [d4 d1 d6 d3]
   enqueue_pos_   = 5
   dequeue_pos_   = 2

 2. T3 calls dequeue(), increments dequeue_pos_ to 3 and returns the data at
 index 2, which is (incorrectly) d6:
   cell.sequence_ = [4  1  2  3 ]
   cell.data_ = [d4 d1 d6 d3]
   enqueue_pos_   = 5
   dequeue_pos_   = 3

 3. T2 calls enqueue(), increments enqueue_pos_, sets the data and sequence
 in index 2:
   cell.sequence_ = [4  5  2  3 ]
   cell.data_ = [d4 d5 d6 d3]
   enqueue_pos_   = 6
   dequeue_pos_   = 3

 4. T1 wakes up, continues its call to enqueue(), the speculative load of
 'pos' succeeds now that enqueue_pos_ is 6 and it gets incremented to 7, and
 the sequence at index 2 is set to 6:
   cell.sequence_ = [4  5  6  3 ]
   cell.data_ = [d4 d5 d6 d3]
   enqueue_pos_   = 7
   dequeue_pos_   = 3


 A somewhat similar issue may occur on the dequeue() method, where the
 relaxed load on 'pos' becomes a speculative load, and the load on
 'cell-data_' gets re-ordered above the
 'dequeue_pos_.compare_exchange_weak()'. This would cause an old value of
 'cell-data_' to be read, which would be incorrect.

 There is a very interesting paper on these kind of issues:
 http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42967.pdf

 I might have completely missed something obvious, so feel free to prove me
 wrong  :)


 Anyways, I think the idea for this queue is good, and even if my analysis is
 accurate and the current implementation is incorrect, it's just a matter of
 adding the right barriers at the right places and it will become correct.


Hi Pedro,

Accesses to data are synchronized by cell-sequence_.
enqueue/dequeue_pos_ only arbitrate concurrent enqueue/dequeue
requests.

In your scenario you missed that there is also an acquire load
cell-sequence_ before the store. The store cannot be speculated ahead
of the load, because that would introduce a data race into a race-free
program.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
Scalable Synchronization Algorithms group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3sozNvJJtB%3D6DoOumybYT3xnTjPfMbVXR5yaMoeAz1bbQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Fast single-reader many-writer concurrent hash table

2015-03-16 Thread Dmitry Vyukov

On Mon, Mar 16, 2015 at 4:22 AM, sbahra sba...@repnop.org wrote:
 Hi all,

 I'd love your thoughts on
 http://backtrace.io/blog/blog/2015/03/13/workload-specialization/ or at
 least hope it's useful to some folks! I've been using this a few years now,
 and decided to spend some time documenting the mechanism. I saw some folks
 asked for this on this last last year. The implementation was targeting TSO
 environment, and has no atomic operations / barriers there. The original
 implementation is also targeting pure single-writer scenario (but extensions
 available to make it completely linearized, if there's interest in that).


The way you do compaction (Probe Sequence Modification) is very nice!

I don't see any mention of SMR in the Delete section. But I guess you
still require deleted elements to be recycled via SMR, right?

In general, this looks like a good example of limiting requirements
and them implementing an efficient solution for the limited problem.

Thanks for sharing!

-- 

--- 
You received this message because you are subscribed to the Google Groups 
Scalable Synchronization Algorithms group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3vxW-ZDvSgRPSKe314KWnm55U%3D-ivdprS%3DXJ8K1gSVZkQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Pros and Concerns for using packed pointers to solve ABA instead of DCAS

2015-06-20 Thread Dmitry Vyukov

On Sat, Jun 20, 2015 at 6:56 PM, Eugene Zatepyakin zatepya...@gmail.com wrote:
 Hi Dmitry,

 thanx for your explanation.
 i did quick tests on iOS platform and it seems to support DwCAS (that was
 quicks runs and i'm not 100% sure)

 i also read in some articles that it is enough to only increment counter
 during pop operation does it sounds valid?


Yes, it is enough to ensure that a node cannot be reinserted with the
same counter.
You can also use per-node counters, rather than single global counter.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
Scalable Synchronization Algorithms group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3umGhvLWMqeSS96pSQzZ5nWqbyuBF%3D0_LTaU43-rHQd_w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[lock-free] Re: mpmc queue reset and count method

2015-11-17 Thread Dmitry Vyukov

On Mon, Nov 9, 2015 at 7:15 PM, Bernard Harmel  wrote:
> Hi,
>
> I saw your bounded mpmc queue
> (http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue)
> in the spdlog library (https://github.com/gabime/spdlog)
> This is a great piece of code and I was wondering if it was easy/possible to
> add a 'reset' function which will set both enqueue and dequeue index value
> to zero to flush the queue content asap without using any lock.
> By the way I would like also to keep track of the current number of element
> in the queue. I have started by using atomic increment of a variable on
> successfull queue op and atomic decrement on successfull  dequeue op. Is it
> the best way to do it or is it possible to compute it from the enqueue and
> dequeue index in a lock free manner ?


+lock-free group

Hi Bernard,

Size is simple: do a relaxed load of enqueue_pos_ and dequeue_pos_ and
subtract. Mind possible wrap around and reordering (dequeue pos can be
larger than enqueue pos).

Reset is doable. Option (1) call dequeue until it returns false.
Option (2) CAS dequeue_pos_ to enqueue_pos_, then walk over the range
of cells, for each cell: wait until it is fully produced and mark it
as consumed.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3uXJ13t4zM3tOZT2yVV0oQLNmeHK2uc0AYAkYWuPewF2g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[lock-free] Re: Lock free question

2015-11-25 Thread Dmitry Vyukov

On Sat, Nov 21, 2015 at 2:15 AM, Rajinder Yadav  wrote:
> Hello Dmitry,
>
> I would like to thank you for your excellent write up on lock-free
> programming. I have a question and was wondering if you might be able to
> explain how I might be able to update multiple variables using a lock-less
> design?
>
> For example say I have a linked linked list and I want to be able to
> add/remove nodes and at the same time I want to keep a count on the size of
> the list.
>
> Would you be able to show me a code snippet in C++11 that would add a new
> node at the front of the list and update the head pointer and also increment
> the count. I find it confusing how to go about doing this. I imagine I would
> need to to make use of CAS loop.

Hello Rajinder,

There is so-called MCAS (multi-word compare-and-swap) operation that
allows to atomically update several memory locations:
http://www.mscs.mu.edu/~brylow/SPLASH-MARC-2013/Feldman.pdf
and there is also HTM/STM (hardware/software transactional memory):
https://en.wikipedia.org/wiki/Software_transactional_memory
However, that's probably overkill in your case (both performance- and
complexity-wise), because these techniques are required only when you
need strong atomicity guarantees.

You probably can just modify the list (using whatever algorithm you
are using now) and then increment the size. Or vise versa: increment
the size and then insert into the list, depending on whether it is
more preferable to observe larger or smaller size that it actually is.
If you increment after insertion, then you can also observe negative
size. Most likely negative size can be just truncated to zero.

Here is C++11 snippet for this:

std::atomic size_;
...
size_.fetch_add(1, std::memory_order_relaxed);

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3s2PeLWsBJFGRaorBVDufS-1MdKOvY3D%2BFzHmsDV_oEnQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Re: low overhead mpsc and work-stealing

2016-03-19 Thread Dmitry Vyukov

1e190, 0x91e1b0, 0x91e1d0,
> 0x91e1f0, 0x91e210, 0x91e230, 0x91e250, 0x91e270, 0x91e290,
> PoP:   0x91e010, 0x91e030, 0x91e050, 0x91e070, 0x91e090, 0x91e0b0, 0x91e0d0,
> 0x91e0f0, 0x91e110, 0x91e130, 0x91e150, 0x91e170, 0x91e190, 0x91e1b0,
> 0x91e1d0, 0x91e1f0, 0x91e210, 0x91e230, 0x91e250, 0x91e270,
> Nothing to pop!
>
> Nothing to pop!
>
> Nothing to pop!
>
> Nothing to pop!
>
> Nothing to pop!
>
> ~~
>
> The stub item is being popped and the last item is not popped out of the
> queue. I just went over the code again and replacing the highlighted code
> above ( return tail;) with (return next;) would solve the problem.
>
> Cheers,
> Saman
>
>
> In your original implementation here, the stub item is
> On Thursday, 10 March 2016 12:54:22 UTC-5, Dmitry Vyukov wrote:
>>
>> Hi Saman,
>>
>> Please post full source code and point to the exact thing that should
>> not happen. I've already lost any context of this.
>>
>> Thanks
>>
>>
>> On Thu, Mar 10, 2016 at 6:47 PM, Saman Barghi <sam...@gmail.com> wrote:
>> > Hi Dmitriy,
>> >
>> > I got here from your blog post. Just wanted to point out that since the
>> > queue needs at least one item to stay in the queue for it to work
>> > properly,
>> > the last item in the queue stays in the queue until another item
>> > arrives.
>> > Trying to pop the last item return 0 although there is 1 item left in
>> > the
>> > queue. In your blog post
>> > you have a stub member to replace the last item, but here the stub item
>> > is
>> > being popped too. Check with this code:
>> >
>> > int main()
>> > {
>> > const int MAX = 20;
>> > mpscq_t q;
>> > mpscq_create(, new mpscq_node_t);
>> >
>> > mpscq_node_t* n = 0;
>> > //n = spscq_pop();
>> >
>> > mpscq_node_t* nodes[MAX];
>> > for(int i =0; i < MAX; i++){
>> > nodes[i] = new mpscq_node_t;
>> > nodes[i]->next = 0;
>> > spscq_push(, nodes[i]);
>> > printf("%p, ", nodes[i]);
>> > }
>> > printf("\n");
>> >
>> > for(int i =0; i < MAX+5; i++)
>> > {
>> > n = spscq_pop();
>> > if(n !=0)
>> > printf("%p, ", n);
>> > else
>> > printf("\nNothing to pop!\n");
>> > }
>> > }
>> >
>> >
>> > Thanks,
>> > Saman
>> >
>> >
>> > On Friday, 12 November 2010 03:42:14 UTC-5, Dmitry Vyukov wrote:
>> >>
>> >> On Nov 12, 4:05 am, Pierre Habouzit <pierre.habou...@intersec-
>> >> group.com> wrote:
>> >> > I'm using a work-stealing scheduler, with a per thread bounded
>> >> > dequeue
>> >> > that can be pushed/poped at the back from the local thread, and
>> >> > popped
>> >> > from the front by thieves. Yes this means that my tasks run in LIFO
>> >> > order most of the time, and I'm fine with that.
>> >> >
>> >> > When a task is queued in a full dequeue then the task is run
>> >> > immediately instead of beeing queued, and when thieves cannot steal
>> >> > anything they are blocked using an eventcount.
>> >> >
>> >> > In addition to that, I'd like to have serial queues of tasks that
>> >> > ensure that:
>> >> > - tasks are run in a full FIFO fashion;
>> >> > - at most one of their task can be run at the same time.
>> >> > Though tasks from different serial queues can be run at the same time
>> >> > without problem. The idea is that a queue allows to protect a
>> >> > resource
>> >> > (think file, GUI, ...).
>> >> >
>> >> > My requirements are that:
>> >> > - queues are registered at most once in the scheduler at any time;
>> >> > - non empty queues are always registered within the scheduler;
>> >> > - empty queues are not registered in the scheduler (though it is
>> >> > acceptable that just after a queue becomes empty it remains
>> >> > registered
>> >> > when a race occurs, as long as it eventually finds its way out)
>> >> >
>> >> > I'm using a setup very similar to th

Re: [lock-free] Index pool

2017-01-15 Thread Dmitry Vyukov

On Sat, Jan 14, 2017 at 5:22 PM, 'Karl' via Scalable Synchronization
Algorithms  wrote:
> I'm looking for a non-blocking index pool, i.e., a finite set of integers in
> the range [0,n) where get() returns an arbitrary integer from the set and
> put(x) writes index x back to the set (there is no need to check for
> duplicates as only indices retrieved using get() are written back). As
> straightforward solution would be to use a bounded MPMC queue, e.g.
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue,
> initially filled with 0, 1, ..., n-1. However, as the data itself (integers)
> can be manipulated atomically and there are no constraints on the order in
> which indices have to be processed (FIFO, LIFO, etc.), there might be even
> simpler / more efficient solutions. Any ideas or hints? Thank you very much
> in advance for your help!


Hello Karl,

If n is within several thousands, then you can use an atomic bitmap.
I.e. an array of atomic and just scan it for a non-zero
element and use fetch_and to reset a bit. This will use less memory
and will be faster than a queue.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3sWzJSwbLTw2NtpzM%2B-QM0Ce4qYskOAneYUzWqRP87m4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Bounded MPMC queue exception safety

2016-11-06 Thread Dmitry Vyukov

On Thu, Nov 3, 2016 at 8:36 PM, Miral Mirality  wrote:
> Hi Dmitry,
>
> Pondering the bounded MPMC queue at
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
> (very clever design, by the way, as I've said before).
>
> In the case where enqueue has made it all the way down to the copying of
> data_ (ie. found and reserved a cell, but not yet filled it or bumped the
> cell's next-sequence).  If this copy throws an exception, as currently
> implemented it leaves the queue permanently broken (enqueue will skip past
> the cell and fill in later cells until the queue fills, and dequeue will
> halt and return "no more items" once it reaches that cell).
>
> Is there a multi-producer-safe way to "undo" the cell reservation made in
> the first part, to unblock the consumers without actually dequeuing an
> invalid item?
>
> The best I've come up with thus far is to add an extra "invalid" bool flag
> to cell_t, which is set true if the copy throws before still updating
> sequence_; if dequeue finds a cell with the flag set then it resets it and
> continues searching.  I can't help wonder if there's a better way, though.
>
> (Using a reserved bit in sequence_ would also work, without the extra bool
> storage, but halves the max queue size and complicates the wraparound.  It
> might technically be safer that way since it puts it under atomic
> protection; but I'm assuming a non-atomic bool field is safe due to rarity
> of use, x86 arch, and being stored prior to a nearby memory_order_release
> store and loaded after a corresponding nearby memory_order_acquire load.)


Hi Miral,

The best solution would be to require copying of elements to not
throw. One way to achieve that is to create a copy of the element
beforehand and then use moving constructor to move it to the queue.
Move constructors should not allocate.

Otherwise, I don't see any better solution than what you come up with
-- add a flag that an element is broken and should be skipped during
consumption. Whether this flag is a separate bool or a bit in sequence
number looks irrelevant from algorithmic point of view. Accesses to
the bool variable should be properly synchronized by acquire/release
on the sequence variable, we can simply consider the bool to be part
of user data.

I don't see a better way to "undo" enqueue.
You may also check concurrent_queue in Intel Threading Building Blocks
(TBB) library. As far as I remember they also handle exceptions in
some way.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3sBL1m-wB7TX%2BDDed1%3Dx%2B210BY4BrWB0rXwp3p%3DabXmow%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Re: your MPMC queue - two problems?

2016-11-28 Thread Dmitry Vyukov

On Fri, Nov 25, 2016 at 1:50 PM, Miral Mirality <uec...@gmail.com> wrote:
> On 25/11/2016 19:46, "Dmitry Vyukov" <dvyu...@gmail.com> wrote:
>> > While uniform licensing is not strictly mandatory in Boost, AFAIK, I
>> > think
>> > technically to include the code in a Boost library would prefer Dmitry
>> > to
>> > allow relicensing it with the Boost license
>> > (http://www.boost.org/users/license.html).  It's very similar to the BSD
>> > license but not identical.
>>
>> How can I do it?
>
> Reading the license page and then formally stating that you're ok (or not)
> with it being used under those terms would probably be sufficient.


I hereby license and permit usage of my "Bounded MPMC queue"
(http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue)
under the terms of "Boost Software License - Version 1.0 - August
17th, 2003" (http://www.boost.org/users/license.html).

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3sPKKazgdkst3LWKZ9DH%2BYK2tk5idysWAa%3D2P_x5a9Fvw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[lock-free] Re: Relacy CAS release + acquire error?

2016-12-18 Thread Dmitry Vyukov

On Mon, Dec 19, 2016 at 4:18 AM, Alex Khripin  wrote:
> Hi Dmitry,
> First -- relacy is great, thank you for providing this tool and all the
> other stuff on 1024cores.
>
> Second -- I ran into a hiccup. This might be something I don't properly
> understand in the C++11 spec, but, relacy does not seem to support
>
> compare_exchange_strong(foo, bar, rl::mo_release, rl::mo_acquire)
>
> When RL_VERIFY is enabled, this triggers an error:
> RELACY INTERNAL ASSERT FAILED: 'collecting_history()' at
> ../../relacy/context_base_impl.hpp:60 (exec_log)
>
> Reading the code, it looks like if the normal memory model is release, then
> the only failure model supported is relaxed.
>
> atomic.hpp line 338 in the big compare_exchange switch statement
>
> I know the specification prohibits the failure memory order from being
> stronger than the success order -- I am not 100% sure, but I think the above
> ordering (success = release, failure = acquire) does satisfy that
> requirement.
>
> This is mostly academic -- I can set the success order to acq_rel -- but it
> did take me some time to figure out what was going on. I am curious now if I
> am completely off base, or if this is a missing combination.


+relacy mailing list

Hi Alex,

The stronger requirement is somewhat vague as there are unordered memory orders.
But I always read it as relacy implements it -- failure order must be
equal or weaker. Do you have any indication that release/acquire is an
allowed combination?

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3uMEJpFPrVLLW-A9fuMiAyvq7CQJxJ-0_iT4AhsTgh_Zg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Work-stealing with LIFO local pop and FIFO steal

2017-08-04 Thread Dmitry Vyukov

On Mon, Jul 31, 2017 at 9:14 PM,   wrote:
> Hi all,
>
> I've been using the following lock-free work-stealing deque for a while:
> https://blog.molecular-matters.com/2015/09/25/job-system-2-0-lock-free-work-stealing-part-3-going-lock-free/
> I'm not using an MPMC queue since pops by the owning worker thread should be
> from the end of the queue it's pushing from (opposite from the stealing
> end), as LIFO makes it more likely the working set will be in cache.
> I'm posting here with a couple of questions:
> 1) To inquire about whether there's a more efficient algorithm, as the pop
> in this implementation has both a full barrier and a CAS, and so it's
> expensive on the 2nd most common operation (after the push, assuming good
> initial task distribution that make steals less frequent).

Hi,

This one looks good.
Note that pop contains only full barrier in common case (when deque is
not close to empty). And it's as cheap as you can get, because pop
needs to achieve consensus with steal as to who gets what elements.
Just don't schedule a single integer addition as a parallel task.


> 2) What's the most efficient way to prevent busy spinning when there's no
> work left? This occurs often since my typical workloads come from real-time
> data streams. Right now, I have workers doing exponential back-off after
> failed steal attempts (with victims chosen randomly). I was thinking of
> maybe using SPSC queues from each worker to periodically send its status
> (sleeping or not) so that new work is sent first to these threads (rather
> than simple round-robin as I'm doing now, using SPSC per worker already for
> sending new work).

Search this group and comp.programming.threads archives for
"eventcount". And also this:
https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/299245
Eventcounts are the generic way to wait for arbitrary predicates in
lock-free algorithms.


> 3) To ask whether in general there's any benefit of doing strided indexing
> of a buffer in this sort of data structure, so that operations by different
> threads on different ends of the (de)que would tend to not read/write in the
> same cache line when the buffer has only a few items. Related is whether,
> given both head and tail are accessed by the different oprations in this
> algorithm, there's any benefit of putting padding between them.

Yes.
Try benchmarking it. It will also help you answer other questions and
compare with alernatives.



> 4) Is there a performance benefit to making the size a template parameter as
> well, given that this class is already templated on item type?

See above.


> #include 
> #include 
>
> class Task;
>
> // Restricted deque for work-stealing parallelism
> // Based on
> https://blog.molecular-matters.com/2015/09/25/job-system-2-0-lock-free-work-stealing-part-3-going-lock-free/
> // Push and LIFO pop at one end by a single thread and FIFO pop from the
> other end by other threads
>
> template
> class WorkStealQ
> {
> public:
> WorkStealQ(void) = delete;
> inline WorkStealQ(uint32_t bufferSize); // Must be positive power-of-two
> WorkStealQ(WorkStealQ const &) = delete;
> inline WorkStealQ(WorkStealQ &&) noexcept = default;
> WorkStealQ =(WorkStealQ const &) = delete;
> inline WorkStealQ =(WorkStealQ &&) = default;
> // PushUnsafe() will not check for a full queue, in the case where task
> counts are externally constrained
> // This is more efficient due to the lack of need to reference the atomic
> head variable
> inline void PushUnsafe(T &);
> inline void PushUnsafe(T const );
> inline bool Push(T &);
> inline bool Push(T const );
> inline T Pop(void);
> inline T Steal(void);
> inline uint32_t Space(void) const noexcept; // Slow as it checks atomic head
> and tail
> private:
> const uint32_t _mask;
> std::vector _tasks;
> std::atomic _head, _tail; /// TODO: Try padding
> };
>
> template
> inline WorkStealQ::WorkStealQ(uint32_t bufferSize) : _mask(bufferSize -
> 1), _tasks(bufferSize), _head{0}, _tail{0}
> {
> if (bufferSize < 2 || bufferSize >
> static_cast(std::numeric_limits::max()) // Limit is
> needed for signed comparison cast in Push()
> || bufferSize & _mask) throw std::invalid_argument("bufferSize must be a
> positive power-of-two that fits in a signed 32-bit integer");
> }
>
> template
> inline void WorkStealQ::PushUnsafe(T &)
> {
> auto head{_head.load(std::memory_order_relaxed)};
> _tasks[head & _mask] = std::move(task);
> _head.store(head + 1, std::memory_order_release);
> }
>
> template
> inline void WorkStealQ::PushUnsafe(T const )
> {
> auto head{ _head.load(std::memory_order_relaxed) };
> _tasks[head & _mask] = task;
> _head.store(head + 1, std::memory_order_release);
> }
>
> template
> inline bool WorkStealQ::Push(T &)
> {
> auto head{_head.load(std::memory_order_relaxed)};
> if (static_cast(head - _tail.load(std::memory_order_relaxed)) >=
> static_cast(_tasks.size())) return false;
> _tasks[head & _mask] = std::move(task);
> _head.store(head + 1,

Re: [lock-free] Re: Issues with intrusive mpsc queue

2017-05-24 Thread Dmitry Vyukov

Please post full reproducer then.

On Mon, May 22, 2017 at 4:42 PM, Jay Miller <jay.mil...@gmail.com> wrote:
> To be clear, the issue is that items get dropped altogether.  In my test I
> have 50 threads each pushing 5000 notes to the queue with a sequence number
> and thread ID.  The single consumer thread then tracks sequence numbers by
> thread ID.  Under load I get things like:
>
>   Items popped: 1106 1107 1108 1109 1110  1112 1113 1114 1115 1118 1119
> 1120 1121 1122 1123 1124 1125 1126 1127
>
> or
>
>   Items popped: 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2836 2837
> 2838 2839 2840 2841 2842 2843 2844 2845
>
> or
>
>   Items popped: 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2891 2892
> 2893 2894 2895 2896 2897 2898 2899 2900
>
> In every example I've seen 2 consecutive nodes get dropped like in these
> examples.  This never happens when I replace the final "return nullptr" with
> the call to spin_pop.
>
> On Saturday, May 20, 2017 at 4:07:41 PM UTC-5, Dmitry Vyukov wrote:
>>
>> On Thu, May 18, 2017 at 6:48 AM, Jay Miller <jay.m...@gmail.com> wrote:
>> > Also it turns out I can consistently pass my test is I only replace the
>> > second // HERE with the spin_pop (the one after pushing _stub).
>> >
>> >
>> > On Wednesday, May 17, 2017 at 8:31:59 PM UTC-5, Jay Miller wrote:
>> >>
>> >> I have an C++ implementation of the MPSC queue.  Under normal
>> >> circumstances it works correctly, but under heavy load it skips nodes
>> >> when
>> >> releasing.  That is, in a test of many writers and a single reader,
>> >> nodes
>> >> are popped in the same order each thread wrote them, but sometimes a
>> >> node
>> >> does not get returned at all.  If anyone can see what is wrong with
>> >> this I
>> >> would appreciate it very much.
>> >>
>> >> class Node
>> >> {
>> >> public:
>> >> // Atomic here as a shortcut for placing
>> >> // memory fences.  Assignment is an atomic
>> >> // store with an acquire and release fence
>> >> std::atomic<Node*> next;
>> >> };
>> >>
>> >> template 
>> >> class MpscQueue
>> >> {
>> >> public:
>> >> MpscQueue()
>> >> : _head(&_stub)
>> >> , _tail(&_stub)
>> >> {
>> >> _stub.next = nullptr;
>> >> }
>> >>
>> >> void push(T* obj)
>> >> {
>> >> _push(obj);
>> >> }
>> >>
>> >> T* pop()
>> >> {
>> >> auto tail = _tail;
>> >> Node* next = tail->next;
>> >>
>> >> if (tail == &_stub)
>> >> {
>> >> if (!next)
>> >> {
>> >> return nullptr;
>> >> }
>> >> tail = next;
>> >> next = next->next;
>> >> }
>> >>
>> >> if (next)
>> >> {
>> >> _tail = next;
>> >> return static_cast<T*>(tail);
>> >> }
>> >>
>> >> if (tail != _head)
>> >> {
>> >> return nullptr; // HERE
>> >> }
>> >>
>> >> _push(&_stub);
>> >>
>> >> next = tail->next;
>> >> if (next)
>> >> {
>> >> _tail = next;
>> >> return static_cast<T*>(tail);
>> >> }
>> >>
>> >> return nullptr; // HERE
>> >> }
>> >>
>> >> private:
>> >> void _push(Node* node)
>> >> {
>> >> node->next = nullptr;
>> >> auto prev = _head.exchange(node);
>> >>

Re: [lock-free] Re: Issues with intrusive mpsc queue

2017-05-20 Thread Dmitry Vyukov

On Thu, May 18, 2017 at 6:48 AM, Jay Miller  wrote:
> Also it turns out I can consistently pass my test is I only replace the
> second // HERE with the spin_pop (the one after pushing _stub).
>
>
> On Wednesday, May 17, 2017 at 8:31:59 PM UTC-5, Jay Miller wrote:
>>
>> I have an C++ implementation of the MPSC queue.  Under normal
>> circumstances it works correctly, but under heavy load it skips nodes when
>> releasing.  That is, in a test of many writers and a single reader, nodes
>> are popped in the same order each thread wrote them, but sometimes a node
>> does not get returned at all.  If anyone can see what is wrong with this I
>> would appreciate it very much.
>>
>> class Node
>> {
>> public:
>> // Atomic here as a shortcut for placing
>> // memory fences.  Assignment is an atomic
>> // store with an acquire and release fence
>> std::atomic next;
>> };
>>
>> template 
>> class MpscQueue
>> {
>> public:
>> MpscQueue()
>> : _head(&_stub)
>> , _tail(&_stub)
>> {
>> _stub.next = nullptr;
>> }
>>
>> void push(T* obj)
>> {
>> _push(obj);
>> }
>>
>> T* pop()
>> {
>> auto tail = _tail;
>> Node* next = tail->next;
>>
>> if (tail == &_stub)
>> {
>> if (!next)
>> {
>> return nullptr;
>> }
>> tail = next;
>> next = next->next;
>> }
>>
>> if (next)
>> {
>> _tail = next;
>> return static_cast(tail);
>> }
>>
>> if (tail != _head)
>> {
>> return nullptr; // HERE
>> }
>>
>> _push(&_stub);
>>
>> next = tail->next;
>> if (next)
>> {
>> _tail = next;
>> return static_cast(tail);
>> }
>>
>> return nullptr; // HERE
>> }
>>
>> private:
>> void _push(Node* node)
>> {
>> node->next = nullptr;
>> auto prev = _head.exchange(node);
>> prev->next = node;
>> }
>> std::atomic _head;
>>
>> Node* _tail;
>> Node _stub;
>> };
>>
>>
>> As a clue, if I replace the lines above marked HERE with a call to this
>> function my tests always pass, except one time in one configuration where it
>> entered a continuous loop.
>>
>> T* spin_pop(Node* tail)
>> {
>> do
>> {
>> _tail = tail->next;
>> } while (!_tail);
>>
>>
>> return static_cast(tail);
>> }
>

Hi Jay,

This algorithm is not linearizable. If push operation is preempted
between exchange and update of the next link, the linked list of nodes
is broken and there is no way pop can reach any of the subsequent
nodes (even if their push'es are completely finished). I guess that's
what you see.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3tXuUuii5XZXoK2ZruRuBxFmHEmtGnmj7LQkZ%2Bmf76%3DEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Question about JCTools MPMC queue

2017-05-29 Thread Dmitry Vyukov

On Fri, May 26, 2017 at 12:10 PM, Sharath Gururaj  wrote:
> Thanks Nitsan and Dmitry for the clear explanation of #2
> I understand it now.
>
> However, In the case of concurrent queues,
> I'm not sure if we should interpret the offer() semantics of java in such a
> strict way. (returns false iff queue is full)
> Because in a concurrent situation, the very question "is the queue full?"
> loses meaning
>
> Even when we return false from offer(), by the time we detect the false
> return, other consumers can race and make the queue not full.
> There is no way to enforce a strict offer() behaviour. Maybe relaxed{Offer,
> Poll} methods should be the default {Offer, Poll} method

I don't know semantics Java interfaces impose, but there cases like
the one you described above, but there are also other cases where the
queue is provably not-full yet Offer fails.
Consider we have a queue with capacity 2. We insert 2 elements
initially. Then start 2 threads. Each thread first removes 1 element
and then inserts 1 element. Under any interleaving of operations in
these threads, the queue is never full during insertion. Yet insertion
can fail with the original implementation of the queue.
Non-linearilizability can be very confusing.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3uRBbPkOeFPUrc91hiRQvUgUigC4g7CGgaGqvDNbR%2Bn%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] regarding mpmc bounded queue

2017-10-15 Thread Dmitry Vyukov

On Sun, Oct 15, 2017 at 11:40 AM, user1990 k  wrote:
> Dear All,
>
> I was trying to understand this.
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>
> Let me consider enqueue scenario [ both threads ]
>
> Thread 1 in core 0  | Thread 2 in core 0
>
> pos = 0|
> seq = 0|
> dif = 0|
> cas succeed, pos =1   |
> 
> yet to increment seq
>|pos = 1
>|   seq = 0
>| dif = -1
>|enqueue fails eventhough the queue is not full

Hi,

The seq are initialized to their indexes initially, see ctor:

for (size_t i = 0; i != buffer_size; i += 1)
  buffer_[i].sequence_.store(i, std::memory_order_relaxed);

So for pos=1 seq will be 1 (diff=0).

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3tzFPQAOhH5ae8B_8d0BO0WGYF02ftjuQgwyYzse2%2BBkA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Single target proxy collector

2017-10-31 Thread Dmitry Vyukov

On Tue, Oct 31, 2017 at 8:32 AM, Chris M. Thomasson  wrote:
> On Friday, February 15, 2013 at 1:16:36 AM UTC-8, Anthony Williams wrote:
>>
>> On 14/02/13 23:07, jseig...@gmail.com wrote:
>> > C11 and C++11 left out the most powerful atomic primitives.   Despite
>> > that you can do some limited implementations though they're much more
>> > complicated than they should be and a PITA to develop.
>>
>> I'd love to know what primitives you think we left out. Pretty much
>> everything that was proposed for threading was included, and we're
>> actively seeking proposals for C++14 and C++17, so a clear description
>> of the desired facility could lead to it being in the next C++ standard.
>>
>> > Also not
>> > helping, stdatomic.h seems to be missing from Ubuntu 12.10.   I have no
>> > idea where it is or when or if it will ever show up.
>>
>> I didn't think gcc was targetting C11. gcc 4.7 certainly has the
>>  header from C++11.
>>
>> > http://sourceforge.net/projects/atomic-ptr-plus/files/stpc/
>>
>> Interesting. I'll have to have a good look at this.
>
>
> It seems that Joe Seigh's posts got deleted from this thread. Damnm, that is
> a shame.

Humm... Maybe Joe deleted it himself? At least we have it in Anthony's reply.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3vtQZoOfb5W%2B8iy2Srr5Pi5_Ka%3DvtDjjUoYxu0XGbNjeA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] One reason why I like atomic_thread_fence...

2018-04-03 Thread Dmitry Vyukov

On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson
 wrote:
> Notice how there is an acquire barrier inside of the CAS loop within the
> enqueue and dequeue functions of:
>
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>
>
> ?
>
>
> Well, fwiw we can get rid of them by using stand alone fences:
>
>
> Btw, sorry for the membar abstraction. I can quickly get sick and tired of
> having to type out (std::memory_order_relaxed) all the damn time. ;^)
>
>
> Anyway, here is the simple code using Relacy:
>
> ___
>
> /* Membar Abstraction
>
> ___*/
>
> #define mbrlx std::memory_order_relaxed
>
> #define mbacq std::memory_order_acquire
>
> #define mbrel std::memory_order_release
>
> #define mb(mb_membar) std::atomic_thread_fence(mb_membar)
>
>
>
> template
>
> struct fifo
>
> {
>
> struct cell
>
> {
>
> VAR_T(T) m_data;
>
> std::atomic m_ver;
>
> };
>
>
> std::atomic m_head;
>
> std::atomic m_tail;
>
> cell m_cells[T_N];
>
>
> fifo() : m_head(0), m_tail(0)
>
> {
>
> for (unsigned int i = 0; i < T_N; ++i)
>
> {
>
> m_cells[i].m_ver.store(i, mbrlx);
>
> }
>
> }
>
>
> bool push_try(T const& data)
>
> {
>
> cell* c = nullptr;
>
> unsigned int ver = m_head.load(mbrlx);
>
>
> do
>
> {
>
> c = _cells[ver & (T_N - 1)];
>
> if (c->m_ver.load(mbrlx) != ver) return false;
>
>
> } while (! m_head.compare_exchange_weak(ver, ver + 1, mbrlx));
>
>
> mb(mbacq);
>
> VAR(c->m_data) = data;
>
> mb(mbrel);
>
>
> c->m_ver.store(ver + 1, mbrlx);
>
>
> return true;
>
> }
>
>
> bool pop_try(T& data)
>
> {
>
> cell* c = nullptr;
>
> unsigned int ver = m_tail.load(mbrlx);
>
>
> do
>
> {
>
> c = _cells[ver & (T_N - 1)];
>
> if (c->m_ver.load(mbrlx) != ver + 1) return false;
>
>
> } while (! m_tail.compare_exchange_weak(ver, ver + 1, mbrlx));
>
>
> mb(mbacq);
>
> data = VAR(c->m_data);
>
> mb(mbrel);
>
>
> c->m_ver.store(ver + T_N, mbrlx);
>
>
> return true;
>
> }
>
> };
>
> ___
>
>
>
> There is no need for any membars within the loops. The version count keeps
> everything in order.

compare_exchange in C/C++ has optional failure memory ordering, so the
following looks equivalent to me ;)

m_tail.compare_exchange_weak(ver, ver + 1, acquire, relaxed)

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3sKGMzW9_okAH3%3Dzp6DDkvSoviE0JEe3LcrEy-7F9t7sw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] One reason why I like atomic_thread_fence...

2018-04-14 Thread Dmitry Vyukov

On Mon, Apr 9, 2018 at 3:38 AM, Chris M. Thomasson <cris...@charter.net> wrote:
> On Saturday, April 7, 2018 at 1:46:20 AM UTC-7, Dmitry Vyukov wrote:
>>
>> On Thu, Apr 5, 2018 at 10:03 PM, Chris M. Thomasson <cri...@charter.net>
>> wrote:
>> > On Tuesday, April 3, 2018 at 5:44:38 AM UTC-7, Dmitry Vyukov wrote:
>> >>
>> >> On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson
>> >> <cri...@charter.net> wrote:
>> >> > Notice how there is an acquire barrier inside of the CAS loop within
>> >> > the
>> >> > enqueue and dequeue functions of:
>> >> >
>> >> >
>> >> > http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>> [...]
>> > Executing an acquire barrier on every iteration of the CAS loop is not
>> > necessary. The actual version count keeps everything in order.
>> >
>> > However, you do need a single acquire fence _after_ the CAS loop
>> > succeeds in
>> > order to get a clear view of the element.
>>
>> This is true.
>
>
> Agreed. I personally like the ability to see the membars being separated out
> and
>
> standing alone. It is a habit of mine from SPARC. Now, tagged standalone
> membars
>
> aside for a moment, perhaps ones that can include memory locations they are
>
> interested in... ;^)
>
>
>>
>>
>> I don't like standalone fences because they are plague for
>> verification. Consider, a single release fences turns _all_ subsequent
>> relaxed atomic stores ever executed by the thread into release
>> operations (they release memory state up to the fence point) and
>> handling of acquire/release operations is an O(N) operation (and
>> generally done under a mutex).
>
>
> A release operation should make sure all _prior_ operations are visible
> _before_
>
> they are visible to another thread. They have no effect on subsequent
> relaxed
>
> operations. For instance:
>
>
> // producer
>
> A = 1
>
> B = 2
>
> RELEASE
>
> C = 3
>
> D = 4
>
>
> // consumer
>
> while (D != 4) backoff;
>
> ACQUIRE
>
> assert(A == 1 && B == 2);
>
>
> Well, A and B are going be in sync with an acquire such that the assert will
> never
> fail, however C can be hoisted up and not be in sync at all! C is incoherent
> wrt the
> consumer because it was not covered by the standalone release barrier.


In this case the RELEASE turned store to D into a release-store (a
subsequent store).
And ACQUIRE turned load of D into an acquire-load (a preceding load).

At lease this is how this is defined in C/C++ standards.
ACQUIRE/RELEASE fences do not establish any happens-before relations
themselves. You still need a load in one thread to observe a value
stored in another thread. And only that "materializes" standalone
fence synchronization. So a store that materializes RELEASE fence will
always be a subsequent store.


>> The same for acquire fences: a single
>> acquire fences turns _all_ loads ever executed by the thread into
>> acquire operations ton he corresponding memory locations, which means
>> that you need to handle all relaxed loads as a "shadow" acquire loads
>> for the case they will be materialized by a subsequent acquire fence.
>
>
> An acquire operation should make sure all operations wrt the release are
> visible
>
> _before_ any subsequent operations can be performed _after_ that fact is
>
> accomplished.
>
>
> Well, fwiw, the membars that can be embedded into the CAS wrt acquire and
>
> release do effect prior and subsequent activity anyway, standalone or not. A
> release will
>
> dump prior stores such that an acquire barrier will see them all. Now, when
> we
>
> are dealing with a consume barrier, well that is targeting the release
> dependency
>
> chain wrt the pointer. A consume barrier is more precisely targeted when
> compared
>
> to the wider spectrum of an acquire. Btw, iirc consume is emulated in Relacy
> as
>
> acquire right?
>
>
> Also, think of popping all nodes at once from an atomic LIFO:
>
>
> https://groups.google.com/d/topic/comp.lang.c++/V0s__czQwa0/discussion
>
>
> Well, how can we accomplish the following without using standalone fences?:
>
>
>   // try to flush all of our nodes
>   node* flush()
>   {
>   node* n = m_head.exchange(nullptr, mb_relaxed);
>
>   if (n)
>   {
>   mb_fence(mb_acquire);
>   }
>
>   return n;
>   }

I can't disagree. They are definitely more

Re: [lock-free] One reason why I like atomic_thread_fence...

2018-04-20 Thread Dmitry Vyukov

On Mon, Apr 16, 2018 at 12:32 AM, Chris M. Thomasson
<cris...@charter.net> wrote:
>
>
> On Friday, April 13, 2018 at 11:45:51 PM UTC-7, Dmitry Vyukov wrote:
>>
>> On Mon, Apr 9, 2018 at 3:38 AM, Chris M. Thomasson <cri...@charter.net>
>> wrote:
>> > On Saturday, April 7, 2018 at 1:46:20 AM UTC-7, Dmitry Vyukov wrote:
>> >>
>> >> On Thu, Apr 5, 2018 at 10:03 PM, Chris M. Thomasson
>> >> <cri...@charter.net>
>> >> wrote:
>> >> > On Tuesday, April 3, 2018 at 5:44:38 AM UTC-7, Dmitry Vyukov wrote:
>> >> >>
>> >> >> On Sat, Mar 31, 2018 at 10:41 PM, Chris M. Thomasson
>> >> >> <cri...@charter.net> wrote:
>> >> >> > Notice how there is an acquire barrier inside of the CAS loop
>> >> >> > within
>> >> >> > the
>> >> >> > enqueue and dequeue functions of:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>> >> [...]
>> >> > Executing an acquire barrier on every iteration of the CAS loop is
>> >> > not
>> >> > necessary. The actual version count keeps everything in order.
>> >> >
>> >> > However, you do need a single acquire fence _after_ the CAS loop
>> >> > succeeds in
>> >> > order to get a clear view of the element.
>> >>
>> >> This is true.
>> >
>> >
>> > Agreed. I personally like the ability to see the membars being separated
>> > out
>> > and
>> >
>> > standing alone. It is a habit of mine from SPARC. Now, tagged standalone
>> > membars
>> >
>> > aside for a moment, perhaps ones that can include memory locations they
>> > are
>> >
>> > interested in... ;^)
>> >
>> >
>> >>
>> >>
>> >> I don't like standalone fences because they are plague for
>> >> verification. Consider, a single release fences turns _all_ subsequent
>> >> relaxed atomic stores ever executed by the thread into release
>> >> operations (they release memory state up to the fence point) and
>> >> handling of acquire/release operations is an O(N) operation (and
>> >> generally done under a mutex).
>> >
>> >
>> > A release operation should make sure all _prior_ operations are visible
>> > _before_
>> >
>> > they are visible to another thread. They have no effect on subsequent
>> > relaxed
>> >
>> > operations. For instance:
>> >
>> >
>> > // producer
>> >
>> > A = 1
>> >
>> > B = 2
>> >
>> > RELEASE
>> >
>> > C = 3
>> >
>> > D = 4
>> >
>> >
>> > // consumer
>> >
>> > while (D != 4) backoff;
>> >
>> > ACQUIRE
>> >
>> > assert(A == 1 && B == 2);
>> >
>> >
>> > Well, A and B are going be in sync with an acquire such that the assert
>> > will
>> > never
>> > fail, however C can be hoisted up and not be in sync at all! C is
>> > incoherent
>> > wrt the
>> > consumer because it was not covered by the standalone release barrier.
>>
>>
>> In this case the RELEASE turned store to D into a release-store (a
>> subsequent store).
>> And ACQUIRE turned load of D into an acquire-load (a preceding load).
>
>
> D should be a pure relaxed store, and C should not be covered by the
> RELEASE. Iirc, it works this way on SPARC RMO mode. However, on x86, C will
> be covered because each store has implied release characteristics, wb memory
> aside for a moment.

C and D are completely symmetric wrt the RELEASE. Later you can
discover that there is also a thread that does:

// consumer 2
while (C != 3) backoff;
ACQUIRE
assert(A == 1 && B == 2);

And now suddenly C is release operation exactly the same way D is.


>> At lease this is how this is defined in C/C++ standards.
>> ACQUIRE/RELEASE fences do not establish any happens-before relations
>> themselves. You still need a load in one thread to observe a value
>> stored in another thread. And only that "materializes" standalone
>> fence synchronization. So a store that materializes RELEASE fence will
>> always be a subsequent store.
>
>
> Humm... That is too strict, and h

Re: [lock-free] Simple DWCAS based Proxy Collector, refined... ;^)

2018-03-29 Thread Dmitry Vyukov

On Tue, Mar 27, 2018 at 11:23 PM, Chris M. Thomasson
 wrote:
> Fwiw Dmitry, I have been working with fractals a lot lately, and was
> wondering if I could still program a proxy collector from scratch. Remember
> my old collector here:
>
> http://webpages.charter.net/appcore/misc/pc_sample_h_v1.html
>
> ?
>
> You are using it as an example in Relacy.
>
> Wrt the following code, all of the membars are standalone fence operations
> std::atomic_thread_fence. All of the membars in the atomic operations are
> relaxed. It uses DWCAS for the acquire function because I wanted to start
> out simple. I have also replaced a CAS with XCHG in the collect, or mutate
> function if you will.  So far, it works in Relacy. :^)
>
> Here is a link to my code:
>
> https://pastebin.com/raw/RKTLyXWt
>
> ___
>
> /* Simple Proxy Collector (DWCAS) - Relacy Version
>
> http://www.1024cores.net/home/relacy-race-detector
>
> Copyright 3/27/2018
> ___*/
>
>
> #include 
> #include 
> #include 
>
>
> /* Membar Abstraction
> ___*/
> #define ct_mb_relaxed std::memory_order_relaxed
> #define ct_mb_acquire std::memory_order_acquire
> #define ct_mb_release std::memory_order_release
> #define ct_mb_seq_cst std::memory_order_seq_cst
> #define ct_mb_fence(mb_membar) std::atomic_thread_fence(mb_membar)
>
>
> /* Simple Proxy Collector C++11
> ___*/
> namespace ct {
> namespace proxy {
>
> // User object base
> struct object
> {
> virtual ~object() {}
> };
>
>
> // Proxy node
> struct node
> {
> std::atomic count;
> VAR_T(node*) next;
> VAR_T(object*) obj;
>
> node(std::intptr_t count_, node* next_, object* obj_)
> : count(count_), next(next_), obj(obj_)
> {
>
> }
>
> ~node()
> {
>
> }
> };
>
>
> // DWCAS target
> struct anchor
> {
> std::intptr_t count;
> node* head;
>
> anchor()
> {
>
> }
>
> anchor(std::intptr_t count_, node* head_)
> {
> count = count_;
> head = head_;
> }
>
> bool operator == (anchor const& right) const
> {
> return count == right.count && head == right.head;
> }
>
> anchor operator + (anchor const& right) const
> {
> anchor res;
> res.count = count + right.count;
> res.head = right.head;
> return res;
> }
>
> anchor operator - (anchor const& right) const
> {
> anchor res;
> res.count = count - right.count;
> res.head = right.head;
> return res;
> }
> };
>
> std::ostream& operator << (std::ostream& s, anchor const& right)
> {
> return s << "{" << right.count << "," << right.head << "}";
> }
>
>
> // Collector Logic
> struct gc
> {
> std::atomic head;
>
> gc() : head(anchor{ 0, new node(0, nullptr, nullptr) }) {}
>
> ~gc()
> {
> anchor cmp = head.load(ct_mb_relaxed);
> RL_ASSERT(cmp.count > -1);
> RL_ASSERT(VAR(cmp.head->next) == nullptr);
> prv_dtor(cmp.head);
> }
>
> // Drops a reference count on a node
> bool prv_release(node* n)
> {
> std::intptr_t count = n->count.fetch_sub(2, ct_mb_relaxed);
> if (count == 3) return true;
> return false;
> }
>
> // Deletes a node
> void prv_dtor(node* n)
> {
> object* obj = VAR(n->obj);
> if (obj != nullptr) delete obj;
> delete n;
> }
>
> // Dumps backlinked nodes
> void prv_dump(node* n)
> {
> node* cur = VAR(n->next);
> VAR(n->next) = nullptr;
>
> // Release and collect in cur LIFO
> while (prv_release(cur))
> {
> ct_mb_fence(ct_mb_acquire);
> node* next = VAR(cur->next);
> VAR(cur->next) = n;
> n = cur;
> cur = next;
> }
>
> // Destroy cur LIFO
> while (n)
> {
> node* next = VAR(n->next);
> prv_dtor(n);
> n = next;
> }
> }
>
> // Collects a node
> void prv_collect(node* n)
> {
> anchor xchg = { 0, n };
>
> ct_mb_fence(ct_mb_release);
>
> // Swap in new node
> anchor cmp = head.exchange(xchg, ct_mb_relaxed);
>
> // Backlink
> ct_mb_fence(ct_mb_acquire);
> VAR(cmp.head->next) = xchg.head;
> ct_mb_fence(ct_mb_release);
>
> // Release and mark as

Re: [lock-free] A Wait-Free MPMC Queue

2018-10-19 Thread Dmitry Vyukov

On Thu, Oct 18, 2018 at 10:02 AM Anthony Williams  wrote:
>
> On 18/10/2018 09:23, 饶萌 wrote:
> > I've just created a real wait-free MPMC queue in 100+ lines of C++11
> > code: WFMPMC .
> >
> > Appreciate it if you can help check.
>
> This is not wait free. If you have two writers, and both take an index.
> The writer with the lower index then gets suspended, but the writer with
> the higher index completes.
>
> Now, the next reader has to wait, because the slot it is trying to read
> from isn't complete yet, even though there is ready data in another slot.

Hi Anthony,

This sounds more like non-linerizable rather then non-wait-free.
I guess wait-free property only makes sense for _try_ push/pop
operations. TryPop won't block, but it can [falsely] return false
because a future pop appears to happen before a preceding push.
Non-linerizability may or may not be a problem depending on particular
use case.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3tDw_%2BxUWxHB5z90B9pc3yCZUutvQZGuEuUX0s2GUUQVA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] A Wait-Free MPMC Queue

2018-10-19 Thread Dmitry Vyukov

On Fri, Oct 19, 2018 at 5:41 PM Anthony Williams  wrote:
>
> On 19/10/2018 17:15, Dmitry Vyukov wrote:
> > On Thu, Oct 18, 2018 at 10:02 AM Anthony Williams  
> > wrote:
> >>
> >> On 18/10/2018 09:23, 饶萌 wrote:
> >>> I've just created a real wait-free MPMC queue in 100+ lines of C++11
> >>> code: WFMPMC <https://github.com/MengRao/WFMPMC>.
> >>>
> >>> Appreciate it if you can help check.
> >>
> >> This is not wait free. If you have two writers, and both take an index.
> >> The writer with the lower index then gets suspended, but the writer with
> >> the higher index completes.
> >>
> >> Now, the next reader has to wait, because the slot it is trying to read
> >> from isn't complete yet, even though there is ready data in another slot.
> >
> > Hi Anthony,
> >
> > This sounds more like non-linerizable rather then non-wait-free.
>
> For something to be wait-free, you can suspend any subset of threads
> indefinitely and all other threads will proceed OK.
>
> In this case, if a thread is suspended mid-insert then other threads
> will be unable to proceed, even when there is meaningful work to be done.

You are right!
But non-linearazible too.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3vhXmK-hGEMDQOWPg_RpeTdY2W9BuZ_zFv%3DBE05chxK6Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[lock-free] wait-free combiner lock

2018-11-09 Thread Dmitry Vyukov

Hi,

An interesting wait-free combiner lock with 2 XCHG to enqueue work:
https://github.com/romkatv/action-chain/blob/master/src/action_chain.h

This is somewhat similar to the XCHG-based enqueue here:
http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue

But the algorithm does not have that nasty inconsistency window
between XCHG and store when the queue links are broken. The novel part
is that it uses another XCHG to restore the next link and figure out
if it managed to do so before anybody observed the broken link or not.

This is more expensive than 1 CAS to enqueue for uncontended case, but
I guess should sustain high contention much better.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3v65geX%2BhP5W41t4K2soRDCAZXnXE007Z2OvTGx5m_yUw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[lock-free] Re: Question about your MPMC code

2018-11-10 Thread Dmitry Vyukov

On Sat, Nov 10, 2018 at 6:16 PM Guang Han  wrote:
>
> Thanks for your quick response, Dmitry! Do you mean that relaxed load can't 
> observe stale data at all? I think sometimes it might observe stale data 
> because it is a relaxed load and it may not get the latest value written by 
> other threads(cores) right away.

Barriers/fences don't affect this in any way. If data has not yet
propagated no load will observe it, if it has propagated any load will
observe it. Barriers/fences affect only mutual ordering of memory
accesses, see:
http://www.1024cores.net/home/lock-free-algorithms/so-what-is-a-memory-model-and-how-to-cook-it

> But i do agree with you that in practice I never see the following happens.
>
> "else
> pos = enqueue_pos_.load(std::memory_order_relaxed);"
>
>
> When you say "eventually" or "finite period of time", normally how long it 
> will take for typical cache coherence algorithms? A couple of microseconds?

I think it's more like tens of nanoseconds.
And from some point of view it's actually immediate. Because it a
distributed system with finite speed of data propagation, event occurs
when you observe it by definition, there is no other way to define it
(think of relativity theory and speed of light). If you observe it, it
happened. If you don't see it, it did not happen yet.


> On Sat, Nov 10, 2018 at 7:24 PM Dmitry Vyukov  wrote:
>>
>> On Sat, Nov 10, 2018 at 4:37 PM Guang Han  wrote:
>> >
>> > Hi Dmitry,
>> >
>> >I came across your MPMC code on 1024 cores website. It is very well 
>> > written with awesome performance. It has the best performance than any 
>> > other MPMC queue i can find online. I have one quick question I hope you 
>> > can help me with. Thanks!!
>> >
>> > Basically in the enqueue method, enqueue_pos_ is loaded again using 
>> > relaxed order when diff is larger than 0, which indicates that its load at 
>> > the beginning of the enqueue method (pos = enqueue_pos_.load(relaxed))  
>> > does not obtain the latest value of enqueue_pos_. How to ensure that the 
>> > load inside the for loop will indeed return the latest value, what if it 
>> > always returns the same value as the load at the beginning of the enqueue 
>> > method? Then the thread will keep looping until the end of its scheduling 
>> > quota?
>> >
>> > Thanks!
>> > Guang
>> >
>> > bool enqueue(T const& data)
>> > {
>> > cell_t* cell;
>> > size_t pos = enqueue_pos_.load(std::memory_order_relaxed);
>> > for (;;)
>> > {
>> > cell = _[pos & buffer_mask_];
>> > size_t seq = cell->sequence_.load(std::memory_order_acquire);
>> > intptr_t dif = (intptr_t)seq - (intptr_t)pos;
>> > if (dif == 0)
>> > {
>> > if (enqueue_pos_.compare_exchange_weak(pos, pos + 1, 
>> > std::memory_order_relaxed))
>> > break;
>> > }
>> > else if (dif < 0)
>> > return false;
>> > else
>> > pos = enqueue_pos_.load(std::memory_order_relaxed);
>> > }
>> >
>> > cell->data_ = data;
>> > cell->sequence_.store(pos + 1, std::memory_order_release);
>> >
>> >     return true;
>> > }
>> >
>>
>> Hi Guang,
>>
>> C/C++ standards guarantee that stores become visible to loads eventually:
>>
>> C++, 1.10/25:
>> An implementation should ensure that the last value (in modification
>> order) assigned by an atomic or synchronization operation will become
>> visible to all other threads in a finite period of time.
>>
>> In practice this is provided by cache coherence. All modern processors
>> are cache coherent, so a load can't observe stale data at all.



-- 
Dmitry Vyukov

All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3sFt4-UvKDZSXvGN3C%3DO%3DO94dDgUfcxxz6H6j-zh47jsA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[lock-free] Re: Question about your MPMC code

2018-11-10 Thread Dmitry Vyukov

On Sat, Nov 10, 2018 at 4:37 PM Guang Han  wrote:
>
> Hi Dmitry,
>
>I came across your MPMC code on 1024 cores website. It is very well 
> written with awesome performance. It has the best performance than any other 
> MPMC queue i can find online. I have one quick question I hope you can help 
> me with. Thanks!!
>
> Basically in the enqueue method, enqueue_pos_ is loaded again using 
> relaxed order when diff is larger than 0, which indicates that its load at 
> the beginning of the enqueue method (pos = enqueue_pos_.load(relaxed))  does 
> not obtain the latest value of enqueue_pos_. How to ensure that the load 
> inside the for loop will indeed return the latest value, what if it always 
> returns the same value as the load at the beginning of the enqueue method? 
> Then the thread will keep looping until the end of its scheduling quota?
>
> Thanks!
> Guang
>
> bool enqueue(T const& data)
> {
> cell_t* cell;
> size_t pos = enqueue_pos_.load(std::memory_order_relaxed);
> for (;;)
> {
> cell = _[pos & buffer_mask_];
> size_t seq = cell->sequence_.load(std::memory_order_acquire);
> intptr_t dif = (intptr_t)seq - (intptr_t)pos;
> if (dif == 0)
> {
> if (enqueue_pos_.compare_exchange_weak(pos, pos + 1, 
> std::memory_order_relaxed))
> break;
> }
> else if (dif < 0)
> return false;
> else
> pos = enqueue_pos_.load(std::memory_order_relaxed);
> }
>
> cell->data_ = data;
> cell->sequence_.store(pos + 1, std::memory_order_release);
>
> return true;
> }
>

Hi Guang,

C/C++ standards guarantee that stores become visible to loads eventually:

C++, 1.10/25:
An implementation should ensure that the last value (in modification
order) assigned by an atomic or synchronization operation will become
visible to all other threads in a finite period of time.

In practice this is provided by cache coherence. All modern processors
are cache coherent, so a load can't observe stale data at all.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3vkHb9VPXhYnsfWxiiZCCEor-aqXFvuw85fefYieNOaiw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] what happened when `mpscq_node_t* prev = XCHG(>head,n)`

2018-11-16 Thread Dmitry Vyukov

On Fri, Nov 16, 2018 at 11:15 AM by byaiu  wrote:
>
> Hi,
> I'm reading <> and found the code in 
> mpscq_push `mpscq_node_t* prev = XCHG(>head,n)` confusing.
> mpscq figure

Hi by,

I don't have permissions to view the link you posted.

I assume you mean this algorithm:
http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue

> 1. does the head and n point to the right address before and after the XCHG?

I am not sure what you mean by "right".
n is pointer to new node that needs to be enqueued. It is function
argument and is not changed by XCHG.
To the best of my knowledge head points to a right address. If I knew
about a bug in the algo, I would fix it.

> 2. where should prev point to?

prev points to the node that was head of the queue before XCHG operation.

Do you have some concrete scenario that you think is broken, or you
are confused about? If yes, describe it.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3s9hvaRLEWYUjgE5QUXiXhSk9prSgf%3DyeXoSk2oodxDWw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Re: Simple Example, Very Basic Per-Thread Proxy GC...

2018-12-23 Thread Dmitry Vyukov

On Sun, Dec 23, 2018 at 7:31 AM Chris M. Thomasson  wrote:
 > On Monday, December 17, 2018 at 11:23:20 PM UTC-8, Chris M. Thomasson 
 > wrote:
 >>
 >> If interested, I can give more details. For now, here is the code in 
 >> the form of a Relacy Test Unit:

 Can you give a short executive summary? What are the
 advantages/disadvantages/novelty?
>>>
>>>
>>> Very quick, sorry: Should have some more time tonight.
>>>
>>> A simple proxy with per-thread mutexes. Threads enter a protected region by 
>>> taking their own lock, and leave it by releasing said lock. Very basic. 
>>> When a thread wants to defer a node for deletion it pushes it onto a global 
>>> lock-free stack via CAS. There is a reaper thread that flushes all of the 
>>> nodes with XCHG, and keeps it on an internal list. The reaper thread loop 
>>> then tries to acquire and release all of the per-thread locks. Once it has 
>>> done this, it says quiescent state. It keeps nodes in a way that ensures at 
>>> least two periods of this state have occurred before it actually calls 
>>> delete and destroys memory. Since the reapers use a try_lock to detect a 
>>> quiescent state, it can livelock in a reaper in the sense that it never 
>>> gains a lock. However, Relacy should never detect the condition because of 
>>> the limited iteration loop for workers wrt the test code itself. There is a 
>>> work around. We can let a reaper fail for a little while before it actually 
>>> ditches try_lock and just locks the per-thread quiescence lock. It would 
>>> act just like an adaptive mutex, in a sense...
>>>
>>>
>> [...]
>>>
>>> So far, so good. It passes 1000 iterations. I am letting 
>>> rl::sched_bound run for around 65 minutes now, no problems at iteration:
>>>
>>>
>>
>>
>> Fwiw, its been running overnight using rl:sched_bound, I am at iteration:
>>
>> 99% (3506831360/3506831361)
>> 99% (3506896896/3506896897)
>> 99% (3506962432/3506962433)
>> 99% (3507027968/3507027969)
>> 99% (3507093504/3507093505)
>> 99% (3507159040/3507159041)
>> 99% (3507224576/3507224577)
>> 99% (3507290112/3507290113)
>>
>>
>> No problems found so far at 3.5 billion iterations.
>
>
>
> No problems so far, however the program has still not completed and the damn 
> battery accidentally went dead on the testing laptop! This means that I have 
> to run it again.
>
> Isn't there a way to start from a given iteration count?
>
>
>>
>> I need to code up a scenario where a thread actually iterates through all of 
>> the unode's in g_worker_stack. I think it should fail in this case. Humm...
>>
>>
>>>
>>>
>>>
>>> Also, need to get the latest version of Relacy. I am using version: 2.3.
>>>
>>>
>>> Does it bite the dust in the latest version? ;^o


Nothing major has changed in the past years, so whatever version you
have should be effectively the latest.

Yes, the exhaustive mode is not too useful for anything non-trivial.
And as you see the iteration count estimation is totally broken.
The context switch bound mode may be more useful.
There is no persistent checkpointing and not parallel mode (shame!).
Every time I start thinking about this, I can't choose between
continuing improving Relacy or supporting its main features in
ThreadSanitizer which has compiler instrumentation for memory accesses
and atomic operations and a set of interceptors so existing programs
work out-of-the-box. But in the end I can't find time for either.
Some other people expressed interest in adding some Relacy-like
features to ThreadSanitizer, e.g.:
https://groups.google.com/forum/#!topic/thread-sanitizer/c32fnH0QQe4
but nothing real come up out of that yet.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3uV-_e%3D59fwuj9J8o1vcUCmPf2FFcmtO95YCe359tsbbw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Re: Simple Example, Very Basic Per-Thread Proxy GC...

2018-12-22 Thread Dmitry Vyukov

On Thu, Dec 20, 2018 at 12:17 AM Chris M. Thomasson  wrote:
>
> On Wednesday, December 19, 2018 at 2:02:31 AM UTC-8, Dmitry Vyukov wrote:
>>
>> On Wed, Dec 19, 2018 at 7:05 AM Chris M. Thomasson  
>> wrote:
>> >
>> > On Monday, December 17, 2018 at 11:23:20 PM UTC-8, Chris M. Thomasson 
>> > wrote:
>> >>
>> >> If interested, I can give more details. For now, here is the code in the 
>> >> form of a Relacy Test Unit:
>>
>> Can you give a short executive summary? What are the
>> advantages/disadvantages/novelty?
>
>
> Very quick, sorry: Should have some more time tonight.
>
> A simple proxy with per-thread mutexes. Threads enter a protected region by 
> taking their own lock, and leave it by releasing said lock. Very basic. When 
> a thread wants to defer a node for deletion it pushes it onto a global 
> lock-free stack via CAS. There is a reaper thread that flushes all of the 
> nodes with XCHG, and keeps it on an internal list. The reaper thread loop 
> then tries to acquire and release all of the per-thread locks. Once it has 
> done this, it says quiescent state. It keeps nodes in a way that ensures at 
> least two periods of this state have occurred before it actually calls delete 
> and destroys memory. Since the reapers use a try_lock to detect a quiescent 
> state, it can livelock in a reaper in the sense that it never gains a lock. 
> However, Relacy should never detect the condition because of the limited 
> iteration loop for workers wrt the test code itself. There is a work around. 
> We can let a reaper fail for a little while before it actually ditches 
> try_lock and just locks the per-thread quiescence lock. It would act just 
> like an adaptive mutex, in a sense...

So the main advantage is simplicity and ability for mere mortals to
understand how it works :)
Should be especially advantageous in teaching since it does not
involve atomic operations.


> ___
> fresh = new generation
> old = old generation
>
> // reaper loop
>
> fresh = gain new nodes
>
> if (quiescent)
> {
> dump = old;
> old = fresh;
> fresh = empty;
>
> dump.destroy(); // reclaim memory...
> }
> ___
>
>
>>
>>
>>
>> >> https://pastebin.com/raw/FpnvTqvM
>> >> (raw link, no ads! :^)
>> >>
>> >> ___
>> >>
>> >> [...]
>> >>
>> >> ___
>> >>
>> >>
>> >>
>> >> Can anybody run it with Relacy? If not, I need to learn why: what 
>> >> problems were encountered?
>>
>> Did you succeed with running it with Relacy? :)
>
>
> So far, so good. It passes 1000 iterations. I am letting rl::sched_bound 
> run for around 65 minutes now, no problems at iteration:
>
>
> 99% (129564672/129564673)
>
> 99% (129630208/129630209)
>
> 99% (129695744/129695745)
>
> 99% (129761280/129761281)
>
> 99% (129826816/129826817)
>
> 99% (129892352/129892353)
>
> 99% (129957888/129957889)
>
> 99% (130023424/130023425)
>
> [...]
>
>
>
> Also, need to get the latest version of Relacy. I am using version: 2.3.
>
>
> Does it bite the dust in the latest version? ;^o

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3tX6jE3nE5_X%2Bj_i10%3DWkYu6CqSgOhVw545RcRW%2BKPZBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[lock-free] Re: Actor scheduler source code availability

2018-12-28 Thread Dmitry Vyukov

On Wed, Dec 12, 2018 at 1:04 PM Per  wrote:
>
> Hi Dmitry,
>
> First, I want to thank you for the fantastic resource you have put together 
> on 1024cores.net. I have found it very helpful in my studies on lockfree 
> algorithms.
>
> I have a question that I was hoping you could help me with. The "Case Study: 
> Actor Scheduler" references a scalable, fast and starvation-free actor 
> scheduler written in ~300 lines of C. I was wondering if this scheduler is 
> open sourced? If so, where can I find it?

Hi Per,

Yes, it's referenced right from the bottom of that page (acsc.zip):
http://www.1024cores.net/home/scalable-architecture/case-study-actor-scheduler
http://www.1024cores.net/home/scalable-architecture/case-study-actor-scheduler/acsc.zip

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3tCyC5Z%2BWoNneHCvwHWmS%3D7NQo%3DKvofj8YC%2Br0tif6pEw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Re: Scalable hash map

2018-12-28 Thread Dmitry Vyukov

On Fri, Dec 28, 2018 at 1:22 PM  wrote:
>
> Hi,
>
> sorry for resurrecting this old thread!
>
> The link to the files (http://groups.google.com/group/lock-free/files) no 
> longer works.
> I've downloaded the hash_map.zip file from your homepage, but this version of 
> the code only supports pod-types of 4 or 8 bytes. I am particularly 
> interested in your solution to support non-trivial key/value types. Any 
> chance this code is still available somewhere?

Hi Manuel,

Maybe non-trivial key/types were never supported? I don't think there
were 2 different versions.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3vjd4H0S9WD-bK-Z%3Dh9V9BU%3DKYB03hXFumCtZ%3D3JV9s%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Re: What is the most simple SPSC queue?

2018-12-28 Thread Dmitry Vyukov

On Fri, Dec 28, 2018 at 10:57 AM segn  wrote:
>
> W dniu czwartek, 13 grudnia 2018 07:54:14 UTC+1 użytkownik segn napisał:
>>
>> Basically I'm looking for implementation of the same queue. In the project, 
>> the same principle was used for both thread-thread single-direction 
>> communication (typical master thread + # of worker threads; each channel is 
>> the SPSC FIFO queue), and for kernel-thread communication.
>>
>> The second one was quite specific: it was using a reserved shared memory 
>> block, used as an C array in a way circular buffer is used (so: FIFO). I 
>> wrote the queue, but surprisingly I've lost memory on this :) I cannot tell 
>> if there was some index `int first_free_idx', it looks like there should be.
>
>
> I've found the queue that I've used, I think that many (like me and other's 
> from my team) programmers came up with it on their own, but here it is called 
> Fast-Forward Queue: 
> https://scholar.colorado.edu/cgi/viewcontent.cgi?article=1960=csci_techreports

Hi segn,

There is also something called FastFlow and my version that was faster
than FastFlow in my experiments:
http://www.1024cores.net/home/lock-free-algorithms/queues/fastflow

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3un-OZfGxi-4Xwsuc2B%3D%2BQX%3Du06ccAGTpYwsCV0Vtzt4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Re: Simple Example, Very Basic Per-Thread Proxy GC...

2018-12-19 Thread Dmitry Vyukov

On Wed, Dec 19, 2018 at 7:05 AM Chris M. Thomasson  wrote:
>
> On Monday, December 17, 2018 at 11:23:20 PM UTC-8, Chris M. Thomasson wrote:
>>
>> If interested, I can give more details. For now, here is the code in the 
>> form of a Relacy Test Unit:

Can you give a short executive summary? What are the
advantages/disadvantages/novelty?


>> https://pastebin.com/raw/FpnvTqvM
>> (raw link, no ads! :^)
>>
>> ___
>>
>> [...]
>>
>> ___
>>
>>
>>
>> Can anybody run it with Relacy? If not, I need to learn why: what problems 
>> were encountered?

Did you succeed with running it with Relacy? :)

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3vySJoRh0k9UixBjRi1N6x_RwDm_qNT0077PmJD60f1FQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Experimental Read/Write Mutex..

2019-02-26 Thread Dmitry Vyukov

On Wed, Feb 20, 2019 at 1:31 AM Chris M. Thomasson  wrote:
>
> On Monday, February 18, 2019 at 12:13:21 PM UTC-8, Dmitry Vyukov wrote:
>>
>> On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson  
>> wrote:
>> >
>> > On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>> >>
>> >> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  
>> >> wrote:
>> [...]
>> >> I can also test on 2 CPU system with 88 hardware threads in total, but
>> >> the bench hardcodes 16 threads (there is std::hardware_concurrency()
>> >> or something like that).
>> >
>> >
>> > Indeed! Will correct this issue. Make it dynamic. :^)
>> >
>> > Although, it can be beneficial to create more threads just to see how the 
>> > algorithm handles these cases of "oversubscribed" contention.
>>
>> Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(),
>> 4*std::hardware_concurrency().
>
>
> Indeed.
>
>
>>
>> >> I've found that it's critical to model realistic/target ratio between
>> >> reads/writes and amount of local work and work inside of critical
>> >> section in such benchmarks. Otherwise in 100% synthetic hammering
>> >> usually the worst mutex shows the best results. Synthetic benchmarks
>> >> produce extreme contention, so a mutex that blocks threads as much as
>> >> possible and allows as few threads as possible to work, shows the best
>> >> result because contention is minimized. The best would be to run all
>> >> threads sequentially one-by-one. But that's counter productive for
>> >> real life because blocked threads don't do useful work too.
>> >
>> >
>> > Agreed. Just wondering why it performs so much better in some high 
>> > contention scenarios. Load imbalance is a factor for sure.
>> >
>> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ
>> >
>> > https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ
>>
>> I don't need to tell you that performance of sync primitives can be
>> very non-linear and counter-intuitive :)
>>
>> Perhaps the other mutexes were not super dumb, so they did not perform
>> the best under very high contention.
>
>
> :^D
>
>
>> >> Also this benchmark has fixed amount of work per thread, so I suspect
>> >> it may be subject to load imbalance when an unlucky outliner delays
>> >> total result. A better option would be to run all threads for X
>> >> seconds and then set a global var for all of them to stop and then
>> >> join all of them and measure number of iterations.
>> >
>> >
>> > Indeed. A benckmark that just modeled a list that readers iterated, and 
>> > writers pushed and popped from. Then we can show how many operations, 
>> > reads or writes, they performed per-second. So it would be the number of 
>> > reads and writes per-second, per-thread.
>> >
>> > This reminds of some of Joe Seighs tests back on comp.programming.threads.
>> >
>> >
>> >>
>> >>
>> >> Also for me it lacked #include .
>> >
>> >
>> > Fixed that. Thanks again:
>> >
>> > https://pastebin.com/raw/xCBHY9qd
>> > (reload the page...)
>> >
>> >
>> > Fwiw, I am wondering what you think about the algorithm itself? Is it 
>> > crap, or decent? :^)
>>
>>
>> I think it's one of the best algorithms out there.
>
>
> Thanks Dmitry. So far, wrt a general purpose centralized rwmutex, it is not 
> all that bad. :^)
>
>
>>
>> The complete fairness for both readers and writers is notable.
>
>
> Indeed. My algorithm has a strict bakery style about it. In this case, the 
> readers get the "main benefit" wrt the wait-free fast-path, assuming 
> fetch_add is a single operation in hardware like x86, not a CAS or LL/SC loop 
> in software. Well, that is great because who uses a rwmutex for writer 
> performance anyway? ;^)
>
>
>>
>> And the wait-free fast-path for readers.
>
>
> Big time. Imvvho, this is the most important aspect. Almost ready with a much 
> better benchmark that should show this off quite well.
>
>
>>
>> It still suffers from high
>> cache line contention even on read-only workload, but it's as good as
>> one can get with a centralized design (i.e. not per-thread/cpu
>> distributed stuff which has own problem

Re: [lock-free] Re: Experimental Read/Write Mutex..

2019-02-27 Thread Dmitry Vyukov

On Tue, Feb 26, 2019 at 11:47 PM Chris M. Thomasson  wrote:
>
> On Tuesday, February 26, 2019 at 12:10:02 AM UTC-8, Dmitry Vyukov wrote:
>>
>> On Wed, Feb 20, 2019 at 7:51 AM Chris M. Thomasson  
>> wrote:
>> >
>> > Fwiw, I wrote a crude new benchmark that measures how many reads and 
>> > writes can be performed in a given amount of time. My algorithm vs 
>> > std::shared_mutex. So, we are _primarily_ looking for how many reads can 
>> > be performed in this test at 60 seconds. The number of threads is variable 
>> > and is determined by std::hardware_concurrency * THREADS, set at 8 in the 
>> > test. So on my system the setup is:
>> > ___
>> > cpu_threads_n = 4
>> > threads_n = 32
>> > writers = 16
>> > readers = 16
>> > test duration = 60 seconds
>> > ___
>> [...]
>>
>> Benchmarked this on 2 systems (3 alternating runs for each mutex):
>>
>> [...]
>
>
> Thank you! :^)
>
>>
>>
>> Now it is your problem to interpret this :)
>
>
> std::shared_mutex might have a reader priority? Wondering if it is using a 
> distributed algorithm on the larger system?
>
> The strict bakery style interleaving in my algorithm must be doing something 
> "interesting" on the larger system. Mine seems to allow for some more writes 
> in the data, too fair? The mutex aspect of my algorithm might be kicking in 
> here. It is using Alexander Terekhov's algorithm from pthread_mutex_t in 
> pthreads-win32, actually it can use any mutual exclusion algorithm for writer 
> access:
>
> https://www.sourceware.org/pthreads-win32/
>
> Integrating writer access wrt ct_rwmutex::m_count should be beneficial...
>
>
>
>>
>>
>> Another interesting data point is time output.
>> On the large system for your mutex:
>> 848.76user 10.83system 1:00.26elapsed 1426%CPU
>>
>> for std mutex:
>> 4238.27user 26.79system 1:00.18elapsed 7086%CPU
>>
>> So whatever your mutex did, it used 5 times less CPU time for that.
>
>
> Bakery style, and the slow paths boiling down to condvar/mutex? I wonder what 
> std::shared_mutex is using on your end when it has to wait? futex?

It's just a wrapper around pthread_rwlock and I have glibc 2.24.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3vmhVoGtQJV4o9%2BsoG2M4egGsNZy9f8VmpGvNvWma9N%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[lock-free] TensorFlow scheduler

2019-02-27 Thread Dmitry Vyukov

Hi,

TensorFlow CPU task scheduler I wrote some time ago:

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default=file-view-default

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default=file-view-default

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default=file-view-default

This and some other fixes improved model training time on CPU up to 3x.
There is some interesting lock-free stuff in there.

Main design goal is to tailor pieces for specific problem at hand & be
practical Eg lock-free where matters mutex-based otherwise. Sane level
of complexity(need to get it right). Not trading practical properties
that matter (array-based queue) for formal properties (lock-freedom)

EventCount (aka "condvar for lock-free algos") has a fast common paths
and minimizes contention as much as possible. Due to
non-uniform/bursty work blocking/unblocking threads all the time
actually turned out to be one of the most critical parts of scheduling
in TF.

RunQueue (work-stealing deque) allows all operations from both sides
as work can be submitted from external threads in TF. Owner ops are
lock-free, remote use mutex. This part was somewhat unique as compared
to similar systems. Getting Size right is tricky!

ThreadPool (scheduler itself) is quite standard design based on
distributed runqueues. Though you can find some interesting tricks wrt
stealing order there. Also steal partitions (not done by me). Spinning
logic required quite some tuning to balance between latency and wasted
CPU

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3uB-gOgEx%3Drwwcy-dgOZeGpcEMzKgF%3DWvvm6FgQXWUeNg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Re: TensorFlow scheduler

2019-03-06 Thread Dmitry Vyukov

On Wed, Mar 6, 2019 at 7:11 AM Chris M. Thomasson  wrote:
>>>
>>> Hi,
>>>
>>> TensorFlow CPU task scheduler I wrote some time ago:
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default=file-view-default
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default=file-view-default
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default=file-view-default
>>>
>>> This and some other fixes improved model training time on CPU up to 3x.
>>> There is some interesting lock-free stuff in there.
>>> [...]
>>
>>
>> The code is a work of art: Well done Dmitry!

Thanks!

> Love the:
>
>  Work PushFront(Work w) {
> unsigned front = front_.load(std::memory_order_relaxed);
> Elem* e = _[front & kMask];
> uint8_t s = e->state.load(std::memory_order_relaxed);
> if (s != kEmpty ||
> !e->state.compare_exchange_strong(s, kBusy, 
> std::memory_order_acquire))
>   return w;
> front_.store(front + 1 + (kSize << 1), std::memory_order_relaxed);
> e->w = std::move(w);
> e->state.store(kReady, std::memory_order_release);
> return Work();
>
> }
>
> part. It reminds me of the same basic logic in your mpmc queue:
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>
> Ahh, I still like my little tweak of your most excellent algorithm:
>
> https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion
>
> ;^)

IIRC once a thread executed XADD it has no way back, which implies
blocking/waiting on empty/full queue.
Here I needed "try" versions because this is not a producer-consumer
scenario and we don't need back-pressure in cooperative task execution
scenario. If we fail to pop from one queue, we will go and try to pop
from another. If we fail to push onto a queue, we can push onto
another or execute directly.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3v8hRs2pTJiTsokcRWLgEERz4s6gdkcBV_q4%2Bov%3DcCt-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Re: TensorFlow scheduler

2019-03-06 Thread Dmitry Vyukov

On Fri, Mar 1, 2019 at 8:27 AM Chris M. Thomasson  wrote:
>
> On Wednesday, February 27, 2019 at 1:35:04 AM UTC-8, Dmitry Vyukov wrote:
>>
>> Hi,
>>
>> TensorFlow CPU task scheduler I wrote some time ago:
>>
>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default=file-view-default
>>
>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default=file-view-default
>>
>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default=file-view-default
>>
>> This and some other fixes improved model training time on CPU up to 3x.
>> There is some interesting lock-free stuff in there.
>>
>> Main design goal is to tailor pieces for specific problem at hand & be
>> practical Eg lock-free where matters mutex-based otherwise. Sane level
>> of complexity(need to get it right). Not trading practical properties
>> that matter (array-based queue) for formal properties (lock-freedom)
>>
>> EventCount (aka "condvar for lock-free algos") has a fast common paths
>> and minimizes contention as much as possible. Due to
>> non-uniform/bursty work blocking/unblocking threads all the time
>> actually turned out to be one of the most critical parts of scheduling
>> in TF.
>
>
> Gotta love the eventcount. Fwiw, I remember way back when I created one in 
> 2005:
>
> https://groups.google.com/d/topic/comp.programming.threads/qoxirQbbs4A/discussion
>
> Actually, iirc, Joe Seigh created a very interesting bakery style rw 
> spinlock, need to find it. Iirc, it can be outfitted with an eventcount to 
> remove the spins.
>
>
>>
>>
>> RunQueue (work-stealing deque) allows all operations from both sides
>> as work can be submitted from external threads in TF. Owner ops are
>> lock-free, remote use mutex. This part was somewhat unique as compared
>> to similar systems. Getting Size right is tricky!
>>
>> ThreadPool (scheduler itself) is quite standard design based on
>> distributed runqueues. Though you can find some interesting tricks wrt
>> stealing order there. Also steal partitions (not done by me). Spinning
>> logic required quite some tuning to balance between latency and wasted
>> CPU
>
>
> Nice! Btw, did you ever find a use for work requesting?
>
> Something like:
>
> https://groups.google.com/d/topic/comp.programming.threads/YBnjd-Sqc-w/discussion


Not any real one...

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3v98bX3rUH_XtEMjznGg3dUqPy9HZ0noB1f3n%3D1%3D7umbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Experimental Read/Write Mutex..

2019-02-16 Thread Dmitry Vyukov

On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  wrote:
>
> Fwiw, here is a simple benchmark for an older read/write algorithm I invented:
>
>
> https://pastebin.com/raw/xCBHY9qd
>
>
> It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a macro 
> in the code called CT_TEST_FAST_MUTEX. If it is defined then the code will 
> test my algorithm. If not defined then it will test against 
> std::shared_mutex. I am wondering if anybody can compile and run the code? 
> Test it out on as many operating systems as you can. Can you please show the 
> output?
>
>
> Thanks everybody. :^)
>
> Here is the code:
>
> /* Simple, crude read/write mutex test
>by: Chris M. Thomasson
> __*/
>
>
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
>
> #define THREADS 16UL
> #define ITERS 1000UL
> #define COUNT (THREADS * ITERS)
>
>
> // undefine to test std::shared_mutex
> #define CT_TEST_FAST_MUTEX 1
>
>
> // bare bones mutex/condvar based semaphore
> struct ct_slow_semaphore
> {
> unsigned long m_state;
> std::mutex m_mutex;
> std::condition_variable m_cond;
>
> ct_slow_semaphore(unsigned long state) : m_state(state) {}
>
> void inc()
> {
> {
> std::unique_lock lock(m_mutex);
> ++m_state;
> }
>
> m_cond.notify_one();
> }
>
> void add(unsigned long addend)
> {
> {
> std::unique_lock lock(m_mutex);
> m_state += addend;
> }
>
> m_cond.notify_all();
> }
>
> void dec()
> {
> std::unique_lock lock(m_mutex);
> while (m_state == 0) m_cond.wait(lock);
> --m_state;
> }
> };
>
>
>
>
> // bin-sema
> struct ct_auto_reset_event
> {
> bool m_state;
> std::mutex m_mutex;
> std::condition_variable m_cond;
>
> ct_auto_reset_event() : m_state(false) {}
>
> void signal()
> {
> std::unique_lock lock(m_mutex);
> m_state = true;
> m_cond.notify_one();
> }
>
> void wait()
> {
> std::unique_lock lock(m_mutex);
> while (m_state == false) m_cond.wait(lock);
> m_state = false; // auto-reset
> }
> };
>
>
> // just a layer over an auto-reset event
> struct ct_fast_mutex
> {
> std::atomic m_state;
> ct_auto_reset_event m_waitset;
>
> ct_fast_mutex() : m_state(0) {}
>
> void lock()
> {
> if (m_state.exchange(1, std::memory_order_acquire))
> {
> while (m_state.exchange(2, std::memory_order_acquire))
> {
> m_waitset.wait();
> }
> }
> }
>
> void unlock()
> {
> if (m_state.exchange(0, std::memory_order_release) == 2)
> {
> m_waitset.signal();
> }
> }
> };
>
>
>
> // Chris M. Thomassons Experimental Read/Write Mutex
> // Yeah, it is pretty damn fat wrt the state, however
> // it has some interesting properties...
> // The state can be compressed a bit...
> // btw, it has no loops...
> // Take a look at the lock_shared and unlock_shared functions
>
> #define RWMUTEX_COUNT_MAX LONG_MAX
>
> struct ct_rwmutex
> {
> // shared state
> std::atomic m_wrstate;
> std::atomic m_count;
> std::atomic m_rdwake;
>
> ct_slow_semaphore m_rdwset;
> ct_slow_semaphore m_wrwset;
> ct_fast_mutex m_wrlock;
>
>
> ct_rwmutex() :
> m_wrstate(1),
> m_count(RWMUTEX_COUNT_MAX),
> m_rdwake(0),
> m_rdwset(0),
> m_wrwset(0) {
> }
>
>
> // READ, pretty slim...
> void lock_shared()
> {
> if (m_count.fetch_add(-1, std::memory_order_acquire) < 1)
> {
> m_rdwset.dec();
> }
> }
>
> void unlock_shared()
> {
> if (m_count.fetch_add(1, std::memory_order_release) < 0)
> {
> if (m_rdwake.fetch_add(-1, std::memory_order_acq_rel) == 1)
> {
> m_wrwset.inc();
> }
> }
> }
>
>
> // WRITE, more hefty
> void lock()
> {
> m_wrlock.lock();
>
> long count = m_count.fetch_add(-RWMUTEX_COUNT_MAX, 
> std::memory_order_acquire);
>
> if (count < RWMUTEX_COUNT_MAX)
> {
> long rdwake = m_rdwake.fetch_add(RWMUTEX_COUNT_MAX - count, 
> std::memory_order_acquire);
>
> if (rdwake + RWMUTEX_COUNT_MAX - count)
> {
> m_wrwset.dec();
> }
> }
> }
>
> // write unlock
> void unlock()
> {
> long count = m_count.fetch_add(RWMUTEX_COUNT_MAX, 
> std::memory_order_release);
>
> if (count < 0)
> {
> m_rdwset.add(-count);
> }
>
> m_wrlock.unlock();
> }
> };
>
>
> struct ct_shared
> {
> std::atomic m_state;
>
> #if defined (CT_TEST_FAST_MUTEX)
> ct_rwmutex m_std_rwmutex;
> #else
> std::shared_mutex

[lock-free] Abseil/nsync Mutex

2019-02-20 Thread Dmitry Vyukov

Hi,

FYI, if you haven't seen this yet:
https://github.com/abseil/abseil-cpp/blob/master/absl/synchronization/mutex.h
https://github.com/abseil/abseil-cpp/blob/master/absl/synchronization/mutex.cc

Some minor improvements are possible, but overall it's a piece of art.
Originally developed by Mike Burrows
(https://en.wikipedia.org/wiki/Michael_Burrows). If you like learning
synchronization primitives, this one will be a good time investment.
The main design goal I guess is handling all possible usage scenarios
gracefully (i.e. not just readers, or writers, or light contention, or
heavy contention, or few waiters, or thousands of waiters, etc) which
is expected of a primitive used in literally millions of place across
thousands of systems. It also has some interesting features like
contention profiling and deadlock detection. FWIW it also fits into a
single intptr (i.e. 4 bytes on 32-bit systems).

There is also a bit simplified version of the same algo in nsync library:
https://github.com/google/nsync/blob/master/README
https://github.com/google/nsync/blob/master/internal/mu.c
It's C and does not have contention profiling and deadlock detection,
so may be easier for leaning of the mutex algorithm itself.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3tstU5aQU%3Dq7oXXwrxRzQzsDzKJE66_hKSB4sUbZJsNzA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [lock-free] Experimental Read/Write Mutex..

2019-02-18 Thread Dmitry Vyukov

On Sun, Feb 17, 2019 at 11:51 PM Chris M. Thomasson  wrote:
>
> On Saturday, February 16, 2019 at 11:24:42 PM UTC-8, Dmitry Vyukov wrote:
>>
>> On Sun, Feb 17, 2019 at 12:34 AM Chris M. Thomasson  
>> wrote:
>> >
>> > Fwiw, here is a simple benchmark for an older read/write algorithm I 
>> > invented:
>> >
>> >
>> > https://pastebin.com/raw/xCBHY9qd
>> >
>> >
>> > It should compile fine on MSVC 2017 and GCC with -std=c++17. There is a 
>> > macro in the code called CT_TEST_FAST_MUTEX. If it is defined then the 
>> > code will test my algorithm. If not defined then it will test against 
>> > std::shared_mutex. I am wondering if anybody can compile and run the code? 
>> > Test it out on as many operating systems as you can. Can you please show 
>> > the output?
>> >
>> >
>> > Thanks everybody. :^)
>> >
>> > Here is the code:
>> >
>> > /* Simple, crude read/write mutex test
>> >by: Chris M. Thomasson
>> > __*/
>> [...]
>
>
>>
>>
>> Hey Chris!
>>
>> Here are my results on i7-8650U (4 HT cores) on Debian 4.19.16-1 gcc 7.3.0
>>
>> std::shared_mutex
>> msec = 56118
>> msec = 49149
>> msec = 69416
>> msec = 68737
>> msec = 59174
>> msec = 65915
>>
>> ct_rwmutex
>> msec = 62772
>> msec = 71139
>> msec = 68110
>> msec = 66038
>> msec = 60436
>> msec = 74238
>
>
>
> Thank you so much! Okay, not great, but all that "radically" hardcore 
> terrible when compared to a std::shared_mutex on your end. Also, my work can 
> benefit from many sessions of padding and alignment therapy wrt 
> data-structure layout.
>
>
>>
>>
>> I can also test on 2 CPU system with 88 hardware threads in total, but
>> the bench hardcodes 16 threads (there is std::hardware_concurrency()
>> or something like that).
>
>
> Indeed! Will correct this issue. Make it dynamic. :^)
>
> Although, it can be beneficial to create more threads just to see how the 
> algorithm handles these cases of "oversubscribed" contention.

Agree. So std::hardware_concurrency(), 2*std::hardware_concurrency(),
4*std::hardware_concurrency().


>> I've found that it's critical to model realistic/target ratio between
>> reads/writes and amount of local work and work inside of critical
>> section in such benchmarks. Otherwise in 100% synthetic hammering
>> usually the worst mutex shows the best results. Synthetic benchmarks
>> produce extreme contention, so a mutex that blocks threads as much as
>> possible and allows as few threads as possible to work, shows the best
>> result because contention is minimized. The best would be to run all
>> threads sequentially one-by-one. But that's counter productive for
>> real life because blocked threads don't do useful work too.
>
>
> Agreed. Just wondering why it performs so much better in some high contention 
> scenarios. Load imbalance is a factor for sure.
>
> https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/L7Sjs0xOBAAJ
>
> https://groups.google.com/d/msg/comp.lang.c++/g6veUOvOAu0/hMIu5VBSBAAJ

I don't need to tell you that performance of sync primitives can be
very non-linear and counter-intuitive :)

Perhaps the other mutexes were not super dumb, so they did not perform
the best under very high contention.


>> Also this benchmark has fixed amount of work per thread, so I suspect
>> it may be subject to load imbalance when an unlucky outliner delays
>> total result. A better option would be to run all threads for X
>> seconds and then set a global var for all of them to stop and then
>> join all of them and measure number of iterations.
>
>
> Indeed. A benckmark that just modeled a list that readers iterated, and 
> writers pushed and popped from. Then we can show how many operations, reads 
> or writes, they performed per-second. So it would be the number of reads and 
> writes per-second, per-thread.
>
> This reminds of some of Joe Seighs tests back on comp.programming.threads.
>
>
>>
>>
>> Also for me it lacked #include .
>
>
> Fixed that. Thanks again:
>
> https://pastebin.com/raw/xCBHY9qd
> (reload the page...)
>
>
> Fwiw, I am wondering what you think about the algorithm itself? Is it crap, 
> or decent? :^)


I think it's one of the best algorithms out there.
The complete fairness for both readers and writers is notable.
And the wait-free fast-path for readers. It still suffers from high
cache line contention even on read-onl

[lock-free] eventually fair mutex

2019-02-18 Thread Dmitry Vyukov

Hi,

I did this change some time ago, but thought it's still worth sharing
(better late than never!):

https://go-review.googlesource.com/c/go/+/34310/
https://go-review.googlesource.com/c/go/+/34310/8/src/sync/mutex.go

There are mutexes that let threads in in strict FIFO order. This is
good for fairness, but very bad for performance as 2 threads
effectively need to go in lock-step: one thread can't immediate
reacquire the mutex if there is another thread waiting, so it has to
yield to the other thread, and then they swap roles and the second
thread needs to yield to the first one and so on.
There are also mutexes that allow the first thread that is _runnning_
when the mutex is free to acquire it. This is great for performance,
but bad for fairness. In the worst case scenario, a thread holds a
mutex for along time, then briefly releases and reacquires again. This
can starve other threads literally infinitely: whenever they wake the
first thread has already reacquired the mutex.

The idea of the change to get best of both worlds: good fast path
performance and (at eventual, or bounded) fairness. If a thread
discovers that it is starved for too long (say a second), it marks the
mutex state as "starved". In this mode Unlock will directly handoff
mutex ownership to the blocked/started thread.
When a thread discovers that it's not starving on lock operation but
the mutex is in starvation mode, it resets the starvation flag the
mutex returns to normal "no handoff" operation.
This allowed to resolve unbounded starvation problems encountered in
real applications with no performance effect for normal use cases.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3uJa82Lqws_MRATbZ_Jh1xBzoEO06FWP60OTiUBOMdfkQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

[lock-free] Re: 3-thread consensus.

2020-01-08 Thread Dmitry Vyukov

On Wed, Jan 8, 2020 at 9:29 PM bittnkr  wrote:
>
> Dear Dmitry,
>
> I found a nice solution for the problem called 3 thread consensus, considered 
> impossible on the book The art of the multiprocessor programming. I think 
> that is a breakthrough.
>
> Debating on S.O. with someone about if the solution is solid or no, If it is 
> possible to occur data races, he referred relacy.
>
> So I'm writing you to ask your opinion about the solution.
>
> Can you take a little look on it?

+lock-free mailing list

Hi bittnkr,

At first glance the algorithm at https://github.com/bittnkr/uniq looks
blocking and non-linearizable to me.
Very similar in nature to:
http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

-- 
Dmitry Vyukov

All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3t9sGs3RyOHG1rCi5KH-0LhP0AX7LdEiG0EaZH-2Xyzrw%40mail.gmail.com.

[lock-free] Re: 3-thread consensus.

2020-01-10 Thread Dmitry Vyukov

On Sat, Jan 11, 2020 at 4:09 AM bittnkr  wrote:
>
> Hello Dmitry and fellows from the group.
>
> If you look carefully, you will see that there is no blocking, just simple 
> CAS, even the buffer is a common buffer.

But consider you look inside of a spin lock, or you take a
spin-lock-based algorithm and inline all spin lock code into the
algorithm. What you will see is exactly the same -- no blocking, just
a CAS. However, it does not really change anything, it's still
blocking mutex-based algorithm.
One can also combine spin lock state with some data, e.g. a particular
value of a data member means "locked" and blocks progress of other
threads. It makes spin lock even less visible, but again does not
change anything on conceptual level.

If I am reading the algorithm correctly, if a thread is preempted
between these 2 operations:

 } while ( (data[t & mask]) || (CompareExchange(tail, t+1, t) != t) )
  data[t & mask] = item // now is safe to update the buffer

It effectively locks a mutex on the queue and blocks progress of all consumers.

There is also a similar blocks on consumer side here:

 } while ( !(data[h & mask]) || CompareExchange(head, h+1, h) != h )
  data[h] = 0 // release the seat


> The queue only sleeps if it is empty out of it is full. But this is not 
> locking.
>
> All protection is done by the two atomic variables head an tail.
>
> The cost is constant near a single CAS. For any number of threads.
>
> Take a look on this benchmarks. They speaks for themselves.
>
> https://github.com/bittnkr/uniq/blob/master/README.md#benchmarks
>
> Besides I don't have a formal proof, we have a test with zero checksum.
>
> This is the results I have for a producer/consumer test pushing random 
> numbers is the queue.
>
> Creating 4 producers & 4 consumers
> to flow 10.000.000 items trough the queue.
>
> Produced: 10.743.668.245.000.000
> Consumed: 5.554.289.678.184.004
> Produced: 10.743.668.245.000.000
> Consumed: 15.217.833.969.059.643
> Produced: 10.743.668.245.000.000
> Consumed: 7.380.542.769.600.801
> Produced: 10.743.668.245.000.000
> Consumed: 14.822.006.563.155.552
>
> Checksum: 0 (it must be zero)
>
> The producers increase total and the consumers decrease. The result for 10M 
> random numbers is zero.
>
> Thanks for the advices, I'll investigate about this tools.
>
> Em qui, 9 de jan de 2020 04:58, Dmitry Vyukov  escreveu:
>>
>> On Wed, Jan 8, 2020 at 9:29 PM bittnkr  wrote:
>> >
>> > Dear Dmitry,
>> >
>> > I found a nice solution for the problem called 3 thread consensus, 
>> > considered impossible on the book The art of the multiprocessor 
>> > programming. I think that is a breakthrough.
>> >
>> > Debating on S.O. with someone about if the solution is solid or no, If it 
>> > is possible to occur data races, he referred relacy.
>> >
>> > So I'm writing you to ask your opinion about the solution.
>> >
>> > Can you take a little look on it?
>>
>> +lock-free mailing list
>>
>> Hi bittnkr,
>>
>> At first glance the algorithm at https://github.com/bittnkr/uniq looks
>> blocking and non-linearizable to me.
>> Very similar in nature to:
>> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>>
>> --
>> Dmitry Vyukov
>>
>> All about lockfree/waitfree algorithms, multicore, scalability,
>> parallel computing and related topics:
>> http://www.1024cores.net



-- 
Dmitry Vyukov

All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3uSM54%3DLWCdkEnUyn9nsUbToRuwOws8ccHP%3DS6oYkMiBg%40mail.gmail.com.

[lock-free] Re: 3-thread consensus.

2020-01-11 Thread Dmitry Vyukov

On Sat, Jan 11, 2020 at 9:26 AM bittnkr  wrote:
>
> Good observations. Thank you.
>
> If the thread preempts on those points, the seat position will be held on 
> local variables h and t.

This blocks consumes from progressing (consuming next produced items)
and is effectively a mutex. This makes algorithm non-termination-safe
and non-lock-free.


> After the thread line is restored, the CAS can fail, and the loop will just 
> restart in normal flow, without any locking.
>
> I updated the docs, I think is clearer now.
>
> Em sáb., 11 de jan. de 2020 às 05:38, Dmitry Vyukov  
> escreveu:
>>
>> On Sat, Jan 11, 2020 at 4:09 AM bittnkr  wrote:
>> >
>> > Hello Dmitry and fellows from the group.
>> >
>> > If you look carefully, you will see that there is no blocking, just simple 
>> > CAS, even the buffer is a common buffer.
>>
>> But consider you look inside of a spin lock, or you take a
>> spin-lock-based algorithm and inline all spin lock code into the
>> algorithm. What you will see is exactly the same -- no blocking, just
>> a CAS. However, it does not really change anything, it's still
>> blocking mutex-based algorithm.
>> One can also combine spin lock state with some data, e.g. a particular
>> value of a data member means "locked" and blocks progress of other
>> threads. It makes spin lock even less visible, but again does not
>> change anything on conceptual level.
>>
>> If I am reading the algorithm correctly, if a thread is preempted
>> between these 2 operations:
>>
>>  } while ( (data[t & mask]) || (CompareExchange(tail, t+1, t) != t) )
>>   data[t & mask] = item // now is safe to update the buffer
>>
>> It effectively locks a mutex on the queue and blocks progress of all 
>> consumers.
>>
>> There is also a similar blocks on consumer side here:
>>
>>  } while ( !(data[h & mask]) || CompareExchange(head, h+1, h) != h )
>>   data[h] = 0 // release the seat
>>
>>
>> > The queue only sleeps if it is empty out of it is full. But this is not 
>> > locking.
>> >
>> > All protection is done by the two atomic variables head an tail.
>> >
>> > The cost is constant near a single CAS. For any number of threads.
>> >
>> > Take a look on this benchmarks. They speaks for themselves.
>> >
>> > https://github.com/bittnkr/uniq/blob/master/README.md#benchmarks
>> >
>> > Besides I don't have a formal proof, we have a test with zero checksum.
>> >
>> > This is the results I have for a producer/consumer test pushing random 
>> > numbers is the queue.
>> >
>> > Creating 4 producers & 4 consumers
>> > to flow 10.000.000 items trough the queue.
>> >
>> > Produced: 10.743.668.245.000.000
>> > Consumed: 5.554.289.678.184.004
>> > Produced: 10.743.668.245.000.000
>> > Consumed: 15.217.833.969.059.643
>> > Produced: 10.743.668.245.000.000
>> > Consumed: 7.380.542.769.600.801
>> > Produced: 10.743.668.245.000.000
>> > Consumed: 14.822.006.563.155.552
>> >
>> > Checksum: 0 (it must be zero)
>> >
>> > The producers increase total and the consumers decrease. The result for 
>> > 10M random numbers is zero.
>> >
>> > Thanks for the advices, I'll investigate about this tools.
>> >
>> > Em qui, 9 de jan de 2020 04:58, Dmitry Vyukov  escreveu:
>> >>
>> >> On Wed, Jan 8, 2020 at 9:29 PM bittnkr  wrote:
>> >> >
>> >> > Dear Dmitry,
>> >> >
>> >> > I found a nice solution for the problem called 3 thread consensus, 
>> >> > considered impossible on the book The art of the multiprocessor 
>> >> > programming. I think that is a breakthrough.
>> >> >
>> >> > Debating on S.O. with someone about if the solution is solid or no, If 
>> >> > it is possible to occur data races, he referred relacy.
>> >> >
>> >> > So I'm writing you to ask your opinion about the solution.
>> >> >
>> >> > Can you take a little look on it?
>> >>
>> >> +lock-free mailing list
>> >>
>> >> Hi bittnkr,
>> >>
>> >> At first glance the algorithm at https://github.com/bittnkr/uniq looks
>> >> blocking and non-linearizable to me.
>> >> Very similar in nature to:
>> >> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>> >>
>> >> --
>> >> Dmitry Vyukov
>> >>
>> >> All about lockfree/waitfree algorithms, multicore, scalability,
>> >> parallel computing and related topics:
>> >> http://www.1024cores.net
>>
>>
>>
>> --
>> Dmitry Vyukov
>>
>> All about lockfree/waitfree algorithms, multicore, scalability,
>> parallel computing and related topics:
>> http://www.1024cores.net



-- 
Dmitry Vyukov

All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3sZwtyO7ZmSgaa5NswKhUU1YkzO0OBR7u9Ko8ugEboKUA%40mail.gmail.com.

[lock-free] Re: 3-thread consensus.

2020-01-12 Thread Dmitry Vyukov

On Sun, Jan 12, 2020 at 7:33 PM bittnkr  wrote:
>
>
> > This blocks consumes from progressing (consuming next produced items)
> and is effectively a mutex.
>
> Suppose the thread A got a local copy of tail in t then is preempted,
>
> another thread will get the same tail an get the seat normally.
>
> When the previous thread retakes the line, the CAS will fail, because the 
> seat was taken.
>
> Restarting the while without any kind of blocking.
>
> Where do you see a mutex here?

I mean preemption between succeeding CAS and writing element /NULL to the array.

If a thread is terminated at that point, the whole queue is broken (no
termination-safety).


> Em dom, 12 de jan de 2020 04:49, Dmitry Vyukov  escreveu:
>>
>> On Sat, Jan 11, 2020 at 9:26 AM bittnkr  wrote:
>> >
>> > Good observations. Thank you.
>> >
>> > If the thread preempts on those points, the seat position will be held on 
>> > local variables h and t.
>>
>> This blocks consumes from progressing (consuming next produced items)
>> and is effectively a mutex. This makes algorithm non-termination-safe
>> and non-lock-free.
>>
>>
>> > After the thread line is restored, the CAS can fail, and the loop will 
>> > just restart in normal flow, without any locking.
>> >
>> > I updated the docs, I think is clearer now.
>> >
>> > Em sáb., 11 de jan. de 2020 às 05:38, Dmitry Vyukov  
>> > escreveu:
>> >>
>> >> On Sat, Jan 11, 2020 at 4:09 AM bittnkr  wrote:
>> >> >
>> >> > Hello Dmitry and fellows from the group.
>> >> >
>> >> > If you look carefully, you will see that there is no blocking, just 
>> >> > simple CAS, even the buffer is a common buffer.
>> >>
>> >> But consider you look inside of a spin lock, or you take a
>> >> spin-lock-based algorithm and inline all spin lock code into the
>> >> algorithm. What you will see is exactly the same -- no blocking, just
>> >> a CAS. However, it does not really change anything, it's still
>> >> blocking mutex-based algorithm.
>> >> One can also combine spin lock state with some data, e.g. a particular
>> >> value of a data member means "locked" and blocks progress of other
>> >> threads. It makes spin lock even less visible, but again does not
>> >> change anything on conceptual level.
>> >>
>> >> If I am reading the algorithm correctly, if a thread is preempted
>> >> between these 2 operations:
>> >>
>> >>  } while ( (data[t & mask]) || (CompareExchange(tail, t+1, t) != t) )
>> >>   data[t & mask] = item // now is safe to update the buffer
>> >>
>> >> It effectively locks a mutex on the queue and blocks progress of all 
>> >> consumers.
>> >>
>> >> There is also a similar blocks on consumer side here:
>> >>
>> >>  } while ( !(data[h & mask]) || CompareExchange(head, h+1, h) != h )
>> >>   data[h] = 0 // release the seat
>> >>
>> >>
>> >> > The queue only sleeps if it is empty out of it is full. But this is not 
>> >> > locking.
>> >> >
>> >> > All protection is done by the two atomic variables head an tail.
>> >> >
>> >> > The cost is constant near a single CAS. For any number of threads.
>> >> >
>> >> > Take a look on this benchmarks. They speaks for themselves.
>> >> >
>> >> > https://github.com/bittnkr/uniq/blob/master/README.md#benchmarks
>> >> >
>> >> > Besides I don't have a formal proof, we have a test with zero checksum.
>> >> >
>> >> > This is the results I have for a producer/consumer test pushing random 
>> >> > numbers is the queue.
>> >> >
>> >> > Creating 4 producers & 4 consumers
>> >> > to flow 10.000.000 items trough the queue.
>> >> >
>> >> > Produced: 10.743.668.245.000.000
>> >> > Consumed: 5.554.289.678.184.004
>> >> > Produced: 10.743.668.245.000.000
>> >> > Consumed: 15.217.833.969.059.643
>> >> > Produced: 10.743.668.245.000.000
>> >> > Consumed: 7.380.542.769.600.801
>> >> > Produced: 10.743.668.245.000.000
>> >> > Consumed: 14.822.006.563.155.552
>> >> >
>> >> > Checksum: 0 (it must be zero)
>> >> >
>&

[lock-free] Re: MPMC atomic queue

2020-01-13 Thread Dmitry Vyukov

On Tue, Nov 5, 2019 at 5:48 AM Erez Strauss  wrote:
>
> Hi Dmitry,
>
> Good morning.
>
> Thank you for your mpmc queue.
>
> Recently, I published on github the following MPMC atomic queue.
>
> https://github.com/erez-strauss/lockfree_mpmc_queue
>
> It is atomic lockfree queue.
> I'll appreciate your feedback.
>
> Thank you
> Best wishes,
>
> Erez

Hi Erez,

+lock-free mailing list, there are lots of knowledgeable people and
some similar discussions/algorithms.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3ttrG8q4wVUTB7OHSr6vrftS9DKbXz9fg7SdX9PY__4wQ%40mail.gmail.com.

[lock-free] Re: 3-thread consensus.

2020-01-12 Thread Dmitry Vyukov

+lock-free group again, please don't drop it

On Mon, Jan 13, 2020 at 8:19 AM bittnkr  wrote:
>
> If a thread dies at that point, my entire system is broken...

Which means it's not lock-free.

> But It can preempts without any problem at near zero cost.

Well, if producer is preempted, consumers are blocked from
progressing. This is 100% equivalent to a mutex. If a thread is
preempted while holding a mutex, it also does not result in any
correctness problems.


> Em seg, 13 de jan de 2020 03:55, Dmitry Vyukov  escreveu:
>>
>> On Sun, Jan 12, 2020 at 7:33 PM bittnkr  wrote:
>> >
>> >
>> > > This blocks consumes from progressing (consuming next produced items)
>> > and is effectively a mutex.
>> >
>> > Suppose the thread A got a local copy of tail in t then is preempted,
>> >
>> > another thread will get the same tail an get the seat normally.
>> >
>> > When the previous thread retakes the line, the CAS will fail, because the 
>> > seat was taken.
>> >
>> > Restarting the while without any kind of blocking.
>> >
>> > Where do you see a mutex here?
>>
>> I mean preemption between succeeding CAS and writing element /NULL to the 
>> array.
>>
>> If a thread is terminated at that point, the whole queue is broken (no
>> termination-safety).
>>
>>
>> > Em dom, 12 de jan de 2020 04:49, Dmitry Vyukov  
>> > escreveu:
>> >>
>> >> On Sat, Jan 11, 2020 at 9:26 AM bittnkr  wrote:
>> >> >
>> >> > Good observations. Thank you.
>> >> >
>> >> > If the thread preempts on those points, the seat position will be held 
>> >> > on local variables h and t.
>> >>
>> >> This blocks consumes from progressing (consuming next produced items)
>> >> and is effectively a mutex. This makes algorithm non-termination-safe
>> >> and non-lock-free.
>> >>
>> >>
>> >> > After the thread line is restored, the CAS can fail, and the loop will 
>> >> > just restart in normal flow, without any locking.
>> >> >
>> >> > I updated the docs, I think is clearer now.
>> >> >
>> >> > Em sáb., 11 de jan. de 2020 às 05:38, Dmitry Vyukov  
>> >> > escreveu:
>> >> >>
>> >> >> On Sat, Jan 11, 2020 at 4:09 AM bittnkr  wrote:
>> >> >> >
>> >> >> > Hello Dmitry and fellows from the group.
>> >> >> >
>> >> >> > If you look carefully, you will see that there is no blocking, just 
>> >> >> > simple CAS, even the buffer is a common buffer.
>> >> >>
>> >> >> But consider you look inside of a spin lock, or you take a
>> >> >> spin-lock-based algorithm and inline all spin lock code into the
>> >> >> algorithm. What you will see is exactly the same -- no blocking, just
>> >> >> a CAS. However, it does not really change anything, it's still
>> >> >> blocking mutex-based algorithm.
>> >> >> One can also combine spin lock state with some data, e.g. a particular
>> >> >> value of a data member means "locked" and blocks progress of other
>> >> >> threads. It makes spin lock even less visible, but again does not
>> >> >> change anything on conceptual level.
>> >> >>
>> >> >> If I am reading the algorithm correctly, if a thread is preempted
>> >> >> between these 2 operations:
>> >> >>
>> >> >>  } while ( (data[t & mask]) || (CompareExchange(tail, t+1, t) != t) )
>> >> >>   data[t & mask] = item // now is safe to update the buffer
>> >> >>
>> >> >> It effectively locks a mutex on the queue and blocks progress of all 
>> >> >> consumers.
>> >> >>
>> >> >> There is also a similar blocks on consumer side here:
>> >> >>
>> >> >>  } while ( !(data[h & mask]) || CompareExchange(head, h+1, h) != h )
>> >> >>   data[h] = 0 // release the seat
>> >> >>
>> >> >>
>> >> >> > The queue only sleeps if it is empty out of it is full. But this is 
>> >> >> > not locking.
>> >> >> >
>> >> >> > All protection is done by the two atomic variables head an tail.
>> >> >> >
>> &

Re: [lock-free] Re: 3-thread consensus.

2020-01-14 Thread Dmitry Vyukov

On Tue, Jan 14, 2020 at 4:03 PM  wrote:
>
> Hi Manuel,
>
> How would a processing line be broken between the CAS and the release of the 
> seat?
>
> This will happen only if the S.O. preempt the thread on that point and never 
> get back.
>
> This is what I mean with the "my entire system is broken"
>
> Only a S.O. scheduling failure can produce that.
>
> Otherwise is impossible that a thread do not terminate that processing line.
>
> Can you think another possibility?

Let me ask first: what is your definition of a lock-free algorithm?

> Em terça-feira, 14 de janeiro de 2020 06:55:14 UTC-2, Manuel Pöter escreveu:
>>
>> The lock-free property guarantees that at any time at least one thread is 
>> making progress in a finite number of steps. Or to put this more generally: 
>> a stalled thread must not cause all other threads to stall indefinitely.
>> The arguments about lock-freedom (or the lack thereof) are usually based on 
>> the (somewhat artificial) assumption that any thread can fail at any time - 
>> i.e., simply stop executing. If such a failed thread causes the whole system 
>> to grind to a halt, then it is not lock-free.
>>
>> You wrote yourself that "If a thread dies at that point, my entire system is 
>> broken...", so it is definitely not lock-free.
>> That being said, lock-freedom is a nice property, but by no means 
>> indispensable for scalability. Several of Dmitry's algorithms are not 
>> lock-free (like the bounded MPMC queue), which does not mean that they do 
>> not scale.
>>
>> Alistarh et al. showed that lock-free algorithms are practically wait-free. 
>> I suppose the same could be shown for several concurrent algorithms that are 
>> not strictly lock-free. So it is not so much important whether an algorithm 
>> is lock-free or not, but whether it works well in practice for the use case 
>> it is designed for.
>>
>>
>> On Tuesday, 14 January 2020 03:16:23 UTC+1, bittnkr wrote:
>>>
>>> >
>>> Well, if producer is preempted, consumers are blocked from progressing.
>>>
>>> Sorry, but this isn't true.
>>>
>>> The consumers are preempted only in case of a empty queue. Which isn't a 
>>> lock.
>>>
>>> Meaning that there is nothing to do. If you don't have active producers, 
>>> three is nothing to consume.
>>>
>>> How can you see a lock there?
>>>
>>> Please
>>>
>>> Em seg, 13 de jan de 2020 04:35, Dmitry Vyukov  escreveu:
>>>>
>>>> +lock-free group again, please don't drop it
>>>>
>>>> On Mon, Jan 13, 2020 at 8:19 AM bittnkr  wrote:
>>>> >
>>>> > If a thread dies at that point, my entire system is broken...
>>>>
>>>> Which means it's not lock-free.
>>>>
>>>> > But It can preempts without any problem at near zero cost.
>>>>
>>>> Well, if producer is preempted, consumers are blocked from
>>>> progressing. This is 100% equivalent to a mutex. If a thread is
>>>> preempted while holding a mutex, it also does not result in any
>>>> correctness problems.
>>>>
>>>>
>>>> > Em seg, 13 de jan de 2020 03:55, Dmitry Vyukov  
>>>> > escreveu:
>>>> >>
>>>> >> On Sun, Jan 12, 2020 at 7:33 PM bittnkr  wrote:
>>>> >> >
>>>> >> >
>>>> >> > > This blocks consumes from progressing (consuming next produced 
>>>> >> > > items)
>>>> >> > and is effectively a mutex.
>>>> >> >
>>>> >> > Suppose the thread A got a local copy of tail in t then is preempted,
>>>> >> >
>>>> >> > another thread will get the same tail an get the seat normally.
>>>> >> >
>>>> >> > When the previous thread retakes the line, the CAS will fail, because 
>>>> >> > the seat was taken.
>>>> >> >
>>>> >> > Restarting the while without any kind of blocking.
>>>> >> >
>>>> >> > Where do you see a mutex here?
>>>> >>
>>>> >> I mean preemption between succeeding CAS and writing element /NULL to 
>>>> >> the array.
>>>> >>
>>>> >> If a thread is terminated at that point, the whole queue is broken (no
>>>> >> termination-safety).
>>>> >&

Re: [lock-free] Linux membarrier syscall

2023-06-13 Thread Dmitry Vyukov

On Fri, Jun 9, 2023 at 8:00 AM Joseph Seigh  wrote:
>
> Anyone know the difference between MEMBARRIER_CMD_PRIVATE_EXPEDITED and 
> MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE for membarrier.
>
> It would be nice if it actually does what I think it does.   I really don't 
> want to have to read from procfs and do a lot of parsing and such.

Nice to see you here again, Joe

The docs are quite cryptic:
"all its running threads siblings have executed a core serializing instruction":
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/membarrier.h#L155

In reality it does this:
https://elixir.bootlin.com/linux/latest/source/kernel/sched/membarrier.c#L322
https://elixir.bootlin.com/linux/latest/source/kernel/sched/membarrier.c#L184

which boils down to:
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/sync_core.h#L42

 * This function forces the icache and prefetched instruction stream to
 * catch up with reality in two very specific cases:
 *
 *  a) Text was modified using one virtual address and is about to be executed
 * from the same physical page at a different virtual address.
 *
 *  b) Text was modified on a different CPU, may subsequently be
 * executed on this CPU, and you want to make sure the new version
 * gets executed.  This generally means you're calling this in an IPI.


So it looks like it's intended for self-modifying code...

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3sdB0%2BrTodYvHXthgUV7xSh9iaEgGkuoJNP8c_z7gpH2g%40mail.gmail.com.

Re: [lock-free] Linux membarrier syscall

2023-06-16 Thread Dmitry Vyukov

On Fri, Jun 16, 2023 at 1:39 PM Joseph Seigh  wrote:
>
> rseq?   The concept seems vaguely familiar. :)

Yes, I meant mostly that it can now be played with easily and used in
production.

> Might not get past POC w/ smrproxy.  The idea was to write portable code 
> using c11/c17 atomics  but I don't thing that's possible given how borked its 
> memory model semantics are.  And I mean even more borked than their 
> discussions of what they think the memory model problems are.  And I really 
> don't feel like writing non portable inline assembly just to ensure the code 
> is doing what it should be doing.
>
> Joe Seigh
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "Scalable Synchronization Algorithms" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to lock-free+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/lock-free/ae8349fc-a322-421f-8acd-9a8c4aa3a32an%40googlegroups.com.



-- 
Dmitry Vyukov

All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3t%2BFP3Ujb%2BKugHSc8qcjSYowmwKVh0xXiY2YK3Y8vR3Zg%40mail.gmail.com.

Re: [lock-free] Linux membarrier syscall

2023-06-14 Thread Dmitry Vyukov

On Tue, Jun 13, 2023 at 9:07 PM Joseph Seigh  wrote:
>
> Ok, thanks.   I'll examine it in a bit more detail, though at first glance it 
> looks like they're using ipi to speed things up by not having to wait for 
> slower occurring kernel events.

Yes, my understanding that it always sends IPIs rather than waiting
for something passively. But it knows what CPUs are actually running
threads from the current process at the moment.

> I'm doing some work on  a hazard pointer based proxy collector w/ memory 
> barriers that I suggested ages ago.  I posted some smrproxy timing 
> comparisons.  I was going to rework the atomic reference counted proxy 
> collector in c11/c17 atomics as well but the approximated timings are so bad 
> compared to smrproxy that I think I will pass on that.

Yes, I guess it's expected for an atomic RMW on a shared location.
But it may be interesting to benchmark with 10K threads instead of 10
(10K is real for our server scenarios). I assume any HP approach needs
to iterate over all threads, so 10K may penalize it.

Btw, have you looked at rseq? It's pretty interesting facility:
https://kib.kiev.ua/kib/rseq.pdf

We use it in tcmalloc for per-CPU caches:
https://google.github.io/tcmalloc/rseq.html

Say, it can make HP use per-CPU register instead per-thread and
membarrier support aborting all concurrent rseq sequences. This can
allow for even more interesting algorithms, e.g. registering a hazard
pointer even w/o a loop in a rseq section (effectively atomic
memory-to-memory move wrt to the current CPU).



> Joe Seigh
>
> On Tuesday, June 13, 2023 at 3:03:34 AM UTC-4 Dmitry Vyukov wrote:
>>
>> On Fri, Jun 9, 2023 at 8:00 AM Joseph Seigh  wrote:
>> >
>> > Anyone know the difference between MEMBARRIER_CMD_PRIVATE_EXPEDITED and 
>> > MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE for membarrier.
>> >
>> > It would be nice if it actually does what I think it does. I really don't 
>> > want to have to read from procfs and do a lot of parsing and such.
>>
>> Nice to see you here again, Joe
>>
>> The docs are quite cryptic:
>> "all its running threads siblings have executed a core serializing 
>> instruction":
>> https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/membarrier.h#L155
>>
>> In reality it does this:
>> https://elixir.bootlin.com/linux/latest/source/kernel/sched/membarrier.c#L322
>> https://elixir.bootlin.com/linux/latest/source/kernel/sched/membarrier.c#L184
>>
>> which boils down to:
>> https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/sync_core.h#L42
>>
>> * This function forces the icache and prefetched instruction stream to
>> * catch up with reality in two very specific cases:
>> *
>> * a) Text was modified using one virtual address and is about to be executed
>> * from the same physical page at a different virtual address.
>> *
>> * b) Text was modified on a different CPU, may subsequently be
>> * executed on this CPU, and you want to make sure the new version
>> * gets executed. This generally means you're calling this in an IPI.
>>
>>
>> So it looks like it's intended for self-modifying code...
>
> --
>
> ---
> You received this message because you are subscribed to the Google Groups 
> "Scalable Synchronization Algorithms" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to lock-free+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/lock-free/02316b45-8d28-49f3-af60-c49ae27518ban%40googlegroups.com.



-- 
Dmitry Vyukov

All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Scalable Synchronization Algorithms" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to lock-free+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/lock-free/CAEeQi3vEt94W79jVL6ru2%3DT1FeDZAUai6EPJZZ1sOptsVQZ0XQ%40mail.gmail.com.

52 matches

Mail list logo