Re: [RFC v1 06/14] bus1: util - queue utility library

2016-10-28 Thread Peter Zijlstra
On Fri, Oct 28, 2016 at 04:33:50PM +0200, Tom Gundersen wrote:
> Ah, I see, we are talking past each other.


Ah I see where my reasoning went wobbly, not sure how to fully express
that yet. I think your solution is stronger than strictly required
though, but I'm not sure there's a better one. I'll think on it.


Re: [RFC v1 06/14] bus1: util - queue utility library

2016-10-28 Thread Tom Gundersen
On Fri, Oct 28, 2016 at 3:58 PM, Peter Zijlstra  wrote:
> On Fri, Oct 28, 2016 at 03:47:58PM +0200, Tom Gundersen wrote:
>> On Fri, Oct 28, 2016 at 3:33 PM, Peter Zijlstra  wrote:
>> > On Fri, Oct 28, 2016 at 01:33:25PM +0200, Tom Gundersen wrote:
>
>> > And this, precisely, is what generates all the complexity found in this
>> > patch.  You want to strictly provide more than causality, which does
>> > not, as per the argument above, provide this at all.
>> >
>> > You're providing a semi-global ordering of things that are themselves
>> > not actually ordered.
>>
>> We are providing two things: causality (as in your physics example
>> above), and consistency (which, I agree, is cute, but not necessarily
>> crucial). However, the complexity comes from causality. Consistency is
>> trivial. The only thing needed for consistency is to tag each message
>> by its sender and use this to resolve conflicts in the ordering. The
>> alternative would be to just let these entries order arbitrarily
>> instead, but conceptually it would not be simpler and it would only
>> save us a few lines of code.
>
> Earlier you wrote:
>
>> >> To make this work with multicast, we must stage messages first, then
>> >> commit on a second round. That is, we must find some way to iterate
>> >> over all clocks before committing, but at the same time preventing any
>> >> races. The multicast-stability as you just described we get for free
>> >> by introducing the second-level ordering via sender-address.
>
> But you don't need the two-pass thing at all for causality. The entire
> two-pass thing, and the serialization, is part of the consistency thing.
>
> This is not virtually free.
>
> For causality, all you need is a single iteration, delivering the
> message one after the other, only ever doing local clock movements. You
> do not need to find the max clock in the multicast set and avoid races
> etc..

Ah, I see, we are talking past each other. The property we do want
(which is not trivial) is that we do not want to observe the effect
before the cause. If an event at A causes an event at B, then the two
events should be guaranteed to be observed at C in that order. i.e.,
if you send a multi-cast message from A to B and C and as a result of
receiving the message, B sends a message to C, we want to be
guaranteed that C receives the latter after the former.

If this property is not wanted, then (repeated) unicast can in most
cases be used instead of multi-cast (and a natural optimization, which
we left out for now, would be to skip the staging round for unicast
messages).

Cheers,

Tom


Re: [RFC v1 06/14] bus1: util - queue utility library

2016-10-28 Thread Peter Zijlstra
On Fri, Oct 28, 2016 at 03:47:58PM +0200, Tom Gundersen wrote:
> On Fri, Oct 28, 2016 at 3:33 PM, Peter Zijlstra  wrote:
> > On Fri, Oct 28, 2016 at 01:33:25PM +0200, Tom Gundersen wrote:

> > And this, precisely, is what generates all the complexity found in this
> > patch.  You want to strictly provide more than causality, which does
> > not, as per the argument above, provide this at all.
> >
> > You're providing a semi-global ordering of things that are themselves
> > not actually ordered.
> 
> We are providing two things: causality (as in your physics example
> above), and consistency (which, I agree, is cute, but not necessarily
> crucial). However, the complexity comes from causality. Consistency is
> trivial. The only thing needed for consistency is to tag each message
> by its sender and use this to resolve conflicts in the ordering. The
> alternative would be to just let these entries order arbitrarily
> instead, but conceptually it would not be simpler and it would only
> save us a few lines of code.

Earlier you wrote:

> >> To make this work with multicast, we must stage messages first, then
> >> commit on a second round. That is, we must find some way to iterate
> >> over all clocks before committing, but at the same time preventing any
> >> races. The multicast-stability as you just described we get for free
> >> by introducing the second-level ordering via sender-address.

But you don't need the two-pass thing at all for causality. The entire
two-pass thing, and the serialization, is part of the consistency thing.

This is not virtually free.

For causality, all you need is a single iteration, delivering the
message one after the other, only ever doing local clock movements. You
do not need to find the max clock in the multicast set and avoid races
etc..


Re: [RFC v1 06/14] bus1: util - queue utility library

2016-10-28 Thread Tom Gundersen
On Fri, Oct 28, 2016 at 3:33 PM, Peter Zijlstra  wrote:
> On Fri, Oct 28, 2016 at 01:33:25PM +0200, Tom Gundersen wrote:
>> On Thu, Oct 27, 2016 at 6:43 PM, Peter Zijlstra  wrote:
>> > On Wed, Oct 26, 2016 at 09:18:02PM +0200, David Herrmann wrote:
>> >
>> >> A bus1 message queue is a FIFO, i.e., messages are linearly ordered by
>> >> the time they were sent. Moreover, atomic delivery of messages to
>> >> multiple queues are supported, without any global synchronization, i.e.,
>> >> the order of message delivery is consistent across queues.
>> >>
>> >> Messages can be destined for multiple queues, hence, we need to be
>> >> careful that all queues get a consistent order of incoming messages.
>> >
>> > So I read that to mean that if A and B both send a multi-cast message to
>> > C and D, the messages will appear in the same order for both C and D.
>>
>> That is one of the ordering guarantees, yes.
>>
>> > Why is this important? It seem that this multi-cast ordering generates
>> > much of the complexity of this patch while this Changelog fails to
>> > explain why this is a desired property.
>>
>> I don't think this is the case. The most important guarantee we give
>> is causal ordering.
>
> C and D not observing the message in the same order is consistent with
> causality (and actual physics). The cause is A sending something the
> effect is C receiving something. These two events must be ordered (which
> yields the partial order). But there is no guarantee that different
> observers would observe the same order. Esp. since A and B do not share
> a clock and these events are not in fact ordered themselves.
>
> When we go back to the example of special relativity, as per the paper,
> this is trivially observable if we put A and C together in a frame of
> reference and B and D in a different frame and have the two frames move
> (at a significant fraction of the speed of light) relative to one
> another. The signal, being an emission of light, would not arrive at
> both observers in the same order (if the signal was given sufficiently
> 'simultaneous')
>
>> To make this work with multicast, we must stage messages first, then
>> commit on a second round. That is, we must find some way to iterate
>> over all clocks before committing, but at the same time preventing any
>> races. The multicast-stability as you just described we get for free
>> by introducing the second-level ordering via sender-address.
>
> And this, precisely, is what generates all the complexity found in this
> patch.  You want to strictly provide more than causality, which does
> not, as per the argument above, provide this at all.
>
> You're providing a semi-global ordering of things that are themselves
> not actually ordered.

We are providing two things: causality (as in your physics example
above), and consistency (which, I agree, is cute, but not necessarily
crucial). However, the complexity comes from causality. Consistency is
trivial. The only thing needed for consistency is to tag each message
by its sender and use this to resolve conflicts in the ordering. The
alternative would be to just let these entries order arbitrarily
instead, but conceptually it would not be simpler and it would only
save us a few lines of code.

>> Stability in multicasts without causal order is not necessarily a crucial
>> feature. However, note that if this ordering is given, it allows reducing
>> the number of round-trips in dependent systems. Imagine a daemon
>> reacting to a set of events from different sources. If the actions of that
>> daemon are solely defined by incoming events, someone else can
>> deduce the actions the daemon took without requiring the daemon to
>> send out events by itself. That is, you can just watch the events on the
>> system, and validly deduce the state of such daemon.
>>
>> Example: There is a configuration daemon that sends events when
>> configuration is changed. And there is a hotplug daemon that sends
>> events when devices are hotplugged. You get an event that the "default
>> mute-state" for audio devices was changed, after it you get a
>> hotplugged audio device. You can now rely on the audio daemon to get
>> the events in the same order, and hence apply the new "default
>> mute-state" to the new device. No need to query the audio daemon
>> whether the new device is muted.
>
> Which is cute; but is it worth the pain?
>
>> But as I said, the causal ordering is what we really want.
>> Multicast-stability is just a nice side-effect.
>
> I'm saying they're not the same thing and multi-cast stability isn't at
> all implied.

Yeah, we agree. These are orthogonal concepts. What I meant is that
once we have causality, getting consistency as a side-effect is
virtually free.


Re: [RFC v1 06/14] bus1: util - queue utility library

2016-10-28 Thread Peter Zijlstra
On Fri, Oct 28, 2016 at 01:33:25PM +0200, Tom Gundersen wrote:
> On Thu, Oct 27, 2016 at 6:43 PM, Peter Zijlstra  wrote:
> > On Wed, Oct 26, 2016 at 09:18:02PM +0200, David Herrmann wrote:
> >
> >> A bus1 message queue is a FIFO, i.e., messages are linearly ordered by
> >> the time they were sent. Moreover, atomic delivery of messages to
> >> multiple queues are supported, without any global synchronization, i.e.,
> >> the order of message delivery is consistent across queues.
> >>
> >> Messages can be destined for multiple queues, hence, we need to be
> >> careful that all queues get a consistent order of incoming messages.
> >
> > So I read that to mean that if A and B both send a multi-cast message to
> > C and D, the messages will appear in the same order for both C and D.
> 
> That is one of the ordering guarantees, yes.
> 
> > Why is this important? It seem that this multi-cast ordering generates
> > much of the complexity of this patch while this Changelog fails to
> > explain why this is a desired property.
> 
> I don't think this is the case. The most important guarantee we give
> is causal ordering.

C and D not observing the message in the same order is consistent with
causality (and actual physics). The cause is A sending something the
effect is C receiving something. These two events must be ordered (which
yields the partial order). But there is no guarantee that different
observers would observe the same order. Esp. since A and B do not share
a clock and these events are not in fact ordered themselves.

When we go back to the example of special relativity, as per the paper,
this is trivially observable if we put A and C together in a frame of
reference and B and D in a different frame and have the two frames move
(at a significant fraction of the speed of light) relative to one
another. The signal, being an emission of light, would not arrive at
both observers in the same order (if the signal was given sufficiently
'simultaneous')

> To make this work with multicast, we must stage messages first, then
> commit on a second round. That is, we must find some way to iterate
> over all clocks before committing, but at the same time preventing any
> races. The multicast-stability as you just described we get for free
> by introducing the second-level ordering via sender-address.

And this, precisely, is what generates all the complexity found in this
patch.  You want to strictly provide more than causality, which does
not, as per the argument above, provide this at all.

You're providing a semi-global ordering of things that are themselves
not actually ordered.

> Stability in multicasts without causal order is not necessarily a crucial
> feature. However, note that if this ordering is given, it allows reducing
> the number of round-trips in dependent systems. Imagine a daemon
> reacting to a set of events from different sources. If the actions of that
> daemon are solely defined by incoming events, someone else can
> deduce the actions the daemon took without requiring the daemon to
> send out events by itself. That is, you can just watch the events on the
> system, and validly deduce the state of such daemon.
> 
> Example: There is a configuration daemon that sends events when
> configuration is changed. And there is a hotplug daemon that sends
> events when devices are hotplugged. You get an event that the "default
> mute-state" for audio devices was changed, after it you get a
> hotplugged audio device. You can now rely on the audio daemon to get
> the events in the same order, and hence apply the new "default
> mute-state" to the new device. No need to query the audio daemon
> whether the new device is muted.

Which is cute; but is it worth the pain?

> But as I said, the causal ordering is what we really want.
> Multicast-stability is just a nice side-effect.

I'm saying they're not the same thing and multi-cast stability isn't at
all implied.


Re: [RFC v1 06/14] bus1: util - queue utility library

2016-10-28 Thread Tom Gundersen
On Thu, Oct 27, 2016 at 6:43 PM, Peter Zijlstra  wrote:
> On Wed, Oct 26, 2016 at 09:18:02PM +0200, David Herrmann wrote:
>
>> A bus1 message queue is a FIFO, i.e., messages are linearly ordered by
>> the time they were sent. Moreover, atomic delivery of messages to
>> multiple queues are supported, without any global synchronization, i.e.,
>> the order of message delivery is consistent across queues.
>>
>> Messages can be destined for multiple queues, hence, we need to be
>> careful that all queues get a consistent order of incoming messages.
>
> So I read that to mean that if A and B both send a multi-cast message to
> C and D, the messages will appear in the same order for both C and D.

That is one of the ordering guarantees, yes.

> Why is this important? It seem that this multi-cast ordering generates
> much of the complexity of this patch while this Changelog fails to
> explain why this is a desired property.

I don't think this is the case. The most important guarantee we give
is causal ordering. To make this work with multicast, we must stage
messages first, then commit on a second round. That is, we must find
some way to iterate over all clocks before committing, but at the same
time preventing any races. The multicast-stability as you just described
we get for free by introducing the second-level ordering via
sender-address.

Stability in multicasts without causal order is not necessarily a crucial
feature. However, note that if this ordering is given, it allows reducing
the number of round-trips in dependent systems. Imagine a daemon
reacting to a set of events from different sources. If the actions of that
daemon are solely defined by incoming events, someone else can
deduce the actions the daemon took without requiring the daemon to
send out events by itself. That is, you can just watch the events on the
system, and validly deduce the state of such daemon.

Example: There is a configuration daemon that sends events when
configuration is changed. And there is a hotplug daemon that sends
events when devices are hotplugged. You get an event that the "default
mute-state" for audio devices was changed, after it you get a
hotplugged audio device. You can now rely on the audio daemon to get
the events in the same order, and hence apply the new "default
mute-state" to the new device. No need to query the audio daemon
whether the new device is muted.

But as I said, the causal ordering is what we really want.
Multicast-stability is just a nice side-effect.

It might also be note mentioning: Both Android Binder and Chromium
Mojo make sure they provide causal ordering, since they run into real
issues. Binder allows placing multiple messages under the same
binder-lock, and Mojo provides Associated Interfaces [1]. DBus makes
sure to provide those ordering guarantees as well.

>> We
>> define the concept of `global order' to provide a basic set of
>> guarantees. This global order is a partial order on the set of all
>> messages. The order is defined as:
>>
>> 1) If a message B was queued *after* a message A, then: A < B
>>
>> 2) If a message B was queued *after* a message A was dequeued,
>>then: A < B
>>
>> 3) If a message B was dequeued *after* a message A on the same queue,
>>then: A < B
>>
>> (Note: Causality is honored. `after' and `before' do not refer to
>>  the same task, nor the same queue, but rather any kind of
>>  synchronization between the two operations.)
>>
>> The queue object implements this global order in a lockless fashion. It
>> solely relies on a distributed clock on each queue. Each message to be
>> sent causes a clock tick on the local clock and on all destination
>> clocks. Furthermore, all clocks are synchronized, meaning they're
>> fast-forwarded in case they're behind the highest of all participating
>> peers. No global state tracking is involved.
>
> Yet the code does compares on more than just timestamps. Why are these
> secondary (and even tertiary) ordering required?

Lamport Timestamps are guaranteed to be unique per-sender, but a receiving
queue can still contain messages with the same timestamp (from different
senders). That is, if two multicasts overlap, they might end up with the same
timestamp, if, and only if, they can have no causal relationship
(i.e., the ioctls
are called concurrently). We want to extend this partial order, though. We
want to provide a stable order in those cases (as described above), so we
need a secondary order (we simply pick the memory address of the sender).
This guarantees that all receivers get the same order of all messages (even
if they have equal timestamps).

Note that equal timestamps only happen if entries block each other.
Hence, we can use the memory address as secondary order, since we know
it is unique in those cases (and cannot be re-used).

Cheers,

Tom

[1] 
https://docs.google.com/document/d/1ENDDzACX4hplfQ8cCHGo_rXd3IHTu5H4hEZ44Cu8KVs


Re: [RFC v1 06/14] bus1: util - queue utility library

2016-10-27 Thread Peter Zijlstra
On Wed, Oct 26, 2016 at 09:18:02PM +0200, David Herrmann wrote:

> A bus1 message queue is a FIFO, i.e., messages are linearly ordered by
> the time they were sent. Moreover, atomic delivery of messages to
> multiple queues are supported, without any global synchronization, i.e.,
> the order of message delivery is consistent across queues.
> 
> Messages can be destined for multiple queues, hence, we need to be
> careful that all queues get a consistent order of incoming messages.

So I read that to mean that if A and B both send a multi-cast message to
C and D, the messages will appear in the same order for both C and D.

Why is this important? It seem that this multi-cast ordering generates
much of the complexity of this patch while this Changelog fails to
explain why this is a desired property.


> We
> define the concept of `global order' to provide a basic set of
> guarantees. This global order is a partial order on the set of all
> messages. The order is defined as:
> 
> 1) If a message B was queued *after* a message A, then: A < B
> 
> 2) If a message B was queued *after* a message A was dequeued,
>then: A < B
> 
> 3) If a message B was dequeued *after* a message A on the same queue,
>then: A < B
> 
> (Note: Causality is honored. `after' and `before' do not refer to
>  the same task, nor the same queue, but rather any kind of
>  synchronization between the two operations.)
> 
> The queue object implements this global order in a lockless fashion. It
> solely relies on a distributed clock on each queue. Each message to be
> sent causes a clock tick on the local clock and on all destination
> clocks. Furthermore, all clocks are synchronized, meaning they're
> fast-forwarded in case they're behind the highest of all participating
> peers. No global state tracking is involved.

Yet the code does compares on more than just timestamps. Why are these
secondary (and even tertiary) ordering required?


Re: [RFC v1 06/14] bus1: util - queue utility library

2016-10-27 Thread Peter Zijlstra
On Wed, Oct 26, 2016 at 09:18:02PM +0200, David Herrmann wrote:
> Messages can be destined for multiple queues, hence, we need to be
> careful that all queues get a consistent order of incoming messages. We
> define the concept of `global order' to provide a basic set of
> guarantees. This global order is a partial order on the set of all
> messages. The order is defined as:

Ah, ok. So it _is_ a partial order only. I got confused by earlier
reports, and the term 'global order' in general, to think you did a
total order. And then I wondered wth you needed total ordering for.

Maybe best to scrub 'global order' from the entire text. It's not
helpful.


[RFC v1 06/14] bus1: util - queue utility library

2016-10-26 Thread David Herrmann
From: Tom Gundersen 

(Please refer to 'Lamport Timestamps', the concept of
 'happened-before', and 'causal ordering'. The queue implementation
 has its roots in Lamport Timestamps, treating a set of local CPUs
 as a distributed system to avoid any global synchronization.)

A bus1 message queue is a FIFO, i.e., messages are linearly ordered by
the time they were sent. Moreover, atomic delivery of messages to
multiple queues are supported, without any global synchronization, i.e.,
the order of message delivery is consistent across queues.

Messages can be destined for multiple queues, hence, we need to be
careful that all queues get a consistent order of incoming messages. We
define the concept of `global order' to provide a basic set of
guarantees. This global order is a partial order on the set of all
messages. The order is defined as:

1) If a message B was queued *after* a message A, then: A < B

2) If a message B was queued *after* a message A was dequeued,
   then: A < B

3) If a message B was dequeued *after* a message A on the same queue,
   then: A < B

(Note: Causality is honored. `after' and `before' do not refer to
 the same task, nor the same queue, but rather any kind of
 synchronization between the two operations.)

The queue object implements this global order in a lockless fashion. It
solely relies on a distributed clock on each queue. Each message to be
sent causes a clock tick on the local clock and on all destination
clocks. Furthermore, all clocks are synchronized, meaning they're
fast-forwarded in case they're behind the highest of all participating
peers. No global state tracking is involved.

During a message transaction, we first queue a message as 'staging'
entry in each destination with a preliminary timestamp. This timestamp
is explicitly odd numbered. Any odd numbered timestamp is considered
'staging' and causes *any* message ordered after it to be blocked until
it is no longer staging. This allows us to queue the message in parallel
with any racing multicast, and be guaranteed that all possible conflicts
are blocked until we eventually commit a transaction. To commit a
transaction (after all staging entries are queued), we choose the
highest timestamp we have seen across all destinations and re-queue all
our entries on each peer using that timestamp. Here we use a commit
timestamp (even numbered).

With this in mind, we define that a client can only dequeue messages
from its queue that have an even timestamp. Furthermore, if there is a
message queued with an odd timestamp that is lower than the even
timestamp of another message, then neither message can be dequeued.
They're considered to be in-flight conflicts. This guarantees that two
concurrent multicast messages can be queued without any *global* locks,
but either can only be dequeued by a peer if their ordering has been
established (via commit timestamps).

NOTE: A fully committed message is not guaranteed to be ready to be
  dequeued as it may be blocked by a staging entry. This means
  that there is an arbitrary (though bounded) time from a
  message transaction completing when the queue may still appear
  to be empty. In other words, message transmission is not
  instantaneous. It would be possible to change this at the
  cost of shortly blocking each message transaction on all other
  conflicting tasks.

The queue implementation uses an rb-tree (ordered by timestamps and
sender), with a cached pointer to the front of the queue. It will be
embedded in every peer participating on the bus1 kernel message bus1.

Signed-off-by: Tom Gundersen 
Signed-off-by: David Herrmann 
---
 ipc/bus1/Makefile |   3 +-
 ipc/bus1/util/queue.c | 445 ++
 ipc/bus1/util/queue.h | 351 +++
 3 files changed, 798 insertions(+), 1 deletion(-)
 create mode 100644 ipc/bus1/util/queue.c
 create mode 100644 ipc/bus1/util/queue.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index ca8e19d..3c90657 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -2,7 +2,8 @@ bus1-y :=   \
main.o  \
util/active.o   \
util/flist.o\
-   util/pool.o
+   util/pool.o \
+   util/queue.o
 
 obj-$(CONFIG_BUS1) += bus1.o
 
diff --git a/ipc/bus1/util/queue.c b/ipc/bus1/util/queue.c
new file mode 100644
index 000..38d7b98
--- /dev/null
+++ b/ipc/bus1/util/queue.c
@@ -0,0 +1,445 @@
+/*
+ * Copyright (C) 2013-2016 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU Lesser General Public License as published by the
+ * Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include 
+#include 
+#include 
+#include 
+#inc