Re: [HACKERS] condition variables

2016-11-22 Thread Robert Haas
On Mon, Nov 21, 2016 at 11:48 PM, Thomas Munro
 wrote:
> Here's a version that works that way, though it allows you to call
> ConditionVariablePrepareToSleep *optionally* before you enter your
> loop, in case you expect to have to wait and would rather avoid the
> extra loop.  Maybe there isn't much point in exposing that though,
> since your condition test should be fast and waiting is the slow path,
> but we don't really really know what your condition test is.  I
> thought about that because my use case (barrier.c) does in fact expect
> to hit the wait case more often than not.  If that seems pointless
> then perhaps ConditionVariablePrepareToSleep should become static and
> implicit.  This version does attempt to suppress spurious returns, a
> bit, using proclist_contains.  No more cvSleeping.

This version looks good to me and I have committed it after doing a
bit more work on the comments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-11-22 Thread Kyotaro HORIGUCHI
At Tue, 22 Nov 2016 17:48:07 +1300, Thomas Munro 
 wrote in 

Re: [HACKERS] condition variables

2016-11-21 Thread Thomas Munro
On Tue, Nov 22, 2016 at 3:05 PM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Mon, 21 Nov 2016 15:57:47 -0500, Robert Haas  wrote 
> in 
>> On Thu, Aug 11, 2016 at 5:47 PM, Robert Haas  wrote:
>> > So, in my
>> > implementation, a condition variable wait loop looks like this:
>> >
>> > for (;;)
>> > {
>> > ConditionVariablePrepareToSleep(cv);
>> > if (condition for which we are waiting is satisfied)
>> > break;
>> > ConditionVariableSleep();
>> > }
>> > ConditionVariableCancelSleep();
>>
>> I have what I think is a better idea.  Let's get rid of
>> ConditionVariablePrepareToSleep(cv) and instead tell users of this
>> facility to write the loop this way:
>>
>> for (;;)
>> {
>> if (condition for which we are waiting is satisfied)
>> break;
>> ConditionVariableSleep(cv);
>> }
>> ConditionVariableCancelSleep();
>
> It seems rather a common way to wait on a condition variable, in
> shorter,
>
> | while (condition for which we are waiting is *not* satisfied)
> | ConditionVariableSleep(cv);
> | ConditionVariableCancelSleep();

Ok, let's show it that way.

>> ConditionVariableSleep(cv) will check whether the current process is
>> already on the condition variable's waitlist.  If so, it will sleep;
>> if not, it will add the process and return without sleeping.
>>
>> It may seem odd that ConditionVariableSleep(cv) doesn't necessary
>> sleep, but this design has a significant advantage: we avoid
>> manipulating the wait-list altogether in the case where the condition
>> is already satisfied when we enter the loop.  That's more like what we
>
> The condition check is done far faster than maintaining the
> wait-list for most cases, I believe.
>
>> already do in lwlock.c: we try to grab the lock first; if we can't, we
>> add ourselves to the wait-list and retry; if we then get the lock
>> after all we have to recheck whether we can get the lock and remove
>> ourselves from the wait-list if so.  Of course, there is some cost: if
>> we do have to wait, we'll end up checking the condition twice before
>> actually going to sleep.  However, it's probably smart to bet that
>> actually needing to sleep is fairly infrequent, just as in lwlock.c.
>>
>> Thoughts?
>
> FWIW, I agree to the assumption.

Here's a version that works that way, though it allows you to call
ConditionVariablePrepareToSleep *optionally* before you enter your
loop, in case you expect to have to wait and would rather avoid the
extra loop.  Maybe there isn't much point in exposing that though,
since your condition test should be fast and waiting is the slow path,
but we don't really really know what your condition test is.  I
thought about that because my use case (barrier.c) does in fact expect
to hit the wait case more often than not.  If that seems pointless
then perhaps ConditionVariablePrepareToSleep should become static and
implicit.  This version does attempt to suppress spurious returns, a
bit, using proclist_contains.  No more cvSleeping.

It's possible that future users will want a version with a timeout, or
multiplexed with IO, in which case there would be some interesting
questions about how this should interact with WaitEventSet.  It also
seems like someone might eventually want to handle postmaster death.
Perhaps there shoul eventually be a way to tell WaitEventSet that
you're waiting for a CV so these things can be multiplexed without
exposing the fact that it's done internally with latches.

-- 
Thomas Munro
http://www.enterprisedb.com


condition-variable-v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-11-21 Thread Kyotaro HORIGUCHI
Hello,

At Mon, 21 Nov 2016 15:57:47 -0500, Robert Haas  wrote 
in 
> On Thu, Aug 11, 2016 at 5:47 PM, Robert Haas  wrote:
> > So, in my
> > implementation, a condition variable wait loop looks like this:
> >
> > for (;;)
> > {
> > ConditionVariablePrepareToSleep(cv);
> > if (condition for which we are waiting is satisfied)
> > break;
> > ConditionVariableSleep();
> > }
> > ConditionVariableCancelSleep();
> 
> I have what I think is a better idea.  Let's get rid of
> ConditionVariablePrepareToSleep(cv) and instead tell users of this
> facility to write the loop this way:
> 
> for (;;)
> {
> if (condition for which we are waiting is satisfied)
> break;
> ConditionVariableSleep(cv);
> }
> ConditionVariableCancelSleep();

It seems rather a common way to wait on a condition variable, in
shorter,

| while (condition for which we are waiting is *not* satisfied)
| ConditionVariableSleep(cv);
| ConditionVariableCancelSleep();

> ConditionVariableSleep(cv) will check whether the current process is
> already on the condition variable's waitlist.  If so, it will sleep;
> if not, it will add the process and return without sleeping.
> 
> It may seem odd that ConditionVariableSleep(cv) doesn't necessary
> sleep, but this design has a significant advantage: we avoid
> manipulating the wait-list altogether in the case where the condition
> is already satisfied when we enter the loop.  That's more like what we

The condition check is done far faster than maintaining the
wait-list for most cases, I believe.

> already do in lwlock.c: we try to grab the lock first; if we can't, we
> add ourselves to the wait-list and retry; if we then get the lock
> after all we have to recheck whether we can get the lock and remove
> ourselves from the wait-list if so.  Of course, there is some cost: if
> we do have to wait, we'll end up checking the condition twice before
> actually going to sleep.  However, it's probably smart to bet that
> actually needing to sleep is fairly infrequent, just as in lwlock.c.
> 
> Thoughts?

FWIW, I agree to the assumption.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-11-21 Thread Robert Haas
On Thu, Aug 11, 2016 at 5:47 PM, Robert Haas  wrote:
> So, in my
> implementation, a condition variable wait loop looks like this:
>
> for (;;)
> {
> ConditionVariablePrepareToSleep(cv);
> if (condition for which we are waiting is satisfied)
> break;
> ConditionVariableSleep();
> }
> ConditionVariableCancelSleep();

I have what I think is a better idea.  Let's get rid of
ConditionVariablePrepareToSleep(cv) and instead tell users of this
facility to write the loop this way:

for (;;)
{
if (condition for which we are waiting is satisfied)
break;
ConditionVariableSleep(cv);
}
ConditionVariableCancelSleep();

ConditionVariableSleep(cv) will check whether the current process is
already on the condition variable's waitlist.  If so, it will sleep;
if not, it will add the process and return without sleeping.

It may seem odd that ConditionVariableSleep(cv) doesn't necessary
sleep, but this design has a significant advantage: we avoid
manipulating the wait-list altogether in the case where the condition
is already satisfied when we enter the loop.  That's more like what we
already do in lwlock.c: we try to grab the lock first; if we can't, we
add ourselves to the wait-list and retry; if we then get the lock
after all we have to recheck whether we can get the lock and remove
ourselves from the wait-list if so.  Of course, there is some cost: if
we do have to wait, we'll end up checking the condition twice before
actually going to sleep.  However, it's probably smart to bet that
actually needing to sleep is fairly infrequent, just as in lwlock.c.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-11-21 Thread Robert Haas
On Mon, Nov 21, 2016 at 7:10 AM, Thomas Munro
 wrote:
>> I wonder if we shouldn't try to create the invariant that when the CV
>> mutex is not help, the state of cvSleeping has to be true if we're in
>> the proclist and false if we're not.  So ConditionVariableSignal()
>> would clear the flag before releasing the spinlock, for example.  Then
>> I think we wouldn't need proclist_contains().
>
> Yeah, that'd work.  ConditionVariableSleep would need to hold the
> spinlock while checking cvSleeping (otherwise there is race case where
> another process sets it to false but this process doesn't see that yet
> and then waits for a latch-set which never arrives).  It's not the end
> of the world but it seems unfortunate to have to acquire and release
> the spinlock in ConditionVariablePrepareToSleep and then immediately
> again in ConditionVariableSleep.
>
> I was going to code that up but I read this and had another idea:
>
> http://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups
>
> I realise that you didn't actually say you wanted to guarantee no
> spurious wakeups.  But since the client code already has to check its
> own exit condition, why bother adding a another layer of looping?
> Sprurious latch sets must be super rare, but even if they aren't, what
> do you save by filtering them here?  In this situation you already got
> woken from a deep slumbler and scheduled back onto the CPU, so it
> hardly matters whether you go around again in that loop or the
> client's loop.  We could make things really simple instead: get rid of
> cvSleeping, have ConditionVariablePrepareToSleep reset the latch, then
> have ConditionVariableSleep wait for the latch to be set just once (no
> loop).  Then we'd document that spurious wakeups are possible so the
> caller must write a robust predicate loop, exactly as you already
> showed in your first message.  We'd need to keep that
> proclist_contains stuff to avoid corrupting the list.
> proclist_contains would be the one source of truth for whether you're
> in the waitlist, and the client code's predicate loop would contain
> the one source of truth for whether the condition it is waiting for
> has been reached.

I don't think we can rely on spurious latch set events being rare.
There are an increasing number of things that set the process latch,
and there will very likely be more in the future.  For instance, the
arrival of tuples from a parallel worker associated with our session
will set the process latch.  Workers starting or dying will set the
process latch.  So my inclination was to try to guarantee no spurious
wakeups at all, but possibly a softer guarantee that makes them
unlikely would be sufficient.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-11-21 Thread Thomas Munro
On Fri, Oct 28, 2016 at 9:38 AM, Robert Haas  wrote:
> On Tue, Oct 4, 2016 at 3:12 PM, Thomas Munro
>  wrote:
>> On Mon, Aug 15, 2016 at 5:58 PM, Thomas Munro
>>  wrote:
>>> Please find attached a -v2 of your patch which includes suggestions
>>> 1-3 above.
>>
>> Here's a rebased patch.  ConditionVariableSleep now takes
>> wait_event_info.  Anyone using this in patches for core probably needs
>> to add enumerators to the WaitEventXXX enums in pgstat.h to describe
>> their wait points.
>
> So, in my original patch, it was intended that cvSleeping was the
> definitive source of truth as to whether a particular backend is
> committed to sleeping or not.  That had some bugs, I guess, but in
> your version, there are now two sources of truth.  On the one hand,
> there's still cvSleeping.  On the other hand, there's now also whether
> proclist_contains(>wakeup, MyProc->pgprocno, cvWaitLink) returns
> true.
>
> I'm not sure whether that causes any serious problem.  It seems
> possible, for example, that ConditionVariableSignal() could clear the
> cvSleeping flag for a backend that's now waiting for some OTHER
> condition variable, because once it pops the victim off the list and
> releases the spinlock, the other backend could meanwhile
> ConditionVariableCancelSleep() and then do anything it likes,
> including sleep on some other condition variable.  Now I think that
> the worst thing that will happen is that the other backend will
> receive a spurious wakeup, but I'm not quite sure.  The whole point of
> having cvSleeping in the first place is so that we don't get spurious
> wakeups just because somebody happens to set your process latch, so it
> seems a bit unfortunate that it doesn't actually prevent that from
> happening.
>
> I wonder if we shouldn't try to create the invariant that when the CV
> mutex is not help, the state of cvSleeping has to be true if we're in
> the proclist and false if we're not.  So ConditionVariableSignal()
> would clear the flag before releasing the spinlock, for example.  Then
> I think we wouldn't need proclist_contains().

Yeah, that'd work.  ConditionVariableSleep would need to hold the
spinlock while checking cvSleeping (otherwise there is race case where
another process sets it to false but this process doesn't see that yet
and then waits for a latch-set which never arrives).  It's not the end
of the world but it seems unfortunate to have to acquire and release
the spinlock in ConditionVariablePrepareToSleep and then immediately
again in ConditionVariableSleep.

I was going to code that up but I read this and had another idea:

http://stackoverflow.com/questions/8594591/why-does-pthread-cond-wait-have-spurious-wakeups

I realise that you didn't actually say you wanted to guarantee no
spurious wakeups.  But since the client code already has to check its
own exit condition, why bother adding a another layer of looping?
Sprurious latch sets must be super rare, but even if they aren't, what
do you save by filtering them here?  In this situation you already got
woken from a deep slumbler and scheduled back onto the CPU, so it
hardly matters whether you go around again in that loop or the
client's loop.  We could make things really simple instead: get rid of
cvSleeping, have ConditionVariablePrepareToSleep reset the latch, then
have ConditionVariableSleep wait for the latch to be set just once (no
loop).  Then we'd document that spurious wakeups are possible so the
caller must write a robust predicate loop, exactly as you already
showed in your first message.  We'd need to keep that
proclist_contains stuff to avoid corrupting the list.
proclist_contains would be the one source of truth for whether you're
in the waitlist, and the client code's predicate loop would contain
the one source of truth for whether the condition it is waiting for
has been reached.

Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-10-27 Thread Robert Haas
On Tue, Oct 4, 2016 at 3:12 PM, Thomas Munro
 wrote:
> On Mon, Aug 15, 2016 at 5:58 PM, Thomas Munro
>  wrote:
>> Please find attached a -v2 of your patch which includes suggestions
>> 1-3 above.
>
> Here's a rebased patch.  ConditionVariableSleep now takes
> wait_event_info.  Anyone using this in patches for core probably needs
> to add enumerators to the WaitEventXXX enums in pgstat.h to describe
> their wait points.

So, in my original patch, it was intended that cvSleeping was the
definitive source of truth as to whether a particular backend is
committed to sleeping or not.  That had some bugs, I guess, but in
your version, there are now two sources of truth.  On the one hand,
there's still cvSleeping.  On the other hand, there's now also whether
proclist_contains(>wakeup, MyProc->pgprocno, cvWaitLink) returns
true.

I'm not sure whether that causes any serious problem.  It seems
possible, for example, that ConditionVariableSignal() could clear the
cvSleeping flag for a backend that's now waiting for some OTHER
condition variable, because once it pops the victim off the list and
releases the spinlock, the other backend could meanwhile
ConditionVariableCancelSleep() and then do anything it likes,
including sleep on some other condition variable.  Now I think that
the worst thing that will happen is that the other backend will
receive a spurious wakeup, but I'm not quite sure.  The whole point of
having cvSleeping in the first place is so that we don't get spurious
wakeups just because somebody happens to set your process latch, so it
seems a bit unfortunate that it doesn't actually prevent that from
happening.

I wonder if we shouldn't try to create the invariant that when the CV
mutex is not help, the state of cvSleeping has to be true if we're in
the proclist and false if we're not.  So ConditionVariableSignal()
would clear the flag before releasing the spinlock, for example.  Then
I think we wouldn't need proclist_contains().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-10-23 Thread Peter Geoghegan
On Tue, Oct 4, 2016 at 12:12 PM, Thomas Munro
 wrote:
> Here's a rebased patch.  ConditionVariableSleep now takes
> wait_event_info.  Anyone using this in patches for core probably needs
> to add enumerators to the WaitEventXXX enums in pgstat.h to describe
> their wait points.

I think that there are at least 2 patches from EDB people that are
already dependent on this one. I myself could easily adopt parallel
CREATE INDEX to use it, too. Why the continued delay in committing it?
I think it's fairly clear that this mechanism is widely useful.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-10-04 Thread Thomas Munro
On Mon, Aug 15, 2016 at 5:58 PM, Thomas Munro
 wrote:
> Please find attached a -v2 of your patch which includes suggestions
> 1-3 above.

Here's a rebased patch.  ConditionVariableSleep now takes
wait_event_info.  Anyone using this in patches for core probably needs
to add enumerators to the WaitEventXXX enums in pgstat.h to describe
their wait points.

-- 
Thomas Munro
http://www.enterprisedb.com


condition-variable-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-09-15 Thread Peter Geoghegan
Maybe I can leave it up to you to determine if that applies in the context
of my parallel create index patch. You are the obvious candidate to review
that patch anyway, of course.

--
Peter Geoghegan


Re: [HACKERS] condition variables

2016-09-14 Thread Robert Haas
On Tue, Sep 13, 2016 at 10:55 PM, Peter Geoghegan  wrote:
> On Thu, Aug 11, 2016 at 2:47 PM, Robert Haas  wrote:
>> Another approach to the problem is to use a latch wait loop.  That
>> almost works.  Interrupts can be serviced, and you can recheck shared
>> memory to see whether the condition for proceeding is satisfied after
>> each iteration of the loop.  There's only one problem: when you do
>> something that might cause the condition to be satisfied for other
>> waiting backends, you need to set their latch - but you don't have an
>> easy way to know exactly which processes are waiting, so how do you
>> call SetLatch?  I originally thought of adding a function like
>> SetAllLatches(ParallelContext *) and maybe that can work, but then I
>> had what I think is a better idea, which is to introduce a notion of
>> condition variables.
>
> I don't see a CF entry for this. Are you planning to work on this
> again soon, Robert?
>
> I have an eye on this patch due to my work on parallel CREATE INDEX.
> It would be nice to have some rough idea of when you intend to commit
> this.

I basically figured I would commit it when and if it became clear that
it'd get good use in some other patch which was on the road to being
committed.  I don't think it needs much work, just the assurance that
it will get some use.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-09-13 Thread Peter Geoghegan
On Thu, Aug 11, 2016 at 2:47 PM, Robert Haas  wrote:
> Another approach to the problem is to use a latch wait loop.  That
> almost works.  Interrupts can be serviced, and you can recheck shared
> memory to see whether the condition for proceeding is satisfied after
> each iteration of the loop.  There's only one problem: when you do
> something that might cause the condition to be satisfied for other
> waiting backends, you need to set their latch - but you don't have an
> easy way to know exactly which processes are waiting, so how do you
> call SetLatch?  I originally thought of adding a function like
> SetAllLatches(ParallelContext *) and maybe that can work, but then I
> had what I think is a better idea, which is to introduce a notion of
> condition variables.

I don't see a CF entry for this. Are you planning to work on this
again soon, Robert?

I have an eye on this patch due to my work on parallel CREATE INDEX.
It would be nice to have some rough idea of when you intend to commit
this.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-09-06 Thread Amit Kapila
On Tue, Sep 6, 2016 at 5:29 PM, Robert Haas  wrote:
> On Mon, Sep 5, 2016 at 3:17 PM, Amit Kapila  wrote:
>> On Mon, Aug 15, 2016 at 10:35 PM, Robert Haas  wrote:
> Don't you need to set proc->cvSleeping = false in ConditionVariableSignal?

 I poked at this a bit... OK, a lot... and have some feedback:

 1.  As above, we need to clear cvSleeping before setting the latch.
>>>
>>> Right, OK.
>>>
>>
>> I have independently faced this problem while using your patch and for
>> now I have updated my local copy.  If possible, please send an updated
>> patch as this patch could be used for development of various
>> parallelism projects.
>
> Thomas already posted an updated patch in the same message where he
> reported the problem.
>

Oops, I missed that, will use the same.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-09-06 Thread Robert Haas
On Mon, Sep 5, 2016 at 3:17 PM, Amit Kapila  wrote:
> On Mon, Aug 15, 2016 at 10:35 PM, Robert Haas  wrote:
 Don't you need to set proc->cvSleeping = false in ConditionVariableSignal?
>>>
>>> I poked at this a bit... OK, a lot... and have some feedback:
>>>
>>> 1.  As above, we need to clear cvSleeping before setting the latch.
>>
>> Right, OK.
>>
>
> I have independently faced this problem while using your patch and for
> now I have updated my local copy.  If possible, please send an updated
> patch as this patch could be used for development of various
> parallelism projects.

Thomas already posted an updated patch in the same message where he
reported the problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-09-05 Thread Amit Kapila
On Mon, Aug 15, 2016 at 10:35 PM, Robert Haas  wrote:
>>> Don't you need to set proc->cvSleeping = false in ConditionVariableSignal?
>>
>> I poked at this a bit... OK, a lot... and have some feedback:
>>
>> 1.  As above, we need to clear cvSleeping before setting the latch.
>
> Right, OK.
>

I have independently faced this problem while using your patch and for
now I have updated my local copy.  If possible, please send an updated
patch as this patch could be used for development of various
parallelism projects.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-08-16 Thread Andres Freund
On 2016-08-11 21:27:45 -0400, Robert Haas wrote:
> On Thu, Aug 11, 2016 at 6:37 PM, Peter Geoghegan  wrote:
> > I notice that you acquire a spinlock within the implementation of
> > condition variables. Is it worth any effort to consolidate the number
> > of spinlock acquisitions? In other words, maybe the most common idioms
> > should be baked into the ConditionVariable interface, which could save
> > callers from having to use their own mutex variable.
> 
> One thing to keep in mind is that spinlocks are extremely fast as long
> as you don't have too many processes contending for them.

That's one of the conditions. The other is that the system as a whole is
not overcommitted. Because then the chance of processes being put to
sleep inside a spinlock increases.

> With
> parallel groups (or I/O-in-progress wait queues) of single digit
> number of processes, I doubt that consolidating the spinlock
> acquisitions will produce any measurable benefit.  If we get to the
> point of having parallel groups containing scores of processes, that
> could change.

And we have no measures to manage systemwide load with paralellism yet,
I think the issue is a bit more general than the quoted paragraph.


But I also think we shouldn't yet worry about it. It seems likely that
the actual critical bottleneck is elsewhere for now.

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-08-15 Thread Robert Haas
On Mon, Aug 15, 2016 at 1:58 AM, Thomas Munro
 wrote:
> On Sun, Aug 14, 2016 at 9:04 AM, Thomas Munro
>  wrote:
>> On Fri, Aug 12, 2016 at 9:47 AM, Robert Haas  wrote:
>>> [condition-variable-v1.patch]
>>
>> Don't you need to set proc->cvSleeping = false in ConditionVariableSignal?
>
> I poked at this a bit... OK, a lot... and have some feedback:
>
> 1.  As above, we need to clear cvSleeping before setting the latch.

Right, OK.

> 2.  The following schedule corrupts the waitlist by trying to delete
> something from it that isn't in it:
>
> P1: ConditionVariablePrepareToSleep: push self onto waitlist
> P2:   ConditionVariableSignal: pop P1 from waitlist
> P1: 
> P3: ConditionVariablePrepareToSleep: push self onto waitlist
> P1: ConditionVariableCancelSleep: delete self from waitlist (!)
>
> Without P3 coming along you probably wouldn't notice because the
> waitlist will be happily cleared and look valid, but P3's entry gets
> lost and then it hangs forever.
>
> One solution is to teach ConditionVariableCancelSleep to check if
> we're still actually there first.  That can be done in O(1) time by
> clearing nodes' next and prev pointers when deleting, so that we can
> rely on that in a new proclist_contains() function.  See attached.

How about instead using cvSleeping to test this?  Suppose we make a
rule that cvSleeping can be changed from false to true only by the
process whose PGPROC it is, and thus no lock is needed, but changing
it from true to false always requires the spinlock.

> 3.  The following schedule corrupts the waitlist by trying to insert
> something into it that is already in it:
>
> P1: ConditionVariablePrepareToSleep: push self onto waitlist
> P1: 
> P1: ConditionVariableSleep
> P1: ConditionVariablePrepareToSleep: push self onto waitlist (!)
>
> Nodes before and after self's pre-existing position can be forgotten
> when self's node is pushed to the front of the list.  That can be
> fixed by making ConditionVariablePrepareToSleep also check if we're
> already in the list.

OK.

> 4.  The waitlist is handled LIFO-ly.  Would it be better for the first
> guy to start waiting to be woken up first, like we do for locks?  The
> Pthreads equivalent says that it depends on "scheduling policy".  I
> don't know if it matters much, just an observation.

I don't know whether this matters.  It's possible that FIFO is a
better policy; I don't really care.

> 5.  The new proclist function you added is the first to work in terms
> of PGPROC* rather than procno.  Maybe the whole interface should work
> with either PGPROC pointers or procnos?  No strong view.

Hmm, maybe so.  But wouldn't any caller translate to a PGPROC * straight off?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-08-15 Thread Thomas Munro
On Mon, Aug 15, 2016 at 5:58 PM, Thomas Munro
 wrote:
> Also, I have attached a v2->v3 diff ...
Ugh.  I meant a v1->v2 diff.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-08-14 Thread Thomas Munro
On Sun, Aug 14, 2016 at 9:04 AM, Thomas Munro
 wrote:
> On Fri, Aug 12, 2016 at 9:47 AM, Robert Haas  wrote:
>> [condition-variable-v1.patch]
>
> Don't you need to set proc->cvSleeping = false in ConditionVariableSignal?

I poked at this a bit... OK, a lot... and have some feedback:

1.  As above, we need to clear cvSleeping before setting the latch.

2.  The following schedule corrupts the waitlist by trying to delete
something from it that isn't in it:

P1: ConditionVariablePrepareToSleep: push self onto waitlist
P2:   ConditionVariableSignal: pop P1 from waitlist
P1: 
P3: ConditionVariablePrepareToSleep: push self onto waitlist
P1: ConditionVariableCancelSleep: delete self from waitlist (!)

Without P3 coming along you probably wouldn't notice because the
waitlist will be happily cleared and look valid, but P3's entry gets
lost and then it hangs forever.

One solution is to teach ConditionVariableCancelSleep to check if
we're still actually there first.  That can be done in O(1) time by
clearing nodes' next and prev pointers when deleting, so that we can
rely on that in a new proclist_contains() function.  See attached.

3.  The following schedule corrupts the waitlist by trying to insert
something into it that is already in it:

P1: ConditionVariablePrepareToSleep: push self onto waitlist
P1: 
P1: ConditionVariableSleep
P1: ConditionVariablePrepareToSleep: push self onto waitlist (!)

Nodes before and after self's pre-existing position can be forgotten
when self's node is pushed to the front of the list.  That can be
fixed by making ConditionVariablePrepareToSleep also check if we're
already in the list.

4.  The waitlist is handled LIFO-ly.  Would it be better for the first
guy to start waiting to be woken up first, like we do for locks?  The
Pthreads equivalent says that it depends on "scheduling policy".  I
don't know if it matters much, just an observation.

5.  The new proclist function you added is the first to work in terms
of PGPROC* rather than procno.  Maybe the whole interface should work
with either PGPROC pointers or procnos?  No strong view.

Please find attached a -v2 of your patch which includes suggestions
1-3 above.  Like the -v1, it applies on top of
lwlocks-in-dsm-v3.patch.  Also, I have attached a v2->v3 diff to show
just my proposed changes.

-- 
Thomas Munro
http://www.enterprisedb.com


diff.patch
Description: Binary data


condition-variable-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-08-13 Thread Thomas Munro
On Fri, Aug 12, 2016 at 9:47 AM, Robert Haas  wrote:
> [condition-variable-v1.patch]

Don't you need to set proc->cvSleeping = false in ConditionVariableSignal?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-08-11 Thread Robert Haas
On Thu, Aug 11, 2016 at 8:44 PM, Thomas Munro
 wrote:
> In contrast, this proposal leaves it up to client code to get that
> right, similarly to the way you need to do things in a certain order
> when waiting for state changes with latches.  You could say that it's
> more error prone: I think there have been a few cases of incorrectly
> coded latch/state-change wait loops in the past.  On the other hand,
> it places no requirements on the synchronisation mechanism the client
> code uses for the related shared state.  pthread_cond_wait requires
> you to pass in a pointer to the related pthread_mutex_t, whereas with
> this proposal client code is free to use atomic ops, lwlocks,
> spinlocks or any other mutual exclusion mechanism to coordinate state
> changes and deal with cache coherency.

I think you have accurately stated the pros and cons of this approach.
On the whole I think it's a good trade-off.  In particular, not being
locked into a specific synchronization method seems like a good idea
from here.  If you had to pick just one, it would be hard to decide
between spinlocks and LWLocks, and "atomics" isn't a sufficiently
specific thing to code to.

> Then there is the question of what happens when the backend that is
> supposed to be doing the signalling dies or aborts, which Tom Lane
> referred to in his reply.  In those other libraries there is no such
> concern: it's understood that these are low level thread
> synchronisation primitives and if you're waiting for something that
> never happens, you'll be waiting forever.  I don't know what the
> answer is in general for Postgres condition variables, but...

As I noted in my reply to Tom, for parallel query, we're going to kill
all the workers at the same time, so the problem doesn't arise.  When
we use this mechanism outside that context, we just have to do it
correctly.  I don't think that's especially hard, but could somebody
mess it up?  Sure.

> The thing that I personally am working on currently that is very
> closely related and could use this has a more specific set of
> circumstances:  I want "join points" AKA barriers.  Something like
> pthread_barrier_t.  (I'm saying "join point" rather than "barrier" to
> avoid confusion with compiler and memory barriers, barrier.h etc.)
> Join points let you wait for all workers in a known set to reach a
> given point, possibly with a phase number or at least sense (one bit
> phase counter) to detect synchronisation bugs.  They also select one
> worker arbitrarily to receive a different return value when releasing
> workers from a join point, for cases where a particular phase of
> parallel work needs to be done by exactly one worker while the others
> sit on the bench: for example initialisation, cleanup or merging (CF
> PTHREAD_BARRIER_SERIAL_THREAD).  Clearly a join point could be not
> much more than a condition variable and some state tracking arrivals
> and departures, but I think that this higher level synchronisation
> primitive might have an advantage over raw condition variables in the
> abort case: it can know the total set of workers that its waiting for,
> if they are somehow registered with it first, and registration can
> include arranging for cleanup hooks to do the right thing.  It's
> already a requirement for a join point to know which workers exist (or
> at least how many).  Then the deal would then be that when you call
> joinpoint_join(_joinpoint, phase), it will return only when all
> peers have joined or detached, where the latter happens automatically
> if they abort or die.  Not at all sure of the details yet...  but I
> suspect join points are useful for a bunch of things like parallel
> sort, parallel hash join (my project), and anything else involving
> phases or some form of "fork/join" parallelism.

If I'm right that the abort/die case doesn't really need any special
handling here, then I think this gets a lot simpler.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-08-11 Thread Robert Haas
On Thu, Aug 11, 2016 at 6:37 PM, Peter Geoghegan  wrote:
> I notice that you acquire a spinlock within the implementation of
> condition variables. Is it worth any effort to consolidate the number
> of spinlock acquisitions? In other words, maybe the most common idioms
> should be baked into the ConditionVariable interface, which could save
> callers from having to use their own mutex variable.

One thing to keep in mind is that spinlocks are extremely fast as long
as you don't have too many processes contending for them.  With
parallel groups (or I/O-in-progress wait queues) of single digit
number of processes, I doubt that consolidating the spinlock
acquisitions will produce any measurable benefit.  If we get to the
point of having parallel groups containing scores of processes, that
could change.

Also, right now we don't have enough users of the CV interface to know
what the most common idioms will be.  We could speculate, but if we
do, I'll start of the speculation by guessing that there will be a lot
of diversity, and not too much that keeps getting repeated.  If that
proves to be wrong, of course, we can always go back and change it
later.  We're not writing this on stone tablets.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-08-11 Thread Robert Haas
On Thu, Aug 11, 2016 at 6:12 PM, Tom Lane  wrote:
>> But if we replace the io_in_progress locks with
>> condition variables, then that doesn't happen any more.  Nobody is
>> "holding" the condition variable, so it doesn't get "released" when
>> the process doing I/O aborts.  Instead, they just keep sleeping until
>> the aborting process reaches AbortBufferIO, and then it broadcasts on
>> the condition variable and wakes everybody up, which seems a good deal
>> nicer.
>
> Hmm.  I fear the only reason you see an advantage there is that you don't
> (yet) have any general-purpose mechanism for an aborting transaction to
> satisfy its responsibilities vis-a-vis waiters on condition variables.

In the specific case of parallel query, this problem doesn't really
arise: if an error happens, it will be propagated back to the leader
(unless it occurs in the leader as an initial matter) and the leader
will then kill of all of the other workers.  Since waiting for a
condition variable doesn't block interrupts, the workers will all stop
waiting and die like the obedient lemmings they are.

> Instead, this wins specifically because you stuck some bespoke logic into
> AbortBufferIO.  OK ... but that sounds like we're going to end up with
> every single condition variable that ever exists in the system needing to
> be catered for separately and explicitly during transaction abort cleanup.
> Which does not sound promising from a reliability standpoint.  On the
> other hand, I don't know what the equivalent rule to "release all LWLocks
> during abort" might look like for condition variables, so I don't know
> if it's even possible to avoid that.

I don't think it is possible to avoid that.  Certainly the fact that
LWLocks are automatically released is in most scenarios a big
advantage, but for content locks it isn't.  I note that I didn't so
much insert bespoke logic there as replace the existing bespoke logic
with less-ugly bespoke logic.

One way to conceptualize a condition variable is that it is the wait
queue for an LWLock without the lock itself.  The "lock modes" are
defined by the interaction between the condition checked by waiters
before sleeping and the shared memory updates performed before
signalling or broadcasting.  This separation of concerns allows for
enormous flexibility - you could use CVs to implement an LWLock-like
construct with an arbitrary set of lock modes and an arbitrary
conflict matrix, for example.  But I think that in most cases it will
be best to leave mutual exclusion to LWLocks, and use CVs when what we
need is not mutual exclusion but rather waiting for an operation
performed by some other backend to complete (successfully or
otherwise).  The IO-in-progress example is just such a case: the
current spinning behavior arises from the fact that the buffer's
I/O-in-progress flag does not get cleared until AFTER processes has
been dropped off the wait queue.  The normal pattern with a CV will be
(1) perform a shared memory update that will, if seen by other
processes, confirm that the condition has been met and then (2)
broadcast on the CV.  There's no way for transaction abort to know
what (1) looks like, even if somehow were made aware of the CV so that
it could do (2).

> I encourage you to pursue this, because indeed LWLocks aren't always
> an ideal solution, but I think it requires some careful thought about
> what transaction aborts will do with condition variables.

Thanks.  As stated above, I believe the right answer is in fact
"nothing", because this facility is too low-level to permit a
categorical answer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-08-11 Thread Thomas Munro
On Fri, Aug 12, 2016 at 9:47 AM, Robert Haas  wrote:
> https://en.wikipedia.org/wiki/Monitor_(synchronization)#Condition_variables_2
>
> Basically, a condition variable has three operations: you can wait for
> the condition variable; you can signal the condition variable to wake
> up one waiter; or you can broadcast on the condition variable to wake
> up all waiters.  Atomically with entering the wait, you must be able
> to check whether the condition is satisfied.  So, in my
> implementation, a condition variable wait loop looks like this:
>
> for (;;)
> {
> ConditionVariablePrepareToSleep(cv);
> if (condition for which we are waiting is satisfied)
> break;
> ConditionVariableSleep();
> }
> ConditionVariableCancelSleep();
>
> To wake up one waiter, another backend can call
> ConditionVariableSignal(cv); to wake up all waiters,
> ConditionVariableBroadcast(cv).

It is interesting to compare this interface with Wikipedia's
description, POSIX's pthread_cond_t and C++'s std::condition_variable.

In those interfaces, the wait operation takes a mutex which must
already be held by the caller.  It unlocks the mutex and begins
waiting atomically.  Then when it returns, the mutex is automatically
reacquired.  This approach avoids race conditions as long as the
shared state change you are awaiting is protected by that mutex.  If
you check that state before waiting and while still holding the lock,
you can be sure not to miss any change signals, and then when it
returns you can check the state again and be sure that no one can be
concurrently changing it.

In contrast, this proposal leaves it up to client code to get that
right, similarly to the way you need to do things in a certain order
when waiting for state changes with latches.  You could say that it's
more error prone: I think there have been a few cases of incorrectly
coded latch/state-change wait loops in the past.  On the other hand,
it places no requirements on the synchronisation mechanism the client
code uses for the related shared state.  pthread_cond_wait requires
you to pass in a pointer to the related pthread_mutex_t, whereas with
this proposal client code is free to use atomic ops, lwlocks,
spinlocks or any other mutual exclusion mechanism to coordinate state
changes and deal with cache coherency.

Then there is the question of what happens when the backend that is
supposed to be doing the signalling dies or aborts, which Tom Lane
referred to in his reply.  In those other libraries there is no such
concern: it's understood that these are low level thread
synchronisation primitives and if you're waiting for something that
never happens, you'll be waiting forever.  I don't know what the
answer is in general for Postgres condition variables, but...

The thing that I personally am working on currently that is very
closely related and could use this has a more specific set of
circumstances:  I want "join points" AKA barriers.  Something like
pthread_barrier_t.  (I'm saying "join point" rather than "barrier" to
avoid confusion with compiler and memory barriers, barrier.h etc.)
Join points let you wait for all workers in a known set to reach a
given point, possibly with a phase number or at least sense (one bit
phase counter) to detect synchronisation bugs.  They also select one
worker arbitrarily to receive a different return value when releasing
workers from a join point, for cases where a particular phase of
parallel work needs to be done by exactly one worker while the others
sit on the bench: for example initialisation, cleanup or merging (CF
PTHREAD_BARRIER_SERIAL_THREAD).  Clearly a join point could be not
much more than a condition variable and some state tracking arrivals
and departures, but I think that this higher level synchronisation
primitive might have an advantage over raw condition variables in the
abort case: it can know the total set of workers that its waiting for,
if they are somehow registered with it first, and registration can
include arranging for cleanup hooks to do the right thing.  It's
already a requirement for a join point to know which workers exist (or
at least how many).  Then the deal would then be that when you call
joinpoint_join(_joinpoint, phase), it will return only when all
peers have joined or detached, where the latter happens automatically
if they abort or die.  Not at all sure of the details yet...  but I
suspect join points are useful for a bunch of things like parallel
sort, parallel hash join (my project), and anything else involving
phases or some form of "fork/join" parallelism.

Or perhaps that type of thinking about error handling should be pushed
down to the condition variable.  How would that look: all potential
signallers would have to register to deliver a goodbye signal in their
abort and shmem exit paths?  Then what happens if you die before
registering?  I think even if you find a way to do that I'd still need
to do similar extra work on top for my join 

Re: [HACKERS] condition variables

2016-08-11 Thread Peter Geoghegan
On Thu, Aug 11, 2016 at 2:47 PM, Robert Haas  wrote:
> Another approach to the problem is to use a latch wait loop.  That
> almost works.  Interrupts can be serviced, and you can recheck shared
> memory to see whether the condition for proceeding is satisfied after
> each iteration of the loop.  There's only one problem: when you do
> something that might cause the condition to be satisfied for other
> waiting backends, you need to set their latch - but you don't have an
> easy way to know exactly which processes are waiting, so how do you
> call SetLatch?

That's what I ended up doing with parallel CREATE INDEX. It worked,
but it would be nice to have a general purpose facility for waiting
for conditions to change.

> https://en.wikipedia.org/wiki/Monitor_(synchronization)#Condition_variables_2
>
> Basically, a condition variable has three operations: you can wait for
> the condition variable; you can signal the condition variable to wake
> up one waiter; or you can broadcast on the condition variable to wake
> up all waiters.  Atomically with entering the wait, you must be able
> to check whether the condition is satisfied.  So, in my
> implementation, a condition variable wait loop looks like this:
>
> for (;;)
> {
> ConditionVariablePrepareToSleep(cv);
> if (condition for which we are waiting is satisfied)
> break;
> ConditionVariableSleep();
> }
> ConditionVariableCancelSleep();
>
> To wake up one waiter, another backend can call
> ConditionVariableSignal(cv); to wake up all waiters,
> ConditionVariableBroadcast(cv).

This seems convenient.

I notice that you acquire a spinlock within the implementation of
condition variables. Is it worth any effort to consolidate the number
of spinlock acquisitions? In other words, maybe the most common idioms
should be baked into the ConditionVariable interface, which could save
callers from having to use their own mutex variable.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] condition variables

2016-08-11 Thread Tom Lane
Robert Haas  writes:
> I had what I think is a better idea, which is to introduce a notion of
> condition variables.

Interesting proposal.

> ... Using condition variables here seems to
> have a couple of advantages.  First, it means that a backend waiting
> for buffer I/O to complete is interruptible.  Second, it fixes a
> long-running bit of nastiness in AbortBufferIO: right now, if a
> backend that is doing buffer I/O aborts, the abort causes it to
> release all of its LWLocks, including the buffer I/O lock. Everyone
> waiting for that buffer busy-loops until the aborting process gets
> around to reacquiring the lock and updating the buffer state in
> AbortBufferIO.  But if we replace the io_in_progress locks with
> condition variables, then that doesn't happen any more.  Nobody is
> "holding" the condition variable, so it doesn't get "released" when
> the process doing I/O aborts.  Instead, they just keep sleeping until
> the aborting process reaches AbortBufferIO, and then it broadcasts on
> the condition variable and wakes everybody up, which seems a good deal
> nicer.

Hmm.  I fear the only reason you see an advantage there is that you don't
(yet) have any general-purpose mechanism for an aborting transaction to
satisfy its responsibilities vis-a-vis waiters on condition variables.
Instead, this wins specifically because you stuck some bespoke logic into
AbortBufferIO.  OK ... but that sounds like we're going to end up with
every single condition variable that ever exists in the system needing to
be catered for separately and explicitly during transaction abort cleanup.
Which does not sound promising from a reliability standpoint.  On the
other hand, I don't know what the equivalent rule to "release all LWLocks
during abort" might look like for condition variables, so I don't know
if it's even possible to avoid that.

I encourage you to pursue this, because indeed LWLocks aren't always
an ideal solution, but I think it requires some careful thought about
what transaction aborts will do with condition variables.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] condition variables

2016-08-11 Thread Robert Haas
Hi,

Some of my EnterpriseDB colleagues and I have been working on various
parallel query projects, all of which have been previously disclosed
here:

https://wiki.postgresql.org/wiki/EnterpriseDB_database_server_roadmap

One issue we've encountered is that it's not very easy for one process
in a group of cooperating parallel processes to wait for another
process in that same group.  One idea is to have one process grab an
LWLock and other processes try to acquire it, but that actually
doesn't work very well.  A pretty obvious problem is that it holds of
interrupts for the entire time that you are holding the lock, which is
pretty undesirable.  A more subtle problem is that it's easy to
conceive of situations where the LWLock paradigm is just a very poor
fit for what you actually want to do.  For example, suppose you have a
computation which proceeds in two phases: each backend that finishes
phase 1 must wait until all backends finish phase 1, and once all have
finished, all can begin phase 2.  You could handle this case by having
an LWLock which everyone holds during phase 1 in shared mode, and then
everyone must briefly acquire it in exclusive mode before starting
phase 2, but that's an awful hack.  It also has race conditions: what
if someone finishes phase 1 before everyone has started phase 1?  And
what if there are 10 phases instead of 2?

Another approach to the problem is to use a latch wait loop.  That
almost works.  Interrupts can be serviced, and you can recheck shared
memory to see whether the condition for proceeding is satisfied after
each iteration of the loop.  There's only one problem: when you do
something that might cause the condition to be satisfied for other
waiting backends, you need to set their latch - but you don't have an
easy way to know exactly which processes are waiting, so how do you
call SetLatch?  I originally thought of adding a function like
SetAllLatches(ParallelContext *) and maybe that can work, but then I
had what I think is a better idea, which is to introduce a notion of
condition variables.  Condition variables, of course, are a standard
synchronization primitive:

https://en.wikipedia.org/wiki/Monitor_(synchronization)#Condition_variables_2

Basically, a condition variable has three operations: you can wait for
the condition variable; you can signal the condition variable to wake
up one waiter; or you can broadcast on the condition variable to wake
up all waiters.  Atomically with entering the wait, you must be able
to check whether the condition is satisfied.  So, in my
implementation, a condition variable wait loop looks like this:

for (;;)
{
ConditionVariablePrepareToSleep(cv);
if (condition for which we are waiting is satisfied)
break;
ConditionVariableSleep();
}
ConditionVariableCancelSleep();

To wake up one waiter, another backend can call
ConditionVariableSignal(cv); to wake up all waiters,
ConditionVariableBroadcast(cv).

I am cautiously optimistic that this design will serve a wide variety
of needs for parallel query development - basically anything that
needs to wait for another process to reach a certain point in the
computation that can be detected through changes in shared memory
state.  The attached patch condition-variable-v1.patch implements this
API.  I originally open-coded the wait queue for this, but I've just
finished rebasing it on top of Thomas Munro's proclist stuff, so
before applying this patch you need the one from here:

https://www.postgresql.org/message-id/CAEepm=0vvr9zgwht67rwutfwmeby1gigptbk3xfpdbbgetz...@mail.gmail.com

At some point while hacking on this I realized that we could actually
replace the io_in_progress locks with condition variables; the
attached patch buffer-io-cv-v1.patch does this (it must be applied on
top of the proclist patch from the above email and also on top of
condition-variable-v1.patch).  Using condition variables here seems to
have a couple of advantages.  First, it means that a backend waiting
for buffer I/O to complete is interruptible.  Second, it fixes a
long-running bit of nastiness in AbortBufferIO: right now, if a
backend that is doing buffer I/O aborts, the abort causes it to
release all of its LWLocks, including the buffer I/O lock. Everyone
waiting for that buffer busy-loops until the aborting process gets
around to reacquiring the lock and updating the buffer state in
AbortBufferIO.  But if we replace the io_in_progress locks with
condition variables, then that doesn't happen any more.  Nobody is
"holding" the condition variable, so it doesn't get "released" when
the process doing I/O aborts.  Instead, they just keep sleeping until
the aborting process reaches AbortBufferIO, and then it broadcasts on
the condition variable and wakes everybody up, which seems a good deal
nicer.

I'm very curious to know whether other people like this abstraction
and whether they think it will be useful for things they want to do
with parallel query (or otherwise).  Comments welcome.