Re: [HACKERS] WIP: Barriers

2017-01-06 Thread Thomas Munro
On Wed, Nov 23, 2016 at 2:28 PM, Robert Haas  wrote:
> The code here looks OK.  A few thoughts:
>
> - I'm a little unsure whether it's a good idea to remove the existing
> barrier.h and immediately add a new barrier.h that does something
> totally different.  It's tempting to try to find another name for
> these things just to avoid that.  Then again, I bet there's not much
> non-core code that is relying on the existing barrier.h, so maybe it's
> OK.  On the third hand, the risk of confusing developers may be
> non-nil even if we don't confuse any code.  I think I'll go commit
> remove-useless-barrier-v4.patch now and see if anyone shows up to
> complain...

Here is a new version with updated copyright messages and some dtrace
probes which provide an easy way to measure the time spend waiting at
barriers.

-- 
Thomas Munro
http://www.enterprisedb.com


barrier-v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-11-22 Thread Robert Haas
On Tue, Nov 22, 2016 at 4:42 PM, Thomas Munro
 wrote:
> On Tue, Nov 1, 2016 at 5:03 PM, Thomas Munro
>> Here's a new version which is rebased and adds support for passing
>> wait_event through to pg_stat_activity.
>
> Here's a version updated for the new conditional variables interface
> which has just been committed as e8ac886c.  Some comments improved.

The code here looks OK.  A few thoughts:

- I'm a little unsure whether it's a good idea to remove the existing
barrier.h and immediately add a new barrier.h that does something
totally different.  It's tempting to try to find another name for
these things just to avoid that.  Then again, I bet there's not much
non-core code that is relying on the existing barrier.h, so maybe it's
OK.  On the third hand, the risk of confusing developers may be
non-nil even if we don't confuse any code.  I think I'll go commit
remove-useless-barrier-v4.patch now and see if anyone shows up to
complain...

- Should parallel bitmap heap scan be using this to decide who runs
the subplan?  It looks like it would work for what is needed in that
case, and it looks like it would be simpler (and probably more
correct) than what's there now.  PBMState could go away completely, I
think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-11-22 Thread Thomas Munro
On Tue, Nov 1, 2016 at 5:03 PM, Thomas Munro
 wrote:
> On Thu, Aug 18, 2016 at 1:55 PM, Thomas Munro
>  wrote:
>> On Tue, Aug 16, 2016 at 1:55 AM, Robert Haas  wrote:
>>> If we're going to remove barrier.h, I think that should be a separate
>>> commit from creating a new barrier.h.
>>
>> OK.  Here's a patch to remove the old header, and the v2 barrier patch
>> which adds phases and attach/detach.  As before, it depends on
>> condition-variable-v2.patch.
>
> Here's a new version which is rebased and adds support for passing
> wait_event through to pg_stat_activity.

Here's a version updated for the new conditional variables interface
which has just been committed as e8ac886c.  Some comments improved.

-- 
Thomas Munro
http://www.enterprisedb.com


remove-useless-barrier-header-v4.patch
Description: Binary data


barrier-v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-10-31 Thread Thomas Munro
On Thu, Aug 18, 2016 at 1:55 PM, Thomas Munro
 wrote:
> On Tue, Aug 16, 2016 at 1:55 AM, Robert Haas  wrote:
>> If we're going to remove barrier.h, I think that should be a separate
>> commit from creating a new barrier.h.
>
> OK.  Here's a patch to remove the old header, and the v2 barrier patch
> which adds phases and attach/detach.  As before, it depends on
> condition-variable-v2.patch.

Here's a new version which is rebased and adds support for passing
wait_event through to pg_stat_activity.

-- 
Thomas Munro
http://www.enterprisedb.com


remove-useless-barrier-header-v2.patch
Description: Binary data


barrier-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-08-17 Thread Thomas Munro
On Tue, Aug 16, 2016 at 1:55 AM, Robert Haas  wrote:
> On Sat, Aug 13, 2016 at 7:18 PM, Thomas Munro
>  wrote:
>> My use case for this is coordinating the phases of parallel hash
>> joins, but I strongly suspect there are other cases.  Parallel sort
>> springs to mind, which is why I wanted to post this separately and
>> earlier than my larger patch series, to get feedback from people
>> working on other parallel features.
>
> I was thinking about this over the weekend and I started to wonder
> whether this is really going to work well for hash joins.  For
> example, suppose that 6GB of work_mem is available and the projected
> size of the hash table is 8GB.  Clearly, we're going to need 2
> batches, but, if our estimates are accurate and the tuples split
> evenly between batches, each batch will be only 4GB!  That means that
> we can build up to 2GB of the hash table for the next batch before
> we've finished with the hash table for the previous batch.  It seems
> like a really good idea to try to take advantage of that as much as
> possible.

Right.  Thanks for that example.  A naive approach with barriers
between building and probing does indeed introduce a kind of pipeline
stall where workers twiddle their thumbs, despite there being unused
work_mem and useful things that could be done concurrently to fill it.

> The simple version of this is that when a worker gets done with its
> own probe phase for batch X, it can immediately start building the
> hash table for phase X+1, stopping if it fills up the unused portion
> of work_mem before the old hash table goes away.  Of course, there are
> some tricky issues with reading tapes that were originally created by
> other backends, but if I understand correctly, Peter Geoghegan has
> already done some work on that problem, and it seems like something we
> can eventually solve, even if not in the first version.
>
> The more complicated version of this is that we might want to delegate
> one or more workers to start building as much of the next-batch hash
> table as will fit instead of assisting with the current probe phase.
> Once work_mem is full, they join the probe phase and continue until
> it's done.  Again, tape management is an issue.  But you can see that
> if you can make this work, in this example, you can reduce the
> enforced pause between batches by about 50%; half the work is already
> done by the time the old hash table goes away.  I bet that has a
> chance of being fairly significant, especially for hash joins that
> have tons of batches.  I once saw a 64-batch hash join outperform a
> nested loop with inner index scan!
>
> Anyway, my point here is that I'm not sure whether the barrier
> mechanic is going to work well for computations with overlapping
> phases, and I suspect that overlapping phases is going to be an
> important technique, so we should make sure not to settle into a
> synchronization model that makes it hard.

I have a draft scheme worked out to do what you called the "simple
version", though not in the first version I will post (the "naive
version").  I don't really want to go into all the details in this
thread, not least because I'm still working on it, but I do want to
point out that barriers as a synchronisation primitive are not at all
incompatible with overlapping work: on the contrary, they can help
coordinate it while keeping the code tractable.

Rough sketch*:  There will be a secondary hash table, used for
preloading the next batch.  As in the naive version, there is a
barrier after the hash table is loaded: you can't start probing an
incomplete hash table.  But after probing the primary table for batch
n, workers will immediately begin preloading the secondary table with
tuples for batch n + 1, until either work_mem is exhausted or batch n
+ 1 is entirely loaded.  Then will they synchronise, elect one worker
to promote the secondary table to primary, synchronise again, finish
loading the newly promoted primary table, and then rince and repeat:
synchronise to coordinate the beginning of probing for batch n + 1...

Of course there are also several other phases relating to
initialization, hashing, resizing and special cases for first and last
batches, as I will describe in a future thread with code.

*Actual patch may not resemble illustration.

>> A problem that I'm still grappling with is how to deal with workers
>> that fail to launch.  What I'm proposing so far is based on static
>> worker sets, where you can only give the number of workers at
>> initialisation time, just like pthread_barrier_init.  Some other
>> libraries allow for adjustable worker sets, and I'm wondering if a
>> parallel leader might need to be able to adjust the barrier when it
>> hears of a worker not starting.  More on that soon.
>
> I think tying this into the number of workers for the parallel context
> in general is going to get you into trouble.  For example, suppose
> that we have 

Re: [HACKERS] WIP: Barriers

2016-08-16 Thread Peter Geoghegan
On Sat, Aug 13, 2016 at 4:18 PM, Thomas Munro
 wrote:
> First, you initialise a Barrier object somewhere in shared memory,
> most likely in the DSM segment used by parallel query, by calling
> BarrierInit(, nworkers).  Then workers can call
> BarrierWait() when they want to block until all workers arrive
> at the barrier.  When the final worker arrives, BarrierWait returns in
> all workers, releasing them to continue their work.  One arbitrary
> worker receives a different return value as a way of "electing" it to
> perform serial phases of computation.  For parallel phases of
> computation, the return value can be ignored.  For example, there may
> be preparation, merging, or post-processing phases which must be done
> by just one worker, interspersed with phases where all workers do
> something.

I think that this mechanism could be quite useful for sorting with
partitioning, which doesn't exist yet. What does exist is unlikely to
benefit from this over and above what Robert's "condition variables"
offer, because as it happens there is no need to "elect" a single
worker at all. The ordering dependencies happen to be quite naturally
across one leader process and one or more worker processes.

I do see value in this, though.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-08-16 Thread Peter Geoghegan
On Mon, Aug 15, 2016 at 6:55 AM, Robert Haas  wrote:
> A sort of dumb way of handling all this is to assume that once a
> worker joins the hash join, it won't go off and do anything else until
> the hash join is done.  Under that assumption, you just need some sort
> of BarrierAttach() operation; workers that have never attached the
> barrier aren't participating in the hash join at all and so they are
> irrelevant - and now you know how many workers you need to await,
> because you can keep a count how many have attached.  Perhaps you
> simply turn away any workers that arrive after batch 0 is complete.

Is that really so bad? In general, I don't tend to think of workers as
the cost to worry about. Rather, we should be concerned about the
active use of CPU cores as our major cost.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-08-16 Thread Peter Geoghegan
On Mon, Aug 15, 2016 at 6:55 AM, Robert Haas  wrote:
> The simple version of this is that when a worker gets done with its
> own probe phase for batch X, it can immediately start building the
> hash table for phase X+1, stopping if it fills up the unused portion
> of work_mem before the old hash table goes away.  Of course, there are
> some tricky issues with reading tapes that were originally created by
> other backends, but if I understand correctly, Peter Geoghegan has
> already done some work on that problem, and it seems like something we
> can eventually solve, even if not in the first version.

The tape vs. BufFile vs. fd.c file handle distinctions get
*confusing*. Thomas and I have hashed this out (pun intended), but I
should summarize.

Currently, and without bringing parallelism into it, Hash joins have
multiple BufFiles (two per batch -- innerBatchFile and
outerBatchFile), which are accessed as needed. External sorts have
only one BufFile, with multiple "logical tapes" within a single
"tapeset" effectively owning space within the BufFile -- that space
doesn't have to be contiguous, and can be reused *eagerly* within and
across logical tapes in tuplesort.c's tapeset. logtape.c is a kind of
block-orientated rudimentary filesystem built on top of one BufFile.
The only real advantage of having the logtape.c abstraction is that
moving stuff around (to sort it, when multiple passes are required)
can be accomplished with minimal wasted disk space (it's eagerly
reclaimed). This is less important today than it would have been in
the past.

Clearly, it doesn't make much sense to talk about logtape.c and
anything that isn't sorting, because it is very clearly written with
that purpose alone in mind. To avoid confusion, please only talk about
tapes when talking about sorting.

So:

* tuplesort.c always talks to logtape.c, which talks to buffile.c
(which talks to fd.c).

* Hash joins use buffile.c directly, though (and have multiple
buffiles, as already noted).

Now, I might still have something that Thomas can reuse, because
buffile.c was made to support "unification" of worker BufFiles in
general. Thomas would be using that interface, if any. I haven't
studied parallel hash join at all, but presumably the difference would
be that *multiple* BufFiles would be unified, such that a
concatenated/unified BufFile would be addressable within each worker,
one per batch. All of this assumes that there is a natural way of
unifying the various batches involved across all workers, of course.

This aspect would present some complexity for Thomas, I think
(comments from hashjoin.h):

 * It is possible to increase nbatch on the fly if the in-memory hash table
 * gets too big.  The hash-value-to-batch computation is arranged so that this
 * can only cause a tuple to go into a later batch than previously thought,
 * never into an earlier batch.  When we increase nbatch, we rescan the hash
 * table and dump out any tuples that are now of a later batch to the correct
 * inner batch file.  Subsequently, while reading either inner or outer batch
 * files, we might find tuples that no longer belong to the current batch;
 * if so, we just dump them out to the correct batch file.

I'd be concerned about managing which backend was entitled to move
tuples across batches, and so on. One thing that I haven't had to
contend with is which backend "owns" which BufFile (or underlying fd.c
file handles). There is no ambiguity about that for me. Owners delete
the temp files on Xact end, and are the only ones entitled to write to
files, and only before unification. These latter restrictions might be
lifted if there was a good reason to do so.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-08-15 Thread Konstantin Knizhnik


On 15.08.2016 15:42, Thomas Munro wrote:
This implementation is using a spinlock for the arrival counter, and 
signals (via Robert's condition variables and latches) for waking up 
peer processes when the counter reaches the target. I realise that 
using signals for this sort of thing is a bit unusual outside the 
Postgres universe, but won't a semaphore-based implementation require 
just as many system calls, context switches and scheduling operations? 


Yes, you are right.
I never expected that this combination of signal+local socket+select can 
provide performance comparable with pthread_cond_t.
I have implemented simple test where two background workers are 
emulating request-response round-trip using latches and pthread primitives.
Result (average round-trip time) was 7.49 microseconds for Postgres 
latches vs. 4.59 microseconds for pthread_cond_timedwait.





#define N_ROUNDTRIPS 100
#define WAIT_LATCH_TIMEOUT 6

static void PongLatch(Datum arg)
{
int i;
timestamp_t start;
int result;

BackgroundWorkerUnblockSignals();

Mtm->pong = MyProc->pgprocno;
ResetLatch(>procLatch);
MtmSleep(100);
Assert(Mtm->ping);


for (i = 0; i <= N_ROUNDTRIPS; i++) {
result = WaitLatch(>procLatch, WL_LATCH_SET|WL_TIMEOUT, 
WAIT_LATCH_TIMEOUT);

Assert(result & WL_LATCH_SET);
ResetLatch(>procLatch);
SetLatch(>allProcs[Mtm->ping].procLatch);
if (i == 0) {
   start = MtmGetSystemTime();
}
}
fprintf(stderr, "Average roundrip time: %f microsconds\n", 
(double)(MtmGetSystemTime() - start) /  N_ROUNDTRIPS);

}

static void PingLatch(Datum arg)
{
int i;
timestamp_t start;
int result;

BackgroundWorkerUnblockSignals();

Mtm->ping = MyProc->pgprocno;
ResetLatch(>procLatch);
MtmSleep(100);
Assert(Mtm->pong);

for (i = 0; i <= N_ROUNDTRIPS; i++) {
SetLatch(>allProcs[Mtm->pong].procLatch);
result = WaitLatch(>procLatch, WL_LATCH_SET|WL_TIMEOUT, 
WAIT_LATCH_TIMEOUT);

Assert(result & WL_LATCH_SET);
ResetLatch(>procLatch);
if (i == 0) {
   start = MtmGetSystemTime();
}
}
fprintf(stderr, "Average roundrip time: %f microseconds\n", 
(double)(MtmGetSystemTime() - start) /  N_ROUNDTRIPS);

}


static BackgroundWorker Pinger = {
"ping",
BGWORKER_SHMEM_ACCESS,// | BGWORKER_BACKEND_DATABASE_CONNECTION,
BgWorkerStart_ConsistentState,
BGW_NEVER_RESTART,
PingLatch
};

static BackgroundWorker Ponger = {
"pong",
BGWORKER_SHMEM_ACCESS,// | BGWORKER_BACKEND_DATABASE_CONNECTION,
BgWorkerStart_ConsistentState,
BGW_NEVER_RESTART,
PongLatch
};

static void PingPong()
{
RegisterBackgroundWorker();
RegisterBackgroundWorker();
}




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-08-15 Thread Robert Haas
On Sat, Aug 13, 2016 at 7:18 PM, Thomas Munro
 wrote:
> I would like to propose "barriers" for Postgres processes.  A barrier
> is a very simple mechanism for coordinating parallel computation, as
> found in many threading libraries.
>
> First, you initialise a Barrier object somewhere in shared memory,
> most likely in the DSM segment used by parallel query, by calling
> BarrierInit(, nworkers).  Then workers can call
> BarrierWait() when they want to block until all workers arrive
> at the barrier.  When the final worker arrives, BarrierWait returns in
> all workers, releasing them to continue their work.  One arbitrary
> worker receives a different return value as a way of "electing" it to
> perform serial phases of computation.  For parallel phases of
> computation, the return value can be ignored.  For example, there may
> be preparation, merging, or post-processing phases which must be done
> by just one worker, interspersed with phases where all workers do
> something.
>
> My use case for this is coordinating the phases of parallel hash
> joins, but I strongly suspect there are other cases.  Parallel sort
> springs to mind, which is why I wanted to post this separately and
> earlier than my larger patch series, to get feedback from people
> working on other parallel features.

I was thinking about this over the weekend and I started to wonder
whether this is really going to work well for hash joins.  For
example, suppose that 6GB of work_mem is available and the projected
size of the hash table is 8GB.  Clearly, we're going to need 2
batches, but, if our estimates are accurate and the tuples split
evenly between batches, each batch will be only 4GB!  That means that
we can build up to 2GB of the hash table for the next batch before
we've finished with the hash table for the previous batch.  It seems
like a really good idea to try to take advantage of that as much as
possible.

The simple version of this is that when a worker gets done with its
own probe phase for batch X, it can immediately start building the
hash table for phase X+1, stopping if it fills up the unused portion
of work_mem before the old hash table goes away.  Of course, there are
some tricky issues with reading tapes that were originally created by
other backends, but if I understand correctly, Peter Geoghegan has
already done some work on that problem, and it seems like something we
can eventually solve, even if not in the first version.

The more complicated version of this is that we might want to delegate
one or more workers to start building as much of the next-batch hash
table as will fit instead of assisting with the current probe phase.
Once work_mem is full, they join the probe phase and continue until
it's done.  Again, tape management is an issue.  But you can see that
if you can make this work, in this example, you can reduce the
enforced pause between batches by about 50%; half the work is already
done by the time the old hash table goes away.  I bet that has a
chance of being fairly significant, especially for hash joins that
have tons of batches.  I once saw a 64-batch hash join outperform a
nested loop with inner index scan!

Anyway, my point here is that I'm not sure whether the barrier
mechanic is going to work well for computations with overlapping
phases, and I suspect that overlapping phases is going to be an
important technique, so we should make sure not to settle into a
synchronization model that makes it hard.

> A problem that I'm still grappling with is how to deal with workers
> that fail to launch.  What I'm proposing so far is based on static
> worker sets, where you can only give the number of workers at
> initialisation time, just like pthread_barrier_init.  Some other
> libraries allow for adjustable worker sets, and I'm wondering if a
> parallel leader might need to be able to adjust the barrier when it
> hears of a worker not starting.  More on that soon.

I think tying this into the number of workers for the parallel context
in general is going to get you into trouble.  For example, suppose
that we have an Append plan and beneath that we have two children of
each of which is a Hash Join.  Right now, Append is dumb, so it will
blindly throw all of the workers at the first Hash Join and then, as
they emerge, it will throw them at the second one.  However, we've had
previous discussions which I'm too lazy to look up right now about
making a Parallel Append that would split the workers between the two
hash joins; if one of them finished, then those workers could go join
the other hash join in medias res.

Now, in this world, it's clearly very bad if each hash join waits for
"all of the workers" to finish a given phase before beginning the next
phase.  In fact, it's probably going to result in both hash joins
hanging, followed by everybody waiting for each other forever.  The
actual condition that must be verified is that there are no workers
which are going to keep 

Re: [HACKERS] WIP: Barriers

2016-08-15 Thread Thomas Munro
On Sun, Aug 14, 2016 at 6:54 PM, konstantin knizhnik
 wrote:
> Barriers are really very simple and convenient mechanism for process 
> synchronization.
> But it is actually a special case of semaphores: having semaphore primitive 
> it is trivial to implement a barrier.
> We have semaphores in Postgres, but ... them can not be used by extensions: 
> there is fixed number of semaphores allocated based on maximal number of 
> connections and there is no mechanism for requesting additional semaphores.

Probably because they are kernel objects requiring extra resource
management.  I'm hoping for something that can be created dynamically
in DSM segments with no cleanup, and that aspect of semaphores is
problematic.

> [...]  I wonder if somebody has measured  how much times latches 
> (signal+socket) are slower then posix semaphores or conditional variables?

I'm not sure if it makes sense for us to use POSIX conditional
variables: they require using POSIX mutexes, and we're pretty heavily
invested in the use of lwlocks that are hooked into our error handling
system, and spinlocks.

I'd be curious to know if you can make a better barrier with
semaphores.  I've attached a little test module which can be used to
measure the time for N processes to synchronise at M barrier wait
points.  You can run with SELECT barrier_test(, ,
), where implementation 0 uses the barrier patch I
posted and you can add another implementation as 1.  This patch
requires lwlocks-in-dsm-v3.patch, condition-variable-v2.patch,
barrier-v1.patch, in that order.

This implementation is using a spinlock for the arrival counter, and
signals (via Robert's condition variables and latches) for waking up
peer processes when the counter reaches the target.  I realise that
using signals for this sort of thing is a bit unusual outside the
Postgres universe,  but won't a semaphore-based implementation require
just as many system calls, context switches and scheduling operations?

-- 
Thomas Munro
http://www.enterprisedb.com


barrier-test.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-08-14 Thread konstantin knizhnik
Hi Thomas,

Barriers are really very simple and convenient mechanism for process 
synchronization.
But it is actually a special case of semaphores: having semaphore primitive it 
is trivial to implement a barrier.
We have semaphores in Postgres, but ... them can not be used by extensions: 
there is fixed number of semaphores allocated based on maximal number of 
connections and there is no mechanism for requesting additional semaphores. 


Rober has recently proposed conditional variables, which are also very useful. 
Right now we have spinlocks, LW-locks and latches.
>From my point of view it is not enough. While creating various extensions for 
>Postgres I always fill lack of such synchronization primitive as events 
>(condition variables) and semaphores. Events and semaphores are similar and it 
>is possible to implement any of them based on another. But from user's point 
>of view them have different semantic and use cases, so it is better to provide 
>both of them.

I wonder if we should provide system-abstraction-layer (SAL) for Postgres, 
where we can isolate all system dependent code and which will be available for 
vore developers as well as for developers of extensions?  The obvious 
candidates for SAL are:
1. Synchronization primitives (locks, events, semaphores, mutexes, spinlocks, 
latches, barriers)
2. Shared memory
3. File access
4. Network sockets
5. Process control (fork, ...)

Certainly it requires a lot of refactoring but will make Postgres code much 
more elegant, easer to read and maintain.
Also it is not necessary to do all this changes  in one step: here we do not 
need atomic transactions:)
We can start for example with synchronization primitives, as far as in any case 
a lot of changes are proposed here.

Parallel execution is one of the most promising approach to improve Postgres 
performance. I do not mean just parallel execution of single query.
Parallel vacuum, parallel index creation, parallel sort, ... And to implement 
all this stuff we definitely need convenient and efficient synchronization 
primitives.
The set of such primitives can be discussed. IMHO it should include RW-locks 
(current LW-locks), mutexes (spinlock + some mechanism to wait), events 
(condition variables), semaphores and barriers (based on semaphores). Latches 
can be left for backward compatibility or be replaced with events.
I wonder if somebody has measured  how much times latches (signal+socket) are 
slower then posix semaphores or conditional variables?



On Aug 14, 2016, at 2:18 AM, Thomas Munro wrote:

> Hi hackers,
> 
> I would like to propose "barriers" for Postgres processes.  A barrier
> is a very simple mechanism for coordinating parallel computation, as
> found in many threading libraries.
> 
> First, you initialise a Barrier object somewhere in shared memory,
> most likely in the DSM segment used by parallel query, by calling
> BarrierInit(, nworkers).  Then workers can call
> BarrierWait() when they want to block until all workers arrive
> at the barrier.  When the final worker arrives, BarrierWait returns in
> all workers, releasing them to continue their work.  One arbitrary
> worker receives a different return value as a way of "electing" it to
> perform serial phases of computation.  For parallel phases of
> computation, the return value can be ignored.  For example, there may
> be preparation, merging, or post-processing phases which must be done
> by just one worker, interspersed with phases where all workers do
> something.
> 
> My use case for this is coordinating the phases of parallel hash
> joins, but I strongly suspect there are other cases.  Parallel sort
> springs to mind, which is why I wanted to post this separately and
> earlier than my larger patch series, to get feedback from people
> working on other parallel features.
> 
> A problem that I'm still grappling with is how to deal with workers
> that fail to launch.  What I'm proposing so far is based on static
> worker sets, where you can only give the number of workers at
> initialisation time, just like pthread_barrier_init.  Some other
> libraries allow for adjustable worker sets, and I'm wondering if a
> parallel leader might need to be able to adjust the barrier when it
> hears of a worker not starting.  More on that soon.
> 
> Please see the attached WIP patch.  I had an earlier version with its
> own waitlists and signalling machinery etc, but I've now rebased it to
> depend on Robert Haas's proposed condition variables, making this code
> much shorter and sweeter.  So it depends on his
> condition-variable-vX.patch[1], which in turn depends on my
> lwlocks-in-dsm-vX.patch[2] (for proclist).
> 
> When Michaƫl Paquier's work on naming wait points[3] lands, I plan to
> include event IDs as an extra argument to BarrierWait which will be
> passed though so as to show up in pg_stat_activity.  Then you'll be
> able to see where workers are waiting for each other!  For now I
> didn't want to tangle this up with yet