Re: [HACKERS] asynchronous and vectorized execution

2016-12-01 Thread Haribabu Kommi
On Mon, Oct 3, 2016 at 3:25 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> At Mon, 3 Oct 2016 13:14:23 +0900, Michael Paquier <
> michael.paqu...@gmail.com> wrote in  xn_g7uwnpqun_a_sewb...@mail.gmail.com>
> > On Thu, Sep 29, 2016 at 5:50 PM, Kyotaro HORIGUCHI
> >  wrote:
> > > Sorry for no response, but, the answer is yes. We could be able
> > > to avoid the problem by managing execution state for every
> > > node. But it needs an additional flag in *State structs and
> > > manipulating on the way shuttling up and down around the
> > > execution tree.
> >
> > Moved to next CF.
>
> Thank you.



Closed in 2016-11 commitfest with "returned with feedback" status.
This is as per my understanding of the recent mails on the thread.
Please feel free to update the status if the current status doesn't
reflect the exact status of the patch.

Regards,
Hari Babu
Fujitsu Australia


Re: [HACKERS] asynchronous and vectorized execution

2016-10-02 Thread Kyotaro HORIGUCHI
At Mon, 3 Oct 2016 13:14:23 +0900, Michael Paquier  
wrote in 
> On Thu, Sep 29, 2016 at 5:50 PM, Kyotaro HORIGUCHI
>  wrote:
> > Sorry for no response, but, the answer is yes. We could be able
> > to avoid the problem by managing execution state for every
> > node. But it needs an additional flag in *State structs and
> > manipulating on the way shuttling up and down around the
> > execution tree.
> 
> Moved to next CF.

Thank you.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-10-02 Thread Michael Paquier
On Thu, Sep 29, 2016 at 5:50 PM, Kyotaro HORIGUCHI
 wrote:
> Sorry for no response, but, the answer is yes. We could be able
> to avoid the problem by managing execution state for every
> node. But it needs an additional flag in *State structs and
> manipulating on the way shuttling up and down around the
> execution tree.

Moved to next CF.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-09-29 Thread Kyotaro HORIGUCHI
Hello, thank you for the comment.

At Fri, 23 Sep 2016 18:15:40 +0530, Amit Khandekar  
wrote in 

Re: [HACKERS] asynchronous and vectorized execution

2016-09-23 Thread Robert Haas
On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar  wrote:
> For e.g., in the above plan which you specified, suppose :
> 1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
> is
> waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
> 2. The event wait list already has foreign scan on a that is on a different
> subtree.
> 3. This foreign scan a happens to be ready, so in
> ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
> which returns with result_ready.
> 4. Since it returns result_ready, it's parent node is now inserted in the
> callbacks array, and so it's parent (Append) is executed.
> 5. But, this Append planstate is already in the middle of executing Hash
> join, and is waiting for HashJoin.

Ah, yeah, something like that could happen.  I've spent much of this
week working on a new design for this feature which I think will avoid
this problem.  It doesn't work yet - in fact I can't even really test
it yet.  But I'll post what I've got by the end of the day today so
that anyone who is interested can look at it and critique.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-09-23 Thread Amit Khandekar
On 13 September 2016 at 20:20, Robert Haas  wrote:

> On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI
>  wrote:
> > [ new patches ]
>
> +/*
> + * We assume that few nodes are async-aware and async-unaware
> + * nodes cannot be revserse-dispatched from lower nodes that
> is
> + * async-aware. Firing of an async node that is not a
> descendant
> + * of the planstate will cause such reverse-diaptching to
> + * async-aware nodes, which is unexpected behavior for them.
> + *
> + * For instance, consider an async-unaware Hashjoin(OUTER,
> INNER)
> + * where the OUTER is running asynchronously but the Hashjoin
> is
> + * waiting on the async INNER during inner-hash creation. If
> the
> + * OUTER fires for the case, since anyone is waiting on it,
> + * ExecAsyncWaitForNode finally dispatches to the Hashjoin
> which
> + * is now in the middle of thing its work.
> + */
> +if (!IsParent(planstate, node))
> +continue;
>
> I'm not entirely sure that I understand this comment, but I don't
> think it's going in the right direction.  Let's start with the example
> in the second paragraph. If the hash join is async-unaware, then it
> isn't possible for the hash join to be both running the outer side of
> the join asynchronously and at the same time waiting on the inner
> side.  Once it tries to pull the first tuple from the outer side, it's
> waiting for that to finish and can't do anything else.  So, the inner
> side can't possibly get touched in any way until the outer side
> finishes.  For anything else to happen, the hash join would have to be
> async-aware.  Even if we did that, I don't think it would be right to
> kick off both sides of the hash join at the same time.  Right now, if
> the outer side turns out to be empty, we never need to build the hash
> table, and that's good.
>

I feel the !IsParent() condition is actually to prevent the infinite wait
caused by a re-entrant issue in ExecAsuncWaitForNode() that Kyotaro
mentioned
earlier. But yes, the comments don't explain exactly how the hash join can
cause the re-entrant issue.

But I attempted to come up with some testcase which might reproduce the
infinite-waiting in ExecAsyncWaitForNode() after removing the !IsParent()
check
so that the other subtree nodes are also included, but I couldn't reproduce.
Kyotaro, is it possible for you to give a testcase that consistently hangs
if
we revert back the !IsParent() check ?

I was also thinking about another possibility where the same plan state
node is
re-entered, as explained below.

>
> I don't think it's a good idea to wait for only nodes that are in the
> current subtree.  For example, consider a plan like this:
>
> Append
> -> Foreign Scan on a
> -> Hash Join
>   -> Foreign Scan on b
>   -> Hash
> -> Seq Scan on x
>
> Suppose Append and Foreign Scan are parallel-aware but the other nodes
> are not.  Append kicks off the Foreign Scan on a and then waits for
> the hash join to produce a tuple; the hash join kicks off the Foreign
> Scan on b and waits for it to return a tuple.  If, while we're waiting
> for the foreign scan on b, the foreign scan on a needs some attention
> - either to produce tuples, or maybe just to call PQconsumeInput() so
> that more data can be sent from the other side, I think we need to be
> able to do that.  There's no real problem here; even if the Append
> becomes result-ready before the hash join returns, that is fine.


Yes I agree : we should be able to do this. Sine we have all the waiting
events
in a common estate, there's no harm if we start executing nodes of another
sub-tree if we get an event from there.

But I am thinking about what would happen when this node from other sub-tree
returns result_ready, and then it's parents are called, and then the result
gets bubbled up upto the node which had already caused us to call
ExecAsyncWaitForNode() in the first place.

For e.g., in the above plan which you specified, suppose :
1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
is
waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
2. The event wait list already has foreign scan on a that is on a different
subtree.
3. This foreign scan a happens to be ready, so in
ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
which returns with result_ready.
4. Since it returns result_ready, it's parent node is now inserted in the
callbacks array, and so it's parent (Append) is executed.
5. But, this Append planstate is already in the middle of executing Hash
join, and is waiting for HashJoin.

Is this safe to execute the same plan state when it is already inside its
execution ? In other words, is the plan state re-entrant ? I suspect, the
new
execution may even corrupt the structures with which it was 

Re: [HACKERS] asynchronous and vectorized execution

2016-09-13 Thread Robert Haas
On Tue, Aug 2, 2016 at 3:41 AM, Kyotaro HORIGUCHI
 wrote:
> Thank you for the comment.
>
> At Mon, 1 Aug 2016 10:44:56 +0530, Amit Khandekar  
> wrote in 
>> On 21 July 2016 at 15:20, Kyotaro HORIGUCHI > > wrote:
>>
>> >
>> > After some consideration, I found that ExecAsyncWaitForNode
>> > cannot be reentrant because it means that the control goes into
>> > async-unaware nodes while having not-ready nodes, that is
>> > inconsistent state. To inhibit such reentering, I allocated node
>> > identifiers in depth-first order so that ascendant-descendant
>> > relationship can be checked (nested-set model) in simple way and
>> > call ExecAsyncConfigureWait only for the descendant nodes of the
>> > parameter planstate.
>> >
>> >
>> We have estate->waiting_nodes containing a mix of async-aware and
>> non-async-aware nodes. I was thinking, an asynchrony tree would have only
>> async-aware nodes, with possible multiple asynchrony sub-trees in a tree.
>> Somehow, if we restrict the bubbling up of events only upto the root of the
>> asynchrony subtree, do you think we can simplify some of the complexities ?
>
> The current code prohibiting regsitration of nodes outside the
> current subtree to avoid the reentring-disaster.
>
> Indeed leaving the "waiting node" mark or something like on every
> root node at the first visit will enable the propagation to stop
> upto the root of any async-subtree. Neverheless, when an
> async-child in an inactive async-root fires, the new tuple is
> loaded but is not consumed then the succeeding firing on the same
> child leads to a dead-lock (without result queueing). However,
> that can be avoided if ExecAsyncConfigureWait doesn't register
> nodes in ready state.

Why would a node call ExecAsyncConfigureWait in the first place if it
already had a result ready?  I think it shouldn't do that.

> On the other hand, any two or more asynchronous nodes can share a
> syncronization object. For instance, multiple postgres_fdw scan
> node can share one server connection and only one of them can get
> into waitable state at once. If no async-child in the current
> async subtree is waitable, it must be stuck. So I think it is
> crucial for ExecAsyncWaitForNode to force at least one child *in
> the current async subtree* to get into waiting state for such
> situation. The ascendant-descendant relationship is necessary to
> do that anyway.

This is another example of a situation where waiting only for nodes
within a subtree causes problems.

Suppose there are two Foreign Scans in completely different parts of
the plan tree that are going to use, in alternation, the same
connection to the same remote server.  When we encounter the first
one, it kicks off the query, uses ExecAsyncConfigureWait to register
itself as waiting, and returns without becoming ready.  When we
encounter the second one, it can't kick off the query and therefore
has no chance of becoming ready until after the first one has finished
with the connection.  Suppose we then wait for the second Foreign
Scan.  Well, we had better wait for the first one, too!  If we don't,
it will never finish with the connection, so the second node will
never get to use it, and now we're in trouble.

I think what we need is for the ConnCacheEntry to have a place to note
the ForeignScanState that is using the connection and any other
PlanState objects that would like to use it.  When one
ForeignScanState is done with the ConnCacheEntry, it activates the
next one, which then takes over.  That seems simple enough, but
there's a problem here for suspended queries: if we stop executing a
plan while some scan within that plan is holding onto a
ConnCacheEntry, and then we run some other query that wants to use the
same one, we've got a problem.  Maybe we can get by with letting the
other query finish running and then executing our own query, but that
might be messy to implement.  Another idea is to somehow let any
in-progress query finish running before allowing the first query to be
suspended; that would need some new infrastructure.

My main point here is that I think waiting for only a subtree is an
idea that cannot work out well.  Whatever problems are pushing you
into that design, we need to confront those problems directly and fix
them.  There shouldn't be any unsolvable problems in waiting for
everything in the whole query, and I'm pretty sure that's going to be
a more elegant and better-performing design.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-09-13 Thread Robert Haas
On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI
 wrote:
> [ new patches ]

+/*
+ * We assume that few nodes are async-aware and async-unaware
+ * nodes cannot be revserse-dispatched from lower nodes that is
+ * async-aware. Firing of an async node that is not a descendant
+ * of the planstate will cause such reverse-diaptching to
+ * async-aware nodes, which is unexpected behavior for them.
+ *
+ * For instance, consider an async-unaware Hashjoin(OUTER, INNER)
+ * where the OUTER is running asynchronously but the Hashjoin is
+ * waiting on the async INNER during inner-hash creation. If the
+ * OUTER fires for the case, since anyone is waiting on it,
+ * ExecAsyncWaitForNode finally dispatches to the Hashjoin which
+ * is now in the middle of thing its work.
+ */
+if (!IsParent(planstate, node))
+continue;

I'm not entirely sure that I understand this comment, but I don't
think it's going in the right direction.  Let's start with the example
in the second paragraph. If the hash join is async-unaware, then it
isn't possible for the hash join to be both running the outer side of
the join asynchronously and at the same time waiting on the inner
side.  Once it tries to pull the first tuple from the outer side, it's
waiting for that to finish and can't do anything else.  So, the inner
side can't possibly get touched in any way until the outer side
finishes.  For anything else to happen, the hash join would have to be
async-aware.  Even if we did that, I don't think it would be right to
kick off both sides of the hash join at the same time.  Right now, if
the outer side turns out to be empty, we never need to build the hash
table, and that's good.

I don't think it's a good idea to wait for only nodes that are in the
current subtree.  For example, consider a plan like this:

Append
-> Foreign Scan on a
-> Hash Join
  -> Foreign Scan on b
  -> Hash
-> Seq Scan on x

Suppose Append and Foreign Scan are parallel-aware but the other nodes
are not.  Append kicks off the Foreign Scan on a and then waits for
the hash join to produce a tuple; the hash join kicks off the Foreign
Scan on b and waits for it to return a tuple.  If, while we're waiting
for the foreign scan on b, the foreign scan on a needs some attention
- either to produce tuples, or maybe just to call PQconsumeInput() so
that more data can be sent from the other side, I think we need to be
able to do that.  There's no real problem here; even if the Append
becomes result-ready before the hash join returns, that is fine.  We
will not actually be able to return from the append until the hash
join returns because of what's on the call stack, but that doesn't
mean that the Append can't be marked result-ready sooner than that.
The situation can be improved by making the hash join node
parallel-aware, but if we don't do that it's still not broken.

I think the reason that you originally got backed into this design was
because of problems with reentrancy.  I don't think I really
understand in any detail exactly what problem you hit there, but it
seems to me that the following problem could occur:
ExecAsyncWaitForNode finds two events and schedules two callbacks.  It
calls the first of those two callbacks.  Before that callback returns,
it again calls ExecAsyncWaitForNode.  But the new invocation of
ExecAsyncWaitForNode doesn't know that there is a second callback
pending, so it somehow gets confused.  However, I think this problem
can fixed using a different method.  The occurred_event and callbacks
arrays defined by ExecAsyncWaitForNode can be made part of the EState
rather than being local variables.  When ExecAsyncWaitForNode is
called, it checks whether there are any pending callbacks; if so, it
removes and calls the first one.  Only if there are no pending
callbacks does it actually wait; when a wait event fires, one or more
new callbacks are generated.  This is very similar to the reason why
ReceiveSharedInvalidMessages uses a static messages array rather than
a normal local variable.  That function is solving a problem which I
suspect is very similar to the one we have here.  However, it would be
helpful if you could provide some more details on what you think the
reentrancy problems are, because I'm not able to figure them out from
your messages so far.

The most mysterious part of this hunk to me is the comment that
"Firing of an async node that is not a descendant of the planstate
will cause such reverse-diaptching to async-aware nodes, which is
unexpected behavior for them."  It is the async-unaware nodes which
might have a problem.  The nodes that have been taught about the new
system should know what to expect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via 

Re: [HACKERS] asynchronous and vectorized execution

2016-09-12 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 01 Sep 2016 16:12:31 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160901.161231.110068639.horiguchi.kyot...@lab.ntt.co.jp>
> There's perfomance degradation for non-asynchronous nodes, as
> shown as 't0' below.
> 
> The patch adds two "if-then" and one additional function call as
> asynchronous stuff into ExecProcnode, which is heavily passed and
> foremerly consists only five meaningful lines. The stuff slows
> performance by about 1% for simple seqscan case. The following is
> the performance numbers previously shown upthread.  (Or the
> difference might be too small to get meaningful performance
> difference..)

I tried __builtin_expect before moving the stuff out of
execProcNode. (attached patch) I found a conversation about the
pragma in past discussion.

https://www.postgresql.org/message-id/CA+TgmoYknejCgWMb8Tg63qA67JoUG2uCc0DZc5mm9=e18am...@mail.gmail.com

> If we can show cases where it reliably produces a significant
> speedup, then I would think it would be worthwhile

I got a result as the followings.

master(67e1e2a)-O2
  time(ms)  stddev(ms)
  t0: 3928.22 (  0.40)   # Simple SeqScan only
  pl: 1665.14 (  0.53)   # Append(SeqScan)

Patched-O2 / NOT Use __builtin_expect
  t0: 4042.69 (  0.92)degradation to master is 2.9%
  pl: 1698.46 (  0.44)degradation to master is 2.0%

Patched-O2 / Use __builtin_expect
  t0: 3886.69 (  1.93)*gain* to master is 1.06%
  pl: 1671.66 (  0.67)degradation to master is 0.39%

I haven't directly seen the pragmra's implication for
optimization on surrounding code but I suspect there's some
implication. I also tried the pragma to ExecAppend but no
difference seen. The numbers flucture easily by any changes in
the machine's state so the lower digits aren't trustworthy but
several succeeding repetitions showed fluctuations up to some
milliseconds.

execProcNode will be allowed to be as it is if __builtin_expect
is usable but ExecAppend still needs an improvement.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From f1f02557f7f4d694f0e3b4d62a6bdfad8e746b03 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 12 Sep 2016 16:36:37 +0900
Subject: [PATCH] Use __builtin_expect to optimize branches

__builtin_expect can minimize the cost of failure of branch prediction
for certain cases. It seems to work very fine here.
---
 src/backend/executor/execProcnode.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index cef262b..c247fa3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -585,13 +585,22 @@ ExecExecuteNode(PlanState *node)
  *  execution subtree and every subtree should have an individual context.
  *  
  */
+#define MY_USE_LIKELY
+#if defined MY_USE_LIKELY
+#define my_likely(x) __builtin_expect((x),1)
+#define my_unlikely(x) __builtin_expect((x),0)
+#else
+#define my_likely(x) (x)
+#define my_unlikely(x) (x)
+#endif
+
 TupleTableSlot *
 ExecProcNode(PlanState *node)
 {
 	CHECK_FOR_INTERRUPTS();
 
 	/* Return unconsumed result if any */
-	if (node->result_ready)
+	if (my_unlikely(node->result_ready))
 		return ExecConsumeResult(node);
 
 	if (node->chgParam != NULL) /* something changed */
@@ -599,7 +608,7 @@ ExecProcNode(PlanState *node)
 
 	ExecDispatchNode(node);
 
-	if (!node->result_ready)
+	if (my_unlikely(!node->result_ready))
 		ExecAsyncWaitForNode(node);
 
 	return ExecConsumeResult(node);
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-09-01 Thread Kyotaro HORIGUCHI
This is random thoughts on this patch.

At Tue, 30 Aug 2016 12:17:52 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160830.121752.100817694.horiguchi.kyot...@lab.ntt.co.jp>
> > As the result, the attached patchset is functionally the same
> > with the last version but replace misused Assert with
> > AssertMacro.

There's perfomance degradation for non-asynchronous nodes, as
shown as 't0' below.

The patch adds two "if-then" and one additional function call as
asynchronous stuff into ExecProcnode, which is heavily passed and
foremerly consists only five meaningful lines. The stuff slows
performance by about 1% for simple seqscan case. The following is
the performance numbers previously shown upthread.  (Or the
difference might be too small to get meaningful performance
difference..)

===
t0- (SeqScan()) (2 parallel)
pl- (Append(4 * SeqScan()))
pf0 (Append(4 * ForeignScan())) all ForeignScans are on the same connection.
pf1 (Append(4 * ForeignScan())) all ForeignScans have their own connections.

 
patched-O2  time(ms)  stddev(ms)  gain from unpatched (%)
t0  4121.27  1.1  -1.44
pl  1757.41  0.94 -1.73
pf0 6458.99192.4  20.26
pf1 1747.4  24.81 78.39
  
unpatched-O2
t0  4062.6   1.95
pl  1727.45  9.41
pf0 8100.47 24.51   
pf1 8086.52 33.53   
===

So, finally, it seems that the infrastructure should not habit in
ExecProcNode, or need to redesign the executor.  I tried
jump-table to dispatch nodes which was in vain. Having a flag in
EState may be able to get rid of async stuff from non-async
route. (similar to, but maybe different from my first patch) JIT
compiling seems promising but it is a different thing.


As for nodeGather, it expects the leader process to be one of
workers, the leader should be free from it so as to behave as an
async node. But still the expectected number of workers seems to
be too small to take a meaningful benefit from async execution.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-08-29 Thread Kyotaro HORIGUCHI
No, it was wrong.

At Mon, 29 Aug 2016 17:08:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160829.170836.161449399.horiguchi.kyot...@lab.ntt.co.jp>
> Hello,
> 
> I considered applying the async infrastructure onto nodeGather,
> but since parallel workers hardly make Gather (or the leader)
> wait, it's really useless at least for simple cases. Furthermore,
> as several people may have said before, being defferent from
> foreign scans, gather (or other kinds of parallel) nodes usually
> have several workers and will have up to two digit nubmers at the
> most even on so-called many-core boxes. I finally gave up
> applying this to nodeGather.

I overlooked that local scan takes place instead of waiting
workers to be ready. I will reconsider counting that..

> As the result, the attached patchset is functionally the same
> with the last version but replace misused Assert with
> AssertMacro.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-08-02 Thread Kyotaro HORIGUCHI
Thank you for the comment.

At Mon, 1 Aug 2016 10:44:56 +0530, Amit Khandekar  
wrote in 
> On 21 July 2016 at 15:20, Kyotaro HORIGUCHI  > wrote:
> 
> >
> > After some consideration, I found that ExecAsyncWaitForNode
> > cannot be reentrant because it means that the control goes into
> > async-unaware nodes while having not-ready nodes, that is
> > inconsistent state. To inhibit such reentering, I allocated node
> > identifiers in depth-first order so that ascendant-descendant
> > relationship can be checked (nested-set model) in simple way and
> > call ExecAsyncConfigureWait only for the descendant nodes of the
> > parameter planstate.
> >
> >
> We have estate->waiting_nodes containing a mix of async-aware and
> non-async-aware nodes. I was thinking, an asynchrony tree would have only
> async-aware nodes, with possible multiple asynchrony sub-trees in a tree.
> Somehow, if we restrict the bubbling up of events only upto the root of the
> asynchrony subtree, do you think we can simplify some of the complexities ?

The current code prohibiting regsitration of nodes outside the
current subtree to avoid the reentring-disaster.

Indeed leaving the "waiting node" mark or something like on every
root node at the first visit will enable the propagation to stop
upto the root of any async-subtree. Neverheless, when an
async-child in an inactive async-root fires, the new tuple is
loaded but is not consumed then the succeeding firing on the same
child leads to a dead-lock (without result queueing). However,
that can be avoided if ExecAsyncConfigureWait doesn't register
nodes in ready state.

On the other hand, any two or more asynchronous nodes can share a
syncronization object. For instance, multiple postgres_fdw scan
node can share one server connection and only one of them can get
into waitable state at once. If no async-child in the current
async subtree is waitable, it must be stuck. So I think it is
crucial for ExecAsyncWaitForNode to force at least one child *in
the current async subtree* to get into waiting state for such
situation. The ascendant-descendant relationship is necessary to
do that anyway.

Since we should have the node-id to detect ascendant-descendant
relationship anyway and finally should restrict async-nodes with
it, activating only descendant node from the first would make the
things rather simple than avoiding possible dead-lock laster as
described above.

# It is implemented as per-subtree waiting-node list but it was
# fragile and too ugly..


> For e.g. ExecAsyncWaitForNode() has become a bit complex seemingly because
> it has to handle non-async-nodes also, and that's the reason I believe you
> have introduced modes such as ASYNCCONF_FORCE_ADD.

As explained above, the ASYNCCONF_FORCE_ADD is not for
non-async-nodes, but for sets of async nodes that share a
synchronization object. We could let ExecAsyncConfigureWait force
acquire async-object from the first, but it in turn causes
possiblly unnecessary transfer of a sync-object among the nodes
sharing it.



I wish the above sentsnces are readable enough, but any questions
are welcome even the meaning of a sentence.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-07-31 Thread Amit Khandekar
On 21 July 2016 at 15:20, Kyotaro HORIGUCHI  wrote:

>
> After some consideration, I found that ExecAsyncWaitForNode
> cannot be reentrant because it means that the control goes into
> async-unaware nodes while having not-ready nodes, that is
> inconsistent state. To inhibit such reentering, I allocated node
> identifiers in depth-first order so that ascendant-descendant
> relationship can be checked (nested-set model) in simple way and
> call ExecAsyncConfigureWait only for the descendant nodes of the
> parameter planstate.
>
>
We have estate->waiting_nodes containing a mix of async-aware and
non-async-aware nodes. I was thinking, an asynchrony tree would have only
async-aware nodes, with possible multiple asynchrony sub-trees in a tree.
Somehow, if we restrict the bubbling up of events only upto the root of the
asynchrony subtree, do you think we can simplify some of the complexities ?
For e.g. ExecAsyncWaitForNode() has become a bit complex seemingly because
it has to handle non-async-nodes also, and that's the reason I believe you
have introduced modes such as ASYNCCONF_FORCE_ADD.


> regards,
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


Re: [HACKERS] asynchronous and vectorized execution

2016-07-22 Thread Kyotaro HORIGUCHI
The previous patch set doesn't accept --enable-cassert. The
attached additional one fixes it. It theoretically won't give
degradation but I'll measure the performance change.

At Thu, 21 Jul 2016 18:50:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160721.185007.268388411.horiguchi.kyot...@lab.ntt.co.jp>
> Hello,
> 
> At Tue, 12 Jul 2016 11:42:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20160712.114255.156540680.horiguchi.kyot...@lab.ntt.co.jp>
> After some refactoring, degradation for a simple seqscan is
> reduced to 1.4% and that of "Append(SeqScan())" is reduced to
> 1.7%. The gains are the same to the previous measurement. Scale
> has been changed from previous measurement in some test cases.
> 
> t0- (SeqScan()) (2 parallel)
> pl- (Append(4 * SeqScan()))
> pf0 (Append(4 * ForeignScan())) all ForeignScans are on the same connection.
> pf1 (Append(4 * ForeignScan())) all ForeignScans have their own connections.
> 
>  
> patched-O2time(ms)  stddev(ms)  gain from unpatched (%)
> t0  4121.27  1.1  -1.44
> pl  1757.41  0.94 -1.73
> pf0 6458.99192.4  20.26
> pf1 1747.4  24.81 78.39
>   
> unpatched-O2
> t0  4062.6   1.95
> pl  1727.45  9.41
> pf0 8100.47 24.51   
> pf1 8086.52 33.53   
> 
> > > Addition to the aboves, I will try reentrant ExecAsyncWaitForNode
> > > or something.
> 
> After some consideration, I found that ExecAsyncWaitForNode
> cannot be reentrant because it means that the control goes into
> async-unaware nodes while having not-ready nodes, that is
> inconsistent state. To inhibit such reentering, I allocated node
> identifiers in depth-first order so that ascendant-descendant
> relationship can be checked (nested-set model) in simple way and
> call ExecAsyncConfigureWait only for the descendant nodes of the
> parameter planstate.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 2a6a95a038948a7a4384f44ef99a9a454175a47c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 22 Jul 2016 17:07:34 +0900
Subject: [PATCH 8/8] Change two macros into inline functions.

ExecConsumeResult cannot have Assertion in the form of macro. So this
patch alters it into a function. ExecReturnTuple is also changed for
the reason of uniformity. This might reduce performance (it
theoretically won't be happen but I believe I saw it..)
---
 src/include/executor/executor.h | 27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c1ef2ab..8e55b54 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -231,20 +231,25 @@ extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
 
 /* Convenience function to set a node's result to a TupleTableSlot. */
-#define ExecReturnTuple(node, slot) \
-{ \
-	Assert(!(node)->result_ready);	\
-	(node)->result = (Node *) (slot);	\
-	(node)->result_ready = true; \
+static inline void ExecReturnTuple(PlanState *node, TupleTableSlot *slot);
+static inline void
+ExecReturnTuple(PlanState *node, TupleTableSlot *slot)
+{
+	Assert(!(node)->result_ready);
+	(node)->result = (Node *) (slot);
+	(node)->result_ready = true;
 }
 
 /* Convenience function to retrieve a node's result. */
-#define ExecConsumeResult(node) \
-( \
-Assert((node)->result_ready), \
-Assert((node)->result == NULL || IsA((node)->result, TupleTableSlot)), \
-(node)->result_ready = false, \
-	(TupleTableSlot *) node->result)
+static inline TupleTableSlot *ExecConsumeResult(PlanState *node);
+static inline TupleTableSlot *
+ExecConsumeResult(PlanState *node)
+{
+Assert(node->result_ready);
+Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+node->result_ready = false;
+	return (TupleTableSlot *) node->result;
+}
 
 
 /*
-- 
1.8.3.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-07-11 Thread Kyotaro HORIGUCHI
I forgot to mention.

At Tue, 12 Jul 2016 11:04:17 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160712.110417.145469826.horiguchi.kyot...@lab.ntt.co.jp>
> Cooled down then measured performance again.
> 
> I show you the true result briefly for now.
> 
> At Mon, 11 Jul 2016 19:07:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20160711.190722.145849861.horiguchi.kyot...@lab.ntt.co.jp>
> > Anyway I need some time to cool down..
> 
> I recalled that I put Makefile.custom that contains
> CFLAGS="-O0". Removing that gave me a sainer result.

Different from the previous measurements, the remote side in
these measurements is unpatched-O2 postgres, so the differences
are made only by the local-side changes.

> patched- -O2
> 
> table   10-average(ms)  stddev  runtime-diff from unpatched(%) 
> t0   441.78 0.32 3.4
> pl   201.77 0.3213.6
> pf0 6619.2218.99   -19.7
> pf1 1800.7232.72   -78.0
> ---
> unpatched- -O2
> 
> t0   427.21 0.42
> pl   177.54 0.25
> pf0 8250.4223.29
> pf1 8206.0212.91
> 
> ==
> 
>   3% slower for local 1*seqscan (2-parallel)
>  14% slower for append-4*seqscan (no-prallel)
>  19% faster for append-4*foreignscan (all scans on one connection)
>  78% faster for append-4*foreignscan (scans have dedicate connection)
> 
> ExecProcNode might be able to be optimized a bit.
> ExecAppend seems to need some fix.
> 
> Addition to the aboves, I will try reentrant ExecAsyncWaitForNode
> or something.

regards,
-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-07-11 Thread Kyotaro HORIGUCHI
Cooled down then measured performance again.

I show you the true result briefly for now.

At Mon, 11 Jul 2016 19:07:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160711.190722.145849861.horiguchi.kyot...@lab.ntt.co.jp>
> Anyway I need some time to cool down..

I recalled that I put Makefile.custom that contains
CFLAGS="-O0". Removing that gave me a sainer result.


patched- -O2

table   10-average(ms)  stddev  runtime-diff from unpatched(%) 
t0   441.78 0.32 3.4
pl   201.77 0.3213.6
pf0 6619.2218.99   -19.7
pf1 1800.7232.72   -78.0
---
unpatched- -O2

t0   427.21 0.42
pl   177.54 0.25
pf0 8250.4223.29
pf1 8206.0212.91

==

  3% slower for local 1*seqscan (2-parallel)
 14% slower for append-4*seqscan (no-prallel)
 19% faster for append-4*foreignscan (all scans on one connection)
 78% faster for append-4*foreignscan (scans have dedicate connection)

ExecProcNode might be able to be optimized a bit.
ExecAppend seems to need some fix.

Addition to the aboves, I will try reentrant ExecAsyncWaitForNode
or something.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-07-11 Thread Kyotaro HORIGUCHI
At Mon, 11 Jul 2016 17:10:11 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20160711.171011.133133724.horiguchi.kyot...@lab.ntt.co.jp>
> > Two things:
> > 
> > 1. That's not the scenario I'm talking about.  I'm concerned about
> > making sure that query plans that don't use asynchronous execution
> > don't get slower.
> 
> The first one donen't (select for t0) is that. It have any
> relation with asynchronous staff except the result_ready flag, a
> branch caused by it and calling ExecDispatchNode. The difference
> from the original is ExecProcNode uses ExecDispatchNode. Even
> ExecAsyncWaitForNode is not called.
> 
> > 2. I have to believe that's a defect in your implementation rather
> > than something intrinsic, or maybe your test scenario is bad.  It's
> > very hard - really impossible -  to believe that all queries involving
> > FDW pushdown are locally CPU-bound.
> 
> Sorry for hard-to-read result but the problem is not in a query
> involving FDW, but a query on a local table (but runs parallel
> seqscan).  The reason of the difference for the tests involving
> FDW should be local scans on the remote side.
> 
> Just reverting ExecProcNode and other related part doesn't change
> the situation. I proceed the confirmation reverting part by
> part.

Uggg. I had no difference even after finally bumped into master.
What is more strange, a binary built from what should be the same
"master" but extended by "git archive | tar" finishes the query
(select sum(a) from t0) in a half time to the master in my git
reposiotrty with -O2. In short, the patch doesn't seem to be the
cause of the difference.

I should investigate the difference between them, or begin again
with a clean environment..

Anyway I need some time to cool down..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-07-11 Thread Kyotaro HORIGUCHI
Hello,

At Thu, 7 Jul 2016 13:59:54 -0400, Robert Haas  wrote in 

> On Wed, Jul 6, 2016 at 3:29 AM, Kyotaro HORIGUCHI
>  wrote:
> > This seems to be a good opportunity to show this patch. The
> > attched patch set does async execution of foreignscan
> > (postgres_fdw) on the Robert's first infrastructure, with some
> > modification.
> 
> Cool.

Thank you.

> > ExecAsyncWaitForNode can get into an inifite-waiting by recursive
> > calls of ExecAsyncWaitForNode caused by ExecProcNode called from
> > async-unaware nodes. Such recursive calls cause a wait on
> > already-ready nodes.
> 
> Hmm, that's annoying.
> 
> > I solved that in the patch set by allocating a separate
> > async-execution context for every async-execution subtrees, which
> > is made by ExecProcNode, instead of one async-exec context for
> > the whole execution tree. This works fine but the way switching
> > contexts seems ugly.  This may also be solved by make
> > ExecAsyncWaitForNode return when no node to wait even if the
> > waiting node is not ready. This might keep the async-exec context
> > (state) simpler so I'll try this.
> 
> I think you should instead try to make ExecAsyncWaitForNode properly 
> reentrant.

I feel the same way. Will try to do that.

> > Does the ParallelWorkerSetLatchesForGroup use mutex or semaphore
> > or something like instead of latches?
> 
> Why would it do that?

I might misunderstand the original sentence but the reason of my
question there is that I didn't see the connection between "When
an executor node does something that might unblock other workers,
it calls ParallelWorkerSetLatchesForGroup()" and "and the async
stuff then tries calling all of the nodes in this array". I
supposed that the former takes place on each worker and the
latter should do the latter on the leader. So I asked the means
to signal the leader to do the latter thing. I should be wrong,
because I feel uneasy (or confused) with this statement..


> >> BTW, we also need to benchmark those changes to add the parent
> >> pointers and change the return conventions and see if they have any
> >> measurable impact on performance.
> >
> > I have to bring you a bad news.
> >
> > With the attached patch, an append on four foreign scans on one
> > server (at local) performs faster by about 10% and by twice for
> > three or four foreign scns on separate foreign servers
> > (connections) respectively, but only when compiled with -O0. I
> > found that it can take hopelessly small amount of advantage from
> > compiler optimization, while unpatched version gets faster.
> 
> Two things:
> 
> 1. That's not the scenario I'm talking about.  I'm concerned about
> making sure that query plans that don't use asynchronous execution
> don't get slower.

The first one donen't (select for t0) is that. It have any
relation with asynchronous staff except the result_ready flag, a
branch caused by it and calling ExecDispatchNode. The difference
from the original is ExecProcNode uses ExecDispatchNode. Even
ExecAsyncWaitForNode is not called.

> 2. I have to believe that's a defect in your implementation rather
> than something intrinsic, or maybe your test scenario is bad.  It's
> very hard - really impossible -  to believe that all queries involving
> FDW pushdown are locally CPU-bound.

Sorry for hard-to-read result but the problem is not in a query
involving FDW, but a query on a local table (but runs parallel
seqscan).  The reason of the difference for the tests involving
FDW should be local scans on the remote side.

Just reverting ExecProcNode and other related part doesn't change
the situation. I proceed the confirmation reverting part by
part.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-07-07 Thread Robert Haas
On Wed, Jul 6, 2016 at 3:29 AM, Kyotaro HORIGUCHI
 wrote:
> This seems to be a good opportunity to show this patch. The
> attched patch set does async execution of foreignscan
> (postgres_fdw) on the Robert's first infrastructure, with some
> modification.

Cool.

> ExecAsyncWaitForNode can get into an inifite-waiting by recursive
> calls of ExecAsyncWaitForNode caused by ExecProcNode called from
> async-unaware nodes. Such recursive calls cause a wait on
> already-ready nodes.

Hmm, that's annoying.

> I solved that in the patch set by allocating a separate
> async-execution context for every async-execution subtrees, which
> is made by ExecProcNode, instead of one async-exec context for
> the whole execution tree. This works fine but the way switching
> contexts seems ugly.  This may also be solved by make
> ExecAsyncWaitForNode return when no node to wait even if the
> waiting node is not ready. This might keep the async-exec context
> (state) simpler so I'll try this.

I think you should instead try to make ExecAsyncWaitForNode properly reentrant.

> Does the ParallelWorkerSetLatchesForGroup use mutex or semaphore
> or something like instead of latches?

Why would it do that?

>> BTW, we also need to benchmark those changes to add the parent
>> pointers and change the return conventions and see if they have any
>> measurable impact on performance.
>
> I have to bring you a bad news.
>
> With the attached patch, an append on four foreign scans on one
> server (at local) performs faster by about 10% and by twice for
> three or four foreign scns on separate foreign servers
> (connections) respectively, but only when compiled with -O0. I
> found that it can take hopelessly small amount of advantage from
> compiler optimization, while unpatched version gets faster.

Two things:

1. That's not the scenario I'm talking about.  I'm concerned about
making sure that query plans that don't use asynchronous execution
don't get slower.

2. I have to believe that's a defect in your implementation rather
than something intrinsic, or maybe your test scenario is bad.  It's
very hard - really impossible -  to believe that all queries involving
FDW pushdown are locally CPU-bound.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-07-05 Thread Robert Haas
On Wed, Jun 29, 2016 at 11:00 AM, Amit Khandekar  wrote:
> We may also want to consider handling abstract events such as
> "tuples-are-available-at-plan-node-X".
>
> One benefit is : we can combine this with batch processing. For e.g. in case
> of an Append node containing foreign scans, its parent node may not want to
> process the Append node result until Append is ready with at least 1000
> rows. In that case, Append node needs to raise an event
> "n-tuples-are-ready"; we cannot just rely on fd-ready events. fd-ready event
> will wake up the foreign scan, but it may not eventually cause its parent
> Append node to in turn wake up it's parent.

Right, I agree.  I think this case only arises in parallel query.  In
serial execution, there's not really any way for a plan node to just
become ready other than an FD or latch event or the parent becoming
ready.  But in parallel query it can happen, of course, because some
other backend can do work that makes that node ready to produce
tuples.

It's not necessarily the case that we have to deal with this in the
initial patches for this feature, because the most obvious win for
this sort of thing is when we have an Append of ForeignScan plans.
Sure, parallel query has interesting cases, too, but a prototype that
just handles Append over a bunch of postgres_fdw ForeignScans would be
pretty cool.  I suggest that we make that the initial goal here.

> How we can do this event abstraction is the other question. We can have one
> latch for each of the event, and each node would raise its own event by
> setting the corresponding latch. But I am not sure about latches within a
> single process as against one process waking up another process. Or else,
> some internal event structures needs to be present (in estate ?), which then
> ExecProcNode would use when it does the event looping to wake up (i.e.
> execute) required nodes.

I think adding more latches would be a bad idea.  What I think we
should do instead is add two additional data structures to dynamic
shared memory:

1. An array of PGPROC * pointers for all of the workers.  Processes
add their PGPROC * to this array as they start up.  Then, parallel.h
can expose new API ParallelWorkerSetLatchesForGroup(void).  In the
leader, this sets the latch for every worker process for every
parallel context with which the leader is associated; in a worker, it
sets the latch for other processes in the parallel group, and the
leader also.

2. An array of executor nodes where one process might do something
that allows other processes to make progress on that node.  This would
be set up somehow by execParallel.c, which would need to somehow
figure out which plan nodes want to be included in the list.  When an
executor node does something that might unblock other workers, it
calls ParallelWorkerSetLatchesForGroup() and the async stuff then
tries calling all of the nodes in this array again to see if any of
them now think that they've got tuples to return (or just to let them
do additional work without returning tuples).

> Also, the WaitEvent.user_data field can have some more info besides the plan
> state. It can have its parent PlanState stored, so that we don't have to
> have parent field in plan state. It also can have some more data such as
> "n-tuples-available".

I don't think that works, because execution may need to flow
arbitrarily far up the tree.  Just knowing the immediate parent isn't
good enough.  If it generates a tuple, then you have to in turn call
it's parent, and that one then produces a tuple, you have to continue
on even further up the tree.  I think it's going to be very awkward to
make this work without those parent pointers.

BTW, we also need to benchmark those changes to add the parent
pointers and change the return conventions and see if they have any
measurable impact on performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-06-29 Thread Amit Khandekar
We may also want to consider handling abstract events such as
"tuples-are-available-at-plan-node-X".

One benefit is : we can combine this with batch processing. For e.g. in
case of an Append node containing foreign scans, its parent node may not
want to process the Append node result until Append is ready with at least
1000 rows. In that case, Append node needs to raise an event
"n-tuples-are-ready"; we cannot just rely on fd-ready events. fd-ready
event will wake up the foreign scan, but it may not eventually cause its
parent Append node to in turn wake up it's parent.

Other benefit (which I am not sure how significant it is) is this part of
the event : "at-plan-node-X". For e.g., for an Append node having 10
foreign scans, when a foreign scan wakes up and becomes ready with
tuple(s), it's parent node (i.e. Append) will be executed. But it would
speed up things if it knows which of it's foreign scan nodes are ready.
>From Robert's prototype, I can see that it can find that out by checking
the result_ready field of each foreign scan plan state. But if it knows
from the event that the node-X is the one who is ready, it can directly
take tuples from there. Another thing is, we may want to give the Append
node a chance to know all those nodes that are ready, instead of just one
node.

How we can do this event abstraction is the other question. We can have one
latch for each of the event, and each node would raise its own event by
setting the corresponding latch. But I am not sure about latches within a
single process as against one process waking up another process. Or else,
some internal event structures needs to be present (in estate ?), which
then ExecProcNode would use when it does the event looping to wake up (i.e.
execute) required nodes.

Also, the WaitEvent.user_data field can have some more info besides the
plan state. It can have its parent PlanState stored, so that we don't have
to have parent field in plan state. It also can have some more data such as
"n-tuples-available".


On 9 May 2016 at 23:03, Robert Haas  wrote:

> Hi,
>
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there.  I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).  To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:
>
> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.
>
> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.  I think that's
> probably right.   For example, consider a 5-table join where all of
> the joins are implemented as hash tables.  If this query plan is going
> to be run to completion, it would make much 

Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Robert Haas
On Wed, May 11, 2016 at 12:30 PM, Andres Freund  wrote:
> On 2016-05-11 12:27:55 -0400, Robert Haas wrote:
>> On Wed, May 11, 2016 at 11:49 AM, Andres Freund  wrote:
>> > On 2016-05-11 10:12:26 -0400, Robert Haas wrote:
>> >> > Hm. Do we really have to keep the page locked in the page-at-a-time
>> >> > mode? Shouldn't the pin suffice?
>> >>
>> >> I think we need a lock to examine MVCC visibility information.  A pin
>> >> is enough to prevent a tuple from being removed, but not from having
>> >> its xmax and cmax overwritten at almost but not quite exactly the same
>> >> time.
>> >
>> > We already batch visibility lookups in page-at-a-time
>> > mode. Cf. heapgetpage() / scan->rs_vistuples. So we can evaluate quals
>> > after releasing the lock, but before the pin is released, without that
>> > much effort.  IIRC that isn't used for index lookups, but that's
>> > probably a good idea.
>>
>> The trouble with that is that if you fail the qual, you have to relock
>> the page.  Which kinda sucks, if the qual is really simple.
>
> Hm? I'm missing something here? We currently do the visibility checks in
> bulk for the whole page. After that we release the page lock. What
> prevents us from executing the quals directly after that? And why would
> you need to relock the page?

Oh, yeah, in page-at-a-time mode we can release the lock first.  I was
thinking at what to do in tuple-at-a-time mode (i.e. when the page is
not all-visible).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Andres Freund
On 2016-05-11 12:27:55 -0400, Robert Haas wrote:
> On Wed, May 11, 2016 at 11:49 AM, Andres Freund  wrote:
> > On 2016-05-11 10:12:26 -0400, Robert Haas wrote:
> >> > Hm. Do we really have to keep the page locked in the page-at-a-time
> >> > mode? Shouldn't the pin suffice?
> >>
> >> I think we need a lock to examine MVCC visibility information.  A pin
> >> is enough to prevent a tuple from being removed, but not from having
> >> its xmax and cmax overwritten at almost but not quite exactly the same
> >> time.
> >
> > We already batch visibility lookups in page-at-a-time
> > mode. Cf. heapgetpage() / scan->rs_vistuples. So we can evaluate quals
> > after releasing the lock, but before the pin is released, without that
> > much effort.  IIRC that isn't used for index lookups, but that's
> > probably a good idea.
> 
> The trouble with that is that if you fail the qual, you have to relock
> the page.  Which kinda sucks, if the qual is really simple.

Hm? I'm missing something here? We currently do the visibility checks in
bulk for the whole page. After that we release the page lock. What
prevents us from executing the quals directly after that? And why would
you need to relock the page?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Robert Haas
On Wed, May 11, 2016 at 11:49 AM, Andres Freund  wrote:
> On 2016-05-11 10:12:26 -0400, Robert Haas wrote:
>> > I've to admit I'm not that convinced about the speedups in the !fdw
>> > case. There seems to be a lot easier avenues for performance
>> > improvements.
>>
>> What I'm talking about is a query like this:
>>
>> SELECT * FROM inheritance_tree_of_foreign_tables WHERE very_rarely_true;
>
> Note that I said "!fdw case".

Oh, wow, I totally missed that exclamation point.

>> > FWIW, I've even hacked something up for a bunch of simple queries, and
>> > the performance improvements were significant.  Besides it only being a
>> > weekend hack project, the big thing I got stuck on was considering how
>> > to exactly determine when to batch and not to batch.
>>
>> Yeah.  I think we need a system for signalling nodes as to when they
>> will be run to completion.  But a Boolean is somehow unsatisfying;
>> LIMIT 100 is more like no LIMIT than it it is like LIMIT 1.  I'm
>> tempted to add a numTuples field to every ExecutorState and give upper
>> nodes some way to set it, as a hint.
>
> I was wondering whether we should hand down TupleVectorStates to lower
> nodes, and their size determines the max batch size...

There's some appeal to that, but it seems complicated to make work.

>> >> Some care is required here because any
>> >> functions we execute as scan keys are run with the buffer locked, so
>> >> we had better not run anything very complicated.  But doing this for
>> >> simple things like integer equality operators seems like it could save
>> >> quite a few buffer lock/unlock cycles and some other executor overhead
>> >> as well.
>> >
>> > Hm. Do we really have to keep the page locked in the page-at-a-time
>> > mode? Shouldn't the pin suffice?
>>
>> I think we need a lock to examine MVCC visibility information.  A pin
>> is enough to prevent a tuple from being removed, but not from having
>> its xmax and cmax overwritten at almost but not quite exactly the same
>> time.
>
> We already batch visibility lookups in page-at-a-time
> mode. Cf. heapgetpage() / scan->rs_vistuples. So we can evaluate quals
> after releasing the lock, but before the pin is released, without that
> much effort.  IIRC that isn't used for index lookups, but that's
> probably a good idea.

The trouble with that is that if you fail the qual, you have to relock
the page.  Which kinda sucks, if the qual is really simple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Andres Freund
On 2016-05-11 10:32:20 -0400, Robert Haas wrote:
> On Tue, May 10, 2016 at 8:50 PM, Andres Freund  wrote:
> > That seems to suggest that we need to restructure how we get to calling
> > fmgr functions, before worrying about the actual fmgr call.
> 
> Any ideas on how to do that?  ExecMakeFunctionResultNoSets() isn't
> really doing a heck of a lot.  Changing FuncExprState to use an array
> rather than a linked list to store its arguments might help some.   We
> could also consider having an optimized path that skips the fn_strict
> stuff if we can somehow deduce that no NULLs can occur in this
> context, but that's a lot of work and new infrastructure.  I feel like
> maybe there's something higher-level we could do that would help more,
> but I don't know what it is.

I think it's not just ExecMakeFunctionResultNoSets, it's the whole
call-stack which needs to be optimized together.

E.g. look at a few performance metrics for a simple seqscan query with a
bunch of ORed equality constraints:
SELECT count(*) FROM pgbench_accounts WHERE abalance = -1 OR abalance = -2 OR 
abalance = -3 OR abalance = -4 OR abalance = -5 OR abalance = -6 OR abalance = 
-7 OR abalance = -8 OR abalance = -9 OR abalance = -10;

perf record -g -p 27286 -F 5000 -e 
cycles:ppp,branch-misses,L1-icache-load-misses,iTLB-load-misses,L1-dcache-load-misses,dTLB-load-misses,LLC-load-misses
 sleep 3
6K cycles:ppp
6K branch-misses
1K L1-icache-load-misses
472 iTLB-load-misses
5K L1-dcache-load-misses
6K dTLB-load-misses
6K LLC-load-misses

You can see that a number of events sample at a high rate, especially
when you take the cycle samples into account.

cycles:
+   32.35%  postgres  postgres   [.] ExecMakeFunctionResultNoSets
+   14.51%  postgres  postgres   [.] slot_getattr
+5.50%  postgres  postgres   [.] ExecEvalOr
+5.22%  postgres  postgres   [.] check_stack_depth

branch-misses:
+   73.77%  postgres  postgres   [.] ExecQual
+   17.83%  postgres  postgres   [.] ExecEvalOr
+1.49%  postgres  postgres   [.] heap_getnext

L1-icache-load-misses:
+4.71%  postgres  [kernel.kallsyms]  [k] update_curr
+4.37%  postgres  postgres   [.] hash_search_with_hash_value
+3.91%  postgres  postgres   [.] heap_getnext
+3.81%  postgres  [kernel.kallsyms]  [k] task_tick_fair

iTLB-load-misses:
+   27.57%  postgres  postgres   [.] LWLockAcquire
+   18.32%  postgres  postgres   [.] hash_search_with_hash_value
+7.09%  postgres  postgres   [.] ExecMakeFunctionResultNoSets
+3.06%  postgres  postgres   [.] ExecEvalConst

L1-dcache-load-misses:
+   20.35%  postgres  postgres   [.] ExecMakeFunctionResultNoSets
+   12.31%  postgres  postgres   [.] check_stack_depth
+8.84%  postgres  postgres   [.] heap_getnext
+8.00%  postgres  postgres   [.] slot_deform_tuple
+7.15%  postgres  postgres   [.] HeapTupleSatisfiesMVCC

dTLB-load-misses:
+   50.13%  postgres  postgres   [.] ExecQual
+   41.36%  postgres  postgres   [.] ExecEvalOr
+2.96%  postgres  postgres   [.] hash_search_with_hash_value
+1.30%  postgres  postgres   [.] PinBuffer.isra.3
+1.19%  postgres  postgres   [.] heap_page_prune_op

LLC-load-misses:
+   24.25%  postgres  postgres   [.] slot_deform_tuple
+   17.45%  postgres  postgres   [.] CheckForSerializableConflictOut
+   10.52%  postgres  postgres   [.] heapgetpage
+9.55%  postgres  postgres   [.] HeapTupleSatisfiesMVCC
+7.52%  postgres  postgres   [.] ExecMakeFunctionResultNoSets


For this workload, we expect a lot of LLC-load-misses as the workload is
lot bigger than memory, and it makes sense that they're in
slot_deform_tuple(),heapgetpage(), HeapTupleSatisfiesMVCC() (but uh
CheckForSerializableConflictOut?).  One avenue to optimize is to make
those accesses easier to predict/prefetch, which they're atm likely not.

But leaving that aside, we can see that a lot of the cost is distributed
over ExecQual, ExecEvalOr, ExecMakeFunctionResultNoSets - all of which
judiciously use linked list.  I suspect that by simplifying these
functions / datastructures *AND* by calling them over a batch of tuples,
instead of one-by-one we'd limit the time spent in them considerably.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Andres Freund
On 2016-05-11 10:12:26 -0400, Robert Haas wrote:
> > I've to admit I'm not that convinced about the speedups in the !fdw
> > case. There seems to be a lot easier avenues for performance
> > improvements.
> 
> What I'm talking about is a query like this:
> 
> SELECT * FROM inheritance_tree_of_foreign_tables WHERE very_rarely_true;

Note that I said "!fdw case".


> > FWIW, I've even hacked something up for a bunch of simple queries, and
> > the performance improvements were significant.  Besides it only being a
> > weekend hack project, the big thing I got stuck on was considering how
> > to exactly determine when to batch and not to batch.
> 
> Yeah.  I think we need a system for signalling nodes as to when they
> will be run to completion.  But a Boolean is somehow unsatisfying;
> LIMIT 100 is more like no LIMIT than it it is like LIMIT 1.  I'm
> tempted to add a numTuples field to every ExecutorState and give upper
> nodes some way to set it, as a hint.

I was wondering whether we should hand down TupleVectorStates to lower
nodes, and their size determines the max batch size...

> >> Some care is required here because any
> >> functions we execute as scan keys are run with the buffer locked, so
> >> we had better not run anything very complicated.  But doing this for
> >> simple things like integer equality operators seems like it could save
> >> quite a few buffer lock/unlock cycles and some other executor overhead
> >> as well.
> >
> > Hm. Do we really have to keep the page locked in the page-at-a-time
> > mode? Shouldn't the pin suffice?
> 
> I think we need a lock to examine MVCC visibility information.  A pin
> is enough to prevent a tuple from being removed, but not from having
> its xmax and cmax overwritten at almost but not quite exactly the same
> time.

We already batch visibility lookups in page-at-a-time
mode. Cf. heapgetpage() / scan->rs_vistuples. So we can evaluate quals
after releasing the lock, but before the pin is released, without that
much effort.  IIRC that isn't used for index lookups, but that's
probably a good idea.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Robert Haas
On Tue, May 10, 2016 at 8:50 PM, Andres Freund  wrote:
> That seems to suggest that we need to restructure how we get to calling
> fmgr functions, before worrying about the actual fmgr call.

Any ideas on how to do that?  ExecMakeFunctionResultNoSets() isn't
really doing a heck of a lot.  Changing FuncExprState to use an array
rather than a linked list to store its arguments might help some.   We
could also consider having an optimized path that skips the fn_strict
stuff if we can somehow deduce that no NULLs can occur in this
context, but that's a lot of work and new infrastructure.  I feel like
maybe there's something higher-level we could do that would help more,
but I don't know what it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Konstantin Knizhnik



On 10.05.2016 20:26, Robert Haas wrote:

At this moment (February) them have implemented translation of only few
PostgreSQL operators used by ExecQuals  and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25%
increase at TPC-H Q1 (for Q1 most critical is generation of native code for
aggregates, because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code
generation, but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open
source...
Interesting.  You may notice that in "asynchronous mode" my prototype
works using a push model of sorts.  Maybe that should be taken
further.

Latest information from ISP RAS guys: them have made good progress since 
February: them have rewritten most of methods of Scan, Aggregate and 
Join to LLVM API. Also then implemented automatic translation of  
PostgreSQL backend functions to LLVM API.

As a result time of TPC-H Q1 query is reduced four times.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Robert Haas
On Wed, May 11, 2016 at 10:17 AM, Konstantin Knizhnik
 wrote:
> Yes, I agree with you that complete rewriting of optimizer is huge project
> with unpredictable influence on performance of some queries.
> Changing things incrementally is good approach, but only if we are moving in
> right direction.
> I still not sure that introduction of async. operations is step in right
> direction. Async.ops are used to significantly complicate code (since you
> have to maintain state yourself). It will be bad if implementation of each
> node has to deal with async state itself in its own manner.

I don't really think so.  The design I've proposed makes adding
asynchronous capability to a node pretty easy, with only minor
changes.

> My suggestion is to try to provide some generic mechanism for managing state
> transition and have some scheduler which controls this process. It should
> not be responsibility of node implementation to organize
> asynchronous/parallel execution. Instead of this it should just produce set
> of jobs which execution should  be controlled by scheduler. First
> implementation of scheduler can be quite simple. But later in can become
> more clever: try to bind data to processors and do many other optimizations.

Whereas this would require a massive rewrite.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Robert Haas
On Tue, May 10, 2016 at 8:23 PM, Andres Freund  wrote:
>> c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
>> process a batch TupleTableSlot. This will require some tight loop to
>> aggregate the entire TupleTableSlot at once before returning.
>> d. Add function in execAmi.c which returns true or false depending on
>> if the node supports batch TupleTableSlots or not.
>> e. At executor startup determine if the entire plan tree supports
>> batch TupleTableSlots, if so enable batch scan mode.
>
> It doesn't really need to be the entire tree. Even if you have a subtree
> (say a parametrized index nested loop join) which doesn't support batch
> mode, you'll likely still see performance benefits by building a batch
> one layer above the non-batch-supporting node.

+1.

I've also wondered about building a new executor node that is sort of
a combination of Nested Loop and Hash Join, but capable of performing
multiple joins in a single operation. (Merge Join is different,
because it's actually matching up the two sides, not just doing
probing once per outer tuple.) So the plan tree would look something
like this:

Multiway Join
-> Seq Scan on driving_table
-> Index Scan on something
-> Index Scan on something_else
-> Hash
  -> Seq Scan on other_thing
-> Hash
  -> Seq Scan on other_thing_2
-> Index Scan on another_one

With the current structure, every level of the plan tree has its own
TupleTableSlot and we have to project into each new slot.  Every level
has to go through ExecProcNode.  So it seems to me that this sort of
structure might save quite a few cycles on deep join nests.  I haven't
tried it, though.

With batching, things get even better for this sort of thing.
Assuming the joins are all basically semi-joins, either because they
were written that way or because they are probing unique indexes or
whatever, you can fetch a batch of tuples from the driving table, do
the first join for each tuple to create a matching batch of tuples,
and repeat for each join step.  Then at the end you project.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Konstantin Knizhnik



On 11.05.2016 17:00, Robert Haas wrote:

On Tue, May 10, 2016 at 3:42 PM, Konstantin Knizhnik
 wrote:

Doesn't this actually mean that we need to have normal job scheduler which
is given queue of jobs and having some pool of threads will be able to
orginize efficient execution of queries? Optimizer can build pipeline
(graph) of tasks, which corresponds to execution plan nodes, i.e. SeqScan,
Sort, ... Each task is splitted into several jobs which can be concurretly
scheduled by task dispatcher.  So you will not have blocked worker waiting
for something and all system resources will be utilized. Such approach with
dispatcher allows to implement quotas, priorities,... Also dispatches can
care about NUMA and cache optimizations which is especially critical on
modern architectures. One more reference:
http://db.in.tum.de/~leis/papers/morsels.pdf

I read this as a proposal to redesign the entire optimizer and
executor to use some new kind of plan.  That's not a project I'm
willing to entertain; it is hard to imagine we could do it in a
reasonable period of time without introducing bugs and performance
regressions.  I think there is a great deal of performance benefit
that we can get by changing things incrementally.

Yes, I agree with you that complete rewriting of optimizer is huge 
project with unpredictable influence on performance of some queries.
Changing things incrementally is good approach, but only if we are 
moving in right direction.
I still not sure that introduction of async. operations is step in right 
direction. Async.ops are used to significantly complicate code (since 
you have to maintain state yourself). It will be bad if implementation 
of each node has to deal with async state itself in its own manner.


My suggestion is to try to provide some generic mechanism for managing 
state transition and have some scheduler which controls this process. It 
should not be responsibility of node implementation to organize 
asynchronous/parallel execution. Instead of this it should just produce 
set of jobs which execution should  be controlled by scheduler. First 
implementation of scheduler can be quite simple. But later in can become 
more clever: try to bind data to processors and do many other 
optimizations.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Robert Haas
On Tue, May 10, 2016 at 7:57 PM, Andres Freund  wrote:
>> 1. asynchronous execution, by which I mean the ability of a node to
>> somehow say that it will generate a tuple eventually, but is not yet
>> ready, so that the executor can go run some other part of the plan
>> tree while it waits.  [...].  It is also a problem
>> for parallel query: in a parallel sequential scan, the next worker can
>> begin reading the next block even if the current block hasn't yet been
>> received from the OS.  Whether or not this will be efficient is a
>> research question, but it can be done.  However, imagine a parallel
>> scan of a btree index: we don't know what page to scan next until we
>> read the previous page and examine the next-pointer.  In the meantime,
>> any worker that arrives at that scan node has no choice but to block.
>> It would be better if the scan node could instead say "hey, thanks for
>> coming but I'm really not ready to be on-CPU just at the moment" and
>> potentially allow the worker to go work in some other part of the
>> query tree.  For that worker to actually find useful work to do
>> elsewhere, we'll probably need it to be the case either that the table
>> is partitioned or the original query will need to involve UNION ALL,
>> but those are not silly cases to worry about, particularly if we get
>> native partitioning in 9.7.
>
> I've to admit I'm not that convinced about the speedups in the !fdw
> case. There seems to be a lot easier avenues for performance
> improvements.

What I'm talking about is a query like this:

SELECT * FROM inheritance_tree_of_foreign_tables WHERE very_rarely_true;

What we do today is run the remote query on the first child table to
completion, then start it on the second child table, and so on.
Sending all the queries at once can bring a speed-up of a factor of N
to a query with N children, and it's completely independent of every
other speed-up that we might attempt.  This has been under discussion
for years on FDW-related threads as a huge problem that we need to fix
someday, and I really don't see how it's sane not to try.  The shape
of what that looks like is of course arguable, but saying the
optimization isn't valuable blows my mind.

Whether you care about this case or not, this is also important for
parallel query.

> FWIW, I've even hacked something up for a bunch of simple queries, and
> the performance improvements were significant.  Besides it only being a
> weekend hack project, the big thing I got stuck on was considering how
> to exactly determine when to batch and not to batch.

Yeah.  I think we need a system for signalling nodes as to when they
will be run to completion.  But a Boolean is somehow unsatisfying;
LIMIT 100 is more like no LIMIT than it it is like LIMIT 1.  I'm
tempted to add a numTuples field to every ExecutorState and give upper
nodes some way to set it, as a hint.

>> For asynchronous execution, I have gone so far as to mock up a bit of
>> what this might look like.  This shouldn't be taken very seriously at
>> this point, but I'm attaching a few very-much-WIP patches to show the
>> direction of my line of thinking.  Basically, I propose to have
>> ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
>> by putting them into a new PlanState member called "result", which is
>> just a Node * so that we can support multiple types of results,
>> instead of returning them.
>
> What different types of results are you envisioning?

TupleTableSlots and TupleTableVectors, mostly.  I think the stuff that
is currently going through MultiExecProcNode() could probably be
folded in as just another type of result.

>> Some care is required here because any
>> functions we execute as scan keys are run with the buffer locked, so
>> we had better not run anything very complicated.  But doing this for
>> simple things like integer equality operators seems like it could save
>> quite a few buffer lock/unlock cycles and some other executor overhead
>> as well.
>
> Hm. Do we really have to keep the page locked in the page-at-a-time
> mode? Shouldn't the pin suffice?

I think we need a lock to examine MVCC visibility information.  A pin
is enough to prevent a tuple from being removed, but not from having
its xmax and cmax overwritten at almost but not quite exactly the same
time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Robert Haas
On Tue, May 10, 2016 at 3:42 PM, Konstantin Knizhnik
 wrote:
> Doesn't this actually mean that we need to have normal job scheduler which
> is given queue of jobs and having some pool of threads will be able to
> orginize efficient execution of queries? Optimizer can build pipeline
> (graph) of tasks, which corresponds to execution plan nodes, i.e. SeqScan,
> Sort, ... Each task is splitted into several jobs which can be concurretly
> scheduled by task dispatcher.  So you will not have blocked worker waiting
> for something and all system resources will be utilized. Such approach with
> dispatcher allows to implement quotas, priorities,... Also dispatches can
> care about NUMA and cache optimizations which is especially critical on
> modern architectures. One more reference:
> http://db.in.tum.de/~leis/papers/morsels.pdf

I read this as a proposal to redesign the entire optimizer and
executor to use some new kind of plan.  That's not a project I'm
willing to entertain; it is hard to imagine we could do it in a
reasonable period of time without introducing bugs and performance
regressions.  I think there is a great deal of performance benefit
that we can get by changing things incrementally.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Robert Haas
On Tue, May 10, 2016 at 4:57 PM, Jim Nasby  wrote:
> Even so, I would think that the simplification in the executor would be
> worth it. If you need to add a new node there's dozens of places where you
> might have to mess with these giant case statements.

Dozens? I think the number is in the single digits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Ants Aasma
On Wed, May 11, 2016 at 3:52 AM, Andres Freund  wrote:
> On 2016-05-11 03:20:12 +0300, Ants Aasma wrote:
>> On Tue, May 10, 2016 at 7:56 PM, Robert Haas  wrote:
>> > On Mon, May 9, 2016 at 8:34 PM, David Rowley
>> >  wrote:
>> > I don't have any at the moment, but I'm not keen on hundreds of new
>> > vector functions that can all have bugs or behavior differences versus
>> > the unvectorized versions of the same code.  That's a substantial tax
>> > on future development.  I think it's important to understand what
>> > sorts of queries we are targeting here.  KaiGai's GPU-acceleration
>> > stuff does great on queries with complex WHERE clauses, but most
>> > people don't care not only because it's out-of-core but because who
>> > actually looks for the records where (a + b) % c > (d + e) * f / g?
>> > This seems like it has the same issue.  If we can speed up common
>> > queries people are actually likely to run, OK, that's interesting.
>>
>> I have seen pretty complex expressions in the projection and
>> aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
>> END) type of expressions. In critical places had to replace them with
>> a C coded function that processed a row at a time to avoid the
>> executor dispatch overhead.
>
> I've seen that as well, but Was it the actual fmgr indirection causing
> the overhead, or was it ExecQual/ExecMakeFunctionResultNoSets et al?

I don't remember what the exact profile looked like, but IIRC it was
mostly Exec* stuff with advance_aggregates also up there.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Andres Freund
On 2016-05-11 03:20:12 +0300, Ants Aasma wrote:
> On Tue, May 10, 2016 at 7:56 PM, Robert Haas  wrote:
> > On Mon, May 9, 2016 at 8:34 PM, David Rowley
> >  wrote:
> > I don't have any at the moment, but I'm not keen on hundreds of new
> > vector functions that can all have bugs or behavior differences versus
> > the unvectorized versions of the same code.  That's a substantial tax
> > on future development.  I think it's important to understand what
> > sorts of queries we are targeting here.  KaiGai's GPU-acceleration
> > stuff does great on queries with complex WHERE clauses, but most
> > people don't care not only because it's out-of-core but because who
> > actually looks for the records where (a + b) % c > (d + e) * f / g?
> > This seems like it has the same issue.  If we can speed up common
> > queries people are actually likely to run, OK, that's interesting.
> 
> I have seen pretty complex expressions in the projection and
> aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
> END) type of expressions. In critical places had to replace them with
> a C coded function that processed a row at a time to avoid the
> executor dispatch overhead.

I've seen that as well, but Was it the actual fmgr indirection causing
the overhead, or was it ExecQual/ExecMakeFunctionResultNoSets et al?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Andres Freund
On 2016-05-10 12:56:17 -0400, Robert Haas wrote:
> I suspect the number of queries that are being hurt by fmgr overhead
> is really large, and I think it would be nice to attack that problem
> more directly.  It's a bit hard to discuss what's worthwhile in the
> abstract, without performance numbers, but when you vectorize, how
> much is the benefit from using SIMD instructions and how much is the
> benefit from just not going through the fmgr every time?

I think fmgr overhead is an issue, but in most profiles of execution
heavy loads I've seen the bottlenecks are elsewhere. They often seem to
roughly look like
+   15.47%  postgres  postgres   [.] slot_deform_tuple
+   12.99%  postgres  postgres   [.] slot_getattr
+   10.36%  postgres  postgres   [.] ExecMakeFunctionResultNoSets
+9.76%  postgres  postgres   [.] heap_getnext
+6.34%  postgres  postgres   [.] HeapTupleSatisfiesMVCC
+5.09%  postgres  postgres   [.] heapgetpage
+4.59%  postgres  postgres   [.] hash_search_with_hash_value
+4.36%  postgres  postgres   [.] ExecQual
+3.30%  postgres  postgres   [.] ExecStoreTuple
+3.29%  postgres  postgres   [.] ExecScan

or

-   33.67%  postgres  postgres   [.] ExecMakeFunctionResultNoSets
   - ExecMakeFunctionResultNoSets
  + 99.11% ExecEvalOr
  + 0.89% ExecQual
+   14.32%  postgres  postgres   [.] slot_getattr
+5.66%  postgres  postgres   [.] ExecEvalOr
+5.06%  postgres  postgres   [.] check_stack_depth
+5.02%  postgres  postgres   [.] slot_deform_tuple
+4.05%  postgres  postgres   [.] pgstat_end_function_usage
+3.69%  postgres  postgres   [.] heap_getnext
+3.41%  postgres  postgres   [.] ExecEvalScalarVarFast
+3.36%  postgres  postgres   [.] ExecEvalConst


with a healthy dose of _bt_compare, heap_hot_search_buffer in more index
heavy workloads.

(yes, I just pulled these example profiles from somewhere, but I've more
often seen them look like this, than very fmgr heavy).


That seems to suggest that we need to restructure how we get to calling
fmgr functions, before worrying about the actual fmgr call.


Tomas, Mark, IIRC you'd both generated perf profiles for TPC-H (IIRC?)
queries at some point. Any chance the results are online somewhere?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Andres Freund
On 2016-05-10 12:34:19 +1200, David Rowley wrote:
> a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes.
> b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to
> indicate if the struct contains a single or a multiple tuples.
> Multiple tuples may need to be deformed in a non-lazy fashion in order
> to prevent too many buffers from having to be pinned at once. Tuples
> will be deformed into arrays of each column rather than arrays for
> each tuple (this part is important to support the next sub-project)

FWIW, I don't think that's necessarily required, and it has the
potential to slow down some operations (like target list
processing/projections) considerably. By the time vectored execution for
postgres is ready, gather instructions should be common and fast enough
(IIRC they started to be ok with broadwells, and are better in skylake;
other archs had them for longer).


> c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
> process a batch TupleTableSlot. This will require some tight loop to
> aggregate the entire TupleTableSlot at once before returning.
> d. Add function in execAmi.c which returns true or false depending on
> if the node supports batch TupleTableSlots or not.
> e. At executor startup determine if the entire plan tree supports
> batch TupleTableSlots, if so enable batch scan mode.

It doesn't really need to be the entire tree. Even if you have a subtree
(say a parametrized index nested loop join) which doesn't support batch
mode, you'll likely still see performance benefits by building a batch
one layer above the non-batch-supporting node.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Ants Aasma
On Tue, May 10, 2016 at 7:56 PM, Robert Haas  wrote:
> On Mon, May 9, 2016 at 8:34 PM, David Rowley
>  wrote:
> I don't have any at the moment, but I'm not keen on hundreds of new
> vector functions that can all have bugs or behavior differences versus
> the unvectorized versions of the same code.  That's a substantial tax
> on future development.  I think it's important to understand what
> sorts of queries we are targeting here.  KaiGai's GPU-acceleration
> stuff does great on queries with complex WHERE clauses, but most
> people don't care not only because it's out-of-core but because who
> actually looks for the records where (a + b) % c > (d + e) * f / g?
> This seems like it has the same issue.  If we can speed up common
> queries people are actually likely to run, OK, that's interesting.

I have seen pretty complex expressions in the projection and
aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
END) type of expressions. In critical places had to replace them with
a C coded function that processed a row at a time to avoid the
executor dispatch overhead.

> By the way, I think KaiGai's GPU-acceleration stuff points to another
> pitfall here.  There's other stuff somebody might legitimately want to
> do that requires another copy of each function. For example, run-time
> code generation likely needs that (a function to tell the code
> generator what to generate for each of our functions), and
> GPU-acceleration probably does, too.  If fixing a bug in numeric_lt
> requires changing not only the regular version and the vectorized
> version but also the GPU-accelerated version and the codegen version,
> Tom and Dean are going to kill us.  And justifiably so!  Granted,
> nobody is proposing those other features in core right now, but
> they're totally reasonable things to want to do.

My thoughts in this area have been circling around getting LLVM to do
the heavy lifting. LLVM/clang could compile existing C functions to IR
and bundle those with the DB. At query planning time or maybe even
during execution the functions can be inlined into the compiled query
plan, LLVM can then be coaxed to copy propagate, constant fold and
dead code eliminate the bejeezus out of the expression tree. This way
duplication of the specialized code can be kept to a minimum while at
least the common cases can completely avoid the fmgr overhead.

This approach would also mesh together fine with batching. Given
suitably regular data structures and simple functions, LLVM will be
able to vectorize the code. If not it will still end up with a nice
tight loop that is an order of magnitude or two faster than the
current executor.

The first cut could take care of ExecQual, ExecTargetList and friends.
Later improvements could let execution nodes provide basic blocks that
would then be threaded together into the main execution loop. If some
node does not implement the basic block interface a default
implementation is used that calls the current interface. It gets a bit
handwavy at this point, but the main idea would be to enable data
marshaling so that values can be routed directly to the code that
needs them without being written to intermediate storage.

> I suspect the number of queries that are being hurt by fmgr overhead
> is really large, and I think it would be nice to attack that problem
> more directly.  It's a bit hard to discuss what's worthwhile in the
> abstract, without performance numbers, but when you vectorize, how
> much is the benefit from using SIMD instructions and how much is the
> benefit from just not going through the fmgr every time?

My feeling is the same, fmgr overhead and data marshaling, dynamic
dispatch through the executor is the big issue. This is corroborated
by what I have seen found by other VM implementations. Once you get
the data into an uniform format where vectorized execution could be
used, the CPU execution resources are no longer the bottleneck. Memory
bandwidth gets in the way, unless each input value is used in multiple
calculations. And even then, we are looking at a 4x speedup at best.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Andres Freund
Hi,

On 2016-05-09 13:33:55 -0400, Robert Haas wrote:
> I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:


3) We use a lot of very cache-inefficient datastructures.

Especially the pervasive use of linked lists in the executor is pretty
bad for performance. Every element is likely to incur cache misses,
every list element pretty much has it's own cacheline (thereby reducing
overall cache hit ratio), they have a horrible allocation overhead (both
space and palloc runtime).


> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  [...].  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.

I've to admit I'm not that convinced about the speedups in the !fdw
case. There seems to be a lot easier avenues for performance
improvements.


> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.

FWIW, I've even hacked something up for a bunch of simple queries, and
the performance improvements were significant.  Besides it only being a
weekend hack project, the big thing I got stuck on was considering how
to exactly determine when to batch and not to batch.


I'd personally say that the CPU pipeline defeating aspect is worse than
the effect of the cache/branch misses themselves. Today's CPUs are
heavily superscalar, and our instruction-per-cycle throughput shows
pretty clearly that we're not good at employing (micro-)instruction
paralellism. We're quite frequently at well below one instruction/cycle.





> My proposal for how to do this is to make ExecProcNode function as a
> backward-compatibility wrapper.  For asynchronous execution, a node
> might return a not-ready-yet indication, but if that node is called
> via ExecProcNode, it means the caller isn't prepared to receive such
> an indication, so ExecProcNode will just wait for the node to become
> ready and then return the tuple.  Similarly, for vectorized execution,
> a node might return a bunch of tuples all at once.  ExecProcNode will
> extract the first one and return it to the caller, and subsequent
> calls to ExecProcNode will iterate through the rest of the batch, only
> calling the underlying node-specific function when the batch is
> exhausted.  In this way, code that doesn't know about the new stuff
> can continue to work pretty much as it does today.

I agree that that generally is a reasonable way forward.


> Also, and I think
> this is important, nodes don't need the permission of their parent
> node to use these new capabilities.  They can use them whenever they
> wish, without worrying about whether the upper node is prepared to
> deal with it.  If not, ExecProcNode will paper over the problem.  This
> seems to me to be a good way to keep the code simple.

Maybe not permission, but for some cases it seems to be important to
hint to *not* prefetch a lot of rows. E.g. anti joins come to mind. Just
using batching with force seems likely to regress some queries quite
badly (e.g an expensive join inside an EXISTS() which returns many
tuples).


> For asynchronous execution, I have gone so far as to mock up a bit of
> what this might look like.  This shouldn't be taken very seriously at
> this point, but I'm attaching a few very-much-WIP patches to show the
> direction of my line of 

Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Bert
hmm, the morsels paper looks really interesting at first sight.
Let's see if we can get a poc working in PostgreSQL? :-)

On Tue, May 10, 2016 at 9:42 PM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

> On 05/10/2016 08:26 PM, Robert Haas wrote:
>
>> On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik
>>  wrote:
>>
>>> What's wrong with it that worker is blocked? You can just have more
>>> workers
>>> (more than CPU cores) to let other of them continue to do useful work.
>>>
>> Not really.  The workers are all running the same plan, so they'll all
>> make the same decision about which node needs to be executed next.  If
>> that node can't accommodate multiple processes trying to execute it at
>> the same time, it will have to block all of them but the first one.
>> Adding more processes just increases the number of processes sitting
>> around doing nothing.
>>
>
> Doesn't this actually mean that we need to have normal job scheduler which
> is given queue of jobs and having some pool of threads will be able to
> orginize efficient execution of queries? Optimizer can build pipeline
> (graph) of tasks, which corresponds to execution plan nodes, i.e. SeqScan,
> Sort, ... Each task is splitted into several jobs which can be concurretly
> scheduled by task dispatcher.  So you will not have blocked worker waiting
> for something and all system resources will be utilized. Such approach with
> dispatcher allows to implement quotas, priorities,... Also dispatches can
> care about NUMA and cache optimizations which is especially critical on
> modern architectures. One more reference:
> http://db.in.tum.de/~leis/papers/morsels.pdf
>
> Sorry, may be I wrong, but I still think that async.ops is "multitasking
> for poor":)
> Yes, maintaining threads and especially separate processes adds
> significant overhead. It leads to extra resources consumption and context
> switches are quite expensive. And I know from my own experience that
> replacing several concurrent processes performing some IO (for example with
> sockets) with just one process using multiplexing allows to increase
> performance. But still async. ops. is just a way to make programmer
> responsible for managing state machine instead of relying on OS tomake
> context switches. Manual transmission is still more efficient than
> automatic transmission. But still most drives prefer last one;)
>
> Seriously, I carefully read your response to Kochei, but still not
> convinced that async. ops. is what we need.  Or may be we just understand
> different thing by this notion.
>
>
>
>
>> But there are some researches, for example:
>>>
>>> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
>>>
>>> showing that the same or even better effect can be achieved by generation
>>> native code for query execution plan (which is not so difficult now,
>>> thanks
>>> to LLVM).
>>> It eliminates interpretation overhead and increase cache locality.
>>> I get similar results with my own experiments of accelerating SparkSQL.
>>> Instead of native code generation I used conversion of query plans to C
>>> code
>>> and experiment with different data representation. "Horisontal model"
>>> with
>>> loading columns on demands shows better performance than columnar store.
>>>
>> Yes, I think this approach should also be considered.
>>
>> At this moment (February) them have implemented translation of only few
>>> PostgreSQL operators used by ExecQuals  and do not support aggregates.
>>> Them get about 2 times increase of speed at synthetic queries and 25%
>>> increase at TPC-H Q1 (for Q1 most critical is generation of native code
>>> for
>>> aggregates, because ExecQual itself takes only 6% of time for this
>>> query).
>>> Actually these 25% for Q1 were achieved not by using dynamic code
>>> generation, but switching from PULL to PUSH model in executor.
>>> It seems to be yet another interesting PostgreSQL executor
>>> transformation.
>>> As far as I know, them are going to publish result of their work to open
>>> source...
>>>
>> Interesting.  You may notice that in "asynchronous mode" my prototype
>> works using a push model of sorts.  Maybe that should be taken
>> further.
>>
>>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



-- 
Bert Desmet
0477/305361


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Jim Nasby

On 5/10/16 12:47 AM, Kouhei Kaigai wrote:

> On 10 May 2016 at 13:38, Kouhei Kaigai  wrote:

> > My concern about ExecProcNode is, it is constructed with a large switch
> > ... case statement. It involves tons of comparison operation at run-time.
> > If we replace this switch ... case by function pointer, probably, it make
> > performance improvement. Especially, OLAP workloads that process large
> > amount of rows.

>
> I imagined that any decent compiler would have built the code to use
> jump tables for this. I have to say that I've never checked to make
> sure though.
>

Ah, indeed, you are right. Please forget above part.


Even so, I would think that the simplification in the executor would be 
worth it. If you need to add a new node there's dozens of places where 
you might have to mess with these giant case statements.


In python (for example), types (equivalent to nodes in this case) have 
data structures that define function pointers for a core set of 
operations (such as doing garbage collection, or generating a string 
representation). That means that to add a new type at the C level you 
only need to define a C structure that has the expected members, and an 
initializer function that will properly set everything up when you 
create a new instance. There's no messing around with the rest of the 
guts of python.


*Even more important, everything you need to know about the type is 
contained in one place, not spread throughout the code.*

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Konstantin Knizhnik

On 05/10/2016 08:26 PM, Robert Haas wrote:

On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik
 wrote:

What's wrong with it that worker is blocked? You can just have more workers
(more than CPU cores) to let other of them continue to do useful work.

Not really.  The workers are all running the same plan, so they'll all
make the same decision about which node needs to be executed next.  If
that node can't accommodate multiple processes trying to execute it at
the same time, it will have to block all of them but the first one.
Adding more processes just increases the number of processes sitting
around doing nothing.


Doesn't this actually mean that we need to have normal job scheduler which is given queue of jobs and having some pool of threads will be able to orginize efficient execution of queries? Optimizer can build pipeline (graph) of tasks, which corresponds to 
execution plan nodes, i.e. SeqScan, Sort, ... Each task is splitted into several jobs which can be concurretly scheduled by task dispatcher.  So you will not have blocked worker waiting for something and all system resources will be utilized. Such approach 
with dispatcher allows to implement quotas, priorities,... Also dispatches can care about NUMA and cache optimizations which is especially critical on modern architectures. One more reference: http://db.in.tum.de/~leis/papers/morsels.pdf


Sorry, may be I wrong, but I still think that async.ops is "multitasking for 
poor":)
Yes, maintaining threads and especially separate processes adds significant overhead. It leads to extra resources consumption and context switches are quite expensive. And I know from my own experience that replacing several concurrent processes performing 
some IO (for example with sockets) with just one process using multiplexing allows to increase performance. But still async. ops. is just a way to make programmer responsible for managing state machine instead of relying on OS tomake context switches. 
Manual transmission is still more efficient than automatic transmission. But still most drives prefer last one;)


Seriously, I carefully read your response to Kochei, but still not convinced 
that async. ops. is what we need.  Or may be we just understand different thing 
by this notion.





But there are some researches, for example:

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

showing that the same or even better effect can be achieved by generation
native code for query execution plan (which is not so difficult now, thanks
to LLVM).
It eliminates interpretation overhead and increase cache locality.
I get similar results with my own experiments of accelerating SparkSQL.
Instead of native code generation I used conversion of query plans to C code
and experiment with different data representation. "Horisontal model" with
loading columns on demands shows better performance than columnar store.

Yes, I think this approach should also be considered.


At this moment (February) them have implemented translation of only few
PostgreSQL operators used by ExecQuals  and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25%
increase at TPC-H Q1 (for Q1 most critical is generation of native code for
aggregates, because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code
generation, but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open
source...

Interesting.  You may notice that in "asynchronous mode" my prototype
works using a push model of sorts.  Maybe that should be taken
further.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Robert Haas
On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik
 wrote:
> What's wrong with it that worker is blocked? You can just have more workers
> (more than CPU cores) to let other of them continue to do useful work.

Not really.  The workers are all running the same plan, so they'll all
make the same decision about which node needs to be executed next.  If
that node can't accommodate multiple processes trying to execute it at
the same time, it will have to block all of them but the first one.
Adding more processes just increases the number of processes sitting
around doing nothing.

> But there are some researches, for example:
>
> http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
>
> showing that the same or even better effect can be achieved by generation
> native code for query execution plan (which is not so difficult now, thanks
> to LLVM).
> It eliminates interpretation overhead and increase cache locality.
> I get similar results with my own experiments of accelerating SparkSQL.
> Instead of native code generation I used conversion of query plans to C code
> and experiment with different data representation. "Horisontal model" with
> loading columns on demands shows better performance than columnar store.

Yes, I think this approach should also be considered.

> At this moment (February) them have implemented translation of only few
> PostgreSQL operators used by ExecQuals  and do not support aggregates.
> Them get about 2 times increase of speed at synthetic queries and 25%
> increase at TPC-H Q1 (for Q1 most critical is generation of native code for
> aggregates, because ExecQual itself takes only 6% of time for this query).
> Actually these 25% for Q1 were achieved not by using dynamic code
> generation, but switching from PULL to PUSH model in executor.
> It seems to be yet another interesting PostgreSQL executor transformation.
> As far as I know, them are going to publish result of their work to open
> source...

Interesting.  You may notice that in "asynchronous mode" my prototype
works using a push model of sorts.  Maybe that should be taken
further.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Robert Haas
On Mon, May 9, 2016 at 9:38 PM, Kouhei Kaigai  wrote:
> Is the parallel aware Append node sufficient to run multiple nodes
> asynchronously? (Sorry, I couldn't have enough time to code the feature
> even though we had discussion before.)

It's tempting to think that parallel query and asynchronous query are
the same thing, but I think that they are actually quite different.
Parallel query involves using multiple processes to service a query.
Asynchronous query involves using each individual process as
efficiently as possible by not having it block any more than
necessary.  You can want these things together or separately.  For
example, consider this query plan:

Append
-> ForeignScan
-> ForeignScan

Here, you do not want parallel query; the queries must both be
launched by the user backend, not some worker process, else you will
not get the right transaction semantics.  The parallel-aware Append
node we talked about before does not help here.  On the other hand,
consider this:

Append
  -> Seq Scan
   Filter: lots_of_cpu()
  -> Seq Scan
   Filter: lots_of_cpu()

Here, asynchronous query is of no help, but parallel query is great.
Now consider this hypothetical plan:

Gather
-> Hash Join
  -> Parallel Bitmap Heap Scan
-> Bitmap Index Scan
  -> Parallel Hash
-> Parallel Seq Scan

Let's assume that the bitmap *heap* scan can be performed in parallel
but the bitmap *index* scan can't.  That's pretty reasonable for a
first cut, actually, because the index accesses are probably touching
much less data, and we're doing little CPU work for each index page -
any delay here is likely to be I/O.

So in that world what you want is for the first worker to begin
performing the bitmap index scan and building a shared TIDBitmap for
that the workers can use to scan the heap.  The other workers,
meanwhile, could begin building the shared hash table (which is what I
intend to denote by saying that it's a *Parallel* Hash).  If the
process building the bitmap finishes before the hash table is built,
it can help build the hash table as well.  Once both operations are
done, each process can scan a subset of the rows from the bitmap heap
scan and do the hash table probes for just those rows.  To make all of
this work, you need both *parallel* query, so that you have workers,
and also *asynchronous* query, so that workers which see that the
bitmap index scan is in progress don't get stuck waiting for it but
can look around for other work.

> In the above example, scan on foreign-table takes longer lead time than
> local scan. If Append can map every child nodes on individual workers,
> local scan worker begins to return tuples at first, then, mixed tuples
> shall be returned eventually.

This is getting at an issue I'm somewhat worried about, which is
scheduling.  In my prototype, we only check for events if there are no
nodes ready for the CPU now, but sometimes that might be a loser.  One
probably needs to check for events periodically even when there are
still nodes waiting for the CPU, but "how often?" seems like an
unanswerable question.

> However, the process internal asynchronous execution may be also beneficial
> in case when cost of shm_mq is not ignorable (e.g, no scan qualifiers
> are given to worker process). I think it allows to implement pre-fetching
> very naturally.

Yes.

>> My proposal for how to do this is to make ExecProcNode function as a
>> backward-compatibility wrapper.  For asynchronous execution, a node
>> might return a not-ready-yet indication, but if that node is called
>> via ExecProcNode, it means the caller isn't prepared to receive such
>> an indication, so ExecProcNode will just wait for the node to become
>> ready and then return the tuple.
>>
> Backward compatibility is good. In addition, child node may want to
> know the context when it is called. It may want to switch the behavior
> according to the caller's expectation. For example, it may be beneficial
> if SeqScan makes more aggressive prefetching on asynchronous execution.

Maybe, but I'm a bit doubtful.  I'm not seeing a lot of advantage in
that sort of thing, and it will make the code a lot more complicated.

> Also, can we consider which data format will be returned from the child
> node during the planning stage? It affects to the cost of inter-node
> data exchange. If a pair of parent-node and child-node supports its
> special data format (like as existing HashJoin and Hash doing), it shall
> be a discount factor of cost estimation.

I'm not sure.  The costing aspects of this need a lot more thought
than I have given them so far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Robert Haas
On Mon, May 9, 2016 at 8:34 PM, David Rowley
 wrote:
> It's interesting that you mention this. We identified this as a pain
> point during our work on column stores last year. Simply passing
> single tuples around the executor is really unfriendly towards L1
> instruction cache, plus also the points you mention about L3 cache and
> hash tables and tuple stores. I really think that we're likely to see
> significant gains by processing >1 tuple at a time, so this topic very
> much interests me.

Cool.  I hope we can work together on it, and with anyone else who is
interested.

> When we start multiplying those increases with the increases with
> something like parallel query then we're starting to see very nice
> gains in performance.

Check.

> Alvaro, Tomas and I had been discussing this and late last year I did
> look into what would be required to allow this to happen in Postgres.
> Basically there's 2 sub-projects, I'll describe what I've managed to
> learn so far about each, and the rough plan that I have to implement
> them:
>
> 1. Batch Execution:
>
> a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes.
> b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to
> indicate if the struct contains a single or a multiple tuples.
> Multiple tuples may need to be deformed in a non-lazy fashion in order
> to prevent too many buffers from having to be pinned at once. Tuples
> will be deformed into arrays of each column rather than arrays for
> each tuple (this part is important to support the next sub-project)
> c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
> process a batch TupleTableSlot. This will require some tight loop to
> aggregate the entire TupleTableSlot at once before returning.
> d. Add function in execAmi.c which returns true or false depending on
> if the node supports batch TupleTableSlots or not.
> e. At executor startup determine if the entire plan tree supports
> batch TupleTableSlots, if so enable batch scan mode.

I'm wondering if we should instead have a whole new kind of node, a
TupleTableVector, say.  Things that want to work one tuple at a time
can continue to do so with no additional overhead.  Things that want
to return batches can do it via this new result type.

> node types, which *may* not be all that difficult. I'm also assuming
> that batch mode (in all cases apart from queries with LIMIT or
> cursors) will always be faster than tuple-at-a-time, so requires no
> costings from the planner.

I definitely agree that we need to consider cases with and without
LIMIT separately, but there's more than one way to get a LIMIT.  For
example, a subquery returns only one row; a semijoin returns only one
row.  I don't think you are going to escape planner considerations.

Nested Loop Semi Join
-> Seq Scan
-> Index Scan on dont_batch_here

> 2. Vector processing
>
> (I admit that I've given this part much less thought so far, but
> here's what I have in mind)
>
> This depends on batch execution, and is intended to allow the executor
> to perform function calls to an entire batch at once, rather than
> tuple-at-a-time. For example, let's take the following example;
>
> SELECT a+b FROM t;
>
> here (as of now) we'd scan "t" one row at a time and perform a+b after
> having deformed enough of the tuple to do that. We'd then go and get
> another Tuple from the scan node and repeat until the scan gave us no
> more Tuples.
>
> With batch execution we'd fetch multiple Tuples from the scan and we'd
> then perform the call to say int4_pl() multiple times, which still
> kinda sucks as it means calling int4_pl() possibly millions of times
> (once per tuple). The vector mode here would require that we modify
> pg_operator to add a vector function for each operator so that we can
> call the function passing in an array of Datums and a length to have
> SIMD operations perform the addition, so we'd call something like
> int4_pl_vector() only once per batch of tuples allowing the CPU to
> perform SIMD operations on those datum arrays. This could be done in
> an incremental way as the code could just callback on the standard
> function in cases where a vectorised version of it is not available.
> Thought is needed here about when exactly this decision is made as the
> user may not have permissions to execute the vector function, so it
> can't simply be a run time check.  These functions would simply return
> another vector of the results.  Aggregates could be given a vector
> transition function, where something like COUNT(*)'s vector_transfn
> would simply just current_count += vector_length;
>
> This project does appear to require that we bloat the code with 100's
> of vector versions of each function. I'm not quite sure if there's a
> better way to handle this. The problem is that the fmgr is pretty much
> a barrier to SIMD operations, and this was the only idea that I've had
> so far about breaking through that 

Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Rajeev rastogi
On 09 May 2016 23:04, Robert Haas Wrote:

>2. vectorized execution, by which I mean the ability of a node to return
>tuples in batches rather than one by one.  Andres has opined more than
>once that repeated trips through ExecProcNode defeat the ability of the
>CPU to do branch prediction correctly, slowing the whole system down,
>and that they also result in poor CPU cache behavior, since we jump all
>over the place executing a little bit of code from each node before
>moving onto the next rather than running one bit of code first, and then
>another later.  I think that's
>probably right.   For example, consider a 5-table join where all of
>the joins are implemented as hash tables.  If this query plan is going
>to be run to completion, it would make much more sense to fetch, say,
>100 tuples from the driving scan and then probe for all of those in the
>first hash table, and then probe for all of those in the second hash
>table, and so on.  What we do instead is fetch one tuple and probe for
>it in all 5 hash tables, and then repeat.  If one of those hash tables
>would fit in the CPU cache but all five together will not,
>that seems likely to be a lot worse.   But even just ignoring the CPU
>cache aspect of it for a minute, suppose you want to write a loop to
>perform a hash join.  The inner loop fetches the next tuple from the
>probe table and does a hash lookup.  Right now, fetching the next tuple
>from the probe table means calling a function which in turn calls
>another function which probably calls another function which probably
>calls another function and now about 4 layers down we actually get the
>next tuple.  If the scan returned a batch of tuples to the hash join,
>fetching the next tuple from the batch would probably be 0 or 1 function
>calls rather than ... more.  Admittedly, you've got to consider the cost
>of marshaling the batches but I'm optimistic that there are cycles to be
>squeezed out here.  We might also want to consider storing batches of
>tuples in a column-optimized rather than row-optimized format so that
>iterating through one or two attributes across every tuple in the batch
>touches the minimal number of cache lines.


This sounds to be really great idea in the direction of performance improvement.
I would like to share my thought as per our research work in the similar area 
(Mostly it may be as you have mentioned).
Our goal with this work was to:
1. Makes the processing data-centric instead of operator centric.
2. Instead of pulling each tuple from immediate operator, operator can push the 
tuple to its parent. It can be allowed to push until it sees any operator, 
which cannot be processed without result from other operator.   
3. Above two points to make better data-locality.

e.g. we had done some quick prototyping (take it just as thought provoker) as 
mentioned below:
Query: select * from tbl1, tbl2, tbl3 where tbl1.a=tbl2.b and tbl2.b=tbl3.c;
For hash join:
For each tuple t2 of tbl2
Materialize a hash tbl1.a = tbl2.b

For each tuple t3 of tbl3
Materialize a hash tbl2.b = tbl3.c

for each tuple t1 of tbl1
Search in hash  tbl1.a = tbl2.b
Search in hash tbl2.b = tbl3.c
Output t1*t2*t3

Off course at each level if there is any additional Qual for the table, same 
can be applied. 

Similarly for Nested Loop Join, plan tree can be processed in the form of 
post-order traversal of tree.
Scan first relation (leftmost relation), store all tuple --> Outer
Loop through all scan (Or some part of total tuples)node relation starting from 
second one
Scan the current relation
For each tuple, match with all tuples of outer result, build the 
combined tuple.
Save all combined satisfied tuple --> Outer

The result we got was really impressive.

There is a paper by Thomas Neumann on this idea: 
http://www.vldb.org/pvldb/vol4/p539-neumann.pdf 

Note: VitesseDB has also implemented this approach for Hash Join along with 
compilation and their result is really impressive.

Thanks and Regards,
Kumar Rajeev Rastogi.
http://rajeevrastogi.blogspot.in/   

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Kyotaro HORIGUCHI
Hello.

At Mon, 9 May 2016 13:33:55 -0400, Robert Haas  wrote in 

> Hi,
> 
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there.  I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).  To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:

> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.

This is my main concern and what I wanted to solve.

> It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.

Especially for foreign tables, there must be gaps between sending
FETCH and getting the result. Visiting other tables is very
effective to fill the gaps. Using file descriptors is greatly
helps this in effective way, thanks to the new API
WaitEventSet. The attached is a WiP of PoC (sorry for including
some debug code and irrelevant code) of that. It is a bit
different in Exec* APIs from the 0002 patch but works even only
for postgres-fdw and append. It embeds waiting code into
ExecAppend but easily replaceable with the framework in the
Robert's 0003 patch.

Apart from the core part, for postgres-fdw, some scans resides
together on one connection. These scans share the same FD but
there's no means to identify for which scan-node the fd is
signalled. To handle the situation, we might need 'seemed to be
ready but really not' route.

> For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.

One annoyance of this method is one FD with latch-like data
drain. Since we should provide FDs for such nodes, gather would
may have another data-passing channel on the FDs.

And I want to realize early-execution of async nodes. This might
need that all types of node return 'not-ready' for the first call
even if it is async-capable.

> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.  I think that's
> probably right.   For example, consider a 5-table join where all of
> the joins are implemented as hash tables.  If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on.  What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat.  If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse.   But even just ignoring the CPU
> cache aspect of it for a minute, suppose you want to write a loop to
> perform a hash join.  The inner loop fetches the next tuple from the
> probe table and does a hash lookup.  Right now, fetching the next
> tuple from the probe table means 

Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread konstantin knizhnik
Hi,

> 1. asynchronous execution,


It seems to me that asynchronous execution can be considered as alternative to 
multithreading model (in case of PostgreSQL the roles of threads are played by 
workers).
Async. operations are used to have smaller overhead but have scalability 
problems (because in most implementation of cooperative multitasking there is 
just one processing thread and so it can not consume more than one core).

So I wonder whether asynchronous execution is trying to achieve that same goal 
as parallel query execution but using slightly different mechanism.
You wrote: 
> in the meantime, any worker that arrives at that scan node has no choice but 
> to block.

What's wrong with it that worker is blocked? You can just have more workers 
(more than CPU cores) to let other of them continue to do useful work.
But I agree that 
> Whether or not this will be efficient is a research question




> 2. vectorized execution

Vector IO is very important for columnar store. In IMCS extension (in-memory 
columnar store) using vector operations allows to increase speed 10-100 times 
depending on size of data set and query. Obviously the best results are for 
grand aggregates.

But there are some researches, for example:

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

showing that the same or even better effect can be achieved by generation 
native code for query execution plan (which is not so difficult now, thanks to 
LLVM).
It eliminates interpretation overhead and increase cache locality.
I get similar results with my own experiments of accelerating SparkSQL. Instead 
of native code generation I used conversion of query plans to C code and 
experiment with different data representation. "Horisontal model" with loading 
columns on demands shows better performance than columnar store.

As far as I know native code generator is currently developed for PostgreSQL by 
ISP RAN 
Sorry, slides in Russian:
https://pgconf.ru/media/2016/02/19/6%20Мельник%20Дмитрий%20Михайлович,%2005-02-2016.pdf

At this moment (February) them have implemented translation of only few 
PostgreSQL operators used by ExecQuals  and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25% increase 
at TPC-H Q1 (for Q1 most critical is generation of native code for aggregates, 
because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code generation, 
but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open 
source...



On May 9, 2016, at 8:33 PM, Robert Haas wrote:

> Hi,
> 
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there.  I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).  To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:
> 
> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.
> 
> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in 

Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Pavel Stehule
2016-05-10 8:05 GMT+02:00 David Rowley :

> On 10 May 2016 at 16:34, Greg Stark  wrote:
> >
> > On 9 May 2016 8:34 pm, "David Rowley" 
> wrote:
> >>
> >> This project does appear to require that we bloat the code with 100's
> >> of vector versions of each function. I'm not quite sure if there's a
> >> better way to handle this. The problem is that the fmgr is pretty much
> >> a barrier to SIMD operations, and this was the only idea that I've had
> >> so far about breaking through that barrier. So further ideas here are
> >> very welcome.
> >
> > Well yes and no. In practice I think you only need to worry about
> vectorised
> > versions of integer and possibly float. For other data types there either
> > aren't vectorised operators or there's little using them.
> >
> > And I'll make a bold claim here that the only operators I think really
> > matter are =
> >
> > The rain is because using SIMD instructions is a minor win if you have
> any
> > further work to do per tuple. The only time it's a big win is if you're
> > eliminating entire tuples from consideration efficiently. = is going to
> do
> > that often, other btree operator classes might be somewhat useful, but
> > things like + really only would come up in odd examples.
> >
> > But even that understates things. If you have column oriented storage
> then =
> > becomes even more important since every scan has a series of implied
> > equijoins to reconstruct the tuple. And the coup de grace is that in a
> > column oriented storage you try to store variable length data as integer
> > indexes into a dictionary of common values so *everything* is an integer
> =
> > operation.
> >
> > How to do this without punching right through the executor as an
> abstraction
> > and still supporting extensible data types and operators was puzzling me
> > already. I do think it involves having these vector operators in the
> > catalogue and also some kind of compression mapping to integer indexes.
> But
> > I'm not sure that's all that would be needed.
>
> Perhaps the first move to make on this front will be for aggregate
> functions. Experimentation might be quite simple to realise which
> functions will bring enough benefit. I imagined that even Datums where
> the type is not processor native might yield a small speedup, not from
> SIMD, but just from less calls through fmgr. Perhaps we'll realise
> that those are not worth the trouble, I've no idea at this stage.
>

It can be reduced to sum and count in first iteration. On other hand lot of
OLAP reports is based on pretty complex expressions - and there probably
the compilation is better way.

Regards

Pavel


>
> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread David Rowley
On 10 May 2016 at 16:34, Greg Stark  wrote:
>
> On 9 May 2016 8:34 pm, "David Rowley"  wrote:
>>
>> This project does appear to require that we bloat the code with 100's
>> of vector versions of each function. I'm not quite sure if there's a
>> better way to handle this. The problem is that the fmgr is pretty much
>> a barrier to SIMD operations, and this was the only idea that I've had
>> so far about breaking through that barrier. So further ideas here are
>> very welcome.
>
> Well yes and no. In practice I think you only need to worry about vectorised
> versions of integer and possibly float. For other data types there either
> aren't vectorised operators or there's little using them.
>
> And I'll make a bold claim here that the only operators I think really
> matter are =
>
> The rain is because using SIMD instructions is a minor win if you have any
> further work to do per tuple. The only time it's a big win is if you're
> eliminating entire tuples from consideration efficiently. = is going to do
> that often, other btree operator classes might be somewhat useful, but
> things like + really only would come up in odd examples.
>
> But even that understates things. If you have column oriented storage then =
> becomes even more important since every scan has a series of implied
> equijoins to reconstruct the tuple. And the coup de grace is that in a
> column oriented storage you try to store variable length data as integer
> indexes into a dictionary of common values so *everything* is an integer =
> operation.
>
> How to do this without punching right through the executor as an abstraction
> and still supporting extensible data types and operators was puzzling me
> already. I do think it involves having these vector operators in the
> catalogue and also some kind of compression mapping to integer indexes. But
> I'm not sure that's all that would be needed.

Perhaps the first move to make on this front will be for aggregate
functions. Experimentation might be quite simple to realise which
functions will bring enough benefit. I imagined that even Datums where
the type is not processor native might yield a small speedup, not from
SIMD, but just from less calls through fmgr. Perhaps we'll realise
that those are not worth the trouble, I've no idea at this stage.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of David Rowley
> Sent: Tuesday, May 10, 2016 2:01 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] asynchronous and vectorized execution
> 
> On 10 May 2016 at 13:38, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
> > My concern about ExecProcNode is, it is constructed with a large switch
> > ... case statement. It involves tons of comparison operation at run-time.
> > If we replace this switch ... case by function pointer, probably, it make
> > performance improvement. Especially, OLAP workloads that process large
> > amount of rows.
> 
> I imagined that any decent compiler would have built the code to use
> jump tables for this. I have to say that I've never checked to make
> sure though.
>
Ah, indeed, you are right. Please forget above part.

In GCC 4.8.5, the case label between T_ResultState and T_LimitState were
handled using jump table.

TupleTableSlot *
ExecProcNode(PlanState *node)
{
:
  
:
switch (nodeTag(node))
  5ad361:   8b 03   mov(%rbx),%eax
  5ad363:   2d c9 00 00 00  sub$0xc9,%eax
  5ad368:   83 f8 24cmp$0x24,%eax
  5ad36b:   0f 87 4f 02 00 00   ja 5ad5c0 <ExecProcNode+0x290>
  5ad371:   ff 24 c5 68 48 8b 00jmpq   *0x8b4868(,%rax,8)
  5ad378:   0f 1f 84 00 00 00 00nopl   0x0(%rax,%rax,1)
  5ad37f:   00

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kai...@ak.jp.nec.com>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread David Rowley
On 10 May 2016 at 13:38, Kouhei Kaigai  wrote:
> My concern about ExecProcNode is, it is constructed with a large switch
> ... case statement. It involves tons of comparison operation at run-time.
> If we replace this switch ... case by function pointer, probably, it make
> performance improvement. Especially, OLAP workloads that process large
> amount of rows.

I imagined that any decent compiler would have built the code to use
jump tables for this. I have to say that I've never checked to make
sure though.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Greg Stark
On 9 May 2016 8:34 pm, "David Rowley"  wrote:
>
> This project does appear to require that we bloat the code with 100's
> of vector versions of each function. I'm not quite sure if there's a
> better way to handle this. The problem is that the fmgr is pretty much
> a barrier to SIMD operations, and this was the only idea that I've had
> so far about breaking through that barrier. So further ideas here are
> very welcome.

Well yes and no. In practice I think you only need to worry about
vectorised versions of integer and possibly float. For other data types
there either aren't vectorised operators or there's little using them.

And I'll make a bold claim here that the only operators I think really
matter are =

The rain is because using SIMD instructions is a minor win if you have any
further work to do per tuple. The only time it's a big win is if you're
eliminating entire tuples from consideration efficiently. = is going to do
that often, other btree operator classes might be somewhat useful, but
things like + really only would come up in odd examples.

But even that understates things. If you have column oriented storage then
= becomes even more important since every scan has a series of implied
equijoins to reconstruct the tuple. And the coup de grace is that in a
column oriented storage you try to store variable length data as integer
indexes into a dictionary of common values so *everything* is an integer =
operation.

How to do this without punching right through the executor as an
abstraction and still supporting extensible data types and operators was
puzzling me already. I do think it involves having these vector operators
in the catalogue and also some kind of compression mapping to integer
indexes. But I'm not sure that's all that would be needed.


Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Kouhei Kaigai
> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Robert Haas
> Sent: Tuesday, May 10, 2016 2:34 AM
> To: pgsql-hackers@postgresql.org
> Subject: [HACKERS] asynchronous and vectorized execution
> 
> Hi,
> 
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there.  I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).  To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:
> 
> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.
>
Is the parallel aware Append node sufficient to run multiple nodes
asynchronously? (Sorry, I couldn't have enough time to code the feature
even though we had discussion before.)
If a part of child-nodes are blocked by I/O or other heavy stuff, it
cannot enqueue the results into shm_mq, thus, Gather node naturally
skip nodes that are not ready.
In the above example, scan on foreign-table takes longer lead time than
local scan. If Append can map every child nodes on individual workers,
local scan worker begins to return tuples at first, then, mixed tuples
shall be returned eventually.

However, the process internal asynchronous execution may be also beneficial
in case when cost of shm_mq is not ignorable (e.g, no scan qualifiers
are given to worker process). I think it allows to implement pre-fetching
very naturally.

> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior,

My concern about ExecProcNode is, it is constructed with a large switch
... case statement. It involves tons of comparison operation at run-time.
If we replace this switch ... case by function pointer, probably, it make
performance improvement. Especially, OLAP workloads that process large
amount of rows.

> since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.  I think that's
> probably right.   For example, consider a 5-table join where all of
> the joins are implemented as hash tables.  If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on.  What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat.  If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse.
>
I can agree with the above concern from my experien

Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread David Rowley
On 10 May 2016 at 05:33, Robert Haas  wrote:
> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.  I think that's
> probably right.   For example, consider a 5-table join where all of
> the joins are implemented as hash tables.  If this query plan is going
> to be run to completion, it would make much more sense to fetch, say,
> 100 tuples from the driving scan and then probe for all of those in
> the first hash table, and then probe for all of those in the second
> hash table, and so on.  What we do instead is fetch one tuple and
> probe for it in all 5 hash tables, and then repeat.  If one of those
> hash tables would fit in the CPU cache but all five together will not,
> that seems likely to be a lot worse.   But even just ignoring the CPU
> cache aspect of it for a minute, suppose you want to write a loop to
> perform a hash join.  The inner loop fetches the next tuple from the
> probe table and does a hash lookup.  Right now, fetching the next
> tuple from the probe table means calling a function which in turn
> calls another function which probably calls another function which
> probably calls another function and now about 4 layers down we
> actually get the next tuple.  If the scan returned a batch of tuples
> to the hash join, fetching the next tuple from the batch would
> probably be 0 or 1 function calls rather than ... more.  Admittedly,
> you've got to consider the cost of marshaling the batches but I'm
> optimistic that there are cycles to be squeezed out here.  We might
> also want to consider storing batches of tuples in a column-optimized
> rather than row-optimized format so that iterating through one or two
> attributes across every tuple in the batch touches the minimal number
> of cache lines.

It's interesting that you mention this. We identified this as a pain
point during our work on column stores last year. Simply passing
single tuples around the executor is really unfriendly towards L1
instruction cache, plus also the points you mention about L3 cache and
hash tables and tuple stores. I really think that we're likely to see
significant gains by processing >1 tuple at a time, so this topic very
much interests me.

On researching this we've found that other peoples research does
indicate that there are gains to be had:
http://www.openlinksw.com/weblog/oerling/

In that blog there's a table that indicates that this row-store
database saw a 4.4x performance improvement from changing from a
tuple-at-a-time executor to a batch tuple executor.

Batch Size 1 tuple = 122 seconds
Batch Size 10k tuples = 27.7 seconds

When we start multiplying those increases with the increases with
something like parallel query then we're starting to see very nice
gains in performance.

Alvaro, Tomas and I had been discussing this and late last year I did
look into what would be required to allow this to happen in Postgres.
Basically there's 2 sub-projects, I'll describe what I've managed to
learn so far about each, and the rough plan that I have to implement
them:


1. Batch Execution:

a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes.
b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to
indicate if the struct contains a single or a multiple tuples.
Multiple tuples may need to be deformed in a non-lazy fashion in order
to prevent too many buffers from having to be pinned at once. Tuples
will be deformed into arrays of each column rather than arrays for
each tuple (this part is important to support the next sub-project)
c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
process a batch TupleTableSlot. This will require some tight loop to
aggregate the entire TupleTableSlot at once before returning.
d. Add function in execAmi.c which returns true or false depending on
if the node supports batch TupleTableSlots or not.
e. At executor startup determine if the entire plan tree supports
batch TupleTableSlots, if so enable batch scan mode.

That at least is my ideas for stage 1. There's still more to work out.
e.g should batch mode occur when the query has a LIMIT? we might not
want to waste time gather up extra tuples when we're just going to
stop after the first one. So perhaps 'e' above should be up to the
planner instead. Further development work here might add a node type
that de-batches a TupleTableSlot to allow nodes which don't support
batching to be in the plan, i.e "mixed execution mode". I'm less
excited about this as it may be difficult to cost that 

Re: [HACKERS] asynchronous and vectorized execution

2016-05-09 Thread Simon Riggs
On 9 May 2016 at 19:33, Robert Haas  wrote:


> I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).



> 1. asynchronous execution
>

Not looking at that.


> 2. vectorized execution...



> We might also want to consider storing batches of tuples in a
> column-optimized
> rather than row-optimized format so that iterating through one or two
> attributes across every tuple in the batch touches the minimal number
> of cache lines.
>

Team is about 2 years into research and coding prototype on those topics at
this point, with agreed time for further work over next 2 years.

I'll let my colleagues chime in with details since I'm not involved at that
level any more.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services