Re: [HACKERS] Parallel Append implementation

2017-11-12 Thread Amit Khandekar
Thanks a lot Robert for the patch. I will have a look. Quickly tried
to test some aggregate queries with a partitioned pgbench_accounts
table, and it is crashing. Will get back with the fix, and any other
review comments.

Thanks
-Amit Khandekar

On 9 November 2017 at 23:44, Robert Haas  wrote:
> On Sat, Oct 28, 2017 at 5:50 AM, Robert Haas  wrote:
>> No, because the Append node is *NOT* getting copied into shared
>> memory.  I have pushed a comment update to the existing functions; you
>> can use the same comment for this patch.
>
> I spent the last several days working on this patch, which had a
> number of problems both cosmetic and functional.  I think the attached
> is in better shape now, but it could certainly use some more review
> and testing since I only just finished modifying it, and I modified it
> pretty heavily.  Changes:
>
> - I fixed the "morerows" entries in the documentation.  If you had
> built the documentation the way you had it and loaded up in a web
> browser, you would have seen that the way you had it was not correct.
>
> - I moved T_AppendState to a different position in the switch inside
> ExecParallelReInitializeDSM, so as to keep that switch in the same
> order as all of the other switch statements in that file.
>
> - I rewrote the comment for pa_finished.  It previously began with
> "workers currently executing the subplan", which is not an accurate
> description. I suspect this was a holdover from a previous version of
> the patch in which this was an array of integers rather than an array
> of type bool.  I also fixed the comment in ExecAppendEstimate and
> added, removed, or rewrote various other comments as well.
>
> - I renamed PA_INVALID_PLAN to INVALID_SUBPLAN_INDEX, which I think is
> more clear and allows for the possibility that this sentinel value
> might someday be used for non-parallel-aware Append plans.
>
> - I largely rewrote the code for picking the next subplan.  A
> superficial problem with the way that you had it is that you had
> renamed exec_append_initialize_next to exec_append_seq_next but not
> updated the header comment to match.  Also, the logic was spread out
> all over the file.  There are three cases: not parallel aware, leader,
> worker.  You had the code for the first case at the top of the file
> and the other two cases at the bottom of the file and used multiple
> "if" statements to pick the right one in each case.  I replaced all
> that with a function pointer stored in the AppendState, moved the code
> so it's all together, and rewrote it in a way that I find easier to
> understand.  I also changed the naming convention.
>
> - I renamed pappend_len to pstate_len and ParallelAppendDescData to
> ParallelAppendState.  I think the use of the word "descriptor" is a
> carryover from the concept of a scan descriptor.  There's nothing
> really wrong with inventing the concept of an "append descriptor", but
> it seems more clear to just refer to shared state.
>
> - I fixed ExecAppendReInitializeDSM not to reset node->as_whichplan.
> Per commit 41b0dd987d44089dc48e9c70024277e253b396b7, that's wrong;
> instead, local state should be reset in ExecReScanAppend.  I installed
> what I believe to be the correct logic in that function instead.
>
> - I fixed list_qsort() so that it copies the type of the old list into
> the new list.  Otherwise, sorting a list of type T_IntList or
> T_OidList would turn it into just plain T_List, which is wrong.
>
> - I removed get_append_num_workers and integrated the logic into the
> callers.  This function was coded quite strangely: it assigned the
> return value of fls() to a double and then eventually rounded the
> result back to an integer.  But fls() returns an integer, so this
> doesn't make much sense.  On a related note, I made it use fls(# of
> subpaths) instead of fls(# of subpaths)+1.  Adding 1 doesn't make
> sense to me here because it leads to a decision to use 2 workers for a
> single, non-partial subpath.  I suspect both of these mistakes stem
> from thinking that fls() returns the base-2 logarithm, but in fact it
> doesn't, quite: log2(1) = 0.0 but fls(1) = 1.
>
> - In the process of making the changes described in the previous
> point, I added a couple of assertions, one of which promptly failed.
> It turns out the reason is that your patch didn't update
> accumulate_append_subpaths(), which can result in flattening
> non-partial paths from a Parallel Append into a parent Append's list
> of partial paths, which is bad.  The easiest way to fix that would be
> to just teach accumulate_append_subpaths() not to flatten a Parallel
> Append into a parent Append or MergeAppend node, but it seemed to me
> that there was a fair amount of duplication between
> accumulate_partialappend_subpath() and accumulate_append_subpaths, so
> what I did instead is folded all of the necessarily logic directly
> into accumulate_append_subpaths().  This approach also avoids some
> 

Re: [HACKERS] Parallel Append implementation

2017-11-09 Thread Robert Haas
On Sat, Oct 28, 2017 at 5:50 AM, Robert Haas  wrote:
> No, because the Append node is *NOT* getting copied into shared
> memory.  I have pushed a comment update to the existing functions; you
> can use the same comment for this patch.

I spent the last several days working on this patch, which had a
number of problems both cosmetic and functional.  I think the attached
is in better shape now, but it could certainly use some more review
and testing since I only just finished modifying it, and I modified it
pretty heavily.  Changes:

- I fixed the "morerows" entries in the documentation.  If you had
built the documentation the way you had it and loaded up in a web
browser, you would have seen that the way you had it was not correct.

- I moved T_AppendState to a different position in the switch inside
ExecParallelReInitializeDSM, so as to keep that switch in the same
order as all of the other switch statements in that file.

- I rewrote the comment for pa_finished.  It previously began with
"workers currently executing the subplan", which is not an accurate
description. I suspect this was a holdover from a previous version of
the patch in which this was an array of integers rather than an array
of type bool.  I also fixed the comment in ExecAppendEstimate and
added, removed, or rewrote various other comments as well.

- I renamed PA_INVALID_PLAN to INVALID_SUBPLAN_INDEX, which I think is
more clear and allows for the possibility that this sentinel value
might someday be used for non-parallel-aware Append plans.

- I largely rewrote the code for picking the next subplan.  A
superficial problem with the way that you had it is that you had
renamed exec_append_initialize_next to exec_append_seq_next but not
updated the header comment to match.  Also, the logic was spread out
all over the file.  There are three cases: not parallel aware, leader,
worker.  You had the code for the first case at the top of the file
and the other two cases at the bottom of the file and used multiple
"if" statements to pick the right one in each case.  I replaced all
that with a function pointer stored in the AppendState, moved the code
so it's all together, and rewrote it in a way that I find easier to
understand.  I also changed the naming convention.

- I renamed pappend_len to pstate_len and ParallelAppendDescData to
ParallelAppendState.  I think the use of the word "descriptor" is a
carryover from the concept of a scan descriptor.  There's nothing
really wrong with inventing the concept of an "append descriptor", but
it seems more clear to just refer to shared state.

- I fixed ExecAppendReInitializeDSM not to reset node->as_whichplan.
Per commit 41b0dd987d44089dc48e9c70024277e253b396b7, that's wrong;
instead, local state should be reset in ExecReScanAppend.  I installed
what I believe to be the correct logic in that function instead.

- I fixed list_qsort() so that it copies the type of the old list into
the new list.  Otherwise, sorting a list of type T_IntList or
T_OidList would turn it into just plain T_List, which is wrong.

- I removed get_append_num_workers and integrated the logic into the
callers.  This function was coded quite strangely: it assigned the
return value of fls() to a double and then eventually rounded the
result back to an integer.  But fls() returns an integer, so this
doesn't make much sense.  On a related note, I made it use fls(# of
subpaths) instead of fls(# of subpaths)+1.  Adding 1 doesn't make
sense to me here because it leads to a decision to use 2 workers for a
single, non-partial subpath.  I suspect both of these mistakes stem
from thinking that fls() returns the base-2 logarithm, but in fact it
doesn't, quite: log2(1) = 0.0 but fls(1) = 1.

- In the process of making the changes described in the previous
point, I added a couple of assertions, one of which promptly failed.
It turns out the reason is that your patch didn't update
accumulate_append_subpaths(), which can result in flattening
non-partial paths from a Parallel Append into a parent Append's list
of partial paths, which is bad.  The easiest way to fix that would be
to just teach accumulate_append_subpaths() not to flatten a Parallel
Append into a parent Append or MergeAppend node, but it seemed to me
that there was a fair amount of duplication between
accumulate_partialappend_subpath() and accumulate_append_subpaths, so
what I did instead is folded all of the necessarily logic directly
into accumulate_append_subpaths().  This approach also avoids some
assumptions that your code made, such as the assumption that we will
never have a partial MergeAppend path.

- I changed create_append_path() so that it uses the parallel_aware
argument as the only determinant of whether the output path is flagged
as parallel-aware. Your version also considered whether
parallel_workers > 0, but I think that's not a good idea; the caller
should pass the correct value for parallel_aware rather than relying
on this function to fix it.  

Re: [HACKERS] Parallel Append implementation

2017-10-28 Thread Robert Haas
On Thu, Oct 19, 2017 at 9:05 AM, Amit Khandekar  wrote:
>> + *ExecAppendEstimate
>> + *
>> + *estimates the space required to serialize Append node.
>>
>> Ugh, this is wrong, but I notice it follows various other
>> equally-wrong comments for other parallel-aware node types. I guess
>> I'll go fix that.  We are not in serializing the Append node.
>
> I didn't clealy get this. Do you think it should be "space required to
> copy the Append node into the shared memory" ?

No, because the Append node is *NOT* getting copied into shared
memory.  I have pushed a comment update to the existing functions; you
can use the same comment for this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-19 Thread Amit Khandekar
On 13 October 2017 at 00:29, Robert Haas  wrote:
> On Wed, Oct 11, 2017 at 8:51 AM, Amit Khandekar  
> wrote:
>> [ new patch ]
>
> + parallel_append
> + Waiting to choose the next subplan during Parallel Append 
> plan
> + execution.
> +
> +
>
> Probably needs to update a morerows values of some earlier entry.

>From what I observed from the other places, the morerows value is one
less than the number of following entries. I have changed it to 10
since it has 11 entries.

>
> +   enable_parallelappend configuration
> parameter
>
> How about enable_parallel_append?

Done.

>
> + * pa_finished : workers currently executing the subplan. A worker which
>
> The way the colon is used here is not a standard comment style for PostgreSQL.

Changed it to "pa_finished:".

>
> + * Go on to the "next" subplan. If no more subplans, return the empty
> + * slot set up for us by ExecInitAppend.
> + * Note: Parallel-aware Append follows different logic for choosing 
> the
> + * next subplan.
>
> Formatting looks wrong, and moreover I don't think this is the right
> way of handling this comment anyway.  Move the existing comment inside
> the if (!node->padesc) block and leave it unchanged; the else block
> explains the differences for parallel append.

I think the first couple of lines do apply to both parallel-append and
sequential append plans. I have moved the remaining couple of lines
inside the else block.

>
> + *ExecAppendEstimate
> + *
> + *estimates the space required to serialize Append node.
>
> Ugh, this is wrong, but I notice it follows various other
> equally-wrong comments for other parallel-aware node types. I guess
> I'll go fix that.  We are not in serializing the Append node.

I didn't clealy get this. Do you think it should be "space required to
copy the Append node into the shared memory" ?

>
> I do not think that it's a good idea to call
> exec_append_parallel_next() from ExecAppendInitializeDSM,
> ExecAppendReInitializeDSM, and ExecAppendInitializeWorker.  We want to
> postpone selecting which plan to run until we're actually ready to run
> that plan.  Otherwise, for example, the leader might seize a
> non-partial plan (if only such plans are included in the Parallel
> Append) when it isn't really necessary for it to do so.  If the
> workers would've reached the plans and started returning tuples to the
> leader before it grabbed a plan, oh well, too bad.  The leader's still
> claimed that plan and must now run it.
>
> I concede that's not a high-probability scenario, but I still maintain
> that it is better for processes not to claim a subplan until the last
> possible moment.  I think we need to initialize as_whichplan as
> PA_INVALID plan and then fix it when ExecProcNode() is called for the
> first time.

Done. Set as_whichplan to PA_INVALID_PLAN in
ExecAppendInitializeDSM(), ExecAppendReInitializeDSM() and
ExecAppendInitializeWorker(). Then when ExecAppend() is called for the
first time, we notice that as_whichplan is PA_INVALID_PLAN, that means
we need to choose the plan.

>
> +if (!IsParallelWorker())
>
> This is not a great test, because it would do the wrong thing if we
> ever allowed an SQL function called from a parallel worker to run a
> parallel query of its own.  Currently that's not allowed but we might
> want to allow it someday.  What we really want to test is whether
> we're the leader for *this* query.  Maybe use a flag in the
> AppendState for that, and set it correctly in
> ExecAppendInitializeWorker.

Done. Set a new AppendState->is_parallel_worker field to true in
ExecAppendInitializeWorker().

>
> I think maybe the loop in exec_append_parallel_next should look more like 
> this:
>
> /* Pick the next plan. */
> state->as_whichplan = padesc->pa_nextplan;
> if (state->as_whichplan != PA_INVALID_PLAN)
> {
> int nextplan = state->as_whichplan;
>
> /* Mark non-partial plans done immediately so that they can't be
> picked again. */
> if (nextplan < first_partial_plan)
> padesc->pa_finished[nextplan] = true;
>
> /* Figure out what plan the next worker should pick. */
> do
> {
> /* If we've run through all the plans, loop back through
> partial plans only. */
> if (++nextplan >= state->as_nplans)
> nextplan = first_partial_plan;
>
> /* No plans remaining or tried them all?  Then give up. */
> if (nextplan == state->as_whichplan || nextplan >= state->as_nplans)
> {
> nextplan = PA_INVALID_PLAN;
> break;
> }
> } while (padesc->pa_finished[nextplan]);
>
> /* Store calculated next plan back into shared memory. */
> padesc->pa_next_plan = nextplan;
> }
>
> This might not be exactly right and the comments may need work, but
> here are a couple of points:
>
> - As you have it coded, the loop exit condition is whichplan !=

Re: [HACKERS] Parallel Append implementation

2017-10-12 Thread Robert Haas
On Wed, Oct 11, 2017 at 8:51 AM, Amit Khandekar  wrote:
> [ new patch ]

+ parallel_append
+ Waiting to choose the next subplan during Parallel Append plan
+ execution.
+
+

Probably needs to update a morerows values of some earlier entry.

+   enable_parallelappend configuration
parameter

How about enable_parallel_append?

+ * pa_finished : workers currently executing the subplan. A worker which

The way the colon is used here is not a standard comment style for PostgreSQL.

+ * Go on to the "next" subplan. If no more subplans, return the empty
+ * slot set up for us by ExecInitAppend.
+ * Note: Parallel-aware Append follows different logic for choosing the
+ * next subplan.

Formatting looks wrong, and moreover I don't think this is the right
way of handling this comment anyway.  Move the existing comment inside
the if (!node->padesc) block and leave it unchanged; the else block
explains the differences for parallel append.

+ *ExecAppendEstimate
+ *
+ *estimates the space required to serialize Append node.

Ugh, this is wrong, but I notice it follows various other
equally-wrong comments for other parallel-aware node types. I guess
I'll go fix that.  We are not in serializing the Append node.

I do not think that it's a good idea to call
exec_append_parallel_next() from ExecAppendInitializeDSM,
ExecAppendReInitializeDSM, and ExecAppendInitializeWorker.  We want to
postpone selecting which plan to run until we're actually ready to run
that plan.  Otherwise, for example, the leader might seize a
non-partial plan (if only such plans are included in the Parallel
Append) when it isn't really necessary for it to do so.  If the
workers would've reached the plans and started returning tuples to the
leader before it grabbed a plan, oh well, too bad.  The leader's still
claimed that plan and must now run it.

I concede that's not a high-probability scenario, but I still maintain
that it is better for processes not to claim a subplan until the last
possible moment.  I think we need to initialize as_whichplan as
PA_INVALID plan and then fix it when ExecProcNode() is called for the
first time.

+if (!IsParallelWorker())

This is not a great test, because it would do the wrong thing if we
ever allowed an SQL function called from a parallel worker to run a
parallel query of its own.  Currently that's not allowed but we might
want to allow it someday.  What we really want to test is whether
we're the leader for *this* query.  Maybe use a flag in the
AppendState for that, and set it correctly in
ExecAppendInitializeWorker.

I think maybe the loop in exec_append_parallel_next should look more like this:

/* Pick the next plan. */
state->as_whichplan = padesc->pa_nextplan;
if (state->as_whichplan != PA_INVALID_PLAN)
{
int nextplan = state->as_whichplan;

/* Mark non-partial plans done immediately so that they can't be
picked again. */
if (nextplan < first_partial_plan)
padesc->pa_finished[nextplan] = true;

/* Figure out what plan the next worker should pick. */
do
{
/* If we've run through all the plans, loop back through
partial plans only. */
if (++nextplan >= state->as_nplans)
nextplan = first_partial_plan;

/* No plans remaining or tried them all?  Then give up. */
if (nextplan == state->as_whichplan || nextplan >= state->as_nplans)
{
nextplan = PA_INVALID_PLAN;
break;
}
} while (padesc->pa_finished[nextplan]);

/* Store calculated next plan back into shared memory. */
padesc->pa_next_plan = nextplan;
}

This might not be exactly right and the comments may need work, but
here are a couple of points:

- As you have it coded, the loop exit condition is whichplan !=
PA_INVALID_PLAN, but that's probably an uncommon case and you have two
other ways out of the loop.  It's easier to understand the code if the
loop condition corresponds to the most common way of exiting the loop,
and any break is for some corner case.

- Don't need a separate call to exec_append_get_next_plan; it's all
handled here (and, I think, pretty compactly).

- No need for pa_first_plan any more.  Looping back to
first_partial_plan is a fine substitute, because by the time anybody
loops around, pa_first_plan would equal first_partial_plan anyway
(unless there's a bug).

- In your code, the value in shared memory is the point at which to
start the search for the next plan.  Here, I made it the value that
the next worker should adopt without question.  Another option would
be to make it the value that the last worker adopted.  I think that
either that option or what I did above are slightly better than what
you have, because as you have it, you've got to use the
increment-with-looping logic in two different places whereas either of
those options only need it in one place, which makes this a little

Re: [HACKERS] Parallel Append implementation

2017-10-11 Thread Amit Khandekar
On 9 October 2017 at 16:03, Amit Kapila  wrote:
> On Fri, Oct 6, 2017 at 12:03 PM, Amit Khandekar  
> wrote:
>> On 6 October 2017 at 08:49, Amit Kapila  wrote:
>>>
>>> Okay, but why not cheapest partial path?
>>
>> I gave some thought on this point. Overall I feel it does not matter
>> which partial path it should pick up. RIght now the partial paths are
>> not ordered. But for non-partial paths sake, we are just choosing the
>> very last path. So in case of mixed paths, leader will get a partial
>> path, but that partial path would not be the cheapest path. But if we
>> also order the partial paths, the same logic would then pick up
>> cheapest partial path. The question is, should we also order the
>> partial paths for the leader ?
>>
>> The only scenario I see where leader choosing cheapest partial path
>> *might* show some benefit, is if there are some partial paths that
>> need to do some startup work using only one worker. I think currently,
>> parallel hash join is one case where it builds the hash table, but I
>> guess here also, we support parallel hash build, but not sure about
>> the status.
>>
>
> You also need to consider how merge join is currently work in parallel
> (each worker need to perform the whole of work of right side).

Yes, here if the leader happens to take the right side, it may slow
down the overall merge join. But this seems to be a different case
than the case of high startup costs.

>  I think there could be more scenario's where the startup cost is high
> and parallel worker needs to do that work independently.

True.

>
>  For such plan, if leader starts it, it would be slow, and
>> no other worker would be able to help it, so its actual startup cost
>> would be drastically increased. (Another path is parallel bitmap heap
>> scan where the leader has to do something and the other workers wait.
>> But here, I think it's not much work for the leader to do). So
>> overall, to handle such cases, it's better for leader to choose a
>> cheapest path, or may be, a path with cheapest startup cost. We can
>> also consider sorting partial paths with decreasing startup cost.
>>
>
> Yeah, that sounds reasonable.

Attached patch sorts partial paths by descending startup cost.


On 6 October 2017 at 08:49, Amit Kapila  wrote:
> On Thu, Oct 5, 2017 at 4:11 PM, Amit Khandekar  wrote:
>>
>> Ok. How about removing pa_all_partial_subpaths altogether , and
>> instead of the below condition :
>>
>> /*
>> * If all the child rels have partial paths, and if the above Parallel
>> * Append path has a mix of partial and non-partial subpaths, then consider
>> * another Parallel Append path which will have *all* partial subpaths.
>> * If enable_parallelappend is off, make this one non-parallel-aware.
>> */
>> if (partial_subpaths_valid && !pa_all_partial_subpaths)
>> ..
>>
>> Use this condition :
>> if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL)
>> ..
>>
>
> Sounds good to me.

Did this. Here is the new condition I used  along with the comments
explaining it :

+* If parallel append has not been added above, or the added
one has a mix
+* of partial and non-partial subpaths, then consider another Parallel
+* Append path which will have *all* partial subpaths. We can add such a
+* path only if all childrels have partial paths in the first
place. This
+* new path will be parallel-aware unless enable_parallelappend is off.
 */
-   if (partial_subpaths_valid && !pa_all_partial_subpaths)
+   if (partial_subpaths_valid &&
+   (!pa_subpaths_valid || pa_nonpartial_subpaths != NIL))

Also added some test scenarios.

On 6 October 2017 at 12:03, Amit Khandekar  wrote:
> On 6 October 2017 at 08:49, Amit Kapila  wrote:
>>
>> One minor point:
>>
>> + if (!node->as_padesc)
>> + {
>> + /*
>> + */
>> + if (!exec_append_seq_next(node))
>> + return ExecClearTuple(node->ps.ps_ResultTupleSlot);
>> + }
>>
>> It seems either you want to add a comment in above part of patch or
>> you just left /**/ mistakenly.
>
> Oops. Yeah, the comment wrapper remained there when I moved its
> content "This is Parallel-aware append. Follow it's own logic ..." out
> of the if block.

Removed the comment wrapper.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v17.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-09 Thread Amit Kapila
On Fri, Oct 6, 2017 at 12:03 PM, Amit Khandekar  wrote:
> On 6 October 2017 at 08:49, Amit Kapila  wrote:
>>
>> Okay, but why not cheapest partial path?
>
> I gave some thought on this point. Overall I feel it does not matter
> which partial path it should pick up. RIght now the partial paths are
> not ordered. But for non-partial paths sake, we are just choosing the
> very last path. So in case of mixed paths, leader will get a partial
> path, but that partial path would not be the cheapest path. But if we
> also order the partial paths, the same logic would then pick up
> cheapest partial path. The question is, should we also order the
> partial paths for the leader ?
>
> The only scenario I see where leader choosing cheapest partial path
> *might* show some benefit, is if there are some partial paths that
> need to do some startup work using only one worker. I think currently,
> parallel hash join is one case where it builds the hash table, but I
> guess here also, we support parallel hash build, but not sure about
> the status.
>

You also need to consider how merge join is currently work in parallel
(each worker need to perform the whole of work of right side).  I
think there could be more scenario's where the startup cost is high
and parallel worker needs to do that work independently.

 For such plan, if leader starts it, it would be slow, and
> no other worker would be able to help it, so its actual startup cost
> would be drastically increased. (Another path is parallel bitmap heap
> scan where the leader has to do something and the other workers wait.
> But here, I think it's not much work for the leader to do). So
> overall, to handle such cases, it's better for leader to choose a
> cheapest path, or may be, a path with cheapest startup cost. We can
> also consider sorting partial paths with decreasing startup cost.
>

Yeah, that sounds reasonable.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-06 Thread Amit Khandekar
On 6 October 2017 at 08:49, Amit Kapila  wrote:
> On Thu, Oct 5, 2017 at 4:11 PM, Amit Khandekar  wrote:
>>
>> Ok. How about removing pa_all_partial_subpaths altogether , and
>> instead of the below condition :
>>
>> /*
>> * If all the child rels have partial paths, and if the above Parallel
>> * Append path has a mix of partial and non-partial subpaths, then consider
>> * another Parallel Append path which will have *all* partial subpaths.
>> * If enable_parallelappend is off, make this one non-parallel-aware.
>> */
>> if (partial_subpaths_valid && !pa_all_partial_subpaths)
>> ..
>>
>> Use this condition :
>> if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL)
>> ..
>>
>
> Sounds good to me.
>
> One minor point:
>
> + if (!node->as_padesc)
> + {
> + /*
> + */
> + if (!exec_append_seq_next(node))
> + return ExecClearTuple(node->ps.ps_ResultTupleSlot);
> + }
>
> It seems either you want to add a comment in above part of patch or
> you just left /**/ mistakenly.

Oops. Yeah, the comment wrapper remained there when I moved its
content "This is Parallel-aware append. Follow it's own logic ..." out
of the if block. Since this is too small a change for an updated
patch, I will do this along with any other changes that would be
required as the review progresses.

>
>> 
>>
>>
>> Regarding a mix of partial and non-partial paths, I feel it always
>> makes sense for the leader to choose the partial path.
>>
>
> Okay, but why not cheapest partial path?

I gave some thought on this point. Overall I feel it does not matter
which partial path it should pick up. RIght now the partial paths are
not ordered. But for non-partial paths sake, we are just choosing the
very last path. So in case of mixed paths, leader will get a partial
path, but that partial path would not be the cheapest path. But if we
also order the partial paths, the same logic would then pick up
cheapest partial path. The question is, should we also order the
partial paths for the leader ?

The only scenario I see where leader choosing cheapest partial path
*might* show some benefit, is if there are some partial paths that
need to do some startup work using only one worker. I think currently,
parallel hash join is one case where it builds the hash table, but I
guess here also, we support parallel hash build, but not sure about
the status. For such plan, if leader starts it, it would be slow, and
no other worker would be able to help it, so its actual startup cost
would be drastically increased. (Another path is parallel bitmap heap
scan where the leader has to do something and the other workers wait.
But here, I think it's not much work for the leader to do). So
overall, to handle such cases, it's better for leader to choose a
cheapest path, or may be, a path with cheapest startup cost. We can
also consider sorting partial paths with decreasing startup cost.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-05 Thread Amit Kapila
On Thu, Oct 5, 2017 at 4:11 PM, Amit Khandekar  wrote:
>
> Ok. How about removing pa_all_partial_subpaths altogether , and
> instead of the below condition :
>
> /*
> * If all the child rels have partial paths, and if the above Parallel
> * Append path has a mix of partial and non-partial subpaths, then consider
> * another Parallel Append path which will have *all* partial subpaths.
> * If enable_parallelappend is off, make this one non-parallel-aware.
> */
> if (partial_subpaths_valid && !pa_all_partial_subpaths)
> ..
>
> Use this condition :
> if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL)
> ..
>

Sounds good to me.

One minor point:

+ if (!node->as_padesc)
+ {
+ /*
+ */
+ if (!exec_append_seq_next(node))
+ return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+ }

It seems either you want to add a comment in above part of patch or
you just left /**/ mistakenly.

> 
>
>
> Regarding a mix of partial and non-partial paths, I feel it always
> makes sense for the leader to choose the partial path.
>

Okay, but why not cheapest partial path?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-05 Thread Amit Kapila
On Thu, Oct 5, 2017 at 6:14 PM, Robert Haas  wrote:
> On Thu, Oct 5, 2017 at 6:29 AM, Amit Kapila  wrote:
>> Okay, but can't we try to pick the cheapest partial path for master
>> backend and maybe master backend can try to work on a partial path
>> which is already picked up by some worker.
>
> Well, the master backend is typically going to be the first process to
> arrive at the Parallel Append because it's already running, whereas
> the workers have to start.
>

Sure, the leader can pick the cheapest partial path to start with.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-05 Thread Robert Haas
On Thu, Oct 5, 2017 at 6:29 AM, Amit Kapila  wrote:
> Okay, but can't we try to pick the cheapest partial path for master
> backend and maybe master backend can try to work on a partial path
> which is already picked up by some worker.

Well, the master backend is typically going to be the first process to
arrive at the Parallel Append because it's already running, whereas
the workers have to start.  So in the case that really matters, no
paths will have been picked yet.  Later on, we could have the master
try to choose a path on which some other worker is already working,
but I really doubt that's optimal.  Either the workers are generating
a lot of tuples (in which case the leader probably won't do much work
on any path because it will be busy reading tuples) or they are
generating only a few tuples (in which case the leader is probably
better off working on an a path not yet chosen, to reduce contention).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-05 Thread Amit Khandekar
On 30 September 2017 at 19:21, Amit Kapila  wrote:
> On Wed, Sep 20, 2017 at 10:59 AM, Amit Khandekar  
> wrote:
>> On 16 September 2017 at 10:42, Amit Kapila  wrote:
>>>
>>> At a broader level, the idea is good, but I think it won't turn out
>>> exactly like that considering your below paragraph which indicates
>>> that it is okay if leader picks a partial path that is costly among
>>> other partial paths as a leader won't be locked with that.
>>> Considering this is a good design for parallel append, the question is
>>> do we really need worker and leader to follow separate strategy for
>>> choosing next path.  I think the patch will be simpler if we can come
>>> up with a way for the worker and leader to use the same strategy to
>>> pick next path to process.  How about we arrange the list of paths
>>> such that first, all partial paths will be there and then non-partial
>>> paths and probably both in decreasing order of cost.  Now, both leader
>>> and worker can start from the beginning of the list. In most cases,
>>> the leader will start at the first partial path and will only ever
>>> need to scan non-partial path if there is no other partial path left.
>>> This is not bulletproof as it is possible that some worker starts
>>> before leader in which case leader might scan non-partial path before
>>> all partial paths are finished, but I think we can avoid that as well
>>> if we are too worried about such cases.
>>
>> If there are no partial subpaths, then again the leader is likely to
>> take up the expensive subpath. And this scenario would not be
>> uncommon.
>>
>
> While thinking about how common the case of no partial subpaths would
> be, it occurred to me that as of now we always create a partial path
> for the inheritance child if it is parallel-safe and the user has not
> explicitly set the value of parallel_workers to zero (refer
> compute_parallel_worker).  So, unless you are planning to change that,
> I think it will be quite uncommon to have no partial subpaths.

There are still some paths that can have non-partial paths cheaper
than the partial paths. Also, there can be UNION ALL queries which
could have non-partial subpaths. I guess this has already been
discussed in the other replies.

>
> Few nitpicks in your latest patch:
> 1.
> @@ -298,6 +366,292 @@ ExecReScanAppend(AppendState *node)
>   if (subnode->chgParam == NULL)
>   ExecReScan(subnode);
>   }
> +
>
> Looks like a spurious line.
>
> 2.
> @@ -1285,7 +1291,11 @@ add_paths_to_append_rel(PlannerInfo *root,
> RelOptInfo *rel,
> ..
> + if (chosen_path && chosen_path != cheapest_partial_path)
> + pa_all_partial_subpaths = false;
>
> It will keep on setting pa_all_partial_subpaths as false for
> non-partial paths which don't seem to be the purpose of this variable.
> I think you want it to be set even when there is one non-partial path,
> so isn't it better to write as below or something similar:
> if (pa_nonpartial_subpaths && pa_all_partial_subpaths)
> pa_all_partial_subpaths = false;

Ok. How about removing pa_all_partial_subpaths altogether , and
instead of the below condition :

/*
* If all the child rels have partial paths, and if the above Parallel
* Append path has a mix of partial and non-partial subpaths, then consider
* another Parallel Append path which will have *all* partial subpaths.
* If enable_parallelappend is off, make this one non-parallel-aware.
*/
if (partial_subpaths_valid && !pa_all_partial_subpaths)
..

Use this condition :
if (partial_subpaths_valid && pa_nonpartial_subpaths != NIL)
..




Regarding a mix of partial and non-partial paths, I feel it always
makes sense for the leader to choose the partial path. If it chooses a
non-partial path, no other worker would be able to help finish that
path. Among the partial paths, whether it chooses the cheapest one or
expensive one does not matter, I think. We have the partial paths
unordered. So whether it starts from the last partial path or the
first partial path should not matter.

Regarding scenario where all paths are non-partial, here is an e.g. :
Suppose we have 4 child paths with costs : 10 5 5 3, and with 2
workers plus one leader. And suppose the leader takes additionally
1/4th of these costs to process the returned tuples.

If leader takes least expensive one (3)  :
2 workers will finish 10, 5, 5 in 10 units,
and leader simultaneously chooses the plan with cost 3, and so it
takes 3 + (1/4)(10 + 5 + 5 + 3) = 9 units.
So the total time taken by Append is : 10.


Whereas if leader takes most expensive one (10) :
10 + .25 (total) = 10 + 6 = 16
The 2 workers will finish 2nd, 3rd and 4th plan (5,5,3) in 8 units.
and simultaneously leader will finish 1st plan (10) in 10 units, plus
tuple processing cost i.e. 10 +  (1/4)(10 + 5 + 5 + 3) = 15 units.
So the total time taken by Append is : 15.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 

Re: [HACKERS] Parallel Append implementation

2017-10-05 Thread Amit Kapila
On Mon, Oct 2, 2017 at 6:21 PM, Robert Haas  wrote:
> On Sun, Oct 1, 2017 at 9:55 AM, Amit Kapila  wrote:
>> Isn't it for both?  I mean it is about comparing the non-partial paths
>> for child relations of the same relation and also when there are
>> different relations involved as in Union All kind of query.  In any
>> case, the point I was trying to say is that generally non-partial
>> relations will have relatively smaller scan size, so probably should
>> take lesser time to complete.
>
> I don't think that's a valid inference.  It's true that a relation
> could fail to have a partial path because it's small, but that's only
> one reason among very many.  The optimal index type could be one that
> doesn't support parallel index scans, for example.
>

Valid point.

>> Sure, I think it is quite good if we can achieve that but it seems to
>> me that we will not be able to achieve that in all scenario's with the
>> patch and rather I think in some situations it can result in leader
>> ended up picking the costly plan (in case when there are all partial
>> plans or mix of partial and non-partial plans).  Now, we are ignoring
>> such cases based on the assumption that other workers might help to
>> complete master backend.  I think it is quite possible that the worker
>> backends picks up some plans which emit rows greater than tuple queue
>> size and they instead wait on the master backend which itself is busy
>> in completing its plan.  So master backend will end up taking too much
>> time.  If we want to go with a strategy of master (leader) backend and
>> workers taking a different strategy to pick paths to work on, then it
>> might be better if we should try to ensure that master backend always
>> starts from the place which has cheapest plans irrespective of whether
>> the path is partial or non-partial.
>
> I agree that it's complicated to get this right in all cases; I'm
> realizing that's probably an unattainable ideal.
>
> However, I don't think ignoring the distinction between partial and
> non-partial plans is an improvement, because the argument that other
> workers may be able to help the leader is a correct one.  You are
> correct in saying that the workers might fill up their tuple queues
> while the leader is working, but once the leader returns one tuple it
> will switch to reading from the queues.  Every other tuple can be
> returned by some worker.  On the other hand, if the leader picks a
> non-partial plan, it must run that plan through to completion.
>
> Imagine (a) a non-partial path with a cost of 1000 returning 100
> tuples and (b) a partial path with a cost of 10,000 returning 1,000
> tuples.  No matter which path the leader picks, it has about 10 units
> of work to do to return 1 tuple.  However, if it picks the first path,
> it is committed to doing 990 more units of work later, regardless of
> whether the workers have filled the tuple queues or not.  If it picks
> the second path, it again has about 10 units of work to do to return 1
> tuple, but it hasn't committed to doing all the rest of the work of
> that path.  So that's better.
>
> Now it's hard to get all of the cases right.  If the partial path in
> the previous example had a cost of 1 crore then even returning 1 tuple
> is more costly than running the whole non-partial path.  When you
> introduce partition-wise join and parallel hash, there are even more
> problems.  Now the partial path might have a large startup cost, so
> returning the first tuple may be very expensive, and that work may
> help other workers (if this is a parallel hash) or it may not (if this
> is a non-parallel hash).
>

Yeah, these were the type of cases I am also worried.

>  I don't know that we can get all of these
> cases right, or that we should try.  However, I still think that
> picking partial paths preferentially is sensible -- we don't know all
> the details, but we do know that they're typically going to at least
> try to divide up the work in a fine-grained fashion, while a
> non-partial path, once started, the leader must run it to completion.
>

Okay, but can't we try to pick the cheapest partial path for master
backend and maybe master backend can try to work on a partial path
which is already picked up by some worker.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-02 Thread Robert Haas
On Sun, Oct 1, 2017 at 9:55 AM, Amit Kapila  wrote:
> Isn't it for both?  I mean it is about comparing the non-partial paths
> for child relations of the same relation and also when there are
> different relations involved as in Union All kind of query.  In any
> case, the point I was trying to say is that generally non-partial
> relations will have relatively smaller scan size, so probably should
> take lesser time to complete.

I don't think that's a valid inference.  It's true that a relation
could fail to have a partial path because it's small, but that's only
one reason among very many.  The optimal index type could be one that
doesn't support parallel index scans, for example.

> Sure, I think it is quite good if we can achieve that but it seems to
> me that we will not be able to achieve that in all scenario's with the
> patch and rather I think in some situations it can result in leader
> ended up picking the costly plan (in case when there are all partial
> plans or mix of partial and non-partial plans).  Now, we are ignoring
> such cases based on the assumption that other workers might help to
> complete master backend.  I think it is quite possible that the worker
> backends picks up some plans which emit rows greater than tuple queue
> size and they instead wait on the master backend which itself is busy
> in completing its plan.  So master backend will end up taking too much
> time.  If we want to go with a strategy of master (leader) backend and
> workers taking a different strategy to pick paths to work on, then it
> might be better if we should try to ensure that master backend always
> starts from the place which has cheapest plans irrespective of whether
> the path is partial or non-partial.

I agree that it's complicated to get this right in all cases; I'm
realizing that's probably an unattainable ideal.

However, I don't think ignoring the distinction between partial and
non-partial plans is an improvement, because the argument that other
workers may be able to help the leader is a correct one.  You are
correct in saying that the workers might fill up their tuple queues
while the leader is working, but once the leader returns one tuple it
will switch to reading from the queues.  Every other tuple can be
returned by some worker.  On the other hand, if the leader picks a
non-partial plan, it must run that plan through to completion.

Imagine (a) a non-partial path with a cost of 1000 returning 100
tuples and (b) a partial path with a cost of 10,000 returning 1,000
tuples.  No matter which path the leader picks, it has about 10 units
of work to do to return 1 tuple.  However, if it picks the first path,
it is committed to doing 990 more units of work later, regardless of
whether the workers have filled the tuple queues or not.  If it picks
the second path, it again has about 10 units of work to do to return 1
tuple, but it hasn't committed to doing all the rest of the work of
that path.  So that's better.

Now it's hard to get all of the cases right.  If the partial path in
the previous example had a cost of 1 crore then even returning 1 tuple
is more costly than running the whole non-partial path.  When you
introduce partition-wise join and parallel hash, there are even more
problems.  Now the partial path might have a large startup cost, so
returning the first tuple may be very expensive, and that work may
help other workers (if this is a parallel hash) or it may not (if this
is a non-parallel hash).  I don't know that we can get all of these
cases right, or that we should try.  However, I still think that
picking partial paths preferentially is sensible -- we don't know all
the details, but we do know that they're typically going to at least
try to divide up the work in a fine-grained fashion, while a
non-partial path, once started, the leader must run it to completion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-10-01 Thread Amit Kapila
On Sat, Sep 30, 2017 at 9:25 PM, Robert Haas  wrote:
> On Sat, Sep 30, 2017 at 12:20 AM, Amit Kapila  wrote:
>> Okay, but the point is whether it will make any difference
>> practically.  Let us try to see with an example, consider there are
>> two children (just taking two for simplicity, we can extend it to
>> many) and first having 1000 pages to scan and second having 900 pages
>> to scan, then it might not make much difference which child plan
>> leader chooses.  Now, it might matter if the first child relation has
>> 1000 pages to scan and second has just 1 page to scan, but not sure
>> how much difference will it be in practice considering that is almost
>> the maximum possible theoretical difference between two non-partial
>> paths (if we have pages greater than 1024 pages
>> (min_parallel_table_scan_size) then it will have a partial path).
>
> But that's comparing two non-partial paths for the same relation --
> the point here is to compare across relations.

Isn't it for both?  I mean it is about comparing the non-partial paths
for child relations of the same relation and also when there are
different relations involved as in Union All kind of query.  In any
case, the point I was trying to say is that generally non-partial
relations will have relatively smaller scan size, so probably should
take lesser time to complete.

>  Also keep in mind
> scenarios like this:
>
> SELECT ... FROM relation UNION ALL SELECT ... FROM generate_series(...);
>

I think for the FunctionScan case, non-partial paths can be quite costly.

>>> It's a lot fuzzier what is best when there are only partial plans.
>>>
>>
>> The point that bothers me a bit is whether it is a clear win if we
>> allow the leader to choose a different strategy to pick the paths or
>> is this just our theoretical assumption.  Basically, I think the patch
>> will become simpler if pick some simple strategy to choose paths.
>
> Well, that's true, but is it really that much complexity?
>
> And I actually don't see how this is very debatable.  If the only
> paths that are reasonably cheap are GIN index scans, then the only
> strategy is to dole them out across the processes you've got.  Giving
> the leader the cheapest one seems to be to be clearly smarter than any
> other strategy.
>

Sure, I think it is quite good if we can achieve that but it seems to
me that we will not be able to achieve that in all scenario's with the
patch and rather I think in some situations it can result in leader
ended up picking the costly plan (in case when there are all partial
plans or mix of partial and non-partial plans).  Now, we are ignoring
such cases based on the assumption that other workers might help to
complete master backend.  I think it is quite possible that the worker
backends picks up some plans which emit rows greater than tuple queue
size and they instead wait on the master backend which itself is busy
in completing its plan.  So master backend will end up taking too much
time.  If we want to go with a strategy of master (leader) backend and
workers taking a different strategy to pick paths to work on, then it
might be better if we should try to ensure that master backend always
starts from the place which has cheapest plans irrespective of whether
the path is partial or non-partial.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-30 Thread Robert Haas
On Sat, Sep 30, 2017 at 12:20 AM, Amit Kapila  wrote:
> Okay, but the point is whether it will make any difference
> practically.  Let us try to see with an example, consider there are
> two children (just taking two for simplicity, we can extend it to
> many) and first having 1000 pages to scan and second having 900 pages
> to scan, then it might not make much difference which child plan
> leader chooses.  Now, it might matter if the first child relation has
> 1000 pages to scan and second has just 1 page to scan, but not sure
> how much difference will it be in practice considering that is almost
> the maximum possible theoretical difference between two non-partial
> paths (if we have pages greater than 1024 pages
> (min_parallel_table_scan_size) then it will have a partial path).

But that's comparing two non-partial paths for the same relation --
the point here is to compare across relations.  Also keep in mind
scenarios like this:

SELECT ... FROM relation UNION ALL SELECT ... FROM generate_series(...);

>> It's a lot fuzzier what is best when there are only partial plans.
>>
>
> The point that bothers me a bit is whether it is a clear win if we
> allow the leader to choose a different strategy to pick the paths or
> is this just our theoretical assumption.  Basically, I think the patch
> will become simpler if pick some simple strategy to choose paths.

Well, that's true, but is it really that much complexity?

And I actually don't see how this is very debatable.  If the only
paths that are reasonably cheap are GIN index scans, then the only
strategy is to dole them out across the processes you've got.  Giving
the leader the cheapest one seems to be to be clearly smarter than any
other strategy.  Am I missing something?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-30 Thread Amit Kapila
On Wed, Sep 20, 2017 at 10:59 AM, Amit Khandekar  wrote:
> On 16 September 2017 at 10:42, Amit Kapila  wrote:
>>
>> At a broader level, the idea is good, but I think it won't turn out
>> exactly like that considering your below paragraph which indicates
>> that it is okay if leader picks a partial path that is costly among
>> other partial paths as a leader won't be locked with that.
>> Considering this is a good design for parallel append, the question is
>> do we really need worker and leader to follow separate strategy for
>> choosing next path.  I think the patch will be simpler if we can come
>> up with a way for the worker and leader to use the same strategy to
>> pick next path to process.  How about we arrange the list of paths
>> such that first, all partial paths will be there and then non-partial
>> paths and probably both in decreasing order of cost.  Now, both leader
>> and worker can start from the beginning of the list. In most cases,
>> the leader will start at the first partial path and will only ever
>> need to scan non-partial path if there is no other partial path left.
>> This is not bulletproof as it is possible that some worker starts
>> before leader in which case leader might scan non-partial path before
>> all partial paths are finished, but I think we can avoid that as well
>> if we are too worried about such cases.
>
> If there are no partial subpaths, then again the leader is likely to
> take up the expensive subpath. And this scenario would not be
> uncommon.
>

While thinking about how common the case of no partial subpaths would
be, it occurred to me that as of now we always create a partial path
for the inheritance child if it is parallel-safe and the user has not
explicitly set the value of parallel_workers to zero (refer
compute_parallel_worker).  So, unless you are planning to change that,
I think it will be quite uncommon to have no partial subpaths.

Few nitpicks in your latest patch:
1.
@@ -298,6 +366,292 @@ ExecReScanAppend(AppendState *node)
  if (subnode->chgParam == NULL)
  ExecReScan(subnode);
  }
+

Looks like a spurious line.

2.
@@ -1285,7 +1291,11 @@ add_paths_to_append_rel(PlannerInfo *root,
RelOptInfo *rel,
..
+ if (chosen_path && chosen_path != cheapest_partial_path)
+ pa_all_partial_subpaths = false;

It will keep on setting pa_all_partial_subpaths as false for
non-partial paths which don't seem to be the purpose of this variable.
I think you want it to be set even when there is one non-partial path,
so isn't it better to write as below or something similar:
if (pa_nonpartial_subpaths && pa_all_partial_subpaths)
pa_all_partial_subpaths = false;


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-29 Thread Amit Kapila
On Sat, Sep 30, 2017 at 4:02 AM, Robert Haas  wrote:
> On Fri, Sep 29, 2017 at 7:48 AM, Amit Kapila  wrote:
>> I think in general the non-partial paths should be cheaper as compared
>> to partial paths as that is the reason planner choose not to make a
>> partial plan at first place. I think the idea patch is using will help
>> because the leader will choose to execute partial path in most cases
>> (when there is a mix of partial and non-partial paths) and for that
>> case, the leader is not bound to complete the execution of that path.
>> However, if all the paths are non-partial, then I am not sure much
>> difference it would be to choose one path over other.
>
> The case where all plans are non-partial is the case where it matters
> most!  If the leader is going to take a share of the work, we want it
> to take the smallest share possible.
>

Okay, but the point is whether it will make any difference
practically.  Let us try to see with an example, consider there are
two children (just taking two for simplicity, we can extend it to
many) and first having 1000 pages to scan and second having 900 pages
to scan, then it might not make much difference which child plan
leader chooses.  Now, it might matter if the first child relation has
1000 pages to scan and second has just 1 page to scan, but not sure
how much difference will it be in practice considering that is almost
the maximum possible theoretical difference between two non-partial
paths (if we have pages greater than 1024 pages
(min_parallel_table_scan_size) then it will have a partial path).

> It's a lot fuzzier what is best when there are only partial plans.
>

The point that bothers me a bit is whether it is a clear win if we
allow the leader to choose a different strategy to pick the paths or
is this just our theoretical assumption.  Basically, I think the patch
will become simpler if pick some simple strategy to choose paths.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-29 Thread Robert Haas
On Fri, Sep 29, 2017 at 7:48 AM, Amit Kapila  wrote:
> I think in general the non-partial paths should be cheaper as compared
> to partial paths as that is the reason planner choose not to make a
> partial plan at first place. I think the idea patch is using will help
> because the leader will choose to execute partial path in most cases
> (when there is a mix of partial and non-partial paths) and for that
> case, the leader is not bound to complete the execution of that path.
> However, if all the paths are non-partial, then I am not sure much
> difference it would be to choose one path over other.

The case where all plans are non-partial is the case where it matters
most!  If the leader is going to take a share of the work, we want it
to take the smallest share possible.

It's a lot fuzzier what is best when there are only partial plans.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-29 Thread Robert Haas
On Sat, Sep 16, 2017 at 2:15 AM, Amit Kapila  wrote:
> Yes, we can do that and that is what I think is probably better.  So,
> the question remains that in which case non-parallel-aware partial
> append will be required?  Basically, it is not clear to me why after
> having parallel-aware partial append we need non-parallel-aware
> version?  Are you keeping it for the sake of backward-compatibility or
> something like for cases if someone has disabled parallel append with
> the help of new guc in this patch?

We can't use parallel append if there are pathkeys, because parallel
append will disturb the output ordering.  Right now, that never
happens anyway, but that will change when
https://commitfest.postgresql.org/14/1093/ is committed.

Parallel append is also not safe for a parameterized path, and
set_append_rel_pathlist() already creates those.  I guess it could be
safe if the parameter is passed down from above the Gather, if we
allowed that, but it's sure not safe in a case like this:

Gather
-> Nested Loop
  -> Parallel Seq Scan
  -> Append
 -> whatever

If it's not clear why that's a disaster, please ask for a more
detailed explaination...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-29 Thread Amit Kapila
On Wed, Sep 20, 2017 at 10:59 AM, Amit Khandekar  wrote:
> On 16 September 2017 at 10:42, Amit Kapila  wrote:
>> On Thu, Sep 14, 2017 at 9:41 PM, Robert Haas  wrote:
>>> On Mon, Sep 11, 2017 at 9:25 AM, Amit Kapila  
>>> wrote:
 I think the patch stores only non-partial paths in decreasing order,
 what if partial paths having more costs follows those paths?
>>>
>>> The general picture here is that we don't want the leader to get stuck
>>> inside some long-running operation because then it won't be available
>>> to read tuples from the workers.  On the other hand, we don't want to
>>> just have the leader do no work because that might be slow.  And in
>>> most cast cases, the leader will be the first participant to arrive at
>>> the Append node, because of the worker startup time.  So the idea is
>>> that the workers should pick expensive things first, and the leader
>>> should pick cheap things first.
>>>
>>
>> At a broader level, the idea is good, but I think it won't turn out
>> exactly like that considering your below paragraph which indicates
>> that it is okay if leader picks a partial path that is costly among
>> other partial paths as a leader won't be locked with that.
>> Considering this is a good design for parallel append, the question is
>> do we really need worker and leader to follow separate strategy for
>> choosing next path.  I think the patch will be simpler if we can come
>> up with a way for the worker and leader to use the same strategy to
>> pick next path to process.  How about we arrange the list of paths
>> such that first, all partial paths will be there and then non-partial
>> paths and probably both in decreasing order of cost.  Now, both leader
>> and worker can start from the beginning of the list. In most cases,
>> the leader will start at the first partial path and will only ever
>> need to scan non-partial path if there is no other partial path left.
>> This is not bulletproof as it is possible that some worker starts
>> before leader in which case leader might scan non-partial path before
>> all partial paths are finished, but I think we can avoid that as well
>> if we are too worried about such cases.
>
> If there are no partial subpaths, then again the leader is likely to
> take up the expensive subpath.
>

I think in general the non-partial paths should be cheaper as compared
to partial paths as that is the reason planner choose not to make a
partial plan at first place. I think the idea patch is using will help
because the leader will choose to execute partial path in most cases
(when there is a mix of partial and non-partial paths) and for that
case, the leader is not bound to complete the execution of that path.
However, if all the paths are non-partial, then I am not sure much
difference it would be to choose one path over other.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-28 Thread Amit Khandekar
On 20 September 2017 at 11:32, Amit Khandekar  wrote:
> There is still the open point being
> discussed : whether to have non-parallel-aware partial Append path, or
> always have parallel-aware Append paths.

Attached is the revised patch v16. In previous versions, we used to
add a non-parallel-aware Partial Append path having all partial
subpaths if the Parallel Append path already added does not contain
all-partial subpaths. Now in the patch, when we add such Append Path
containing all partial subpaths, we make it parallel-aware (unless
enable_parallelappend is false). So in this case, there will be a
parallel-aware Append path containing one or more non-partial
subpaths, and there will be another parallel-aware Append path
containing all-partial subpaths.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v16.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-20 Thread Amit Khandekar
On 11 September 2017 at 18:55, Amit Kapila  wrote:
>>> 1.
>>> + else if (IsA(subpath, MergeAppendPath))
>>> + {
>>> + MergeAppendPath *mpath = (MergeAppendPath *) subpath;
>>> +
>>> + /*
>>> + * If at all MergeAppend is partial, all its child plans have to be
>>> + * partial : we don't currently support a mix of partial and
>>> + * non-partial MergeAppend subpaths.
>>> + */
>>> + if (is_partial)
>>> + return list_concat(partial_subpaths, list_copy(mpath->subpaths));
>>>
>>> In which situation partial MergeAppendPath is generated?  Can you
>>> provide one example of such path?
>>
>> Actually currently we don't support partial paths for MergeAppendPath.
>> That code just has that if condition (is_partial) but currently that
>> condition won't be true for MergeAppendPath.
>>
>
> I think then it is better to have Assert for MergeAppend.  It is
> generally not a good idea to add code which we can never hit.

Done.

> One more thing, I think you might want to update comment atop
> add_paths_to_append_rel to reflect the new type of parallel-aware
> append path being generated. In particular, I am referring to below
> part of the comment:
>
>  * Similarly it collects partial paths from
>  * non-dummy children to create partial append paths.
>  */
> static void
> add_paths_to_append_rel()
>

Added comments.

Attached revised patch v15. There is still the open point being
discussed : whether to have non-parallel-aware partial Append path, or
always have parallel-aware Append paths.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v15.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-19 Thread Amit Khandekar
On 16 September 2017 at 10:42, Amit Kapila  wrote:
> On Thu, Sep 14, 2017 at 9:41 PM, Robert Haas  wrote:
>> On Mon, Sep 11, 2017 at 9:25 AM, Amit Kapila  wrote:
>>> I think the patch stores only non-partial paths in decreasing order,
>>> what if partial paths having more costs follows those paths?
>>
>> The general picture here is that we don't want the leader to get stuck
>> inside some long-running operation because then it won't be available
>> to read tuples from the workers.  On the other hand, we don't want to
>> just have the leader do no work because that might be slow.  And in
>> most cast cases, the leader will be the first participant to arrive at
>> the Append node, because of the worker startup time.  So the idea is
>> that the workers should pick expensive things first, and the leader
>> should pick cheap things first.
>>
>
> At a broader level, the idea is good, but I think it won't turn out
> exactly like that considering your below paragraph which indicates
> that it is okay if leader picks a partial path that is costly among
> other partial paths as a leader won't be locked with that.
> Considering this is a good design for parallel append, the question is
> do we really need worker and leader to follow separate strategy for
> choosing next path.  I think the patch will be simpler if we can come
> up with a way for the worker and leader to use the same strategy to
> pick next path to process.  How about we arrange the list of paths
> such that first, all partial paths will be there and then non-partial
> paths and probably both in decreasing order of cost.  Now, both leader
> and worker can start from the beginning of the list. In most cases,
> the leader will start at the first partial path and will only ever
> need to scan non-partial path if there is no other partial path left.
> This is not bulletproof as it is possible that some worker starts
> before leader in which case leader might scan non-partial path before
> all partial paths are finished, but I think we can avoid that as well
> if we are too worried about such cases.

If there are no partial subpaths, then again the leader is likely to
take up the expensive subpath. And this scenario would not be
uncommon. And for this scenario at least, we anyway have to make it
start from cheapest one, so will have to maintain code for that logic.
Now since we anyway have to maintain that logic, why not use the same
logic for leader for all cases. Otherwise, if we try to come up with a
common logic that conditionally chooses different next plan for leader
or worker, then that logic would most probably get complicated than
the current state. Also, I don't see any performance issue if there is
a leader is running backwards while the others are going forwards.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-17 Thread Amit Khandekar
On 16 September 2017 at 11:45, Amit Kapila  wrote:
> On Thu, Sep 14, 2017 at 8:30 PM, Amit Khandekar  
> wrote:
>> On 11 September 2017 at 18:55, Amit Kapila  wrote:

>>>
>>> How?  See, if you have four partial subpaths and two non-partial
>>> subpaths, then for parallel-aware append it considers all six paths in
>>> parallel path whereas for non-parallel-aware append it will consider
>>> just four paths and that too with sub-optimal strategy.  Can you
>>> please try to give me some example so that it will be clear.
>>
>> Suppose 4 appendrel children have costs for their cheapest partial (p)
>> and non-partial paths (np)  as shown below :
>>
>> p1=5000  np1=100
>> p2=200   np2=1000
>> p3=80   np3=2000
>> p4=3000  np4=50
>>
>> Here, following two Append paths will be generated :
>>
>> 1. a parallel-aware Append path with subpaths :
>> np1, p2, p3, np4
>>
>> 2. Partial (i.e. non-parallel-aware) Append path with all partial subpaths:
>> p1,p2,p3,p4
>>
>> Now, one thing we can do above is : Make the path#2 parallel-aware as
>> well; so both Append paths would be parallel-aware.
>>
>
> Yes, we can do that and that is what I think is probably better.  So,
> the question remains that in which case non-parallel-aware partial
> append will be required?  Basically, it is not clear to me why after
> having parallel-aware partial append we need non-parallel-aware
> version?  Are you keeping it for the sake of backward-compatibility or
> something like for cases if someone has disabled parallel append with
> the help of new guc in this patch?

Yes one case is the enable_parallelappend GUC case. If a user disables
it, we do want to add the usual non-parallel-aware append partial
path.

About backward compatibility, the concern we discussed in [1] was that
we better continue to have the usual non-parallel-aware partial Append
path, plus we should have an additional parallel-aware Append path
containing mix of partial and non-partial subpaths.

But thinking again on the example above, I think Amit, I tend to agree
that we don't have to worry about the existing behaviour, and so we
can make the path#2 parallel-aware as well.

Robert, can you please suggest what is your opinion on the paths that
are chosen in the above example ?

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoaLRtaWdJVHfhHej2s7w1spbr6gZiZXJrM5bsz1KQ54Rw%40mail.gmail.com

>
-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-16 Thread Amit Kapila
On Thu, Sep 14, 2017 at 8:30 PM, Amit Khandekar  wrote:
> On 11 September 2017 at 18:55, Amit Kapila  wrote:
>>>
>>
>> How?  See, if you have four partial subpaths and two non-partial
>> subpaths, then for parallel-aware append it considers all six paths in
>> parallel path whereas for non-parallel-aware append it will consider
>> just four paths and that too with sub-optimal strategy.  Can you
>> please try to give me some example so that it will be clear.
>
> Suppose 4 appendrel children have costs for their cheapest partial (p)
> and non-partial paths (np)  as shown below :
>
> p1=5000  np1=100
> p2=200   np2=1000
> p3=80   np3=2000
> p4=3000  np4=50
>
> Here, following two Append paths will be generated :
>
> 1. a parallel-aware Append path with subpaths :
> np1, p2, p3, np4
>
> 2. Partial (i.e. non-parallel-aware) Append path with all partial subpaths:
> p1,p2,p3,p4
>
> Now, one thing we can do above is : Make the path#2 parallel-aware as
> well; so both Append paths would be parallel-aware.
>

Yes, we can do that and that is what I think is probably better.  So,
the question remains that in which case non-parallel-aware partial
append will be required?  Basically, it is not clear to me why after
having parallel-aware partial append we need non-parallel-aware
version?  Are you keeping it for the sake of backward-compatibility or
something like for cases if someone has disabled parallel append with
the help of new guc in this patch?

>
> So above, what I am saying is, we can't tell which of the paths #1 and
> #2 are cheaper until we calculate total cost. I didn't understand what
> did you mean by "non-parallel-aware append will consider only the
> partial subpaths and that too with sub-optimal strategy" in the above
> example. I guess, you were considering a different scenario than the
> above one.
>

Yes, something different, but I think you can ignore that as we can
discuss the guts of my point based on the example given by you above.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-15 Thread Amit Kapila
On Thu, Sep 14, 2017 at 9:41 PM, Robert Haas  wrote:
> On Mon, Sep 11, 2017 at 9:25 AM, Amit Kapila  wrote:
>> I think the patch stores only non-partial paths in decreasing order,
>> what if partial paths having more costs follows those paths?
>
> The general picture here is that we don't want the leader to get stuck
> inside some long-running operation because then it won't be available
> to read tuples from the workers.  On the other hand, we don't want to
> just have the leader do no work because that might be slow.  And in
> most cast cases, the leader will be the first participant to arrive at
> the Append node, because of the worker startup time.  So the idea is
> that the workers should pick expensive things first, and the leader
> should pick cheap things first.
>

At a broader level, the idea is good, but I think it won't turn out
exactly like that considering your below paragraph which indicates
that it is okay if leader picks a partial path that is costly among
other partial paths as a leader won't be locked with that.
Considering this is a good design for parallel append, the question is
do we really need worker and leader to follow separate strategy for
choosing next path.  I think the patch will be simpler if we can come
up with a way for the worker and leader to use the same strategy to
pick next path to process.  How about we arrange the list of paths
such that first, all partial paths will be there and then non-partial
paths and probably both in decreasing order of cost.  Now, both leader
and worker can start from the beginning of the list. In most cases,
the leader will start at the first partial path and will only ever
need to scan non-partial path if there is no other partial path left.
This is not bulletproof as it is possible that some worker starts
before leader in which case leader might scan non-partial path before
all partial paths are finished, but I think we can avoid that as well
if we are too worried about such cases.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-14 Thread Robert Haas
On Mon, Sep 11, 2017 at 9:25 AM, Amit Kapila  wrote:
> I think the patch stores only non-partial paths in decreasing order,
> what if partial paths having more costs follows those paths?

The general picture here is that we don't want the leader to get stuck
inside some long-running operation because then it won't be available
to read tuples from the workers.  On the other hand, we don't want to
just have the leader do no work because that might be slow.  And in
most cast cases, the leader will be the first participant to arrive at
the Append node, because of the worker startup time.  So the idea is
that the workers should pick expensive things first, and the leader
should pick cheap things first.  This may not always work out
perfectly and certainly the details of the algorithm may need some
refinement, but I think the basic concept is good.  Of course, that
may be because I proposed it...

Note that there's a big difference between the leader picking a
partial path and the leader picking a non-partial path.  If the leader
picks a partial path, it isn't committed to executing that path to
completion.  Other workers can help.  If the leader picks a
non-partial path, though, the workers are locked out of that path,
because a single process must run it all the way through.  So the
leader should avoid choosing a non-partial path unless there are no
partial paths remaining.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-14 Thread Amit Khandekar
On 11 September 2017 at 18:55, Amit Kapila  wrote:
>>> Do you think non-parallel-aware Append
>>> will be better in any case when there is a parallel-aware append?  I
>>> mean to say let's try to create non-parallel-aware append only when
>>> parallel-aware append is not possible.
>>
>> By non-parallel-aware append, I am assuming you meant  partial
>> non-parallel-aware Append. Yes, if the parallel-aware Append path has
>> *all* partial subpaths chosen, then we do omit a partial non-parallel
>> Append path, as seen in this code in the patch :
>>
>> /*
>> * Consider non-parallel partial append path. But if the parallel append
>> * path is made out of all partial subpaths, don't create another partial
>> * path; we will keep only the parallel append path in that case.
>> */
>> if (partial_subpaths_valid && !pa_all_partial_subpaths)
>> {
>> ..
>> }
>>
>> But if the parallel-Append path has a mix of partial and non-partial
>> subpaths, then we can't really tell which of the two could be cheapest
>> until we calculate the cost. It can be that the non-parallel-aware
>> partial Append can be cheaper as well.
>>
>
> How?  See, if you have four partial subpaths and two non-partial
> subpaths, then for parallel-aware append it considers all six paths in
> parallel path whereas for non-parallel-aware append it will consider
> just four paths and that too with sub-optimal strategy.  Can you
> please try to give me some example so that it will be clear.

Suppose 4 appendrel children have costs for their cheapest partial (p)
and non-partial paths (np)  as shown below :

p1=5000  np1=100
p2=200   np2=1000
p3=80   np3=2000
p4=3000  np4=50

Here, following two Append paths will be generated :

1. a parallel-aware Append path with subpaths :
np1, p2, p3, np4

2. Partial (i.e. non-parallel-aware) Append path with all partial subpaths:
p1,p2,p3,p4

Now, one thing we can do above is : Make the path#2 parallel-aware as
well; so both Append paths would be parallel-aware. Are you suggesting
exactly this ?

So above, what I am saying is, we can't tell which of the paths #1 and
#2 are cheaper until we calculate total cost. I didn't understand what
did you mean by "non-parallel-aware append will consider only the
partial subpaths and that too with sub-optimal strategy" in the above
example. I guess, you were considering a different scenario than the
above one.

Whereas, if one or more subpaths of Append do not have partial subpath
in the first place, then non-parallel-aware partial Append is out of
question, which we both agree.
And the other case where we skip non-parallel-aware partial Append is
when all the cheapest subpaths of the parallel-aware Append path are
partial paths: we do not want parallel-aware and non-parallel-aware
Append paths both having exactly the same partial subpaths.

-

I will be addressing your other comments separately.

Thanks
-Amit Khandekar


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-11 Thread Amit Kapila
On Mon, Sep 11, 2017 at 4:49 PM, Amit Khandekar  wrote:
> On 8 September 2017 at 19:17, Amit Kapila  wrote:
>>
>> In that case, why can't we keep the workers also process in same
>> order, what is the harm in that?
>
> Because of the way the logic of queuing works, the workers finish
> earlier if they start with expensive plans first. For e.g. : with 3
> plans with costs 8, 4, 4 and with 2 workers w1 and w2, they will
> finish in 8 time units (w1 will finish plan 1 in 8, while in parallel
> w2 will finish the remaining 2 plans in 8 units.  Whereas if the plans
> are ordered like : 4, 4, 8, then the workers will finish in 12 time
> units (w1 and w2 will finish each of the 1st two plans in 4 units, and
> then w1 or w2 will take up plan 3 and finish in 8 units, while the
> other worker remains idle).
>

I think the patch stores only non-partial paths in decreasing order,
what if partial paths having more costs follows those paths?

>
>>
>> Few more comments:
>>
>> 1.
>> + else if (IsA(subpath, MergeAppendPath))
>> + {
>> + MergeAppendPath *mpath = (MergeAppendPath *) subpath;
>> +
>> + /*
>> + * If at all MergeAppend is partial, all its child plans have to be
>> + * partial : we don't currently support a mix of partial and
>> + * non-partial MergeAppend subpaths.
>> + */
>> + if (is_partial)
>> + return list_concat(partial_subpaths, list_copy(mpath->subpaths));
>>
>> In which situation partial MergeAppendPath is generated?  Can you
>> provide one example of such path?
>
> Actually currently we don't support partial paths for MergeAppendPath.
> That code just has that if condition (is_partial) but currently that
> condition won't be true for MergeAppendPath.
>

I think then it is better to have Assert for MergeAppend.  It is
generally not a good idea to add code which we can never hit.

>>
>> 2.
>> add_paths_to_append_rel()
..
>>
>> I think it might be better to add a sentence why we choose a different
>> way to decide a number of workers in the second case
>> (non-parallel-aware append).
>
> Yes, I agree. Will do that after we conclude with your next point below ...
>
>> Do you think non-parallel-aware Append
>> will be better in any case when there is a parallel-aware append?  I
>> mean to say let's try to create non-parallel-aware append only when
>> parallel-aware append is not possible.
>
> By non-parallel-aware append, I am assuming you meant  partial
> non-parallel-aware Append. Yes, if the parallel-aware Append path has
> *all* partial subpaths chosen, then we do omit a partial non-parallel
> Append path, as seen in this code in the patch :
>
> /*
> * Consider non-parallel partial append path. But if the parallel append
> * path is made out of all partial subpaths, don't create another partial
> * path; we will keep only the parallel append path in that case.
> */
> if (partial_subpaths_valid && !pa_all_partial_subpaths)
> {
> ..
> }
>
> But if the parallel-Append path has a mix of partial and non-partial
> subpaths, then we can't really tell which of the two could be cheapest
> until we calculate the cost. It can be that the non-parallel-aware
> partial Append can be cheaper as well.
>

How?  See, if you have four partial subpaths and two non-partial
subpaths, then for parallel-aware append it considers all six paths in
parallel path whereas for non-parallel-aware append it will consider
just four paths and that too with sub-optimal strategy.  Can you
please try to give me some example so that it will be clear.

>>
>> 4.
>> -  select count(*) from a_star;
>> -select count(*) from a_star;
>> +  select round(avg(aa)), sum(aa) from a_star;
>> +select round(avg(aa)), sum(aa) from a_star;
>>
>> Why you have changed the existing test. It seems count(*) will also
>> give what you are expecting.
>
> Needed to do cover some data testing with Parallel Append execution.
>

Okay.

One more thing, I think you might want to update comment atop
add_paths_to_append_rel to reflect the new type of parallel-aware
append path being generated. In particular, I am referring to below
part of the comment:

 * Similarly it collects partial paths from
 * non-dummy children to create partial append paths.
 */
static void
add_paths_to_append_rel()


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-11 Thread Amit Khandekar
On 8 September 2017 at 19:17, Amit Kapila  wrote:
> On Fri, Sep 8, 2017 at 3:59 PM, Amit Khandekar  wrote:
>> On 7 September 2017 at 11:05, Amit Kapila  wrote:
>>> On Thu, Aug 31, 2017 at 12:47 PM, Amit Khandekar  
>>> wrote:
>>> 3.
>>> +/* 
>>> + * exec_append_leader_next
>>> + *
>>> + * To be used only if it's a parallel leader. The backend should scan
>>> + * backwards from the last plan. This is to prevent it from taking up
>>> + * the most expensive non-partial plan, i.e. the first subplan.
>>> + * 
>>> + */
>>> +static bool
>>> +exec_append_leader_next(AppendState *state)
>>>
>>> From above explanation, it is clear that you don't want backend to
>>> pick an expensive plan for a leader, but the reason for this different
>>> treatment is not clear.
>>
>> Explained it, saying that for more workers, a leader spends more work
>> in processing the worker tuples , and less work contributing to
>> parallel processing. So it should not take expensive plans, otherwise
>> it will affect the total time to finish Append plan.
>>
>
> In that case, why can't we keep the workers also process in same
> order, what is the harm in that?

Because of the way the logic of queuing works, the workers finish
earlier if they start with expensive plans first. For e.g. : with 3
plans with costs 8, 4, 4 and with 2 workers w1 and w2, they will
finish in 8 time units (w1 will finish plan 1 in 8, while in parallel
w2 will finish the remaining 2 plans in 8 units. Whereas if the plans
are ordered like : 4, 4, 8, then the workers will finish in 12 time
units (w1 and w2 will finish each of the 1st two plans in 4 units, and
then w1 or w2 will take up plan 3 and finish in 8 units, while the
other worker remains idle).

> Also, the leader will always scan
> the subplans from last subplan even if all the subplans are partial
> plans.

Since we already need to have two different code paths, I think we can
use the same code paths for any subplans.

> I think this will be the unnecessary difference in the
> strategy of leader and worker especially when all paths are partial.
> I think the selection of next subplan might become simpler if we use
> the same strategy for worker and leader.

Yeah if we had a common method for both it would have been better. But
anyways we have different logics to maintain.

>
> Few more comments:
>
> 1.
> + else if (IsA(subpath, MergeAppendPath))
> + {
> + MergeAppendPath *mpath = (MergeAppendPath *) subpath;
> +
> + /*
> + * If at all MergeAppend is partial, all its child plans have to be
> + * partial : we don't currently support a mix of partial and
> + * non-partial MergeAppend subpaths.
> + */
> + if (is_partial)
> + return list_concat(partial_subpaths, list_copy(mpath->subpaths));
>
> In which situation partial MergeAppendPath is generated?  Can you
> provide one example of such path?

Actually currently we don't support partial paths for MergeAppendPath.
That code just has that if condition (is_partial) but currently that
condition won't be true for MergeAppendPath.

>
> 2.
> add_paths_to_append_rel()
> {
> ..
> + /* Consider parallel append path. */
> + if (pa_subpaths_valid)
> + {
> + AppendPath *appendpath;
> + int parallel_workers;
> +
> + parallel_workers = get_append_num_workers(pa_partial_subpaths,
> +  pa_nonpartial_subpaths);
> + appendpath = create_append_path(rel, pa_nonpartial_subpaths,
> + pa_partial_subpaths,
> + NULL, parallel_workers, true,
> + partitioned_rels);
> + add_partial_path(rel, (Path *) appendpath);
> + }
> +
>   /*
> - * Consider an append of partial unordered, unparameterized partial paths.
> + * Consider non-parallel partial append path. But if the parallel append
> + * path is made out of all partial subpaths, don't create another partial
> + * path; we will keep only the parallel append path in that case.
>   */
> - if (partial_subpaths_valid)
> + if (partial_subpaths_valid && !pa_all_partial_subpaths)
>   {
>   AppendPath *appendpath;
>   ListCell   *lc;
>   int parallel_workers = 0;
>
>   /*
> - * Decide on the number of workers to request for this append path.
> - * For now, we just use the maximum value from among the members.  It
> - * might be useful to use a higher number if the Append node were
> - * smart enough to spread out the workers, but it currently isn't.
> + * To decide the number of workers, just use the maximum value from
> + * among the children.
>   */
>   foreach(lc, partial_subpaths)
>   {
> @@ -1421,9 +1502,9 @@ add_paths_to_append_rel(PlannerInfo *root,
> RelOptInfo *rel,
>   }
>   Assert(parallel_workers > 0);
>
> - /* Generate a partial append path. */
> - appendpath = create_append_path(rel, partial_subpaths, NULL,
> - parallel_workers, partitioned_rels);
> + appendpath = create_append_path(rel, NIL, partial_subpaths,
> + 

Re: [HACKERS] Parallel Append implementation

2017-09-08 Thread Amit Kapila
On Fri, Sep 8, 2017 at 3:59 PM, Amit Khandekar  wrote:
> On 7 September 2017 at 11:05, Amit Kapila  wrote:
>> On Thu, Aug 31, 2017 at 12:47 PM, Amit Khandekar  
>> wrote:
>> 3.
>> +/* 
>> + * exec_append_leader_next
>> + *
>> + * To be used only if it's a parallel leader. The backend should scan
>> + * backwards from the last plan. This is to prevent it from taking up
>> + * the most expensive non-partial plan, i.e. the first subplan.
>> + * 
>> + */
>> +static bool
>> +exec_append_leader_next(AppendState *state)
>>
>> From above explanation, it is clear that you don't want backend to
>> pick an expensive plan for a leader, but the reason for this different
>> treatment is not clear.
>
> Explained it, saying that for more workers, a leader spends more work
> in processing the worker tuples , and less work contributing to
> parallel processing. So it should not take expensive plans, otherwise
> it will affect the total time to finish Append plan.
>

In that case, why can't we keep the workers also process in same
order, what is the harm in that?  Also, the leader will always scan
the subplans from last subplan even if all the subplans are partial
plans.  I think this will be the unnecessary difference in the
strategy of leader and worker especially when all paths are partial.
I think the selection of next subplan might become simpler if we use
the same strategy for worker and leader.

Few more comments:

1.
+ else if (IsA(subpath, MergeAppendPath))
+ {
+ MergeAppendPath *mpath = (MergeAppendPath *) subpath;
+
+ /*
+ * If at all MergeAppend is partial, all its child plans have to be
+ * partial : we don't currently support a mix of partial and
+ * non-partial MergeAppend subpaths.
+ */
+ if (is_partial)
+ return list_concat(partial_subpaths, list_copy(mpath->subpaths));

In which situation partial MergeAppendPath is generated?  Can you
provide one example of such path?

2.
add_paths_to_append_rel()
{
..
+ /* Consider parallel append path. */
+ if (pa_subpaths_valid)
+ {
+ AppendPath *appendpath;
+ int parallel_workers;
+
+ parallel_workers = get_append_num_workers(pa_partial_subpaths,
+  pa_nonpartial_subpaths);
+ appendpath = create_append_path(rel, pa_nonpartial_subpaths,
+ pa_partial_subpaths,
+ NULL, parallel_workers, true,
+ partitioned_rels);
+ add_partial_path(rel, (Path *) appendpath);
+ }
+
  /*
- * Consider an append of partial unordered, unparameterized partial paths.
+ * Consider non-parallel partial append path. But if the parallel append
+ * path is made out of all partial subpaths, don't create another partial
+ * path; we will keep only the parallel append path in that case.
  */
- if (partial_subpaths_valid)
+ if (partial_subpaths_valid && !pa_all_partial_subpaths)
  {
  AppendPath *appendpath;
  ListCell   *lc;
  int parallel_workers = 0;

  /*
- * Decide on the number of workers to request for this append path.
- * For now, we just use the maximum value from among the members.  It
- * might be useful to use a higher number if the Append node were
- * smart enough to spread out the workers, but it currently isn't.
+ * To decide the number of workers, just use the maximum value from
+ * among the children.
  */
  foreach(lc, partial_subpaths)
  {
@@ -1421,9 +1502,9 @@ add_paths_to_append_rel(PlannerInfo *root,
RelOptInfo *rel,
  }
  Assert(parallel_workers > 0);

- /* Generate a partial append path. */
- appendpath = create_append_path(rel, partial_subpaths, NULL,
- parallel_workers, partitioned_rels);
+ appendpath = create_append_path(rel, NIL, partial_subpaths,
+ NULL, parallel_workers, false,
+ partitioned_rels);
  add_partial_path(rel, (Path *) appendpath);
  }
..
}

I think it might be better to add a sentence why we choose a different
way to decide a number of workers in the second case
(non-parallel-aware append).  Do you think non-parallel-aware Append
will be better in any case when there is a parallel-aware append?  I
mean to say let's try to create non-parallel-aware append only when
parallel-aware append is not possible.

3.
+ * evaluates to a value just a bit greater than max(w1,w2, w3). So, we

The spacing between w1, w2, w3 is not same.

4.
-  select count(*) from a_star;
-select count(*) from a_star;
+  select round(avg(aa)), sum(aa) from a_star;
+select round(avg(aa)), sum(aa) from a_star;

Why you have changed the existing test. It seems count(*) will also
give what you are expecting.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-08 Thread Rafia Sabih
On Wed, Aug 30, 2017 at 5:32 PM, Amit Khandekar  wrote:
> Hi Rafia,
>
> On 17 August 2017 at 14:12, Amit Khandekar  wrote:
>> But for all of the cases here, partial
>> subplans seem possible, and so even on HEAD it executed Partial
>> Append. So between a Parallel Append having partial subplans and a
>> Partial Append having partial subplans , the cost difference would not
>> be significant. Even if we assume that Parallel Append was chosen
>> because its cost turned out to be a bit cheaper, the actual
>> performance gain seems quite large as compared to the expected cost
>> difference. So it might be even possible that the performance gain
>> might be due to some other reasons. I will investigate this, and the
>> other queries.
>>
>
> I ran all the queries that were showing performance benefits in your
> run. But for me, the ParallelAppend benefits are shown only for plans
> that use Partition-Wise-Join.
>
> For all the queries that use only PA plans but not PWJ plans, I got
> the exact same plan for HEAD as for PA+PWJ patch, except that for the
> later, the Append is a ParallelAppend. Whereas, for you, the plans
> have join-order changed.
>
> Regarding actual costs; consequtively, for me the actual-cost are more
> or less the same for HEAD and PA+PWJ. Whereas, for your runs, you have
> quite different costs naturally because the plans themselves are
> different on head versus PA+PWJ.
>
> My PA+PWJ plan outputs (and actual costs) match exactly what you get
> with PA+PWJ patch. But like I said, I get the same join order and same
> plans (and actual costs) for HEAD as well (except
> ParallelAppend=>Append).
>
> May be, if you have the latest HEAD code with your setup, you can
> yourself check some of the queries again to see if they are still
> seeing higher costs as compared to PA ? I suspect that some changes in
> latest code might be causing this discrepancy; because when I tested
> some of the explains with a HEAD-branch server running with your
> database, I got results matching PA figures.
>
> Attached is my explain-analyze outputs.
>

Now, when I compare your results with the ones I posted I could see
one major difference between them -- selectivity estimation errors.
In the results I posted, e.g. Q3, on head it gives following

->  Finalize GroupAggregate  (cost=41131358.89..101076015.45
rows=455492628 width=44) (actual time=126436.642..129247.972
rows=226765 loops=1)
   Group Key: lineitem_001.l_orderkey,
orders_001.o_orderdate, orders_001.o_shippriority
   ->  Gather Merge  (cost=41131358.89..90637642.73
rows=379577190 width=44) (actual time=126436.602..127791.768
rows=235461 loops=1)
 Workers Planned: 2
 Workers Launched: 2

and in your results it is,
->  Finalize GroupAggregate  (cost=4940619.86..6652725.07
rows=13009521 width=44) (actual time=89573.830..91956.956 rows=226460
loops=1)
   Group Key: lineitem_001.l_orderkey,
orders_001.o_orderdate, orders_001.o_shippriority
   ->  Gather Merge  (cost=4940619.86..6354590.21
rows=10841268 width=44) (actual time=89573.752..90747.393 rows=235465
loops=1)
 Workers Planned: 2
 Workers Launched: 2

However, for the results with the patch/es this is not the case,

in my results, with patch,

 ->  Finalize GroupAggregate  (cost=4933450.21..663.01
rows=12899766 width=44) (actual time=87250.039..90593.716 rows=226765
loops=1)
   Group Key: lineitem_001.l_orderkey,
orders_001.o_orderdate, orders_001.o_shippriority
   ->  Gather Merge  (cost=4933450.21..6335491.38
rows=10749804 width=44) (actual time=87250.020..89125.279 rows=227291
loops=1)
 Workers Planned: 2
 Workers Launched: 2

I think this explains the reason for drastic different in the plan
choices and thus the performance for both the cases.

Since I was using same database for the cases, I don't have much
reasons for such difference in selectivity estimation for these
queries. The only thing might be a missing vacuum analyse, but since I
checked it a couple of times I am not sure if even that could be the
reason. Additionally, it is not the case for all the queries, like in
Q10 and Q21, the estimates are similar.

However, on a fresh database the selectivity-estimates and plans as
reported by you and with the patched version I posted seems to be the
correct one. I'll see if I may check performance of these queries once
again to verify these.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-08 Thread Amit Khandekar
On 7 September 2017 at 11:05, Amit Kapila  wrote:
> On Thu, Aug 31, 2017 at 12:47 PM, Amit Khandekar  
> wrote:
>> The last updated patch needs a rebase. Attached is the rebased version.
>>
>
> Few comments on the first read of the patch:

Thanks !

>
> 1.
> @@ -279,6 +347,7 @@ void
>  ExecReScanAppend(AppendState *node)
>  {
>   int i;
> + ParallelAppendDesc padesc = node->as_padesc;
>
>   for (i = 0; i < node->as_nplans; i++)
>   {
> @@ -298,6 +367,276 @@ ExecReScanAppend(AppendState *node)
>   if (subnode->chgParam == NULL)
>   ExecReScan(subnode);
>   }
> +
> + if (padesc)
> + {
> + padesc->pa_first_plan = padesc->pa_next_plan = 0;
> + memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans);
> + }
> +
>
> For rescan purpose, the parallel state should be reinitialized via
> ExecParallelReInitializeDSM.  We need to do that way mainly to avoid
> cases where sometimes in rescan machinery we don't perform rescan of
> all the nodes.  See commit 41b0dd987d44089dc48e9c70024277e253b396b7.

Right. I didn't notice this while I rebased my patch over that commit.
Fixed it. Also added an exec_append_parallel_next() call in
ExecAppendReInitializeDSM(), otherwise the next ExecAppend() in leader
will get an uninitialized as_whichplan.

>
> 2.
> + * shared next_subplan counter index to start looking for unfinished plan,

Done.

>
> Here using "counter index" seems slightly confusing. I think using one
> of those will be better.

Re-worded it a bit. See whether that's what you wanted.

>
> 3.
> +/* 
> + * exec_append_leader_next
> + *
> + * To be used only if it's a parallel leader. The backend should scan
> + * backwards from the last plan. This is to prevent it from taking up
> + * the most expensive non-partial plan, i.e. the first subplan.
> + * 
> + */
> +static bool
> +exec_append_leader_next(AppendState *state)
>
> From above explanation, it is clear that you don't want backend to
> pick an expensive plan for a leader, but the reason for this different
> treatment is not clear.

Explained it, saying that for more workers, a leader spends more work
in processing the worker tuples , and less work contributing to
parallel processing. So it should not take expensive plans, otherwise
it will affect the total time to finish Append plan.

>
> 4.
> accumulate_partialappend_subpath()
> {
> ..
> + /* Add partial subpaths, if any. */
> + return list_concat(partial_subpaths, apath_partial_paths);
> ..
> + return partial_subpaths;
> ..
> + if (is_partial)
> + return lappend(partial_subpaths, subpath);
> ..
> }
>
> In this function, instead of returning from multiple places
> partial_subpaths list, you can just return it at the end and in all
> other places just append to it if required.  That way code will look
> more clear and simpler.

Agreed. Did it that way.

>
> 5.
>  * is created to represent the case that a relation is provably empty.
> + *
>   */
>  typedef struct AppendPath
>
> Spurious line addition.
Removed.

>
> 6.
> if (!node->as_padesc)
> {
> /*
> * This is Parallel-aware append. Follow it's own logic of choosing
> * the next subplan.
> */
> if (!exec_append_seq_next(node))
>
> I think this is the case of non-parallel-aware appends, but the
> comments are indicating the opposite.

Yeah, this comment got left over there when the relevant code got
changed. Shifted that comment upwards where it is appropriate.

Attached is the updated patch v14.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v14.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-07 Thread Amit Khandekar
On 7 September 2017 at 13:40, Rafia Sabih  wrote:
> On Wed, Aug 30, 2017 at 5:32 PM, Amit Khandekar  
> wrote:
>> Hi Rafia,
>>
>> On 17 August 2017 at 14:12, Amit Khandekar  wrote:
>>> But for all of the cases here, partial
>>> subplans seem possible, and so even on HEAD it executed Partial
>>> Append. So between a Parallel Append having partial subplans and a
>>> Partial Append having partial subplans , the cost difference would not
>>> be significant. Even if we assume that Parallel Append was chosen
>>> because its cost turned out to be a bit cheaper, the actual
>>> performance gain seems quite large as compared to the expected cost
>>> difference. So it might be even possible that the performance gain
>>> might be due to some other reasons. I will investigate this, and the
>>> other queries.
>>>
>>
>> I ran all the queries that were showing performance benefits in your
>> run. But for me, the ParallelAppend benefits are shown only for plans
>> that use Partition-Wise-Join.
>>
>> For all the queries that use only PA plans but not PWJ plans, I got
>> the exact same plan for HEAD as for PA+PWJ patch, except that for the
>> later, the Append is a ParallelAppend. Whereas, for you, the plans
>> have join-order changed.
>>
>> Regarding actual costs; consequtively, for me the actual-cost are more
>> or less the same for HEAD and PA+PWJ. Whereas, for your runs, you have
>> quite different costs naturally because the plans themselves are
>> different on head versus PA+PWJ.
>>
>> My PA+PWJ plan outputs (and actual costs) match exactly what you get
>> with PA+PWJ patch. But like I said, I get the same join order and same
>> plans (and actual costs) for HEAD as well (except
>> ParallelAppend=>Append).
>>
>> May be, if you have the latest HEAD code with your setup, you can
>> yourself check some of the queries again to see if they are still
>> seeing higher costs as compared to PA ? I suspect that some changes in
>> latest code might be causing this discrepancy; because when I tested
>> some of the explains with a HEAD-branch server running with your
>> database, I got results matching PA figures.
>>
>> Attached is my explain-analyze outputs.
>>
>
> Strange. Please let me know what was the commit-id you were
> experimenting on. I think we need to investigate this a little
> further.

Sure. I think the commit was b5c75fec. It was sometime in Aug 30 when
I ran the tests. But you may try on latest head.

> Additionally I want to point that I also applied patch [1],
> which I forgot to mention before.

Yes , I also had applied that patch over PA+PWJ.

>
> [1]  
> https://www.postgresql.org/message-id/CAEepm%3D3%3DNHHko3oOzpik%2BggLy17AO%2Bpx3rGYrg3x_x05%2BBr9-A%40mail.gmail.com

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-07 Thread Rafia Sabih
On Wed, Aug 30, 2017 at 5:32 PM, Amit Khandekar  wrote:
> Hi Rafia,
>
> On 17 August 2017 at 14:12, Amit Khandekar  wrote:
>> But for all of the cases here, partial
>> subplans seem possible, and so even on HEAD it executed Partial
>> Append. So between a Parallel Append having partial subplans and a
>> Partial Append having partial subplans , the cost difference would not
>> be significant. Even if we assume that Parallel Append was chosen
>> because its cost turned out to be a bit cheaper, the actual
>> performance gain seems quite large as compared to the expected cost
>> difference. So it might be even possible that the performance gain
>> might be due to some other reasons. I will investigate this, and the
>> other queries.
>>
>
> I ran all the queries that were showing performance benefits in your
> run. But for me, the ParallelAppend benefits are shown only for plans
> that use Partition-Wise-Join.
>
> For all the queries that use only PA plans but not PWJ plans, I got
> the exact same plan for HEAD as for PA+PWJ patch, except that for the
> later, the Append is a ParallelAppend. Whereas, for you, the plans
> have join-order changed.
>
> Regarding actual costs; consequtively, for me the actual-cost are more
> or less the same for HEAD and PA+PWJ. Whereas, for your runs, you have
> quite different costs naturally because the plans themselves are
> different on head versus PA+PWJ.
>
> My PA+PWJ plan outputs (and actual costs) match exactly what you get
> with PA+PWJ patch. But like I said, I get the same join order and same
> plans (and actual costs) for HEAD as well (except
> ParallelAppend=>Append).
>
> May be, if you have the latest HEAD code with your setup, you can
> yourself check some of the queries again to see if they are still
> seeing higher costs as compared to PA ? I suspect that some changes in
> latest code might be causing this discrepancy; because when I tested
> some of the explains with a HEAD-branch server running with your
> database, I got results matching PA figures.
>
> Attached is my explain-analyze outputs.
>

Strange. Please let me know what was the commit-id you were
experimenting on. I think we need to investigate this a little
further. Additionally I want to point that I also applied patch [1],
which I forgot to mention before.

[1]  
https://www.postgresql.org/message-id/CAEepm%3D3%3DNHHko3oOzpik%2BggLy17AO%2Bpx3rGYrg3x_x05%2BBr9-A%40mail.gmail.com

> On 16 August 2017 at 18:34, Robert Haas  wrote:
>> Thanks for the benchmarking results!
>>
>> On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih
>>  wrote:
>>> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41
>>
>> 12 seconds instead of 244?  Whoa.  I find it curious that we picked a
>> Parallel Append with a bunch of non-partial plans when we could've
>> just as easily picked partial plans, or so it seems to me.  To put
>> that another way, why did we end up with a bunch of Bitmap Heap Scans
>> here instead of Parallel Bitmap Heap Scans?
>
> Actually, the cost difference would be quite low for Parallel Append
> with partial plans and Parallel Append with non-partial plans with 2
> workers. But yes, I should take a look at why it is consistently
> taking non-partial Bitmap Heap Scan.
>
> 
>
>> Q6 | 29 | 12 | PA only
>
> This one needs to be analysed, because here, the plan cost is the
> same, but actual cost for PA is almost half the cost for HEAD. This is
> the same observation for my run also.
>
> Thanks
> -Amit



-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-06 Thread Amit Kapila
On Thu, Aug 31, 2017 at 12:47 PM, Amit Khandekar  wrote:
> The last updated patch needs a rebase. Attached is the rebased version.
>

Few comments on the first read of the patch:

1.
@@ -279,6 +347,7 @@ void
 ExecReScanAppend(AppendState *node)
 {
  int i;
+ ParallelAppendDesc padesc = node->as_padesc;

  for (i = 0; i < node->as_nplans; i++)
  {
@@ -298,6 +367,276 @@ ExecReScanAppend(AppendState *node)
  if (subnode->chgParam == NULL)
  ExecReScan(subnode);
  }
+
+ if (padesc)
+ {
+ padesc->pa_first_plan = padesc->pa_next_plan = 0;
+ memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans);
+ }
+

For rescan purpose, the parallel state should be reinitialized via
ExecParallelReInitializeDSM.  We need to do that way mainly to avoid
cases where sometimes in rescan machinery we don't perform rescan of
all the nodes.  See commit 41b0dd987d44089dc48e9c70024277e253b396b7.

2.
+ * shared next_subplan counter index to start looking for unfinished plan,

Here using "counter index" seems slightly confusing. I think using one
of those will be better.

3.
+/* 
+ * exec_append_leader_next
+ *
+ * To be used only if it's a parallel leader. The backend should scan
+ * backwards from the last plan. This is to prevent it from taking up
+ * the most expensive non-partial plan, i.e. the first subplan.
+ * 
+ */
+static bool
+exec_append_leader_next(AppendState *state)

>From above explanation, it is clear that you don't want backend to
pick an expensive plan for a leader, but the reason for this different
treatment is not clear.

4.
accumulate_partialappend_subpath()
{
..
+ /* Add partial subpaths, if any. */
+ return list_concat(partial_subpaths, apath_partial_paths);
..
+ return partial_subpaths;
..
+ if (is_partial)
+ return lappend(partial_subpaths, subpath);
..
}

In this function, instead of returning from multiple places
partial_subpaths list, you can just return it at the end and in all
other places just append to it if required.  That way code will look
more clear and simpler.

5.
 * is created to represent the case that a relation is provably empty.
+ *
  */
 typedef struct AppendPath

Spurious line addition.

6.
if (!node->as_padesc)
{
/*
* This is Parallel-aware append. Follow it's own logic of choosing
* the next subplan.
*/
if (!exec_append_seq_next(node))

I think this is the case of non-parallel-aware appends, but the
comments are indicating the opposite.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-09-05 Thread Amit Khandekar
On 30 August 2017 at 17:32, Amit Khandekar  wrote:
> On 16 August 2017 at 18:34, Robert Haas  wrote:
>> Thanks for the benchmarking results!
>>
>> On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih
>>  wrote:
>>> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41
>>
>> 12 seconds instead of 244?  Whoa.  I find it curious that we picked a
>> Parallel Append with a bunch of non-partial plans when we could've
>> just as easily picked partial plans, or so it seems to me.  To put
>> that another way, why did we end up with a bunch of Bitmap Heap Scans
>> here instead of Parallel Bitmap Heap Scans?
>
> Actually, the cost difference would be quite low for Parallel Append
> with partial plans and Parallel Append with non-partial plans with 2
> workers. But yes, I should take a look at why it is consistently
> taking non-partial Bitmap Heap Scan.

Here, I checked that Partial Bitmap Heap Scan Path is not getting
created in the first place; but I think it should.

As you can see from the below plan snippet, the inner path of the join
is a parameterized Index Scan :

->  Parallel Append
 ->  Nested Loop Semi Join
   ->  Bitmap Heap Scan on orders_004
   Recheck Cond: ((o_orderdate >= '1994-01-01'::date) AND
(o_orderdate < '1994-04-01 00:00:00'::timestamp without time zone))
   ->  Bitmap Index Scan on idx_orders_orderdate_004
Index Cond: ((o_orderdate >= '1994-01-01'::date) AND
(o_orderdate < '1994-04-01 00:00:00'::timestamp without time zone))
   ->  Index Scan using idx_lineitem_orderkey_004 on lineitem_004
   Index Cond: (l_orderkey = orders_004.o_orderkey)
   Filter: (l_commitdate < l_receiptdate)

In the index condition of the inner IndexScan path, it is referencing
partition order_004 which is used by the outer path. So this should
satisfy the partial join path restriction concerning parameterized
inner path : "inner path should not refer to relations *outside* the
join path". Here, it is referring to relations *inside* the join path.
But still this join path gets rejected by try_partial_nestloop_path(),
here :

if (inner_path->param_info != NULL)
{
   Relids inner_paramrels = inner_path->param_info->ppi_req_outer;
   if (!bms_is_subset(inner_paramrels, outer_path->parent->relids))
  return;
}

Actually, bms_is_subset() above should return true, because
inner_paramrels and outer_path relids should have orders_004. But
that's not happening. inner_paramrels is referring to orders, not
orders_004. And hence bms_is_subset() returns false (thereby rejecting
the partial nestloop path). I suspect this is because the innerpath is
not getting reparameterized so as to refer to child relations. In the
PWJ patch, I saw that reparameterize_path_by_child() is called by
try_nestloop_path(), but not by try_partial_nestloop_path().

Now, for Parallel Append, if this partial nestloop subpath gets
created, it may or may not get chosen, depending upon the number of
workers. For e.g. if the number of workers is 6, and ParalleAppend+PWJ
runs with only 2 partitions, then partial nestedloop join would
definitely win because we can put all 6 workers to work, whereas for
ParallelAppend with all non-partial nested loop join subpaths, at the
most only 2 workers could be allotted, one for each child. But if the
partitions are more, and available workers are less, then I think the
cost difference in case of partial versus non-partial join paths would
not be significant.

But here the issue is, partial nest loop subpaths don't get created in
the first place. Looking at the above analysis, this issue should be
worked by a different thread, not in this one.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-08-31 Thread Amit Khandekar
The last updated patch needs a rebase. Attached is the rebased version.

Thanks
-Amit Khandekar


ParallelAppend_v13_rebased_3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-08-30 Thread Amit Khandekar
Hi Rafia,

On 17 August 2017 at 14:12, Amit Khandekar  wrote:
> But for all of the cases here, partial
> subplans seem possible, and so even on HEAD it executed Partial
> Append. So between a Parallel Append having partial subplans and a
> Partial Append having partial subplans , the cost difference would not
> be significant. Even if we assume that Parallel Append was chosen
> because its cost turned out to be a bit cheaper, the actual
> performance gain seems quite large as compared to the expected cost
> difference. So it might be even possible that the performance gain
> might be due to some other reasons. I will investigate this, and the
> other queries.
>

I ran all the queries that were showing performance benefits in your
run. But for me, the ParallelAppend benefits are shown only for plans
that use Partition-Wise-Join.

For all the queries that use only PA plans but not PWJ plans, I got
the exact same plan for HEAD as for PA+PWJ patch, except that for the
later, the Append is a ParallelAppend. Whereas, for you, the plans
have join-order changed.

Regarding actual costs; consequtively, for me the actual-cost are more
or less the same for HEAD and PA+PWJ. Whereas, for your runs, you have
quite different costs naturally because the plans themselves are
different on head versus PA+PWJ.

My PA+PWJ plan outputs (and actual costs) match exactly what you get
with PA+PWJ patch. But like I said, I get the same join order and same
plans (and actual costs) for HEAD as well (except
ParallelAppend=>Append).

May be, if you have the latest HEAD code with your setup, you can
yourself check some of the queries again to see if they are still
seeing higher costs as compared to PA ? I suspect that some changes in
latest code might be causing this discrepancy; because when I tested
some of the explains with a HEAD-branch server running with your
database, I got results matching PA figures.

Attached is my explain-analyze outputs.

On 16 August 2017 at 18:34, Robert Haas  wrote:
> Thanks for the benchmarking results!
>
> On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih
>  wrote:
>> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41
>
> 12 seconds instead of 244?  Whoa.  I find it curious that we picked a
> Parallel Append with a bunch of non-partial plans when we could've
> just as easily picked partial plans, or so it seems to me.  To put
> that another way, why did we end up with a bunch of Bitmap Heap Scans
> here instead of Parallel Bitmap Heap Scans?

Actually, the cost difference would be quite low for Parallel Append
with partial plans and Parallel Append with non-partial plans with 2
workers. But yes, I should take a look at why it is consistently
taking non-partial Bitmap Heap Scan.



> Q6 | 29 | 12 | PA only

This one needs to be analysed, because here, the plan cost is the
same, but actual cost for PA is almost half the cost for HEAD. This is
the same observation for my run also.

Thanks
-Amit


PA-test-AmitKh.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-08-17 Thread Amit Khandekar
On 16 August 2017 at 18:34, Robert Haas  wrote:
> Thanks for the benchmarking results!
>
> On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih
>  wrote:
>> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41
>
> 12 seconds instead of 244?  Whoa.  I find it curious that we picked a
> Parallel Append with a bunch of non-partial plans when we could've
> just as easily picked partial plans, or so it seems to me.  To put
> that another way, why did we end up with a bunch of Bitmap Heap Scans
> here instead of Parallel Bitmap Heap Scans?
>
>> Q7 | 134 | 88 | PA only
>> Q18 | 508 | 489 | PA only
>
> What's interesting in these results is that the join order changes
> quite a lot when PA is in the mix, and I don't really see why that
> should happen.

Yes, it seems hard to determine why exactly the join order changes.
Parallel Append is expected to give the benefit especially if there
are no partial subplans. But for all of the cases here, partial
subplans seem possible, and so even on HEAD it executed Partial
Append. So between a Parallel Append having partial subplans and a
Partial Append having partial subplans , the cost difference would not
be significant. Even if we assume that Parallel Append was chosen
because its cost turned out to be a bit cheaper, the actual
performance gain seems quite large as compared to the expected cost
difference. So it might be even possible that the performance gain
might be due to some other reasons. I will investigate this, and the
other queries.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-08-16 Thread Robert Haas
Thanks for the benchmarking results!

On Tue, Aug 15, 2017 at 11:35 PM, Rafia Sabih
 wrote:
> Q4 | 244 | 12 | PA and PWJ, time by only PWJ - 41

12 seconds instead of 244?  Whoa.  I find it curious that we picked a
Parallel Append with a bunch of non-partial plans when we could've
just as easily picked partial plans, or so it seems to me.  To put
that another way, why did we end up with a bunch of Bitmap Heap Scans
here instead of Parallel Bitmap Heap Scans?

> Q7 | 134 | 88 | PA only
> Q18 | 508 | 489 | PA only

What's interesting in these results is that the join order changes
quite a lot when PA is in the mix, and I don't really see why that
should happen.  I haven't thought about how we're doing the PA costing
in a while, so that might just be my ignorance.  But I think we need
to try to separate the effect of the plan changes from the
execution-time effect of PA itself, so that we can (1) be sure that
the plan changes are legitimate and justifiable rather than the result
of some bug and (2) make sure that replacing an Append with a Parallel
Append with no other changes to the plan produces an execution-time
benefit as we're hoping.

> Q21 | 649 | 163 | PA only

This is a particularly interesting case because in both the patched
and unpatched plans, the driving scan is on the lineitem table and in
both cases a Parallel Seq Scan is used.  The join order is more
similar than in some of the other pans, but not the same: in the
unpatched case, it's l1-(nation-supplier)-l2-orders-l3; in the patched
case, it's l1-(nation-supplier)-l3-l2-orders.  The Parallel Append
node actually runs slower than the plan Append node (42.4 s vs. 39.0
s) but that plan ends up being faster overall.  I suspect that's
partly because the patched plan pulls 265680 rows through the Gather
node while the unpatched plan pulls 2888728 rows through the Gather
node, more than 10x more.  That's a very strange choice for the
planner to make, seemingly, and what's even stranger is that if it did
ALL of the joins below the Gather node it would only need to pull
78214 rows through the Gather node; why not do that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-08-09 Thread Amit Khandekar
On 9 August 2017 at 19:05, Robert Haas  wrote:
> On Wed, Jul 5, 2017 at 7:53 AM, Amit Khandekar  wrote:
>>> This is not applicable on the latest head i.e. commit --
>>> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing.
>>
>> Thanks for notifying. Attached is the rebased version of the patch.
>
> This again needs a rebase.

Attached rebased version of the patch. Thanks.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v13_rebased_2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-08-09 Thread Robert Haas
On Wed, Jul 5, 2017 at 7:53 AM, Amit Khandekar  wrote:
>> This is not applicable on the latest head i.e. commit --
>> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing.
>
> Thanks for notifying. Attached is the rebased version of the patch.

This again needs a rebase.

(And, hey everybody, it also needs some review!)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-07-05 Thread Amit Khandekar
On 30 June 2017 at 15:10, Rafia Sabih  wrote:
>
> On Tue, Apr 4, 2017 at 12:37 PM, Amit Khandekar 
> wrote:
>>
>> Attached is an updated patch v13 that has some comments changed as per
>> your review, and also rebased on latest master.
>
>
> This is not applicable on the latest head i.e. commit --
> 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing.

Thanks for notifying. Attached is the rebased version of the patch.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v13_rebased.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-06-30 Thread Rafia Sabih
On Tue, Apr 4, 2017 at 12:37 PM, Amit Khandekar 
wrote:

> Attached is an updated patch v13 that has some comments changed as per
> your review, and also rebased on latest master.
>

This is not applicable on the latest head i.e. commit
-- 08aed6604de2e6a9f4d499818d7c641cbf5eb9f7, looks like need a rebasing.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


Re: [HACKERS] Parallel Append implementation

2017-04-18 Thread Amit Khandekar
On 7 April 2017 at 20:35, Andres Freund  wrote:
>> But for costs such as (4, 4, 4,   20 times), the logic would give
>> us 20 workers because we want to finish the Append in 4 time units;
>> and this what we want to avoid when we go with
>> don't-allocate-too-many-workers approach.
>
> I guess, my problem is that I don't agree with that as a goal in an of
> itself.  If you actually want to run your query quickly, you *want* 20
> workers here.  The issues of backend startup overhead (already via
> parallel_setup_cost), concurrency and such cost should be modelled, not
> burried in a formula the user can't change.  If we want to make it less
> and less likely to start more workers we should make that configurable,
> not the default.
> I think there's some precedent taken from the parallel seqscan case,
> that's not actually applicable here.  Parallel seqscans have a good
> amount of shared state, both on the kernel and pg side, and that shared
> state reduces gains of increasing the number of workers.  But with
> non-partial scans such shared state largely doesn't exist.

After searching through earlier mails about parallel scan, I am not
sure whether the shared state was considered to be a potential factor
that might reduce parallel query gains, when deciding the calculation
for number of workers for a parallel seq scan. I mean even today if we
allocate 10 workers as against a calculated 4 workers count for a
parallel seq scan, they might help. I think it's just that we don't
know if they would *always* help or it would regress sometimes.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-07 Thread Andres Freund
Hi,

On 2017-04-07 11:44:39 +0530, Amit Khandekar wrote:
> On 6 April 2017 at 07:33, Andres Freund  wrote:
> > On 2017-04-05 14:52:38 +0530, Amit Khandekar wrote:
> >> This is what the earlier versions of my patch had done : just add up
> >> per-subplan parallel_workers (1 for non-partial subplan and
> >> subpath->parallel_workers for partial subplans) and set this total as
> >> the Append parallel_workers.
> >
> > I don't think that's great, consider e.g. the case that you have one
> > very expensive query, and a bunch of cheaper ones. Most of those workers
> > wouldn't do much while waiting for the the expensive query.  What I'm
> > basically thinking we should do is something like the following
> > pythonesque pseudocode:
> >
> > best_nonpartial_cost = -1
> > best_nonpartial_nworkers = -1
> >
> > for numworkers in 1...#max workers:
> >worker_work = [0 for x in range(0, numworkers)]
> >
> >nonpartial_cost += startup_cost * numworkers
> >
> ># distribute all nonpartial tasks over workers.  Assign tasks to the
> ># worker with the least amount of work already performed.
> >for task in all_nonpartial_subqueries:
> >least_busy_worker = worker_work.smallest()
> >least_busy_worker += task.total_nonpartial_cost
> >
> ># the nonpartial cost here is the largest amount any single worker
> ># has to perform.
> >nonpartial_cost += worker_work.largest()
> >
> >total_partial_cost = 0
> >for task in all_partial_subqueries:
> >total_partial_cost += task.total_nonpartial_cost
> >
> ># Compute resources needed by partial tasks. First compute how much
> ># cost we can distribute to workers that take shorter than the
> ># "busiest" worker doing non-partial tasks.
> >remaining_avail_work = 0
> >for i in range(0, numworkers):
> >remaining_avail_work += worker_work.largest() - worker_work[i]
> >
> ># Equally divide up remaining work over all workers
> >if remaining_avail_work < total_partial_cost:
> >   nonpartial_cost += (worker_work.largest - remaining_avail_work) / 
> > numworkers
> >
> ># check if this is the best number of workers
> >if best_nonpartial_cost == -1 or best_nonpartial_cost > nonpartial_cost:
> >   best_nonpartial_cost = worker_work.largest
> >   best_nonpartial_nworkers = nworkers
> >
> > Does that make sense?
> 
> Yeah, I gather what you are trying to achieve is : allocate number of
> workers such that the total cost does not exceed the cost of the first
> non-partial plan (i.e. the costliest one, because the plans are sorted
> by descending cost).
> 
> So for non-partial costs such as (20, 10, 5, 2) allocate only 2
> workers because the 2nd worker will execute (10, 5, 2) while 1st
> worker executes (20).
> 
> But for costs such as (4, 4, 4,   20 times), the logic would give
> us 20 workers because we want to finish the Append in 4 time units;
> and this what we want to avoid when we go with
> don't-allocate-too-many-workers approach.

I guess, my problem is that I don't agree with that as a goal in an of
itself.  If you actually want to run your query quickly, you *want* 20
workers here.  The issues of backend startup overhead (already via
parallel_setup_cost), concurrency and such cost should be modelled, not
burried in a formula the user can't change.  If we want to make it less
and less likely to start more workers we should make that configurable,
not the default.
I think there's some precedent taken from the parallel seqscan case,
that's not actually applicable here.  Parallel seqscans have a good
amount of shared state, both on the kernel and pg side, and that shared
state reduces gains of increasing the number of workers.  But with
non-partial scans such shared state largely doesn't exist.


> > especially if we get partitionwise joins.
> 
> About that I am not sure, because we already have support for parallel
> joins, so wouldn't the join subpaths corresponding to all of the
> partitions be partial paths ? I may be wrong about that.

We'll probably generate both, and then choose the cheaper one.  The
startup cost for partitionwise joins should usually be a lot cheaper
(because e.g. for hashtables we'll generate smaller hashtables), and we
should have less cost of concurrency.


> But if the subplans are foreign scans, then yes all would be
> non-partial plans. This may provoke  off-topic discussion, but here
> instead of assigning so many workers to all these foreign plans and
> all those workers waiting for the results, a single asynchronous
> execution node (which is still in the making) would be desirable
> because it would do the job of all these workers.

That's something that probably shouldn't be modelled throug a parallel
append, I agree - it shouldn't be too hard to develop a costing model
for that however.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] Parallel Append implementation

2017-04-07 Thread Amit Khandekar
On 6 April 2017 at 07:33, Andres Freund  wrote:
> On 2017-04-05 14:52:38 +0530, Amit Khandekar wrote:
>> This is what the earlier versions of my patch had done : just add up
>> per-subplan parallel_workers (1 for non-partial subplan and
>> subpath->parallel_workers for partial subplans) and set this total as
>> the Append parallel_workers.
>
> I don't think that's great, consider e.g. the case that you have one
> very expensive query, and a bunch of cheaper ones. Most of those workers
> wouldn't do much while waiting for the the expensive query.  What I'm
> basically thinking we should do is something like the following
> pythonesque pseudocode:
>
> best_nonpartial_cost = -1
> best_nonpartial_nworkers = -1
>
> for numworkers in 1...#max workers:
>worker_work = [0 for x in range(0, numworkers)]
>
>nonpartial_cost += startup_cost * numworkers
>
># distribute all nonpartial tasks over workers.  Assign tasks to the
># worker with the least amount of work already performed.
>for task in all_nonpartial_subqueries:
>least_busy_worker = worker_work.smallest()
>least_busy_worker += task.total_nonpartial_cost
>
># the nonpartial cost here is the largest amount any single worker
># has to perform.
>nonpartial_cost += worker_work.largest()
>
>total_partial_cost = 0
>for task in all_partial_subqueries:
>total_partial_cost += task.total_nonpartial_cost
>
># Compute resources needed by partial tasks. First compute how much
># cost we can distribute to workers that take shorter than the
># "busiest" worker doing non-partial tasks.
>remaining_avail_work = 0
>for i in range(0, numworkers):
>remaining_avail_work += worker_work.largest() - worker_work[i]
>
># Equally divide up remaining work over all workers
>if remaining_avail_work < total_partial_cost:
>   nonpartial_cost += (worker_work.largest - remaining_avail_work) / 
> numworkers
>
># check if this is the best number of workers
>if best_nonpartial_cost == -1 or best_nonpartial_cost > nonpartial_cost:
>   best_nonpartial_cost = worker_work.largest
>   best_nonpartial_nworkers = nworkers
>
> Does that make sense?

Yeah, I gather what you are trying to achieve is : allocate number of
workers such that the total cost does not exceed the cost of the first
non-partial plan (i.e. the costliest one, because the plans are sorted
by descending cost).

So for non-partial costs such as (20, 10, 5, 2) allocate only 2
workers because the 2nd worker will execute (10, 5, 2) while 1st
worker executes (20).

But for costs such as (4, 4, 4,   20 times), the logic would give
us 20 workers because we want to finish the Append in 4 time units;
and this what we want to avoid when we go with
don't-allocate-too-many-workers approach.

>
>
>> BTW all of the above points apply only for non-partial plans.
>
> Indeed. But I think that's going to be a pretty common type of plan,

Yes it is.

> especially if we get partitionwise joins.

About that I am not sure, because we already have support for parallel
joins, so wouldn't the join subpaths corresponding to all of the
partitions be partial paths ? I may be wrong about that.

But if the subplans are foreign scans, then yes all would be
non-partial plans. This may provoke  off-topic discussion, but here
instead of assigning so many workers to all these foreign plans and
all those workers waiting for the results, a single asynchronous
execution node (which is still in the making) would be desirable
because it would do the job of all these workers.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-05 Thread Andres Freund
On 2017-04-05 14:52:38 +0530, Amit Khandekar wrote:
> This is what the earlier versions of my patch had done : just add up
> per-subplan parallel_workers (1 for non-partial subplan and
> subpath->parallel_workers for partial subplans) and set this total as
> the Append parallel_workers.

I don't think that's great, consider e.g. the case that you have one
very expensive query, and a bunch of cheaper ones. Most of those workers
wouldn't do much while waiting for the the expensive query.  What I'm
basically thinking we should do is something like the following
pythonesque pseudocode:

best_nonpartial_cost = -1
best_nonpartial_nworkers = -1

for numworkers in 1...#max workers:
   worker_work = [0 for x in range(0, numworkers)]

   nonpartial_cost += startup_cost * numworkers

   # distribute all nonpartial tasks over workers.  Assign tasks to the
   # worker with the least amount of work already performed.
   for task in all_nonpartial_subqueries:
   least_busy_worker = worker_work.smallest()
   least_busy_worker += task.total_nonpartial_cost

   # the nonpartial cost here is the largest amount any single worker
   # has to perform.
   nonpartial_cost += worker_work.largest()

   total_partial_cost = 0
   for task in all_partial_subqueries:
   total_partial_cost += task.total_nonpartial_cost

   # Compute resources needed by partial tasks. First compute how much
   # cost we can distribute to workers that take shorter than the
   # "busiest" worker doing non-partial tasks.
   remaining_avail_work = 0
   for i in range(0, numworkers):
   remaining_avail_work += worker_work.largest() - worker_work[i]

   # Equally divide up remaining work over all workers
   if remaining_avail_work < total_partial_cost:
  nonpartial_cost += (worker_work.largest - remaining_avail_work) / 
numworkers

   # check if this is the best number of workers
   if best_nonpartial_cost == -1 or best_nonpartial_cost > nonpartial_cost:
  best_nonpartial_cost = worker_work.largest
  best_nonpartial_nworkers = nworkers

Does that make sense?


> BTW all of the above points apply only for non-partial plans.

Indeed. But I think that's going to be a pretty common type of plan,
especially if we get partitionwise joins.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-05 Thread Robert Haas
On Tue, Apr 4, 2017 at 4:13 PM, Andres Freund  wrote:
> I'm quite unconvinced that just throwing a log() in there is the best
> way to combat that.  Modeling the issue of starting more workers through
> tuple transfer, locking, startup overhead costing seems a better to me.

Knock yourself out.  There's no doubt that the way the number of
parallel workers is computed is pretty stupid right now, and it
obviously needs to get a lot smarter before we can consider doing
things like throwing 40 workers at a query.  If you throw 2 or 4
workers at a query and it turns out that it doesn't help much, that's
sad, but if you throw 40 workers at a query and it turns out that it
doesn't help much, or even regresses, that's a lot sadder.  The
existing system does try to model startup and tuple transfer overhead
during costing, but only as a way of comparing parallel plans to each
other or to non-parallel plans, not to work out the right number of
workers.  It also does not model contention, which it absolutely needs
to do.  I was kind of hoping that once the first version of parallel
query was committed, other developers who care about the query planner
would be motivated to improve some of this stuff, but so far that
hasn't really happened.  This release adds a decent number of new
execution capabilities, and there is a lot more work to be done there,
but without some serious work on the planner end of things I fear
we're never going to be able to get more than ~4x speedup out of
parallel query, because we're just too dumb to know how many workers
we really ought to be using.

That having been said, I completely and emphatically disagree that
this patch ought to be required to be an order-of-magnitude smarter
than the existing logic in order to get committed.  There are four
main things that this patch can hope to accomplish:

1. If we've got an Append node with children that have a non-zero
startup cost, it is currently pretty much guaranteed that every worker
will pay the startup cost for every child.  With Parallel Append, we
can spread out the workers across the plans, and once a plan has been
finished by however many workers it got, other workers can ignore it,
which means that its startup cost need not be paid by those workers.
This case will arise a lot more frequently once we have partition-wise
join.

2. When the Append node's children are partial plans, spreading out
the workers reduces contention for whatever locks those workers use to
coordinate access to shared data.

3. If the Append node represents a scan of a partitioned table, and
the partitions are on different tablespaces (or there's just enough
I/O bandwidth available in a single tablespace to read more than one
of them at once without slowing things down), then spreading out the
work gives us I/O parallelism.  This is an area where some
experimentation and benchmarking is needed, because there is a
possibility of regressions if we run several sequential scans on the
same spindle in parallel instead of consecutively.  We might need to
add some logic to try to avoid this, but it's not clear how that logic
should work.

4. If the Append node is derived from a UNION ALL query, we can run
different branches in different processes even if the plans are not
themselves able to be parallelized.  This was proposed by Stephen
among others as an "easy" case for parallelism, which was maybe a tad
optimistic, but it's sad that we're going to release v10 without
having done anything about it.

All of those things (except possibly #3) are wins over the status quo
even if the way we choose the number of workers is still pretty dumb.
It shouldn't get away with being dumber than what we've already got,
but it shouldn't be radically smarter - or even just radically
different because, if it is, then the results you get when you query a
partitioned table will be very different from what you get when you
query an partitioned table, which is not sensible.  I very much agree
that doing something smarter than log-scaling on the number of workers
is an a good project for somebody to do, but it's not *this* project.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-05 Thread Amit Khandekar
On 5 April 2017 at 01:43, Andres Freund  wrote:
> On 2017-04-04 08:01:32 -0400, Robert Haas wrote:
>> On Tue, Apr 4, 2017 at 12:47 AM, Andres Freund  wrote:
>> > I don't think the parallel seqscan is comparable in complexity with the
>> > parallel append case.  Each worker there does the same kind of work, and
>> > if one of them is behind, it'll just do less.  But correct sizing will
>> > be more important with parallel-append, because with non-partial
>> > subplans the work is absolutely *not* uniform.
>>
>> Sure, that's a problem, but I think it's still absolutely necessary to
>> ramp up the maximum "effort" (in terms of number of workers)
>> logarithmically.  If you just do it by costing, the winning number of
>> workers will always be the largest number that we think we'll be able
>> to put to use - e.g. with 100 branches of relatively equal cost we'll
>> pick 100 workers.  That's not remotely sane.
>
> I'm quite unconvinced that just throwing a log() in there is the best
> way to combat that.  Modeling the issue of starting more workers through
> tuple transfer, locking, startup overhead costing seems a better to me.
>
> If the goal is to compute the results of the query as fast as possible,
> and to not use more than max_parallel_per_XXX, and it's actually
> beneficial to use more workers, then we should.  Because otherwise you
> really can't use the resources available.
>
> - Andres

This is what the earlier versions of my patch had done : just add up
per-subplan parallel_workers (1 for non-partial subplan and
subpath->parallel_workers for partial subplans) and set this total as
the Append parallel_workers.

Robert had a valid point that this would be inconsistent with the
worker count that we would come up with if it were a single table with
a cost as big as the total cost of all Append subplans. We were
discussing rather about partitioned table versus if it were
unpartitioned, but I think the same argument goes for a union query
with non-partial plans : if we want to clamp down the number of
workers for a single table for a good reason, we should then also
follow that policy and prevent assigning too many workers even for an
Append.

Now I am not sure of the reason why for a single table parallel scan,
we increase number of workers logarithmically; but I think there might
have been an observation that after certain number of workers, adding
up more workers does not make significant difference, but this is just
my guess.

If we try to calculate workers based on each of the subplan costs
rather than just the number of workers, still I think the total worker
count should be a *log* of the total cost, so as to be consistent with
what we did for other scans. Now log(total_cost) does not increase
significantly with cost. For cost of 1000 units, the log3(cost) will
be 6, and for cost of 10,000 units, it is 8, i.e. just 2 more workers.
So I think since its a logarithmic value, it would be might as well
better to just drop the cost factor, and consider only number of
workers.

But again, in the future if we drop the method of log(), then the
above is not valid. But I think till then we should follow some common
strategy we have been following.

BTW all of the above points apply only for non-partial plans. For
partial plans, what we have done in the patch is : Take the highest of
the per-subplan parallel_workers, and make sure that Append workers is
at least as high as this value.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-04 Thread Ashutosh Bapat
On Wed, Apr 5, 2017 at 1:43 AM, Andres Freund  wrote:

> On 2017-04-04 08:01:32 -0400, Robert Haas wrote:
> > On Tue, Apr 4, 2017 at 12:47 AM, Andres Freund 
> wrote:
> > > I don't think the parallel seqscan is comparable in complexity with the
> > > parallel append case.  Each worker there does the same kind of work,
> and
> > > if one of them is behind, it'll just do less.  But correct sizing will
> > > be more important with parallel-append, because with non-partial
> > > subplans the work is absolutely *not* uniform.
> >
> > Sure, that's a problem, but I think it's still absolutely necessary to
> > ramp up the maximum "effort" (in terms of number of workers)
> > logarithmically.  If you just do it by costing, the winning number of
> > workers will always be the largest number that we think we'll be able
> > to put to use - e.g. with 100 branches of relatively equal cost we'll
> > pick 100 workers.  That's not remotely sane.
>
> I'm quite unconvinced that just throwing a log() in there is the best
> way to combat that.  Modeling the issue of starting more workers through
> tuple transfer, locking, startup overhead costing seems a better to me.
>
> If the goal is to compute the results of the query as fast as possible,
> and to not use more than max_parallel_per_XXX, and it's actually
> beneficial to use more workers, then we should.  Because otherwise you
> really can't use the resources available.
>

+1. I had expressed similar opinion earlier, but yours is better
articulated. Thanks.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: [HACKERS] Parallel Append implementation

2017-04-04 Thread Andres Freund
On 2017-04-04 08:01:32 -0400, Robert Haas wrote:
> On Tue, Apr 4, 2017 at 12:47 AM, Andres Freund  wrote:
> > I don't think the parallel seqscan is comparable in complexity with the
> > parallel append case.  Each worker there does the same kind of work, and
> > if one of them is behind, it'll just do less.  But correct sizing will
> > be more important with parallel-append, because with non-partial
> > subplans the work is absolutely *not* uniform.
>
> Sure, that's a problem, but I think it's still absolutely necessary to
> ramp up the maximum "effort" (in terms of number of workers)
> logarithmically.  If you just do it by costing, the winning number of
> workers will always be the largest number that we think we'll be able
> to put to use - e.g. with 100 branches of relatively equal cost we'll
> pick 100 workers.  That's not remotely sane.

I'm quite unconvinced that just throwing a log() in there is the best
way to combat that.  Modeling the issue of starting more workers through
tuple transfer, locking, startup overhead costing seems a better to me.

If the goal is to compute the results of the query as fast as possible,
and to not use more than max_parallel_per_XXX, and it's actually
beneficial to use more workers, then we should.  Because otherwise you
really can't use the resources available.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-04 Thread Robert Haas
On Mon, Apr 3, 2017 at 4:17 PM, Andres Freund  wrote:
> I'm afraid this is too late for v10 - do you agree?

Yeah, I think so.  The benefit of this will be a lot more once we get
partitionwise join and partitionwise aggregate working, but that
probably won't happen for this release, or at best in limited cases.
And while we might not agree on exactly what work this patch still
needs, I think it does still need some work.  I've moved this to the
next CommitFest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-04 Thread Robert Haas
On Tue, Apr 4, 2017 at 12:47 AM, Andres Freund  wrote:
> I don't think the parallel seqscan is comparable in complexity with the
> parallel append case.  Each worker there does the same kind of work, and
> if one of them is behind, it'll just do less.  But correct sizing will
> be more important with parallel-append, because with non-partial
> subplans the work is absolutely *not* uniform.

Sure, that's a problem, but I think it's still absolutely necessary to
ramp up the maximum "effort" (in terms of number of workers)
logarithmically.  If you just do it by costing, the winning number of
workers will always be the largest number that we think we'll be able
to put to use - e.g. with 100 branches of relatively equal cost we'll
pick 100 workers.  That's not remotely sane.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-04 Thread Amit Khandekar
On 4 April 2017 at 01:47, Andres Freund  wrote:
>> +typedef struct ParallelAppendDescData
>> +{
>> + LWLock  pa_lock;/* mutual exclusion to choose 
>> next subplan */
>> + int pa_first_plan;  /* plan to choose while 
>> wrapping around plans */
>> + int pa_next_plan;   /* next plan to choose by any 
>> worker */
>> +
>> + /*
>> +  * pa_finished : workers currently executing the subplan. A worker 
>> which
>> +  * finishes a subplan should set pa_finished to true, so that no new
>> +  * worker picks this subplan. For non-partial subplan, a worker which 
>> picks
>> +  * up that subplan should immediately set to true, so as to make sure
>> +  * there are no more than 1 worker assigned to this subplan.
>> +  */
>> + boolpa_finished[FLEXIBLE_ARRAY_MEMBER];
>> +} ParallelAppendDescData;
>
>
>> +typedef ParallelAppendDescData *ParallelAppendDesc;
>
> Pointer hiding typedefs make this Andres sad.

Yeah .. was trying to be consistent with other parts of code where we
have typedefs for both structure and a pointer to that structure.

>
>
>
>> @@ -291,6 +362,276 @@ ExecReScanAppend(AppendState *node)
>>   if (subnode->chgParam == NULL)
>>   ExecReScan(subnode);
>>   }
>> +
>> + if (padesc)
>> + {
>> + padesc->pa_first_plan = padesc->pa_next_plan = 0;
>> + memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans);
>> + }
>> +
>
> Is it actually guaranteed that none of the parallel workers are doing
> something at that point?

ExecReScanAppend() would be called by ExecReScanGather().
ExecReScanGather() shuts down all the parallel workers before calling
its child node (i.e. ExecReScanAppend).


>> +static bool
>> +exec_append_parallel_next(AppendState *state)
>> +{
>> + ParallelAppendDesc padesc = state->as_padesc;
>> + int whichplan;
>> + int initial_plan;
>> + int first_partial_plan = ((Append 
>> *)state->ps.plan)->first_partial_plan;
>> + boolfound;
>> +
>> + Assert(padesc != NULL);
>> +
>> + /* Backward scan is not supported by parallel-aware plans */
>> + Assert(ScanDirectionIsForward(state->ps.state->es_direction));
>> +
>> + /* The parallel leader chooses its next subplan differently */
>> + if (!IsParallelWorker())
>> + return exec_append_leader_next(state);
>
> It's a bit weird that the leader's case does is so separate, and does
> it's own lock acquisition.

Since we wanted to prevent it from taking the most expensive
non-partial plans first , thought it would be better to keep its logic
simple and separate, so could not merge it in the main logic code.

>
>
>> + found = false;
>> + for (whichplan = initial_plan; whichplan != PA_INVALID_PLAN;)
>> + {
>> + /*
>> +  * Ignore plans that are already done processing. These also 
>> include
>> +  * non-partial subplans which have already been taken by a 
>> worker.
>> +  */
>> + if (!padesc->pa_finished[whichplan])
>> + {
>> + found = true;
>> + break;
>> + }
>> +
>> + /*
>> +  * Note: There is a chance that just after the child plan node 
>> is
>> +  * chosen above, some other worker finishes this node and sets
>> +  * pa_finished to true. In that case, this worker will go 
>> ahead and
>> +  * call ExecProcNode(child_node), which will return NULL tuple 
>> since it
>> +  * is already finished, and then once again this worker will 
>> try to
>> +  * choose next subplan; but this is ok : it's just an extra
>> +  * "choose_next_subplan" operation.
>> +  */
>
> IIRC not all node types are safe against being executed again when
> they've previously returned NULL.  That's why e.g. nodeMaterial.c
> contains the following blurb:
> /*
>  * If necessary, try to fetch another row from the subplan.
>  *
>  * Note: the eof_underlying state variable exists to short-circuit 
> further
>  * subplan calls.  It's not optional, unfortunately, because some plan
>  * node types are not robust about being called again when they've 
> already
>  * returned NULL.
>  */

This scenario is different from the parallel append scenario described
by my comment. A worker sets pa_finished to true only when it itself
gets a NULL tuple for a given subplan. So in
exec_append_parallel_next(), suppose a worker W1 finds a subplan with
pa_finished=false. So it chooses it. Now a different worker W2 sets
this subplan's pa_finished=true because W2 has got a NULL tuple. But
W1 hasn't yet got a NULL tuple. If it had got a NULL tuple earlier, it
would have itself set pa_finished to true, and then it 

Re: [HACKERS] Parallel Append implementation

2017-04-03 Thread Amit Khandekar
Thanks Andres for your review comments. Will get back with the other
comments, but meanwhile some queries about the below particular
comment ...

On 4 April 2017 at 10:17, Andres Freund  wrote:
> On 2017-04-03 22:13:18 -0400, Robert Haas wrote:
>> On Mon, Apr 3, 2017 at 4:17 PM, Andres Freund  wrote:
>> > Hm.  I'm not really convinced by the logic here.  Wouldn't it be better
>> > to try to compute the minimum total cost across all workers for
>> > 1..#max_workers for the plans in an iterative manner?  I.e. try to map
>> > each of the subplans to 1 (if non-partial) or N workers (partial) using
>> > some fitting algorith (e.g. always choosing the worker(s) that currently
>> > have the least work assigned).  I think the current algorithm doesn't
>> > lead to useful #workers for e.g. cases with a lot of non-partial,
>> > high-startup plans - imo a quite reasonable scenario.

I think I might have not understood this part exactly. Are you saying
we need to consider per-subplan parallel_workers to calculate total
number of workers for Append ? I also didn't get about non-partial
subplans. Can you please explain how many workers you think should be
expected with , say , 7 subplans out of which 3 are non-partial
subplans ?

>>
>> Well, that'd be totally unlike what we do in any other case.  We only
>> generate a Parallel Seq Scan plan for a given table with one # of
>> workers, and we cost it based on that.  We have no way to re-cost it
>> if we changed our mind later about how many workers to use.
>> Eventually, we should probably have something like what you're
>> describing here, but in general, not just for this specific case.  One
>> problem, of course, is to avoid having a larger number of workers
>> always look better than a smaller number, which with the current
>> costing model would probably happen a lot.
>
> I don't think the parallel seqscan is comparable in complexity with the
> parallel append case.  Each worker there does the same kind of work, and
> if one of them is behind, it'll just do less.  But correct sizing will
> be more important with parallel-append, because with non-partial
> subplans the work is absolutely *not* uniform.
>
> Greetings,
>
> Andres Freund



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-03 Thread Andres Freund
On 2017-04-03 22:13:18 -0400, Robert Haas wrote:
> On Mon, Apr 3, 2017 at 4:17 PM, Andres Freund  wrote:
> > Hm.  I'm not really convinced by the logic here.  Wouldn't it be better
> > to try to compute the minimum total cost across all workers for
> > 1..#max_workers for the plans in an iterative manner?  I.e. try to map
> > each of the subplans to 1 (if non-partial) or N workers (partial) using
> > some fitting algorith (e.g. always choosing the worker(s) that currently
> > have the least work assigned).  I think the current algorithm doesn't
> > lead to useful #workers for e.g. cases with a lot of non-partial,
> > high-startup plans - imo a quite reasonable scenario.
> 
> Well, that'd be totally unlike what we do in any other case.  We only
> generate a Parallel Seq Scan plan for a given table with one # of
> workers, and we cost it based on that.  We have no way to re-cost it
> if we changed our mind later about how many workers to use.
> Eventually, we should probably have something like what you're
> describing here, but in general, not just for this specific case.  One
> problem, of course, is to avoid having a larger number of workers
> always look better than a smaller number, which with the current
> costing model would probably happen a lot.

I don't think the parallel seqscan is comparable in complexity with the
parallel append case.  Each worker there does the same kind of work, and
if one of them is behind, it'll just do less.  But correct sizing will
be more important with parallel-append, because with non-partial
subplans the work is absolutely *not* uniform.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-03 Thread Robert Haas
On Mon, Apr 3, 2017 at 4:17 PM, Andres Freund  wrote:
> Hm.  I'm not really convinced by the logic here.  Wouldn't it be better
> to try to compute the minimum total cost across all workers for
> 1..#max_workers for the plans in an iterative manner?  I.e. try to map
> each of the subplans to 1 (if non-partial) or N workers (partial) using
> some fitting algorith (e.g. always choosing the worker(s) that currently
> have the least work assigned).  I think the current algorithm doesn't
> lead to useful #workers for e.g. cases with a lot of non-partial,
> high-startup plans - imo a quite reasonable scenario.

Well, that'd be totally unlike what we do in any other case.  We only
generate a Parallel Seq Scan plan for a given table with one # of
workers, and we cost it based on that.  We have no way to re-cost it
if we changed our mind later about how many workers to use.
Eventually, we should probably have something like what you're
describing here, but in general, not just for this specific case.  One
problem, of course, is to avoid having a larger number of workers
always look better than a smaller number, which with the current
costing model would probably happen a lot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-04-03 Thread Andres Freund
Hi,


On 2017-03-24 21:32:57 +0530, Amit Khandekar wrote:
> diff --git a/src/backend/executor/nodeAppend.c 
> b/src/backend/executor/nodeAppend.c
> index a107545..e9e8676 100644
> --- a/src/backend/executor/nodeAppend.c
> +++ b/src/backend/executor/nodeAppend.c
> @@ -59,9 +59,47 @@
>  
>  #include "executor/execdebug.h"
>  #include "executor/nodeAppend.h"
> +#include "miscadmin.h"
> +#include "optimizer/cost.h"
> +#include "storage/spin.h"
> +
> +/*
> + * Shared state for Parallel Append.
> + *
> + * Each backend participating in a Parallel Append has its own
> + * descriptor in backend-private memory, and those objects all contain
> + * a pointer to this structure.
> + */
> +typedef struct ParallelAppendDescData
> +{
> + LWLock  pa_lock;/* mutual exclusion to choose 
> next subplan */
> + int pa_first_plan;  /* plan to choose while 
> wrapping around plans */
> + int pa_next_plan;   /* next plan to choose by any 
> worker */
> +
> + /*
> +  * pa_finished : workers currently executing the subplan. A worker which
> +  * finishes a subplan should set pa_finished to true, so that no new
> +  * worker picks this subplan. For non-partial subplan, a worker which 
> picks
> +  * up that subplan should immediately set to true, so as to make sure
> +  * there are no more than 1 worker assigned to this subplan.
> +  */
> + boolpa_finished[FLEXIBLE_ARRAY_MEMBER];
> +} ParallelAppendDescData;


> +typedef ParallelAppendDescData *ParallelAppendDesc;

Pointer hiding typedefs make this Andres sad.



> @@ -291,6 +362,276 @@ ExecReScanAppend(AppendState *node)
>   if (subnode->chgParam == NULL)
>   ExecReScan(subnode);
>   }
> +
> + if (padesc)
> + {
> + padesc->pa_first_plan = padesc->pa_next_plan = 0;
> + memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans);
> + }
> +

Is it actually guaranteed that none of the parallel workers are doing
something at that point?


> +/* 
> + *   exec_append_parallel_next
> + *
> + *   Determine the next subplan that should be executed. Each worker 
> uses a
> + *   shared next_subplan counter index to start looking for 
> unfinished plan,
> + *   executes the subplan, then shifts ahead this counter to the next
> + *   subplan, so that other workers know which next plan to choose. 
> This
> + *   way, workers choose the subplans in round robin order, and thus 
> they
> + *   get evenly distributed among the subplans.
> + *
> + *   Returns false if and only if all subplans are already finished
> + *   processing.
> + * 
> + */
> +static bool
> +exec_append_parallel_next(AppendState *state)
> +{
> + ParallelAppendDesc padesc = state->as_padesc;
> + int whichplan;
> + int initial_plan;
> + int first_partial_plan = ((Append 
> *)state->ps.plan)->first_partial_plan;
> + boolfound;
> +
> + Assert(padesc != NULL);
> +
> + /* Backward scan is not supported by parallel-aware plans */
> + Assert(ScanDirectionIsForward(state->ps.state->es_direction));
> +
> + /* The parallel leader chooses its next subplan differently */
> + if (!IsParallelWorker())
> + return exec_append_leader_next(state);

It's a bit weird that the leader's case does is so separate, and does
it's own lock acquisition.


> + found = false;
> + for (whichplan = initial_plan; whichplan != PA_INVALID_PLAN;)
> + {
> + /*
> +  * Ignore plans that are already done processing. These also 
> include
> +  * non-partial subplans which have already been taken by a 
> worker.
> +  */
> + if (!padesc->pa_finished[whichplan])
> + {
> + found = true;
> + break;
> + }
> +
> + /*
> +  * Note: There is a chance that just after the child plan node 
> is
> +  * chosen above, some other worker finishes this node and sets
> +  * pa_finished to true. In that case, this worker will go ahead 
> and
> +  * call ExecProcNode(child_node), which will return NULL tuple 
> since it
> +  * is already finished, and then once again this worker will 
> try to
> +  * choose next subplan; but this is ok : it's just an extra
> +  * "choose_next_subplan" operation.
> +  */

IIRC not all node types are safe against being executed again when
they've previously returned NULL.  That's why e.g. nodeMaterial.c
contains the following blurb:
/*
 * If necessary, try to fetch another row from the subplan.
 *
 * Note: the 

Re: [HACKERS] Parallel Append implementation

2017-03-24 Thread Amit Khandekar
On 24 March 2017 at 00:38, Amit Khandekar  wrote:
> On 23 March 2017 at 16:26, Amit Khandekar  wrote:
>> On 23 March 2017 at 05:55, Robert Haas  wrote:
>>>
>>> So in your example we do this:
>>>
>>> C[0] += 20;
>>> C[1] += 16;
>>> C[2] += 10;
>>> /* C[2] is smaller than C[0] or C[1] at this point, so we add the next
>>> path to C[2] */
>>> C[2] += 8;
>>> /* after the previous line, C[1] is now the smallest, so add to that
>>> entry next */
>>> C[1] += 3;
>>> /* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */
>>> C[2] += 1;
>>> /* final result: C[0] = 20, C[1] = 19, C[2] = 19 */
>>>
>>> Now we just take the highest entry that appears in any array, which in
>>> this case is C[0], as the total cost.
>>
>> Wow. The way your final result exactly tallies with my algorithm
>> result is very interesting. This looks like some maths or computer
>> science theory that I am not aware.
>>
>> I am currently coding the algorithm using your method.
>

> While I was coding this, I was considering if Path->rows also should
> be calculated similar to total cost for non-partial subpath and total
> cost for partial subpaths. I think for rows, we can just take
> total_rows divided by workers for non-partial paths, and this
> approximation should suffice. It looks odd that it be treated with the
> same algorithm we chose for total cost for non-partial paths.

Attached is the patch v12, where Path->rows calculation of non-partial
paths is kept separate from the way total cost is done for non-partial
costs. rows for non-partial paths is calculated as total_rows divided
by workers as approximation. And then rows for partial paths are just
added one by one.

>
> Meanwhile, attached is a WIP patch v10. The only change in this patch
> w.r.t. the last patch (v9) is that this one has a new function defined
> append_nonpartial_cost(). Just sending this to show how the algorithm
> looks like; haven't yet called it.

Now append_nonpartial_cost() is used, and it is tested.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


ParallelAppend_v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-24 Thread Amit Khandekar
On 24 March 2017 at 13:11, Rajkumar Raghuwanshi
 wrote:
> I have given patch on latest pg sources (on commit
> 457a4448732881b5008f7a3bcca76fc299075ac3). configure and make all
> install ran successfully, but initdb failed with below error.

> FailedAssertion("!(LWLockTranchesAllocated >=
> LWTRANCHE_FIRST_USER_DEFINED)", File: "lwlock.c", Line: 501)

Thanks for reporting, Rajkumar.

With the new PARALLEL_APPEND tranche ID, LWTRANCHE_FIRST_USER_DEFINED
value has crossed the value 64. So we need to increase the initial
size of LWLockTrancheArray from 64 to 128. Attached is the updated
patch.


ParallelAppend_v11.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-24 Thread Rajkumar Raghuwanshi
On Fri, Mar 24, 2017 at 12:38 AM, Amit Khandekar  wrote:
> Meanwhile, attached is a WIP patch v10. The only change in this patch
> w.r.t. the last patch (v9) is that this one has a new function defined
> append_nonpartial_cost(). Just sending this to show how the algorithm
> looks like; haven't yet called it.
>

Hi,

I have given patch on latest pg sources (on commit
457a4448732881b5008f7a3bcca76fc299075ac3). configure and make all
install ran successfully, but initdb failed with below error.

[edb@localhost bin]$ ./initdb -D data
The files belonging to this database system will be owned by user "edb".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory data ... ok
creating subdirectories ... ok
selecting default max_connections ... sh: line 1:  3106 Aborted
 (core dumped)
"/home/edb/WORKDB/PG3/postgresql/inst/bin/postgres" --boot -x0 -F -c
max_connections=100 -c shared_buffers=1000 -c
dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1:  3112 Aborted (core dumped)
"/home/edb/WORKDB/PG3/postgresql/inst/bin/postgres" --boot -x0 -F -c
max_connections=50 -c shared_buffers=500 -c
dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1:  3115 Aborted (core dumped)
"/home/edb/WORKDB/PG3/postgresql/inst/bin/postgres" --boot -x0 -F -c
max_connections=40 -c shared_buffers=400 -c
dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1:  3118 Aborted (core dumped)
"/home/edb/WORKDB/PG3/postgresql/inst/bin/postgres" --boot -x0 -F -c
max_connections=30 -c shared_buffers=300 -c
dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1:  3121 Aborted (core dumped)
"/home/edb/WORKDB/PG3/postgresql/inst/bin/postgres" --boot -x0 -F -c
max_connections=20 -c shared_buffers=200 -c
dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
sh: line 1:  3124 Aborted (core dumped)
"/home/edb/WORKDB/PG3/postgresql/inst/bin/postgres" --boot -x0 -F -c
max_connections=10 -c shared_buffers=100 -c
dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
10
selecting default shared_buffers ... sh: line 1:  3127 Aborted
(core dumped)
"/home/edb/WORKDB/PG3/postgresql/inst/bin/postgres" --boot -x0 -F -c
max_connections=10 -c shared_buffers=16384 -c
dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
400kB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... TRAP:
FailedAssertion("!(LWLockTranchesAllocated >=
LWTRANCHE_FIRST_USER_DEFINED)", File: "lwlock.c", Line: 501)
child process was terminated by signal 6: Aborted
initdb: removing data directory "data"

[edb@localhost bin]$

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-23 Thread Amit Khandekar
On 23 March 2017 at 16:26, Amit Khandekar  wrote:
> On 23 March 2017 at 05:55, Robert Haas  wrote:
>> On Wed, Mar 22, 2017 at 4:49 AM, Amit Khandekar 
wrote:
>>> Attached is the updated patch that handles the changes for all the
>>> comments except the cost changes part. Details about the specific
>>> changes are after the cost-related points discussed below.
>>>
>>> For non-partial paths, I was checking following 3 options :
>>>
>>> Option 1. Just take the sum of total non-partial child costs and
>>> divide it by number of workers. It seems to be getting close to the
>>> actual cost.
>>
>> If the costs for all children are about equal, then that works fine.
>> But when they are very unequal, then it's highly misleading.
>>
>>> Option 2. Calculate exact cost by an algorithm which I mentioned
>>> before, which is pasted below for reference :
>>> Per­subpath cost : 20 16 10 8 3 1, with 3 workers.
>>> After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
>>> times remaining are :
>>> 10  6  0 8 3 1
>>> After 6 units (minimum of 10, 06, 08), the times remaining are :
>>> 4  0  0 2 3 1
>>> After 2 units (minimum of 4, 2, 3), the times remaining are :
>>>  2  0  0 0 1 1
>>> After 1 units (minimum of 2, 1, 1), the times remaining are :
>>>  1  0  0 0 0 0
>>> After 1 units (minimum of 1, 0 , 0), the times remaining are :
>>>  0  0  0 0 0 0
>>> Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20
>>
>
>> This gives the same answer as what I was proposing
>
> Ah I see.
>
>> but I believe it's more complicated to compute.
> Yes a bit, particularly because in my algorithm, I would have to do
> 'n' subtractions each time, in case of 'n' workers. But it looked more
> natural because it follows exactly the way we manually calculate.
>
>> The way my proposal would work in this
>> case is that we would start with an array C[3] (since there are three
>> workers], with all entries 0.  Logically C[i] represents the amount of
>> work to be performed by worker i.  We add each path in turn to the
>> worker whose array entry is currently smallest; in the case of a tie,
>> just pick the first such entry.
>>
>> So in your example we do this:
>>
>> C[0] += 20;
>> C[1] += 16;
>> C[2] += 10;
>> /* C[2] is smaller than C[0] or C[1] at this point, so we add the next
>> path to C[2] */
>> C[2] += 8;
>> /* after the previous line, C[1] is now the smallest, so add to that
>> entry next */
>> C[1] += 3;
>> /* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */
>> C[2] += 1;
>> /* final result: C[0] = 20, C[1] = 19, C[2] = 19 */
>>
>> Now we just take the highest entry that appears in any array, which in
>> this case is C[0], as the total cost.
>
> Wow. The way your final result exactly tallies with my algorithm
> result is very interesting. This looks like some maths or computer
> science theory that I am not aware.
>
> I am currently coding the algorithm using your method.

While I was coding this, I was considering if Path->rows also should
be calculated similar to total cost for non-partial subpath and total
cost for partial subpaths. I think for rows, we can just take
total_rows divided by workers for non-partial paths, and this
approximation should suffice. It looks odd that it be treated with the
same algorithm we chose for total cost for non-partial paths.

Meanwhile, attached is a WIP patch v10. The only change in this patch
w.r.t. the last patch (v9) is that this one has a new function defined
append_nonpartial_cost(). Just sending this to show how the algorithm
looks like; haven't yet called it.


ParallelAppend_v10_WIP.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-23 Thread Amit Khandekar
On 23 March 2017 at 05:55, Robert Haas  wrote:
> On Wed, Mar 22, 2017 at 4:49 AM, Amit Khandekar  
> wrote:
>> Attached is the updated patch that handles the changes for all the
>> comments except the cost changes part. Details about the specific
>> changes are after the cost-related points discussed below.
>>
>> For non-partial paths, I was checking following 3 options :
>>
>> Option 1. Just take the sum of total non-partial child costs and
>> divide it by number of workers. It seems to be getting close to the
>> actual cost.
>
> If the costs for all children are about equal, then that works fine.
> But when they are very unequal, then it's highly misleading.
>
>> Option 2. Calculate exact cost by an algorithm which I mentioned
>> before, which is pasted below for reference :
>> Per­subpath cost : 20 16 10 8 3 1, with 3 workers.
>> After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
>> times remaining are :
>> 10  6  0 8 3 1
>> After 6 units (minimum of 10, 06, 08), the times remaining are :
>> 4  0  0 2 3 1
>> After 2 units (minimum of 4, 2, 3), the times remaining are :
>>  2  0  0 0 1 1
>> After 1 units (minimum of 2, 1, 1), the times remaining are :
>>  1  0  0 0 0 0
>> After 1 units (minimum of 1, 0 , 0), the times remaining are :
>>  0  0  0 0 0 0
>> Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20
>

> This gives the same answer as what I was proposing

Ah I see.

> but I believe it's more complicated to compute.
Yes a bit, particularly because in my algorithm, I would have to do
'n' subtractions each time, in case of 'n' workers. But it looked more
natural because it follows exactly the way we manually calculate.

> The way my proposal would work in this
> case is that we would start with an array C[3] (since there are three
> workers], with all entries 0.  Logically C[i] represents the amount of
> work to be performed by worker i.  We add each path in turn to the
> worker whose array entry is currently smallest; in the case of a tie,
> just pick the first such entry.
>
> So in your example we do this:
>
> C[0] += 20;
> C[1] += 16;
> C[2] += 10;
> /* C[2] is smaller than C[0] or C[1] at this point, so we add the next
> path to C[2] */
> C[2] += 8;
> /* after the previous line, C[1] is now the smallest, so add to that
> entry next */
> C[1] += 3;
> /* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */
> C[2] += 1;
> /* final result: C[0] = 20, C[1] = 19, C[2] = 19 */
>
> Now we just take the highest entry that appears in any array, which in
> this case is C[0], as the total cost.

Wow. The way your final result exactly tallies with my algorithm
result is very interesting. This looks like some maths or computer
science theory that I am not aware.

I am currently coding the algorithm using your method. Meanwhile
attached is a patch that takes care of your other comments, details of
which are below...

>
> In my previous review, I said that you should "define and document a
> new builtin tranche ID"; you did the first but not the second.  See
> the table in monitoring.sgml.

Yeah, I tried to search how TBM did in the source, but I guess I
didn't correctly run "git grep" commands, so the results did not have
monitoring.sgml, so I thought may be you mean something else by
"document".

Added changes in monitoring.sgml now.

>
> Definition of exec_append_goto_next_plan should have a line break
> after the return type, per usual PostgreSQL style rules.

Oops. Done.

>
> - * initialize to scan first subplan
> + * In case it's a sequential Append, initialize to scan first subplan.
>
> This comment is confusing because the code is executed whether it's
> parallel or not.  I think it might be better to write something like
> "initialize to scan first subplan (but note that we'll override this
> later in the case of a parallel append)"
Done.

>
>  /*
> + * Check if we are already finished plans from parallel append. This
> + * can happen if all the subplans are finished when this worker
> + * has not even started returning tuples.
> + */
> +if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN)
> +return ExecClearTuple(node->ps.ps_ResultTupleSlot);
>
> There seems to be no reason why this couldn't be hoisted out of the
> loop.  Actually, I think Ashutosh pointed this out before, but I
> didn't understand at that time what his point was.  Looking back, I
> see that he also pointed out that the as_padesc test isn't necessary,
> which is also true.

I am assuming both yours and Ashutosh's concern is that this check
will be executed for *each* tuple returned, and which needs to be
avoided. Actually, just moving it out of the loop is not going to
solve the runs-for-each-tuple issue. It still will execute for each
tuple. But after a thought, now I agree this can be taken out of loop
anyways, but, not for solving the per-tuple issue, but because it need
not be 

Re: [HACKERS] Parallel Append implementation

2017-03-22 Thread Robert Haas
On Wed, Mar 22, 2017 at 4:49 AM, Amit Khandekar  wrote:
> Attached is the updated patch that handles the changes for all the
> comments except the cost changes part. Details about the specific
> changes are after the cost-related points discussed below.
>
> For non-partial paths, I was checking following 3 options :
>
> Option 1. Just take the sum of total non-partial child costs and
> divide it by number of workers. It seems to be getting close to the
> actual cost.

If the costs for all children are about equal, then that works fine.
But when they are very unequal, then it's highly misleading.

> Option 2. Calculate exact cost by an algorithm which I mentioned
> before, which is pasted below for reference :
> Per­subpath cost : 20 16 10 8 3 1, with 3 workers.
> After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
> times remaining are :
> 10  6  0 8 3 1
> After 6 units (minimum of 10, 06, 08), the times remaining are :
> 4  0  0 2 3 1
> After 2 units (minimum of 4, 2, 3), the times remaining are :
>  2  0  0 0 1 1
> After 1 units (minimum of 2, 1, 1), the times remaining are :
>  1  0  0 0 0 0
> After 1 units (minimum of 1, 0 , 0), the times remaining are :
>  0  0  0 0 0 0
> Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

This gives the same answer as what I was proposing, but I believe it's
more complicated to compute.  The way my proposal would work in this
case is that we would start with an array C[3] (since there are three
workers], with all entries 0.  Logically C[i] represents the amount of
work to be performed by worker i.  We add each path in turn to the
worker whose array entry is currently smallest; in the case of a tie,
just pick the first such entry.

So in your example we do this:

C[0] += 20;
C[1] += 16;
C[2] += 10;
/* C[2] is smaller than C[0] or C[1] at this point, so we add the next
path to C[2] */
C[2] += 8;
/* after the previous line, C[1] is now the smallest, so add to that
entry next */
C[1] += 3;
/* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */
C[2] += 1;
/* final result: C[0] = 20, C[1] = 19, C[2] = 19 */

Now we just take the highest entry that appears in any array, which in
this case is C[0], as the total cost.

Comments on this latest version:

In my previous review, I said that you should "define and document a
new builtin tranche ID"; you did the first but not the second.  See
the table in monitoring.sgml.

Definition of exec_append_goto_next_plan should have a line break
after the return type, per usual PostgreSQL style rules.

- * initialize to scan first subplan
+ * In case it's a sequential Append, initialize to scan first subplan.

This comment is confusing because the code is executed whether it's
parallel or not.  I think it might be better to write something like
"initialize to scan first subplan (but note that we'll override this
later in the case of a parallel append)"

 /*
+ * Check if we are already finished plans from parallel append. This
+ * can happen if all the subplans are finished when this worker
+ * has not even started returning tuples.
+ */
+if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN)
+return ExecClearTuple(node->ps.ps_ResultTupleSlot);

There seems to be no reason why this couldn't be hoisted out of the
loop.  Actually, I think Ashutosh pointed this out before, but I
didn't understand at that time what his point was.  Looking back, I
see that he also pointed out that the as_padesc test isn't necessary,
which is also true.

+if (node->as_padesc)
+node->as_padesc->pa_finished[node->as_whichplan] = true;

I think you should move this logic inside exec_append_parallel_next.
That would avoid testing node->pa_desc an extra time for non-parallel
append.  I note that the comment doesn't explain why it's safe to do
this without taking the lock.  I think we could consider doing it with
the lock held, but it probably is safe, because we're only setting it
from false to true.  If someone else does the same thing, that won't
hurt anything, and if someone else fails to see our update, then the
worst-case scenario is that they'll try to execute a plan that has no
chance of returning any more rows.  That's not so bad.  Actually,
looking further, you do have a comment explaining that, but it's in
exec_append_parallel_next() where the value is used, rather than here.

+memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans);
+
+shm_toc_insert(pcxt->toc, node->ps.plan->plan_node_id, padesc);
+node->as_padesc = padesc;

Putting the shm_toc_insert call after we fully initialize the
structure seems better than putting it after we've done some of the
initialization and before we've done the rest.

+/* Choose the optimal subplan to be executed. */

I think the word "first" would be more accurate than "optimal".  We
can only hope to pick the optimal one, but whichever one we pick 

Re: [HACKERS] Parallel Append implementation

2017-03-22 Thread Amit Khandekar
Attached is the updated patch that handles the changes for all the
comments except the cost changes part. Details about the specific
changes are after the cost-related points discussed below.

>> I wanted to take into account per­subpath parallel_workers for total
>> cost of Append. Suppose the partial subpaths have per worker total
>> costs (3, 3, 3) and their parallel_workers are (2, 8, 4), with 2
>> Append workers available. So according to what you say, the total cost
>> is 9. With per­subplan parallel_workers taken into account, total cost
>> = (3*2 + 3*8 * 3*4)/2 = 21.
> But that case never happens, because the parallel workers for the
> append is always at least as large as the number of workers for any
> single child.

Yeah, that's right. I will use this approach for partial paths.


For non-partial paths, I was checking following 3 options :

Option 1. Just take the sum of total non-partial child costs and
divide it by number of workers. It seems to be getting close to the
actual cost.

Option 2. Calculate exact cost by an algorithm which I mentioned
before, which is pasted below for reference :
Per­subpath cost : 20 16 10 8 3 1, with 3 workers.
After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
times remaining are :
10  6  0 8 3 1
After 6 units (minimum of 10, 06, 08), the times remaining are :
4  0  0 2 3 1
After 2 units (minimum of 4, 2, 3), the times remaining are :
 2  0  0 0 1 1
After 1 units (minimum of 2, 1, 1), the times remaining are :
 1  0  0 0 0 0
After 1 units (minimum of 1, 0 , 0), the times remaining are :
 0  0  0 0 0 0
Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

Option 3. Get some approximation formula like you suggested. I am also
looking for such formula, just that some things are not clear to me.
The discussion of the same is below ...
>>> 2. Next, estimate the cost of the non­partial paths.  To do this, make
>>> an array of Cost of that length and initialize all the elements to
>>> zero, then add the total cost of each non­partial plan in turn to the
>>> element of the array with the smallest cost, and then take the maximum
>>> of the array elements as the total cost of the non­partial plans.  Add
>>> this to the result from step 1 to get the total cost.
>>
>> So with costs (8, 5, 2), add 8 and 5 to 2 so that it becomes (8, 5,
>> 15) , and so the max is 15 ? I surely am misinterpreting this.
> No.  If you have costs 8, 5, and 2 and only one process, cost is 15.
> If you have two processes then for costing purposes you assume worker
> 1 will execute the first path (cost 8) and worker 2 will execute the
> other two (cost 5 + 2 = 7), so the total cost is 8.  If you have three
> workers, the cost will still be 8, because there's no way to finish
> the cost­8 path in less than 8 units of work.

So the part that you suggested about adding up total cost in turn to
the smallest cost; this suggestion applies to only 1 worker right ?
For more than worker, are you suggesting to use some algorithm similar
to the one I suggested in option 2 above ? If yes, it would be great
if you again describe how that works for multiple workers. Or is it
that you were suggesting some simple approximate arithmetic that
applies to multiple workers ?
Like I mentioned, I will be happy to get such simple approximation
arithmetic that can be applied for multiple worker case. The one logic
I suggested in option 2 is something we can keep as the last option.
And option 1 is also an approximation but we would like to have a
better approximation. So wanted to clear my queries regarding option
3.

--

Details about all the remaining changes in updated patch are below ...

On 20 March 2017 at 17:29, Robert Haas  wrote:
> On Fri, Mar 17, 2017 at 1:12 PM, Amit Khandekar  
> wrote:
>>> - The substantive changes in add_paths_to_append_rel don't look right
>>> either.  It's not clear why accumulate_partialappend_subpath is
>>> getting called even in the non-enable_parallelappend case.  I don't
>>> think the logic for the case where we're not generating a parallel
>>> append path needs to change at all.
>>
>> When accumulate_partialappend_subpath() is called for a childrel with
>> a partial path, it works just like accumulate_append_subpath() when
>> enable_parallelappend is false. That's why, for partial child path,
>> the same function is called irrespective of parallel-append or
>> non-parallel-append case. May be mentioning this in comments should
>> suffice here ?
>
> I don't get it.  If you can get the same effect by changing something
> or not changing it, presumably it'd be better to not change it.   We
> try not to change things just because we can; the change should be an
> improvement in some way.
>
>>> - When parallel append is enabled, I think add_paths_to_append_rel
>>> should still consider all the same paths that it does today, plus one
>>> extra.  The new path is a parallel append path where each subpath is
>>> the 

Re: [HACKERS] Parallel Append implementation

2017-03-20 Thread Robert Haas
On Fri, Mar 17, 2017 at 1:12 PM, Amit Khandekar  wrote:
>> - The substantive changes in add_paths_to_append_rel don't look right
>> either.  It's not clear why accumulate_partialappend_subpath is
>> getting called even in the non-enable_parallelappend case.  I don't
>> think the logic for the case where we're not generating a parallel
>> append path needs to change at all.
>
> When accumulate_partialappend_subpath() is called for a childrel with
> a partial path, it works just like accumulate_append_subpath() when
> enable_parallelappend is false. That's why, for partial child path,
> the same function is called irrespective of parallel-append or
> non-parallel-append case. May be mentioning this in comments should
> suffice here ?

I don't get it.  If you can get the same effect by changing something
or not changing it, presumably it'd be better to not change it.   We
try not to change things just because we can; the change should be an
improvement in some way.

>> - When parallel append is enabled, I think add_paths_to_append_rel
>> should still consider all the same paths that it does today, plus one
>> extra.  The new path is a parallel append path where each subpath is
>> the cheapest subpath for that childrel, whether partial or
>> non-partial.  If !enable_parallelappend, or if all of the cheapest
>> subpaths are partial, then skip this.  (If all the cheapest subpaths
>> are non-partial, it's still potentially useful.)
>
> In case of all-partial childrels, the paths are *exactly* same as
> those that would have been created for enable_parallelappend=off. The
> extra path is there for enable_parallelappend=on only when one or more
> of the child rels do not have partial paths. Does this make sense ?

No, I don't think so.  Imagine that we have three children, A, B, and
C.  The cheapest partial paths have costs of 10,000 each.  A, however,
has a non-partial path with a cost of 1,000.  Even though A has a
partial path, we still want to consider a parallel append using the
non-partial path because it figures to be hugely faster.

> The
> Path->total_cost for a partial path is *always* per-worker cost, right
> ? Just want to confirm this assumption of mine.

Yes.

>> Also, it
>> could be smarter about what happens with the costs of non-partial
>> paths. I suggest the following algorithm instead.
>>
>> 1. Add up all the costs of the partial paths.  Those contribute
>> directly to the final cost of the Append.  This ignores the fact that
>> the Append may escalate the parallel degree, but I think we should
>> just ignore that problem for now, because we have no real way of
>> knowing what the impact of that is going to be.
>
> I wanted to take into account per-subpath parallel_workers for total
> cost of Append. Suppose the partial subpaths have per worker total
> costs (3, 3, 3) and their parallel_workers are (2, 8, 4), with 2
> Append workers available. So according to what you say, the total cost
> is 9. With per-subplan parallel_workers taken into account, total cost
> = (3*2 + 3*8 * 3*4)/2 = 21.

But that case never happens, because the parallel workers for the
append is always at least as large as the number of workers for any
single child.

> May be I didn't follow exactly what you suggested. Your logic is not
> taking into account number of workers ? I am assuming you are
> calculating per-worker total cost here.
>>
>> 2. Next, estimate the cost of the non-partial paths.  To do this, make
>> an array of Cost of that length and initialize all the elements to
>> zero, then add the total cost of each non-partial plan in turn to the
>> element of the array with the smallest cost, and then take the maximum
>> of the array elements as the total cost of the non-partial plans.  Add
>> this to the result from step 1 to get the total cost.
>
> So with costs (8, 5, 2), add 8 and 5 to 2 so that it becomes (8, 5,
> 15) , and so the max is 15 ? I surely am misinterpreting this.

No.  If you have costs 8, 5, and 2 and only one process, cost is 15.
If you have two processes then for costing purposes you assume worker
1 will execute the first path (cost 8) and worker 2 will execute the
other two (cost 5 + 2 = 7), so the total cost is 8.  If you have three
workers, the cost will still be 8, because there's no way to finish
the cost-8 path in less than 8 units of work.

>> - In get_append_num_workers, instead of the complicated formula with
>> log() and 0.693, just add the list lengths and call fls() on the
>> result.  Integer arithmetic FTW!
>
> Yeah fls() could be used. BTW I just found that costsize.c already has
> this defined in the same way I did:
> #define LOG2(x)  (log(x) / 0.693147180559945)
> May be we need to shift this to some common header file.

LOG2() would make sense if you're working with a value represented as
a double, but if you have an integer input, I think fls() is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 

Re: [HACKERS] Parallel Append implementation

2017-03-20 Thread Amit Khandekar
>> 2. Next, estimate the cost of the non-partial paths.  To do this, make
>> an array of Cost of that length and initialize all the elements to
>> zero, then add the total cost of each non-partial plan in turn to the
>> element of the array with the smallest cost, and then take the maximum
>> of the array elements as the total cost of the non-partial plans.  Add
>> this to the result from step 1 to get the total cost.
>
> So with costs (8, 5, 2), add 8 and 5 to 2 so that it becomes (8, 5,
> 15) , and so the max is 15 ? I surely am misinterpreting this.
>
> Actually, I couldn't come up with a general formula to find the
> non-partial paths total cost, given the per-subplan cost and number of
> workers. I mean, we can manually find out the total cost, but turning
> it into a formula seems quite involved. We can even do a dry-run of
> workers consuming each of the subplan slots and find the total time
> time units taken, but finding some approximation seemed ok.
>
> For e.g. we can manually find total time units taken for following :
> costs (8, 2, 2, 2) with 2 workers : 8
> costs (6, 6, 4, 1) with 2 workers : 10.
> costs (6, 6, 4, 1) with 3 workers : 6.
>
> But coming up with an alogrithm or a formula didn't look worth. So I
> just did the total cost and divided it by workers. And besides that,
> took the maximum of the 1st plan cost (since it is the highest) and
> the average of total. I understand it would be too much approximation
> for some cases, but another thing is, we don't know how to take into
> account some of the workers shifting to partial workers. So the shift
> may be quite fuzzy since all workers may not shift to partial plans
> together.


For non-partial paths, I did some comparison between the actual cost
and the cost taken by adding the per-subpath figures and dividing by
number of workers. And in the below cases, they do not differ
significantly. Here are the figures :

Case 1 :
Cost units of subpaths : 20 16 10 8 3 1.
Workers : 3
Actual total time to finish all workers : 20.
total/workers: 16.

Case 2 :
Cost units of subpaths : 20 16 10 8 3 1.
Workers : 2
Actual total time to finish all workers : 34.
total/workers: 32.

Case 3 :
Cost units of subpaths : 5 3 3 3 3
Workers : 3
Actual total time to finish all workers : 6
total/workers: 5.6

One more thing observed, is , in all of the above cases, all the
workers more or less finish at about the same time.

So this method seem to compare good which actual cost. The average
comes out a little less than the actual. But I think in the patch,
what I need to correct is, calculate separate per-worker costs of
non-partial and partial costs, and add them. This will give us
per-worker total cost, which is what a partial Append cost will be. I
just added all costs together.

There can be some extreme cases such as (5, 1, 1, 1, 1, 1) with 6
workers, where it will take at least 5 units, but average is 2. For
that we can clamp up the cost to the first path cost, so that for e.g.
it does not go lesser than 5 in this case.

Actually I have deviced one algorithm to calculate the exact time when
all workers finish non-partial costs. But I think it does not make
sense to apply it because it may be too much of calculation cost for
hundreds of paths.

But anyways, for archival purpose, here is the algorithm :

Per-subpath cost : 20 16 10 8 3 1, with 3 workers.
After 10 units (this is minimum of 20, 16, 10), the times remaining are :
10  6  0 8 3 1
After 6 units (minimum of 10, 06, 08), the times remaining are :
4  0  0 2 3 1
After 2 units (minimum of 4, 2, 3), the times remaining are :
 2  0  0 0 1 1
After 1 units (minimum of 2, 1, 1), the times remaining are :
 1  0  0 0 0 0
After 1 units (minimum of 1, 0 , 0), the times remaining are :
 0  0  0 0 0 0
Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-17 Thread Peter Geoghegan
On Fri, Mar 17, 2017 at 10:12 AM, Amit Khandekar  wrote:
> Yeah, I was in double minds as to whether to do the
> copy-to-array-and-qsort thing, or should just write the same number of
> lines of code to manually do an insertion sort. Actually I was
> searching if we already have a linked list sort, but it seems we don't
> have. Will do the qsort now since it would be faster.

relcache.c does an insertion sort with a list of OIDs. See insert_ordered_oid().


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-17 Thread Amit Khandekar
On 16 March 2017 at 18:18, Ashutosh Bapat
 wrote:
> + * Check if we are already finished plans from parallel append. This
> + * can happen if all the subplans are finished when this worker
> + * has not even started returning tuples.
> + */
> +if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN)
> +return ExecClearTuple(node->ps.ps_ResultTupleSlot);
> From the comment, it looks like this condition will be encountered before the
> backend returns any tuple. But this code is part of the loop which returns the
> tuples. Shouldn't this be outside the loop? Why do we want to check a 
> condition
> for every row returned when the condition can happen only once and that too
> before returning any tuple?

The way ExecProcNode() gets called, there is no different special code
that gets called instead of ExecProcNode() when a tuple is fetched for
the first time. I mean, we cannot prevent ExecProcNode() from getting
called when as_whichplan is invalid right from the beginning.

One thing we can do is : have a special slot in AppenState->as_plan[]
which has some dummy execution node that just returns NULL tuple, and
initially make as_whichplan point to this slot. But I think it is not
worth doing this.

We can instead reduce the if condition to:
if (node->as_whichplan == PA_INVALID_PLAN)
{
Assert(node->as_padesc != NULL);
   return ExecClearTuple(node->ps.ps_ResultTupleSlot);
}

BTW, the loop which you mentioned that returns tuples the loop is
not for returning tuples, the loop is for iterating to the next
subplan. Even if we take the condition out and keep it in the
beginning of ExecAppend, the issue will remain.

>
> Why do we need following code in both ExecAppendInitializeWorker() and
> ExecAppendInitializeDSM()? Both of those things happen before starting the
> actual execution, so one of those should suffice?
> +/* Choose the optimal subplan to be executed. */
> +(void) parallel_append_next(node);

ExecAppendInitializeWorker() is for the worker to attach (and then
initialize its own local data) to the dsm area created and shared by
ExecAppendInitializeDSM() in backend. But both worker and backend
needs to initialize its own as_whichplan to the next subplan.

>
> There is no pa_num_worker now, so probably this should get updated. Per 
> comment
> we should also get rid of SpinLockAcquire() and SpinLockRelease()?
> + *purpose. The spinlock is used so that it does not change the
> + *pa_num_workers field while workers are choosing the next node.
Will do this.

>
> BTW, sa_finished seems to be a misnomor. The plan is not finished yet, but it
> wants no more workers. So, should it be renamed as sa_no_new_workers or
> something like that?

Actually in this context, "finished" means "we are done with this subplan".

>
> In parallel_append_next() we shouldn't need to call goto_next_plan() twice. If
> the plan indicated by pa_next_plan is finished, all the plans must have
> finished. This should be true if we set pa_next_plan to 0 at the time of
> initialization. Any worker picking up pa_next_plan will set it to the next
> valid plan. So the next worker asking for plan should pick pa_next_plan and
> set it to the next one and so on.

The current patch does not call it twice, but I might have overlooked
something. Let me know if I have.

>
> I am wonding whether goto_next_plan() can be simplified as some module
> arithmatic e.g. (whichplan - first_plan)++ % (last_plan - first_plan)
> + first_plan.

Hmm. IMHO it seems too much calculation for just shifting to next array element.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-17 Thread Amit Khandekar
On 17 March 2017 at 01:37, Robert Haas  wrote:
> - You've added a GUC (which is good) but not documented it (which is
> bad) or added it to postgresql.conf.sample (also bad).
>
> - You've used a loop inside a spinlock-protected critical section,
> which is against project policy.  Use an LWLock; define and document a
> new builtin tranche ID.
>
> - The comment for pa_finished claims that it is the number of workers
> executing the subplan, but it's a bool, not a count; I think this
> comment is just out of date.

Yes, agreed. Will fix the above.

>
> - paths_insert_sorted_by_cost() is a hand-coded insertion sort.  Can't
> we find a way to use qsort() for this instead of hand-coding a slower
> algorithm?  I think we could just create an array of the right length,
> stick each path into it from add_paths_to_append_rel, and then qsort()
> the array based on .  Then the result can be
> turned into a list.

Yeah, I was in double minds as to whether to do the
copy-to-array-and-qsort thing, or should just write the same number of
lines of code to manually do an insertion sort. Actually I was
searching if we already have a linked list sort, but it seems we don't
have. Will do the qsort now since it would be faster.

>
> - Maybe the new helper functions in nodeAppend.c could get names
> starting with exec_append_, to match the style of
> exec_append_initialize_next().
>
> - There's a superfluous whitespace change in add_paths_to_append_rel.

Will fix this.

>
> - The substantive changes in add_paths_to_append_rel don't look right
> either.  It's not clear why accumulate_partialappend_subpath is
> getting called even in the non-enable_parallelappend case.  I don't
> think the logic for the case where we're not generating a parallel
> append path needs to change at all.

When accumulate_partialappend_subpath() is called for a childrel with
a partial path, it works just like accumulate_append_subpath() when
enable_parallelappend is false. That's why, for partial child path,
the same function is called irrespective of parallel-append or
non-parallel-append case. May be mentioning this in comments should
suffice here ?

>
> - When parallel append is enabled, I think add_paths_to_append_rel
> should still consider all the same paths that it does today, plus one
> extra.  The new path is a parallel append path where each subpath is
> the cheapest subpath for that childrel, whether partial or
> non-partial.  If !enable_parallelappend, or if all of the cheapest
> subpaths are partial, then skip this.  (If all the cheapest subpaths
> are non-partial, it's still potentially useful.)

In case of all-partial childrels, the paths are *exactly* same as
those that would have been created for enable_parallelappend=off. The
extra path is there for enable_parallelappend=on only when one or more
of the child rels do not have partial paths. Does this make sense ?

> In other words,
> don't skip consideration of parallel append just because you have a
> partial path available for every child rel; it could be

I didn't get this. Are you saying that in the patch it is getting
skipped if enable_parallelappend = off ? I don't think so. For
all-partial child rels, partial append is always created. Only thing
is, in case of enable_parallelappend=off, the Append path is not
parallel_aware, so it executes just like it executes today under
Gather without being parallel-aware.

>
> - I think the way cost_append() works is not right.  What you've got
> assumes that you can just multiply the cost of a partial plan by the
> parallel divisor to recover the total cost, which is not true because
> we don't divide all elements of the plan cost by the parallel divisor
> -- only the ones that seem like they should be divided.

Yes, that was an approximation done. For those subpaths for which
there is no parallel_divsor, we cannot calculate the total cost
considering the number of workers for the subpath. I feel we should
consider the per-subpath parallel_workers somehow. The
Path->total_cost for a partial path is *always* per-worker cost, right
? Just want to confirm this assumption of mine.

> Also, it
> could be smarter about what happens with the costs of non-partial
> paths. I suggest the following algorithm instead.
>
> 1. Add up all the costs of the partial paths.  Those contribute
> directly to the final cost of the Append.  This ignores the fact that
> the Append may escalate the parallel degree, but I think we should
> just ignore that problem for now, because we have no real way of
> knowing what the impact of that is going to be.

I wanted to take into account per-subpath parallel_workers for total
cost of Append. Suppose the partial subpaths have per worker total
costs (3, 3, 3) and their parallel_workers are (2, 8, 4), with 2
Append workers available. So according to what you say, the total cost
is 9. With per-subplan parallel_workers taken into account, total cost
= (3*2 + 3*8 * 3*4)/2 = 21.


Re: [HACKERS] Parallel Append implementation

2017-03-16 Thread Robert Haas
On Thu, Mar 16, 2017 at 6:27 AM, Amit Khandekar  wrote:
> Attached is an updated patch v7, which does the above.

Some comments:

- You've added a GUC (which is good) but not documented it (which is
bad) or added it to postgresql.conf.sample (also bad).

- You've used a loop inside a spinlock-protected critical section,
which is against project policy.  Use an LWLock; define and document a
new builtin tranche ID.

- The comment for pa_finished claims that it is the number of workers
executing the subplan, but it's a bool, not a count; I think this
comment is just out of date.

- paths_insert_sorted_by_cost() is a hand-coded insertion sort.  Can't
we find a way to use qsort() for this instead of hand-coding a slower
algorithm?  I think we could just create an array of the right length,
stick each path into it from add_paths_to_append_rel, and then qsort()
the array based on .  Then the result can be
turned into a list.

- Maybe the new helper functions in nodeAppend.c could get names
starting with exec_append_, to match the style of
exec_append_initialize_next().

- There's a superfluous whitespace change in add_paths_to_append_rel.

- The substantive changes in add_paths_to_append_rel don't look right
either.  It's not clear why accumulate_partialappend_subpath is
getting called even in the non-enable_parallelappend case.  I don't
think the logic for the case where we're not generating a parallel
append path needs to change at all.

- When parallel append is enabled, I think add_paths_to_append_rel
should still consider all the same paths that it does today, plus one
extra.  The new path is a parallel append path where each subpath is
the cheapest subpath for that childrel, whether partial or
non-partial.  If !enable_parallelappend, or if all of the cheapest
subpaths are partial, then skip this.  (If all the cheapest subpaths
are non-partial, it's still potentially useful.)  In other words,
don't skip consideration of parallel append just because you have a
partial path available for every child rel; it could be

- I think the way cost_append() works is not right.  What you've got
assumes that you can just multiply the cost of a partial plan by the
parallel divisor to recover the total cost, which is not true because
we don't divide all elements of the plan cost by the parallel divisor
-- only the ones that seem like they should be divided.  Also, it
could be smarter about what happens with the costs of non-partial
paths. I suggest the following algorithm instead.

1. Add up all the costs of the partial paths.  Those contribute
directly to the final cost of the Append.  This ignores the fact that
the Append may escalate the parallel degree, but I think we should
just ignore that problem for now, because we have no real way of
knowing what the impact of that is going to be.

2. Next, estimate the cost of the non-partial paths.  To do this, make
an array of Cost of that length and initialize all the elements to
zero, then add the total cost of each non-partial plan in turn to the
element of the array with the smallest cost, and then take the maximum
of the array elements as the total cost of the non-partial plans.  Add
this to the result from step 1 to get the total cost.

- In get_append_num_workers, instead of the complicated formula with
log() and 0.693, just add the list lengths and call fls() on the
result.  Integer arithmetic FTW!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-16 Thread Robert Haas
On Thu, Mar 16, 2017 at 8:48 AM, Ashutosh Bapat
 wrote:
> Why do we need following code in both ExecAppendInitializeWorker() and
> ExecAppendInitializeDSM()? Both of those things happen before starting the
> actual execution, so one of those should suffice?
> +/* Choose the optimal subplan to be executed. */
> +(void) parallel_append_next(node);

ExecAppendInitializeWorker runs only in workers, but
ExecAppendInitializeDSM runs only in the leader.

> BTW, sa_finished seems to be a misnomor. The plan is not finished yet, but it
> wants no more workers. So, should it be renamed as sa_no_new_workers or
> something like that?

I think that's not going to improve clarity.  The comments can clarify
the exact semantics.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-16 Thread Ashutosh Bapat
On Thu, Mar 16, 2017 at 3:57 PM, Amit Khandekar  wrote:
> On 12 March 2017 at 08:50, Robert Haas  wrote:
>>> However, Ashutosh's response made me think of something: one thing is
>>> that we probably do want to group all of the non-partial plans at the
>>> beginning of the Append so that they get workers first, and put the
>>> partial plans afterward.  That's because the partial plans can always
>>> be accelerated by adding more workers as they become available, but
>>> the non-partial plans are just going to take as long as they take - so
>>> we want to start them as soon as possible.  In fact, what we might
>>> want to do is actually sort the non-partial paths in order of
>>> decreasing cost, putting the most expensive one first and the others
>>> in decreasing order after that - and then similarly afterward with the
>>> partial paths.  If we did that, we wouldn't need to store a bitmapset
>>> OR two separate lists.  We could just store the index of the first
>>> partial plan in the list.  Then you can test whether a path is partial
>>> by checking whether this_index >= first_partial_index.
>
> Attached is an updated patch v7, which does the above. Now,
> AppendState->subplans has all non-partial subplans followed by all
> partial subplans, with the non-partial subplans in the order of
> descending total cost. Also, for convenience, the AppendPath also now
> has similar ordering in its AppendPath->subpaths. So there is a new
> field both in Append and AppendPath : first_partial_path/plan, which
> has value 0 if there are no non-partial subpaths.
>
> Also the backend now scans reverse, so that it does not take up the
> most expensive path.
>
> There are also some changes in the costing done. Now that we know that
> the very first path is the costliest non-partial path, we can use its
> total cost as the total cost of Append in case all the partial path
> costs are lesser.
>
> Modified/enhanced an existing test scenario in
> src/test/regress/select_parallel.sql so that Parallel Append is
> covered.
>
> As suggested by Robert, since pa_info->pa_finished was the only field
> in pa_info, removed the ParallelAppendDescData.pa_info structure, and
> instead brought pa_info->pa_finished into ParallelAppendDescData.
>
 +static inline void
 +exec_append_scan_first(AppendState *appendstate)
 +{
 +appendstate->as_whichplan = 0;
 +}

 I don't think this is buying you anything, and suggest backing it out.
>>>
>>> This is required for sequential Append, so that we can start executing
>>> from the first subplan.
>>
>> My point is that there's really no point in defining a static inline
>> function containing one line of code.  You could just put that line of
>> code in whatever places need it, which would probably be more clear.
>
> Did the same.

Some comments
+ * Check if we are already finished plans from parallel append. This
+ * can happen if all the subplans are finished when this worker
+ * has not even started returning tuples.
+ */
+if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN)
+return ExecClearTuple(node->ps.ps_ResultTupleSlot);
>From the comment, it looks like this condition will be encountered before the
backend returns any tuple. But this code is part of the loop which returns the
tuples. Shouldn't this be outside the loop? Why do we want to check a condition
for every row returned when the condition can happen only once and that too
before returning any tuple?

Why do we need following code in both ExecAppendInitializeWorker() and
ExecAppendInitializeDSM()? Both of those things happen before starting the
actual execution, so one of those should suffice?
+/* Choose the optimal subplan to be executed. */
+(void) parallel_append_next(node);

There is no pa_num_worker now, so probably this should get updated. Per comment
we should also get rid of SpinLockAcquire() and SpinLockRelease()?
+ *purpose. The spinlock is used so that it does not change the
+ *pa_num_workers field while workers are choosing the next node.

BTW, sa_finished seems to be a misnomor. The plan is not finished yet, but it
wants no more workers. So, should it be renamed as sa_no_new_workers or
something like that?

In parallel_append_next() we shouldn't need to call goto_next_plan() twice. If
the plan indicated by pa_next_plan is finished, all the plans must have
finished. This should be true if we set pa_next_plan to 0 at the time of
initialization. Any worker picking up pa_next_plan will set it to the next
valid plan. So the next worker asking for plan should pick pa_next_plan and
set it to the next one and so on.

I am wonding whether goto_next_plan() can be simplified as some module
arithmatic e.g. (whichplan - first_plan)++ % (last_plan - first_plan)
+ first_plan.

I am still reviewing the patch.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres 

Re: [HACKERS] Parallel Append implementation

2017-03-16 Thread Amit Khandekar
On 12 March 2017 at 08:50, Robert Haas  wrote:
>> However, Ashutosh's response made me think of something: one thing is
>> that we probably do want to group all of the non-partial plans at the
>> beginning of the Append so that they get workers first, and put the
>> partial plans afterward.  That's because the partial plans can always
>> be accelerated by adding more workers as they become available, but
>> the non-partial plans are just going to take as long as they take - so
>> we want to start them as soon as possible.  In fact, what we might
>> want to do is actually sort the non-partial paths in order of
>> decreasing cost, putting the most expensive one first and the others
>> in decreasing order after that - and then similarly afterward with the
>> partial paths.  If we did that, we wouldn't need to store a bitmapset
>> OR two separate lists.  We could just store the index of the first
>> partial plan in the list.  Then you can test whether a path is partial
>> by checking whether this_index >= first_partial_index.

Attached is an updated patch v7, which does the above. Now,
AppendState->subplans has all non-partial subplans followed by all
partial subplans, with the non-partial subplans in the order of
descending total cost. Also, for convenience, the AppendPath also now
has similar ordering in its AppendPath->subpaths. So there is a new
field both in Append and AppendPath : first_partial_path/plan, which
has value 0 if there are no non-partial subpaths.

Also the backend now scans reverse, so that it does not take up the
most expensive path.

There are also some changes in the costing done. Now that we know that
the very first path is the costliest non-partial path, we can use its
total cost as the total cost of Append in case all the partial path
costs are lesser.

Modified/enhanced an existing test scenario in
src/test/regress/select_parallel.sql so that Parallel Append is
covered.

As suggested by Robert, since pa_info->pa_finished was the only field
in pa_info, removed the ParallelAppendDescData.pa_info structure, and
instead brought pa_info->pa_finished into ParallelAppendDescData.

>>> +static inline void
>>> +exec_append_scan_first(AppendState *appendstate)
>>> +{
>>> +appendstate->as_whichplan = 0;
>>> +}
>>>
>>> I don't think this is buying you anything, and suggest backing it out.
>>
>> This is required for sequential Append, so that we can start executing
>> from the first subplan.
>
> My point is that there's really no point in defining a static inline
> function containing one line of code.  You could just put that line of
> code in whatever places need it, which would probably be more clear.

Did the same.


ParallelAppend_v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-13 Thread Robert Haas
On Mon, Mar 13, 2017 at 7:46 AM, Robert Haas  wrote:
> On Mon, Mar 13, 2017 at 4:59 AM, Amit Khandekar  
> wrote:
>> I agree that we should preferably have the non-partial plans started
>> first. But I am not sure if it is really worth ordering the partial
>> plans by cost. The reason we ended up not keeping track of the
>> per-subplan parallel_worker, is because it would not matter  much ,
>> and we would just equally distribute the workers among all regardless
>> of how big the subplans are. Even if smaller plans get more worker,
>> they will finish faster, and workers would be available to larger
>> subplans sooner.
>
> Imagine that the plan costs are 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, and 10
> and you have 2 workers.
>
> If you move that 10 to the front, this will finish in 10 time units.
> If you leave it at the end, it will take 15 time units.

Oh, never mind.  You were only asking whether we should sort partial
plans.  That's a lot less important, and maybe not important at all.
The only consideration there is whether we might try to avoid having
the leader start in on a plan with a large startup cost.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-13 Thread Robert Haas
On Mon, Mar 13, 2017 at 4:59 AM, Amit Khandekar  wrote:
> I agree that we should preferably have the non-partial plans started
> first. But I am not sure if it is really worth ordering the partial
> plans by cost. The reason we ended up not keeping track of the
> per-subplan parallel_worker, is because it would not matter  much ,
> and we would just equally distribute the workers among all regardless
> of how big the subplans are. Even if smaller plans get more worker,
> they will finish faster, and workers would be available to larger
> subplans sooner.

Imagine that the plan costs are 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, and 10
and you have 2 workers.

If you move that 10 to the front, this will finish in 10 time units.
If you leave it at the end, it will take 15 time units.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-13 Thread Amit Khandekar
On 12 March 2017 at 19:31, Tels  wrote:
> Moin,
>
> On Sat, March 11, 2017 11:29 pm, Robert Haas wrote:
>> On Fri, Mar 10, 2017 at 6:01 AM, Tels 
>> wrote:
>>> Just a question for me to understand the implementation details vs. the
>>> strategy:
>>>
>>> Have you considered how the scheduling decision might impact performance
>>> due to "inter-plan parallelism vs. in-plan parallelism"?
>>>
>>> So what would be the scheduling strategy? And should there be a fixed
>>> one
>>> or user-influencable? And what could be good ones?
>>>
>>> A simple example:
>>>
>>> E.g. if we have 5 subplans, and each can have at most 5 workers and we
>>> have 5 workers overall.
>>>
>>> So, do we:
>>>
>>>   Assign 5 workers to plan 1. Let it finish.
>>>   Then assign 5 workers to plan 2. Let it finish.
>>>   and so on
>>>
>>> or:
>>>
>>>   Assign 1 workers to each plan until no workers are left?
>>
>> Currently, we do the first of those, but I'm pretty sure the second is
>> way better.  For example, suppose each subplan has a startup cost.  If
>> you have all the workers pile on each plan in turn, every worker pays
>> the startup cost for every subplan.  If you spread them out, then
>> subplans can get finished without being visited by all workers, and
>> then the other workers never pay those costs.  Moreover, you reduce
>> contention for spinlocks, condition variables, etc.  It's not
>> impossible to imagine a scenario where having all workers pile on one
>> subplan at a time works out better: for example, suppose you have a
>> table with lots of partitions all of which are on the same disk, and
>> it's actually one physical spinning disk, not an SSD or a disk array
>> or anything, and the query is completely I/O-bound.  Well, it could
>> be, in that scenario, that spreading out the workers is going to turn
>> sequential I/O into random I/O and that might be terrible.  In most
>> cases, though, I think you're going to be better off.  If the
>> partitions are on different spindles or if there's some slack I/O
>> capacity for prefetching, you're going to come out ahead, maybe way
>> ahead.  If you come out behind, then you're evidently totally I/O
>> bound and have no capacity for I/O parallelism; in that scenario, you
>> should probably just turn parallel query off altogether, because
>> you're not going to benefit from it.
>
> I agree with the proposition that both strategies can work well, or not,
> depending on system-setup, the tables and data layout. I'd be a bit more
> worried about turning it into the "random-io-case", but that's still just
> a feeling and guesswork.
>
> So which one will be better seems speculative, hence the question for
> benchmarking different strategies.
>
> So, I'd like to see the scheduler be out in a single place, maybe a
> function that get's called with the number of currently running workers,
> the max. number of workers to be expected, the new worker, the list of
> plans still todo, and then schedules that single worker to one of these
> plans by strategy X.
>
> That would make it easier to swap out X for Y and see how it fares,
> wouldn't it?

Yes, actually pretty much the scheduler logic is all in one single
function parallel_append_next().

>
>
> However, I don't think the patch needs to select the optimal strategy
> right from the beginning (if that even exists, maybe it's a mixed
> strategy), even "not so optimal" parallelism will be better than doing all
> things sequentially.
>
> Best regards,
>
> Tels



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-13 Thread Amit Khandekar
On 10 March 2017 at 22:08, Robert Haas  wrote:
> On Fri, Mar 10, 2017 at 12:17 AM, Amit Khandekar  
> wrote:
>> I agree that the two-lists approach will consume less memory than
>> bitmapset. Keeping two lists will effectively have an extra pointer
>> field which will add up to the AppendPath size, but this size will not
>> grow with the number of subpaths, whereas the Bitmapset will grow.
>
> Sure.  You'll use about one BIT of memory per subpath.  I'm kind of
> baffled as to why we're treating this as an issue worth serious
> discussion; the amount of memory involved is clearly very small.  Even
> for an appendrel with 1000 children, that's 125 bytes of memory.
> Considering the amount of memory we're going to spend planning that
> appendrel overall, that's not significant.
Yes, I agree that we should consider rather other things like code
simplicity to determine which data structure we should use in
AppendPath.

>
> However, Ashutosh's response made me think of something: one thing is
> that we probably do want to group all of the non-partial plans at the
> beginning of the Append so that they get workers first, and put the
> partial plans afterward.  That's because the partial plans can always
> be accelerated by adding more workers as they become available, but
> the non-partial plans are just going to take as long as they take - so
> we want to start them as soon as possible.  In fact, what we might
> want to do is actually sort the non-partial paths in order of
> decreasing cost, putting the most expensive one first and the others
> in decreasing order after that - and then similarly afterward with the
> partial paths.  If we did that, we wouldn't need to store a bitmapset
> OR two separate lists.  We could just store the index of the first
> partial plan in the list.  Then you can test whether a path is partial
> by checking whether this_index >= first_partial_index.

I agree that we should preferably have the non-partial plans started
first. But I am not sure if it is really worth ordering the partial
plans by cost. The reason we ended up not keeping track of the
per-subplan parallel_worker, is because it would not matter  much ,
and we would just equally distribute the workers among all regardless
of how big the subplans are. Even if smaller plans get more worker,
they will finish faster, and workers would be available to larger
subplans sooner.

Anyways, I have given a thought on the logic of choosing the next plan
, and that is irrespective of whether the list is sorted. I have
included Ashutosh's proposal of scanning the array round-robin as
against finding the minimum, since that method will automatically
distribute the workers evenly. Also, the logic uses a single array and
keeps track of first partial plan. The first section of the array is
non-partial, followed by partial plans. Below is the algorithm ...
There might be corner cases which I didn't yet take into account, but
first I wanted to get an agreement if this looks ok to go ahead with.
Since it does not find minimum worker count, it no longer uses
pa_num_workers. Instead it has boolean field painfo->pa_finished.

parallel_append_next(AppendState *state)
{

/* Make a note of which subplan we have started with */
initial_plan = padesc->next_plan;

/* Keep going to the next plan until we find an unfinished one. In
the process, also keep track of the first unfinished subplan. As the
non-partial subplans are taken one by one, the unfinished subplan will
shift ahead, so that we don't have to scan these anymore */

whichplan = initial_plan;
for (;;)
{
ParallelAppendInfo *painfo = >pa_info[whichplan];

/*
 * Ignore plans that are already done processing. These also include
 * non-partial subplans which have already been taken by a worker.
 */
if (!painfo->pa_finished)
{
/* If this a non-partial plan, immediately mark it
finished, and shift ahead first_plan */
if (whichplan < padesc->first_partial_plan)
{
padesc->pa_info[whichplan].pa_finished = true;
padesc->first_plan++;
}

break;
}

/* Either go to the next index, or wrap around to the first
unfinished one */
whichplan = goto_next_plan(whichplan, padesc->first_plan,
padesc->as_nplans - 1));

/* Have we scanned all subplans ? If yes, we are done */
if (whichplan == initial_plan)
break;
}

/* If we didn't find any plan to execute, stop executing. */
if (whichplan == initial_plan || whichplan == PA_INVALID_PLAN)
return false;
else
{
/* Set the chosen plan, and also the next plan to be picked by
other workers */
state->as_whichplan = whichplan;
padesc->next_plan = goto_next_plan(whichplan,
padesc->first_plan, padesc->as_nplans - 1));
return true;
}
}

/* Either go 

Re: [HACKERS] Parallel Append implementation

2017-03-12 Thread Tels
Moin,

On Sat, March 11, 2017 11:29 pm, Robert Haas wrote:
> On Fri, Mar 10, 2017 at 6:01 AM, Tels 
> wrote:
>> Just a question for me to understand the implementation details vs. the
>> strategy:
>>
>> Have you considered how the scheduling decision might impact performance
>> due to "inter-plan parallelism vs. in-plan parallelism"?
>>
>> So what would be the scheduling strategy? And should there be a fixed
>> one
>> or user-influencable? And what could be good ones?
>>
>> A simple example:
>>
>> E.g. if we have 5 subplans, and each can have at most 5 workers and we
>> have 5 workers overall.
>>
>> So, do we:
>>
>>   Assign 5 workers to plan 1. Let it finish.
>>   Then assign 5 workers to plan 2. Let it finish.
>>   and so on
>>
>> or:
>>
>>   Assign 1 workers to each plan until no workers are left?
>
> Currently, we do the first of those, but I'm pretty sure the second is
> way better.  For example, suppose each subplan has a startup cost.  If
> you have all the workers pile on each plan in turn, every worker pays
> the startup cost for every subplan.  If you spread them out, then
> subplans can get finished without being visited by all workers, and
> then the other workers never pay those costs.  Moreover, you reduce
> contention for spinlocks, condition variables, etc.  It's not
> impossible to imagine a scenario where having all workers pile on one
> subplan at a time works out better: for example, suppose you have a
> table with lots of partitions all of which are on the same disk, and
> it's actually one physical spinning disk, not an SSD or a disk array
> or anything, and the query is completely I/O-bound.  Well, it could
> be, in that scenario, that spreading out the workers is going to turn
> sequential I/O into random I/O and that might be terrible.  In most
> cases, though, I think you're going to be better off.  If the
> partitions are on different spindles or if there's some slack I/O
> capacity for prefetching, you're going to come out ahead, maybe way
> ahead.  If you come out behind, then you're evidently totally I/O
> bound and have no capacity for I/O parallelism; in that scenario, you
> should probably just turn parallel query off altogether, because
> you're not going to benefit from it.

I agree with the proposition that both strategies can work well, or not,
depending on system-setup, the tables and data layout. I'd be a bit more
worried about turning it into the "random-io-case", but that's still just
a feeling and guesswork.

So which one will be better seems speculative, hence the question for
benchmarking different strategies.

So, I'd like to see the scheduler be out in a single place, maybe a
function that get's called with the number of currently running workers,
the max. number of workers to be expected, the new worker, the list of
plans still todo, and then schedules that single worker to one of these
plans by strategy X.

That would make it easier to swap out X for Y and see how it fares,
wouldn't it?


However, I don't think the patch needs to select the optimal strategy
right from the beginning (if that even exists, maybe it's a mixed
strategy), even "not so optimal" parallelism will be better than doing all
things sequentially.

Best regards,

Tels


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-11 Thread Robert Haas
On Fri, Mar 10, 2017 at 6:01 AM, Tels  wrote:
> Just a question for me to understand the implementation details vs. the
> strategy:
>
> Have you considered how the scheduling decision might impact performance
> due to "inter-plan parallelism vs. in-plan parallelism"?
>
> So what would be the scheduling strategy? And should there be a fixed one
> or user-influencable? And what could be good ones?
>
> A simple example:
>
> E.g. if we have 5 subplans, and each can have at most 5 workers and we
> have 5 workers overall.
>
> So, do we:
>
>   Assign 5 workers to plan 1. Let it finish.
>   Then assign 5 workers to plan 2. Let it finish.
>   and so on
>
> or:
>
>   Assign 1 workers to each plan until no workers are left?

Currently, we do the first of those, but I'm pretty sure the second is
way better.  For example, suppose each subplan has a startup cost.  If
you have all the workers pile on each plan in turn, every worker pays
the startup cost for every subplan.  If you spread them out, then
subplans can get finished without being visited by all workers, and
then the other workers never pay those costs.  Moreover, you reduce
contention for spinlocks, condition variables, etc.  It's not
impossible to imagine a scenario where having all workers pile on one
subplan at a time works out better: for example, suppose you have a
table with lots of partitions all of which are on the same disk, and
it's actually one physical spinning disk, not an SSD or a disk array
or anything, and the query is completely I/O-bound.  Well, it could
be, in that scenario, that spreading out the workers is going to turn
sequential I/O into random I/O and that might be terrible.  In most
cases, though, I think you're going to be better off.  If the
partitions are on different spindles or if there's some slack I/O
capacity for prefetching, you're going to come out ahead, maybe way
ahead.  If you come out behind, then you're evidently totally I/O
bound and have no capacity for I/O parallelism; in that scenario, you
should probably just turn parallel query off altogether, because
you're not going to benefit from it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-11 Thread Robert Haas
On Fri, Mar 10, 2017 at 8:12 AM, Amit Khandekar  wrote:
>> +static inline void
>> +exec_append_scan_first(AppendState *appendstate)
>> +{
>> +appendstate->as_whichplan = 0;
>> +}
>>
>> I don't think this is buying you anything, and suggest backing it out.
>
> This is required for sequential Append, so that we can start executing
> from the first subplan.

My point is that there's really no point in defining a static inline
function containing one line of code.  You could just put that line of
code in whatever places need it, which would probably be more clear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-10 Thread Robert Haas
On Fri, Mar 10, 2017 at 12:17 AM, Amit Khandekar  wrote:
> I agree that the two-lists approach will consume less memory than
> bitmapset. Keeping two lists will effectively have an extra pointer
> field which will add up to the AppendPath size, but this size will not
> grow with the number of subpaths, whereas the Bitmapset will grow.

Sure.  You'll use about one BIT of memory per subpath.  I'm kind of
baffled as to why we're treating this as an issue worth serious
discussion; the amount of memory involved is clearly very small.  Even
for an appendrel with 1000 children, that's 125 bytes of memory.
Considering the amount of memory we're going to spend planning that
appendrel overall, that's not significant.

However, Ashutosh's response made me think of something: one thing is
that we probably do want to group all of the non-partial plans at the
beginning of the Append so that they get workers first, and put the
partial plans afterward.  That's because the partial plans can always
be accelerated by adding more workers as they become available, but
the non-partial plans are just going to take as long as they take - so
we want to start them as soon as possible.  In fact, what we might
want to do is actually sort the non-partial paths in order of
decreasing cost, putting the most expensive one first and the others
in decreasing order after that - and then similarly afterward with the
partial paths.  If we did that, we wouldn't need to store a bitmapset
OR two separate lists.  We could just store the index of the first
partial plan in the list.  Then you can test whether a path is partial
by checking whether this_index >= first_partial_index.

One problem with that is that, since the leader has about a 4ms head
start on the other workers, it would tend to pick the most expensive
path to run locally before any other worker had a chance to make a
selection, and that's probably not what we want.  To fix that, let's
have the leader start at the end of the list of plans and work
backwards towards the beginning, so that it prefers cheaper and
partial plans over decisions that would force it to undertake a large
amount of work itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-10 Thread Amit Khandekar
After giving more thought to our discussions, I have have used the
Bitmapset structure in AppendPath as against having two lists one for
partial and other for non-partial paths. Attached is the patch v6 that
has the required changes. So accumulate_append_subpath() now also
prepares the bitmapset containing the information about which paths
are partial paths. This is what I had done in the first version.

At this point of time, I have not given sufficient time to think about
Ashutosh's proposal of just keeping track of the next_subplan which he
mentioned. There, we just keep assigning workers to a circle of
subplans in round-robin style. But I think as of now the approach of
choosing the minimum worker subplan is pretty simple looking. So the
patch v6 is in a working condition using minimum-worker approach.

On 9 March 2017 at 07:22, Robert Haas  wrote:

> Some review:
>
> +typedef struct ParallelAppendDescData
> +{
> +slock_tpa_mutex;/* mutual exclusion to choose
> next subplan */
> +ParallelAppendInfo pa_info[FLEXIBLE_ARRAY_MEMBER];
> +} ParallelAppendDescData;
>
> Instead of having ParallelAppendInfo, how about just int
> pa_workers[FLEXIBLE_ARRAY_MEMBER]?  The second structure seems like
> overkill, at least for now.

I have , for now, kept the structure there, just in case after further
discussion we may add something.

>
> +static inline void
> +exec_append_scan_first(AppendState *appendstate)
> +{
> +appendstate->as_whichplan = 0;
> +}
>
> I don't think this is buying you anything, and suggest backing it out.

This is required for sequential Append, so that we can start executing
from the first subplan.

>
> +/* Backward scan is not supported by parallel-aware plans */
> +
> Assert(!ScanDirectionIsBackward(appendstate->ps.state->es_direction));
>
> I think you could assert ScanDirectionIsForward, couldn't you?
> NoMovement, I assume, is right out.

Right. Changed.

>
> +elog(DEBUG2, "ParallelAppend : pid %d : all plans already
> finished",
> + MyProcPid);
>
> Please remove (and all similar cases also).

Removed at multiple places.

>
> + sizeof(*node->as_padesc->pa_info) * node->as_nplans);
>
> I'd use the type name instead.

Done.

>
> +for (i = 0; i < node->as_nplans; i++)
> +{
> +/*
> + * Just setting all the number of workers to 0 is enough. The logic
> + * of choosing the next plan in workers will take care of everything
> + * else.
> + */
> +padesc->pa_info[i].pa_num_workers = 0;
> +}
>
> Here I'd use memset.

Done.

>
> +return (min_whichplan == PA_INVALID_PLAN ? false : true);
>
> Maybe just return (min_whichplan != PA_INVALID_PLAN);

Done.

>
> -  childrel->cheapest_total_path);
> +
> childrel->cheapest_total_path);
>
> Unnecessary.

This call is now having more param, so kept the change.
>
> +{
>  partial_subpaths = accumulate_append_subpath(partial_subpaths,
> linitial(childrel->partial_pathlist));
> +}
>
> Don't need to add braces.

Removed them.

>
> +/*
> + * Extract the first unparameterized, parallel-safe one among the
> + * child paths.
> + */
>
> Can we use get_cheapest_parallel_safe_total_inner for this, from
> a71f10189dc10a2fe422158a2c9409e0f77c6b9e?

Yes, Fixed.

>
> +if (rel->partial_pathlist != NIL &&
> +(Path *) linitial(rel->partial_pathlist) == subpath)
> +partial_subplans_set = bms_add_member(partial_subplans_set, i);
>
> This seems like a scary way to figure this out.  What if we wanted to
> build a parallel append subpath with some path other than the
> cheapest, for some reason?  I think you ought to record the decision
> that set_append_rel_pathlist makes about whether to use a partial path
> or a parallel-safe path, and then just copy it over here.

As mentioned above, used Bitmapset in AppendPath.

>
> -create_append_path(grouped_rel,
> -   paths,
> -   NULL,
> -   0);
> +create_append_path(grouped_rel, paths, NULL, 0);
>
> Unnecessary.

Now since there was anyway a change in the number of params, I kept
the single line call.

Please refer to attached patch version v6 for all of the above changes.


ParallelAppend_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-10 Thread Tels
Moin,

On Fri, March 10, 2017 3:24 am, Amit Khandekar wrote:
> On 10 March 2017 at 12:33, Ashutosh Bapat
>  wrote:
>> On Fri, Mar 10, 2017 at 11:33 AM, Ashutosh Bapat
>>  wrote:

 But as far as code is concerned, I think the two-list approach will
 turn out to be less simple if we derive corresponding two different
 arrays in AppendState node. Handling two different arrays during
 execution does not look clean. Whereas, the bitmapset that I have used
 in Append has turned out to be very simple. I just had to do the below
 check (and that is the only location) to see if it's a partial or
 non-partial subplan. There is nowhere else any special handling for
 non-partial subpath.

 /*
 * Increment worker count for the chosen node, if at all we found one.
 * For non-partial plans, set it to -1 instead, so that no other
 workers
 * run it.
 */
 if (min_whichplan != PA_INVALID_PLAN)
 {
if (bms_is_member(min_whichplan,
 ((Append*)state->ps.plan)->partial_subplans_set))
padesc->pa_info[min_whichplan].pa_num_workers++;
else
padesc->pa_info[min_whichplan].pa_num_workers = -1;
 }

 Now, since Bitmapset field is used during execution with such
 simplicity, why not have this same data structure in AppendPath, and
 re-use bitmapset field in Append plan node without making a copy of
 it. Otherwise, if we have two lists in AppendPath, and a bitmap in
 Append, again there is going to be code for data structure conversion.

>>>
>>> I think there is some merit in separating out non-parallel and
>>> parallel plans within the same array or outside it. The current logic
>>> to assign plan to a worker looks at all the plans, unnecessarily
>>> hopping over the un-parallel ones after they are given to a worker. If
>>> we separate those two, we can keep assigning new workers to the
>>> non-parallel plans first and then iterate over the parallel ones when
>>> a worker needs a plan to execute. We might eliminate the need for
>>> special value -1 for num workers. You may separate those two kinds in
>>> two different arrays or within the same array and remember the
>>> smallest index of a parallel plan.
>
> Do you think we might get performance benefit with this ? I am looking
> more towards logic simplicity. non-parallel plans would be mostly
> likely be there only in case of UNION ALL queries, and not partitioned
> tables. And UNION ALL queries probably would have far lesser number of
> subplans, there won't be too many unnecessary hops. The need for
> num_workers=-1 will still be there for partial plans, because we need
> to set it to -1 once a worker finishes a plan.
>
>>
>> Further to that, with this scheme and the scheme to distribute workers
>> equally irrespective of the maximum workers per plan, you don't need
>> to "scan" the subplans to find the one with minimum workers. If you
>> treat the array of parallel plans as a circular queue, the plan to be
>> assigned next to a worker will always be the plan next to the one
>> which got assigned to the given worker.
>
>> Once you have assigned workers
>> to non-parallel plans, intialize a shared variable next_plan to point
>> to the first parallel plan. When a worker comes asking for a plan,
>> assign the plan pointed by next_plan and update it to the next plan in
>> the circular queue.
>
> At some point of time, this logic may stop working. Imagine plans are
> running with (1, 1, 1). Next worker goes to plan 1, so they run with
> (2, 1, 1). So now the next_plan points to plan 2. Now suppose worker
> on plan 2 finishes. It should not again take plan 2, even though
> next_plan points to 2. It should take plan 3, or whichever is not
> finished. May be a worker that finishes a plan should do this check
> before directly going to the next_plan. But if this is turning out as
> simple as the finding-min-worker-plan, we can use this logic. But will
> have to check. We can anyway consider this even when we have a single
> list.

Just a question for me to understand the implementation details vs. the
strategy:

Have you considered how the scheduling decision might impact performance
due to "inter-plan parallelism vs. in-plan parallelism"?

So what would be the scheduling strategy? And should there be a fixed one
or user-influencable? And what could be good ones?

A simple example:

E.g. if we have 5 subplans, and each can have at most 5 workers and we
have 5 workers overall.

So, do we:

  Assign 5 workers to plan 1. Let it finish.
  Then assign 5 workers to plan 2. Let it finish.
  and so on

or:

  Assign 1 workers to each plan until no workers are left?

In the second case you would have 5 plans running in a quasy-sequential
manner, which might be slower than the other way. Or not, that probably
needs some benchmarks?

Likewise, if you have a mix of plans with max 

Re: [HACKERS] Parallel Append implementation

2017-03-10 Thread Amit Khandekar
On 10 March 2017 at 14:05, Ashutosh Bapat
 wrote:
>> The need for
>> num_workers=-1 will still be there for partial plans, because we need
>> to set it to -1 once a worker finishes a plan.
>>
>
> IIRC, we do that so that no other workers are assigned to it when
> scanning the array of plans. But with the new scheme we don't need to
> scan the non-parallel plans for when assigning plan to workers so -1
> may not be needed. I may be wrong though.
>

Still, when a worker finishes a partial subplan , it marks it as -1,
so that no new workers pick this, even if there are other workers
already executing it.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-10 Thread Ashutosh Bapat

>>>
>>> I think there is some merit in separating out non-parallel and
>>> parallel plans within the same array or outside it. The current logic
>>> to assign plan to a worker looks at all the plans, unnecessarily
>>> hopping over the un-parallel ones after they are given to a worker. If
>>> we separate those two, we can keep assigning new workers to the
>>> non-parallel plans first and then iterate over the parallel ones when
>>> a worker needs a plan to execute. We might eliminate the need for
>>> special value -1 for num workers. You may separate those two kinds in
>>> two different arrays or within the same array and remember the
>>> smallest index of a parallel plan.
>
> Do you think we might get performance benefit with this ? I am looking
> more towards logic simplicity. non-parallel plans would be mostly
> likely be there only in case of UNION ALL queries, and not partitioned
> tables. And UNION ALL queries probably would have far lesser number of
> subplans, there won't be too many unnecessary hops.

A partitioned table which has foreign and local partitions would have
non-parallel and parallel plans if the foreign plans can not be
parallelized like what postgres_fdw does.

> The need for
> num_workers=-1 will still be there for partial plans, because we need
> to set it to -1 once a worker finishes a plan.
>

IIRC, we do that so that no other workers are assigned to it when
scanning the array of plans. But with the new scheme we don't need to
scan the non-parallel plans for when assigning plan to workers so -1
may not be needed. I may be wrong though.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-10 Thread Amit Khandekar
On 10 March 2017 at 12:33, Ashutosh Bapat
 wrote:
> On Fri, Mar 10, 2017 at 11:33 AM, Ashutosh Bapat
>  wrote:
>>>
>>> But as far as code is concerned, I think the two-list approach will
>>> turn out to be less simple if we derive corresponding two different
>>> arrays in AppendState node. Handling two different arrays during
>>> execution does not look clean. Whereas, the bitmapset that I have used
>>> in Append has turned out to be very simple. I just had to do the below
>>> check (and that is the only location) to see if it's a partial or
>>> non-partial subplan. There is nowhere else any special handling for
>>> non-partial subpath.
>>>
>>> /*
>>> * Increment worker count for the chosen node, if at all we found one.
>>> * For non-partial plans, set it to -1 instead, so that no other workers
>>> * run it.
>>> */
>>> if (min_whichplan != PA_INVALID_PLAN)
>>> {
>>>if (bms_is_member(min_whichplan,
>>> ((Append*)state->ps.plan)->partial_subplans_set))
>>>padesc->pa_info[min_whichplan].pa_num_workers++;
>>>else
>>>padesc->pa_info[min_whichplan].pa_num_workers = -1;
>>> }
>>>
>>> Now, since Bitmapset field is used during execution with such
>>> simplicity, why not have this same data structure in AppendPath, and
>>> re-use bitmapset field in Append plan node without making a copy of
>>> it. Otherwise, if we have two lists in AppendPath, and a bitmap in
>>> Append, again there is going to be code for data structure conversion.
>>>
>>
>> I think there is some merit in separating out non-parallel and
>> parallel plans within the same array or outside it. The current logic
>> to assign plan to a worker looks at all the plans, unnecessarily
>> hopping over the un-parallel ones after they are given to a worker. If
>> we separate those two, we can keep assigning new workers to the
>> non-parallel plans first and then iterate over the parallel ones when
>> a worker needs a plan to execute. We might eliminate the need for
>> special value -1 for num workers. You may separate those two kinds in
>> two different arrays or within the same array and remember the
>> smallest index of a parallel plan.

Do you think we might get performance benefit with this ? I am looking
more towards logic simplicity. non-parallel plans would be mostly
likely be there only in case of UNION ALL queries, and not partitioned
tables. And UNION ALL queries probably would have far lesser number of
subplans, there won't be too many unnecessary hops. The need for
num_workers=-1 will still be there for partial plans, because we need
to set it to -1 once a worker finishes a plan.

>
> Further to that, with this scheme and the scheme to distribute workers
> equally irrespective of the maximum workers per plan, you don't need
> to "scan" the subplans to find the one with minimum workers. If you
> treat the array of parallel plans as a circular queue, the plan to be
> assigned next to a worker will always be the plan next to the one
> which got assigned to the given worker.

> Once you have assigned workers
> to non-parallel plans, intialize a shared variable next_plan to point
> to the first parallel plan. When a worker comes asking for a plan,
> assign the plan pointed by next_plan and update it to the next plan in
> the circular queue.

At some point of time, this logic may stop working. Imagine plans are
running with (1, 1, 1). Next worker goes to plan 1, so they run with
(2, 1, 1). So now the next_plan points to plan 2. Now suppose worker
on plan 2 finishes. It should not again take plan 2, even though
next_plan points to 2. It should take plan 3, or whichever is not
finished. May be a worker that finishes a plan should do this check
before directly going to the next_plan. But if this is turning out as
simple as the finding-min-worker-plan, we can use this logic. But will
have to check. We can anyway consider this even when we have a single
list.

>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-09 Thread Ashutosh Bapat
On Fri, Mar 10, 2017 at 11:33 AM, Ashutosh Bapat
 wrote:
>>
>> But as far as code is concerned, I think the two-list approach will
>> turn out to be less simple if we derive corresponding two different
>> arrays in AppendState node. Handling two different arrays during
>> execution does not look clean. Whereas, the bitmapset that I have used
>> in Append has turned out to be very simple. I just had to do the below
>> check (and that is the only location) to see if it's a partial or
>> non-partial subplan. There is nowhere else any special handling for
>> non-partial subpath.
>>
>> /*
>> * Increment worker count for the chosen node, if at all we found one.
>> * For non-partial plans, set it to -1 instead, so that no other workers
>> * run it.
>> */
>> if (min_whichplan != PA_INVALID_PLAN)
>> {
>>if (bms_is_member(min_whichplan,
>> ((Append*)state->ps.plan)->partial_subplans_set))
>>padesc->pa_info[min_whichplan].pa_num_workers++;
>>else
>>padesc->pa_info[min_whichplan].pa_num_workers = -1;
>> }
>>
>> Now, since Bitmapset field is used during execution with such
>> simplicity, why not have this same data structure in AppendPath, and
>> re-use bitmapset field in Append plan node without making a copy of
>> it. Otherwise, if we have two lists in AppendPath, and a bitmap in
>> Append, again there is going to be code for data structure conversion.
>>
>
> I think there is some merit in separating out non-parallel and
> parallel plans within the same array or outside it. The current logic
> to assign plan to a worker looks at all the plans, unnecessarily
> hopping over the un-parallel ones after they are given to a worker. If
> we separate those two, we can keep assigning new workers to the
> non-parallel plans first and then iterate over the parallel ones when
> a worker needs a plan to execute. We might eliminate the need for
> special value -1 for num workers. You may separate those two kinds in
> two different arrays or within the same array and remember the
> smallest index of a parallel plan.

Further to that, with this scheme and the scheme to distribute workers
equally irrespective of the maximum workers per plan, you don't need
to "scan" the subplans to find the one with minimum workers. If you
treat the array of parallel plans as a circular queue, the plan to be
assigned next to a worker will always be the plan next to the one
which got assigned to the given worker. Once you have assigned workers
to non-parallel plans, intialize a shared variable next_plan to point
to the first parallel plan. When a worker comes asking for a plan,
assign the plan pointed by next_plan and update it to the next plan in
the circular queue.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-09 Thread Ashutosh Bapat
>
> But as far as code is concerned, I think the two-list approach will
> turn out to be less simple if we derive corresponding two different
> arrays in AppendState node. Handling two different arrays during
> execution does not look clean. Whereas, the bitmapset that I have used
> in Append has turned out to be very simple. I just had to do the below
> check (and that is the only location) to see if it's a partial or
> non-partial subplan. There is nowhere else any special handling for
> non-partial subpath.
>
> /*
> * Increment worker count for the chosen node, if at all we found one.
> * For non-partial plans, set it to -1 instead, so that no other workers
> * run it.
> */
> if (min_whichplan != PA_INVALID_PLAN)
> {
>if (bms_is_member(min_whichplan,
> ((Append*)state->ps.plan)->partial_subplans_set))
>padesc->pa_info[min_whichplan].pa_num_workers++;
>else
>padesc->pa_info[min_whichplan].pa_num_workers = -1;
> }
>
> Now, since Bitmapset field is used during execution with such
> simplicity, why not have this same data structure in AppendPath, and
> re-use bitmapset field in Append plan node without making a copy of
> it. Otherwise, if we have two lists in AppendPath, and a bitmap in
> Append, again there is going to be code for data structure conversion.
>

I think there is some merit in separating out non-parallel and
parallel plans within the same array or outside it. The current logic
to assign plan to a worker looks at all the plans, unnecessarily
hopping over the un-parallel ones after they are given to a worker. If
we separate those two, we can keep assigning new workers to the
non-parallel plans first and then iterate over the parallel ones when
a worker needs a plan to execute. We might eliminate the need for
special value -1 for num workers. You may separate those two kinds in
two different arrays or within the same array and remember the
smallest index of a parallel plan.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-09 Thread Amit Khandekar
On 10 March 2017 at 10:13, Ashutosh Bapat
 wrote:
> On Thu, Mar 9, 2017 at 6:28 PM, Robert Haas  wrote:
>> On Thu, Mar 9, 2017 at 7:42 AM, Ashutosh Bapat
>>  wrote:

 +if (rel->partial_pathlist != NIL &&
 +(Path *) linitial(rel->partial_pathlist) == subpath)
 +partial_subplans_set = bms_add_member(partial_subplans_set, 
 i);

 This seems like a scary way to figure this out.  What if we wanted to
 build a parallel append subpath with some path other than the
 cheapest, for some reason?

Yes, there was an assumption that append subpath will be either a
cheapest non-partial path, or the cheapest (i.e. first in the list)
partial path, although in the patch there is no Asserts to make sure
that a common rule has been followed at both these places.

 I think you ought to record the decision
 that set_append_rel_pathlist makes about whether to use a partial path
 or a parallel-safe path, and then just copy it over here.
>>>
>>> I agree that assuming that a subpath is non-partial path if it's not
>>> cheapest of the partial paths is risky. In fact, we can not assume
>>> that even when it's not one of the partial_paths since it could have
>>> been kicked out or was never added to the partial path list like
>>> reparameterized path. But if we have to save the information about
>>> which of the subpaths are partial paths and which are not in
>>> AppendPath, it would take some memory, noticeable for thousands of
>>> partitions, which will leak if the path doesn't make into the
>>> rel->pathlist.
>>
>> True, but that's no different from the situation for any other Path
>> node that has substructure.  For example, an IndexPath has no fewer
>> than 5 list pointers in it.  Generally we assume that the number of
>> paths won't be large enough for the memory used to really matter, and
>> I think that will also be true here.  And an AppendPath has a list of
>> subpaths, and if I'm not mistaken, those list nodes consume more
>> memory than the tracking information we're thinking about here will.
>>
>
> What I have observed is that we try to keep the memory usage to a
> minimum, trying to avoid memory consumption as much as possible. Most
> of that substructure gets absorbed by the planner or is shared across
> paths. Append path lists are an exception to that, but we need
> something to hold all subpaths together and list is PostgreSQL's way
> of doing it. So, that's kind of unavoidable. And may be we will find
> some reason for almost every substructure in paths.
>
>> I think you're thinking about this issue because you've been working
>> on partitionwise join where memory consumption is a big issue, but
>> there are a lot of cases where that isn't really a big deal.
>
> :).
>
>>
>>> The purpose of that information is to make sure that we
>>> allocate only one worker to that plan. I suggested that we use
>>> path->parallel_workers for the same, but it seems that's not
>>> guaranteed to be reliable. The reasons were discussed upthread. Is
>>> there any way to infer whether we can allocate more than one workers
>>> to a plan by looking at the corresponding path?
>>
>> I think it would be smarter to track it some other way.  Either keep
>> two lists of paths, one of which is the partial paths and the other of
>> which is the parallel-safe paths, or keep a bitmapset indicating which
>> paths fall into which category.
>
> I like two lists: it consumes almost no memory (two list headers
> instead of one) compared to non-parallel-append when there are
> non-partial paths and what more, it consumes no extra memory when all
> paths are partial.

I agree that the two-lists approach will consume less memory than
bitmapset. Keeping two lists will effectively have an extra pointer
field which will add up to the AppendPath size, but this size will not
grow with the number of subpaths, whereas the Bitmapset will grow.

But as far as code is concerned, I think the two-list approach will
turn out to be less simple if we derive corresponding two different
arrays in AppendState node. Handling two different arrays during
execution does not look clean. Whereas, the bitmapset that I have used
in Append has turned out to be very simple. I just had to do the below
check (and that is the only location) to see if it's a partial or
non-partial subplan. There is nowhere else any special handling for
non-partial subpath.

/*
* Increment worker count for the chosen node, if at all we found one.
* For non-partial plans, set it to -1 instead, so that no other workers
* run it.
*/
if (min_whichplan != PA_INVALID_PLAN)
{
   if (bms_is_member(min_whichplan,
((Append*)state->ps.plan)->partial_subplans_set))
   padesc->pa_info[min_whichplan].pa_num_workers++;
   else
   padesc->pa_info[min_whichplan].pa_num_workers = -1;
}

Now, since Bitmapset field is used during execution 

Re: [HACKERS] Parallel Append implementation

2017-03-09 Thread Ashutosh Bapat
On Thu, Mar 9, 2017 at 6:28 PM, Robert Haas  wrote:
> On Thu, Mar 9, 2017 at 7:42 AM, Ashutosh Bapat
>  wrote:
>>>
>>> +if (rel->partial_pathlist != NIL &&
>>> +(Path *) linitial(rel->partial_pathlist) == subpath)
>>> +partial_subplans_set = bms_add_member(partial_subplans_set, i);
>>>
>>> This seems like a scary way to figure this out.  What if we wanted to
>>> build a parallel append subpath with some path other than the
>>> cheapest, for some reason?  I think you ought to record the decision
>>> that set_append_rel_pathlist makes about whether to use a partial path
>>> or a parallel-safe path, and then just copy it over here.
>>
>> I agree that assuming that a subpath is non-partial path if it's not
>> cheapest of the partial paths is risky. In fact, we can not assume
>> that even when it's not one of the partial_paths since it could have
>> been kicked out or was never added to the partial path list like
>> reparameterized path. But if we have to save the information about
>> which of the subpaths are partial paths and which are not in
>> AppendPath, it would take some memory, noticeable for thousands of
>> partitions, which will leak if the path doesn't make into the
>> rel->pathlist.
>
> True, but that's no different from the situation for any other Path
> node that has substructure.  For example, an IndexPath has no fewer
> than 5 list pointers in it.  Generally we assume that the number of
> paths won't be large enough for the memory used to really matter, and
> I think that will also be true here.  And an AppendPath has a list of
> subpaths, and if I'm not mistaken, those list nodes consume more
> memory than the tracking information we're thinking about here will.
>

What I have observed is that we try to keep the memory usage to a
minimum, trying to avoid memory consumption as much as possible. Most
of that substructure gets absorbed by the planner or is shared across
paths. Append path lists are an exception to that, but we need
something to hold all subpaths together and list is PostgreSQL's way
of doing it. So, that's kind of unavoidable. And may be we will find
some reason for almost every substructure in paths.

> I think you're thinking about this issue because you've been working
> on partitionwise join where memory consumption is a big issue, but
> there are a lot of cases where that isn't really a big deal.

:).

>
>> The purpose of that information is to make sure that we
>> allocate only one worker to that plan. I suggested that we use
>> path->parallel_workers for the same, but it seems that's not
>> guaranteed to be reliable. The reasons were discussed upthread. Is
>> there any way to infer whether we can allocate more than one workers
>> to a plan by looking at the corresponding path?
>
> I think it would be smarter to track it some other way.  Either keep
> two lists of paths, one of which is the partial paths and the other of
> which is the parallel-safe paths, or keep a bitmapset indicating which
> paths fall into which category.

I like two lists: it consumes almost no memory (two list headers
instead of one) compared to non-parallel-append when there are
non-partial paths and what more, it consumes no extra memory when all
paths are partial.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-09 Thread Robert Haas
On Thu, Mar 9, 2017 at 7:42 AM, Ashutosh Bapat
 wrote:
>>
>> +if (rel->partial_pathlist != NIL &&
>> +(Path *) linitial(rel->partial_pathlist) == subpath)
>> +partial_subplans_set = bms_add_member(partial_subplans_set, i);
>>
>> This seems like a scary way to figure this out.  What if we wanted to
>> build a parallel append subpath with some path other than the
>> cheapest, for some reason?  I think you ought to record the decision
>> that set_append_rel_pathlist makes about whether to use a partial path
>> or a parallel-safe path, and then just copy it over here.
>
> I agree that assuming that a subpath is non-partial path if it's not
> cheapest of the partial paths is risky. In fact, we can not assume
> that even when it's not one of the partial_paths since it could have
> been kicked out or was never added to the partial path list like
> reparameterized path. But if we have to save the information about
> which of the subpaths are partial paths and which are not in
> AppendPath, it would take some memory, noticeable for thousands of
> partitions, which will leak if the path doesn't make into the
> rel->pathlist.

True, but that's no different from the situation for any other Path
node that has substructure.  For example, an IndexPath has no fewer
than 5 list pointers in it.  Generally we assume that the number of
paths won't be large enough for the memory used to really matter, and
I think that will also be true here.  And an AppendPath has a list of
subpaths, and if I'm not mistaken, those list nodes consume more
memory than the tracking information we're thinking about here will.

I think you're thinking about this issue because you've been working
on partitionwise join where memory consumption is a big issue, but
there are a lot of cases where that isn't really a big deal.

> The purpose of that information is to make sure that we
> allocate only one worker to that plan. I suggested that we use
> path->parallel_workers for the same, but it seems that's not
> guaranteed to be reliable. The reasons were discussed upthread. Is
> there any way to infer whether we can allocate more than one workers
> to a plan by looking at the corresponding path?

I think it would be smarter to track it some other way.  Either keep
two lists of paths, one of which is the partial paths and the other of
which is the parallel-safe paths, or keep a bitmapset indicating which
paths fall into which category.  I am not going to say there's no way
we could make it work without either of those things -- looking at the
parallel_workers flag might be made to work, for example -- but the
design idea I had in mind when I put this stuff into place was that
you keep them separate in other ways, not by the data they store
inside them.  I think it will be more robust if we keep to that
principle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-09 Thread Ashutosh Bapat
>
> +if (rel->partial_pathlist != NIL &&
> +(Path *) linitial(rel->partial_pathlist) == subpath)
> +partial_subplans_set = bms_add_member(partial_subplans_set, i);
>
> This seems like a scary way to figure this out.  What if we wanted to
> build a parallel append subpath with some path other than the
> cheapest, for some reason?  I think you ought to record the decision
> that set_append_rel_pathlist makes about whether to use a partial path
> or a parallel-safe path, and then just copy it over here.
>

I agree that assuming that a subpath is non-partial path if it's not
cheapest of the partial paths is risky. In fact, we can not assume
that even when it's not one of the partial_paths since it could have
been kicked out or was never added to the partial path list like
reparameterized path. But if we have to save the information about
which of the subpaths are partial paths and which are not in
AppendPath, it would take some memory, noticeable for thousands of
partitions, which will leak if the path doesn't make into the
rel->pathlist. The purpose of that information is to make sure that we
allocate only one worker to that plan. I suggested that we use
path->parallel_workers for the same, but it seems that's not
guaranteed to be reliable. The reasons were discussed upthread. Is
there any way to infer whether we can allocate more than one workers
to a plan by looking at the corresponding path?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-08 Thread Robert Haas
On Wed, Mar 8, 2017 at 2:00 AM, Amit Khandekar  wrote:
> Yeah, that seems to be right in most of the cases. The only cases
> where your formula seems to give too few workers is for something like
> : (2, 8, 8). For such subplans, we should at least allocate 8 workers.
> It turns out that in most of the cases in my formula, the Append
> workers allocated is just 1 worker more than the max per-subplan
> worker count. So in (2, 1, 1, 8), it will be a fraction more than 8.
> So in the patch, in addition to the log2() formula you proposed, I
> have made sure that it allocates at least equal to max(per-subplan
> parallel_workers values).

Yeah, I agree with that.

Some review:

+typedef struct ParallelAppendDescData
+{
+slock_tpa_mutex;/* mutual exclusion to choose
next subplan */
+ParallelAppendInfo pa_info[FLEXIBLE_ARRAY_MEMBER];
+} ParallelAppendDescData;

Instead of having ParallelAppendInfo, how about just int
pa_workers[FLEXIBLE_ARRAY_MEMBER]?  The second structure seems like
overkill, at least for now.

+static inline void
+exec_append_scan_first(AppendState *appendstate)
+{
+appendstate->as_whichplan = 0;
+}

I don't think this is buying you anything, and suggest backing it out.

+/* Backward scan is not supported by parallel-aware plans */
+Assert(!ScanDirectionIsBackward(appendstate->ps.state->es_direction));

I think you could assert ScanDirectionIsForward, couldn't you?
NoMovement, I assume, is right out.

+elog(DEBUG2, "ParallelAppend : pid %d : all plans already
finished",
+ MyProcPid);

Please remove (and all similar cases also).

+ sizeof(*node->as_padesc->pa_info) * node->as_nplans);

I'd use the type name instead.

+for (i = 0; i < node->as_nplans; i++)
+{
+/*
+ * Just setting all the number of workers to 0 is enough. The logic
+ * of choosing the next plan in workers will take care of everything
+ * else.
+ */
+padesc->pa_info[i].pa_num_workers = 0;
+}

Here I'd use memset.

+return (min_whichplan == PA_INVALID_PLAN ? false : true);

Maybe just return (min_whichplan != PA_INVALID_PLAN);

-  childrel->cheapest_total_path);
+
childrel->cheapest_total_path);

Unnecessary.

+{
 partial_subpaths = accumulate_append_subpath(partial_subpaths,
linitial(childrel->partial_pathlist));
+}

Don't need to add braces.

+/*
+ * Extract the first unparameterized, parallel-safe one among the
+ * child paths.
+ */

Can we use get_cheapest_parallel_safe_total_inner for this, from
a71f10189dc10a2fe422158a2c9409e0f77c6b9e?

+if (rel->partial_pathlist != NIL &&
+(Path *) linitial(rel->partial_pathlist) == subpath)
+partial_subplans_set = bms_add_member(partial_subplans_set, i);

This seems like a scary way to figure this out.  What if we wanted to
build a parallel append subpath with some path other than the
cheapest, for some reason?  I think you ought to record the decision
that set_append_rel_pathlist makes about whether to use a partial path
or a parallel-safe path, and then just copy it over here.

-create_append_path(grouped_rel,
-   paths,
-   NULL,
-   0);
+create_append_path(grouped_rel, paths, NULL, 0);

Unnecessary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-03-07 Thread Amit Khandekar
On 19 February 2017 at 14:59, Robert Haas  wrote:
> On Fri, Feb 17, 2017 at 2:56 PM, Amit Khandekar  
> wrote:
>> The log2(num_children)+1 formula which you proposed does not take into
>> account the number of workers for each of the subplans, that's why I
>> am a bit more inclined to look for some other logic. May be, treat the
>> children as if they belong to partitions, and accordingly calculate
>> the final number of workers. So for 2 children with 4 and 5 workers
>> respectively, Append parallel_workers would be : log3(3^4 + 3^5) .
>
> In general this will give an answer not different by more than 1 or 2
> from my answer, and often exactly the same.  In the case you mention,
> whether we get the same answer depends on which way you round:
> log3(3^4+3^5) is 5 if you round down, 6 if you round up.
>
> My formula is more aggressive when there are many subplans that are
> not parallel or take only 1 worker, because I'll always use at least 5
> workers for an append that has 9-16 children, whereas you might use
> only 2 if you do log3(3^0+3^0+3^0+3^0+3^0+3^0+3^0+3^0+3^0).  In that
> case I like my formula better. With lots of separate children, the
> chances of being able to use as many as 5 workers seem good.  (Note
> that using 9 workers as Ashutosh seems to be proposing would be a
> waste if the different children have very unequal execution times,
> because the workers that run children with short execution times can
> be reused to run additional subplans while the long ones are still
> running.  Running a separate worker for each child only works out if
> the shortest runtime is more than 50% of the longest runtime, which
> may sometimes be true but doesn't seem like a good bet in general.)
>
> Your formula is more aggressive when you have 3 children that all use
> the same number of workers; it'll always decide on  per child>+1, whereas mine won't add the extra worker in that case.
> Possibly your formula is better than mine in that case, but I'm not
> sure.  If you have as many as 9 children that all want N workers, your
> formula will decide on N+2 workers, but since my formula guarantees a
> minimum of 5 workers in such cases, I'll probably be within 1 of
> whatever answer you were getting.
>

Yeah, that seems to be right in most of the cases. The only cases
where your formula seems to give too few workers is for something like
: (2, 8, 8). For such subplans, we should at least allocate 8 workers.
It turns out that in most of the cases in my formula, the Append
workers allocated is just 1 worker more than the max per-subplan
worker count. So in (2, 1, 1, 8), it will be a fraction more than 8.
So in the patch, in addition to the log2() formula you proposed, I
have made sure that it allocates at least equal to max(per-subplan
parallel_workers values).

>
>> BTW, there is going to be some logic change in the choose-next-subplan
>> algorithm if we consider giving extra workers to subplans.
>
> I'm not sure that it's going to be useful to make this logic very
> complicated.  I think the most important thing is to give 1 worker to
> each plan before we give a second worker to any plan.  In general I
> think it's sufficient to assign a worker that becomes available to the
> subplan with the fewest number of workers (or one of them, if there's
> a tie) without worrying too much about the target number of workers
> for that subplan.

In the attached v5 patch, the logic of distributing the workers is now
kept simple : it just distributes the workers equally without
considering the per-sublan parallel_workers value. I have retained the
earlier logic of choosing the plan with minimum current workers. But
now that the pa_max_workers is not needed, I removed it, and instead a
partial_plans bitmapset is added in the Append node. Once a worker
picks up a non-partial subplan, it immediately changes its
pa_num_workers to -1. Whereas for partial subplans, the worker sets it
to -1 only after it finishes executing it.

Effectively, in parallel_append_next(), the check for whether subplan
is executing with max parallel_workers is now removed, and all code
that was using pa_max_workers is now removed.


Ashutosh Bapat  wrote:
> 10. We should probably move the parallel_safe calculation out of 
> cost_append().
> +path->parallel_safe = path->parallel_safe &&
> +  subpath->parallel_safe;
>
> 11. This check shouldn't be part of cost_append().
> +/* All child paths must have same parameterization */
> +Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
>

Moved out these two statements from cost_append(). Did it separately
in create_append_path().


Also, I have removed some elog() statements which were there while
inside Spinlock in parallel_append_next().


On 17 January 2017 at 11:10, Amit Langote  wrote:
> I was looking at the 

Re: [HACKERS] Parallel Append implementation

2017-02-26 Thread Robert Haas
On Mon, Feb 20, 2017 at 10:54 AM, Ashutosh Bapat
 wrote:
> On Sun, Feb 19, 2017 at 2:33 PM, Robert Haas  wrote:
>> On Fri, Feb 17, 2017 at 11:44 AM, Ashutosh Bapat
>>  wrote:
>>> That's true for a partitioned table, but not necessarily for every
>>> append relation. Amit's patch is generic for all append relations. If
>>> the child plans are joins or subquery segments of set operations, I
>>> doubt if the same logic works. It may be better if we throw as many
>>> workers (or some function "summing" those up) as specified by those
>>> subplans. I guess, we have to use different logic for append relations
>>> which are base relations and append relations which are not base
>>> relations.
>>
>> Well, I for one do not believe that if somebody writes a UNION ALL
>> with 100 branches, they should get 100 (or 99) workers.  Generally
>> speaking, the sweet spot for parallel workers on queries we've tested
>> so far has been between 1 and 4.  It's straining credulity to believe
>> that the number that's correct for parallel append is more than an
>> order of magnitude larger.  Since increasing resource commitment by
>> the logarithm of the problem size has worked reasonably well for table
>> scans, I believe we should pursue a similar approach here.
>
> Thanks for that explanation. I makes sense. So, something like this
> would work: total number of workers = some function of log(sum of
> sizes of relations). The number of workers allotted to each segment
> are restricted to the the number of workers chosen by the planner
> while planning that segment. The patch takes care of the limit right
> now. It needs to incorporate the calculation for total number of
> workers for append.

log(sum of sizes of relations) isn't well-defined for a UNION ALL query.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Append implementation

2017-02-19 Thread Ashutosh Bapat
On Sun, Feb 19, 2017 at 2:33 PM, Robert Haas  wrote:
> On Fri, Feb 17, 2017 at 11:44 AM, Ashutosh Bapat
>  wrote:
>> That's true for a partitioned table, but not necessarily for every
>> append relation. Amit's patch is generic for all append relations. If
>> the child plans are joins or subquery segments of set operations, I
>> doubt if the same logic works. It may be better if we throw as many
>> workers (or some function "summing" those up) as specified by those
>> subplans. I guess, we have to use different logic for append relations
>> which are base relations and append relations which are not base
>> relations.
>
> Well, I for one do not believe that if somebody writes a UNION ALL
> with 100 branches, they should get 100 (or 99) workers.  Generally
> speaking, the sweet spot for parallel workers on queries we've tested
> so far has been between 1 and 4.  It's straining credulity to believe
> that the number that's correct for parallel append is more than an
> order of magnitude larger.  Since increasing resource commitment by
> the logarithm of the problem size has worked reasonably well for table
> scans, I believe we should pursue a similar approach here.

Thanks for that explanation. I makes sense. So, something like this
would work: total number of workers = some function of log(sum of
sizes of relations). The number of workers allotted to each segment
are restricted to the the number of workers chosen by the planner
while planning that segment. The patch takes care of the limit right
now. It needs to incorporate the calculation for total number of
workers for append.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >