Re: [HACKERS] parallelize queries containing subplans
On Wed, Feb 15, 2017 at 4:38 AM, Robert Haaswrote: > On Tue, Feb 14, 2017 at 4:24 AM, Amit Kapila wrote: >> On further evaluation, it seems this patch has one big problem which >> is that it will allow forming parallel plans which can't be supported >> with current infrastructure. For ex. marking immediate level params >> as parallel safe can generate below type of plan: >> >> Seq Scan on t1 >>Filter: (SubPlan 1) >>SubPlan 1 >> -> Gather >>Workers Planned: 1 >>-> Result >> One-Time Filter: (t1.k = 0) >> -> Parallel Seq Scan on t2 >> >> >> In this plan, we can't evaluate one-time filter (that contains >> correlated param) unless we have the capability to pass all kind of >> PARAM_EXEC param to workers. I don't want to invest too much time in >> this patch unless somebody can see some way using current parallel >> infrastructure to implement correlated subplans. > > I don't think this approach has much chance of working; it just seems > too simplistic. I'm not entirely sure what the right approach is. > Unfortunately, the current query planner code seems to compute the > sets of parameters that are set and used quite late, and really only > on a per-subquery level. > Now just for the sake of discussion consider we have list of allParams at the path level, then also I think it might not be easy to make it work. > Here we need to know whether there is > anything that's set below the Gather node and used above it, or the > other way around, and we need to know it much earlier, while we're > still doing path generation. > Yes, that is exactly the challenge. I am not sure if currently there is a way by which we can identify if a Param on a particular node refers to node below it or above it. > (There's also possible a couple of other cases, like an initPlan that > needs to get executed only once, and also maybe a case where a > parameter is set below the Gather and later used above the Gather. > Not sure if that latter one happen, or how to deal with it.) > I think the case for initPlan is slightly different because we can always evaluate it at Gather (considering it is an uncorrelated initplan) and then pass it to Workers. We generally have a list of all the params at each plan node, so we can identify which of these are initPlan params and then evaluate them. Now, it can be used irrespective of whether it is used above or below the Gather node. For the cases, where it can be used above Gather node, it will work as we always store the computed value of such params in estate/econtext and for the cases when it has to be used below Gather, we need to pass the computed value to workers. Now, there is some exceptions like for few cases not all the params are available at a particular node, but I feel those can be handled easily by either traversing the planstate tree or by actually storing them at Gather node. Actually, in short, this is what is done in the patch proposed for parallizing initplans [1]. [1] - https://commitfest.postgresql.org/13/997/ -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Wed, Feb 15, 2017 at 4:46 AM, Robert Haaswrote: > On Mon, Feb 13, 2017 at 11:07 PM, Amit Kapila wrote: >> I have removed the check of AlternativeSubPlan so that it can be >> handled by expression_tree_walker. > ... >> Attached patch fixes all the comments raised till now. > > Committed after removing the reference to AlternativeSubPlan from the > comment. I didn't see any need to mention that. > Okay, I was also of two minds while adding that line in the comment. I had kept it because there are few places in the code where both SubPlan and AlternativeSubPlan are handled together, so I thought the person looking at this code should not wonder why we have not handled AlternativeSubPlan. However, I think it is okay to remove it from comment as there exist other places where only Subplan is handled and we rely on expression tree walker for AlternativeSubPlans. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Mon, Feb 13, 2017 at 11:07 PM, Amit Kapilawrote: > I have removed the check of AlternativeSubPlan so that it can be > handled by expression_tree_walker. ... > Attached patch fixes all the comments raised till now. Committed after removing the reference to AlternativeSubPlan from the comment. I didn't see any need to mention that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Tue, Feb 14, 2017 at 4:24 AM, Amit Kapilawrote: > On further evaluation, it seems this patch has one big problem which > is that it will allow forming parallel plans which can't be supported > with current infrastructure. For ex. marking immediate level params > as parallel safe can generate below type of plan: > > Seq Scan on t1 >Filter: (SubPlan 1) >SubPlan 1 > -> Gather >Workers Planned: 1 >-> Result > One-Time Filter: (t1.k = 0) > -> Parallel Seq Scan on t2 > > > In this plan, we can't evaluate one-time filter (that contains > correlated param) unless we have the capability to pass all kind of > PARAM_EXEC param to workers. I don't want to invest too much time in > this patch unless somebody can see some way using current parallel > infrastructure to implement correlated subplans. I don't think this approach has much chance of working; it just seems too simplistic. I'm not entirely sure what the right approach is. Unfortunately, the current query planner code seems to compute the sets of parameters that are set and used quite late, and really only on a per-subquery level. Here we need to know whether there is anything that's set below the Gather node and used above it, or the other way around, and we need to know it much earlier, while we're still doing path generation. There doesn't seem to be any simple way of getting that information, but I think you need it. What's more, I think you would still need it even if you had the ability to pass parameter values between processes. For example, consider: Gather -> Parallel Seq Scan Filter: (Correlated Subplan Reference Goes Here) Of course, the Param in the filter condition *can't* be a shared Param across all processes. It needs to be private to each process participating in the parallel sequential scan -- and the params passing data down from the Parallel Seq Scan to the correlated subplan also need to be private. On the other hand, in your example quoted above, you do need to share across processes: the value for t1.k needs to get passed down. So it seems to me that we somehow need to identify, for each parameter that gets used, whether it's provided by something beneath the Gather node (in which case it should be private to the worker) or whether it's provided from higher up (in which case it should be passed down to the worker, or if we can't do that, then don't use parallelism there). (There's also possible a couple of other cases, like an initPlan that needs to get executed only once, and also maybe a case where a parameter is set below the Gather and later used above the Gather. Not sure if that latter one happen, or how to deal with it.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Tue, Jan 3, 2017 at 4:19 PM, Amit Kapilawrote: > On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila wrote: >> >> Now, we can further extend this to parallelize queries containing >> correlated subplans like below: >> >> explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i); >> QUERY PLAN >> - >> Seq Scan on t1 (cost=0.00..831049.09 rows=8395 width=12) >>Filter: (SubPlan 1) >>SubPlan 1 >> -> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4) >>Filter: (i = t1.i) >> (5 rows) >> >> In the above query, Filter on t2 (i = t1.i) generates Param node which >> is a parallel-restricted node, so such queries won't be able to use >> parallelism even with the patch. I think we can mark Params which >> refer to same level as parallel-safe and I think we have this >> information (node-> varlevelsup/ phlevelsup/ agglevelsup) available >> when we replace correlation vars (SS_replace_correlation_vars). >> > > I have implemented the above idea which will allow same or immediate > outer level PARAMS as parallel_safe. The results of above query after > patch: > > postgres=# explain select * from t1 where t1.i in (select t2.i from t2 > where t2.i=t1.i); > QUERY PLAN > -- > Gather (cost=0.00..49.88 rows=8395 width=12) >Workers Planned: 1 >-> Parallel Seq Scan on t1 (cost=0.00..49.88 rows=4938 width=12) > Filter: (SubPlan 1) > SubPlan 1 >-> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4) > Filter: (i = t1.i) > (7 rows) > On further evaluation, it seems this patch has one big problem which is that it will allow forming parallel plans which can't be supported with current infrastructure. For ex. marking immediate level params as parallel safe can generate below type of plan: Seq Scan on t1 Filter: (SubPlan 1) SubPlan 1 -> Gather Workers Planned: 1 -> Result One-Time Filter: (t1.k = 0) -> Parallel Seq Scan on t2 In this plan, we can't evaluate one-time filter (that contains correlated param) unless we have the capability to pass all kind of PARAM_EXEC param to workers. I don't want to invest too much time in this patch unless somebody can see some way using current parallel infrastructure to implement correlated subplans. Note that still, the other patch [1] in this thread which implements parallelism for uncorrelated subplan holds good. [1] - https://www.postgresql.org/message-id/CAA4eK1J9mDZLcp-OskkdzAf_yT8W4dBSGL9E%3DkoEiJkdpVZsEA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Thu, Feb 2, 2017 at 9:23 AM, Amit Kapilawrote: > On Wed, Feb 1, 2017 at 10:04 PM, Robert Haas wrote: >> On Tue, Jan 31, 2017 at 6:01 AM, Amit Kapila wrote: >>> Moved this patch to next CF. >> >> So here is what seems to be the key hunk from this patch: >> >> /* >> - * Since we don't have the ability to push subplans down to workers at >> - * present, we treat subplan references as parallel-restricted. We need >> - * not worry about examining their contents; if they are unsafe, we >> would >> - * have found that out while examining the whole tree before reduction >> of >> - * sublinks to subplans. (Really we should not see SubLink during a >> - * max_interesting == restricted scan, but if we do, return true.) >> + * We can push the subplans only if they don't contain any >> parallel-aware >> + * node as we don't support multi-level parallelism (parallel workers >> + * invoking another set of parallel workers). >> */ >> -else if (IsA(node, SubLink) || >> - IsA(node, SubPlan) || >> - IsA(node, AlternativeSubPlan)) >> +else if (IsA(node, SubPlan)) >> +return !((SubPlan *) node)->parallel_safe; >> +else if (IsA(node, AlternativeSubPlan)) >> { >> -if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context)) >> -return true; >> +AlternativeSubPlan *asplan = (AlternativeSubPlan *) node; >> +ListCell *lc; >> + >> +foreach(lc, asplan->subplans) >> +{ >> +SubPlan*splan = (SubPlan *) lfirst(lc); >> + >> +Assert(IsA(splan, SubPlan)); >> + >> +if (max_parallel_hazard_walker((Node *) splan, context)) >> +return true; >> +} >> + >> +return false; >> } >> >> I don't see the reason for having this new code that makes >> AlternativeSubPlan walk over the subplans; expression_tree_walker >> already does that. >> I have removed the check of AlternativeSubPlan so that it can be handled by expression_tree_walker. > > It is done this way mainly to avoid recursion/nested calls, but I > think you prefer to handle it via expression_tree_walker so that code > looks clean. Another probable reason could be that there is always a > chance that we miss handling of some expression of a particular node > (in this case AlternativeSubPlan), but if that is the case then there > are other places in the code which do operate on individual subplan/'s > in AlternativeSubPlan list. > >> On the flip side I don't see the reason for >> removing the max_parallel_hazard_test() call for AlternativeSubPlan or >> for SubLink. > > If we don't remove the current test of max_parallel_hazard_test() for > AlternativeSubPlan, then AlternativeSubPlan node will be considered > parallel restricted, so why do we want that after this patch. For > Sublinks, I think they would have converted to Subplans by the time we > reach this function for the parallel restricted check. Can you > elaborate what you have in mind for the handling of AlternativeSubPlan > and SubLink? > I have removed the changes for SubLink node. >> +* CTE scans are not consider for parallelism (cf >> >> >> considered >> Fixed. >> + select count(*)from tenk1 where (two, four) not in >> >> whitespace Fixed. Attached patch fixes all the comments raised till now. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pq_pushdown_subplan_v5.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Wed, Feb 1, 2017 at 10:04 PM, Robert Haaswrote: > On Tue, Jan 31, 2017 at 6:01 AM, Amit Kapila wrote: >> Moved this patch to next CF. > > So here is what seems to be the key hunk from this patch: > > /* > - * Since we don't have the ability to push subplans down to workers at > - * present, we treat subplan references as parallel-restricted. We need > - * not worry about examining their contents; if they are unsafe, we would > - * have found that out while examining the whole tree before reduction of > - * sublinks to subplans. (Really we should not see SubLink during a > - * max_interesting == restricted scan, but if we do, return true.) > + * We can push the subplans only if they don't contain any parallel-aware > + * node as we don't support multi-level parallelism (parallel workers > + * invoking another set of parallel workers). > */ > -else if (IsA(node, SubLink) || > - IsA(node, SubPlan) || > - IsA(node, AlternativeSubPlan)) > +else if (IsA(node, SubPlan)) > +return !((SubPlan *) node)->parallel_safe; > +else if (IsA(node, AlternativeSubPlan)) > { > -if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context)) > -return true; > +AlternativeSubPlan *asplan = (AlternativeSubPlan *) node; > +ListCell *lc; > + > +foreach(lc, asplan->subplans) > +{ > +SubPlan*splan = (SubPlan *) lfirst(lc); > + > +Assert(IsA(splan, SubPlan)); > + > +if (max_parallel_hazard_walker((Node *) splan, context)) > +return true; > +} > + > +return false; > } > > I don't see the reason for having this new code that makes > AlternativeSubPlan walk over the subplans; expression_tree_walker > already does that. > It is done this way mainly to avoid recursion/nested calls, but I think you prefer to handle it via expression_tree_walker so that code looks clean. Another probable reason could be that there is always a chance that we miss handling of some expression of a particular node (in this case AlternativeSubPlan), but if that is the case then there are other places in the code which do operate on individual subplan/'s in AlternativeSubPlan list. > On the flip side I don't see the reason for > removing the max_parallel_hazard_test() call for AlternativeSubPlan or > for SubLink. If we don't remove the current test of max_parallel_hazard_test() for AlternativeSubPlan, then AlternativeSubPlan node will be considered parallel restricted, so why do we want that after this patch. For Sublinks, I think they would have converted to Subplans by the time we reach this function for the parallel restricted check. Can you elaborate what you have in mind for the handling of AlternativeSubPlan and SubLink? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Tue, Jan 31, 2017 at 6:01 AM, Amit Kapilawrote: > Moved this patch to next CF. So here is what seems to be the key hunk from this patch: /* - * Since we don't have the ability to push subplans down to workers at - * present, we treat subplan references as parallel-restricted. We need - * not worry about examining their contents; if they are unsafe, we would - * have found that out while examining the whole tree before reduction of - * sublinks to subplans. (Really we should not see SubLink during a - * max_interesting == restricted scan, but if we do, return true.) + * We can push the subplans only if they don't contain any parallel-aware + * node as we don't support multi-level parallelism (parallel workers + * invoking another set of parallel workers). */ -else if (IsA(node, SubLink) || - IsA(node, SubPlan) || - IsA(node, AlternativeSubPlan)) +else if (IsA(node, SubPlan)) +return !((SubPlan *) node)->parallel_safe; +else if (IsA(node, AlternativeSubPlan)) { -if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context)) -return true; +AlternativeSubPlan *asplan = (AlternativeSubPlan *) node; +ListCell *lc; + +foreach(lc, asplan->subplans) +{ +SubPlan*splan = (SubPlan *) lfirst(lc); + +Assert(IsA(splan, SubPlan)); + +if (max_parallel_hazard_walker((Node *) splan, context)) +return true; +} + +return false; } I don't see the reason for having this new code that makes AlternativeSubPlan walk over the subplans; expression_tree_walker already does that. On the flip side I don't see the reason for removing the max_parallel_hazard_test() call for AlternativeSubPlan or for SubLink. I think that the core of this change is to change the handling for SubPlan nodes to just return the flag from the subplan, and that seems fine, but the rest of this I'm skeptical about. +* CTE scans are not consider for parallelism (cf considered + select count(*)from tenk1 where (two, four) not in whitespace -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Tue, Jan 24, 2017 at 10:59 AM, Amit Kapilawrote: > On Thu, Jan 19, 2017 at 4:51 PM, Dilip Kumar wrote: >> On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar wrote: >>> During debugging I found that subplan created for below part of the >>> query is parallel_unsafe, Is it a problem or there is some explanation >>> of why it's not parallel_safe, >> >> Okay, so basically we don't have any mechanism to perform parallel >> scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes) >> must come along with CTE scan. So I think we can avoid setting below >> code because we will never be able to test its side effect, another >> argument can be that if we don't consider the final effect, and just >> see this subplan then by logic it should be marked parallel-safe or >> unsafe as per it's path and it will not have any side effect, as it >> will finally become parallel-unsafe. So it's your call to keep it >> either way. >> > > Yeah, actually setting parallel_safety information for subplan from > corresponding is okay. However, in this particular case as we know > that it might not be of any use till we enable parallelism for CTE > Scan (and doing that is certainly not essential for this project). > So, I have set parallel_safe as false for CTE subplans. > Moved this patch to next CF. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Tue, Jan 24, 2017 at 10:59 AM, Amit Kapilawrote: > On Thu, Jan 19, 2017 at 4:51 PM, Dilip Kumar wrote: >> On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar wrote: >>> During debugging I found that subplan created for below part of the >>> query is parallel_unsafe, Is it a problem or there is some explanation >>> of why it's not parallel_safe, >> >> Okay, so basically we don't have any mechanism to perform parallel >> scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes) >> must come along with CTE scan. So I think we can avoid setting below >> code because we will never be able to test its side effect, another >> argument can be that if we don't consider the final effect, and just >> see this subplan then by logic it should be marked parallel-safe or >> unsafe as per it's path and it will not have any side effect, as it >> will finally become parallel-unsafe. So it's your call to keep it >> either way. >> > > Yeah, actually setting parallel_safety information for subplan from > corresponding is okay. > missed the word *path* in above sentence. /corresponding/corresponding path -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Thu, Jan 19, 2017 at 4:51 PM, Dilip Kumarwrote: > On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar wrote: >> During debugging I found that subplan created for below part of the >> query is parallel_unsafe, Is it a problem or there is some explanation >> of why it's not parallel_safe, > > Okay, so basically we don't have any mechanism to perform parallel > scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes) > must come along with CTE scan. So I think we can avoid setting below > code because we will never be able to test its side effect, another > argument can be that if we don't consider the final effect, and just > see this subplan then by logic it should be marked parallel-safe or > unsafe as per it's path and it will not have any side effect, as it > will finally become parallel-unsafe. So it's your call to keep it > either way. > Yeah, actually setting parallel_safety information for subplan from corresponding is okay. However, in this particular case as we know that it might not be of any use till we enable parallelism for CTE Scan (and doing that is certainly not essential for this project). So, I have set parallel_safe as false for CTE subplans. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pq_pushdown_subplan_v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Thu, Jan 19, 2017 at 4:51 PM, Dilip Kumarwrote: > On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumar wrote: >> During debugging I found that subplan created for below part of the >> query is parallel_unsafe, Is it a problem or there is some explanation >> of why it's not parallel_safe, > > Okay, so basically we don't have any mechanism to perform parallel > scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes) > must come along with CTE scan. So I think we can avoid setting below > code because we will never be able to test its side effect, another > argument can be that if we don't consider the final effect, and just > see this subplan then by logic it should be marked parallel-safe or > unsafe as per it's path and it will not have any side effect, as it > will finally become parallel-unsafe. So it's your call to keep it > either way. Oops, you're right. We don't consider parallelism for RTE_CTE type. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumarwrote: > During debugging I found that subplan created for below part of the > query is parallel_unsafe, Is it a problem or there is some explanation > of why it's not parallel_safe, Okay, so basically we don't have any mechanism to perform parallel scan on CTE. And, IMHO subplan built for CTE (using SS_process_ctes) must come along with CTE scan. So I think we can avoid setting below code because we will never be able to test its side effect, another argument can be that if we don't consider the final effect, and just see this subplan then by logic it should be marked parallel-safe or unsafe as per it's path and it will not have any side effect, as it will finally become parallel-unsafe. So it's your call to keep it either way. @@ -1213,6 +1216,7 @@ SS_process_ctes(PlannerInfo *root) >firstColCollation); splan->useHashTable = false; splan->unknownEqFalse = false; + splan->parallel_safe = best_path->parallel_safe; -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Thu, Jan 19, 2017 at 3:19 PM, Kuntal Ghoshwrote: > Since Param is not parallel_safe till now, the SubPlan is also not > parallel_safe. This is why CTE subplans will not be pushed under > Gather. Specifically, Params which are generated using generate_new_param() are not parallel_safe. In the patch, it is marked as parallel-unsafe. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Thu, Jan 19, 2017 at 3:05 PM, Dilip Kumarwrote: > @@ -1213,6 +1216,7 @@ SS_process_ctes(PlannerInfo *root) > >firstColCollation); > splan->useHashTable = false; > splan->unknownEqFalse = false; > + splan->parallel_safe = best_path->parallel_safe; > > I noticed that if path for CTE is parallel safe then we are marking > CTE subplan as parallel safe, In particular, I don't have any problem > with that, but can you add some test case which can cover this path, I > mean to say where CTE subplan are pushed. > > > I have tried to test the subplan with CTE below is my test. > create table t1(a int , b varchar); > create table t (n int, b varchar); > > Query: > explain verbose select * from t where t.n not in (WITH RECURSIVE t(n) AS ( > VALUES (1) > UNION ALL > SELECT a+1 FROM t1 WHERE a < 100 > ) > SELECT sum(n) FROM t); > > During debugging I found that subplan created for below part of the > query is parallel_unsafe, Is it a problem or there is some explanation > of why it's not parallel_safe, > > (WITH RECURSIVE t(n) AS ( > VALUES (1) > UNION ALL > SELECT a+1 FROM t1 WHERE a < 100 > ) > SELECT sum(n) FROM t); > -- The corresponding plan for the query you have specified is: QUERY PLAN -- Seq Scan on public.t (cost=40.73..20894.74 rows=500480 width=35) Output: t.n, t.b Filter: (NOT (hashed SubPlan 2)) SubPlan 2 -> Aggregate (cost=40.72..40.73 rows=1 width=8) Output: sum(t_1.n) CTE t -> Append (cost=0.00..31.18 rows=424 width=4) -> Result (cost=0.00..0.01 rows=1 width=4) Output: 1 -> Seq Scan on public.t1 (cost=0.00..26.93 rows=423 width=4) Output: (t1.a + 1) Filter: (t1.a < 100) -> CTE Scan on t t_1 (cost=0.00..8.48 rows=424 width=4) Output: t_1.n (15 rows) Now, the plan for CTE is parallel_safe. But, a CTE plan is converted to an InitPlan and returns a Param which is used in the CTE Scan. Since Param is not parallel_safe till now, the SubPlan is also not parallel_safe. This is why CTE subplans will not be pushed under Gather. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Mon, Jan 16, 2017 at 9:13 PM, Amit Kapilawrote: > > After commit-ab1f0c8, this patch needs a rebase. Attached find > rebased version of the patch. > > Thanks to Kuntal for informing me offlist that this patch needs rebase. In this patch, I have observed some changes while creating subplan for CTE. So I have reviewed this from that perspective and also tried to perform some test. Here is my finding/ comments. @@ -1213,6 +1216,7 @@ SS_process_ctes(PlannerInfo *root) >firstColCollation); splan->useHashTable = false; splan->unknownEqFalse = false; + splan->parallel_safe = best_path->parallel_safe; I noticed that if path for CTE is parallel safe then we are marking CTE subplan as parallel safe, In particular, I don't have any problem with that, but can you add some test case which can cover this path, I mean to say where CTE subplan are pushed. I have tried to test the subplan with CTE below is my test. create table t1(a int , b varchar); create table t (n int, b varchar); Query: explain verbose select * from t where t.n not in (WITH RECURSIVE t(n) AS ( VALUES (1) UNION ALL SELECT a+1 FROM t1 WHERE a < 100 ) SELECT sum(n) FROM t); During debugging I found that subplan created for below part of the query is parallel_unsafe, Is it a problem or there is some explanation of why it's not parallel_safe, (WITH RECURSIVE t(n) AS ( VALUES (1) UNION ALL SELECT a+1 FROM t1 WHERE a < 100 ) SELECT sum(n) FROM t); -- -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Thu, Jan 12, 2017 at 7:56 PM, Amit Kapilawrote: > On Thu, Jan 12, 2017 at 8:51 AM, Robert Haas wrote: >> On Wed, Jan 11, 2017 at 9:58 PM, Amit Kapila wrote: >>> The other alternative is to remember this information in SubPlan. We >>> can retrieve parallel_safe information from best_path and can use it >>> while generating SubPlan. The main reason for storing it in the plan >>> was to avoid explicitly passing parallel_safe information while >>> generating SubPlan as plan was already available at that time. >>> However, it seems there are only two places in code (refer >>> build_subplan) where this information needs to be propagated. Let me >>> know if you prefer to remember the parallel_safe information in >>> SubPlan instead of in Plan or if you have something else in mind? >> >> I think we should try doing it in the SubPlan, at least, and see if >> that comes out more elegant than what you have at the moment. >> > > Okay, done that way. I have fixed the review comments raised by Dilip > as well and added the test case in attached patch. > After commit-ab1f0c8, this patch needs a rebase. Attached find rebased version of the patch. Thanks to Kuntal for informing me offlist that this patch needs rebase. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pq_pushdown_subplan_v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Fri, Jan 13, 2017 at 7:13 PM, Tom Lanewrote: > Dilip Kumar writes: >> ERROR: did not find '}' at end of input node at character 762 > I could reproduce this error with simple query like: SELECT * FROM information_schema.role_usage_grants WHERE object_type LIKE 'FOREIGN%' AND object_name IN ('s6', 'foo') ORDER BY 1, 2, 3, 4, 5; The reason is that the stored rules have the old structure of Param. See below: {RELABELTYPE :arg {PARAM :paramkind 2 :paramid 1 :paramtype 11975 :paramtypmod -1 :paramcollid 100 :location -1} The new variable parallel_safe is not present in above node. If you recreate the fresh database you won't see the above problem. I have observed a problem in equal function which I have fixed in the attached patch. I have also added a test, so that we can catch any problem similar to what you have reported. > I've not looked at the patches, but just seeing this error message, > this looks like somebody's fat-fingered the correspondence between > outfuncs.c and readfuncs.c processing. > I think what we need is catversion bump as Param is part of stored rules. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pq_pushdown_correl_subplan_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
Dilip Kumarwrites: > ERROR: did not find '}' at end of input node at character 762 I've not looked at the patches, but just seeing this error message, this looks like somebody's fat-fingered the correspondence between outfuncs.c and readfuncs.c processing. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Thu, Jan 12, 2017 at 7:56 PM, Amit Kapilawrote: > Okay, done that way. I have fixed the review comments raised by Dilip > as well and added the test case in attached patch. I have tested with pq_pushdown_subplan_v2.patch + pq_pushdown_correl_subplan_v1.patch sqlsmith. is reporting below error, I have attached the query to reproduce the issue. This issue is only coming when both the patch is applied, only with pq_pushdown_subplan_v2.patch there is no problem. So I think the problem is because of pq_pushdown_correl_subplan_v1.patch. ERROR: did not find '}' at end of input node at character 762 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com select subq_2.c1 as c0, 25 as c1 from (select pg_catalog.random() as c0, 31 as c1, ref_3.prsend as c2, subq_1.c2 as c3 from (select (select page from pg_catalog.pg_locks limit 1 offset 10) as c0, 9 as c1, sample_1.opfname as c2, sample_1.opfnamespace as c3 from pg_catalog.pg_opfamily as sample_1 tablesample bernoulli (5.4) , lateral (select 57 as c0, 7 as c1 from pg_catalog.pg_amop as sample_2 tablesample system (8.7) where 54 > (select character_maximum_length from information_schema.element_types limit 1 offset 28) limit 8) as subq_0 where (((pg_catalog.pg_postmaster_start_time() < pg_catalog.pg_stat_get_snapshot_timestamp()) and (subq_0.c0 is not NULL)) and (pg_catalog.pg_backend_pid() <> 30)) or (subq_0.c0 >= (select active_pid from pg_catalog.pg_replication_slots limit 1 offset 9) ) limit 167) as subq_1 inner join pg_catalog.pg_ts_parser as ref_3 on (subq_1.c2 = ref_3.prsname ) where subq_1.c1 <= subq_1.c0 limit 56) as subq_2, lateral (select subq_2.c0 as c0 from pg_catalog.pg_statistic as sample_3 tablesample bernoulli (2.3) where (EXISTS ( select pg_catalog.pg_trigger_depth() as c0, sample_4.comments as c1 from information_schema.sql_sizing_profiles as sample_4 tablesample bernoulli (9.9) where (pg_catalog.statement_timestamp() is not NULL) and (((select character_maximum_length from information_schema.domains limit 1 offset 40) < 7) or ((select ordinal_position from information_schema.parameters limit 1 offset 21) = sample_4.sizing_id) or (sample_4.sizing_id <> sample_4.required_value)) or (sample_4.sizing_id <= pg_catalog.pg_stat_get_buf_alloc())) and ((pg_catalog.pg_stat_get_bgwriter_buf_written_checkpoints() <> pg_catalog.gin_cmp_tslexeme( cast(pg_catalog.timeofday() as text), cast(null as text))) and (sample_4.profile_id is NULL))) or ((select ordinal_position from information_schema.attributes limit 1 offset 31) < sample_4.sizing_id))) and (sample_4.sizing_id <= sample_4.sizing_id)) or (sample_4.sizing_id <> 40)) and ((sample_4.required_value <> sample_4.sizing_id) and (false))) or (((pg_catalog.int8range_canonical( cast(null as int8range)) is not NULL) and (sample_4.required_value >= 58)) and (((select objsubid from pg_catalog.pg_seclabel limit 1 offset 1) is NULL) and (((select typtypmod from pg_catalog.pg_type limit 1 offset 32) <= (select character_maximum_length from information_schema.element_types limit 1 offset 33) ) or (13 <= 9) limit 67)) and ((27 <= (select sourceline from pg_catalog.pg_settings limit 1 offset 6) ) or (sample_3.staattnum <= subq_2.c1)) limit 52) as subq_3 where true; -- Sent via pgsql-hackers mailing list
Re: [HACKERS] parallelize queries containing subplans
On Thu, Jan 12, 2017 at 8:51 AM, Robert Haaswrote: > On Wed, Jan 11, 2017 at 9:58 PM, Amit Kapila wrote: >> The other alternative is to remember this information in SubPlan. We >> can retrieve parallel_safe information from best_path and can use it >> while generating SubPlan. The main reason for storing it in the plan >> was to avoid explicitly passing parallel_safe information while >> generating SubPlan as plan was already available at that time. >> However, it seems there are only two places in code (refer >> build_subplan) where this information needs to be propagated. Let me >> know if you prefer to remember the parallel_safe information in >> SubPlan instead of in Plan or if you have something else in mind? > > I think we should try doing it in the SubPlan, at least, and see if > that comes out more elegant than what you have at the moment. > Okay, done that way. I have fixed the review comments raised by Dilip as well and added the test case in attached patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pq_pushdown_subplan_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Thu, Jan 12, 2017 at 9:22 AM, Amit Kapilawrote: > Valid point, but I think we can avoid that by returning false after > foreach(..) loop. I think one improvement could be that instead of > manually checking the parallel safety of each subplan, we can > recursively call max_parallel_hazard_walker for each subplan. I agree that this way we can avoid the problem what I mentioned. > > >> But, more than that it will be cleaner to not >> handle AlternativeSubPlan here unless there is some strong reason? >> > > Yeah, the reason is to avoid unnecessary recursion. Also, in all > other places in the code, we do handle both of them separately, refer > cost_qual_eval_walker for somewhat similar usage. ok. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Tue, Jan 10, 2017 at 11:55 AM, Dilip Kumarwrote: > On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapila wrote: >> Attached patch implements the above idea. This will enable >> parallelism for queries containing un-correlated subplans, an example >> of which is as follows: > > I have reviewed the patch (pq_pushdown_subplan_v1.patch), Mostly patch > looks clean to me. > I have also done some basic functional testing and it's working fine. > > I have one comment. > > + else if (IsA(node, AlternativeSubPlan)) > + { > + AlternativeSubPlan *asplan = (AlternativeSubPlan *) node; > + ListCell *lc; > + > + Assert(context->root); > + > + foreach(lc, asplan->subplans) > + { > + SubPlan*splan = (SubPlan *) lfirst(lc); > + Plan *plan; > + > + Assert(IsA(splan, SubPlan)); > + > + plan = planner_subplan_get_plan(context->root, splan); > + if (!plan->parallel_safe) > + return true; > + } > } > > I see no reason why we need to process the subplan list of > AlternativeSubPlan here, Anyway expression_tree_walker will take care > of processing each subplan of AlternativeSubPlan. Now in the case > where all the subplans in AlternativeSubPlan are parallel_safe we will > process this list twice. > Valid point, but I think we can avoid that by returning false after foreach(..) loop. I think one improvement could be that instead of manually checking the parallel safety of each subplan, we can recursively call max_parallel_hazard_walker for each subplan. > But, more than that it will be cleaner to not > handle AlternativeSubPlan here unless there is some strong reason? > Yeah, the reason is to avoid unnecessary recursion. Also, in all other places in the code, we do handle both of them separately, refer cost_qual_eval_walker for somewhat similar usage. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Wed, Jan 11, 2017 at 9:58 PM, Amit Kapilawrote: > The other alternative is to remember this information in SubPlan. We > can retrieve parallel_safe information from best_path and can use it > while generating SubPlan. The main reason for storing it in the plan > was to avoid explicitly passing parallel_safe information while > generating SubPlan as plan was already available at that time. > However, it seems there are only two places in code (refer > build_subplan) where this information needs to be propagated. Let me > know if you prefer to remember the parallel_safe information in > SubPlan instead of in Plan or if you have something else in mind? I think we should try doing it in the SubPlan, at least, and see if that comes out more elegant than what you have at the moment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Tue, Jan 10, 2017 at 10:55 PM, Robert Haaswrote: > On Wed, Dec 28, 2016 at 1:17 AM, Amit Kapila wrote: >> Currently, queries that have references to SubPlans or >> AlternativeSubPlans are considered parallel-restricted. I think we >> can lift this restriction in many cases especially when SubPlans are >> parallel-safe. To make this work, we need to propagate the >> parallel-safety information from path node to plan node and the same >> could be easily done while creating a plan. Another option could be >> that instead of propagating parallel-safety information from path to >> plan, we can find out from the plan if it is parallel-safe (doesn't >> contain any parallel-aware node) by traversing whole plan tree, but I >> think it is a waste of cycles. Once we have parallel-safety >> information in the plan, we can use that for detection of >> parallel-safe expressions in max_parallel_hazard_walker(). Finally, >> we can pass all the subplans to workers during plan serialization in >> ExecSerializePlan(). This will enable workers to execute subplans >> that are referred in parallel part of the plan. Now, we might be able >> to optimize it such that we pass only subplans that are referred in >> parallel portion of plan, but I am not sure if it is worth the trouble >> because it is one-time cost and much lesser than other things we do >> (like creating >> dsm, launching workers). > > It seems unfortunate to have to add a parallel_safe flag to the > finished plan; the whole reason we have the Path-Plan distinction is > so that we can throw away information that won't be needed at > execution time. The parallel_safe flag is, in fact, not needed at > execution time, but just for further planning. Isn't there some way > that we can remember, at the time when a sublink is converted to a > subplan, whether or not the subplan was created from a parallel-safe > path? > The other alternative is to remember this information in SubPlan. We can retrieve parallel_safe information from best_path and can use it while generating SubPlan. The main reason for storing it in the plan was to avoid explicitly passing parallel_safe information while generating SubPlan as plan was already available at that time. However, it seems there are only two places in code (refer build_subplan) where this information needs to be propagated. Let me know if you prefer to remember the parallel_safe information in SubPlan instead of in Plan or if you have something else in mind? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Wed, Dec 28, 2016 at 1:17 AM, Amit Kapilawrote: > Currently, queries that have references to SubPlans or > AlternativeSubPlans are considered parallel-restricted. I think we > can lift this restriction in many cases especially when SubPlans are > parallel-safe. To make this work, we need to propagate the > parallel-safety information from path node to plan node and the same > could be easily done while creating a plan. Another option could be > that instead of propagating parallel-safety information from path to > plan, we can find out from the plan if it is parallel-safe (doesn't > contain any parallel-aware node) by traversing whole plan tree, but I > think it is a waste of cycles. Once we have parallel-safety > information in the plan, we can use that for detection of > parallel-safe expressions in max_parallel_hazard_walker(). Finally, > we can pass all the subplans to workers during plan serialization in > ExecSerializePlan(). This will enable workers to execute subplans > that are referred in parallel part of the plan. Now, we might be able > to optimize it such that we pass only subplans that are referred in > parallel portion of plan, but I am not sure if it is worth the trouble > because it is one-time cost and much lesser than other things we do > (like creating > dsm, launching workers). It seems unfortunate to have to add a parallel_safe flag to the finished plan; the whole reason we have the Path-Plan distinction is so that we can throw away information that won't be needed at execution time. The parallel_safe flag is, in fact, not needed at execution time, but just for further planning. Isn't there some way that we can remember, at the time when a sublink is converted to a subplan, whether or not the subplan was created from a parallel-safe path? That seems like it would be cleaner than maintaining this flag for all plans. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapilawrote: > Attached patch implements the above idea. This will enable > parallelism for queries containing un-correlated subplans, an example > of which is as follows: I have reviewed the patch (pq_pushdown_subplan_v1.patch), Mostly patch looks clean to me. I have also done some basic functional testing and it's working fine. I have one comment. + else if (IsA(node, AlternativeSubPlan)) + { + AlternativeSubPlan *asplan = (AlternativeSubPlan *) node; + ListCell *lc; + + Assert(context->root); + + foreach(lc, asplan->subplans) + { + SubPlan*splan = (SubPlan *) lfirst(lc); + Plan *plan; + + Assert(IsA(splan, SubPlan)); + + plan = planner_subplan_get_plan(context->root, splan); + if (!plan->parallel_safe) + return true; + } } I see no reason why we need to process the subplan list of AlternativeSubPlan here, Anyway expression_tree_walker will take care of processing each subplan of AlternativeSubPlan. Now in the case where all the subplans in AlternativeSubPlan are parallel_safe we will process this list twice. But, more than that it will be cleaner to not handle AlternativeSubPlan here unless there is some strong reason? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapilawrote: > > Now, we can further extend this to parallelize queries containing > correlated subplans like below: > > explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i); > QUERY PLAN > - > Seq Scan on t1 (cost=0.00..831049.09 rows=8395 width=12) >Filter: (SubPlan 1) >SubPlan 1 > -> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4) >Filter: (i = t1.i) > (5 rows) > > In the above query, Filter on t2 (i = t1.i) generates Param node which > is a parallel-restricted node, so such queries won't be able to use > parallelism even with the patch. I think we can mark Params which > refer to same level as parallel-safe and I think we have this > information (node-> varlevelsup/ phlevelsup/ agglevelsup) available > when we replace correlation vars (SS_replace_correlation_vars). > I have implemented the above idea which will allow same or immediate outer level PARAMS as parallel_safe. The results of above query after patch: postgres=# explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i); QUERY PLAN -- Gather (cost=0.00..49.88 rows=8395 width=12) Workers Planned: 1 -> Parallel Seq Scan on t1 (cost=0.00..49.88 rows=4938 width=12) Filter: (SubPlan 1) SubPlan 1 -> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4) Filter: (i = t1.i) (7 rows) Note - This patch can be applied on top of pq_pushdown_subplan_v1.patch posted upthread [1] [1] - https://www.postgresql.org/message-id/CAA4eK1%2Be8Z45D2n%2BrnDMDYsVEb5iW7jqaCH_tvPMYau%3D1Rru9w%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pq_pushdown_correl_subplan_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallelize queries containing subplans
On Wed, Dec 28, 2016 at 11:47 AM, Amit Kapilawrote: > > Currently, queries that have references to SubPlans or > AlternativeSubPlans are considered parallel-restricted. I think we > can lift this restriction in many cases especially when SubPlans are > parallel-safe. To make this work, we need to propagate the > parallel-safety information from path node to plan node and the same > could be easily done while creating a plan. Another option could be > that instead of propagating parallel-safety information from path to > plan, we can find out from the plan if it is parallel-safe (doesn't > contain any parallel-aware node) by traversing whole plan tree, but I > think it is a waste of cycles. Once we have parallel-safety > information in the plan, we can use that for detection of > parallel-safe expressions in max_parallel_hazard_walker(). Finally, > we can pass all the subplans to workers during plan serialization in > ExecSerializePlan(). This will enable workers to execute subplans > that are referred in parallel part of the plan. Now, we might be able > to optimize it such that we pass only subplans that are referred in > parallel portion of plan, but I am not sure if it is worth the trouble > because it is one-time cost and much lesser than other things we do > (like creating > dsm, launching workers). > > Attached patch implements the above idea. This will enable > parallelism for queries containing un-correlated subplans, an example > of which is as follows: > > set parallel_tuple_cost=0; > set parallel_setup_cost=0; > set min_parallel_relation_size=50; > > create table t1 (i int, j int, k int); > create table t2 (i int, j int, k int); > > insert into t1 values (generate_series(1,10)*random(), > generate_series(5,50)*random(), > generate_series(8,80)*random()); > insert into t2 values (generate_series(4,10)*random(), > generate_series(5,90)*random(), > generate_series(2,10)*random()); > > > Plan without Patch > - > postgres=# explain select * from t1 where t1.i not in (select t2.i > from t2 where t2.i in (1,2,3)); > QUERY PLAN > --- > Seq Scan on t1 (cost=110.84..411.72 rows=8395 width=12) >Filter: (NOT (hashed SubPlan 1)) >SubPlan 1 > -> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4) >Filter: (i = ANY ('{1,2,3}'::integer[])) > (5 rows) > > Plan with Patch > > postgres=# explain select * from t1 where t1.i not in (select t2.i > from t2 where t2.i in (1,2,3)); >QUERY PLAN > - > Gather (cost=110.84..325.30 rows=8395 width=12) >Workers Planned: 1 >-> Parallel Seq Scan on t1 (cost=110.84..325.30 rows=4938 width=12) > Filter: (NOT (hashed SubPlan 1)) > SubPlan 1 >-> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4) > Filter: (i = ANY ('{1,2,3}'::integer[])) > (7 rows) > > We have observed that Q-16 in TPC-H have been improved with the patch > and the analysis of same will be shared by my colleague Rafia. > To study the effect of uncorrelated sub-plan pushdown on TPC-H and TPC-DS benchmark queries we performed some experiments and the execution time results for same are summarised as follows, Query| HEAD | Patches | scale-factor ---+-+---+- DS-Q45 | 35 | 19 | scale-factor: 100 H-Q16 | 812 | 645 | scale-factor: 300 H-Q16 | 49 | 37 | scale-factor: 20 Execution time given in above table is in seconds. Detailed analysis of this experimentation is as follows, Additional patches applied in this analysis are, Parallel index scan [1] Parallel index-only scan [2] Parallel merge-join [3] The system setup used for this experiment is, Server parameter settings: work_mem = 500 MB, max_parallel_workers_per_gather = 4, random_page_cost = seq_page_cost = 0.1 = parallel_tuple_cost, shared_buffers = 1 GB Machine used: IBM Power, 4 socket machine, 512 GB RAM TPC-DS scale factor = 100 (approx size of database is 150 GB) Query 45 which takes around 35 seconds on head, completes in 19 seconds with these patches. The point to note here is that without this patch of pushing uncorrelated sublans, hash join which is using subplan in join filter could not be pushed to workers and hence query was unable to use the parallelism enough, with this patch parallelism is available till the topmost join node. The output of explain analyse statement of this query on both head and with patches is given in attached file ds_q45.txt. On further evaluating these patches on TPC-H queries on different scale factors we came across query 16, for TPC-H scale factor 20 and aforementioned parameter settings with the change of max_parallel_workers_per gather = 2, it took 37 seconds with the
[HACKERS] parallelize queries containing subplans
Currently, queries that have references to SubPlans or AlternativeSubPlans are considered parallel-restricted. I think we can lift this restriction in many cases especially when SubPlans are parallel-safe. To make this work, we need to propagate the parallel-safety information from path node to plan node and the same could be easily done while creating a plan. Another option could be that instead of propagating parallel-safety information from path to plan, we can find out from the plan if it is parallel-safe (doesn't contain any parallel-aware node) by traversing whole plan tree, but I think it is a waste of cycles. Once we have parallel-safety information in the plan, we can use that for detection of parallel-safe expressions in max_parallel_hazard_walker(). Finally, we can pass all the subplans to workers during plan serialization in ExecSerializePlan(). This will enable workers to execute subplans that are referred in parallel part of the plan. Now, we might be able to optimize it such that we pass only subplans that are referred in parallel portion of plan, but I am not sure if it is worth the trouble because it is one-time cost and much lesser than other things we do (like creating dsm, launching workers). Attached patch implements the above idea. This will enable parallelism for queries containing un-correlated subplans, an example of which is as follows: set parallel_tuple_cost=0; set parallel_setup_cost=0; set min_parallel_relation_size=50; create table t1 (i int, j int, k int); create table t2 (i int, j int, k int); insert into t1 values (generate_series(1,10)*random(), generate_series(5,50)*random(), generate_series(8,80)*random()); insert into t2 values (generate_series(4,10)*random(), generate_series(5,90)*random(), generate_series(2,10)*random()); Plan without Patch - postgres=# explain select * from t1 where t1.i not in (select t2.i from t2 where t2.i in (1,2,3)); QUERY PLAN --- Seq Scan on t1 (cost=110.84..411.72 rows=8395 width=12) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4) Filter: (i = ANY ('{1,2,3}'::integer[])) (5 rows) Plan with Patch postgres=# explain select * from t1 where t1.i not in (select t2.i from t2 where t2.i in (1,2,3)); QUERY PLAN - Gather (cost=110.84..325.30 rows=8395 width=12) Workers Planned: 1 -> Parallel Seq Scan on t1 (cost=110.84..325.30 rows=4938 width=12) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> Seq Scan on t2 (cost=0.00..104.50 rows=2537 width=4) Filter: (i = ANY ('{1,2,3}'::integer[])) (7 rows) We have observed that Q-16 in TPC-H have been improved with the patch and the analysis of same will be shared by my colleague Rafia. Now, we can further extend this to parallelize queries containing correlated subplans like below: explain select * from t1 where t1.i in (select t2.i from t2 where t2.i=t1.i); QUERY PLAN - Seq Scan on t1 (cost=0.00..831049.09 rows=8395 width=12) Filter: (SubPlan 1) SubPlan 1 -> Seq Scan on t2 (cost=0.00..97.73 rows=493 width=4) Filter: (i = t1.i) (5 rows) In the above query, Filter on t2 (i = t1.i) generates Param node which is a parallel-restricted node, so such queries won't be able to use parallelism even with the patch. I think we can mark Params which refer to same level as parallel-safe and I think we have this information (node-> varlevelsup/ phlevelsup/ agglevelsup) available when we replace correlation vars (SS_replace_correlation_vars). The reason why it is not advisable to mark Params that don't refer to same query level as parallel-safe is that can lead to plans like below: Foo -> Gather -> Bar SubPlan 1 (SubPlan refers to Foo) Now, this problem is tricky because we need to pass all such Params each time we invoke Gather. That also is doable with much more effort, but I am not sure if it is worth because all of the use cases I have seen in TPC-H (Q-2) or TPC-DS (Q-6) always uses SubPlans that refer to same query level. Yet, another useful enhancement in this area could be to consider both parallel and non-parallel paths for subplans. As of now, we consider the cheapest/best path and form subplan from it, but it is quite possible that instead of choosing parallel path (in case it is cheapest) at subplan level, the non-parallel path at subplan level could be beneficial when upper plan can use parallelism. I think this will be a separate project in itself if we want to do this and based on my study of TPC-H and TPC-DS queries, I am confident that this will be helpful in certain queries at higher scale