Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
> "Tomas" == Tomas Vondra writes: Tomas> If we can get rid of the excessive ChainAggregate, that's Tomas> certainly enough for now. I found an algorithm that should provably give the minimal number of sorts (I was afraid that problem would turn out to be NP-hard, but not so - it's solvable in P by reducing it to a problem of maximal matching in bipartite graphs). Updated patch should be forthcoming in a day or two. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On 7.9.2014 18:52, Andrew Gierth wrote: >> "Tomas" == Tomas Vondra writes: > > Tomas> Maybe preventing this completely (i.e. raising an ERROR with > Tomas> "duplicate columns in CUBE/ROLLUP/... clauses") would be > Tomas> appropriate. Does the standard says anything about this? > > The spec does not say anything explicitly about duplicates, so they > are allowed (and duplicate grouping _sets_ can't be removed, only > duplicate columns within a single GROUP BY clause after the grouping > sets have been eliminated by transformation). I have checked my > reading of the spec against oracle 11 and MSSQL using sqlfiddle. > > The way the spec handles grouping sets is to define a sequence of > syntactic transforms that result in a query which is a UNION ALL of > ordinary GROUP BY queries. (We haven't tried to implement the > additional optional feature of GROUP BY DISTINCT.) Since it's UNION > ALL, any duplicates must be preserved, so a query with GROUPING SETS > ((a),(a)) reduces to: > > SELECT ... GROUP BY a UNION ALL SELECT ... GROUP BY a; > > and therefore has duplicates of all its result rows. > > I'm quite prepared to concede that I may have read the spec wrong > (wouldn't be the first time), but in this case I require any such > claim to be backed up by an example from some other db showing an > actual difference in behavior. I think you read the spec right. Apparently duplicate grouping sets are allowed, and it's supposed to output that grouping set twice. The section on ROLLUP/CUBE do not mention duplicates at all, it only explains how to generate all the possible grouping sets, so if you have duplicate columns there, you'll get duplicate sets (which is allowed). If we can get rid of the excessive ChainAggregate, that's certainly enough for now. Optimizing it could be simple, though - you don't need to keep the duplicate groups, you only need to keep a counter "how many times to output this group". But the more I think about it, the more I think we can ignore that. There are far more important pieces to implement, and if you write bad SQL there's no help anyway. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
> "Tomas" == Tomas Vondra writes: >> As for computing it all twice, there's currently no attempt to >> optimize multiple identical grouping sets into multiple >> projections of a single grouping set result. CUBE(a,b,c,a) has >> twice as many grouping sets as CUBE(a,b,c) does, even though all >> the extra ones are duplicates. Tomas> Shouldn't this be solved by eliminating the excessive Tomas> ChainAggregate? Although it probably changes GROUPING(...), Tomas> so it's not just about removing the duplicate column(s) from Tomas> the CUBE. Eliminating the excess ChainAggregate would not change the number of grouping sets, only where they are computed. Tomas> Maybe preventing this completely (i.e. raising an ERROR with Tomas> "duplicate columns in CUBE/ROLLUP/... clauses") would be Tomas> appropriate. Does the standard says anything about this? The spec does not say anything explicitly about duplicates, so they are allowed (and duplicate grouping _sets_ can't be removed, only duplicate columns within a single GROUP BY clause after the grouping sets have been eliminated by transformation). I have checked my reading of the spec against oracle 11 and MSSQL using sqlfiddle. The way the spec handles grouping sets is to define a sequence of syntactic transforms that result in a query which is a UNION ALL of ordinary GROUP BY queries. (We haven't tried to implement the additional optional feature of GROUP BY DISTINCT.) Since it's UNION ALL, any duplicates must be preserved, so a query with GROUPING SETS ((a),(a)) reduces to: SELECT ... GROUP BY a UNION ALL SELECT ... GROUP BY a; and therefore has duplicates of all its result rows. I'm quite prepared to concede that I may have read the spec wrong (wouldn't be the first time), but in this case I require any such claim to be backed up by an example from some other db showing an actual difference in behavior. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On 7.9.2014 15:11, Andrew Gierth wrote: >> "Tomas" == Tomas Vondra writes: > > >> It's not one sort per grouping set, it's the minimal number of > >> sorts needed to express the result as a union of ROLLUP > >> clauses. The planner code will (I believe) always find the > >> smallest number of sorts needed. > > Tomas> You're probably right. Although when doing GROUP BY CUBE > Tomas> (a,b,c,a) I get one more ChainAggregate than with > Tomas> CUBE(a,b,c). and we seem to compute all the aggregates > Tomas> twice. Not sure if we need to address this though, because > Tomas> it's mostly user's fault. > > Hm. Yeah, you're right that the number of sorts is not optimal > there. We can look into that. I don't think it's very critical, though. I was worried about it because of the sorts, but if that gets tackled in patches following this commitfest it seems OK. > As for computing it all twice, there's currently no attempt to > optimize multiple identical grouping sets into multiple projections > of a single grouping set result. CUBE(a,b,c,a) has twice as many > grouping sets as CUBE(a,b,c) does, even though all the extra ones are > duplicates. Shouldn't this be solved by eliminating the excessive ChainAggregate? Although it probably changes GROUPING(...), so it's not just about removing the duplicate column(s) from the CUBE. Maybe preventing this completely (i.e. raising an ERROR with "duplicate columns in CUBE/ROLLUP/... clauses") would be appropriate. Does the standard says anything about this? But arguably this is a minor issue, happening only when the user uses the same column/expression twice. Hopefully the users don't do that too often. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
> "Tomas" == Tomas Vondra writes: >> It's not one sort per grouping set, it's the minimal number of >> sorts needed to express the result as a union of ROLLUP >> clauses. The planner code will (I believe) always find the >> smallest number of sorts needed. Tomas> You're probably right. Although when doing GROUP BY CUBE Tomas> (a,b,c,a) I get one more ChainAggregate than with Tomas> CUBE(a,b,c). and we seem to compute all the aggregates Tomas> twice. Not sure if we need to address this though, because Tomas> it's mostly user's fault. Hm. Yeah, you're right that the number of sorts is not optimal there. We can look into that. As for computing it all twice, there's currently no attempt to optimize multiple identical grouping sets into multiple projections of a single grouping set result. CUBE(a,b,c,a) has twice as many grouping sets as CUBE(a,b,c) does, even though all the extra ones are duplicates. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On 6.9.2014 23:34, Andrew Gierth wrote: >> "Tomas" == Tomas Vondra writes: > > Tomas> I have significant doubts about the whole design, > Tomas> though. Especially the decision not to use HashAggregate, > > There is no "decision not to use HashAggregate". There is simply no > support for HashAggregate yet. > > Having it be able to work with GroupAggregate is essential, because > there are always cases where HashAggregate is simply not permitted > (e.g. when using distinct or sorted aggs; or unhashable types; or with > the current code, when the estimated memory usage exceeds work_mem). > HashAggregate may be a performance improvement, but it's something > that can be added afterwards rather than an essential part of the > feature. Ah, OK. I got confused by the "final patch" subject, and so the possibility of additional optimization somehow didn't occur to me. > Tomas> Now, the chaining only makes this worse, because it > Tomas> effectively forces a separate sort of the whole table for each > Tomas> grouping set. > > It's not one sort per grouping set, it's the minimal number of sorts > needed to express the result as a union of ROLLUP clauses. The planner > code will (I believe) always find the smallest number of sorts needed. You're probably right. Although when doing GROUP BY CUBE (a,b,c,a) I get one more ChainAggregate than with CUBE(a,b,c). and we seem to compute all the aggregates twice. Not sure if we need to address this though, because it's mostly user's fault. > Each aggregate node can process any number of grouping sets as long as > they represent a single rollup list (and therefore share a single sort > order). > > Yes, this is slower than using one hashagg. But it solves the general > problem in a way that does not interfere with future optimization. > > (HashAggregate can be added to the current implementation by first > adding executor support for hashagg with multiple grouping sets, then > in the planner, extracting as many hashable grouping sets as possible > from the list before looking for rollup lists. The chained aggregate > code can work just fine with a HashAggregate as the chain head. > > We have not actually tackled this, since I'm not going to waste any > time adding optimizations before the basic idea is accepted.) OK, understood. > > Tomas> What I envisioned when considering hacking on this a few > Tomas> months back, was extending the aggregate API with "merge > Tomas> state" function, > > That's not really on the cards for arbitrary non-trivial aggregate > functions. > > Yes, it can be done for simple ones, and if you want to use that as a > basis for adding optimizations that's fine. But a solution that ONLY > works in simple cases isn't sufficient, IMO. I believe it can be done for most aggregates, assuming you have access to the internal state somehow (not just the). Adding it for in-core aggregates would not be difficult, in most cases. But you're right we don't have this now at all. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
> "Tomas" == Tomas Vondra writes: Tomas> I have significant doubts about the whole design, Tomas> though. Especially the decision not to use HashAggregate, There is no "decision not to use HashAggregate". There is simply no support for HashAggregate yet. Having it be able to work with GroupAggregate is essential, because there are always cases where HashAggregate is simply not permitted (e.g. when using distinct or sorted aggs; or unhashable types; or with the current code, when the estimated memory usage exceeds work_mem). HashAggregate may be a performance improvement, but it's something that can be added afterwards rather than an essential part of the feature. Tomas> Now, the chaining only makes this worse, because it Tomas> effectively forces a separate sort of the whole table for each Tomas> grouping set. It's not one sort per grouping set, it's the minimal number of sorts needed to express the result as a union of ROLLUP clauses. The planner code will (I believe) always find the smallest number of sorts needed. Each aggregate node can process any number of grouping sets as long as they represent a single rollup list (and therefore share a single sort order). Yes, this is slower than using one hashagg. But it solves the general problem in a way that does not interfere with future optimization. (HashAggregate can be added to the current implementation by first adding executor support for hashagg with multiple grouping sets, then in the planner, extracting as many hashable grouping sets as possible from the list before looking for rollup lists. The chained aggregate code can work just fine with a HashAggregate as the chain head. We have not actually tackled this, since I'm not going to waste any time adding optimizations before the basic idea is accepted.) Tomas> What I envisioned when considering hacking on this a few Tomas> months back, was extending the aggregate API with "merge Tomas> state" function, That's not really on the cards for arbitrary non-trivial aggregate functions. Yes, it can be done for simple ones, and if you want to use that as a basis for adding optimizations that's fine. But a solution that ONLY works in simple cases isn't sufficient, IMO. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On 31.8.2014 22:52, Andrew Gierth wrote: > Recut patches: > > gsp1.patch - phase 1 code patch (full syntax, limited functionality) > gsp2.patch - phase 2 code patch (adds full functionality using the > new chained aggregate mechanism) > gsp-doc.patch - docs > gsp-contrib.patch - quote "cube" in contrib/cube and contrib/earthdistance, > intended primarily for testing pending a decision on > renaming contrib/cube or unreserving keywords > gsp-u.patch- proposed method to unreserve CUBE and ROLLUP > > (the contrib patch is not necessary if the -u patch is used; the > contrib/pg_stat_statements fixes are in the phase1 patch) Hi, I looked at the patch today. The good news is it seems to apply cleanly on HEAD (with some small offsets, but no conflicts). The code generally seems OK to me, although the patch is quite massive. I've also did a considerable amount of testing and I've been unable to cause failures. I have significant doubts about the whole design, though. Especially the decision not to use HashAggregate, and the whole chaining idea. I haven't noticed any discussion about this (at least in this thread), and the chaining idea was not mentioned until 21/8, so I'd appreciate some reasoning behind this choice. I assume the "no HashAggregate" decision was done because of fear of underestimates, and the related OOM issues. I don't see how this is different from the general HashAggregate, though. Or is there another reason for this? Now, the chaining only makes this worse, because it effectively forces a separate sort of the whole table for each grouping set. We're doing a lot of analytics on large tables, where large means tens of GBs and hundreds of millions of rows. What we do now at the moment is basically the usual ROLAP approach - create a cube with aggregated data, which is usually much smaller than the source table, and then compute the rollups for the interesting slices in a second step. I was hoping that maybe we could eventually replace this with the GROUP BY CUBE functionality provided by this patch, but these design decisions make it pretty much impossible. I believe most other users processing non-trivial amounts of data (pretty much everyone with just a few million rows) will be in similar position :-( What I envisioned when considering hacking on this a few months back, was extending the aggregate API with "merge state" function, doing the aggregation just like today and merging the groups (for each cell) at the end. Yeah, we don't have this infrastructure, but maybe it'd be a better way than the current chaining approach. And it was repeatedly mentioned as necessary for parallel aggregation (and even mentioned in the memory-bounded hashagg batching discussion). I'm ready to spend some time on this, if it makes the grouping sets useful for us. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On Sunday, August 31, 2014, Andres Freund wrote: > On 2014-08-31 21:09:59 +0530, Atri Sharma wrote: > > On Sun, Aug 31, 2014 at 9:07 PM, Erik Rijkers > wrote: > > > I have found that the "unrecognized node type" error is caused by: > > It's a warning, not an error, right? > > > > shared_preload_libraries = pg_stat_statements > > > > > > in postgresql.conf (as my default compile script was doing). > > > > > > If I disable that line the error goes away. > > > > > > > > I think thats more of a library linking problem rather than a problem > with > > the patch. I couldnt reproduce it,though. > > I think it's vastly more likely that the patch simply didn't add the new > expression types to pg_stat_statements.c:JumbleExpr(). > > > Must have run the above diagnosis in a wrong manner then, I will check.Thanks for the heads up! Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On 2014-08-31 21:09:59 +0530, Atri Sharma wrote: > On Sun, Aug 31, 2014 at 9:07 PM, Erik Rijkers wrote: > > I have found that the "unrecognized node type" error is caused by: It's a warning, not an error, right? > > shared_preload_libraries = pg_stat_statements > > > > in postgresql.conf (as my default compile script was doing). > > > > If I disable that line the error goes away. > > > > > I think thats more of a library linking problem rather than a problem with > the patch. I couldnt reproduce it,though. I think it's vastly more likely that the patch simply didn't add the new expression types to pg_stat_statements.c:JumbleExpr(). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On Sun, Aug 31, 2014 at 9:07 PM, Erik Rijkers wrote: > On Tue, August 26, 2014 14:24, Andrew Gierth wrote: > >> "Erik" == Erik Rijkers writes: > > > > >> They apply cleanly for me at 2bde297 whether with git apply or > > >> patch, except for the contrib one (which you don't need unless you > > >> want to run the contrib regression tests without applying the > > >> gsp-u patch). > > > > Erik> Ah, I had not realised that. Excluding that contrib-patch and > > Erik> only applying these three: > > > > Erik> gsp1.patch > > Erik> gsp2.patch > > Erik> gsp-doc.patch > > > > Erik> does indeed work (applies, compiles). > > > > I put up a rebased contrib patch anyway (linked off the CF). > > > > Did the "unrecognized node type" error go away, or do we still need to > > look into that? > > > > I have found that the "unrecognized node type" error is caused by: > > shared_preload_libraries = pg_stat_statements > > in postgresql.conf (as my default compile script was doing). > > If I disable that line the error goes away. > > I think thats more of a library linking problem rather than a problem with the patch. I couldnt reproduce it,though. Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On Tue, August 26, 2014 14:24, Andrew Gierth wrote: >> "Erik" == Erik Rijkers writes: > > >> They apply cleanly for me at 2bde297 whether with git apply or > >> patch, except for the contrib one (which you don't need unless you > >> want to run the contrib regression tests without applying the > >> gsp-u patch). > > Erik> Ah, I had not realised that. Excluding that contrib-patch and > Erik> only applying these three: > > Erik> gsp1.patch > Erik> gsp2.patch > Erik> gsp-doc.patch > > Erik> does indeed work (applies, compiles). > > I put up a rebased contrib patch anyway (linked off the CF). > > Did the "unrecognized node type" error go away, or do we still need to > look into that? > I have found that the "unrecognized node type" error is caused by: shared_preload_libraries = pg_stat_statements in postgresql.conf (as my default compile script was doing). If I disable that line the error goes away. I don't know exactly what that means for the groping sets patches but I thought I'd mention it here. Otherwise I've not run into any problems with GROUPING SETS. Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On Tue, August 26, 2014 14:24, Andrew Gierth wrote: >> "Erik" == Erik Rijkers writes: > > >> They apply cleanly for me at 2bde297 whether with git apply or > >> patch, except for the contrib one (which you don't need unless you > >> want to run the contrib regression tests without applying the > >> gsp-u patch). > > Erik> Ah, I had not realised that. Excluding that contrib-patch and > Erik> only applying these three: > > Erik> gsp1.patch > Erik> gsp2.patch > Erik> gsp-doc.patch > > Erik> does indeed work (applies, compiles). > > I put up a rebased contrib patch anyway (linked off the CF). > > Did the "unrecognized node type" error go away, or do we still need to > look into that? > Yes, it did go away; looks fine now: select brand , size , grouping(brand, size) , sum(sales) from items_sold group by rollup(brand, size) ; brand | size | grouping | sum ---+--+--+- Bar | L|0 | 5 Bar | M|0 | 15 Bar | |1 | 20 Foo | L|0 | 10 Foo | M|0 | 20 Foo | |1 | 30 | |3 | 50 (7 rows) I'm a bit unclear why the bottom-row 'grouping' value is 3. Shouldn't that be 2? But I'm still reading the documentation so it's perhaps too early to ask... Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
> "Erik" == Erik Rijkers writes: >> They apply cleanly for me at 2bde297 whether with git apply or >> patch, except for the contrib one (which you don't need unless you >> want to run the contrib regression tests without applying the >> gsp-u patch). Erik> Ah, I had not realised that. Excluding that contrib-patch and Erik> only applying these three: Erik> gsp1.patch Erik> gsp2.patch Erik> gsp-doc.patch Erik> does indeed work (applies, compiles). I put up a rebased contrib patch anyway (linked off the CF). Did the "unrecognized node type" error go away, or do we still need to look into that? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On Tue, August 26, 2014 11:13, Andrew Gierth wrote: >> "Andrew" == Andrew Gierth writes: > >> "Erik" == Erik Rijkers writes: > > Erik> The patches did not apply anymore so I applied at 73eba19aebe0. > Erik> There they applied OK, and make && make check was OK. > > Andrew> I'll look and rebase if need be. > > They apply cleanly for me at 2bde297 whether with git apply or patch, > except for the contrib one (which you don't need unless you want to > run the contrib regression tests without applying the gsp-u patch). > Ah, I had not realised that. Excluding that contrib-patch and only applying these three: gsp1.patch gsp2.patch gsp-doc.patch does indeed work (applies, compiles). Thank you, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
> "Andrew" == Andrew Gierth writes: > "Erik" == Erik Rijkers writes: Erik> The patches did not apply anymore so I applied at 73eba19aebe0. Erik> There they applied OK, and make && make check was OK. Andrew> I'll look and rebase if need be. They apply cleanly for me at 2bde297 whether with git apply or patch, except for the contrib one (which you don't need unless you want to run the contrib regression tests without applying the gsp-u patch). -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
> "Erik" == Erik Rijkers writes: Erik> The patches did not apply anymore so I applied at 73eba19aebe0. Erik> There they applied OK, and make && make check was OK. I'll look and rebase if need be. --> WARNING: unrecognized node type: 347 Can't reproduce this - are you sure it's not a mis-build? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Final Patch for GROUPING SETS - unrecognized node type: 347
On Mon, August 25, 2014 07:21, Andrew Gierth wrote: > Here is the new version of our grouping sets patch. This version > supersedes the previous post. The patches did not apply anymore so I applied at 73eba19aebe0. There they applied OK, and make && make check was OK. drop table if exists items_sold; create table items_sold as select * from ( values ('Foo', 'L', 10) , ('Foo', 'M', 20) , ('Bar', 'M', 15) , ('Bar', 'L', 5) ) as f(brand, size, sales) ; select brand, size, grouping(brand, size), sum(sales) from items_sold group by rollup(brand, size); --> WARNING: unrecognized node type: 347 I suppose that's not correct. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers