Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)
Noah Misch said: Other aggregates based on this syntax might not desire such type unification. Then there would have to be some way to distinguish that. Maybe those could have -1 and the standard hypothetical set functions -2, with some flag in CREATE AGGREGATE to sort it out. Having parse analysis do that distorts the character of an any argument. I think the proper place for such processing is the first call to a transition function. Except there isn't one. But let's not make the parser presume that an aggordnargs=-1 aggregate always wants its any arguments handled in the manner of the standard hypothetical set functions. This has to happen in the parser because these are errors that should be caught before execution: rank(foo) within group (order by bar,baz) rank(integercol) within group (order by textcol) And collations have to be resolved (pairwise) before sorting can happen: rank(textval COLLATE C) within group (order by foo) -- sorts in C rank(textval COLLATE C) within group (order by bar COLLATE en_US) -- error (basically, in rank(x) within group (order by y) where x and y are collatable, the collation rules apply exactly as though you were doing (x y), with all the implicit vs. explicit stuff included) And this: rank(1.1) within group (order by intcol) should become rank(1.1) within group (order by intcol::numeric) -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)
Ok, since Atri posted our work-so-far and there's not been much comment, I'll outline here my proposed plan of attack. Rather than, as in the WIP patch, using the agg finalfn to validate the split between normal args and ORDER BY args, I propose this: Firstly, as in the WIP patch, func(a) within group (order by b) is looked up as though it were func(a,b). The result must be marked as an ordered set function. A new pg_aggregate integer column, aggordnargs (?), must be equal to the number of normal args (i.e. (a) in this case). Note that this may be 0; one can see a legitimate use case for mode() within group (order by anyelement) for example. The finalfn must be defined to have the same signature, so its args are processed as if it were func_final(a,b) - but only a dummy arg is passed for b. (Similar to the case for window functions.) Resolution of polymorphic parameters and result types therefore works as normal. For hypothetical set functions we add a special case, aggordnargs=-1, for which both the aggregate and the finalfn must be defined as (variadic any) and parse analysis detects this case and unifies the types of the normal args with those of the ORDER BY args. I propose this new syntax: create aggregate func(argtypes...) within group (argtypes...) ( [ STYPE = ... , ] [ SORTOP = ... , ] [ INITCOND = ... , ] FINALFUNC = func_final ); Ordered set functions will typically not need STYPE etc., but hypothetical set functions will be declared as, e.g.: create aggregate rank(variadic any) within group (variadic any) ( STYPE = boolean, INITCOND = 'f', SORTOP = , FINALFUNC = rank_hypothetical_final ); (I'm open to comment as to whether to simply overload the aggsortop column in pg_aggregate or add a new one. I'm inclined to do the latter.) The idea here is that a column of type STYPE will be appended to the list of columns to be sorted, using SORTOP as sort operator, and all input rows will have this column initialized to the INITCOND value. This is to make it easy to implement rank() and friends by simply inserting the hypothetical row into the sort, with a true value to flag it, and finding its position in the sort result. (Better that than comparing the hypothetical row against the whole group.) (Security caveat: it will be necessary for the finalfn in such cases to validate that the additional column exists with the right type. Producing the wrong result is acceptable if the values in it are unexpected; crashing is not.) Any comment before we get back to coding? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)
On 07/17/2013 08:15 PM, Andrew Gierth wrote: The spec defines two types of aggregate function classed as ordered set function, as follows: 1. An inverse distribution function taking one argument (which must be a grouped column or otherwise constant within groups) plus a sorted group with exactly one column: =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ... The motivating example for this (and the only ones in the spec) are percentile_cont and percentile_disc, to return a percentile result from a continuous or discrete distribution. (Thus percentile_cont(0.5) within group (order by x) is the spec's version of a median(x) function.) One question is how this relates to the existing SELECT agg_func(x order by y) ... syntax. Clearly there's some extra functionality here, but the two are very similar conceptually. 2. A hypothetical set function taking N arguments of arbitrary types (a la VARIADIC any, rather than a fixed list) plus a sorted group with N columns of matching types: =# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...)) from ... (where typeof(p1)==typeof(q1) and so on, at least up to trivial conversions) The motivating example here is to be able to do rank(p1,p2,...) to return the rank that the specified values would have had if they were added to the group. Wow, I can't possibly grasp the purpose of this. Maybe a practical example? We've also had an expression of interest in extending this to allow percentile_disc(float8[]) and percentile_cont(float8[]) returning arrays; e.g. percentile_cont(array[0, 0.25, 0.5, 0.75, 1]) to return an array containing the bounds, median and quartiles in one go. This is an extension to the spec but it seems sufficiently obviously useful to be worth supporting. To be specific, I asked for this because it's already something I do using PL/R, although in PL/R it's pretty much limited to floats. Anyway, for anyone who isn't following why we want this: statitical summary reports. For example, I'd love to be able to do a quartile distribution of query execution times without resorting to R. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)
Josh Berkus wrote: On 07/17/2013 08:15 PM, Andrew Gierth wrote: The spec defines two types of aggregate function classed as ordered set function, as follows: 1. An inverse distribution function taking one argument (which must be a grouped column or otherwise constant within groups) plus a sorted group with exactly one column: =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ... The motivating example for this (and the only ones in the spec) are percentile_cont and percentile_disc, to return a percentile result from a continuous or discrete distribution. (Thus percentile_cont(0.5) within group (order by x) is the spec's version of a median(x) function.) One question is how this relates to the existing SELECT agg_func(x order by y) ... syntax. Clearly there's some extra functionality here, but the two are very similar conceptually. Well, as you probably know, the spec is a whole pile of random special-case syntax and any similarities are probably more accidental than anything else. A major difference is that in agg(x order by y), the values of y are not passed to the aggregate function - they serve no purpose other than controlling the order of the x values. Whereas in WITHIN GROUP, the values in the ORDER BY ... clause are in some sense the primary input to the aggregate, and the p argument is secondary and can't vary between rows of the group. Our implementation does heavily reuse the existing executor mechanics for ORDER BY in aggregates, and it also reuses a fair chunk of the parser code for it, but there are significant differences. [of hypothetical set functions] Wow, I can't possibly grasp the purpose of this. Maybe a practical example? =# select rank(123) within group (order by x) from (values (10),(50),(100),(200),(500)) v(x); would return 1 row containing the value 4, because if you added the value 123 to the grouped values, it would have been ranked 4th. Any time you want to calculate what the rank, dense_rank or cume_dist would be of a specific row within a group without actually adding the row to the group, this is how it's done. I don't have any practical examples to hand, but this beast seems to be implemented in at least Oracle and MSSQL so I guess it has uses. [on supporting arrays of percentiles] To be specific, I asked for this because it's already something I do using PL/R, although in PL/R it's pretty much limited to floats. percentile_cont is limited to floats and intervals in the spec; to be precise, it's limited to taking args of either interval or any numeric type, and returns interval for interval args, and approximate numeric with implementation-defined precision, i.e. some form of float, for numeric-type args. The definition requires interpolation between values, so it's not clear that there's any point in trying to allow other types. percentile_disc is also limited to floats and intervals in the spec, but I see absolutely no reason whatsoever for this, since the definition given is valid for any type with ordering operators; there is no reason not to make it fully polymorphic. (The requirement for ordering will be enforced in parse-analysis anyway, by the ORDER BY transformations, and the function simply returns one of the input values unaltered.) -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)
Josh Berkus wrote: Hah, I didn't realize that our ordered aggregate syntax even *was* spec. The spec defines agg(x order by y) only for array_agg and xmlagg; the generalization to arbitrary other aggregates is our extension. (But kind of obvious really.) Our implementation does heavily reuse the existing executor mechanics for ORDER BY in aggregates, and it also reuses a fair chunk of the parser code for it, but there are significant differences. Well, seems like it would work the same as agg_func(constx,coly,colz ORDER BY coly, colz) ... which means you could reuse a LOT of the internal plumbing. Or am I missing something? You missed the part above which said ...does heavily reuse... :-) i.e. we are in fact reusing a lot of the internal plumbing. Also, what would a CREATE AGGREGATE and state function definition for custom WITHIN GROUP aggregates look like? Now this is exactly the part we haven't nailed down yet and want ideas for. The problem is, given that the parser is looking at: foo(p1,p2,...) within group (order by q1,q2,...) how do we best represent the possible matching functions in pg_proc and pg_aggregate? Our partial solution so far does not allow polymorphism to work properly, so we need a better way; I'm hoping for some independent suggestions before I post my own ideas. -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)
The spec defines two types of aggregate function classed as ordered set function, as follows: 1. An inverse distribution function taking one argument (which must be a grouped column or otherwise constant within groups) plus a sorted group with exactly one column: =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ... The motivating example for this (and the only ones in the spec) are percentile_cont and percentile_disc, to return a percentile result from a continuous or discrete distribution. (Thus percentile_cont(0.5) within group (order by x) is the spec's version of a median(x) function.) 2. A hypothetical set function taking N arguments of arbitrary types (a la VARIADIC any, rather than a fixed list) plus a sorted group with N columns of matching types: =# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...)) from ... (where typeof(p1)==typeof(q1) and so on, at least up to trivial conversions) The motivating example here is to be able to do rank(p1,p2,...) to return the rank that the specified values would have had if they were added to the group. As usual, we do not want to constrain ourselves to supporting only the specific cases in the spec, but would prefer a general solution. We (meaning myself and Atri) have an implementation that basically works, though it is not yet complete, but before taking it any further we need to resolve the design question of how to represent these two types of function in the system catalogs. The fact that there are in effect two parts to the parameter list, which are either independent (for inverse distribution funcs) or closely related (for hypothetical set functions), doesn't seem to point to an obvious way to represent this in pg_proc/pg_aggregate. I'm not yet satisfied with the method used in our implementation, so we're throwing this open for suggestions. We will post the work-in-progress patch along with a description of its current implementation shortly. One of the major complications is that we ideally want to be able to do polymorphism based on the type of the sorted group, specifically in order to be able to do percentile_disc(float8) within group (order by anyelement) returning anyelement. (i.e. we should be able to get a discrete percentile from any type that is orderable.) The question here is how to resolve the return type both of the aggregate itself and of the finalfn. We've also had an expression of interest in extending this to allow percentile_disc(float8[]) and percentile_cont(float8[]) returning arrays; e.g. percentile_cont(array[0, 0.25, 0.5, 0.75, 1]) to return an array containing the bounds, median and quartiles in one go. This is an extension to the spec but it seems sufficiently obviously useful to be worth supporting. Comments? -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers