Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

2013-07-23 Thread Andrew Gierth
Noah Misch said:
 Other aggregates based on this syntax might not desire such type unification.

Then there would have to be some way to distinguish that. Maybe those could
have -1 and the standard hypothetical set functions -2, with some flag in
CREATE AGGREGATE to sort it out.

 Having parse analysis do that distorts the character of an any argument.  I
 think the proper place for such processing is the first call to a transition
 function.

Except there isn't one.

 But let's not make the
 parser presume that an aggordnargs=-1 aggregate always wants its any
 arguments handled in the manner of the standard hypothetical set functions.

This has to happen in the parser because these are errors that should be
caught before execution:

rank(foo) within group (order by bar,baz)
rank(integercol) within group (order by textcol)

And collations have to be resolved (pairwise) before sorting can happen:

rank(textval COLLATE C) within group (order by foo)  -- sorts in C
rank(textval COLLATE C) within group (order by bar COLLATE en_US)  -- error

(basically, in rank(x) within group (order by y) where x and y are
collatable, the collation rules apply exactly as though you were doing
(x  y), with all the implicit vs. explicit stuff included)

And this:

rank(1.1) within group (order by intcol)

should become  rank(1.1) within group (order by intcol::numeric)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

2013-07-22 Thread Andrew Gierth
Ok, since Atri posted our work-so-far and there's not been much
comment, I'll outline here my proposed plan of attack.

Rather than, as in the WIP patch, using the agg finalfn to validate
the split between normal args and ORDER BY args, I propose this:

Firstly, as in the WIP patch,

  func(a) within group (order by b)

is looked up as though it were func(a,b). The result must be marked as
an ordered set function. A new pg_aggregate integer column,
aggordnargs (?), must be equal to the number of normal args (i.e. (a)
in this case). Note that this may be 0; one can see a legitimate use
case for mode() within group (order by anyelement) for example.

The finalfn must be defined to have the same signature, so its args
are processed as if it were func_final(a,b) - but only a dummy arg is
passed for b. (Similar to the case for window functions.) Resolution
of polymorphic parameters and result types therefore works as normal.

For hypothetical set functions we add a special case, aggordnargs=-1,
for which both the aggregate and the finalfn must be defined as
(variadic any) and parse analysis detects this case and unifies the
types of the normal args with those of the ORDER BY args.

I propose this new syntax:

create aggregate func(argtypes...) within group (argtypes...) (
  [ STYPE = ... , ]
  [ SORTOP = ... , ]
  [ INITCOND = ... , ]
  FINALFUNC = func_final
);

Ordered set functions will typically not need STYPE etc., but
hypothetical set functions will be declared as, e.g.:

create aggregate rank(variadic any) within group (variadic any) (
  STYPE = boolean,
  INITCOND = 'f',
  SORTOP = ,
  FINALFUNC = rank_hypothetical_final
);

(I'm open to comment as to whether to simply overload the aggsortop
column in pg_aggregate or add a new one. I'm inclined to do the latter.)

The idea here is that a column of type STYPE will be appended to the
list of columns to be sorted, using SORTOP as sort operator, and all
input rows will have this column initialized to the INITCOND value.
This is to make it easy to implement rank() and friends by simply
inserting the hypothetical row into the sort, with a true value to
flag it, and finding its position in the sort result. (Better that
than comparing the hypothetical row against the whole group.)

(Security caveat: it will be necessary for the finalfn in such cases
to validate that the additional column exists with the right
type. Producing the wrong result is acceptable if the values in it are
unexpected; crashing is not.)

Any comment before we get back to coding?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

2013-07-18 Thread Josh Berkus
On 07/17/2013 08:15 PM, Andrew Gierth wrote:
 The spec defines two types of aggregate function classed as ordered set
 function, as follows:
  
 1. An inverse distribution function taking one argument (which must be
a grouped column or otherwise constant within groups) plus a sorted
group with exactly one column:
  
=# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...
  
The motivating example for this (and the only ones in the spec) are
percentile_cont and percentile_disc, to return a percentile result
from a continuous or discrete distribution. (Thus
percentile_cont(0.5) within group (order by x) is the spec's version
of a median(x) function.)

One question is how this relates to the existing

   SELECT agg_func(x order by y)

... syntax.  Clearly there's some extra functionality here, but the two
are very similar conceptually.

 2. A hypothetical set function taking N arguments of arbitrary types
(a la VARIADIC any, rather than a fixed list) plus a sorted group
with N columns of matching types:
  
=# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...)) from ...
  
(where typeof(p1)==typeof(q1) and so on, at least up to trivial
conversions)
  
The motivating example here is to be able to do rank(p1,p2,...) to
return the rank that the specified values would have had if they were
added to the group.

Wow, I can't possibly grasp the purpose of this.  Maybe a practical example?

 We've also had an expression of interest in extending this to allow
 percentile_disc(float8[]) and percentile_cont(float8[]) returning
 arrays; e.g. percentile_cont(array[0, 0.25, 0.5, 0.75, 1]) to return an
 array containing the bounds, median and quartiles in one go. This is an
 extension to the spec but it seems sufficiently obviously useful to be
 worth supporting.

To be specific, I asked for this because it's already something I do
using PL/R, although in PL/R it's pretty much limited to floats.

Anyway, for anyone who isn't following why we want this: statitical
summary reports.  For example, I'd love to be able to do a quartile
distribution of query execution times without resorting to R.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

2013-07-18 Thread Andrew Gierth
Josh Berkus wrote:
 On 07/17/2013 08:15 PM, Andrew Gierth wrote:
  The spec defines two types of aggregate function classed as ordered set
  function, as follows:
   
  1. An inverse distribution function taking one argument (which must be
 a grouped column or otherwise constant within groups) plus a sorted
 group with exactly one column:
   
 =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...
   
 The motivating example for this (and the only ones in the spec) are
 percentile_cont and percentile_disc, to return a percentile result
 from a continuous or discrete distribution. (Thus
 percentile_cont(0.5) within group (order by x) is the spec's version
 of a median(x) function.)
 
 One question is how this relates to the existing
 
SELECT agg_func(x order by y)
 
 ... syntax.  Clearly there's some extra functionality here, but the two
 are very similar conceptually.

Well, as you probably know, the spec is a whole pile of random
special-case syntax and any similarities are probably more accidental
than anything else.

A major difference is that in agg(x order by y), the values of y are
not passed to the aggregate function - they serve no purpose other
than controlling the order of the x values. Whereas in WITHIN GROUP,
the values in the ORDER BY ... clause are in some sense the primary
input to the aggregate, and the p argument is secondary and can't
vary between rows of the group.

Our implementation does heavily reuse the existing executor mechanics
for ORDER BY in aggregates, and it also reuses a fair chunk of the
parser code for it, but there are significant differences.

[of hypothetical set functions]
 Wow, I can't possibly grasp the purpose of this.  Maybe a practical
 example?

=# select rank(123) within group (order by x)
 from (values (10),(50),(100),(200),(500)) v(x);

would return 1 row containing the value 4, because if you added the
value 123 to the grouped values, it would have been ranked 4th.

Any time you want to calculate what the rank, dense_rank or cume_dist
would be of a specific row within a group without actually adding the
row to the group, this is how it's done.

I don't have any practical examples to hand, but this beast seems to
be implemented in at least Oracle and MSSQL so I guess it has uses.

[on supporting arrays of percentiles]
 To be specific, I asked for this because it's already something I do
 using PL/R, although in PL/R it's pretty much limited to floats.

percentile_cont is limited to floats and intervals in the spec; to be
precise, it's limited to taking args of either interval or any numeric
type, and returns interval for interval args, and approximate numeric
with implementation-defined precision, i.e. some form of float, for
numeric-type args. The definition requires interpolation between values,
so it's not clear that there's any point in trying to allow other types.

percentile_disc is also limited to floats and intervals in the spec, but
I see absolutely no reason whatsoever for this, since the definition
given is valid for any type with ordering operators; there is no reason
not to make it fully polymorphic. (The requirement for ordering will be
enforced in parse-analysis anyway, by the ORDER BY transformations, and
the function simply returns one of the input values unaltered.)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

2013-07-18 Thread Andrew Gierth
Josh Berkus wrote:
 Hah, I didn't realize that our ordered aggregate syntax even *was* spec.

The spec defines agg(x order by y) only for array_agg and xmlagg; the
generalization to arbitrary other aggregates is our extension. (But
kind of obvious really.)

  Our implementation does heavily reuse the existing executor mechanics
  for ORDER BY in aggregates, and it also reuses a fair chunk of the
  parser code for it, but there are significant differences.

 Well, seems like it would work the same as

  agg_func(constx,coly,colz ORDER BY coly, colz)

 ... which means you could reuse a LOT of the internal plumbing.  Or am I
 missing something?

You missed the part above which said ...does heavily reuse... :-)

i.e. we are in fact reusing a lot of the internal plumbing.

 Also, what would a CREATE AGGREGATE and state function definition for
 custom WITHIN GROUP aggregates look like?

Now this is exactly the part we haven't nailed down yet and want ideas
for.

The problem is, given that the parser is looking at:

  foo(p1,p2,...) within group (order by q1,q2,...)

how do we best represent the possible matching functions in pg_proc
and pg_aggregate? Our partial solution so far does not allow
polymorphism to work properly, so we need a better way; I'm hoping for
some independent suggestions before I post my own ideas.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

2013-07-17 Thread Andrew Gierth
The spec defines two types of aggregate function classed as ordered set
function, as follows:
 
1. An inverse distribution function taking one argument (which must be
   a grouped column or otherwise constant within groups) plus a sorted
   group with exactly one column:
 
   =# SELECT (func(p) WITHIN GROUP (ORDER BY q)) from ...
 
   The motivating example for this (and the only ones in the spec) are
   percentile_cont and percentile_disc, to return a percentile result
   from a continuous or discrete distribution. (Thus
   percentile_cont(0.5) within group (order by x) is the spec's version
   of a median(x) function.)
 
2. A hypothetical set function taking N arguments of arbitrary types
   (a la VARIADIC any, rather than a fixed list) plus a sorted group
   with N columns of matching types:
 
   =# SELECT (func(p1,p2,...) WITHIN GROUP (ORDER BY q1,q2,...)) from ...
 
   (where typeof(p1)==typeof(q1) and so on, at least up to trivial
   conversions)
 
   The motivating example here is to be able to do rank(p1,p2,...) to
   return the rank that the specified values would have had if they were
   added to the group.
 
As usual, we do not want to constrain ourselves to supporting only the
specific cases in the spec, but would prefer a general solution.
 
We (meaning myself and Atri) have an implementation that basically
works, though it is not yet complete, but before taking it any further
we need to resolve the design question of how to represent these two
types of function in the system catalogs. The fact that there are in
effect two parts to the parameter list, which are either independent
(for inverse distribution funcs) or closely related (for hypothetical
set functions), doesn't seem to point to an obvious way to represent
this in pg_proc/pg_aggregate.
 
I'm not yet satisfied with the method used in our implementation, so
we're throwing this open for suggestions. We will post the
work-in-progress patch along with a description of its current
implementation shortly.
 
One of the major complications is that we ideally want to be able to do
polymorphism based on the type of the sorted group, specifically in
order to be able to do
 
percentile_disc(float8) within group (order by anyelement)
 
returning anyelement. (i.e. we should be able to get a discrete
percentile from any type that is orderable.) The question here is how to
resolve the return type both of the aggregate itself and of the finalfn.
 
We've also had an expression of interest in extending this to allow
percentile_disc(float8[]) and percentile_cont(float8[]) returning
arrays; e.g. percentile_cont(array[0, 0.25, 0.5, 0.75, 1]) to return an
array containing the bounds, median and quartiles in one go. This is an
extension to the spec but it seems sufficiently obviously useful to be
worth supporting.

Comments?

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers