Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-18 Thread Tomas Vondra

Hi,

On 6/7/17 5:52 AM, Regina Obe wrote:

On 6/6/17 13:52, Regina Obe wrote:

It seems CREATE  AGGREGATE was expanded in 9.6 to support
parallelization of aggregate functions using transitions, with the
addition of serialfunc and deserialfunc to the aggregate definitions.

https://www.postgresql.org/docs/10/static/sql-createaggregate.html

I was looking at the PostgreSQL 10 source code for some example usages
of this and was hoping that array_agg and string_agg would support the feature.



I'm not sure how you would parallelize these, since in most uses
you want to have a deterministic output order.



--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Good point.  If that's the reason it wasn't done, that's good just wasn't sure.

But if you didn't have an ORDER BY in your aggregate usage, and you
did have those transition functions, it shouldn't be any different from
any other use case right?
I imagine you are right that most folks who use array_agg and
string_agg usually combine it with array_agg(... ORDER BY ..)



I think that TL had in mind is something like

SELECT array_agg(x) FROM (
   SELECT x FROM bar ORDER BY y
) foo;

i.e. a subquery producing the data in predictable order.

>

My main reason for asking is that most of the PostGIS geometry and
raster aggregate functions use transitions and were patterned after
array agg.




In the case of PostGIS the sorting is done internally and really
only to expedite take advantage of things like cascaded union
algorithms.
That is always done though (so even if each worker does it on just it's
batch that's still better than having only one worker).
So I think it's still very beneficial to break into separate jobs
since in the end the gather, will have say 2 biggish geometries or 2
biggish rasters to union if you have 2 workers which is still better
than having a million smallish geometries/rasters to union
I'm not sure I got your point correctly, but if you can (for example) 
sort the per-worker results as part of the "serialize" function, and 
benefit from that while combining that in the gather, then sure, that 
should be a huge win.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-08 Thread Ashutosh Bapat
On Wed, Jun 7, 2017 at 10:55 PM, Robert Haas  wrote:
> On Tue, Jun 6, 2017 at 3:23 PM, David Fetter  wrote:
>> I'd bet on lack of tuits.
>
> I expect that was part of it.  Another thing to consider is that, for
> numeric aggregates, the transition values don't generally get larger
> as you aggregate, but for something like string_agg(), they will.
> It's not clear how much work we'll really save by parallelizing that
> sort of thing.  Maybe it will be great?

+1, I was thinking about the same. There might be some cases when the
output of array_agg/string_agg is not a lot wider but the underlying
scans are large e.g. having clause containing another aggregate and
very small group sizes. I am not sure how frequent are such usecases.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-07 Thread Robert Haas
On Tue, Jun 6, 2017 at 3:23 PM, David Fetter  wrote:
> I'd bet on lack of tuits.

I expect that was part of it.  Another thing to consider is that, for
numeric aggregates, the transition values don't generally get larger
as you aggregate, but for something like string_agg(), they will.
It's not clear how much work we'll really save by parallelizing that
sort of thing.  Maybe it will be great?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-06 Thread Andres Freund
On 2017-06-07 00:03:15 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2017-06-06 23:32:53 -0400, Peter Eisentraut wrote:
> >> I'm not sure how you would parallelize these, since in most uses you
> >> want to have a deterministic output order.
> 
> > Unless you specify ORDER BY you don't really have that anyway, consider
> > hash-aggregation.  If you want deterministic order, you really need an
> > ORDER BY inside the aggregate.
> 
> Hash aggregation does not destroy the property that array_agg/string_agg
> will produce results whose components appear in the order that the
> subquery emitted them in.  It only causes the various aggregate results
> in a GROUP BY query to themselves appear in random order.

Whoa, I obviously should stop working tonight.  I think it's still a
hugely useful to parallelize such aggregates - it might be worthwhile to
have two versions of array_agg, one with a serial/combinefunc and one
without...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-06 Thread Tom Lane
Andres Freund  writes:
> On 2017-06-06 23:32:53 -0400, Peter Eisentraut wrote:
>> I'm not sure how you would parallelize these, since in most uses you
>> want to have a deterministic output order.

> Unless you specify ORDER BY you don't really have that anyway, consider
> hash-aggregation.  If you want deterministic order, you really need an
> ORDER BY inside the aggregate.

Hash aggregation does not destroy the property that array_agg/string_agg
will produce results whose components appear in the order that the
subquery emitted them in.  It only causes the various aggregate results
in a GROUP BY query to themselves appear in random order.

Now you could argue that the subquery might've gotten parallelized and
emitted its outputs in some random order, so doing the same thing one
level further up changes nothing.  But you can't defend this on this
basis that it was historically unpredictable, because it wasn't.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-06 Thread Regina Obe
> On 6/6/17 13:52, Regina Obe wrote:
>> It seems CREATE  AGGREGATE was expanded in 9.6 to support 
>> parallelization of aggregate functions using transitions, with the 
>> addition of serialfunc and deserialfunc to the aggregate definitions.
>> 
>> https://www.postgresql.org/docs/10/static/sql-createaggregate.html
>> 
>> I was looking at the PostgreSQL 10 source code for some example usages 
>> of this and was hoping that array_agg and string_agg would support the 
>> feature.

> I'm not sure how you would parallelize these, since in most uses you want to 
> have a deterministic output order.

>-- 
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Good point.  If that's the reason it wasn't done, that's good just wasn't sure.

But if you didn't have an ORDER BY in your aggregate usage, and you did have 
those transition functions, it shouldn't be any different from any other use 
case right?
I imagine you are right that most folks who use array_agg and string_agg 
usually combine it with array_agg(... ORDER BY ..)

My main reason for asking is that most of the PostGIS geometry and raster 
aggregate functions use transitions and were patterned after array agg.

In the case of PostGIS the sorting is done internally and really only to 
expedite take advantage of things like cascaded union algorithms.  That is 
always done though (so even if each worker does it on just it's batch that's 
still better than having only one worker).
So I think it's still very beneficial to break into separate jobs since in the 
end the gather, will have  say 2 biggish geometries or 2 biggish rasters to 
union if you have 2 workers which is still better than having a million 
smallish geometries/rasters to union

Split Union 

Worker 1:

Parallel agg (internal sort geoms by box)  - Union

Worker 2: 
Parallel Agg (internal sort geoms )  - Union


Gather  Union(union, union) internal sort.


Thanks,
Regina






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-06 Thread Andres Freund
On 2017-06-06 23:32:53 -0400, Peter Eisentraut wrote:
> On 6/6/17 13:52, Regina Obe wrote:
> > It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
> > aggregate functions using transitions, with the addition of serialfunc and
> > deserialfunc to the aggregate definitions.
> > 
> > https://www.postgresql.org/docs/10/static/sql-createaggregate.html
> > 
> > I was looking at the PostgreSQL 10 source code for some example usages of
> > this and was hoping that array_agg and string_agg would support the feature.
> 
> I'm not sure how you would parallelize these, since in most uses you
> want to have a deterministic output order.

Unless you specify ORDER BY you don't really have that anyway, consider
hash-aggregation.  If you want deterministic order, you really need an
ORDER BY inside the aggregate.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-06 Thread Peter Eisentraut
On 6/6/17 13:52, Regina Obe wrote:
> It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
> aggregate functions using transitions, with the addition of serialfunc and
> deserialfunc to the aggregate definitions.
> 
> https://www.postgresql.org/docs/10/static/sql-createaggregate.html
> 
> I was looking at the PostgreSQL 10 source code for some example usages of
> this and was hoping that array_agg and string_agg would support the feature.

I'm not sure how you would parallelize these, since in most uses you
want to have a deterministic output order.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-06 Thread Andres Freund
On 2017-06-06 12:23:49 -0700, David Fetter wrote:
> On Tue, Jun 06, 2017 at 01:52:45PM -0400, Regina Obe wrote:
> > It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
> > aggregate functions using transitions, with the addition of serialfunc and
> > deserialfunc to the aggregate definitions.
> > 
> > https://www.postgresql.org/docs/10/static/sql-createaggregate.html
> > 
> > I was looking at the PostgreSQL 10 source code for some example usages of
> > this and was hoping that array_agg and string_agg would support the feature.
> > At a cursory glance, it seems they do not use this.
> > Examples I see that do support it are the average and standard deviation
> > functions.
> > 
> > Is there a reason for this or it just wasn't gotten to?

I'd suggest trying to write a parallel version of them ;).  Shouldn't be
too hard.


> I'd bet on lack of tuits.  Anything with text has to deal with
> collation issues, etc., that may make this trickier than it first
> appears.

I don't see how collations makes things more complicated here.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-06 Thread David Fetter
On Tue, Jun 06, 2017 at 01:52:45PM -0400, Regina Obe wrote:
> It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
> aggregate functions using transitions, with the addition of serialfunc and
> deserialfunc to the aggregate definitions.
> 
> https://www.postgresql.org/docs/10/static/sql-createaggregate.html
> 
> I was looking at the PostgreSQL 10 source code for some example usages of
> this and was hoping that array_agg and string_agg would support the feature.
> At a cursory glance, it seems they do not use this.
> Examples I see that do support it are the average and standard deviation
> functions.
> 
> Is there a reason for this or it just wasn't gotten to?

I'd bet on lack of tuits.  Anything with text has to deal with
collation issues, etc., that may make this trickier than it first
appears.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

2017-06-06 Thread Regina Obe
It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
aggregate functions using transitions, with the addition of serialfunc and
deserialfunc to the aggregate definitions.

https://www.postgresql.org/docs/10/static/sql-createaggregate.html

I was looking at the PostgreSQL 10 source code for some example usages of
this and was hoping that array_agg and string_agg would support the feature.
At a cursory glance, it seems they do not use this.
Examples I see that do support it are the average and standard deviation
functions.

Is there a reason for this or it just wasn't gotten to?


Thanks,
Regina





-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers