subject:"Re\: Partial aggregates pushdown"

On Thu, Aug 22, 2024 at 09:54:02PM +0200, Tomas Vondra wrote:
> On 8/22/24 20:56, Bruce Momjian wrote:
> > You make a very good point above.  Would there ever be cases where a
> > targetlist would have multiple aggregates, and some can be pushed down,
> > and some have to return all matching rows so the sender can compute the
> > aggregate?  If so, how would we handle that?  How does parallelism
> > handle that now?
> 
> I think the only sane way to handle this is to disable partial pushdown
> for that query, fetch all rows and do the aggregate locally.

Okay.

> > It is more than different versions.  Different CPU architectures can
> > store data types differently in binary.  It would be kind of interesting
> > to have a serial/deserial that was in text format.  I guess that is
> > where my RECORD idea came from that I just emailed to the list.
> >
> 
> I believe everything in this thread assumes "binary" in the same sense
> as the protocol, with the same rules for sending data in text/binary
> formats, etc. With binary for integers meaning "network order" and that
> sort of stuff. Which is mostly independent of PG version.

If the binary is tranferable between servers of the same PG version but
perhaps different CPU architectures, you are saying we only are worrying
about not using the exact same aggregate passing method we use for
parallelism, except for PG version differences.  Seems we should just
assume the same version for this optimization and call it done.  Except
we don't always check for the same PG version and that could lead to
crashes or security problems?

> > What I would really like is something similar to pg_proc.proargtypes
> > (which is data type oidvector) for pg_aggregate where we can supply the
> > oids of the pg_aggregate.aggtranstype and construct a record on the fly
> > to return to the FDW caller, rather than having to create specialized
> > functions for every aggregate that needs to return several different
> > data types, e.g., AVG(interval), #4 in my previous email.  I realize
> > this would require the creation of data types int128 and int128 array.
> > 
> 
> Isn't that really just a definition of a composite type?

Probably.  I am trying to leverage what we already have.

> I still don't understand which part of the code constructs the record.
> If the aggregate has aggstate internal, I don't see how could it be
> anything else than the aggregate itself.

Yes, but I was unclear if we needed a special function for every
construction.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-08-22 Thread Tomas Vondra

On 8/22/24 20:56, Bruce Momjian wrote:
> On Wed, Aug 21, 2024 at 05:41:02PM +0200, Tomas Vondra wrote:
>> On 8/8/24 13:48, Jelte Fennema-Nio wrote:
>>> SUMMARY OF THREAD
>>>
>>> The design of patch 0001 is agreed upon by everyone on the thread (so
>>> far). This adds the PARTIAL_AGGREGATE label for aggregates, which will
>>> cause the finalfunc not to run. It also starts using PARTIAL_AGGREGATE
>>> for pushdown of aggregates in postgres_fdw. In 0001 PARTIAL_AGGREGATE
>>> is only supported for aggregates with a non-internal/pseudo type as
>>> the stype.
>>
>> I don't have a strong opinion on this, but I wonder if someone might
>> object this essentially extends the syntax with something that is not
>> (and never will be) in the SQL standard. I wonder if there's some
>> precedent for encoding such explicit execution instructions into the
>> query itself?
>>
>> That reminds me - the PARTIAL_AGGREGATE label is per aggregate, but I
>> don't think it makes much sense to do this only for some aggregates,
>> right? Do we have a way to make sure the query is "consistent"? I'm not
>> sure if doing this on the source (before pushdown) is enough. Could
>> there be a difference in what the remote instance supports?
>>
>> The only alternative that I can think of (and that I believe was already
>> mentioned in this thread) is to set some GUC that forces the top-most
>> query level to do this (all aggregates at that level). That's have the
>> benefit of always affecting all aggregates.
> 
> You make a very good point above.  Would there ever be cases where a
> targetlist would have multiple aggregates, and some can be pushed down,
> and some have to return all matching rows so the sender can compute the
> aggregate?  If so, how would we handle that?  How does parallelism
> handle that now?

I think the only sane way to handle this is to disable partial pushdown
for that query, fetch all rows and do the aggregate locally.

>>> The design for patch 0002 is still under debate. This would expand on
>>> the functionality added by adding support for PARTIAL_AGGREGATE for
>>> aggregates with an internal stype. This is done by returning a byte
>>> array containing the bytes that the serialfunc of the aggregate
>>> returns.
>>>
>>> A competing proposal for 0002 is to instead change aggregates to not
>>> use an internal stype anymore, and create dedicated types. The main
>>> downside here is that infunc and outfunc would need to be added for
>>> text serialization, in addition to the binary serialization. An open
>>> question is: Can we change the requirements for CREATE TYPE, so that
>>> types can be created without infunc and outfunc.
>>>
>>
>> I think it's +0.5 for the new dedicated data types from me.
>>
>> I admit I'm too lazy to read the whole thread from scratch, but I
>> believe we did discuss the possibility to reuse the serial/deserial
>> functions we already have, but the reason against that was the missing
>> cross-version stability. Parallel queries always run within a single
>> instance, hence there are no concerns about other versions. But this is
>> meant to support the remote node having a wildly different version.
> 
> It is more than different versions.  Different CPU architectures can
> store data types differently in binary.  It would be kind of interesting
> to have a serial/deserial that was in text format.  I guess that is
> where my RECORD idea came from that I just emailed to the list.
>

I believe everything in this thread assumes "binary" in the same sense
as the protocol, with the same rules for sending data in text/binary
formats, etc. With binary for integers meaning "network order" and that
sort of stuff. Which is mostly independent of PG version.

When I say version dependency, I mean the structure of the aggregate
state itself. That is, in one version the state may store (A,B), while
in the other version it may be (B,C). For example, we could store
different values in the array, or something like that.


> What I would really like is something similar to pg_proc.proargtypes
> (which is data type oidvector) for pg_aggregate where we can supply the
> oids of the pg_aggregate.aggtranstype and construct a record on the fly
> to return to the FDW caller, rather than having to create specialized
> functions for every aggregate that needs to return several different
> data types, e.g., AVG(interval), #4 in my previous email.  I realize
> this would require the creation of data types int128 and int128 array.
> 

Isn't that really just a definition of a composite type?

I still don't understand which part of the code constructs the record.
If the aggregate has aggstate internal, I don't see how could it be
anything else than the aggregate itself.

>> I guess we might introduce another pair of serial/deserial functions,
>> with this guarantee. I know we (me and Jelte) discussed that in person
>> at some point, and there were arguments for doing the data types. But I
>> managed to forget the details :-(
>>
>>

Re: Partial aggregates pushdown

On Thu, Aug 22, 2024 at 08:31:11PM +0200, Tomas Vondra wrote:
> > My question is related to #3 and #4.  For #3, if we are going to be
> > building infrastructure to handle passing int128 for AVG, wouldn't it be
> > wiser to create an int128 type and an int128 array type, and then use
> > method #2 to handle those, rather than creating custom code just to
> > read/write int128 values for FDWs aggregate passing alone.
> > 
> 
> Yep, adding int128 as a data type would extend this to aggregates that
> store state as int128 (or array of int128).

Great, I am not too far off then.

> > For #4, can we use or improve the RECORD data type to handle #4 --- that
> > seems preferable to creating custom FDWs aggregate passing code.
> > 
> > I know the open question was whether we should create custom FDWs
> > aggregate passing functions or custom data types for FDWs aggregate
> > passing, but I am asking if we can improve existing facilities, like
> > int128 or record passing, to reduce the need for some of these.
> > 
> 
> But which code would produce the record? AFAIK it can't happen in some
> generic executor code, because that only sees "internal" for each
> aggregate. The exact structure of the aggstate is private within the
> code of each aggregate - the record would have to be built there, no?
> 
> I imagine we'd add this for each aggregate as a new pair of functions to
> build/parse the record, but that's kinda the serial/deserial way we
> discussed earlier.
> 
> Or are you suggesting we'd actually say:
> 
>   CREATE AGGREGATE xyz(...) (
> STYPE = record,
> ...
>   )

So my idea from the email I just sent is to create a
pg_proc.proargtypes-like column (data type oidvector) for pg_aggregate
which stores the oids of the values we want to return, so AVG(interval)
would have an array of the oids for interval and int8, e.g.:

SELECT oid FROM pg_type WHERE typname = 'interval';
 oid
--
 1186

SELECT oid FROM pg_type WHERE typname = 'int8';
 oid
-
  20

SELECT '1186 20'::oidvector;
 oidvector
---
 1186 20

It seems all four methods could use this, again assuming we create
int128/int16 and whatever other types we need.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

On Wed, Aug 21, 2024 at 05:41:02PM +0200, Tomas Vondra wrote:
> On 8/8/24 13:48, Jelte Fennema-Nio wrote:
> > SUMMARY OF THREAD
> > 
> > The design of patch 0001 is agreed upon by everyone on the thread (so
> > far). This adds the PARTIAL_AGGREGATE label for aggregates, which will
> > cause the finalfunc not to run. It also starts using PARTIAL_AGGREGATE
> > for pushdown of aggregates in postgres_fdw. In 0001 PARTIAL_AGGREGATE
> > is only supported for aggregates with a non-internal/pseudo type as
> > the stype.
> 
> I don't have a strong opinion on this, but I wonder if someone might
> object this essentially extends the syntax with something that is not
> (and never will be) in the SQL standard. I wonder if there's some
> precedent for encoding such explicit execution instructions into the
> query itself?
> 
> That reminds me - the PARTIAL_AGGREGATE label is per aggregate, but I
> don't think it makes much sense to do this only for some aggregates,
> right? Do we have a way to make sure the query is "consistent"? I'm not
> sure if doing this on the source (before pushdown) is enough. Could
> there be a difference in what the remote instance supports?
> 
> The only alternative that I can think of (and that I believe was already
> mentioned in this thread) is to set some GUC that forces the top-most
> query level to do this (all aggregates at that level). That's have the
> benefit of always affecting all aggregates.

You make a very good point above.  Would there ever be cases where a
targetlist would have multiple aggregates, and some can be pushed down,
and some have to return all matching rows so the sender can compute the
aggregate?  If so, how would we handle that?  How does parallelism
handle that now?

> > The design for patch 0002 is still under debate. This would expand on
> > the functionality added by adding support for PARTIAL_AGGREGATE for
> > aggregates with an internal stype. This is done by returning a byte
> > array containing the bytes that the serialfunc of the aggregate
> > returns.
> > 
> > A competing proposal for 0002 is to instead change aggregates to not
> > use an internal stype anymore, and create dedicated types. The main
> > downside here is that infunc and outfunc would need to be added for
> > text serialization, in addition to the binary serialization. An open
> > question is: Can we change the requirements for CREATE TYPE, so that
> > types can be created without infunc and outfunc.
> > 
> 
> I think it's +0.5 for the new dedicated data types from me.
> 
> I admit I'm too lazy to read the whole thread from scratch, but I
> believe we did discuss the possibility to reuse the serial/deserial
> functions we already have, but the reason against that was the missing
> cross-version stability. Parallel queries always run within a single
> instance, hence there are no concerns about other versions. But this is
> meant to support the remote node having a wildly different version.

It is more than different versions.  Different CPU architectures can
store data types differently in binary.  It would be kind of interesting
to have a serial/deserial that was in text format.  I guess that is
where my RECORD idea came from that I just emailed to the list.

What I would really like is something similar to pg_proc.proargtypes
(which is data type oidvector) for pg_aggregate where we can supply the
oids of the pg_aggregate.aggtranstype and construct a record on the fly
to return to the FDW caller, rather than having to create specialized
functions for every aggregate that needs to return several different
data types, e.g., AVG(interval), #4 in my previous email.  I realize
this would require the creation of data types int128 and int128 array.

> I guess we might introduce another pair of serial/deserial functions,
> with this guarantee. I know we (me and Jelte) discussed that in person
> at some point, and there were arguments for doing the data types. But I
> managed to forget the details :-(
> 
> > WHAT IS NEEDED?
> > 
> > The things needed for this patch are that docs need to be added, and
> > detailed codereview needs to be done.
> 
> Yeah, I think the docs are must-have for a proper review.

I think the docs are on hold until we decide on a transfer method.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-08-22 Thread Tomas Vondra

On 8/22/24 19:22, Bruce Momjian wrote:
> On Wed, Aug 21, 2024 at 04:59:12PM +0200, Tomas Vondra wrote:
>> On 8/20/24 20:41, Bruce Momjian wrote:
>>> SELECT (oid, relname) FROM pg_class LIMIT 1;
>>>  row
>>> -
>>>  (2619,pg_statistic)
>>>
>>> SELECT pg_typeof((oid, relname)) FROM pg_class LIMIT 1;
>>>  pg_typeof
>>> ---
>>>  record
>>>
>>> SELECT pg_typeof(oid) FROM pg_class LIMIT 1;
>>>  pg_typeof
>>> ---
>>>  oid
>>> 
>>> SELECT pg_typeof(relname) FROM pg_class LIMIT 1;
>>>  pg_typeof
>>> ---
>>>  name
>>>
>>
>> How would this help with the AVG(bigint) case? We don't have int128 as
>> SQL type, so what would be part of the record? Also, which part of the
>> code would produce the record? If the internal state is "internal", that
>> would probably need to be something aggregate specific, and that's kinda
>> what this patch series is adding, no?
>>
>> Or am I missing some cases where the record would make it work?
> 
> Right now, my goal in this thread is to try to concretely explain what
> is being proposed.  Therefore, I think I need to go back and make four
> categories instead of two:
> 
> 1.  cases like MAX(int), where we return only one value, and the FDW
> return value is an existing data type, e.g., int
> 
> 2.  cases like AVG(int) where we return multiple FDW values of the same
> type and can use an array, e.g.,  bigint array
> 
> 3.  cases like AVG(bigint) where we return multiple FDW values of the
> same type (or can), but the one of the FDW return values is not an
> existing data type, e.g.  int128
> 
> 4.  cases like AVG(interval) where we return multiple FDW values of
> different types, e.g. interval and an integral count
> 
> For #1, all MAX cases have aggregate input parameters the same as the
> FDW return types (aggtranstype):
> 
>   SELECT proargtypes[0]::regtype, aggtranstype::regtype
>   FROM pg_aggregate a JOIN pg_proc p ON a.aggfnoid = p.oid
>   WHERE proname = 'max' AND proargtypes[0] != aggtranstype;
> 
>proargtypes | aggtranstype
>   -+--
> 
> For #2-4, we have for AVG:
> 
>   SELECT proargtypes[0]::regtype, aggtranstype::regtype
>   FROM pg_aggregate a JOIN pg_proc p ON a.aggfnoid = p.oid
>   WHERE proname = 'avg';
>   
>  proargtypes|aggtranstype
>   --+
> 3->bigint   | internal
> 2->integer  | bigint[]
> 2->smallint | bigint[]
> 3->numeric  | internal
> 2->real | double precision[]
> 2->double precision | double precision[]
> 4->interval | internal
> 
> You can see which AVG items fall into which categories.  It seems we
> have #1 and #2 handled cleanly in the patch.
> 

Agreed.

> My question is related to #3 and #4.  For #3, if we are going to be
> building infrastructure to handle passing int128 for AVG, wouldn't it be
> wiser to create an int128 type and an int128 array type, and then use
> method #2 to handle those, rather than creating custom code just to
> read/write int128 values for FDWs aggregate passing alone.
> 

Yep, adding int128 as a data type would extend this to aggregates that
store state as int128 (or array of int128).

> For #4, can we use or improve the RECORD data type to handle #4 --- that
> seems preferable to creating custom FDWs aggregate passing code.
> 
> I know the open question was whether we should create custom FDWs
> aggregate passing functions or custom data types for FDWs aggregate
> passing, but I am asking if we can improve existing facilities, like
> int128 or record passing, to reduce the need for some of these.
> 

But which code would produce the record? AFAIK it can't happen in some
generic executor code, because that only sees "internal" for each
aggregate. The exact structure of the aggstate is private within the
code of each aggregate - the record would have to be built there, no?

I imagine we'd add this for each aggregate as a new pair of functions to
build/parse the record, but that's kinda the serial/deserial way we
discussed earlier.

Or are you suggesting we'd actually say:

  CREATE AGGREGATE xyz(...) (
STYPE = record,
...
  )

or something like that? I have no idea if that would work, maybe it
would. In a way it'd not be that different from the "custom data type"
except that it doesn't actually need a custom data type (assuming we
know how to serialize such "generic" records - I'm not familiar with
this code enough to answer that).

The reason why I found the "custom data type" approach interesting is
that it sets clear guarantees / expectations about the stability of the
output between versions. For serial/deserial we make no guarantees, it
can change even in a minor release. But here we need something stable
even for major releases, to allow pushdown when querying older server.
We have that strong expect

Re: Partial aggregates pushdown

On Wed, Aug 21, 2024 at 04:59:12PM +0200, Tomas Vondra wrote:
> On 8/20/24 20:41, Bruce Momjian wrote:
> > SELECT (oid, relname) FROM pg_class LIMIT 1;
> >  row
> > -
> >  (2619,pg_statistic)
> > 
> > SELECT pg_typeof((oid, relname)) FROM pg_class LIMIT 1;
> >  pg_typeof
> > ---
> >  record
> > 
> > SELECT pg_typeof(oid) FROM pg_class LIMIT 1;
> >  pg_typeof
> > ---
> >  oid
> > 
> > SELECT pg_typeof(relname) FROM pg_class LIMIT 1;
> >  pg_typeof
> > ---
> >  name
> > 
> 
> How would this help with the AVG(bigint) case? We don't have int128 as
> SQL type, so what would be part of the record? Also, which part of the
> code would produce the record? If the internal state is "internal", that
> would probably need to be something aggregate specific, and that's kinda
> what this patch series is adding, no?
> 
> Or am I missing some cases where the record would make it work?

Right now, my goal in this thread is to try to concretely explain what
is being proposed.  Therefore, I think I need to go back and make four
categories instead of two:

1.  cases like MAX(int), where we return only one value, and the FDW
return value is an existing data type, e.g., int

2.  cases like AVG(int) where we return multiple FDW values of the same
type and can use an array, e.g.,  bigint array

3.  cases like AVG(bigint) where we return multiple FDW values of the
same type (or can), but the one of the FDW return values is not an
existing data type, e.g.  int128

4.  cases like AVG(interval) where we return multiple FDW values of
different types, e.g. interval and an integral count

For #1, all MAX cases have aggregate input parameters the same as the
FDW return types (aggtranstype):

SELECT proargtypes[0]::regtype, aggtranstype::regtype
FROM pg_aggregate a JOIN pg_proc p ON a.aggfnoid = p.oid
WHERE proname = 'max' AND proargtypes[0] != aggtranstype;

 proargtypes | aggtranstype
-+--

For #2-4, we have for AVG:

SELECT proargtypes[0]::regtype, aggtranstype::regtype
FROM pg_aggregate a JOIN pg_proc p ON a.aggfnoid = p.oid
WHERE proname = 'avg';

   proargtypes|aggtranstype
--+
3->  bigint   | internal
2->  integer  | bigint[]
2->  smallint | bigint[]
3->  numeric  | internal
2->  real | double precision[]
2->  double precision | double precision[]
4->  interval | internal

You can see which AVG items fall into which categories.  It seems we
have #1 and #2 handled cleanly in the patch.

My question is related to #3 and #4.  For #3, if we are going to be
building infrastructure to handle passing int128 for AVG, wouldn't it be
wiser to create an int128 type and an int128 array type, and then use
method #2 to handle those, rather than creating custom code just to
read/write int128 values for FDWs aggregate passing alone.

For #4, can we use or improve the RECORD data type to handle #4 --- that
seems preferable to creating custom FDWs aggregate passing code.

I know the open question was whether we should create custom FDWs
aggregate passing functions or custom data types for FDWs aggregate
passing, but I am asking if we can improve existing facilities, like
int128 or record passing, to reduce the need for some of these.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-08-22 Thread Ashutosh Bapat

On Wed, Aug 21, 2024 at 9:11 PM Tomas Vondra  wrote:
>
>
>
> On 8/8/24 13:48, Jelte Fennema-Nio wrote:
> > SUMMARY OF THREAD
> >
> > The design of patch 0001 is agreed upon by everyone on the thread (so
> > far). This adds the PARTIAL_AGGREGATE label for aggregates, which will
> > cause the finalfunc not to run. It also starts using PARTIAL_AGGREGATE
> > for pushdown of aggregates in postgres_fdw. In 0001 PARTIAL_AGGREGATE
> > is only supported for aggregates with a non-internal/pseudo type as
> > the stype.
> >
>
> I don't have a strong opinion on this, but I wonder if someone might
> object this essentially extends the syntax with something that is not
> (and never will be) in the SQL standard. I wonder if there's some
> precedent for encoding such explicit execution instructions into the
> query itself?

This feature might be a useful feature to run aggregation in a
federated database across many source databases. So people in the
community who participate in SQL standard may add it there. While
implementing the feature, we might think of it as influencing the
execution but I don't see it that way. It's a feature allowing users
to access a pre-finalization state of an aggregate which they can use
to combine with such states from other data sources. There may be
other uses as well. But adding it as a SQL feature means some
standardization of what is partial aggregate for each aggregate
function - the notions of which are intuitive but they need to be
standardized. Of course, going this way means that it will take longer
for the feature to be available but it won't look like a kludge at
least.

-- 
Best Wishes,
Ashutosh Bapat

Re: Partial aggregates pushdown

2024-08-21 Thread Tomas Vondra

On 8/8/24 13:48, Jelte Fennema-Nio wrote:
> SUMMARY OF THREAD
> 
> The design of patch 0001 is agreed upon by everyone on the thread (so
> far). This adds the PARTIAL_AGGREGATE label for aggregates, which will
> cause the finalfunc not to run. It also starts using PARTIAL_AGGREGATE
> for pushdown of aggregates in postgres_fdw. In 0001 PARTIAL_AGGREGATE
> is only supported for aggregates with a non-internal/pseudo type as
> the stype.
> 

I don't have a strong opinion on this, but I wonder if someone might
object this essentially extends the syntax with something that is not
(and never will be) in the SQL standard. I wonder if there's some
precedent for encoding such explicit execution instructions into the
query itself?

That reminds me - the PARTIAL_AGGREGATE label is per aggregate, but I
don't think it makes much sense to do this only for some aggregates,
right? Do we have a way to make sure the query is "consistent"? I'm not
sure if doing this on the source (before pushdown) is enough. Could
there be a difference in what the remote instance supports?

The only alternative that I can think of (and that I believe was already
mentioned in this thread) is to set some GUC that forces the top-most
query level to do this (all aggregates at that level). That's have the
benefit of always affecting all aggregates.

> The design for patch 0002 is still under debate. This would expand on
> the functionality added by adding support for PARTIAL_AGGREGATE for
> aggregates with an internal stype. This is done by returning a byte
> array containing the bytes that the serialfunc of the aggregate
> returns.
> 
> A competing proposal for 0002 is to instead change aggregates to not
> use an internal stype anymore, and create dedicated types. The main
> downside here is that infunc and outfunc would need to be added for
> text serialization, in addition to the binary serialization. An open
> question is: Can we change the requirements for CREATE TYPE, so that
> types can be created without infunc and outfunc.
> 

I think it's +0.5 for the new dedicated data types from me.

I admit I'm too lazy to read the whole thread from scratch, but I
believe we did discuss the possibility to reuse the serial/deserial
functions we already have, but the reason against that was the missing
cross-version stability. Parallel queries always run within a single
instance, hence there are no concerns about other versions. But this is
meant to support the remote node having a wildly different version.

I guess we might introduce another pair of serial/deserial functions,
with this guarantee. I know we (me and Jelte) discussed that in person
at some point, and there were arguments for doing the data types. But I
managed to forget the details :-(

> WHAT IS NEEDED?
> 
> The things needed for this patch are that docs need to be added, and
> detailed codereview needs to be done.

Yeah, I think the docs are must-have for a proper review.

> Feedback from more people on the two competing proposals for 0002
> would be very helpful in making a decision.
> 

regards

-- 
Tomas Vondra

Re: Partial aggregates pushdown

2024-08-21 Thread Tomas Vondra




On 8/20/24 20:41, Bruce Momjian wrote:
> On Tue, Aug 20, 2024 at 07:03:56PM +0200, Jelte Fennema-Nio wrote:
>> On Tue, 20 Aug 2024 at 18:50, Bruce Momjian  wrote:
>>> Okay, so we can do MAX easily, and AVG if the count can be represented
>>> as the same data type as the sum?  Is that correct?  Our only problem is
>>> that something like AVG(interval) can't use an array because arrays have
>>> to have the same data type for all array elements, and an interval can't
>>> represent a count?
>>
>> Close, but still not completely correct. AVG(bigint) can also not be
>> supported by patch 1, because the sum and the count for that both
>> stored using an int128. So we'd need an array of int128, and there's
>> currently no int128 SQL type.
> 
> Okay.  Have we considered having the FDW return a record:
> 
>   SELECT (oid, relname) FROM pg_class LIMIT 1;
>row
>   -
>(2619,pg_statistic)
> 
>   SELECT pg_typeof((oid, relname)) FROM pg_class LIMIT 1;
>pg_typeof
>   ---
>record
> 
>   SELECT pg_typeof(oid) FROM pg_class LIMIT 1;
>pg_typeof
>   ---
>oid
>   
>   SELECT pg_typeof(relname) FROM pg_class LIMIT 1;
>pg_typeof
>   ---
>name
> 

How would this help with the AVG(bigint) case? We don't have int128 as
SQL type, so what would be part of the record? Also, which part of the
code would produce the record? If the internal state is "internal", that
would probably need to be something aggregate specific, and that's kinda
what this patch series is adding, no?

Or am I missing some cases where the record would make it work?


regards

-- 
Tomas Vondra

Re: Partial aggregates pushdown

2024-08-20 Thread Bruce Momjian

On Tue, Aug 20, 2024 at 07:03:56PM +0200, Jelte Fennema-Nio wrote:
> On Tue, 20 Aug 2024 at 18:50, Bruce Momjian  wrote:
> > Okay, so we can do MAX easily, and AVG if the count can be represented
> > as the same data type as the sum?  Is that correct?  Our only problem is
> > that something like AVG(interval) can't use an array because arrays have
> > to have the same data type for all array elements, and an interval can't
> > represent a count?
> 
> Close, but still not completely correct. AVG(bigint) can also not be
> supported by patch 1, because the sum and the count for that both
> stored using an int128. So we'd need an array of int128, and there's
> currently no int128 SQL type.

Okay.  Have we considered having the FDW return a record:

SELECT (oid, relname) FROM pg_class LIMIT 1;
 row
-
 (2619,pg_statistic)

SELECT pg_typeof((oid, relname)) FROM pg_class LIMIT 1;
 pg_typeof
---
 record

SELECT pg_typeof(oid) FROM pg_class LIMIT 1;
 pg_typeof
---
 oid

SELECT pg_typeof(relname) FROM pg_class LIMIT 1;
 pg_typeof
---
 name

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-08-20 Thread Jelte Fennema-Nio

On Tue, 20 Aug 2024 at 18:50, Bruce Momjian  wrote:
> Okay, so we can do MAX easily, and AVG if the count can be represented
> as the same data type as the sum?  Is that correct?  Our only problem is
> that something like AVG(interval) can't use an array because arrays have
> to have the same data type for all array elements, and an interval can't
> represent a count?

Close, but still not completely correct. AVG(bigint) can also not be
supported by patch 1, because the sum and the count for that both
stored using an int128. So we'd need an array of int128, and there's
currently no int128 SQL type.

Re: Partial aggregates pushdown

2024-08-20 Thread Bruce Momjian

On Tue, Aug 20, 2024 at 10:07:32AM +0200, Jelte Fennema-Nio wrote:
> On Thu, 15 Aug 2024 at 23:12, Bruce Momjian  wrote:
> > Third, I would like to show a more specific example to clarify what is
> > being considered above.  If we look at MAX(), we can have FDWs return
> > the max for each FDW, and the coordinator can chose the highest value.
> > This is the patch 1 listed above.  These can return the
> > pg_aggregate.aggtranstype data type using the pg_type.typoutput text
> > output.
> >
> > The second case is for something like AVG(), which must return the SUM()
> > and COUNT(), and we currently have no way to return multiple text values
> > on the wire.  For patch 0002, we have the option of creating functions
> > that can do this and record them in new pg_attribute columns, or we can
> > create a data type with these functions, and assign the data type to
> > pg_aggregate.aggtranstype.
> >
> > Is that accurate?
> 
> It's close to accurate, but not entirely. Patch 1 would actually
> solves some AVG cases too, because some AVG implementations use an SQL
> array type to store the transtype instead of an internal type. And by
> using an SQL array type we *can* send multiple text values on the
> wire. See below for a list of those aggregates:

Okay, so we can do MAX easily, and AVG if the count can be represented
as the same data type as the sum?  Is that correct?  Our only problem is
that something like AVG(interval) can't use an array because arrays have
to have the same data type for all array elements, and an interval can't
represent a count?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-08-20 Thread Jelte Fennema-Nio

On Thu, 15 Aug 2024 at 23:12, Bruce Momjian  wrote:
> Third, I would like to show a more specific example to clarify what is
> being considered above.  If we look at MAX(), we can have FDWs return
> the max for each FDW, and the coordinator can chose the highest value.
> This is the patch 1 listed above.  These can return the
> pg_aggregate.aggtranstype data type using the pg_type.typoutput text
> output.
>
> The second case is for something like AVG(), which must return the SUM()
> and COUNT(), and we currently have no way to return multiple text values
> on the wire.  For patch 0002, we have the option of creating functions
> that can do this and record them in new pg_attribute columns, or we can
> create a data type with these functions, and assign the data type to
> pg_aggregate.aggtranstype.
>
> Is that accurate?

It's close to accurate, but not entirely. Patch 1 would actually
solves some AVG cases too, because some AVG implementations use an SQL
array type to store the transtype instead of an internal type. And by
using an SQL array type we *can* send multiple text values on the
wire. See below for a list of those aggregates:

> select p.oid::regprocedure
from pg_aggregate a join pg_proc p on a.aggfnoid = p.oid
where aggfinalfn != 0 and aggtranstype::regtype not in ('internal',
'anyenum', 'anyelement', 'anyrange', 'anyarray', 'anymultirange');
oid
───
 avg(integer)
 avg(smallint)
 avg(real)
 avg(double precision)
 avg(interval)
 var_pop(real)
 var_pop(double precision)
 var_samp(real)
 var_samp(double precision)
 variance(real)
 variance(double precision)
 stddev_pop(real)
 stddev_pop(double precision)
 stddev_samp(real)
 stddev_samp(double precision)
 stddev(real)
 stddev(double precision)
 regr_sxx(double precision,double precision)
 regr_syy(double precision,double precision)
 regr_sxy(double precision,double precision)
 regr_avgx(double precision,double precision)
 regr_avgy(double precision,double precision)
 regr_r2(double precision,double precision)
 regr_slope(double precision,double precision)
 regr_intercept(double precision,double precision)
 covar_pop(double precision,double precision)
 covar_samp(double precision,double precision)
 corr(double precision,double precision)
(28 rows)

And to be clear, these are in addition to the MAX type of aggregates
you were describing:
> select p.oid::regprocedure
from pg_aggregate a join pg_proc p on a.aggfnoid = p.oid
where aggfinalfn = 0 and aggtranstype::regtype not in ('internal',
'anyenum', 'anyelement', 'anyrange', 'anyarray', 'anymultirange');
  oid
───
 sum(integer)
 sum(smallint)
 sum(real)
 sum(double precision)
 sum(money)
 sum(interval)
 max(bigint)
 max(integer)
 max(smallint)
 max(oid)
 max(real)
 max(double precision)
 max(date)
 max(time without time zone)
 max(time with time zone)
 max(money)
 max(timestamp without time zone)
 max(timestamp with time zone)
 max(interval)
 max(text)
 max(numeric)
 max(character)
 max(tid)
 max(inet)
 max(pg_lsn)
 max(xid8)
 min(bigint)
 min(integer)
 min(smallint)
 min(oid)
 min(real)
 min(double precision)
 min(date)
 min(time without time zone)
 min(time with time zone)
 min(money)
 min(timestamp without time zone)
 min(timestamp with time zone)
 min(interval)
 min(text)
 min(numeric)
 min(character)
 min(tid)
 min(inet)
 min(pg_lsn)
 min(xid8)
 count("any")
 count()
 regr_count(double precision,double precision)
 bool_and(boolean)
 bool_or(boolean)
 every(boolean)
 bit_and(smallint)
 bit_or(smallint)
 bit_xor(smallint)
 bit_and(integer)
 bit_or(integer)
 bit_xor(integer)
 bit_and(bigint)
 bit_or(bigint)
 bit_xor(bigint)
 bit_and(bit)
 bit_or(bit)
 bit_xor(bit)
 xmlagg(xml)
(65 rows)

Re: Partial aggregates pushdown

2024-08-15 Thread Bruce Momjian

On Thu, Aug  8, 2024 at 01:48:49PM +0200, Jelte Fennema-Nio wrote:
> SUMMARY OF THREAD
> 
> The design of patch 0001 is agreed upon by everyone on the thread (so
> far). This adds the PARTIAL_AGGREGATE label for aggregates, which will
> cause the finalfunc not to run. It also starts using PARTIAL_AGGREGATE
> for pushdown of aggregates in postgres_fdw. In 0001 PARTIAL_AGGREGATE
> is only supported for aggregates with a non-internal/pseudo type as
> the stype.
> 
> The design for patch 0002 is still under debate. This would expand on
> the functionality added by adding support for PARTIAL_AGGREGATE for
> aggregates with an internal stype. This is done by returning a byte
> array containing the bytes that the serialfunc of the aggregate
> returns.
> 
> A competing proposal for 0002 is to instead change aggregates to not
> use an internal stype anymore, and create dedicated types. The main
> downside here is that infunc and outfunc would need to be added for
> text serialization, in addition to the binary serialization. An open
> question is: Can we change the requirements for CREATE TYPE, so that
> types can be created without infunc and outfunc.
> 
> WHAT IS NEEDED?
> 
> The things needed for this patch are that docs need to be added, and
> detailed codereview needs to be done.
> 
> Feedback from more people on the two competing proposals for 0002
> would be very helpful in making a decision.

First, I am sorry to be replying so late --- I have been traveling for
the past four weeks.  Second, I consider this feature a big part of
sharding, and I think sharding is Postgres's biggest missing feature. I
talk about this patch often when asked about what Postgres is working
on.

Third, I would like to show a more specific example to clarify what is
being considered above.  If we look at MAX(), we can have FDWs return
the max for each FDW, and the coordinator can chose the highest value. 
This is the patch 1 listed above.  These can return the
pg_aggregate.aggtranstype data type using the pg_type.typoutput text
output.

The second case is for something like AVG(), which must return the SUM()
and COUNT(), and we currently have no way to return multiple text values
on the wire.  For patch 0002, we have the option of creating functions
that can do this and record them in new pg_attribute columns, or we can
create a data type with these functions, and assign the data type to
pg_aggregate.aggtranstype.

Is that accurate?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-08-15 Thread Bruce Momjian

On Sun, Jul  7, 2024 at 09:52:27PM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> Hi Bruce.
> 
> > From: Bruce Momjian 
> > Is there a reason the documentation is no longer a part of this patch?
> > Can I help you keep it current?
> 
> Here are the reasons:
>   Reason1. The approach differs significantly from the previous patch that 
> included documentation, the below.
> 
> https://www.postgresql.org/message-id/attachment/152086/0001-Partial-aggregates-push-down-v34.patch
>   Reason2. I have two options for transmitting the state value and I'm 
> evaluating which one is optimal.
>One is what I presented you in PGConf.dev2024. The other is Jelte's one.
>He listened to my talk at the conference and gave me some useful comments 
> on hackers. I'm very grateful that.
>   Reason3. The implementation and test have been not finished yet.
> Regarding Reason 2, I provided my conclusion in the previous message.
> 
> My plan for advancing the patch involves the following steps:
>   Step1. Decide the approach on transmitting state value.
>   Step2. Implement code (including comments) and tests to support a subset of 
> aggregate functions.
> Specifically, I plan to support avg, sum, and other aggregate functions 
> like min and max which don't need export/import functions.
>   Step3. Add documentations.
> 
> To clarify my approach, should I proceed with Step 3 before Step2?
> I would appreciate your feedback on this.

Thanks, I now understand why the docs were remove, and I agree.  I will
post about the options now in a new email.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-08-08 Thread Jelte Fennema-Nio

SUMMARY OF THREAD

The design of patch 0001 is agreed upon by everyone on the thread (so
far). This adds the PARTIAL_AGGREGATE label for aggregates, which will
cause the finalfunc not to run. It also starts using PARTIAL_AGGREGATE
for pushdown of aggregates in postgres_fdw. In 0001 PARTIAL_AGGREGATE
is only supported for aggregates with a non-internal/pseudo type as
the stype.

The design for patch 0002 is still under debate. This would expand on
the functionality added by adding support for PARTIAL_AGGREGATE for
aggregates with an internal stype. This is done by returning a byte
array containing the bytes that the serialfunc of the aggregate
returns.

A competing proposal for 0002 is to instead change aggregates to not
use an internal stype anymore, and create dedicated types. The main
downside here is that infunc and outfunc would need to be added for
text serialization, in addition to the binary serialization. An open
question is: Can we change the requirements for CREATE TYPE, so that
types can be created without infunc and outfunc.

WHAT IS NEEDED?

The things needed for this patch are that docs need to be added, and
detailed codereview needs to be done.

Feedback from more people on the two competing proposals for 0002
would be very helpful in making a decision.

Re: Partial aggregates pushdown

2024-07-08 Thread fujii.y...@df.mitsubishielectric.co.jp

On Mon, 8 Jul 2024 at 14:12, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> The best thing about Approach2 is that it lets us send state values using the 
> existing data type system.
> I'm worried that if we choose Approach2, we might face some limits because we 
> have to create new types.
> But, we might be able to fix these limits if we look into it more.
>
> Approach1 doesn't make new types, so we can avoid these limits.
> But, it means we have to make export/import functions that are similar to the 
> typsend/typreceive functions.
> So, we need to make sure if we really need this method.
>
> Is this the right understanding?

Yeah, correct. To clarify my reasoning a bit more: IMHO, the main
downside of implementing Approach1 is that we then end up with two
different mechanisms to "take data from memory and serialize it in a
way in which it can be sent over the network". I'd very much prefer if
we could have a single system responsible for that task. So if there's
issues with the current system (e.g. having to implement
typinput/typoutput), then I'd rather address these problems in the
existing system. Only if that turns out to be impossible for some
reason, then I think I'd prefer Approach1.

Personally, even if the Approach2 requires a bit more code, then I'd
still prefer a single serialization system over having two
serializiation systems.

RE: Partial aggregates pushdown

Hi Jelte,

Thank you for comments and advises.

> From: Jelte Fennema-Nio 
> Sent: Monday, July 8, 2024 5:31 PM
> On Sun, 7 Jul 2024 at 23:46, fujii.y...@df.mitsubishielectric.co.jp
>  wrote:
> > In my mind, Approach1 is superior. Therefore, if there are no objections 
> > this week, I plan to resume implementing
> Approach1 next week. I would appreciate it if anyone could discuss the topic 
> with me or ask questions.
> 
> Honestly, the more I think about this, the more I like Approach2. Not because 
> I disagree with you about some of the
> limitations of Approach2, but because I'd rather see those limitations fixed 
> in CREATE TYPE, instead of working around
> these limitations in CREATE AGGREGATE. That way more usages can benefit. 
> Detailed explanation and arguments below.
Firstly, I may have jumped to conclusions too quickly. I apologize that.
I would appreciate it if we clarify Approach 1 and Approach 2 more precisely 
and we can proceed with the discussion.

Before we get into the details, let me break down the main differences between 
Approach 1 and Approach 2.

The best thing about Approach2 is that it lets us send state values using the 
existing data type system.
I'm worried that if we choose Approach2, we might face some limits because we 
have to create new types.
But, we might be able to fix these limits if we look into it more. 

Approach1 doesn't make new types, so we can avoid these limits.
But, it means we have to make export/import functions that are similar to the 
typsend/typreceive functions.
So, we need to make sure if we really need this method.

Is this the right understanding?

> From: Jelte Fennema-Nio 
> Sent: Monday, July 8, 2024 5:59 PM
> On Sun, 30 Jun 2024 at 23:42, fujii.y...@df.mitsubishielectric.co.jp
>  wrote:
> > Instead, I can make PARTIAL_AGGREGATE an unreserved word by placing it 
> > after the FILTER clause, like avg(c1) FILTER
> (WHERE c2 > 0) PARTIAL_AGGREGATE, and by marking it as an ASLABEL word like 
> FILTER.
> > I attached the patch of the method.
> > If there are no objections, I would like to proceed with the method 
> > described above.
> > I'd appreciate it if anyone comment the method.
> 
> I like this approach of using PARTIAL_AGGREGATE in the same place as the 
> FILTER clause.
Thank you for comment.

> On Sun, 7 Jul 2024 at 23:52, fujii.y...@df.mitsubishielectric.co.jp
>  wrote:
> > My plan for advancing the patch involves the following steps:
> >   Step1. Decide the approach on transmitting state value.
> >   Step2. Implement code (including comments) and tests to support a subset 
> > of aggregate functions.
> > Specifically, I plan to support avg, sum, and other aggregate functions 
> > like min and max which don't need
> export/import functions.
> >   Step3. Add documentations.
> >
> > To clarify my approach, should I proceed with Step 3 before Step2?
> 
> (my opinion, Bruce might have a different one)
> 
> I think it's good that you split the original patch in two:
> 0001: non-internal partial aggregates
> 0002: internal partial aggregates
> 
> I think we're aligned on the general design of 0001. So I think now is 
> definitely the time to include documentation there, so
> we can discuss this patch in more detail, and move it forward.
> 
> I think generally for 0002 it would also be useful to have documentation, I 
> personally like reading it to understand the
> general design and then comparing that to the code. But I also understand 
> that the language differences between Japanese
> and English, makes writing such docs a significant effort for you. So I think 
> it would be fine to skip docs for 0002 for now
> until we decide on the approach we want to take for internal partial 
> aggregates.
At least for 0001, it seems like it would be a good idea to attach a document 
at this stage.

Best regards, Yuki Fujii
--
Yuki Fujii
Information Technology R&D Center, Mitsubishi Electric Corporation

Re: Partial aggregates pushdown

On Sun, 7 Jul 2024 at 23:52, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> My plan for advancing the patch involves the following steps:
>   Step1. Decide the approach on transmitting state value.
>   Step2. Implement code (including comments) and tests to support a subset of 
> aggregate functions.
> Specifically, I plan to support avg, sum, and other aggregate functions 
> like min and max which don't need export/import functions.
>   Step3. Add documentations.
>
> To clarify my approach, should I proceed with Step 3 before Step2?

(my opinion, Bruce might have a different one)

I think it's good that you split the original patch in two:
0001: non-internal partial aggregates
0002: internal partial aggregates

I think we're aligned on the general design of 0001. So I think now is
definitely the time to include documentation there, so we can discuss
this patch in more detail, and move it forward.

I think generally for 0002 it would also be useful to have
documentation, I personally like reading it to understand the general
design and then comparing that to the code. But I also understand that
the language differences between Japanese and English, makes writing
such docs a significant effort for you. So I think it would be fine to
skip docs for 0002 for now until we decide on the approach we want to
take for internal partial aggregates.

Re: Partial aggregates pushdown

On Sun, 30 Jun 2024 at 23:42, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> Instead, I can make PARTIAL_AGGREGATE an unreserved word by placing it after 
> the FILTER clause, like avg(c1) FILTER (WHERE c2 > 0) PARTIAL_AGGREGATE, and 
> by marking it as an ASLABEL word like FILTER.
> I attached the patch of the method.
> If there are no objections, I would like to proceed with the method described 
> above.
> I'd appreciate it if anyone comment the method.

I like this approach of using PARTIAL_AGGREGATE in the same place as
the FILTER clause.

Re: Partial aggregates pushdown

2024-07-07 Thread fujii.y...@df.mitsubishielectric.co.jp

On Sun, 7 Jul 2024 at 23:46, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> In my mind, Approach1 is superior. Therefore, if there are no objections this 
> week, I plan to resume implementing Approach1 next week. I would appreciate 
> it if anyone could discuss the topic with me or ask questions.

Honestly, the more I think about this, the more I like Approach2. Not
because I disagree with you about some of the limitations of
Approach2, but because I'd rather see those limitations fixed in
CREATE TYPE, instead of working around these limitations in CREATE
AGGREGATE. That way more usages can benefit. Detailed explanation and
arguments below.

> 1. Extendability
> I believe it is crucial to support scenarios where the local and remote major 
> versions may differ in the future (see the below).
>
> https://www.postgresql.org/message-id/4012625.1701120204%40sss.pgh.pa.us

>From my reading, Tom's concern is that different server versions
cannot talk to each other anymore. So as long as this perf
optimization is only enabled when server versions are the same, I
don't think there is a huge problem if we never implement this.
Honestly, I even think making this optimization opt-in at the FDW
server creation level would already solve Tom's concert. I do agree
that it would be good if we could have cross version partial
aggregates though, so it's definitely something to consider.

> Regarding this aspect, I consider Approach1 superior to Approach2. The reason 
> is that:
> ・The data type of an aggregate function's state value may change with each 
> major version increment.
> ・In Approach1, by extending the export/import functionalities to include the 
> major version in which the state value was created (refer to p.16 and p.17 of 
> [1]), I can handle such situations.
> ・On the other hand, it appears that Approach2 fundamentally lacks the 
> capability to support these scenarios.

Approach 2 definitely has some cross-version capabilities, e.g.
jsonb_send includes a version. Such an approach can be used to solve a
newer coordinator talking to an older worker, if the transtypes are
the same.

I personally don't think it's worth supporting this optimization for
an older coordinator talking to a newer worker. Using binary protocol
to talk to from an older server to a newer server doesn't work either.

Finally, based on p.16 & p.17 it's unclear to me how cross-version
with different transtypes would work. That situation seems inherently
incompatible to me.

> 2. Amount of codes
> Regarding this aspect, I find Approach1 to be better than Approach2.
> In Approach1, developers only need to export/import functions and can use a 
> standardized format for transmitting state values.
> In Approach2, developers have two options:
>   Option1: Adding typinput/typoutput and typsend/typreceive.
>   Option2: Adding typinput/typoutput only.
> Option1 requires more lines of code, which may be seen as cumbersome by some 
> developers.
> Option2 restricts developers to using only text representation for 
> transmitting state values, which I consider limiting.

In my opinion this is your strongest argument for Approach1. But you
didn't answer my previous two questions yet:

On Tue, 25 Jun 2024 at 11:28, Jelte Fennema-Nio  wrote:
> So that leaves me two questions:
> 1. Maybe CREATE TYPE should allow types without input/output functions
> as long as send/receive are defined. For these types text
> representation could fall back to the hex representation of bytea.
> 2. If for some reason 1 is undesired, then why are transtypes so
> special. Why is it fine for them to only have send/receive functions
> and not for other types?

Basically: I agree with this argument, but I feel like this being a
problem for this usecase is probably a sign that we should take the
solution a step further and solve this at the CREATE TYPE level
instead of allowing people to hack around CREATE TYPE its limitations
just for these partial aggregates.

> 3. Catalog size
> Regarding this point, I believe Approach1 is better than Approach2.
> In Approach1, theoretically, it is necessary to add export/import functions 
> to pg_proc for each aggregate.
> In Approach2, theoretically, it is necessary to add typoutput/typinput 
> functions (and typsend/typreceive if necessary) to pg_proc and add a native 
> type to pg_type for each aggregate.
> I would like to emphasize that we should consider user-defined functions in 
> addition to built-in aggregate functions.
> I think most developers prefer to avoid bloating catalogs, even if they may 
> not be able to specify exact reasons.
> In fact, in Robert's previous review, he expressed a similar concern (see 
> below).
>
> https://www.postgresql.org/message-id/CA%2BTgmobvja%2Bjytj5zcEcYgqzOaeJiqrrJxgqDf1q%3D3k8FepuWQ%40mail.gmail.com

So, to summarize the difference (assuming we change CREATE TYPE to
allow only typsend/typreceive): "Approach 2 adds an additional pg_type
entry per aggregate"

IMHO this is fine, especially sin

RE: Partial aggregates pushdown

Hi Bruce.

> From: Bruce Momjian 
> Is there a reason the documentation is no longer a part of this patch?
> Can I help you keep it current?

Here are the reasons:
  Reason1. The approach differs significantly from the previous patch that 
included documentation, the below.

https://www.postgresql.org/message-id/attachment/152086/0001-Partial-aggregates-push-down-v34.patch
  Reason2. I have two options for transmitting the state value and I'm 
evaluating which one is optimal.
   One is what I presented you in PGConf.dev2024. The other is Jelte's one.
   He listened to my talk at the conference and gave me some useful comments on 
hackers. I'm very grateful that.
  Reason3. The implementation and test have been not finished yet.
Regarding Reason 2, I provided my conclusion in the previous message.

My plan for advancing the patch involves the following steps:
  Step1. Decide the approach on transmitting state value.
  Step2. Implement code (including comments) and tests to support a subset of 
aggregate functions.
Specifically, I plan to support avg, sum, and other aggregate functions 
like min and max which don't need export/import functions.
  Step3. Add documentations.

To clarify my approach, should I proceed with Step 3 before Step2?
I would appreciate your feedback on this.

Best regards, Yuki Fujii
--
Yuki Fujii
Information Technology R&D Center, Mitsubishi Electric Corporation

RE: Partial aggregates pushdown

2024-07-07 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi Jelte and hackers,

I've reconsidered which of the following two approaches is the best.
  Approach1: Adding export/import functions to transmit state values.
  Approach 2: Adding native types which are equal to state values.

In my mind, Approach1 is superior. Therefore, if there are no objections this 
week, I plan to resume implementing Approach1 next week. I would appreciate it 
if anyone could discuss the topic with me or ask questions.

I believe that while Approach1 has the extendability to support situations 
where local and remote major versions differ, Approach2 lacks this 
extendability. Additionally, it seems that Approach1 requires fewer additional 
lines of code compared to Approach2. I'm also concerned that Approach2 may 
cause the catalog pg_type to bloat.

Although Approach2 offers the benefit of avoiding the addition of columns to 
pg_aggregate, I think this benefit is smaller than the advantages of Approach1 
mentioned above.

Next, I will present my complete comparison. The comparison points are as 
follows:
  1. Extendability
  2. Amount of codes
  3. Catalog size
  4. Developer burden
  5. Additional columns to catalogs

1. Extendability
I believe it is crucial to support scenarios where the local and remote major 
versions may differ in the future (see the below).

https://www.postgresql.org/message-id/4012625.1701120204%40sss.pgh.pa.us

Regarding this aspect, I consider Approach1 superior to Approach2. The reason 
is that:
・The data type of an aggregate function's state value may change with each 
major version increment.
・In Approach1, by extending the export/import functionalities to include the 
major version in which the state value was created (refer to p.16 and p.17 of 
[1]), I can handle such situations.
・On the other hand, it appears that Approach2 fundamentally lacks the 
capability to support these scenarios.

2. Amount of codes
Regarding this aspect, I find Approach1 to be better than Approach2.
In Approach1, developers only need to export/import functions and can use a 
standardized format for transmitting state values.
In Approach2, developers have two options:
  Option1: Adding typinput/typoutput and typsend/typreceive.
  Option2: Adding typinput/typoutput only.
Option1 requires more lines of code, which may be seen as cumbersome by some 
developers.
Option2 restricts developers to using only text representation for transmitting 
state values, which I consider limiting.

3. Catalog size
Regarding this point, I believe Approach1 is better than Approach2.
In Approach1, theoretically, it is necessary to add export/import functions to 
pg_proc for each aggregate.
In Approach2, theoretically, it is necessary to add typoutput/typinput 
functions (and typsend/typreceive if necessary) to pg_proc and add a native 
type to pg_type for each aggregate.
I would like to emphasize that we should consider user-defined functions in 
addition to built-in aggregate functions.
I think most developers prefer to avoid bloating catalogs, even if they may not 
be able to specify exact reasons.
In fact, in Robert's previous review, he expressed a similar concern (see 
below).

https://www.postgresql.org/message-id/CA%2BTgmobvja%2Bjytj5zcEcYgqzOaeJiqrrJxgqDf1q%3D3k8FepuWQ%40mail.gmail.com

4. Developer burden.
Regarding this aspect, I believe Approach1 is better than Approach2.
In Approach1, developers have the following additional tasks:
  Task1-1: Create and define export/import functions.

In Approach2, developers have the following additional tasks:
  Task2-1: Create and define typoutput/input functions (and typesend/typreceive 
functions if necessary).
  Task2-2: Define a native type.

Approach1 requires fewer additional tasks, although the difference may be not 
substantial.

5. Additional columns to catalogs.
Regarding this aspect, Approach2 is better than Approach1.
Approach1 requires additional three columns in pg_aggregate, specifically the 
aggpartialpushdownsafe flag, export function reference, and import function 
reference.
Approach2 does not require any additional columns in catalogs.
However, over the past four years of discussions, no one has expressed concerns 
about additional columns in catalogs.

[1] 
https://www.postgresql.org/message-id/attachment/160659/PGConfDev2024_Presentation_Aggregation_Scaleout_FDW_Sharding_20240531.pdf

Best regards, Yuki Fujii
--
Yuki Fujii
Information Technology R&D Center, Mitsubishi Electric Corporation

Re: Partial aggregates pushdown

2024-07-05 Thread Bruce Momjian

On Sun, Jun 30, 2024 at 09:42:19PM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> On Mon, Jun 24, 2024 at 6:09?PM Jelte Fennema-Nio  wrote:
> > 4. Related to 3, I think it would be good to have some tests of
> > PARTIAL_AGGREGATE that don't involve postgres_fdw at all. I also
> > spotted some comments too that mention FDW, even though they apply to
> > the "pure" PARTIAL_AGGREGATE code.
> > 5. This comment now seems incorrect:
> > -* Apply the agg's finalfn if one is provided, else return transValue.
> > +* If the agg's finalfn is provided and PARTIAL_AGGREGATE keyword is
> > +* not specified, apply the agg's finalfn.
> > +* If PARTIAL_AGGREGATE keyword is specified and the transValue type
> > +* is internal, apply the agg's serialfn. In this case the agg's
> > +* serialfn must not be invalid. Otherwise return transValue.
> >
> > 6. These errors are not on purpose afaict (if they are a comment in
> > the test would be good to explain why)
> >
> > +SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b ORDER BY 1;
> > +ERROR:  could not connect to server "loopback"
> > +DETAIL:  invalid connection option "partial_aggregate_support"
> Fixed.

Is there a reason the documentation is no longer a part of this patch? 
Can I help you keep it current?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

RE: Partial aggregates pushdown

2024-06-30 Thread fujii.y...@df.mitsubishielectric.co.jp

> From: Fujii Yuki 
> Sent: Monday, July 1, 2024 6:42 AM
> Hi hackers.
> 
> On Wed, Jun 5, 2024 at 9:15?AM fujii.y...@df.mitsubishielectric.co.jp 
>  wrote:
> >   Requirement2. Consider appropriate position of the new keyword 
> > "PARTIAL_AGGREGATE". (with Robert)
> > Existing patch: Before the target expression. Ex. avg(PARTIAL_AGGREGATE 
> > c1)
> > Ideal: Before the aggregate function. Ex. PARTIAL_AGGREGATE avg(c1)
> >   Requirement3. Consider to avoid to make the new keyword 
> > "PARTIAL_AGGREGATE" become a reserved word. (with
> Robert)
> > In the existing patch, "PARTIAL_AGGREGATE" is a reserved word.
> I considered the above two requirement.
> Based on my research, there is no way to use PARTIAL_AGGREGATE in front of a 
> function name without making it a
> reserved word.
With this way, I couldn't resolve shift/reduce conflicts.

Re: Partial aggregates pushdown

2024-06-25 Thread Jelte Fennema-Nio

On Tue, 25 Jun 2024 at 08:33, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> Actually, I have other tasks about "PARTIAL_AGGREAGATE" keyword
> to respond Requirement1 and Requirement2 in the following mail.
> https://www.postgresql.org/message-id/TYAPR01MB3088755F2281D41F5EEF06D495F92%40TYAPR01MB3088.jpnprd01.prod.outlook.com

No problem. I totally think it makes sense to focus on basic
PARTIAL_AGGREGATE first. Which is also why I suggested splitting the
patchset up in multiple patches. That way it's easier to get everyone
aligned on PARTIAL_AGGREGATE behaviour for non-internal transtypes,
which would already be a huge improvement over the current situation
in my opinion.

> After that tasks, I plan to compare your proposal with mine seriously, with 
> additional POC patches if necessary.

Sounds great! To be clear, I'm not sure which proposal is best. I
mainly thought mine seemed interesting because it doesn't require
additional columns. But maybe the benefits that the extra columns in
your proposal brings are worth adding those extra columns.

> I see. It seems that adding new natives might make it easier to transmit the 
> state values between local and remote have different major versions.
> However, in my opinion, we should be careful to support the case in which 
> local and remote have different major versions,
> because the transtype of an aggregate function would may change in future 
> major version due to
> something related to the implementation.
> Actually, something like that occurs recently, see
> https://github.com/postgres/postgres/commit/519fc1bd9e9d7b408903e44f55f83f6db30742b7
> I think the transtype of an aggregate function quite more changeable than 
> retype.
> Consequently, so far, I want to support the cases in which local and remote 
> have the same major version.
> If we try to resolve the limitation, it seems to need more additional codes.

Hmm, that's a very good point. Indeed cross-major-version partial
aggregates pushdown would not be fully solved with this yet.

> And, I'm afraid that adding typinput/typoutput bothers the developers.
> They also have to create a new native types in addition to create their new 
> aggregate functions.
> I wonder if this concern might outweigh the benefits for debugging.
> And, if skipping send/receive, they have to select only the text 
> representation on
> the transmission of the state value. I think it is narrow.

I kinda agree with this argument. But really this same argument
applies just as well for regular CREATE TYPE. Developers are forced to
implement typinput/typoutput, even though send/receive might really be
enough for their usecase. So in a sense with your proposal, you give
transtypes a special status over regular types: i.e. transtypes are
the only types where only send/receive is necessary.

So that leaves me two questions:
1. Maybe CREATE TYPE should allow types without input/output functions
as long as send/receive are defined. For these types text
representation could fall back to the hex representation of bytea.
2. If for some reason 1 is undesired, then why are transtypes so
special. Why is it fine for them to only have send/receive functions
and not for other types?

RE: Partial aggregates pushdown

2024-06-24 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi Jelte, hackers.

Thank you for explanations.

Actually, I have other tasks about "PARTIAL_AGGREAGATE" keyword
to respond Requirement1 and Requirement2 in the following mail.
https://www.postgresql.org/message-id/TYAPR01MB3088755F2281D41F5EEF06D495F92%40TYAPR01MB3088.jpnprd01.prod.outlook.com

After that tasks, I plan to compare your proposal with mine seriously, with 
additional POC patches if necessary.

I think that your proposal might seem to be a more fundamental solution.
However, to be honest, so far, I don't perfectly get the benefits and impacts 
by stopping usage of internal types
instead using a native types, especially on handling memory contexts of 
existing deserialization functions and
on the amount of codes to be modified or added.
The followings are the answers with the knowledge I have right now.

> From: Jelte Fennema-Nio 
> Sent: Tuesday, June 25, 2024 5:49 AM
> > However, I still have the following two questions.
> >
> > 1. Not necessary components of new native types Refer to pg_type.dat,
> > typinput and typoutput are required.
> > I think that in your proposal they are not necessary, so waste. I
> > think that it is not acceptable.
> > How can I resolve the problem?
> 
> I think requiring typinput/typoutput is a benefit personally, because that 
> makes it possible to do PARTIAL_AGGREGATE
> pushdown to a different PG major version. Also it makes it easier to debug 
> the partial aggregate values when using
> psql/pgregress. So yes, it requires implementing both binary (send/receive) 
> and text (input/output) serialization, but it also
> has some benefits. And in theory you might be able to skip implementing the 
> binary serialization, and rely purely on the text
> serialization to send partial aggregates between servers.
I see. It seems that adding new natives might make it easier to transmit the 
state values between local and remote have different major versions.
However, in my opinion, we should be careful to support the case in which local 
and remote have different major versions,
because the transtype of an aggregate function would may change in future major 
version due to
something related to the implementation.
Actually, something like that occurs recently, see
https://github.com/postgres/postgres/commit/519fc1bd9e9d7b408903e44f55f83f6db30742b7
I think the transtype of an aggregate function quite more changeable than 
retype.
Consequently, so far, I want to support the cases in which local and remote 
have the same major version.
If we try to resolve the limitation, it seems to need more additional codes.

And, I'm afraid that adding typinput/typoutput bothers the developers.
They also have to create a new native types in addition to create their new 
aggregate functions.
I wonder if this concern might outweigh the benefits for debugging.
And, if skipping send/receive, they have to select only the text representation 
on
the transmission of the state value. I think it is narrow.

> > 2. Many new native types
> > I think that, basically, each aggregate function does need a new native 
> > type.
> > For example,
> > avg(int8), avg(numeric), and var_pop(int4) has the same transtype, 
> > PolyNumAggState.
> > You said that it is enough to add only one native type like
> > state_poly_num_agg for supporting them, right?
> 
> Yes, correct. That's what I had in mind.
> 
> > But the combine functions of them might have quite different
> > expectation on the data items of PolyNumAggState like the range of
> > N(means count) and the true/false of calcSumX2 (the flag of
> > calculating sum of squares).
> > The final functions of them have the similar expectation.
> > So, I think that, responded to your proposal, each of them need a
> > native data type like state_int8_avg, state_numeric_avg, for safety.
> >
> > And, we do need a native type for an aggregate function whose
> > transtype is not internal and not pseudo.
> > For avg(int4), the current transtype is _int8.
> > However, I do need a validation check on the number of the array And
> > the positiveness of count(the first element of the array).
> > Responded to your proposal,
> > I do need a new native type like state_int4_avg.
> 
> To help avoid confusion let me try to restate what I think you mean
> here: You're worried about someone passing in a bogus native type into the 
> final/combine functions and then getting
> crashes and/or wrong results. With internal type people cannot do this 
> because they cannot manually call the
> combinefunc/finalfunc because the argument type is internal. To solve this 
> problem your suggestion is to make the type
> specific to the specific aggregate such that send/receive or input/output can 
> validate the input as reasonable. But this
> would then mean that we have many native types (and also many 
> deserialize/serialize functions).
Yes, right.

> Assuming that's indeed what you meant, that's an interesting thought, I 
> didn't consider that much indeed. My thinking was
> that we only

Re: Partial aggregates pushdown

2024-06-24 Thread Jelte Fennema-Nio

On Mon, 24 Jun 2024 at 15:03, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> I see. I maybe got your proposal.
> Refer to your proposal, for avg(int8),
> I create a new native type like state_int8_avg
> with the new typsend/typreceive functions
> and use them to transmit the state value, right?

Yes, that's roughly what I had in mind indeed.

> That might seem to be a more fundamental solution
> because I can avoid adding export/import functions of my proposal,
> which are the new components of aggregate function.
> I have never considered the solution.
> I appreciate your proposal.

Thank you :)

> However, I still have the following two questions.
>
> 1. Not necessary components of new native types
> Refer to pg_type.dat, typinput and typoutput are required.
> I think that in your proposal they are not necessary,
> so waste. I think that it is not acceptable.
> How can I resolve the problem?

I think requiring typinput/typoutput is a benefit personally, because
that makes it possible to do PARTIAL_AGGREGATE pushdown to a different
PG major version. Also it makes it easier to debug the partial
aggregate values when using psql/pgregress. So yes, it requires
implementing both binary (send/receive) and text (input/output)
serialization, but it also has some benefits. And in theory you might
be able to skip implementing the binary serialization, and rely purely
on the text serialization to send partial aggregates between servers.

> 2. Many new native types
> I think that, basically, each aggregate function does need a new native type.
> For example,
> avg(int8), avg(numeric), and var_pop(int4) has the same transtype, 
> PolyNumAggState.
> You said that it is enough to add only one native type like state_poly_num_agg
> for supporting them, right?

Yes, correct. That's what I had in mind.

> But the combine functions of them might have quite different expectation
> on the data items of PolyNumAggState like
> the range of N(means count) and the true/false of calcSumX2
> (the flag of calculating sum of squares).
> The final functions of them have the similar expectation.
> So, I think that, responded to your proposal,
> each of them need a native data type
> like state_int8_avg, state_numeric_avg, for safety.
>
> And, we do need a native type for an aggregate function
> whose transtype is not internal and not pseudo.
> For avg(int4), the current transtype is _int8.
> However, I do need a validation check on the number of the array
> And the positiveness of count(the first element of the array).
> Responded to your proposal,
> I do need a new native type like state_int4_avg.

To help avoid confusion let me try to restate what I think you mean
here: You're worried about someone passing in a bogus native type into
the final/combine functions and then getting crashes and/or wrong
results. With internal type people cannot do this because they cannot
manually call the combinefunc/finalfunc because the argument type is
internal. To solve this problem your suggestion is to make the type
specific to the specific aggregate such that send/receive or
input/output can validate the input as reasonable. But this would then
mean that we have many native types (and also many
deserialize/serialize functions).

Assuming that's indeed what you meant, that's an interesting thought,
I didn't consider that much indeed. My thinking was that we only need
to implement send/receive & input/output functions for these types,
and even though their meaning is very different we can serialize them
in the same way.

As you say though, something like that is already true for avg(int4)
today. The way avg(int4) handles this issue is by doing some input
validation for every call to its trans/final/combinefunc (see
int4_avg_accum, int4_avg_combine, and int8_avg). It checks the length
of the array there, but it doesn't check the positiveness of the
count. I think that makes sense. IMHO these functions only need to
protect against crashes (e.g. null pointer dereferences). But I don't
think there is a good reason for them to protect the user against
passing in weird data. These functions aren't really meant to be
called manually in the first place anyway, so if the user does that
and they pass in weird data then I'm fine with them getting a weird
result back, even errors are fine (only crashes are not).

So as long as our input/output & send/receive functions for
state_poly_num_agg handle all the inconsistencies that could cause
crashes later on (which I think is pretty simple to do for
PolyNumAggState), then I don't think we need state_int8_avg,
state_numeric_avg, etc.

> > Not a full review but some initial notes:
> Thank you. I don't have time today, so I'll answer after tomorrow.

Sure, no rush.

RE: Partial aggregates pushdown

2024-06-24 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi. Jelte, hackers.

Thank you for your proposal and comments.

> From: Jelte Fennema-Nio 
> Sent: Monday, June 24, 2024 6:09 PM
> > 1. Generality
> > I believe we should develop a method that can theoretically apply to any
> aggregate function, even if we cannot implement it immediately. However, I 
> find
> it exceptionally challenging to demonstrate that any internal transtype can be
> universally converted to a native data type for aggregate functions that are
> parallel-safe under the current parallel query feature. Specifically, proving 
> this
> for user-defined aggregate functions presents a significant difficulty in my
> view.
> > On the other hand, I think that the usage of export and import functions can
> theoretically apply to any aggregate functions.
> 
> The only thing required when doing CREATE TYPE is having an INPUT and
> OUTPUT function for the type, which (de)serialize the type to text format. As
> far as I can tell by definition that requirement should be fine for any 
> aggregates
> that we can do partial aggregation pushdown for. To clarify I'm not suggesting
> we should change any of the internal representation of the type for the 
> current
> internal aggregates. I'm suggesting we create new native types (i.e. do CREATE
> TYPE) for those internal representations and then use the name of that type
> instead of internal.
I see. I maybe got your proposal.
Refer to your proposal, for avg(int8), 
I create a new native type like state_int8_avg
with the new typsend/typreceive functions
and use them to transmit the state value, right?

That might seem to be a more fundamental solution
because I can avoid adding export/import functions of my proposal,
which are the new components of aggregate function.
I have never considered the solution.
I appreciate your proposal.

However, I still have the following two questions.

1. Not necessary components of new native types
Refer to pg_type.dat, typinput and typoutput are required.
I think that in your proposal they are not necessary,
so waste. I think that it is not acceptable.
How can I resolve the problem? 

2. Many new native types
I think that, basically, each aggregate function does need a new native type.
For example,
avg(int8), avg(numeric), and var_pop(int4) has the same transtype, 
PolyNumAggState.
You said that it is enough to add only one native type like state_poly_num_agg
for supporting them, right?

But the combine functions of them might have quite different expectation
on the data items of PolyNumAggState like
the range of N(means count) and the true/false of calcSumX2
(the flag of calculating sum of squares).
The final functions of them have the similar expectation.
So, I think that, responded to your proposal,
each of them need a native data type
like state_int8_avg, state_numeric_avg, for safety.

And, we do need a native type for an aggregate function
whose transtype is not internal and not pseudo.
For avg(int4), the current transtype is _int8.
However, I do need a validation check on the number of the array
And the positiveness of count(the first element of the array).
Responded to your proposal,
I do need a new native type like state_int4_avg.

Consequently, I think that, responded to your proposal, finally
each of aggregate functions need a new native type
with typinput and typoutput.
That seems need the same amount of codes and more catalog data,
right?

> > 2. Amount of codes.
> > It could need more codes.
> 
> I think it would be about the same as your proposal. Instead of serializing 
> to an
> intermediary existing type you serialize to string straight away. I think it 
> would
> probably be slightly less code actually, since you don't have to add code to
> handle the new aggpartialexportfn and aggpartialimportfn columns.
> 
> > 3. Concern about performance
> > I'm concerned that altering the current internal data types could impact
> performance.
> 
> As explained above in my proposal all the aggregation code would remain
> unchanged, only new native types will be added. Thus performance won't be
> affected, because all aggregation code will be the same. The only thing that's
> changed is that the internal type now has a name and an INPUT and OUTPUT
> function.
I got it. Thank you.

> Not a full review but some initial notes:
Thank you. I don't have time today, so I'll answer after tomorrow.

Best regards, Yuki Fujii
--
Yuki Fujii
Information Technology R&D Center, Mitsubishi Electric Corporation

Re: Partial aggregates pushdown

2024-06-24 Thread Jelte Fennema-Nio

On Sun, 23 Jun 2024 at 10:24, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> I attached the POC patch(the second one) which supports avg(int8) whose 
> standard format is _numeric type.

Okay, very interesting. So instead of defining the
serialization/deserialization functions to text/binary, you serialize
the internal type to an existing non-internal type, which then in turn
gets serialized to text. In the specific case of avg(int8) this is
done to an array of numeric (with length 2).

> However, I do not agree that I modify the internal transtype to the native 
> data type. The reasons are the following three.
> 1. Generality
> I believe we should develop a method that can theoretically apply to any 
> aggregate function, even if we cannot implement it immediately. However, I 
> find it exceptionally challenging to demonstrate that any internal transtype 
> can be universally converted to a native data type for aggregate functions 
> that are parallel-safe under the current parallel query feature. 
> Specifically, proving this for user-defined aggregate functions presents a 
> significant difficulty in my view.
> On the other hand, I think that the usage of export and import functions can 
> theoretically apply to any aggregate functions.

The only thing required when doing CREATE TYPE is having an INPUT and
OUTPUT function for the type, which (de)serialize the type to text
format. As far as I can tell by definition that requirement should be
fine for any aggregates that we can do partial aggregation pushdown
for. To clarify I'm not suggesting we should change any of the
internal representation of the type for the current internal
aggregates. I'm suggesting we create new native types (i.e. do CREATE
TYPE) for those internal representations and then use the name of that
type instead of internal.

> 2. Amount of codes.
> It could need more codes.

I think it would be about the same as your proposal. Instead of
serializing to an intermediary existing type you serialize to string
straight away. I think it would probably be slightly less code
actually, since you don't have to add code to handle the new
aggpartialexportfn and aggpartialimportfn columns.

> 3. Concern about performance
> I'm concerned that altering the current internal data types could impact 
> performance.

As explained above in my proposal all the aggregation code would
remain unchanged, only new native types will be added. Thus
performance won't be affected, because all aggregation code will be
the same. The only thing that's changed is that the internal type now
has a name and an INPUT and OUTPUT function.

> I know. In my proposal, the standard format is not seriarized data by 
> serialfn, instead, is text or other native data type.
> Just to clarify, I'm writing this to avoid any potential misunderstanding.

Ah alright, that definitely clarifies the proposal. I was looking at
the latest patch file on the thread and that one was still using
serialfn. Your new one indeed doesn't, so this is fine.

> Basically responding to your advice,
> for now, I prepare two POC patches.

Great! I definitely think this makes the review/discussion easier.

> The first supports case a, currently covering only avg(int4) and other 
> aggregate functions that do not require import or export functions, such as 
> min, max, and count.

Not a full review but some initial notes:

1. Why does this patch introduce aggpartialpushdownsafe? I'd have
expected that any type with a non-pseudo/internal type as aggtranstype
would be safe to partially push down.
2. It seems the int4_avg_import function shouldn't be part of this
patch (but maybe of a future one).
3. I think splitting this patch in two pieces would make it even
easier to review: First adding support for the new PARTIAL_AGGREGATE
keyword (adds the new feature). Second, using PARTIAL_AGGREGATE in
postgres_fdw (starts using the new feature). Citus would only need the
first patch not the second one, so I think the PARTIAL_AGGREGATE
feature has merit to be added on its own, even without the
postgres_fdw usage.
4. Related to 3, I think it would be good to have some tests of
PARTIAL_AGGREGATE that don't involve postgres_fdw at all. I also
spotted some comments too that mention FDW, even though they apply to
the "pure" PARTIAL_AGGREGATE code.
5. This comment now seems incorrect:
-* Apply the agg's finalfn if one is provided, else return transValue.
+* If the agg's finalfn is provided and PARTIAL_AGGREGATE keyword is
+* not specified, apply the agg's finalfn.
+* If PARTIAL_AGGREGATE keyword is specified and the transValue type
+* is internal, apply the agg's serialfn. In this case the agg's
+* serialfn must not be invalid. Otherwise return transValue.

6. These errors are not on purpose afaict (if they are a comment in
the test would be good to explain why)

+SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b ORDER BY 1;
+ERROR:  could not connect to server "loopback"
+DETAIL:  inval

RE: Partial aggregates pushdown

2024-06-23 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi Alexander, hackers.

> From: Alexander Pyhalov 
> Sent: Tuesday, May 28, 2024 2:45 PM
> The fix was to set child_agg->agg_partial to orig_agg->agg_partial in
> convert_combining_aggrefs(), it's already in the patch, as well as the 
> example -
> without this fix
I've just realized that you've added the necessary tests. I forgot to respond, 
my apologies.

> From: Alexander Pyhalov 
> Sent: Tuesday, May 28, 2024 2:58 PM
> BTW, there's I have an issue with test results in the last version of the 
> patch.
> Attaching regression diffs.
> I have partial sum over c_interval instead of sum(c_interval).
I think the difference stems from the commit[1], which add serialization 
function
to sum(interval). I will fix it.

Best regards, Yuki Fujii
--
Yuki Fujii
Information Technology R&D Center, Mitsubishi Electric Corporation

[1] 
https://github.com/postgres/postgres/commit/519fc1bd9e9d7b408903e44f55f83f6db30742b7

Re: Partial aggregates pushdown

2024-06-12 Thread Jelte Fennema-Nio

On Wed, 12 Jun 2024 at 07:27, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> Could you please clarify what you mean?
> Are you referring to:
>   Option 1: Modifying existing aggregate functions to minimize the use of 
> internal state values.
>   Option 2: Not supporting the push down of partial aggregates for functions 
> with internal state values.

Basically I mean both Option 1 and Option 2 together. i.e. once we do
option 1, supporting partial aggregate pushdown for all important
aggregates with internal state values, then supporting pushdown of
internal state values becomes unnecessary.

> There are many aggregate functions with internal state values, so if we go 
> with Option 1,
> we might need to change a lot of existing code, like transition functions and 
> finalize functions.
> Also, I'm not sure how many existing aggregate functions can be modified this 
> way.

There are indeed 57 aggregate functions with internal state values:

> SELECT count(*) from pg_aggregate where aggtranstype = 'internal'::regtype;
 count
───
57
(1 row)

But there are only 26 different aggtransfns. And the most used 8 of
those cover 39 of those 57 aggregates.

> SELECT
sum (count) OVER(ORDER BY count desc, aggtransfn ROWS BETWEEN
unbounded preceding and current row) AS cumulative_count
, *
FROM (
SELECT
count(*),
aggtransfn
from pg_aggregate
where aggtranstype = 'internal'::regtype
group by aggtransfn
order by count(*) desc, aggtransfn
);
 cumulative_count │ count │   aggtransfn
──┼───┼
7 │ 7 │ ordered_set_transition
   13 │ 6 │ numeric_accum
   19 │ 6 │ int2_accum
   25 │ 6 │ int4_accum
   31 │ 6 │ int8_accum
   35 │ 4 │ ordered_set_transition_multi
   37 │ 2 │ int8_avg_accum
   39 │ 2 │ numeric_avg_accum
   40 │ 1 │ array_agg_transfn
   41 │ 1 │ json_agg_transfn
   42 │ 1 │ json_object_agg_transfn
   43 │ 1 │ jsonb_agg_transfn
   44 │ 1 │ jsonb_object_agg_transfn
   45 │ 1 │ string_agg_transfn
   46 │ 1 │ bytea_string_agg_transfn
   47 │ 1 │ array_agg_array_transfn
   48 │ 1 │ range_agg_transfn
   49 │ 1 │ multirange_agg_transfn
   50 │ 1 │ json_agg_strict_transfn
   51 │ 1 │ json_object_agg_strict_transfn
   52 │ 1 │ json_object_agg_unique_transfn
   53 │ 1 │ json_object_agg_unique_strict_transfn
   54 │ 1 │ jsonb_agg_strict_transfn
   55 │ 1 │ jsonb_object_agg_strict_transfn
   56 │ 1 │ jsonb_object_agg_unique_transfn
   57 │ 1 │ jsonb_object_agg_unique_strict_transfn
(26 rows)

And actually most of those don't have a serialfn, so they wouldn't be
supported by your suggested approach either. Looking at the
distribution of aggserialfns instead we see the following:

> SELECT
sum (count) OVER(ORDER BY count desc, aggserialfn ROWS BETWEEN
unbounded preceding and current row) AS cumulative_count
, *
FROM (
SELECT
count(*),
aggserialfn
from pg_aggregate
where
aggtranstype = 'internal'::regtype
AND aggserialfn != 0
group by aggserialfn
order by count(*) desc, aggserialfn
);
 cumulative_count │ count │aggserialfn
──┼───┼───
   12 │12 │ numeric_serialize
   24 │12 │ numeric_poly_serialize
   26 │ 2 │ numeric_avg_serialize
   28 │ 2 │ int8_avg_serialize
   30 │ 2 │ string_agg_serialize
   31 │ 1 │ array_agg_serialize
   32 │ 1 │ array_agg_array_serialize
(7 rows)

So there are only 7 aggserialfns, and thus at most 7 new postgres
types that you would need to create to support the same aggregates as
in your current proposal. But looking at the implementations of these
serialize functions even that is an over-estimation: numeric_serialize
and numeric_avg_serialize both serialize a NumericAggState, and
numeric_poly_serialize and int8_avg_serialize both serialize a
PolyNumAggState. So probably a we could even do with only 5 types. And
to be clear: only converting PolyNumAggState and NumericAggState to
actual postgres types would already cover 28 out of the 32 aggregates.
That seems quite feasible to do.

So I agree it's probably more code than your current approach. At the
very least because you would need to implement in/out text
serialization functions for these internal types that currently don't
have them. But I do think it would be quite a feasible amount. And to
clarify, I see a few benefits of using the approach that I'm
proposing:

1. So far aggserialfn and aggdese

RE: Partial aggregates pushdown

2024-06-11 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi Jelte,

Thank you for your comment!
> From: Jelte Fennema-Nio 
> Sent: Tuesday, June 11, 2024 11:00 PM
> How about instead of trying to serialize the output of serialfn/deserialfn, 
> instead we don't use the "internal" type and
> create actual types in pg_type for these transtypes? Then we can simply use 
> the in/out and recv/send functions of those
> types to serialize the values of the partial aggregate over the network.
> Instead of having to rely on serialfn/deserialfn to be network-safe (which 
> they probably aren't).
> 
> That indeed still leaves the pseudo types. Since non of those pseudotypes 
> have a working in/recv function (they always
> error by definition), I agree that we can simply not support those.
> 
> Basically that would mean that any aggregate with a non-internal and 
> non-pseudotype as a transtype could be used in this
> multi-node partial aggregate pushdown.
Could you please clarify what you mean?
Are you referring to:
  Option 1: Modifying existing aggregate functions to minimize the use of 
internal state values.
  Option 2: Not supporting the push down of partial aggregates for functions 
with internal state values.
  Option 3: Something other than Option 1 and Option 2.

There are many aggregate functions with internal state values, so if we go with 
Option 1,
we might need to change a lot of existing code, like transition functions and 
finalize functions.
Also, I'm not sure how many existing aggregate functions can be modified this 
way.

There are also many popular functions with internal state values,
like sum(int8) and avg(int8)(see [1]), so I don't think Option2 would be 
acceptable.

Best regards, Yuki Fujii

--
Yuki Fujii
Information Technology R&D Center, Mitsubishi Electric Corporation

[1] 
https://github.com/postgres/postgres/blob/REL_16_STABLE/src/include/catalog/pg_aggregate.dat

Re: Partial aggregates pushdown

2024-06-11 Thread Jelte Fennema-Nio

On Tue, 11 Jun 2024 at 13:40, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> > From: Bruce Momjian 
> > So, we need import/export text representation for the partial aggregate 
> > mode for these eight, and call the base data type
> > text import/export functions for the zero ones when in this mode?
>
> I think that you are basically right.
> But, I think, in a perfect world we should also add an import/export function 
> for the following
> two category.
>
>   Category1. Validation Chek is needed for Safety.
> For example, I think a validation check is needed for avg(float4),
>  whose transition type is not internal. (See p.18 in [1])
> I plan to add import functions of avg, count (See p.18, p.19 in [1]).
>   Category1. Transition type is a pseudo data type.
> Aggregate functions of this category needs to accept many actual data 
> types,
> including user-defined types. So I think that it is hard to implement 
> import/export functions.
> Consequently, I do not plan to support these category. (See p.19 in [1])

How about instead of trying to serialize the output of
serialfn/deserialfn, instead we don't use the "internal" type and
create actual types in pg_type for these transtypes? Then we can
simply use the in/out and recv/send functions of those types to
serialize the values of the partial aggregate over the network.
Instead of having to rely on serialfn/deserialfn to be network-safe
(which they probably aren't).

That indeed still leaves the pseudo types. Since non of those
pseudotypes have a working in/recv function (they always error by
definition), I agree that we can simply not support those.

Basically that would mean that any aggregate with a non-internal and
non-pseudotype as a transtype could be used in this multi-node partial
aggregate pushdown.

RE: Partial aggregates pushdown

2024-06-11 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi. Bruce.

Sorry for the late. Thank you for comment.

> From: Bruce Momjian 
> Sent: Wednesday, June 5, 2024 11:04 PM
> > > *  Passes unsafe binary data from the foreign server.
> > >
> > > Can someone show me where that last item is in the patch, and why
> > > can't we just pass back values cast to text?
> >
> > In finalize_aggregate() when we see partial aggregate with
> > peragg->aggref->aggtranstype = INTERNALOID
> > we call aggregate's serialization function and return it as bytea.
> >
> > The issue is that this internal representation isn't guaranteed to be
> > compatible between servers of different versions (or architectures?).
> > So, likely, we instead should have called some export function for
> > aggregate and later - some import function on postgres_fdw side. It
> > doesn't matter much, what this export function generates - text, json
> > or some portable binary format,
> > 1) export/import functions should just "understand" it,
> > 2) it should be a stable representation.
> 
> Okay, so looking at the serialization output functions already defined, I see 
> many zeros, which I assume means just the base
> data type, and eight
> more:
> 
>   SELECT DISTINCT aggserialfn from pg_aggregate WHERE aggserialfn::oid != 
> 0;
>   aggserialfn
>   ---
>numeric_avg_serialize
>string_agg_serialize
>array_agg_array_serialize
>numeric_serialize
>int8_avg_serialize
>array_agg_serialize
>interval_avg_serialize
>numeric_poly_serialize
> 
> I realize we need to return the sum and count for average, so that makes 
> sense.
> 
> So, we need import/export text representation for the partial aggregate mode 
> for these eight, and call the base data type
> text import/export functions for the zero ones when in this mode?

I think that you are basically right.
But, I think, in a perfect world we should also add an import/export function 
for the following
two category.

  Category1. Validation Chek is needed for Safety.
For example, I think a validation check is needed for avg(float4),
 whose transition type is not internal. (See p.18 in [1])
I plan to add import functions of avg, count (See p.18, p.19 in [1]).
  Category1. Transition type is a pseudo data type.
Aggregate functions of this category needs to accept many actual data types,
including user-defined types. So I think that it is hard to implement 
import/export functions.
Consequently, I do not plan to support these category. (See p.19 in [1])

Sincerely yours,
Yuki Fujii

--
Yuki Fujii
Information Technology R&D Center Mitsubishi Electric Corporation

[1] 
https://www.postgresql.org/message-id/attachment/160659/PGConfDev2024_Presentation_Aggregation_Scaleout_FDW_Sharding_20240531.pdf

Re: Partial aggregates pushdown

2024-06-05 Thread Bruce Momjian

On Wed, Jun  5, 2024 at 08:19:04AM +0300, Alexander Pyhalov wrote:
> > *  Passes unsafe binary data from the foreign server.
> > 
> > Can someone show me where that last item is in the patch, and why can't
> > we just pass back values cast to text?
> 
> In finalize_aggregate() when we see partial aggregate with
> peragg->aggref->aggtranstype = INTERNALOID
> we call aggregate's serialization function and return it as bytea.
> 
> The issue is that this internal representation isn't guaranteed to be
> compatible between servers
> of different versions (or architectures?). So, likely, we instead should
> have called some export function for aggregate
> and later - some import function on postgres_fdw side. It doesn't matter
> much, what this export function
> generates - text, json or some portable binary format,
> 1) export/import functions should just "understand" it,
> 2) it should be a stable representation.

Okay, so looking at the serialization output functions already defined, I
see many zeros, which I assume means just the base data type, and eight
more:

SELECT DISTINCT aggserialfn from pg_aggregate WHERE aggserialfn::oid != 
0;
aggserialfn
---
 numeric_avg_serialize
 string_agg_serialize
 array_agg_array_serialize
 numeric_serialize
 int8_avg_serialize
 array_agg_serialize
 interval_avg_serialize
 numeric_poly_serialize

I realize we need to return the sum and count for average, so that makes
sense.

So, we need import/export text representation for the partial aggregate
mode for these eight, and call the base data type text import/export
functions for the zero ones when in this mode?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-06-04 Thread Alexander Pyhalov


Bruce Momjian писал(а) 2024-06-04 20:12:
On Mon, May 27, 2024 at 09:30:59PM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:

Hi Mr. Pyhalov.

Sorry for the late reply.
Thank you for your modification and detailed review.
I attach a fixed patch, have been not yet rebased.


I know this patch was discussed at the Vancouver conference.  What are
the open issues?  I know of several:

*  add tests that were requested by Fujii-san and now posted by
   Alexander Pyhalov
*  Where is the documentation?  I know the original patch had some, and
   I improved it, but it seems to be missing.
*  Passes unsafe binary data from the foreign server.

Can someone show me where that last item is in the patch, and why can't
we just pass back values cast to text?


Hi.

In finalize_aggregate() when we see partial aggregate with 
peragg->aggref->aggtranstype = INTERNALOID

we call aggregate's serialization function and return it as bytea.

The issue is that this internal representation isn't guaranteed to be 
compatible between servers
of different versions (or architectures?). So, likely, we instead should 
have called some export function for aggregate
and later - some import function on postgres_fdw side. It doesn't matter 
much, what this export function

generates - text, json or some portable binary format,
1) export/import functions should just "understand" it,
2) it should be a stable representation.
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Re: Partial aggregates pushdown

2024-06-04 Thread Bruce Momjian

On Wed, Jun  5, 2024 at 12:14:45AM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> I will add sufficient document and comments to the next patch as Bruce and 
> Robert said.

Great, I am available to help improve the documentation.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-06-04 Thread Bruce Momjian

On Mon, May 27, 2024 at 09:30:59PM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> Hi Mr. Pyhalov.
> 
> Sorry for the late reply.
> Thank you for your modification and detailed review.
> I attach a fixed patch, have been not yet rebased.

I know this patch was discussed at the Vancouver conference.  What are
the open issues?  I know of several:

*  add tests that were requested by Fujii-san and now posted by
   Alexander Pyhalov
*  Where is the documentation?  I know the original patch had some, and
   I improved it, but it seems to be missing.
*  Passes unsafe binary data from the foreign server.

Can someone show me where that last item is in the patch, and why can't
we just pass back values cast to text?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-05-27 Thread Alexander Pyhalov


fujii.y...@df.mitsubishielectric.co.jp писал(а) 2024-05-28 00:30:

Hi Mr. Pyhalov.

Sorry for the late reply.
Thank you for your modification and detailed review.
I attach a fixed patch, have been not yet rebased.

Monday, 25 March 2024 16:01 Alexander Pyhalov 
:.

Comment in nodeAgg.c seems to be strange:

1079 /*
1080 * If the agg's finalfn is provided and PARTIAL_AGGREGATE
keyword is
1081 * not specified, apply the agg's finalfn.
1082 * If PARTIAL_AGGREGATE keyword is specified and the
transValue type
1083 * is internal, apply the agg's serialfn. In this case, if
the agg's
1084 * serialfn must not be invalid. Otherwise return
transValue.
1085 */

Likely, you mean:

... In this case the agg'ss serialfn must not be invalid...

Fixed.


Lower, in the same file, please, correct error message:

1136 if(!OidIsValid(peragg->serialfn_oid))
1137  elog(ERROR, "serialfunc is note provided
for partial aggregate");

it should be "serialfunc is not provided for partial aggregate"

Fixed.


Also something is wrong with the following test :

SELECT /* aggregate <> partial aggregate */
 array_agg(c_int4array), array_agg(b),
 avg(b::int2), avg(b::int4), avg(b::int8), avg(c_interval),
 avg(b::float4), avg(b::float8),
 corr(b::float8, (b * b)::float8),
 covar_pop(b::float8, (b * b)::float8),
 covar_samp(b::float8, (b * b)::float8),
 regr_avgx((2 * b)::float8, b::float8),
.

Its results have changed since last patch. Do they depend on daylight
saving time?

You are right. In my environment, TimeZone is set to 'PST8PDT'
with which timetz values depends on daylight saving time.
Changed TimeZone to 'UTC' in this test.

You can see that filter is applied before append. The result is 
correct

only by chance, as sum in every partition is actually < 700. If you
lower this bound, let's say, to 200, you'll start getting wrong 
results

as data is filtered prior to aggregation.

It seems, however, that in partial case you should just avoid pulling
conditions from having qual at all, all filters will be applied on 
upper

level. Something like

Thank you for your modification.

Found one more problem. You can fire partial aggregate over 
partitioned
table, but convert_combining_aggrefs() will make non-partial copy, 
which

leads to
'variable not found in subplan target list' error.

Thanks for the correction as well.
As you pointed out,
the original patch certainly had the potential to cause problems.
However, I could not actually reproduce the problem in cases such as 
the following.


  Settings:
t(c1, c2) is a patitioned table whose partition key is c1.
t1, t2 are patitions of t and are partitioned table.
t11, t12: partitions of t1 and foreign table of postgres_fdw.
t21, t22: partitions of t2 and foreign table of postgres_fdw.
  Query:
select c2 / 2, sum(c1) from t group by c2 / 2 order by 1

If you have a reproducible example, I would like to add it to
the regression test.
Do you have a reproducible example?


Also denied partial agregates pushdown on server version mismatch.
Should check_partial_aggregate_support be 'true' by default?
Could we discuss this point after we determine how to transfer state 
values?
If we determine this point, we can easly determine whether 
check_partial_aggregate_support shold be 'true' by default.



I'm not sure what to do with current grammar - it precludes partial
distinct aggregates. I understand that it's currently impossible to 
have

partial aggregation for distinct agregates -but does it worth to have
such restriction at grammar level?
If partial aggregation for distinct agregates becomes possible in the 
future,

I see no problem with the policy of accepting new SQL keywords,
such as "PARTIL_AGGREGATE DISTINCT".


BTW, there's I have an issue with test results in the last version of 
the patch. Attaching regression diffs.

I have partial sum over c_interval instead of sum(c_interval).


--
Best regards,
Alexander Pyhalov,
Postgres Professionaldiff -U3 /home/leoric/srcs/shardman-backup/shardman/contrib/postgres_fdw/expected/postgres_fdw.out /home/leoric/srcs/shardman-backup/shardman/contrib/postgres_fdw/results/postgres_fdw.out
--- /home/leoric/srcs/shardman-backup/shardman/contrib/postgres_fdw/expected/postgres_fdw.out	2024-05-28 08:48:27.236520098 +0300
+++ /home/leoric/srcs/shardman-backup/shardman/contrib/postgres_fdw/results/postgres_fdw.out	2024-05-28 08:51:24.395704846 +0300
@@ -10438,15 +10438,15 @@
  ->  Foreign Scan
Output: (PARTIAL array_agg(pagg_tab.c_int4array)), (PARTIAL array_agg(pagg_tab.b)), (PARTIAL avg((pagg_tab.b)::smallint)), (PARTIAL avg(pagg_tab.b)), (PARTIAL avg((pagg_tab.b)::bigint)), (PARTIAL avg(pagg_tab.c_interval)), (PARTIAL avg((pagg_tab.b)::real)), (PARTIAL avg((pagg_tab.b)::double precision)), (PARTIAL avg((pagg_tab.b)::numeric)), (PARTIAL corr((pagg_tab.b)::double precision, ((pagg_tab.b * pagg_tab.b))::double precision)), (PARTIAL covar_pop((pagg_tab.b)::double precision, ((pagg_tab.b * pagg_tab.b))::double

Re: Partial aggregates pushdown

2024-05-27 Thread Alexander Pyhalov


fujii.y...@df.mitsubishielectric.co.jp писал(а) 2024-05-28 00:30:

Hi Mr. Pyhalov.


Hi.

Found one more problem. You can fire partial aggregate over 
partitioned
table, but convert_combining_aggrefs() will make non-partial copy, 
which

leads to
'variable not found in subplan target list' error.

Thanks for the correction as well.
As you pointed out,
the original patch certainly had the potential to cause problems.
However, I could not actually reproduce the problem in cases such as 
the following.


  Settings:
t(c1, c2) is a patitioned table whose partition key is c1.
t1, t2 are patitions of t and are partitioned table.
t11, t12: partitions of t1 and foreign table of postgres_fdw.
t21, t22: partitions of t2 and foreign table of postgres_fdw.
  Query:
select c2 / 2, sum(c1) from t group by c2 / 2 order by 1

If you have a reproducible example, I would like to add it to
the regression test.
Do you have a reproducible example?



The fix was to set child_agg->agg_partial to orig_agg->agg_partial in 
convert_combining_aggrefs(), it's already in the patch,

as well as the example - without this fix

-- Check partial aggregate over partitioned table
EXPLAIN (VERBOSE, COSTS OFF)
SELECT avg(PARTIAL_AGGREGATE a), avg(a) FROM pagg_tab;

fails with

ERROR:  variable not found in subplan target list
--
Best regards,
Alexander Pyhalov,
Postgres Professional

Re: Partial aggregates pushdown

2024-03-25 Thread Alexander Pyhalov


fujii.y...@df.mitsubishielectric.co.jp писал(а) 2024-03-16 05:28:

Hi. Mr.Pyhalov.

>>
>> I don't see it that way. What we would push to the foreign server
>> would be something like SELECT count(a) FROM t. Then, after we get
>> the results back and combine the various partial counts locally, we
>> would locally evaluate the HAVING clause afterward. That is, partial
>> aggregation is a barrier to pushing down HAVING clause itself, but it
>> doesn't preclude pushing down the aggregation.
> I have made modifications in the attached patch to ensure that when
> the HAVING clause is present, the HAVING clause is executed locally
> while the partial aggregations are pushed down.
>
>

Sorry, I don't see how it works. When we have partial aggregates and 
having clause, foreign_grouping_ok() returns false and

add_foreign_grouping_paths() adds no paths.
I'm not saying it's necessary to fix this in the first patch version.
Our sincere apologies. I had attached an older version before this 
modification.




Hi.

In foreign_grouping_ok() having qual is added to local conds here:

6635 if (is_foreign_expr(root, grouped_rel, 
expr) && !partial)
6636 fpinfo->remote_conds = 
lappend(fpinfo->remote_conds, rinfo);

6637 else
6638 fpinfo->local_conds = 
lappend(fpinfo->local_conds, rinfo);

6639 }
6640 }


This is incorrect. If you look at plan for query in postgres_fdw.sql


-- Partial aggregates are safe to push down when there is a HAVING 
clause

EXPLAIN (VERBOSE, COSTS OFF)
SELECT b, avg(a), max(a), count(*) FROM pagg_tab GROUP BY b HAVING 
sum(a) < 700 ORDER BY 1;
 
QUERY PLAN

-
 Finalize GroupAggregate
   Output: pagg_tab.b, avg(pagg_tab.a), max(pagg_tab.a), count(*)
   Group Key: pagg_tab.b
   Filter: (sum(pagg_tab.a) < 700)
   ->  Sort
 Output: pagg_tab.b, (PARTIAL avg(pagg_tab.a)), (PARTIAL 
max(pagg_tab.a)), (PARTIAL count(*)), (PARTIAL sum(pagg_tab.a))

 Sort Key: pagg_tab.b
 ->  Append
   ->  Foreign Scan
 Output: pagg_tab.b, (PARTIAL avg(pagg_tab.a)), 
(PARTIAL max(pagg_tab.a)), (PARTIAL count(*)), (PARTIAL sum(pagg_tab.a))

 Filter: ((sum(pagg_tab.a)) < 700)
 Relations: Aggregate on (public.fpagg_tab_p1 
pagg_tab)
 Remote SQL: SELECT b, avg(PARTIAL_AGGREGATE a), 
max(a), count(*), sum(a), sum(a) FROM public.pagg_tab_p1 GROUP BY 1

   ->  Foreign Scan
 Output: pagg_tab_1.b, (PARTIAL avg(pagg_tab_1.a)), 
(PARTIAL max(pagg_tab_1.a)), (PARTIAL count(*)), (PARTIAL 
sum(pagg_tab_1.a))

 Filter: ((sum(pagg_tab_1.a)) < 700)
 Relations: Aggregate on (public.fpagg_tab_p2 
pagg_tab_1)
 Remote SQL: SELECT b, avg(PARTIAL_AGGREGATE a), 
max(a), count(*), sum(a), sum(a) FROM public.pagg_tab_p2 GROUP BY 1

   ->  Foreign Scan
 Output: pagg_tab_2.b, (PARTIAL avg(pagg_tab_2.a)), 
(PARTIAL max(pagg_tab_2.a)), (PARTIAL count(*)), (PARTIAL 
sum(pagg_tab_2.a))

 Filter: ((sum(pagg_tab_2.a)) < 700)
 Relations: Aggregate on (public.fpagg_tab_p3 
pagg_tab_2)
 Remote SQL: SELECT b, avg(PARTIAL_AGGREGATE a), 
max(a), count(*), sum(a), sum(a) FROM public.pagg_tab_p3 GROUP BY 1



You can see that filter is applied before append. The result is correct 
only by chance, as sum in every partition is actually < 700. If you 
lower this bound, let's say, to 200, you'll start getting wrong results 
as data is filtered prior to aggregation.


It seems, however, that in partial case you should just avoid pulling 
conditions from having qual at all, all filters will be applied on upper 
level. Something like


diff --git a/contrib/postgres_fdw/postgres_fdw.c 
b/contrib/postgres_fdw/postgres_fdw.c

index 42eb17ae7c0..54918b9f1a4 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -6610,7 +6610,7 @@ foreign_grouping_ok(PlannerInfo *root, RelOptInfo 
*grouped_rel,
 * Classify the pushable and non-pushable HAVING clauses and 
save them in

 * remote_conds and local_conds of the grouped rel's fpinfo.
 */
-   if (extra->havingQual)
+   if (extra->havingQual && !partial)
{
foreach(lc, (List *) extra->havingQual)
{
@@ -6632,7 +6632,7 @@ foreign_grouping_ok(PlannerInfo *root, RelOptInfo 
*grouped_rel,
 
 grouped_rel->relids,
 
 NULL,

Re: Partial aggregates pushdown

2024-03-21 Thread Bruce Momjian

On Thu, Mar 21, 2024 at 11:37:50AM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> Hi. Mr.Momjian, Mr.Lane, Mr.Haas, hackers.
> 
> I apologize for any misunderstanding regarding the context of the attached 
> patch and
> the points on which I requested a review. Could you please allow me to 
> clarify?
> 
> In the review around early December 2023, I received the following three 
> issues pointed out by Mr.Haas[1].
> 1. Transmitting state value safely between machines
> 2. Making the patch clearer by adding SQL keywords
> 3. Fixing the behavior when the HAVING clause is present
> 
> In the email sent on February 22, 2024[2], I provided an update on the 
> progress made in addressing these issues.
> Regarding issue 1, I have only provided a proposed solution in the email and 
> have not started the programming. 
> Therefore, the latest patch is not in a commit-ready state. As mentioned 
> later, we have also temporarily reverted the changes made to the 
> documentation.
> Before proceeding with the programming, I would like to discuss the proposed 
> solution with the community and seek consensus.
> If it is necessary to have source code in order to discuss, I can create a 
> simple prototype so that I can receive your feedback.
> Would you be able to provide your opinions on it?
> 
> Regarding issue 2., I have confirmed that creating a prototype allows us to 
> address the issue and clear the patch.
> In this prototype creation, the main purpose was to verify if the patch can 
> be cleared and significant revisions were made to the previous version.
> Therefore, I have removed all the document differences.
> I have submitted a patch [3] that includes the fixes for issue 3. to the 
> patch that was posted in [2].
> Regarding the proposed solution for issue 1, unlike the patch posted in [3], 
> we have a policy of not performing partial aggregation pushdown if we cannot 
> guarantee compatibility and safety.
> The latest patch in [3] is a POC patch. The patch that Mr. Momjian reviewed 
> is this.
> If user-facing documentation is needed for this POC patch, it can be added.
> 
> I apologize for the lack of explanation regarding this positioning, which may 
> have caused misunderstandings regarding the patch posted in [3].

That makes sense.  Let's get you answers to those questions first before
you continue.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

RE: Partial aggregates pushdown

2024-03-21 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi. Mr.Momjian, Mr.Lane, Mr.Haas, hackers.

I apologize for any misunderstanding regarding the context of the attached
patch and
the points on which I requested a review. Could you please allow me to clarify?

In the review around early December 2023, I received the following three issues
pointed out by Mr.Haas[1].
1. Transmitting state value safely between machines
2. Making the patch clearer by adding SQL keywords
3. Fixing the behavior when the HAVING clause is present

In the email sent on February 22, 2024[2], I provided an update on the progress
made in addressing these issues.
Regarding issue 1, I have only provided a proposed solution in the email and
have not started the programming.
Therefore, the latest patch is not in a commit-ready state. As mentioned later,
we have also temporarily reverted the changes made to the documentation.
Before proceeding with the programming, I would like to discuss the proposed
solution with the community and seek consensus.
If it is necessary to have source code in order to discuss, I can create a
simple prototype so that I can receive your feedback.
Would you be able to provide your opinions on it?

Regarding issue 2., I have confirmed that creating a prototype allows us to
address the issue and clear the patch.
In this prototype creation, the main purpose was to verify if the patch can be
cleared and significant revisions were made to the previous version.
Therefore, I have removed all the document differences.
I have submitted a patch [3] that includes the fixes for issue 3. to the patch
that was posted in [2].
Regarding the proposed solution for issue 1, unlike the patch posted in [3],
we have a policy of not performing partial aggregation pushdown if we cannot
guarantee compatibility and safety.
The latest patch in [3] is a POC patch. The patch that Mr. Momjian reviewed is
this.
If user-facing documentation is needed for this POC patch, it can be added.

I apologize for the lack of explanation regarding this positioning, which may
have caused misunderstandings regarding the patch posted in [3].

[1]
https://www.postgresql.org/message-id/CA%2BTgmoYCrtOvk2f32qQKZV%3DjNL35tandf2A2Dp_2F5ASuiG1BA%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/TYAPR01MB5514F0CBD9CD4F84A261198195562%40TYAPR01MB5514.jpnprd01.prod.outlook.com
[3]
https://www.postgresql.org/message-id/TYAPR01MB55141D18188AC86ADCE35FCB952F2%40TYAPR01MB5514.jpnprd01.prod.outlook.com

Sincerely yours,
Yuuki Fujii

--
Yuuki Fujii
Information Technology R&D Center Mitsubishi Electric Corporation

Re: Partial aggregates pushdown

2024-03-19 Thread Bruce Momjian

On Tue, Mar 19, 2024 at 05:29:07PM -0400, Tom Lane wrote:
> I'd like to vociferously protest both of those decisions.
> 
> "No version check by default" means "unsafe by default", which is not
> project style in general and is especially not so for postgres_fdw.
> We have tried very hard for years to ensure that postgres_fdw will
> work with a wide range of remote server versions, and generally been
> extremely conservative about what we think will work (example:
> collations); but this patch seems ready to throw that principle away.
> 
> Also, surely "remoteversion < PG_VERSION_NUM" is backwards.  What
> this would mean is that nobody can ever change a partial aggregate's
> implementation, because that would break queries issued from older
> servers (that couldn't know about the change) to newer ones.

Well it is the origin server that is issuing the PUSHDOWN syntax, so an
older origin server should be able to push to a newer remote server.

> Realistically, I think it's fairly unsafe to try aggregate pushdown
> to anything but the same PG major version; otherwise, you're buying
> into knowing which aggregates have partial support in which versions,
> as well as predicting the future about incompatible state changes.

Yes, incompatible state changes would be a problem with an older origin
server with a newer remote server setup.

If we require matching versions, we must accept that upgrades will
require more downtime.

> Even that isn't bulletproof --- e.g, maybe somebody wasn't careful
> about endianness-independence of the serialized partial state, making
> it unsafe to ship --- so there had better be a switch whereby the user
> can disable it.

Makes sense.  I was also wondering how a user would know whether the
pushdown is happening, or not.

> Maybe we could define a three-way setting:
> 
> * default: push down partial aggs only to same major PG version
> * disable: don't push down, period
> * force: push down regardless of remote version

What would be the default?  If it is the first one, it requires a
remote version check on first in the session.

> With the "force" setting, it's the user's responsibility not to
> issue any remote-able aggregation that would be unsafe to push
> down.  This is still a pretty crude tool: I can foresee people
> wanting to have per-aggregate control over things, especially
> extension-supplied aggregates.  But it'd do for starters.

We have the postgres_fdw extensions option to control function pushdown
to extensions.

> I'm not super thrilled by the fact that the patch contains zero
> user-facing documentation, even though it's created new SQL syntax,
> not to mention a new postgres_fdw option.  I assume this means that
> nobody thinks it's anywhere near ready to commit.

Previous versions of the patch had docs since I know I worked on
improving them.  I am not sure what happened to them.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-03-19 Thread Tom Lane

Bruce Momjian  writes:
> The current patch has:

>   if ((OidIsValid(aggform->aggfinalfn) ||
>   (aggform->aggtranstype == INTERNALOID)) &&
>   fpinfo->check_partial_aggregate_support)
>   {
>   if (fpinfo->remoteversion == 0)
>   {
>   PGconn *conn = GetConnection(fpinfo->user, false, 
> NULL);

>   fpinfo->remoteversion = PQserverVersion(conn);
>   }

>   if (fpinfo->remoteversion < PG_VERSION_NUM)
>   partial_agg_ok = false;
>   }

> It uses check_partial_aggregate_support, which defaults to false,
> meaning partial aggregates will be pushed down with no version check by
> default.  If set to true, pushdown will happen if the remote server is
> the same version or newer, which seems acceptable to me.

I'd like to vociferously protest both of those decisions.

"No version check by default" means "unsafe by default", which is not
project style in general and is especially not so for postgres_fdw.
We have tried very hard for years to ensure that postgres_fdw will
work with a wide range of remote server versions, and generally been
extremely conservative about what we think will work (example:
collations); but this patch seems ready to throw that principle away.

Also, surely "remoteversion < PG_VERSION_NUM" is backwards.  What
this would mean is that nobody can ever change a partial aggregate's
implementation, because that would break queries issued from older
servers (that couldn't know about the change) to newer ones.

Realistically, I think it's fairly unsafe to try aggregate pushdown
to anything but the same PG major version; otherwise, you're buying
into knowing which aggregates have partial support in which versions,
as well as predicting the future about incompatible state changes.
Even that isn't bulletproof --- e.g, maybe somebody wasn't careful
about endianness-independence of the serialized partial state, making
it unsafe to ship --- so there had better be a switch whereby the user
can disable it.

Maybe we could define a three-way setting:

* default: push down partial aggs only to same major PG version
* disable: don't push down, period
* force: push down regardless of remote version

With the "force" setting, it's the user's responsibility not to
issue any remote-able aggregation that would be unsafe to push
down.  This is still a pretty crude tool: I can foresee people
wanting to have per-aggregate control over things, especially
extension-supplied aggregates.  But it'd do for starters.

I'm not super thrilled by the fact that the patch contains zero
user-facing documentation, even though it's created new SQL syntax,
not to mention a new postgres_fdw option.  I assume this means that
nobody thinks it's anywhere near ready to commit.

regards, tom lane

Re: Partial aggregates pushdown

2024-03-19 Thread Bruce Momjian

On Sat, Mar 16, 2024 at 02:28:50AM +, 
fujii.y...@df.mitsubishielectric.co.jp wrote:
> Hi. Mr.Pyhalov.
>
> > From: Alexander Pyhalov  Sent: Wednesday,
> > February 28, 2024 10:43 PM
> > > 1. Transmitting state value safely between machines
> > >> From: Robert Haas  Sent: Wednesday,
> > >> December 6, 2023 10:25 PM the problems around transmitting
> > >> serialized bytea blobs between machines that can't be assumed to
> > >> fully trust each other will need to be addressed in some way,
> > >> which seems like it will require a good deal of design work,
> > >> forming some kind of consensus, and then implementation work to
> > >> follow.
> > > I have considered methods for safely transmitting state values
> > > between different machines.  I have taken into account the version
> > > policy of PostgreSQL (5 years of support) and the major version
> > > release cycle over the past 10 years (1 year), and as a result, I
> > > have made the assumption that transmission is allowed only when
> > > the difference between the local version and the remote version
> > > is 5 or less.  I believe that by adding new components, "export
> > > function" and "import function", to the aggregate functions, and
> > > further introducing a new SQL keyword to the query syntax of
> > > aggregate expressions, we can address this issue.
> >
> > I honestly think that achieving cross-version compatibility in
> > this way puts a significant burden on developers. Can we instead
> > always use the more or less universal export and import function
> > to fix possible issues with binary representations on different
> > architectures and just refuse to push down partial aggregates on
> > server version mismatch? At least at the first step?
>
> Thank you for your comment. I agree with your point that the proposed
> method would impose a significant burden on developers. In order
> to ensure cross-version compatibility, it is necessary to impose
> constraints on the format of the state values exchanged between
> servers, which would indeed burden developers. As you mentioned, I
> think that it is realistic to allow partial aggregation pushdown only
> when coordinating between the same versions in the first step.

The current patch has:

  if ((OidIsValid(aggform->aggfinalfn) ||
  (aggform->aggtranstype == INTERNALOID)) &&
  fpinfo->check_partial_aggregate_support)
  {
  if (fpinfo->remoteversion == 0)
  {
  PGconn *conn = GetConnection(fpinfo->user, false, 
NULL);

  fpinfo->remoteversion = PQserverVersion(conn);
  }

  if (fpinfo->remoteversion < PG_VERSION_NUM)
  partial_agg_ok = false;
  }

It uses check_partial_aggregate_support, which defaults to false,
meaning partial aggregates will be pushed down with no version check by
default.  If set to true, pushdown will happen if the remote server is
the same version or newer, which seems acceptable to me.

FYI, the patch is much smaller now.  :-)

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Partial aggregates pushdown

2024-02-28 Thread Alexander Pyhalov

Hi.

fujii.y...@df.mitsubishielectric.co.jp писал(а) 2024-02-22 10:20:

Hi. Mr.Haas, hackers.

I apologize for the significant delay since my last post.
I have conducted investigations and considerations regarding the 
remaining tasks as follows.

Would it be possible for you to review them?
In particular, could you please confirm if the approach mentioned in 1. 
is acceptable?
If there are no issues with the direction outlined in 1., I plan to 
make a simple prototype based on this approach.

1. Transmitting state value safely between machines

From: Robert Haas 
Sent: Wednesday, December 6, 2023 10:25 PM
the problems around transmitting
serialized bytea blobs between machines that can't be assumed to fully 
trust each other will need to be addressed in some
way, which seems like it will require a good deal of design work, 
forming some kind of consensus, and then implementation

work to follow.
I have considered methods for safely transmitting state values between 
different machines.
I have taken into account the version policy of PostgreSQL (5 years of 
support) and the major version release cycle over the past 10 years (1 
year), and as a result, I have made the assumption that transmission is 
allowed only when the difference between the local version and the 
remote version is 5 or less.
I believe that by adding new components, "export function" and "import 
function", to the aggregate functions, and further introducing a new 
SQL keyword to the query syntax of aggregate expressions, we can 
address this issue.
If the version of the local server is higher than or equal to the 
version of the remote server, the proposed method can be simplified. 
The export version mentioned later in (1) would not be necessary. 
Furthermore, if the version of the local server matches the version of 
the remote server, the proposed method can be further simplified.
I would appreciate your input on reasonable assumptions regarding the 
differences in versions between the local server and the remote server.
I will explain the specifications of the export function, import 
function, the new SQL keyword for aggregate expressions, and the 
behavior of query processing for partial aggregation separately.

(1) Export Function Specification
This function is another final function for partial aggregate.
This function converts the state value that represents the result of 
partial aggregation into a format that can be read by the local server.
This function is called instead of the existing finalfunc during the 
final stage of aggregation when performing partial aggregation.

The conversion process described above will be referred to as "export".
The argument of an export function is the version of the server that 
will receive the return value.

Hereafter, this version will be referred to as the export version.
The concept of an export version is necessary to handle cases where the 
version of the local server is smaller than the version of the remote 
server.
The return value of the export function is the transformed state value, 
and its data type is bytea.
For backward compatibility, the developer of the export function must 
ensure that the export can be performed for major versions up to five 
versions prior to the major version of PostgreSQL that the export 
function is being developed for.
For built-in functions, I believe it is necessary to allow for the 
possibility of not developing the export functionality for specific 
versions in the future (due to reasons such as development burden) 
after the export function is developed for a certain version.
To achieve this, for built-in functions, we will add a column to the 
pg_aggregate catalog that indicates the presence or absence of export 
functionality for each major version, including the major version being 
developed and the previous five major versions. This column will be 
named safety_export_versions and will have a data type of boolean[6].
For user-defined functions, we will refer to the extensions option and 
add an external server option called safety_export_extensions, which 
will maintain a list of extensions that include only the aggregate 
functions that can be exported to the local server version.

...

I honestly think that achieving cross-version compatibility in this way 
puts a significant burden on developers. Can we instead always use the 
more or less universal export and import function to fix possible issues 
with binary representations on different architectures and just refuse 
to push down partial aggregates on server version mismatch? At least at 
the first step?

3. Fixing the behavior when the HAVING clause is present

From: Robert Haas 
Sent: Tuesday, November 28, 2023 4:08 AM

On Wed, Nov 22, 2023 at 1:32 AM Alexander Pyhalov 
 wrote:

> Hi. HAVING is also a problem. Consider the following query
>
> SELECT count(a) FROM t HAVING count(a) > 10 - we can't push it down to
> foreign server as HAVING needs full aggregate result, but fore

Re: [CAUTION!! freemail] Re: Partial aggregates pushdown

2024-01-26 Thread vignesh C

On Thu, 7 Dec 2023 at 05:41, fujii.y...@df.mitsubishielectric.co.jp
 wrote:
>
> Hi Mr.Haas.
>
> > -Original Message-
> > From: Robert Haas 
> > Sent: Wednesday, December 6, 2023 10:25 PM
> > On Wed, Dec 6, 2023 at 3:41 AM fujii.y...@df.mitsubishielectric.co.jp
> >  wrote:
> > > Are you concerned about the hassle and potential human errors of
> > > manually adding new partial aggregation functions, rather than the 
> > > catalog becoming bloated?
> >
> > I'm concerned about both.
> Understood. Thank you for your response.
>
> > > The process of creating partial aggregation functions from aggregation
> > > functions can be automated, so I believe this issue can be resolved.
> > > However, automating it may increase the size of the patch even more, so 
> > > overall, approach#2 might be better.
> > > To implement approach #2, it would be necessary to investigate how much 
> > > additional code is required.
> >
> > Yes. Unfortunately I fear that there is quite a lot of work left to do here 
> > in order to produce a committable feature. To me it
> > seems necessary to conduct an investigation of approach #2. If the result 
> > of that investigation is that nothing major
> > stands in the way of approach #2, then I think we should adopt it, which is 
> > more work. In addition, the problems around
> > transmitting serialized bytea blobs between machines that can't be assumed 
> > to fully trust each other will need to be
> > addressed in some way, which seems like it will require a good deal of 
> > design work, forming some kind of consensus, and
> > then implementation work to follow. In addition to that there may be some 
> > small problems that need to be solved at a
> > detail level, such as the HAVING issue. I think the last category won't be 
> > too hard to sort out, but that still leaves two really
> > major areas to address.
> Yes, I agree with you. It is clear that further investigation and discussion 
> are still needed.
> I would be grateful if we can resolve this issue gradually. I would also like 
> to continue the discussion if possible in the future.

Thanks for all the efforts on this patch. I have changed the status of
the commitfest entry to "Returned with Feedback" as there is still
some work to get this patch out. Feel free to continue the discussion
and add a new entry when the patch is in a reviewable shape.

Regards,
Vignesh

Re: [CAUTION!! freemail] Re: Partial aggregates pushdown

2023-12-11 Thread Robert Haas

On Thu, Dec 7, 2023 at 4:12 PM Bruce Momjian  wrote:
> Second, the patch already has a mechanism to check the remote server
> version to see if it is the same or newer.   Here is the version check
> documentation patch:

Right. This feature can certainly be implemented in a
backward-compatible way. I'm not sure that we have as much control
over what does and does not get pushed down as we really want here,
but it's completely possible to do this in a way that doesn't break
other use cases.

> However, functions don't have pre-created records, and internal
> functions don't see to have an SQL-defined structure, but as I remember
> the internal aggregate functions all take the same internal structure,
> so I guess we only need one fixed input and one output that would
> output/input such records.  Performance might be an issue, but at this
> point let's just implement this and measure the overhead since there are
> few/any(?) other viable options.

IMHO records will be the easiest approach, but it will be some work to try it.

> Fourth, going with #2 where we do the pushdown using an SQL keyword also
> allows extensions to automatically work, while requiring partial
> aggregate functions for every non-partial aggregate will require work
> for extensions, and potentially lead to more version mismatch issues.

Yeah.

> Finally, I am now concerned that this will not be able to be in PG 17,
> which I was hoping for.

Getting it ready to ship by March seems very difficult. I'm not saying
it couldn't happen, but I think more likely it won't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [CAUTION!! freemail] Re: Partial aggregates pushdown

2023-12-07 Thread Bruce Momjian

On Thu, Dec  7, 2023 at 09:56:08AM -0500, Robert Haas wrote:
> On Wed, Dec 6, 2023 at 7:11 PM fujii.y...@df.mitsubishielectric.co.jp
>  wrote:
> > I would be grateful if we can resolve this issue gradually. I would also 
> > like to continue the discussion if possible in the future.
> 
> I think that would be good. Thanks for your work on this. It is a hard 
> problem.

Agreed.  First, Robert is right that this feature is long overdue.  It
might not help many of our existing workloads, but it opens us up to
handling new, larger workloads.

Second, the patch already has a mechanism to check the remote server
version to see if it is the same or newer.   Here is the version check
documentation patch:

check_partial_aggregate_support (boolean)

If this option is false, postgres_fdw always
uses partial aggregate pushdown by assuming that each built-in
aggregate function has a partial aggregate function defined on
the remote server.  If this option is true, local aggregates
whose partial computation function references itself are assumed
to exist on the remote server.  If not, during query planning,
postgres_fdw will connect to the remote
server and retrieve the remote server version.  If the remote
version is the same or newer, partial aggregate functions will be
assumed to exist.  If older, postgres_fdw
checks that the remote server has a matching partial aggregate
function before performing partial aggregate pushdown.  The default
is false.

There is also an extension list that specifies which extension-owned
functions can be pushed down;  from the doc patch:

To reduce the risk of misexecution of queries, WHERE clauses and
aggregate expressions are not sent to the remote server unless they
only use data types, operators, and functions that are built-in
or belong to an extension that is listed in the foreign server's
extensions option.

Third, we already have a way of creating records for tables:

SELECT pg_language FROM pg_language;
pg_language
---
 (12,internal,10,f,f,0,0,2246,)
 (13,c,10,f,f,0,0,2247,)
 (14,sql,10,f,t,0,0,2248,)
 (13576,plpgsql,10,t,t,13573,13574,13575,)

And we do have record input functionality:

CREATE TABLE test (x int, language pg_language);

INSERT INTO test SELECT 0, pg_language FROM pg_language;

SELECT * FROM test;
 x | language
---+---
 0 | (12,internal,10,f,f,0,0,2246,)
 0 | (13,c,10,f,f,0,0,2247,)
 0 | (14,sql,10,f,t,0,0,2248,)
 0 | (13576,plpgsql,10,t,t,13573,13574,13575,)
(4 rows)

However, functions don't have pre-created records, and internal
functions don't see to have an SQL-defined structure, but as I remember
the internal aggregate functions all take the same internal structure,
so I guess we only need one fixed input and one output that would
output/input such records.  Performance might be an issue, but at this
point let's just implement this and measure the overhead since there are
few/any(?) other viable options.

Fourth, going with #2 where we do the pushdown using an SQL keyword also
allows extensions to automatically work, while requiring partial
aggregate functions for every non-partial aggregate will require work
for extensions, and potentially lead to more version mismatch issues.

Finally, I am now concerned that this will not be able to be in PG 17,
which I was hoping for.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: [CAUTION!! freemail] Re: Partial aggregates pushdown

2023-12-07 Thread Robert Haas

On Wed, Dec 6, 2023 at 7:11 PM fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> I would be grateful if we can resolve this issue gradually. I would also like 
> to continue the discussion if possible in the future.

I think that would be good. Thanks for your work on this. It is a hard problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

RE: [CAUTION!! freemail] Re: Partial aggregates pushdown

2023-12-06 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi Mr.Haas.

> -Original Message-
> From: Robert Haas 
> Sent: Wednesday, December 6, 2023 10:25 PM
> On Wed, Dec 6, 2023 at 3:41 AM fujii.y...@df.mitsubishielectric.co.jp
>  wrote:
> > Are you concerned about the hassle and potential human errors of
> > manually adding new partial aggregation functions, rather than the catalog 
> > becoming bloated?
> 
> I'm concerned about both.
Understood. Thank you for your response.

> > The process of creating partial aggregation functions from aggregation
> > functions can be automated, so I believe this issue can be resolved.
> > However, automating it may increase the size of the patch even more, so 
> > overall, approach#2 might be better.
> > To implement approach #2, it would be necessary to investigate how much 
> > additional code is required.
> 
> Yes. Unfortunately I fear that there is quite a lot of work left to do here 
> in order to produce a committable feature. To me it
> seems necessary to conduct an investigation of approach #2. If the result of 
> that investigation is that nothing major
> stands in the way of approach #2, then I think we should adopt it, which is 
> more work. In addition, the problems around
> transmitting serialized bytea blobs between machines that can't be assumed to 
> fully trust each other will need to be
> addressed in some way, which seems like it will require a good deal of design 
> work, forming some kind of consensus, and
> then implementation work to follow. In addition to that there may be some 
> small problems that need to be solved at a
> detail level, such as the HAVING issue. I think the last category won't be 
> too hard to sort out, but that still leaves two really
> major areas to address.
Yes, I agree with you. It is clear that further investigation and discussion 
are still needed. 
I would be grateful if we can resolve this issue gradually. I would also like 
to continue the discussion if possible in the future.

Sincerely yours,
Yuuki Fujii

--
Yuuki Fujii
Information Technology R&D Center Mitsubishi Electric Corporation

Re: Partial aggregates pushdown

2023-12-06 Thread Robert Haas

On Wed, Dec 6, 2023 at 3:41 AM fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> Are you concerned about the hassle and potential human errors of manually 
> adding new partial
> aggregation functions, rather than the catalog becoming bloated?

I'm concerned about both.

> The process of creating partial aggregation functions from aggregation 
> functions can be automated,
> so I believe this issue can be resolved. However, automating it may increase 
> the size of the patch
> even more, so overall, approach#2 might be better.
> To implement approach #2, it would be necessary to investigate how much 
> additional code is required.

Yes. Unfortunately I fear that there is quite a lot of work left to do
here in order to produce a committable feature. To me it seems
necessary to conduct an investigation of approach #2. If the result of
that investigation is that nothing major stands in the way of approach
#2, then I think we should adopt it, which is more work. In addition,
the problems around transmitting serialized bytea blobs between
machines that can't be assumed to fully trust each other will need to
be addressed in some way, which seems like it will require a good deal
of design work, forming some kind of consensus, and then
implementation work to follow. In addition to that there may be some
small problems that need to be solved at a detail level, such as the
HAVING issue. I think the last category won't be too hard to sort out,
but that still leaves two really major areas to address.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

RE: Partial aggregates pushdown

2023-12-06 Thread fujii.y...@df.mitsubishielectric.co.jp

Hi Mr.Haas, hackers.

> From: Robert Haas 
> Sent: Tuesday, November 28, 2023 5:03 AM
> Also, I want to make one other point here about security and reliability. 
> Right now, there is no way for a user to feed
> arbitrary data to a deserialization function. Since serialization and 
> deserialization functions are only used in the context of
> parallel query, we always know that the data fed to the deserialization 
> function must have come from the serialization
> function on the same machine. Nor can users call the deserialization function 
> directly with arbitrary data of their own
> choosing, because users cannot call functions that take or return internal. 
> But with this system, it becomes possible to
> feed arbitrary data to a deserialization function.
> The user could redefine the function on the remote side so that it produces 
> arbitrary data of their choosing, and the local
> deserialization function will ingest it.
> 
> That's potentially quite a significant problem. Consider for example that 
> numericvar_deserialize() does no validity
> checking on any of the weight, sign, or dscale, but not all values for those 
> fields are legal. Right now that doesn't matter,
> but if you can feed arbitrary data to that function, then it is. I don't know 
> exactly what the consequences are if you can get
> that function to spit out a NumericVar with values outside the normal legal 
> range. What can you do then?
> Store a bogus numeric on disk? Crash the server? Worst case, some problem 
> like this could be a security issue allowing for
> escalation to superuser; more likely, it would be a crash bug, corrupt your 
> database, or lead to unexpected and strange
> error messages.
> 
> Unfortunately, I have the unpleasant suspicion that most internal-type 
> aggregates will be affected by this problem.
> Consider, for example, string_agg_deserialize(). Generally, strings are one 
> of the least-constrained data types, so you
> might hope that this function would be OK. But it doesn't look very 
> promising. The first int4 in the serialized representation
> is the cursor, which would have to be bounds-checked, lest someone provide a 
> cursor that falls outside the bounds of the
> StringInfo and, maybe, cause a reference to an arbitrary memory location. 
> Then the rest of the representation is the actual
> data, which could be anything. This function is used for both bytea and for 
> text, and for bytea, letting the payload be
> anything is OK.
> But for text, the supplied data shouldn't contain an embedded zero byte, or 
> otherwise be invalid in the server encoding. If
> it were, that would provide a vector to inject invalidly encoded data into 
> the database. This feature can't be allowed to do
> that.
I completely overlooked this issue. I should have considered the risks of 
sending raw state values or serialized state
data directly from remote to local. I apologize.

> What could be a solution to this class of problems? One option is to just 
> give up on supporting this feature for internal-type
> aggregates for now. That's easy enough to do, and just means we have less 
> functionality, but it's sad because that's
> functionality we'd like to have. Another approach is to add necessary sanity 
> checks to the relevant deserialization
> functions, but that seems a little hard to get right, and it would slow down 
> parallel query cases which are probably going to
> be more common than the use of this feature. I think the slowdown might be 
> significant, too. A third option is to change
> those aggregates in some way, like giving them a transition function that 
> operates on some data type other than internal,
> but there again we have to be careful of slowdowns. A final option is to 
> rethink the infrastructure in some way, like having
> a way to serialize to something other than bytea, for which we already have 
> input functions with adequate checks. For
> instance, if string_agg_serialize() produced a record containing an integer 
> column and a text or bytea column, we could
> attempt to ingest that record on the other side and presumably the right 
> things would happen in the case of any invalid
> data. But I'm not quite sure what infrastructure would be required to make 
> this kind of idea work.
Thank you very much for providing a direction towards resolving this issue.
As you have suggested as the last option, it seems that expanding the current 
mechanism of the aggregation
function is the only choice. It may take some time, but I will consider 
specific solutions.

> From: Robert Haas 
> Sent: Tuesday, November 28, 2023 4:08 AM
> On Wed, Nov 22, 2023 at 1:32 AM Alexander Pyhalov  
> wrote:
> > Hi. HAVING is also a problem. Consider the following query
> >
> > SELECT count(a) FROM t HAVING count(a) > 10 - we can't push it down to
> > foreign server as HAVING needs full aggregate result, but foreign
> > server don't know it.
> 
> I don't see it that way. What we would push to the foreig

Re: Partial aggregates pushdown

2023-11-28 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Nov 27, 2023 at 4:23 PM Tom Lane  wrote:
> > Well, one of the founding principles of postgres_fdw was to be able
> > to talk to PG servers that are not of the same version as yours.
> > If we break that in the name of performance, we are going to have
> > a lot of unhappy users.  Even the ones who do get the benefit of
> > the speedup are going to be unhappy when it breaks because they
> > didn't upgrade local and remote at exactly the same time.
> 
> I agree with this.

+1.  We do want to continue to make this work- to the extent possible.
I don't think there's any problem with saying that when talking to an
older server, you don't get the same capabilities as you do when talking
to a newer server.

> > Just because we'd like to have it doesn't make the patch workable
> > in the real world.
> 
> And also with this in concept - I'd like to plan arbitrarily
> complicated queries perfectly and near-instantly, and then execute
> them at faster-than-light speed, but we can't. However, I don't
> understand the fatalism with respect to the feature at hand. As I said
> before, it's not like no other product has made this work. Sure, some
> of those products may not have the extensible system of data types
> that we do, or may not care about cross-version communication, but
> those don't seem like good enough reasons to just immediately give up.

Certainly there are other projects out there which are based on PG that
have managed to make this work and work really quite well.

> TBH, I suspect even some PG forks have made this work, like maybe PGXC
> or PGXL, although I don't know for certain. We might not like the
> trade-offs they made to get there, but we haven't even talked through
> possible design ideas yet, so it seems way too early to give up.

Yes, Citus[1] and Greenplum[2], to just name two.

I certainly understand the concern around the security of this and would
have thought the approach we'd use would be to not just take internal
state and pass it along but rather to provide a way for aggregates to
opt-in to supporting this and have them serialize/deserialize with
new dedicated functions that have appropriate checks to avoid bad things
happening.  That could also be versioned, perhaps, if we feel that's
necessary (I'm a bit skeptical, but it would hopefully address the
concern about different versions having different data that they want to
pass along).

> One of the things that I think is a problem in this area is that the
> ways we have to configure FDW connections are just not very rich.

Agreed.

> We're trying to cram everything into a set of strings that can be
> attached to the foreign server or the user mapping, but that's not a
> very good fit for something like how all the local SQL functions that
> might exist map onto all of the remote SQL functions that might exist.
> Now you might well say that we don't want the act of configuring a
> foreign data wrapper to be insanely complicated, and I would agree
> with that. But, on the other hand, as Larry Wall once said, a good
> programming language makes simple things simple and complicated things
> possible. I think our current configuration system is only
> accomplishing the first of those goals.

We've already got issues in this area with extensions- there's no way
for a user to say what version of an extension exists on the remote side
and no way for an extension to do anything different based on that
information.  Perhaps we could work on a solution to both of these
issues, but at the least I don't see holding back on this effort for a
problem that already exists but which we've happily accepted because of
the benefit it provides, like being able to push-down postgis bounding
box conditionals to allow for indexed lookups.

Thanks,

Stephen

[1]: https://docs.citusdata.com/en/v11.1/develop/reference_sql.html
[2]: 
https://postgresconf.org/conferences/Beijing/program/proposals/implementation-of-distributed-aggregation-in-greenplum

signature.asc
Description: PGP signature

Re: Partial aggregates pushdown

2023-11-28 Thread Robert Haas

On Tue, Nov 28, 2023 at 5:24 AM Ashutosh Bapat
 wrote:
> If my memory serves me right, PGXC implemented partial aggregation
> only when the output of partial aggregate was a SQL data type
> (non-Internal, non-Unknown). But I may be wrong. But at that time,
> JSONB wasn't there or wasn't that widespread.
>
> Problem with Internal is it's just a binary string whose content can
> change across version and which can be interpreted differently across
> different versions. There is no metadata in it to know how to
> interpret it. We can add that metadata to JSONB. The result of partial
> aggregate can be sent as a JSONB. If the local server finds the JSONB
> familiar it will construct the right partial aggregate value otherwise
> it will throw an error. If there's a way to even avoid that error (by
> looking at server version etc.) the error can be avoided too. But
> JSONB leaves very very less chance that the value will be interpreted
> wrong. Downside is we are tying PARTIAL's output to be JSONB thus
> tying SQL syntax with a data type.
>
> Does that look acceptable?

If somebody had gone to the trouble of making this work, and had done
a good job, I wouldn't vote against it, but in a vacuum, I'm not sure
it's the best design. The problem in my view is that working with JSON
is not actually very pleasant. It's super-easy to generate, and
super-easy for humans to read. But parsing and validating it is a
pain. You basically have to have two parsers, one to do syntactical
validation and then a second one to ensure that the structure of the
document and the contents of each item are as expected. See
parse_manifest.c for an example of what I mean by that. Now, if we add
new code, it can reuse the JSON parser we've already got, so it's not
that you need to write a new JSON parser for every new application of
JSON, but the semantic validator (a la parse_manifest.c) isn't
necessarily any less code than a whole new parser for a bespoke
format.

To make that a bit more concrete, for something like string_agg(), is
it easier to write a validator for the existing deserialization
function that accepts a bytea blob, or to write a validator for a JSON
blob that we could be passing instead? My suspicion is that the former
is less work and easier to verify, but it's possible I'm wrong about
that and they're more or less equal. I don't really see any way that
the JSON thing is straight-up better; at best it's a toss-up in terms
of amount of code. Now somebody could still make an argument that they
would like JSON better because it would be more useful for some
purpose other than this feature, and that is fine, but here I'm just
thinking about this feature in particular.

My personal suspicion is that the easiest way to support internal-type
aggregates here is to convert them to use an array or record type as a
transition state instead, or maybe serialize the internal state to one
of those things instead of to bytea. I suspect that would allow us to
leverage more of our existing validation infrastructure than using
JSON or sticking with bytea. But I'm certainly amenable to other
points of view. I'm not trying to pretend that my gut feeling is
necessarily correct; I'm just explaining what I currently think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Partial aggregates pushdown

2023-11-28 Thread Ashutosh Bapat

On Tue, Nov 28, 2023 at 5:21 AM Robert Haas  wrote:
>
> TBH, I suspect even some PG forks have made this work, like maybe PGXC
> or PGXL, although I don't know for certain. We might not like the
> trade-offs they made to get there, but we haven't even talked through
> possible design ideas yet, so it seems way too early to give up.

If my memory serves me right, PGXC implemented partial aggregation
only when the output of partial aggregate was a SQL data type
(non-Internal, non-Unknown). But I may be wrong. But at that time,
JSONB wasn't there or wasn't that widespread.

Problem with Internal is it's just a binary string whose content can
change across version and which can be interpreted differently across
different versions. There is no metadata in it to know how to
interpret it. We can add that metadata to JSONB. The result of partial
aggregate can be sent as a JSONB. If the local server finds the JSONB
familiar it will construct the right partial aggregate value otherwise
it will throw an error. If there's a way to even avoid that error (by
looking at server version etc.) the error can be avoided too. But
JSONB leaves very very less chance that the value will be interpreted
wrong. Downside is we are tying PARTIAL's output to be JSONB thus
tying SQL syntax with a data type.

Does that look acceptable?

-- 
Best Wishes,
Ashutosh Bapat

Re: Partial aggregates pushdown

First of all, that last email of mine was snippy, and I apologize for it. Sorry.

That said:

On Mon, Nov 27, 2023 at 4:23 PM Tom Lane  wrote:
> Well, one of the founding principles of postgres_fdw was to be able
> to talk to PG servers that are not of the same version as yours.
> If we break that in the name of performance, we are going to have
> a lot of unhappy users.  Even the ones who do get the benefit of
> the speedup are going to be unhappy when it breaks because they
> didn't upgrade local and remote at exactly the same time.

I agree with this.

> Just because we'd like to have it doesn't make the patch workable
> in the real world.

And also with this in concept - I'd like to plan arbitrarily
complicated queries perfectly and near-instantly, and then execute
them at faster-than-light speed, but we can't. However, I don't
understand the fatalism with respect to the feature at hand. As I said
before, it's not like no other product has made this work. Sure, some
of those products may not have the extensible system of data types
that we do, or may not care about cross-version communication, but
those don't seem like good enough reasons to just immediately give up.

TBH, I suspect even some PG forks have made this work, like maybe PGXC
or PGXL, although I don't know for certain. We might not like the
trade-offs they made to get there, but we haven't even talked through
possible design ideas yet, so it seems way too early to give up.

One of the things that I think is a problem in this area is that the
ways we have to configure FDW connections are just not very rich.
We're trying to cram everything into a set of strings that can be
attached to the foreign server or the user mapping, but that's not a
very good fit for something like how all the local SQL functions that
might exist map onto all of the remote SQL functions that might exist.
Now you might well say that we don't want the act of configuring a
foreign data wrapper to be insanely complicated, and I would agree
with that. But, on the other hand, as Larry Wall once said, a good
programming language makes simple things simple and complicated things
possible. I think our current configuration system is only
accomplishing the first of those goals.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Partial aggregates pushdown

2023-11-27 Thread Tom Lane

Robert Haas  writes:
> On Mon, Nov 27, 2023 at 3:59 PM Tom Lane  wrote:
>> TBH, I think this entire proposal is dead in the water.  Which is
>> sad from a performance standpoint, but I can't see any way that
>> we would not regret shipping a feature that makes such assumptions.

> I think it's ridiculous to just hold our breath and pretend like this
> feature isn't needed -- it's at least half a decade overdue. We engage
> in endless hand-wringing over local-remote symmetry in cases where
> other systems seem to effortlessly make that assumption and then get
> on with building new features.

Well, one of the founding principles of postgres_fdw was to be able
to talk to PG servers that are not of the same version as yours.
If we break that in the name of performance, we are going to have
a lot of unhappy users.  Even the ones who do get the benefit of
the speedup are going to be unhappy when it breaks because they
didn't upgrade local and remote at exactly the same time.

Just because we'd like to have it doesn't make the patch workable
in the real world.

regards, tom lane

Re: Partial aggregates pushdown

On Mon, Nov 27, 2023 at 3:59 PM Tom Lane  wrote:
> Even if the partial-aggregate serialization value isn't "internal"
> but some more-narrowly-defined type, it is still an internal
> implementation detail of the aggregate.  You have no right to assume
> that the remote server implements the aggregate the same way the
> local one does.  If we start making such an assumption then we'll
> be unable to revise the implementation of an aggregate ever again.
>
> TBH, I think this entire proposal is dead in the water.  Which is
> sad from a performance standpoint, but I can't see any way that
> we would not regret shipping a feature that makes such assumptions.

I think it's ridiculous to just hold our breath and pretend like this
feature isn't needed -- it's at least half a decade overdue. We engage
in endless hand-wringing over local-remote symmetry in cases where
other systems seem to effortlessly make that assumption and then get
on with building new features. It's not that I disagree with the
concern; we're *already* doing stuff that is unprincipled in a bunch
of different areas and that could and occasionally does cause queries
that push things to the remote side to return wrong answers, and I
hate that. But the response to that can't be to refuse to add new
features and maybe rip out the features we already have. Users don't
like it when pushdown causes queries to return wrong answers, but they
like it even less when the pushdown doesn't happen in the first place
and the query runs until the heat death of the universe. I'm not
entirely sure what the right design ideas are here, but giving up and
refusing to add features ought to be completely off the table.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Partial aggregates pushdown

2023-11-27 Thread Tom Lane

Robert Haas  writes:
> Also, I want to make one other point here about security and
> reliability. Right now, there is no way for a user to feed arbitrary
> data to a deserialization function. Since serialization and
> deserialization functions are only used in the context of parallel
> query, we always know that the data fed to the deserialization
> function must have come from the serialization function on the same
> machine. Nor can users call the deserialization function directly with
> arbitrary data of their own choosing, because users cannot call
> functions that take or return internal. But with this system, it
> becomes possible to feed arbitrary data to a deserialization function.

Ouch.  That is absolutely horrid --- we have a lot of stuff that
depends on users not being able to get at "internal" values, and
it sounds like the current proposal breaks all of that.

Quite aside from security concerns, there is no justification for
assuming that the "internal" values used on one platform/PG version
are identical to those used on another.  So if the idea is to
ship back "internal" values from the remote server to the local one,
I think it's basically impossible to make that work.

Even if the partial-aggregate serialization value isn't "internal"
but some more-narrowly-defined type, it is still an internal
implementation detail of the aggregate.  You have no right to assume
that the remote server implements the aggregate the same way the
local one does.  If we start making such an assumption then we'll
be unable to revise the implementation of an aggregate ever again.

TBH, I think this entire proposal is dead in the water.  Which is
sad from a performance standpoint, but I can't see any way that
we would not regret shipping a feature that makes such assumptions.

regards, tom lane

Re: Partial aggregates pushdown

On Wed, Nov 22, 2023 at 5:16 AM fujii.y...@df.mitsubishielectric.co.jp
 wrote:
> I did not choose Approach2 because I was not confident that the disadvantage 
> mentioned in 2.(2)(a)
> would be accepted by the PostgreSQL development community.
> If it is accepted, I think Approach 2 is smarter.
> Could you please provide your opinion on which
> approach is preferable after comparing these two approaches?
> If we cannot say anything without comparing the amount of source code, as 
> Mr.Momjian mentioned,
> we need to estimate the amount of source code required to implement Approach2.

I've had the same concern, that approach #2 would draw objections, so
I think you were right to be cautious about it. I don't think it is a
wonderful approach in all ways, but I do think that it is superior to
approach #1. If we add dedicated support to the grammar, it is mostly
a one-time effort, and after that, there should not be much need for
anyone to be concerned about it. If we instead add extra aggregates,
then that generates extra work every time someone writes a patch that
adds a new aggregate to core. I have a difficult time believing that
anyone will prefer an approach that involves an ongoing maintenance
effort of that type over one that doesn't.

One point that seems to me to be of particular importance is that if
we add new aggregates, there is a risk that some future aggregate
might do that incorrectly, so that the main aggregate works, but the
secondary aggregate created for this feature does not work. That seems
like it would be very frustrating for future code authors so I'd like
to avoid the risk as much as we can.

Also, I want to make one other point here about security and
reliability. Right now, there is no way for a user to feed arbitrary
data to a deserialization function. Since serialization and
deserialization functions are only used in the context of parallel
query, we always know that the data fed to the deserialization
function must have come from the serialization function on the same
machine. Nor can users call the deserialization function directly with
arbitrary data of their own choosing, because users cannot call
functions that take or return internal. But with this system, it
becomes possible to feed arbitrary data to a deserialization function.
The user could redefine the function on the remote side so that it
produces arbitrary data of their choosing, and the local
deserialization function will ingest it.

That's potentially quite a significant problem. Consider for example
that numericvar_deserialize() does no validity checking on any of the
weight, sign, or dscale, but not all values for those fields are
legal. Right now that doesn't matter, but if you can feed arbitrary
data to that function, then it is. I don't know exactly what the
consequences are if you can get that function to spit out a NumericVar
with values outside the normal legal range. What can you do then?
Store a bogus numeric on disk? Crash the server? Worst case, some
problem like this could be a security issue allowing for escalation to
superuser; more likely, it would be a crash bug, corrupt your
database, or lead to unexpected and strange error messages.

Unfortunately, I have the unpleasant suspicion that most internal-type
aggregates will be affected by this problem. Consider, for example,
string_agg_deserialize(). Generally, strings are one of the
least-constrained data types, so you might hope that this function
would be OK. But it doesn't look very promising. The first int4 in the
serialized representation is the cursor, which would have to be
bounds-checked, lest someone provide a cursor that falls outside the
bounds of the StringInfo and, maybe, cause a reference to an arbitrary
memory location. Then the rest of the representation is the actual
data, which could be anything. This function is used for both bytea
and for text, and for bytea, letting the payload be anything is OK.
But for text, the supplied data shouldn't contain an embedded zero
byte, or otherwise be invalid in the server encoding. If it were, that
would provide a vector to inject invalidly encoded data into the
database. This feature can't be allowed to do that.

What could be a solution to this class of problems? One option is to
just give up on supporting this feature for internal-type aggregates
for now. That's easy enough to do, and just means we have less
functionality, but it's sad because that's functionality we'd like to
have. Another approach is to add necessary sanity checks to the
relevant deserialization functions, but that seems a little hard to
get right, and it would slow down parallel query cases which are
probably going to be more common than the use of this feature. I think
the slowdown might be significant, too. A third option is to change
those aggregates in some way, like giving them a transition function
that operates on some data type other than internal, but there again
we have to be careful of slowdowns. A final option i

Re: Partial aggregates pushdown