from:"Jeevan Chalke"

Re: More new SQL/JSON item methods

2024-02-27 Thread Jeevan Chalke

On Tue, Feb 27, 2024 at 12:40 PM Andrew Dunstan  wrote:

>
> On 2024-02-02 Fr 00:31, Jeevan Chalke wrote:
>
>
>
> On Thu, Feb 1, 2024 at 11:25 AM Kyotaro Horiguchi 
> wrote:
>
>> At Thu, 1 Feb 2024 09:22:22 +0530, Jeevan Chalke <
>> jeevan.cha...@enterprisedb.com> wrote in
>> > On Thu, Feb 1, 2024 at 7:24 AM Kyotaro Horiguchi <
>> horikyota@gmail.com>
>> > wrote:
>> >
>> > > At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi <
>> > > horikyota@gmail.com> wrote in
>> > > > By the way, while playing with this feature, I noticed the following
>> > > > error message:
>> > > >
>> > > > > select jsonb_path_query('1.1' , '$.boolean()');
>> > > > > ERROR:  numeric argument of jsonpath item method .boolean() is
>> out of
>> > > range for type boolean
>> > > >
>> > > > The error message seems a bit off to me. For example, "argument
>> '1.1'
>> > > > is invalid for type [bB]oolean" seems more appropriate for this
>> > > > specific issue. (I'm not ceratin about our policy on the spelling of
>> > > > Boolean..)
>> > >
>> > > Or, following our general convention, it would be spelled as:
>> > >
>> > > 'invalid argument for type Boolean: "1.1"'
>> > >
>> >
>> > jsonpath way:
>>
>> Hmm. I see.
>>
>> > ERROR:  argument of jsonpath item method .boolean() is invalid for type
>> > boolean
>> >
>> > or, if we add input value, then
>> >
>> > ERROR:  argument "1.1" of jsonpath item method .boolean() is invalid for
>> > type boolean
>> >
>> > And this should work for all the error types, like out of range, not
>> valid,
>> > invalid input, etc, etc. Also, we don't need separate error messages for
>> > string input as well, which currently has the following form:
>> >
>> > "string argument of jsonpath item method .%s() is not a valid
>> > representation.."
>>
>> Agreed.
>>
>
> Attached are patches based on the discussion.
>
>
>
>
> Thanks, I combined these and pushed the result.
>

Thank you, Andrew.


>
> cheers
>
>
> andrew
>
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>

-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com

Re: recently added jsonpath method change jsonb_path_query, jsonb_path_query_first immutability

2024-02-13 Thread Jeevan Chalke

On Sat, Feb 10, 2024 at 10:55 PM Andrew Dunstan  wrote:

>
> On 2024-02-08 Th 21:02, Jeevan Chalke wrote:
>
>
>
> On Thu, Feb 8, 2024 at 2:22 PM jian he 
> wrote:
>
>> On Thu, Feb 8, 2024 at 1:27 PM Jeevan Chalke
>>  wrote:
>> >
>> >
>> >
>> > On Wed, Feb 7, 2024 at 9:13 PM jian he 
>> wrote:
>> >>
>> >> On Wed, Feb 7, 2024 at 7:36 PM Jeevan Chalke
>> >>  wrote:
>> >> > Added checkTimezoneIsUsedForCast() check where ever we are casting
>> timezoned to non-timezoned types and vice-versa.
>> >>
>> >> https://www.postgresql.org/docs/devel/functions-json.html
>> >> above Table 9.51. jsonpath Filter Expression Elements, the Note
>> >> section, do we also need to rephrase it?
>> >
>> >
>> > OK. Added a line for the same.
>> >
>>
>> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
>> index 6788ba8..37ae2d1 100644
>> --- a/doc/src/sgml/func.sgml
>> +++ b/doc/src/sgml/func.sgml
>> @@ -18240,7 +18240,11 @@ ERROR:  jsonpath member accessor can only be
>> applied to an object
>>timestamptz, and time to
>> timetz.
>>However, all but the first of these conversions depend on the
>> current
>> setting, and thus can only be
>> performed
>> -  within timezone-aware jsonpath functions.
>> +  within timezone-aware jsonpath functions.  Similarly,
>> other
>> +  date/time-related methods that convert string to the date/time
>> types
>> +  also do the casting and may involve the current
>> +  .  To preserve the immutability,
>> those can
>> +  only be performed within timezone-aware jsonpath
>> functions.
>>   
>>  
>>
>> my proposed minor changes:
>> -  within timezone-aware jsonpath functions.
>> +  within timezone-aware jsonpath functions. Similarly,
>> other
>> +  date/time-related methods that convert string to the date/time
>> types
>> +  also do the casting and may involve the current
>> +   setting. Those conversions can
>> +  only be performed within timezone-aware jsonpath
>> functions.
>> I don't have a strong opinion, though.
>>
>
> That seems fine as well. Let's leave that to the committer.
>
> I edited slightly to my taste, and committed the patch. Thanks.
>

Thank you, Andrew and Jian.


>
> cheers
>
>
> andrew
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>

-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com

Re: recently added jsonpath method change jsonb_path_query, jsonb_path_query_first immutability

2024-02-08 Thread Jeevan Chalke

On Thu, Feb 8, 2024 at 2:22 PM jian he  wrote:

> On Thu, Feb 8, 2024 at 1:27 PM Jeevan Chalke
>  wrote:
> >
> >
> >
> > On Wed, Feb 7, 2024 at 9:13 PM jian he 
> wrote:
> >>
> >> On Wed, Feb 7, 2024 at 7:36 PM Jeevan Chalke
> >>  wrote:
> >> > Added checkTimezoneIsUsedForCast() check where ever we are casting
> timezoned to non-timezoned types and vice-versa.
> >>
> >> https://www.postgresql.org/docs/devel/functions-json.html
> >> above Table 9.51. jsonpath Filter Expression Elements, the Note
> >> section, do we also need to rephrase it?
> >
> >
> > OK. Added a line for the same.
> >
>
> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
> index 6788ba8..37ae2d1 100644
> --- a/doc/src/sgml/func.sgml
> +++ b/doc/src/sgml/func.sgml
> @@ -18240,7 +18240,11 @@ ERROR:  jsonpath member accessor can only be
> applied to an object
>timestamptz, and time to
> timetz.
>However, all but the first of these conversions depend on the
> current
> setting, and thus can only be
> performed
> -  within timezone-aware jsonpath functions.
> +  within timezone-aware jsonpath functions.  Similarly,
> other
> +  date/time-related methods that convert string to the date/time types
> +  also do the casting and may involve the current
> +  .  To preserve the immutability,
> those can
> +  only be performed within timezone-aware jsonpath
> functions.
>   
>  
>
> my proposed minor changes:
> -  within timezone-aware jsonpath functions.
> +  within timezone-aware jsonpath functions. Similarly,
> other
> +  date/time-related methods that convert string to the date/time types
> +  also do the casting and may involve the current
> +   setting. Those conversions can
> +  only be performed within timezone-aware jsonpath
> functions.
> I don't have a strong opinion, though.
>

That seems fine as well. Let's leave that to the committer.


Thanks
-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com

Re: recently added jsonpath method change jsonb_path_query, jsonb_path_query_first immutability

2024-02-07 Thread Jeevan Chalke

On Wed, Feb 7, 2024 at 9:13 PM jian he  wrote:

> On Wed, Feb 7, 2024 at 7:36 PM Jeevan Chalke
>  wrote:
> > Added checkTimezoneIsUsedForCast() check where ever we are casting
> timezoned to non-timezoned types and vice-versa.
>
> https://www.postgresql.org/docs/devel/functions-json.html
> above Table 9.51. jsonpath Filter Expression Elements, the Note
> section, do we also need to rephrase it?
>

OK. Added a line for the same.

Thanks

-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com


v2-preserve-immutability.patch
Description: Binary data

Re: recently added jsonpath method change jsonb_path_query, jsonb_path_query_first immutability

2024-02-07 Thread Jeevan Chalke

On Tue, Feb 6, 2024 at 5:25 PM Andrew Dunstan  wrote:

>
> On 2024-02-05 Mo 22:06, jian he wrote:
>
>
> Hi.
> this commit [0] changes immutability of jsonb_path_query, 
> jsonb_path_query_first?
> If so, it may change other functions also.
>
>
Thanks for reporting Jian.

Added checkTimezoneIsUsedForCast() check where ever we are casting
timezoned to non-timezoned types and vice-versa.

Thanks


>
> demo:
>
> begin;
> SET LOCAL TIME ZONE 10.5;
> with cte(s) as (select jsonb '"2023-08-15 12:34:56 +05:30"')
>
> select jsonb_path_query(s,
> '$.timestamp_tz()')::text,'+10.5'::text,'timestamp_tz'::text from cte
> union all
> select jsonb_path_query(s, '$.time()')::text,'+10.5'::text, 'time'::text
> from cte
> union all
> select jsonb_path_query(s,
> '$.timestamp()')::text,'+10.5'::text,'timestamp'::text from cte
> union all
> select jsonb_path_query(s, '$.date()')::text,'+10.5'::text, 'date'::text
> from cte
> union all
> select jsonb_path_query(s, '$.time_tz()')::text,'+10.5'::text,
> 'time_tz'::text from cte;
>
> SET LOCAL TIME ZONE -8;
> with cte(s) as (select jsonb '"2023-08-15 12:34:56 +05:30"')
> select jsonb_path_query(s,
> '$.timestamp_tz()')::text,'-8'::text,'timestamp_tz'::text from cte
> union all
> select jsonb_path_query(s, '$.time()')::text,'-8'::text, 'time'::text from
> cte
> union all
> select jsonb_path_query(s,
> '$.timestamp()')::text,'-8'::text,'timestamp'::text from cte
> union all
> select jsonb_path_query(s, '$.date()')::text,'-8'::text, 'date'::text from
> cte
> union all
> select jsonb_path_query(s, '$.time_tz()')::text,'-8'::text,
> 'time_tz'::text from cte;
> commit;
>
>
> [0]
> https://git.postgresql.org/cgit/postgresql.git/commit/?id=66ea94e8e606529bb334515f388c62314956739e
>
>
> ouch. Good catch. Clearly we need to filter these like we do for the
> .datetime() method.
>
>
> cheers
>
>
> andrew
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>

-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com


preserve-immutability.patch
Description: Binary data

Re: More new SQL/JSON item methods

2024-02-01 Thread Jeevan Chalke

On Thu, Feb 1, 2024 at 11:25 AM Kyotaro Horiguchi 
wrote:

> At Thu, 1 Feb 2024 09:22:22 +0530, Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote in
> > On Thu, Feb 1, 2024 at 7:24 AM Kyotaro Horiguchi <
> horikyota@gmail.com>
> > wrote:
> >
> > > At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi <
> > > horikyota@gmail.com> wrote in
> > > > By the way, while playing with this feature, I noticed the following
> > > > error message:
> > > >
> > > > > select jsonb_path_query('1.1' , '$.boolean()');
> > > > > ERROR:  numeric argument of jsonpath item method .boolean() is out
> of
> > > range for type boolean
> > > >
> > > > The error message seems a bit off to me. For example, "argument '1.1'
> > > > is invalid for type [bB]oolean" seems more appropriate for this
> > > > specific issue. (I'm not ceratin about our policy on the spelling of
> > > > Boolean..)
> > >
> > > Or, following our general convention, it would be spelled as:
> > >
> > > 'invalid argument for type Boolean: "1.1"'
> > >
> >
> > jsonpath way:
>
> Hmm. I see.
>
> > ERROR:  argument of jsonpath item method .boolean() is invalid for type
> > boolean
> >
> > or, if we add input value, then
> >
> > ERROR:  argument "1.1" of jsonpath item method .boolean() is invalid for
> > type boolean
> >
> > And this should work for all the error types, like out of range, not
> valid,
> > invalid input, etc, etc. Also, we don't need separate error messages for
> > string input as well, which currently has the following form:
> >
> > "string argument of jsonpath item method .%s() is not a valid
> > representation.."
>
> Agreed.
>

Attached are patches based on the discussion.


>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com


v1-0002-Improve-error-message-for-NaN-or-Infinity-inputs-.patch
Description: Binary data


v1-0001-Merge-error-messages-in-the-same-pattern-in-jsonp.patch
Description: Binary data


v1-0003-Unify-error-messages-for-various-jsonpath-item-me.patch
Description: Binary data


v1-0004-Show-input-value-in-the-error-message-of-various-.patch
Description: Binary data

Re: More new SQL/JSON item methods

2024-01-31 Thread Jeevan Chalke

On Thu, Feb 1, 2024 at 7:24 AM Kyotaro Horiguchi 
wrote:

> At Thu, 01 Feb 2024 10:49:57 +0900 (JST), Kyotaro Horiguchi <
> horikyota@gmail.com> wrote in
> > By the way, while playing with this feature, I noticed the following
> > error message:
> >
> > > select jsonb_path_query('1.1' , '$.boolean()');
> > > ERROR:  numeric argument of jsonpath item method .boolean() is out of
> range for type boolean
> >
> > The error message seems a bit off to me. For example, "argument '1.1'
> > is invalid for type [bB]oolean" seems more appropriate for this
> > specific issue. (I'm not ceratin about our policy on the spelling of
> > Boolean..)
>
> Or, following our general convention, it would be spelled as:
>
> 'invalid argument for type Boolean: "1.1"'
>

jsonpath way:

ERROR:  argument of jsonpath item method .boolean() is invalid for type
boolean

or, if we add input value, then

ERROR:  argument "1.1" of jsonpath item method .boolean() is invalid for
type boolean

And this should work for all the error types, like out of range, not valid,
invalid input, etc, etc. Also, we don't need separate error messages for
string input as well, which currently has the following form:

"string argument of jsonpath item method .%s() is not a valid
representation.."


Thanks


> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com

Re: More new SQL/JSON item methods

2024-01-31 Thread Jeevan Chalke

On Thu, Feb 1, 2024 at 7:20 AM Kyotaro Horiguchi 
wrote:

> Thank you for the fix!
>
> At Tue, 30 Jan 2024 13:46:17 +0530, Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote in
> > On Mon, Jan 29, 2024 at 11:03 AM Tom Lane  wrote:
> >
> > > Kyotaro Horiguchi  writes:
> > > > Having said that, I'm a bit concerned about the case where an overly
> > > > long string is given there. However, considering that we already have
> > > > "invalid input syntax for type xxx: %x" messages (including for the
> > > > numeric), this concern might be unnecessary.
> > >
> > > Yeah, we have not worried about that in the past.
> > >
> > > > Another concern is that the input value is already a numeric, not a
> > > > string. This situation occurs when the input is NaN of +-Inf.
> Although
> > > > numeric_out could be used, it might cause another error. Therefore,
> > > > simply writing out as "argument NaN and Infinity.." would be better.
> > >
> > > Oh!  I'd assumed that we were discussing a string that we'd failed to
> > > convert to numeric.  If the input is already numeric, then either
> > > the error is unreachable or what we're really doing is rejecting
> > > special values such as NaN on policy grounds.  I would ask first
> > > if that policy is sane at all.  (I'd lean to "not" --- if we allow
> > > it in the input JSON, why not in the output?)  If it is sane, the
> > > error message needs to be far more specific.
> > >
> > > regards, tom lane
> > >
> >
> > *Consistent error message related to type:*
> ...
> > What if we use phrases like "for type double precision" at both places,
> > like:
> > numeric argument of jsonpath item method .%s() is out of range for type
> > double precision
> > string argument of jsonpath item method .%s() is not a valid
> representation
> > for type double precision
> >
> > With this, the rest will be like:
> > for type numeric
> > for type bigint
> > for type integer
> >
> > Suggestions?
>
> FWIW, I prefer consistently using "for type hoge".
>

OK.


>
> > ---
> >
> > *Showing input string in the error message:*
> >
> > argument "...input string here..." of jsonpath item method .%s() is out
> of
> > range for type numeric
> >
> > If we add the input string in the error, then for some errors, it will be
> > too big, for example:
> >
> > -ERROR:  numeric argument of jsonpath item method .double() is out of
> range
> > for type double precision
> > +ERROR:  argument
> > "10"
> > of jsonpath item method .double() is out of range for type double
> precision
>
> As Tom suggested, given that similar situations have already been
> disregarded elsewhere, worrying about excessively long input strings
> in this specific instance won't notably improve safety in total.
>
> > Also, for non-string input, we need to convert numeric to string just for
> > the error message, which seems overkill.
>
> As I suggested and you seem to agree, using literally "Nan or
> Infinity" would be sufficient.
>

I am more concerned about .bigint() and .integer(). We can have errors when
the numeric input is out of range, but not NaN or Infinity. At those
places, we need to convert numeric to string to put that value into the
error.
Do you mean we should still put "Nan or Infinity" there?

This is the case:
 select jsonb_path_query('12345678901', '$.integer()');
 ERROR:  numeric argument of jsonpath item method .integer() is out of
range for type integer


>
> > On another note, irrespective of these changes, is it good to show the
> > given input in the error messages? Error messages are logged and may leak
> > some details.
> >
> > I think the existing way seems ok.
>
> In my opinion, it is quite common to include the error-causing value
> in error messages. Also, we have already many functions that impliy
> the possibility for revealing input values when converting text
> representation into internal format, such as with int4in. However, I
> don't stick to that way.
>
> > ---
> >
> > *NaN and Infinity restrictions:*
> >
> > I am not sure why NaN and Infinity are not allowed in conversion to
> double
> > precision (.double() method). I have used the same restriction for
> > .decimal() and .number(). However, as you said, we should have error
> > messages more specific. I tried that in the at

Re: More new SQL/JSON item methods

2024-01-30 Thread Jeevan Chalke

On Mon, Jan 29, 2024 at 11:03 AM Tom Lane  wrote:

> Kyotaro Horiguchi  writes:
> > Having said that, I'm a bit concerned about the case where an overly
> > long string is given there. However, considering that we already have
> > "invalid input syntax for type xxx: %x" messages (including for the
> > numeric), this concern might be unnecessary.
>
> Yeah, we have not worried about that in the past.
>
> > Another concern is that the input value is already a numeric, not a
> > string. This situation occurs when the input is NaN of +-Inf. Although
> > numeric_out could be used, it might cause another error. Therefore,
> > simply writing out as "argument NaN and Infinity.." would be better.
>
> Oh!  I'd assumed that we were discussing a string that we'd failed to
> convert to numeric.  If the input is already numeric, then either
> the error is unreachable or what we're really doing is rejecting
> special values such as NaN on policy grounds.  I would ask first
> if that policy is sane at all.  (I'd lean to "not" --- if we allow
> it in the input JSON, why not in the output?)  If it is sane, the
> error message needs to be far more specific.
>
> regards, tom lane
>

*Consistent error message related to type:*

Agree that the number is not a PostgreSQL type and needs a change. As Tom
suggested, we can say "type numeric" here. However, I have seen two
variants of error messages here: (1) When the input is numeric and (2) when
the input is string. For first, we have error messages like:
numeric argument of jsonpath item method .%s() is out of range for type
double precision

and for the second, it is like:
string argument of jsonpath item method .%s() is not a valid representation
of a double precision number

The second form says "double precision number". So, in the decimal/number
case, should we use "numeric number" and then similarly "big integer
number"?

What if we use phrases like "for type double precision" at both places,
like:
numeric argument of jsonpath item method .%s() is out of range for type
double precision
string argument of jsonpath item method .%s() is not a valid representation
for type double precision

With this, the rest will be like:
for type numeric
for type bigint
for type integer

Suggestions?

---

*Showing input string in the error message:*

argument "...input string here..." of jsonpath item method .%s() is out of
range for type numeric

If we add the input string in the error, then for some errors, it will be
too big, for example:

-ERROR:  numeric argument of jsonpath item method .double() is out of range
for type double precision
+ERROR:  argument
"1"
of jsonpath item method .double() is out of range for type double precision

Also, for non-string input, we need to convert numeric to string just for
the error message, which seems overkill.

On another note, irrespective of these changes, is it good to show the
given input in the error messages? Error messages are logged and may leak
some details.

I think the existing way seems ok.

---

*NaN and Infinity restrictions:*

I am not sure why NaN and Infinity are not allowed in conversion to double
precision (.double() method). I have used the same restriction for
.decimal() and .number(). However, as you said, we should have error
messages more specific. I tried that in the attached patch; please have
your views. I have the following wordings for that error message:
"NaN or Infinity is not allowed for jsonpath item method .%s()"

Suggestions...

Thanks
-- 
Jeevan Chalke

*Principal, ManagerProduct Development*

edbpostgres.com

improve_error_for_Nan_Infinity.patch.no
Description: Binary data

Re: More new SQL/JSON item methods

2024-01-26 Thread Jeevan Chalke

On Fri, Jan 26, 2024 at 2:57 AM Andrew Dunstan  wrote:

>
> On 2024-01-25 Th 15:58, Tom Lane wrote:
> > I wrote:
> >> There's something else going on, because I'm still getting the
> >> assertion failure on my Mac with this fix in place.  Annoyingly,
> >> it goes away if I compile with -O0, so it's kind of hard to
> >> identify what's going wrong.
> > No, belay that: I must've got confused about which version I was
> > testing.  It's very unclear to me why the undefined reference
> > causes the preceding Assert to misbehave, but that is clearly
> > what's happening.  Compiler bug maybe?  My Mac has clang 15.0.0,
> > and the unhappy buildfarm members are also late-model clang.
> >
> > Anyway, I did note that the preceding line
> >
> >   res = jperOk;
> >
> > is dead code and might as well get removed while you're at it.
> >
> >
>
>
> OK, pushed those. Thanks.
>

Thank you all.


>
>
> cheers
>
>
> andrew
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>

-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com

Re: More new SQL/JSON item methods

2024-01-17 Thread Jeevan Chalke

On Mon, Jan 15, 2024 at 7:41 PM Peter Eisentraut 
wrote:

> Attached are two small fixup patches for your patch set.
>

Thanks, Peter.


>
> In the first one, I simplified the grammar for the .decimal() method.
> It seemed a bit overkill to build a whole list structure when all we
> need are 0, 1, or 2 arguments.
>

Agree.
I added unary '+' and '-' support as well and thus thought of having
separate rules altogether rather than folding those in.


> Per SQL standard, the precision and scale arguments are unsigned
> integers, so unary plus and minus signs are not supported.  So my patch
> removes that support, but I didn't adjust the regression tests for that.
>

However, PostgreSQL numeric casting does support a negative scale. Here is
an example:

# select '12345'::numeric(4,-2);
 numeric
-
   12300
(1 row)

And thus thought of supporting those.
Do we want this JSON item method to behave differently here?


>
> Also note that in your 0002 patch, the datetime precision is similarly
> unsigned, so that's consistent.
>
> By the way, in your 0002 patch, don't see the need for the separate
> datetime_method grammar rule.  You can fold that into accessor_op.
>

Sure.


>
> Overall, I think it would be better if you combined all three of these
> patches into one.  Right now, you have arranged these as incremental
> features, and as a result of that, the additions to the JsonPathItemType
> enum and the grammar keywords etc. are ordered in the way you worked on
> these features, I guess.  It would be good to maintain a bit of sanity
> to put all of this together and order all the enums and everything else
> for example in the order they are in the sql_features.txt file (which is
> alphabetical, I suppose).  At this point I suspect we'll end up
> committing this whole feature set together anyway, so we might as well
> organize it that way.
>

OK.
I will merge them all into one and will try to keep them in the order
specified in sql_features.txt.
However, for documentation, it makes more sense to keep them in logical
order than the alphabetical one. What are your views on this?


Thanks
-- 
Jeevan Chalke

*Principal, ManagerProduct Development*



edbpostgres.com

Re: More new SQL/JSON item methods

2024-01-07 Thread Jeevan Chalke

On Thu, Jan 4, 2024 at 2:34 AM Peter Eisentraut 
wrote:

> On 03.01.24 13:01, Peter Eisentraut wrote:
> > On 07.12.23 14:24, Jeevan Chalke wrote:
> >> We have the same issue with integer conversion and need a fix.
> >>
> >> Unfortunately, I was using int8in() for the conversion of numeric
> >> values. We should be using numeric_int8() instead. However, there is
> >> no opt_error version of the same.
> >>
> >> So, I have introduced a numeric_int8_opt_error() version just like we
> >> have one for int4, i.e. numeric_int4_opt_error(), to suppress the
> >> error. These changes are in the 0001 patch. (All other patch numbers
> >> are now increased by 1)
> >>
> >> I have used this new function to fix this reported issue and used
> >> numeric_int4_opt_error() for integer conversion.
> >
> > I have committed the 0001 and 0002 patches for now.
> >
> > The remaining patches look reasonable to me, but I haven't reviewed them
> > in detail.
>
> The 0002 patch had to be reverted, because we can't change the order of
> the enum values in JsonPathItemType.  I have instead committed a
> different patch that adjusts the various switch cases to observe the
> current order of the enum.  That also means that the remaining patches
> that add new item methods need to add the new enum values at the end and
> adjust the rest of their code accordingly.
>

Thanks, Peter.

I will work on rebasing and reorganizing the remaining patches.

Thanks

-- 
Jeevan Chalke

*PrincipalProduct Development*



edbpostgres.com

Re: More new SQL/JSON item methods

2023-11-01 Thread Jeevan Chalke

Hello,

On Tue, Oct 24, 2023 at 6:41 PM Andrew Dunstan  wrote:

>
> On 2023-10-19 Th 02:06, Jeevan Chalke wrote:
>
> Thanks, Peter for the comments.
>
> On Fri, Oct 6, 2023 at 5:13 PM Peter Eisentraut 
> wrote:
>
>> On 29.08.23 09:05, Jeevan Chalke wrote:
>> > v1-0001-Implement-jsonpath-.bigint-.integer-and-.number-m.patch
>> >
>> > This commit implements jsonpath .bigint(), .integer(), and .number()
>> > methods.  The JSON string or a numeric value is converted to the
>> > bigint, int4, and numeric type representation.
>>
>> A comment that applies to all of these: These add various keywords,
>> switch cases, documentation entries in some order.  Are we happy with
>> that?  Should we try to reorder all of that for better maintainability
>> or readability?
>>
>
> Yeah, that's the better suggestion. While implementing these methods, I
> was confused about where to put them exactly and tried keeping them in some
> logical place.
> I think once these methods get in, we can have a follow-up patch
> reorganizing all of these.
>
>
> I think it would be better to organize things how we want them before
> adding in more stuff.
>

I have tried reordering all the jsonpath Operators and Methods
consistently. With this patch, they all appear in the same order when
together in the group.

In some switch cases, they are still divided, like in
flattenJsonPathParseItem(), where 2-arg, 1-arg, and no-arg cases are
clubbed together. But I have tried to keep them in order in those subgroups.

I will rebase my patches for this task on this patch, but before doing so,
I  would like to get your views on this reordering.

Thanks


>
> cheers
>
>
> andrew
>
>
> --
> Andrew Dunstan
> EDB: https://www.enterprisedb.com
>
>

-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com


reorder-jsonpath-Operators-Methods.patch
Description: Binary data

Re: More new SQL/JSON item methods

2023-10-19 Thread Jeevan Chalke

On Wed, Oct 18, 2023 at 4:50 PM jian he  wrote:

> On Fri, Oct 6, 2023 at 7:47 PM Peter Eisentraut 
> wrote:
> >
> > On 29.08.23 09:05, Jeevan Chalke wrote:
> > > v1-0001-Implement-jsonpath-.bigint-.integer-and-.number-m.patch
> > >
> > > This commit implements jsonpath .bigint(), .integer(), and .number()
> > > methods.  The JSON string or a numeric value is converted to the
> > > bigint, int4, and numeric type representation.
> >
> > A comment that applies to all of these: These add various keywords,
> > switch cases, documentation entries in some order.  Are we happy with
> > that?  Should we try to reorder all of that for better maintainability
> > or readability?
> >
> > > v1-0002-Implement-.date-.time-.time_tz-.timestamp-and-.ti.patch
> > >
> > > This commit implements jsonpath .date(), .time(), .time_tz(),
> > > .timestamp(), .timestamp_tz() methods.  The JSON string representing
> > > a valid date/time is converted to the specific date or time type
> > > representation.
> > >
> > > The changes use the infrastructure of the .datetime() method and
> > > perform the datatype conversion as appropriate.  All these methods
> > > accept no argument and use ISO datetime formats.
> >
> > These should accept an optional precision argument.  Did you plan to add
> > that?
>
> compiler warnings issue resolved.
>

Thanks for pitching in, Jian.
I was slightly busy with other stuff and thus could not spend time on this.

I will start looking into it and expect a patch in a couple of days.


> I figured out how to use the precision argument.
> But I don't know how to get the precision argument in the parse stage.
>
> attached is my attempt to implement: select
> jsonb_path_query('"2017-03-10 11:11:01.123"', '$.timestamp(2)');
> not that familiar with src/backend/utils/adt/jsonpath_gram.y. imitate
> decimal method failed. decimal has precision and scale two arguments.
> here only one argument.
>
> looking for hints.
>

You may refer to how .datetime() is implemented.

Thanks


-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com

Re: More new SQL/JSON item methods

2023-10-19 Thread Jeevan Chalke

Thanks, Peter for the comments.

On Fri, Oct 6, 2023 at 5:13 PM Peter Eisentraut 
wrote:

> On 29.08.23 09:05, Jeevan Chalke wrote:
> > v1-0001-Implement-jsonpath-.bigint-.integer-and-.number-m.patch
> >
> > This commit implements jsonpath .bigint(), .integer(), and .number()
> > methods.  The JSON string or a numeric value is converted to the
> > bigint, int4, and numeric type representation.
>
> A comment that applies to all of these: These add various keywords,
> switch cases, documentation entries in some order.  Are we happy with
> that?  Should we try to reorder all of that for better maintainability
> or readability?
>

Yeah, that's the better suggestion. While implementing these methods, I was
confused about where to put them exactly and tried keeping them in some
logical place.
I think once these methods get in, we can have a follow-up patch
reorganizing all of these.


>
> > v1-0002-Implement-.date-.time-.time_tz-.timestamp-and-.ti.patch
> >
> > This commit implements jsonpath .date(), .time(), .time_tz(),
> > .timestamp(), .timestamp_tz() methods.  The JSON string representing
> > a valid date/time is converted to the specific date or time type
> > representation.
> >
> > The changes use the infrastructure of the .datetime() method and
> > perform the datatype conversion as appropriate.  All these methods
> > accept no argument and use ISO datetime formats.
>
> These should accept an optional precision argument.  Did you plan to add
> that?
>

Yeah, will add that.


>
> > v1-0003-Implement-jsonpath-.boolean-and-.string-methods.patch
> >
> > This commit implements jsonpath .boolean() and .string() methods.
>
> This contains a compiler warning:
>
> ../src/backend/utils/adt/jsonpath_exec.c: In function
> 'executeItemOptUnwrapTarget':
> ../src/backend/utils/adt/jsonpath_exec.c:1162:86: error: 'tmp' may be
> used uninitialized [-Werror=maybe-uninitialized]
>
> > v1-0004-Implement-jasonpath-.decimal-precision-scale-meth.patch
> >
> > This commit implements jsonpath .decimal() method with optional
> > precision and scale.  If precision and scale are provided, then
> > it is converted to the equivalent numerictypmod and applied to the
> > numeric number.
>
> This also contains compiler warnings:
>

Thanks, for reporting these warnings. I don't get those on my machine, thus
missed them. Will fix them.


>
> ../src/backend/utils/adt/jsonpath_exec.c: In function
> 'executeItemOptUnwrapTarget':
> ../src/backend/utils/adt/jsonpath_exec.c:1403:53: error: declaration of
> 'numstr' shadows a previous local [-Werror=shadow=compatible-local]
> ../src/backend/utils/adt/jsonpath_exec.c:1442:54: error: declaration of
> 'elem' shadows a previous local [-Werror=shadow=compatible-local]
>
> There is a typo in the commit message: "Implement jasonpath"
>

Will fix.


>
> Any reason this patch is separate from 0002?  Isn't number() and
> decimal() pretty similar?
>

Since DECIMAL has precision and scale arguments, I have implemented that at
the end. I tried merging that with 0001, but other patches ended up with
the conflicts and thus I didn't merge that and kept it as a separate patch.
But yes, logically it belongs to the 0001 group. My bad that I haven't put
in that extra effort. Will do that in the next version. Sorry for the same.


>
> You could also update src/backend/catalog/sql_features.txt in each patch
> (features T865 through T878).
>

OK.

Thanks

-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com

Re: More new SQL/JSON item methods

2023-09-04 Thread Jeevan Chalke

>
> Looking at the SQL standard itself, in the 2023 edition section 9.46
> "SQL/JSON path language: syntax and semantics", it shows this:
>
>  ::=
> type  
> | size  
> | double  
> | ceiling  
> | floor  
> | abs  
> | datetime  [  ] 
> | keyvalue  
> | bigint  
> | boolean  
> | date  
> | decimal  [  [   ] ] 
> | integer  
> | number  
> | string  
> | time  [  ] 
> | time_tz  [  ] 
> | timestamp  [  ] 
> | timestamp_tz  [  ] 
>
> and then details, for each of those, rules like
>
> III) If JM specifies , then:
>  1) For all j, 1 (one) ≤ j ≤ n,
> Case:
> a) If I_j is not a number or character string, then let ST be data
>exception — non-numeric SQL/JSON item (22036).
> b) Otherwise, let X be an SQL variable whose value is I_j.
>Let V_j be the result of
>CAST (X AS DOUBLE PRECISION)
>If this conversion results in an exception condition, then
>let ST be that exception condition.
>  2) Case:
> a) If ST is not successful completion, then the result of JAE
>is ST.
> b) Otherwise, the result of JAE is the SQL/JSON sequence V_1,
>..., V_n.
>
> so at least superficially our implementation is constrained by what the
> SQL standard says to do, and we should verify that this implementation
> matches those rules.  We don't necessarily need to watch what do other
> specs such as jsonpath itself.
>

I believe our current implementation of the .double() method is in line with
this. And these new methods are following the same suit.



> > - surely there's a more direct way to make boolean from numeric
> >   than to serialize the numeric and parse an int?
>

Yeah, we can directly check the value = 0 for false, true otherwise.
But looking at the PostgreSQL conversion to bool, it doesn't allow floating
point values to be converted to boolean and only accepts int4. That's why I
did the int4 conversion.

Thanks

-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com

Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread Jeevan Chalke

Hi Tom,

On Tue, Jul 11, 2023 at 4:30 PM Tom Lane  wrote:

> Jeevan Chalke  writes:
> > Attached patch.
>
> I would be astonished if this fixes anything.  The code still doesn't
> know which paths are referenced by which other ones, and so the place
> where we free a previously-added path can't know what to do.
>
> I've speculated about adding some form of reference counting to paths
> (maybe just a "pin" flag rather than a full refcount) so that we could
> be smarter about this.  The existing kluge for "don't free IndexPaths"
> could be replaced by setting the pin mark on any IndexPath that we
> make a bitmap path from.  Up to now it hasn't seemed necessary to
> generalize that hack, but maybe it's time.  Can you show a concrete
> case where we are freeing a still-referenced path?
>

As mentioned earlier, while debugging some issues, we have put an elog
displaying the foreignrel contents using nodeToString(). Like below:

@@ -1238,6 +1238,8 @@ postgresGetForeignPlan(PlannerInfo *root,
boolhas_limit = false;
ListCell   *lc;

+   elog(INFO, "foreignrel: %s", nodeToString(foreignrel));
+

And ran the postgres_fdw regression and observed many warnings saying "could
not dump unrecognized node type".  Here are the queries retrieved and
adjusted from postgres_fdw.sql

CREATE EXTENSION postgres_fdw;
CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname
'postgres', port '5432');
CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
CREATE TABLE t1 (c1 int NOT NULL, c2 int NOT NULL, CONSTRAINT t1_pkey
PRIMARY KEY (c1));
INSERT INTO t1 SELECT id, id % 10 FROM generate_series(1, 1000) id;
ANALYZE t1;
CREATE FOREIGN TABLE ft2 (c1 int NOT NULL, c2 int NOT NULL) SERVER loopback
OPTIONS (schema_name 'public', table_name 't1');

explain (verbose, costs off)
select c2, sum(c1) from ft2 group by c2 having avg(c1) < 500 and sum(c1) <
49800 order by c2;

With the above elog() in place, we can see the warning. And the pathlist
has a second path as empty ({}). Which got freed but has a reference in
this foreignrel.

Thanks


>
> regards, tom lane
>


-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com

Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread Jeevan Chalke

On Tue, Jul 11, 2023 at 2:58 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Tue, Jul 11, 2023 at 1:19 PM Alvaro Herrera 
> wrote:
>
>> On 2023-Jul-11, Jeevan Chalke wrote:
>>
>> > 4. However, 2nd path was already sorted and passed as is to the
>> add_path().
>> > 5. add_path() decides to reject this new path on some metrics. However,
>> in
>> > the end, it pfree() this passed in path. It seems wrong as its
>> references
>> > do present elsewhere. For example, in the first path's parent rels path
>> > list.
>> > 6. So, while displaying the parent's path, we end up with these
>> warnings.
>>
>> In other words, this is use-after-free, with add_path freeing the
>> passed-in Path pointer, but one particular case in which this Path is
>> still used afterwards.
>>
>> > I tried to get a fix for this but no luck so far.
>>
>> I proposed to add an add_path_extended() function that adds 'bool
>> free_input_path' argument, and pass it false in that one place in
>> create_ordered_paths.
>>
>
> Yeah, this can be a way.
>
> However, I am thinking the other way around now. What if we first added
> the unmodified input path as it is to the ordered_rel first?
>
> If we do so, then while adding the next path, add_path() may decide to
> remove the older one as the newer path is the best one. The remove_old
> logic in add_path() will free the path (unsorted one), and we end up with
> the same error.
>
> And if we conditionally remove that path (remove_old logic one), then we
> need to pass false in every add_path() call in create_ordered_paths().
>

Attached patch.


>
> Am I missing something?
>
> Thanks
>
>
>>
>> --
>> Álvaro Herrera   48°01'N 7°57'E  —
>> https://www.EnterpriseDB.com/
>>
>
>
> --
> Jeevan Chalke
>
> *Senior Staff SDE, Database Architect, and ManagerProduct Development*
>
>
>
> edbpostgres.com
>


-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com


fix_add_path.patch
Description: Binary data

Re: unrecognized node type while displaying a Path due to dangling pointer

2023-07-11 Thread Jeevan Chalke

On Tue, Jul 11, 2023 at 1:19 PM Alvaro Herrera 
wrote:

> On 2023-Jul-11, Jeevan Chalke wrote:
>
> > 4. However, 2nd path was already sorted and passed as is to the
> add_path().
> > 5. add_path() decides to reject this new path on some metrics. However,
> in
> > the end, it pfree() this passed in path. It seems wrong as its references
> > do present elsewhere. For example, in the first path's parent rels path
> > list.
> > 6. So, while displaying the parent's path, we end up with these warnings.
>
> In other words, this is use-after-free, with add_path freeing the
> passed-in Path pointer, but one particular case in which this Path is
> still used afterwards.
>
> > I tried to get a fix for this but no luck so far.
>
> I proposed to add an add_path_extended() function that adds 'bool
> free_input_path' argument, and pass it false in that one place in
> create_ordered_paths.
>

Yeah, this can be a way.

However, I am thinking the other way around now. What if we first added the
unmodified input path as it is to the ordered_rel first?

If we do so, then while adding the next path, add_path() may decide to
remove the older one as the newer path is the best one. The remove_old
logic in add_path() will free the path (unsorted one), and we end up with
the same error.

And if we conditionally remove that path (remove_old logic one), then we
need to pass false in every add_path() call in create_ordered_paths().

Am I missing something?

Thanks

>
> --
> Álvaro Herrera   48°01'N 7°57'E  —
> https://www.EnterpriseDB.com/
>

-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*

edbpostgres.com

unrecognized node type while displaying a Path due to dangling pointer

2023-07-10 Thread Jeevan Chalke

Hello,

While debugging one issue, we have added the below line in
postgresGetForeignPlan() to see the foreignrel details.

@@ -1238,6 +1238,8 @@ postgresGetForeignPlan(PlannerInfo *root,
boolhas_limit = false;
ListCell   *lc;

+   elog(INFO, "foreignrel: %s", nodeToString(foreignrel));
+

And with this change, we ran the postgres_fdw regression (executing
sql/postgres_fdw.sql) suite. We observed the below warnings that seem
strange.

+WARNING:  could not dump unrecognized node type: 0
+WARNING:  could not dump unrecognized node type: 26072088
+WARNING:  could not dump unrecognized node type: 26438448
+WARNING:  could not dump unrecognized node type: 368

Of course, with multiple runs, we see some random node types listed above.
Thanks to my colleague Suraj Kharage for this and for working parallel with
me.

Does anybody have any idea about these?

After debugging one random query from the above-failed case, what we have
observed is (we might be wrong, but worth noting here):

1. This warning ended up while displaying RelOptInfo->pathlist.
2. In create_ordered_paths(), input_rel has two paths, and it loops over
both paths to get the best-sorted path.
3. First path was unsorted, and thus we add a sort node on top of it, and
adds that to the ordered_rel.
4. However, 2nd path was already sorted and passed as is to the add_path().
5. add_path() decides to reject this new path on some metrics. However, in
the end, it pfree() this passed in path. It seems wrong as its references
do present elsewhere. For example, in the first path's parent rels path
list.
6. So, while displaying the parent's path, we end up with these warnings.

I tried to get a fix for this but no luck so far.
One approach was to copy the path before passing it to the add_path().
However, there is no easy way to copy a path due to its circular references.

To see whether this warning goes or not, I have commented code in add_path()
that does pfree() on the new_path. And with that, I don't see any warnings.
But removing that code doesn't seem to be the correct fix.

Suggestions?

Thanks

-- 
Jeevan Chalke

*Senior Staff SDE, Database Architect, and ManagerProduct Development*



edbpostgres.com

Re: WIP/PoC for parallel backup

2020-04-07 Thread Jeevan Chalke

On Tue, Apr 7, 2020 at 10:14 PM Asif Rehman  wrote:

> Hi,
>
> Thanks, Kashif and Rajkumar. I have fixed the reported issues.
>
> I have added the shared state as previously described. The new grammar
> changes
> are as follows:
>
> START_BACKUP [LABEL ''] [FAST] [MAX_RATE %d]
> - This will generate a unique backupid using pg_strong_random(16) and
> hex-encoded
>   it. which is then returned as the result set.
> - It will also create a shared state and add it to the hashtable. The
> hash table size is set
>   to BACKUP_HASH_SIZE=10, but since hashtable can expand dynamically,
> I think it's
>   sufficient initial size. max_wal_senders is not used, because it can
> be set to quite a
>   large values.
>
> JOIN_BACKUP 'backup_id'
> - finds 'backup_id' in hashtable and attaches it to server process.
>
>
> SEND_FILE '(' 'FILE' ')' [NOVERIFY_CHECKSUMS]
> - renamed SEND_FILES to SEND_FILE
> - removed START_WAL_LOCATION from this because 'startptr' is now
> accessible through
>   shared state.
>
> There is no change in other commands:
> STOP_BACKUP [NOWAIT]
> LIST_TABLESPACES [PROGRESS]
> LIST_FILES [TABLESPACE]
> LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']
>
> The current patches (v11) have been rebased to the latest master. The
> backup manifest is enabled
> by default, so I have disabled it for parallel backup mode and have
> generated a warning so that
> user is aware of it and not expect it in the backup.
>

So, are you working on to make it work? I don't think a parallel backup
feature should be creating a backup with no manifest.


>
>
>
> --
> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>
>

-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: WIP/PoC for parallel backup

2020-04-06 Thread Jeevan Chalke

Asif,

After recent backup manifest addition, patches needed to rebase and
reconsideration of a few things like making sure that parallel backup
creates
a manifest file correctly or not etc.

-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: WIP/PoC for parallel backup

2020-03-16 Thread Jeevan Chalke

Hi Asif,


> Thanks Rajkumar. I have fixed the above issues and have rebased the patch
> to the latest master (b7f64c64).
> (V9 of the patches are attached).
>

I had a further review of the patches and here are my few observations:

1.
+/*
+ * stop_backup() - ends an online backup
+ *
+ * The function is called at the end of an online backup. It sends out
pg_control
+ * file, optionally WAL segments and ending WAL location.
+ */

Comments seem out-dated.

2. With parallel jobs, maxrate is now not supported. Since we are now asking
data in multiple threads throttling seems important here. Can you please
explain why have you disabled that?

3. As we are always fetching a single file and as Robert suggested, let
rename
SEND_FILES to SEND_FILE instead.

4. Does this work on Windows? I mean does pthread_create() work on Windows?
I asked this as I see that pgbench has its own implementation for
pthread_create() for WIN32 but this patch doesn't.

5. Typos:
tablspace => tablespace
safly => safely

6. parallel_backup_run() needs some comments explaining the states it goes
through PB_* states.

7.
+case PB_FETCH_REL_FILES:/* fetch files from server */
+if (backupinfo->activeworkers == 0)
+{
+backupinfo->backupstate = PB_STOP_BACKUP;
+free_filelist(backupinfo);
+}
+break;
+case PB_FETCH_WAL_FILES:/* fetch WAL files from server */
+if (backupinfo->activeworkers == 0)
+{
+backupinfo->backupstate = PB_BACKUP_COMPLETE;
+}
+break;

Why free_filelist() is not called in PB_FETCH_WAL_FILES case?

Thanks
-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 66449694

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.

Re: WIP/PoC for parallel backup

2020-02-10 Thread Jeevan Chalke

Hi Asif,

On Thu, Jan 30, 2020 at 7:10 PM Asif Rehman  wrote:

>
> Here are the the updated patches, taking care of the issues pointed
> earlier. This patch adds the following commands (with specified option):
>
> START_BACKUP [LABEL ''] [FAST]
> STOP_BACKUP [NOWAIT]
> LIST_TABLESPACES [PROGRESS]
> LIST_FILES [TABLESPACE]
> LIST_WAL_FILES [START_WAL_LOCATION 'X/X'] [END_WAL_LOCATION 'X/X']
> SEND_FILES '(' FILE, FILE... ')' [START_WAL_LOCATION 'X/X']
> [NOVERIFY_CHECKSUMS]
>
>
> Parallel backup is not making any use of tablespace map, so I have
> removed that option from the above commands. There is a patch pending
> to remove the exclusive backup; we can further refactor the
> do_pg_start_backup
> function at that time, to remove the tablespace information and move the
> creation of tablespace_map file to the client.
>
>
> I have disabled the maxrate option for parallel backup. I intend to send
> out a separate patch for it. Robert previously suggested to implement
> throttling on the client-side. I found the original email thread [1]
> where throttling was proposed and added to the server. In that thread,
> it was originally implemented on the client-side, but per many suggestions,
> it was moved to server-side.
>
> So, I have a few suggestions on how we can implement this:
>
> 1- have another option for pg_basebackup (i.e. per-worker-maxrate) where
> the user could choose the bandwidth allocation for each worker. This
> approach
> can be implemented on the client-side as well as on the server-side.
>
> 2- have the maxrate, be divided among workers equally at first. and the
> let the main thread keep adjusting it whenever one of the workers finishes.
> I believe this would only be possible if we handle throttling on the
> client.
> Also, as I understand it, implementing this will introduce additional mutex
> for handling of bandwidth consumption data so that rate may be adjusted
> according to data received by threads.
>
> [1]
> https://www.postgresql.org/message-id/flat/521B4B29.20009%402ndquadrant.com#189bf840c87de5908c0b4467d31b50af
>
> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>
>

The latest changes look good to me. However, the patch set is missing the
documentation.
Please add those.

Thanks

-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: backup manifests

2019-12-10 Thread Jeevan Chalke

On Tue, Dec 10, 2019 at 3:29 PM Rushabh Lathia 
wrote:

>
> Attaching another version of 0002 patch, as my collogue Jeevan Chalke
> pointed
> few indentation problem in 0002 patch which I sent earlier.  Fixed the
> same in
> the latest patch.
>

I had a look over the new patch and see no issues. Looks good to me.
Thanks for quickly fixing the review comments posted earlier.

However, here are the minor comments:

1.
@@ -122,6 +133,7 @@ static long long int total_checksum_failures;
 /* Do not verify checksums. */
 static bool noverify_checksums = false;

+
 /*
  * The contents of these directories are removed or recreated during server
  * start so they are not included in backups.  The directories themselves
are

Please remove this unnecessary change.

Need to run the indentation.

Thanks
-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: backup manifests

2019-12-08 Thread Jeevan Chalke

On Fri, Dec 6, 2019 at 12:05 PM Rushabh Lathia 
wrote:

>
>
> On Fri, Dec 6, 2019 at 1:44 AM Robert Haas  wrote:
>
>> On Thu, Dec 5, 2019 at 11:22 AM Rushabh Lathia 
>> wrote:
>> > Here is the whole stack of patches.
>>
>> I committed 0001, as that's just refactoring and I think (hope) it's
>> uncontroversial. I think 0002-0005 need to be squashed together
>> (crediting all authors properly and in the appropriate order) as it's
>> quite hard to understand right now,
>
>
> Please find attached single patch and I tried to add the credit to all
> the authors.
>

I had a look over the patch and here are my few review comments:

1.
+if (pg_strcasecmp(manifest_checksum_algo, "SHA256") == 0)
+manifest_checksums = MC_SHA256;
+else if (pg_strcasecmp(manifest_checksum_algo, "CRC32C") == 0)
+manifest_checksums = MC_CRC32C;
+else if (pg_strcasecmp(manifest_checksum_algo, "NONE") == 0)
+manifest_checksums = MC_NONE;
+else
+ereport(ERROR,

Is NONE is a valid input? I think the default is "NONE" only and thus no
need
of this as an input. It will be better if we simply error out if input is
neither "SHA256" nor "CRC32C".

I believe you have done this way as from pg_basebackup you are always
passing
MANIFEST_CHECKSUMS '%s' string which will have "NONE" if no user input is
given. But I think passing that conditional will be better like we have
maxrate_clause for example.

Well, this is what I think, feel free to ignore as I don't see any
correctness
issue over here.

2.
+if (manifest_checksums != MC_NONE)
+{
+checksumbuflen = finalize_manifest_checksum(cCtx, checksumbuf);
+switch (manifest_checksums)
+{
+case MC_NONE:
+break;
+}

Since switch case is within "if (manifest_checksums != MC_NONE)" condition,
I don't think we need a case for MC_NONE here. Rather we can use a default
case to error out.

3.
+if (manifest_checksums != MC_NONE)
+{
+initialize_manifest_checksum();
+update_manifest_checksum(, content, len);
+}

@@ -1384,6 +1641,9 @@ sendFile(const char *readfilename, const char
*tarfilename, struct stat *statbuf
 intsegmentno = 0;
 char   *segmentpath;
 boolverify_checksum = false;
+ChecksumCtx cCtx;
+
+initialize_manifest_checksum();

I see that in a few cases you are calling
initialize/update_manifest_checksum()
conditional and at some other places call is unconditional. It seems like
calling unconditional will not have any issues as switch cases inside them
return doing nothing when manifest_checksums is MC_NONE.

4.
initialize/update/finalize_manifest_checksum() functions may be needed by
the
validation patch as well. And thus I think these functions should not depend
on a global variable as such. Also, it will be good if we keep them in a
file
that is accessible to frontend-only code. Well, you can ignore these
comments
with the argument saying that this refactoring can be done by the patch
adding
validation support. I have no issues. Since both the patches are dependent
and
posted on the same email chain, thought of putting that observation.

5.
+switch (manifest_checksums)
+{
+case MC_SHA256:
+checksumlabel = "SHA256:";
+break;
+case MC_CRC32C:
+checksumlabel = "CRC32C:";
+break;
+case MC_NONE:
+break;
+}

This code in AddFileToManifest() is executed for every file for which we are
adding an entry. However, the checksumlabel will be going to remain the same
throughout. Can it be set just once and then used as is?

6.
Can we avoid manifest_checksums from declaring it as a global variable?
I think for that, we need to pass that to every function and thus need to
change the function signature of various functions. Currently, we pass
"StringInfo manifest" to all the required function, will it better to pass
the struct variable instead? A struct may have members like,
"StringInfo manifest" in it, checksum type (manifest_checksums),
checksum label, etc.

Thanks
-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: WIP/PoC for parallel backup

2019-11-27 Thread Jeevan Chalke

On Wed, Nov 13, 2019 at 7:04 PM Asif Rehman  wrote:

>
> Sorry, I sent the wrong patches. Please see the correct version of the
> patches (_v6).
>

Review comments on these patches:

1.
+XLogRecPtrwal_location;

Looking at the other field names in basebackup_options structure, let's use
wallocation instead. Or better startwallocation to be precise.

2.
+int32size;

Should we use size_t here?

3.
I am still not sure why we need SEND_BACKUP_FILELIST as a separate command.
Can't we return the file list with START_BACKUP itself?

4.
+else if (
+#ifndef WIN32
+ S_ISLNK(statbuf.st_mode)
+#else
+ pgwin32_is_junction(pathbuf)
+#endif
+)
+{
+/*
+ * If symlink, write it as a directory. file symlinks only
allowed
+ * in pg_tblspc
+ */
+statbuf.st_mode = S_IFDIR | pg_dir_create_mode;
+_tarWriteHeader(pathbuf + basepathlen + 1, NULL, ,
false);
+}

In normal backup mode, we skip the special file which is not a regular file
or
a directory or a symlink inside pg_tblspc. But in your patch, above code,
treats it as a directory. Should parallel backup too skip such special
files?

5.
Please keep header file inclusions in alphabetical order in basebackup.c and
pg_basebackup.c

6.
+/*
+ * build query in form of: SEND_BACKUP_FILES ('base/1/1245/32683',
+ * 'base/1/1245/32683', ...) [options]
+ */

Please update these comments as we fetch one file at a time.

7.
+backup_file:
+SCONST{ $$ = (Node *)
makeString($1); }
+;
+

Instead of having this rule with only one constant terminal, we can use
SCONST directly in backup_files_list. However, I don't see any issue with
this approach either, just trying to reduce the rules.

8.
Please indent code within 80 char limit at all applicable places.

9.
Please fix following typos:

identifing => identifying
optionaly => optionally
structre => structure
progrsss => progress
Retrive => Retrieve
direcotries => directories

=

The other mail thread related to backup manifest [1], is creating a
backup_manifest file and sends that to the client which has optional
checksum and other details including filename, file size, mtime, etc.
There is a patch on the same thread which is then validating the backup too.

Since this patch too gets a file list from the server and has similar
details (except checksum), can somehow parallel backup use the
backup-manifest
infrastructure from that patch?

When the parallel backup is in use, will there be a backup_manifest file
created too? I am just visualizing what will be the scenario when both these
features are checked-in.

[1]
https://www.postgresql.org/message-id/CA+TgmoZV8dw1H2bzZ9xkKwdrk8+XYa+DC9H=f7heo2zna5t...@mail.gmail.com

> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>
>
Thanks
-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: backup manifests

2019-11-21 Thread Jeevan Chalke

On Wed, Nov 20, 2019 at 11:05 AM Suraj Kharage <
suraj.khar...@enterprisedb.com> wrote:

> Hi,
>
> Since now we are generating the backup manifest file with each backup, it
> provides us an option to validate the given backup.
> Let's say, we have taken a backup and after a few days, we want to check
> whether that backup is validated or corruption-free without restarting the
> server.
>
> Please find attached POC patch for same which will be based on the latest
> backup manifest patch from Rushabh. With this functionality, we add new
> option to pg_basebackup, something like --verify-backup.
> So, the syntax would be:
> ./bin/pg_basebackup --verify-backup -D 
>
> Basically, we read the backup_manifest file line by line from the given
> directory path and build the hash table, then scan the directory and
> compare each file with the hash entry.
>
> Thoughts/suggestions?
>

I like the idea of verifying the backup once we have backup_manifest with
us.
Periodically verifying the already taken backup with this simple tool
becomes
easy now.

I have reviewed this patch and here are my comments:

1.
@@ -30,7 +30,9 @@
 #include "common/file_perm.h"
 #include "common/file_utils.h"
 #include "common/logging.h"
+#include "common/sha2.h"
 #include "common/string.h"
+#include "fe_utils/simple_list.h"
 #include "fe_utils/recovery_gen.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
@@ -38,12 +40,19 @@
 #include "pgtar.h"
 #include "pgtime.h"
 #include "pqexpbuffer.h"
+#include "pgrhash.h"
 #include "receivelog.h"
 #include "replication/basebackup.h"
 #include "streamutil.h"

Please add new files in order.

2.
Can hash related file names be renamed to backuphash.c and backuphash.h?

3.
Need indentation adjustments at various places.

4.
+charbuf[100];  // 1MB chunk

It will be good if we have multiple of block /page size (or at-least power
of 2
number).

5.
+typedef struct pgrhash_entry
+{
+struct pgrhash_entry *next; /* link to next entry in same bucket */
+DataDirectoryFileInfo *record;
+} pgrhash_entry;
+
+struct pgrhash
+{
+unsignednbuckets;/* number of buckets */
+pgrhash_entry **bucket;/* pointer to hash entries */
+};
+
+typedef struct pgrhash pgrhash;

These two can be moved to .h file instead of redefining over there.

6.
+/*
+ * TODO: this function is not necessary, can be removed.
+ * Test whether the given row number is match for the supplied keys.
+ */
+static bool
+pgrhash_compare(char *bt_filename, char *filename)

Yeah, it can be removed by doing strcmp() at the required places rather than
doing it in a separate function.

7.
mdate is not compared anywhere. I understand that it can't be compared with
the file in the backup directory and its entry in the manifest as manifest
entry gives mtime from server file whereas the same file in the backup will
have different mtime. But adding a few comments there will be good.

8.
+charmdate[24];

should be mtime instead?

Thanks

-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: backup manifests

2019-11-21 Thread Jeevan Chalke

On Tue, Nov 19, 2019 at 3:30 PM Rushabh Lathia 
wrote:

>
>
> My colleague Suraj did testing and noticed the performance impact
> with the checksums.   On further testing, he found that specifically with
> sha its more of performance impact.
>
> Please find below statistics:
>
> no of tables without checksum SHA256
> checksum % performnce
> overhead
> with
> SHA-256 md5 checksum % performnce
> overhead with md5 CRC checksum % performnce
> overhead with
> CRC
> 10 (100 MB
> in each table) real 0m10.957s
> user 0m0.367s
> sys 0m2.275s real 0m16.816s
> user 0m0.210s
> sys 0m2.067s 53% real 0m11.895s
> user 0m0.174s
> sys 0m1.725s 8% real 0m11.136s
> user 0m0.365s
> sys 0m2.298s 2%
> 20 (100 MB
> in each table) real 0m20.610s
> user 0m0.484s
> sys 0m3.198s real 0m31.745s
> user 0m0.569s
> sys 0m4.089s
> 54% real 0m22.717s
> user 0m0.638s
> sys 0m4.026s 10% real 0m21.075s
> user 0m0.538s
> sys 0m3.417s 2%
> 50 (100 MB
> in each table) real 0m49.143s
> user 0m1.646s
> sys 0m8.499s real 1m13.683s
> user 0m1.305s
> sys 0m10.541s 50% real 0m51.856s
> user 0m0.932s
> sys 0m7.702s 6% real 0m49.689s
> user 0m1.028s
> sys 0m6.921s 1%
> 100 (100 MB
> in each table) real 1m34.308s
> user 0m2.265s
> sys 0m14.717s real 2m22.403s
> user 0m2.613s
> sys 0m20.776s 51% real 1m41.524s
> user 0m2.158s
> sys 0m15.949s
> 8% real 1m35.045s
> user 0m2.061s
> sys 0m16.308s 1%
> 100 (1 GB
> in each table) real 17m18.336s
> user 0m20.222s
> sys 3m12.960s real 24m45.942s
> user 0m26.911s
> sys 3m33.501s 43% real 17m41.670s
> user 0m26.506s
> sys 3m18.402s 2% real 17m22.296s
> user 0m26.811s
> sys 3m56.653s
>
> sometimes, this test
> completes within the
> same time as without
> checksum. approx. 0.5%
>
>
> Considering the above results, I modified the earlier Robert's patch and
> added
> "manifest_with_checksums" option to pg_basebackup.  With a new patch.
> by default, checksums will be disabled and will be only enabled when
> "manifest_with_checksums" option is provided.  Also re-based all patch set.
>

Review comments on 0004:

1.
I don't think we need o_manifest_with_checksums variable,
manifest_with_checksums can be used instead.

2.
We need to document this new option for pg_basebackup and basebackup.

3.
Also, instead of keeping manifest_with_checksums as a global variable, we
should pass that to the required function. Patch 0002 already modified the
signature of all relevant functions anyways. So just need to add one more
bool
variable there.

4.
Why we need a "File" at the start of each entry as we are adding files only?
I wonder if we also need to provide a tablespace name and directory marker
so
that we have "Tablespace" and "Dir" at the start.

5.
If I don't provide manifest-with-checksums option then too I see that
checksum
is calculated for backup_manifest file itself. Is that intentional or
missed?
I think we should omit that too if this option is not provided.

6.
Is it possible to get only a backup manifest from the server? A client like
pg_basebackup can then use that to fetch files reading that.

Thanks


>
>
>
> Regards,
>
> --
> Rushabh Lathia
> www.EnterpriseDB.com
>
> On Tue, Oct 1, 2019 at 5:43 PM Robert Haas  wrote:
>
>> On Mon, Sep 30, 2019 at 5:31 AM Jeevan Chalke
>>  wrote:
>> > Entry for directory is not added in manifest. So it might be difficult
>> > at client to get to know about the directories. Will it be good to add
>> > an entry for each directory too? May be like:
>> > Dir 
>>
>> Well, what kind of corruption would this allow us to detect that we
>> can't detect as things stand? I think the only case is an empty
>> directory. If it's not empty, we'd have some entries for the files in
>> that directory, and those files won't be able to exist unless the
>> directory does. But, how would we end up backing up an empty
>> directory, anyway?
>>
>> I don't really *mind* adding directories into the manifest, but I'm
>> not sure how much it helps.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>>
>>
>
> --
> Rushabh Lathia
>


-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: WIP/PoC for parallel backup

2019-10-18 Thread Jeevan Chalke

On Thu, Oct 17, 2019 at 10:51 AM Asif Rehman  wrote:

>
> Attached are the updated patches.
>

I had a quick look over these changes and they look good overall.
However, here are my few review comments I caught while glancing the patches
0002 and 0003.

--- 0002 patch

1.
Can lsn option be renamed to start-wal-location? This will be more clear
too.

2.
+typedef struct
+{
+charname[MAXPGPATH];
+chartype;
+int32size;
+time_tmtime;
+} BackupFile;

I think it will be good if we keep this structure in a common place so that
the client can also use it.

3.
+SEND_FILE_LIST,
+SEND_FILES_CONTENT,
Can above two commands renamed to SEND_BACKUP_MANIFEST and SEND_BACKUP_FILE
respectively?
The reason behind the first name change is, we are not getting only file
lists
here instead we are getting a few more details with that too. And for
others,
it will be inline with START_BACKUP/STOP_BACKUP/SEND_BACKUP_MANIFEST.

4.
Typos:
non-exlusive => non-exclusive
retured => returned
optionaly => optionally
nessery => necessary
totoal => total

--- 0003 patch

1.
+static int
+simple_list_length(SimpleStringList *list)
+{
+intlen = 0;
+SimpleStringListCell *cell;
+
+for (cell = list->head; cell; cell = cell->next, len++)
+;
+
+return len;
+}

I think it will be good if it goes to simple_list.c. That will help in other
usages as well.

2.
Please revert these unnecessary changes:

@@ -1475,6 +1575,7 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res,
int rownum)
  */
 snprintf(filename, sizeof(filename), "%s/%s", current_path,
  copybuf);
+
 if (filename[strlen(filename) - 1] == '/')
 {
 /*

@@ -1528,8 +1622,8 @@ ReceiveAndUnpackTarFile(PGconn *conn, PGresult *res,
int rownum)
  * can map them too.)
  */
 filename[strlen(filename) - 1] = '\0';/* Remove
trailing slash */
-
 mapped_tblspc_path =
get_tablespace_mapping([157]);
+
 if (symlink(mapped_tblspc_path, filename) != 0)
 {
 pg_log_error("could not create symbolic link from
\"%s\" to \"%s\": %m",

3.
Typos:
retrive => retrieve
takecare => take care
tablespae => tablespace

4.
ParallelBackupEnd() function does not do anything for parallelism. Will it
be
better to just rename it as EndBackup()?

5.
To pass a tablespace path to the server in SEND_FILES_CONTENT, you are
reusing
a LABEL option, that seems odd. How about adding a new option for that?

6.
It will be good if we have some comments explaining what the function is
actually doing in its prologue. For functions like:
GetBackupFilesList()
ReceiveFiles()
create_workers_and_fetch()

Thanks

>
> Thanks,
>
> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>
>

-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: WIP/PoC for parallel backup

2019-10-03 Thread Jeevan Chalke

ning 2 files.
>>
>> This is one of several possible approaches. If we're doing a
>> plain-format backup in parallel, we can just write each file where it
>> needs to go and call it good. But, with a tar-format backup, what
>> should we do? I can see three options:
>>
>> 1. Error! Tar format parallel backups are not supported.
>>
>> 2. Write multiple tar files. The user might reasonably expect that
>> they're going to end up with the same files at the end of the backup
>> regardless of whether they do it in parallel. A user with this
>> expectation will be disappointed.
>>
>> 3. Write one tar file. In this design, the workers have to take turns
>> writing to the tar file, so you need some synchronization around that.
>> Perhaps you'd have N threads that read and buffer a file, and N+1
>> buffers.  Then you have one additional thread that reads the complete
>> files from the buffers and writes them to the tar file. There's
>> obviously some possibility that the writer won't be able to keep up
>> and writing the backup will therefore be slower than it would be with
>> approach (2).
>>
>> There's probably also a possibility that approach (2) would thrash the
>> disk head back and forth between multiple files that are all being
>> written at the same time, and approach (3) will therefore win by not
>> thrashing the disk head. But, since spinning media are becoming less
>> and less popular and are likely to have multiple disk heads under the
>> hood when they are used, this is probably not too likely.
>>
>> I think your choice to go with approach (2) is probably reasonable,
>> but I'm not sure whether everyone will agree.
>>
>
> Yes for the tar format support, approach (2) is what I had in
> mind. Currently I'm working on the implementation and will share the patch
> in a couple of days.
>
>
> --
> Asif Rehman
> Highgo Software (Canada/China/Pakistan)
> URL : www.highgo.ca
>


-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
commit 0d7433e44123b486b48f2071b24f1eaef46f4849
Author: Jeevan Chalke 
Date:   Thu Oct 3 12:58:55 2019 +0530

Fix errors and warnings.

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ef55bd0..32ed160 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -346,6 +346,7 @@ perform_base_backup(basebackup_options *opt)
 		{
 			/* save backup label into temp file for now. So stop backup can send it to pg_basebackup later on. */
 			FILE   *fp = AllocateFile(TMP_BACKUP_LABEL_FILE, "w");
+
 			if (!fp)
 ereport(ERROR,
 		(errcode_for_file_access(),
@@ -1627,8 +1628,8 @@ StopBackup(basebackup_options *opt)
 	XLogRecPtrendptr;
 	char *labelfile;
 	struct stat statbuf;
-	intr;
 	StringInfoData buf;
+	FILE	   *lfp;
 
 	/* Disable throttling. */
 	throttling_counter = -1;
@@ -1648,7 +1649,7 @@ StopBackup(basebackup_options *opt)
 	sendFile(TMP_BACKUP_LABEL_FILE, BACKUP_LABEL_FILE, , false, InvalidOid);
 
 	/* read backup_label file into buffer, we need it for do_pg_stop_backup */
-	FILE *lfp = AllocateFile(TMP_BACKUP_LABEL_FILE, "r");
+	lfp = AllocateFile(TMP_BACKUP_LABEL_FILE, "r");
 	if (!lfp)
 	{
 		ereport(ERROR,
@@ -1658,7 +1659,7 @@ StopBackup(basebackup_options *opt)
 	}
 
 	labelfile = palloc(statbuf.st_size + 1);
-	r = fread(labelfile, statbuf.st_size, 1, lfp);
+	fread(labelfile, statbuf.st_size, 1, lfp);
 	labelfile[statbuf.st_size] = '\0';
 
 
@@ -1692,6 +1693,7 @@ SendBackupFileList(List *tablespaces)
 {
 	StringInfoData buf;
 	ListCell   *lc;
+	pathinfo   *pi;
 
 	List *files = NIL;
 	foreach(lc, tablespaces)
@@ -1704,7 +1706,6 @@ SendBackupFileList(List *tablespaces)
 	}
 
 	// add backup label file
-	pathinfo *pi;
 	MAKE_PATHINFO(false, TMP_BACKUP_LABEL_FILE);
 	files = lcons(pi, files);
 
@@ -1736,15 +1737,15 @@ SendBackupFileList(List *tablespaces)
 	{
 		pathinfo *pi = (pathinfo *) lfirst(lc);
 		char   *path = pi->path;
+		Size		len;
 
 		/* Send one datarow message */
 		pq_beginmessage(, 'D');
 		pq_sendint16(, 2);/* number of columns */
 
-		int32 isdir = pi->isdir ? 1 : 0;
-		send_int8_string(, isdir);
+		send_int8_string(, (pi->isdir ? 1 : 0));
 
-		Size len = strlen(path);
+		len = strlen(path);
 		pq_sendint32(, len);
 		pq_sendbytes(, path, len);
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 1637735..a316cc6 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -1864,6 +1864,7 @@ BaseBackup(void)
 	{
 		SimpleStringList files = {NULL, NULL};
 		SimpleStringList **worker_files;
+		int			num_files;
 
 		/*
 		 * Get

Re: backup manifests

2019-09-30 Thread Jeevan Chalke

Entry for directory is not added in manifest. So it might be difficult
at client to get to know about the directories. Will it be good to add
an entry for each directory too? May be like:
Dir 

Also, on latest HEAD patches does not apply.

On Wed, Sep 25, 2019 at 6:17 PM vignesh C  wrote:

> On Sat, Sep 21, 2019 at 12:25 AM Robert Haas 
> wrote:
> >
> Some comments:
>
> Manifest file will be in plain text format even if compression is
> specified, should we compress it?
> May be this is intended, just raised the point to make sure that it is
> intended.
> +static void
> +ReceiveBackupManifestChunk(size_t r, char *copybuf, void *callback_data)
> +{
> + WriteManifestState *state = callback_data;
> +
> + if (fwrite(copybuf, r, 1, state->file) != 1)
> + {
> + pg_log_error("could not write to file \"%s\": %m", state->filename);
> + exit(1);
> + }
> +}
>
> WALfile.done file gets added but wal file information is not included
> in the manifest file, should we include WAL file also?
> @@ -599,16 +618,20 @@ perform_base_backup(basebackup_options *opt)
>   (errcode_for_file_access(),
>   errmsg("could not stat file \"%s\": %m", pathbuf)));
>
> - sendFile(pathbuf, pathbuf, , false, InvalidOid);
> + sendFile(pathbuf, pathbuf, , false, InvalidOid, manifest,
> + NULL);
>
>   /* unconditionally mark file as archived */
>   StatusFilePath(pathbuf, fname, ".done");
> - sendFileWithContent(pathbuf, "");
> + sendFileWithContent(pathbuf, "", manifest);
>
> Should we add an option to make manifest generation configurable to
> reduce overhead during backup?
>
> Manifest file does not include directory information, should we include it?
>
> There is one warning:
> In file included from ../../../src/include/fe_utils/string_utils.h:20:0,
>  from pg_basebackup.c:34:
> pg_basebackup.c: In function ‘ReceiveTarFile’:
> ../../../src/interfaces/libpq/pqexpbuffer.h:60:9: warning: the
> comparison will always evaluate as ‘false’ for the address of ‘buf’
> will never be NULL [-Waddress]
>   ((str) == NULL || (str)->maxlen == 0)
>  ^
> pg_basebackup.c:1203:7: note: in expansion of macro ‘PQExpBufferBroken’
>if (PQExpBufferBroken())
>
>
Yes I too obeserved this warning.


> pg_gmtime can fail in case of malloc failure:
> + /*
> + * Convert time to a string. Since it's not clear what time zone to use
> + * and since time zone definitions can change, possibly causing confusion,
> + * use GMT always.
> + */
> + pg_strftime(timebuf, sizeof(timebuf), "%Y-%m-%d %H:%M:%S %Z",
> + pg_gmtime());
>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>
>
>

-- 
Jeevan Chalke
Associate Database Architect & Team Lead, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-09-12 Thread Jeevan Chalke

Hi,

One of my colleague at EDB, Rajkumar Raghuwanshi, while testing this
feature reported an issue. He reported that if a full base-backup is
taken, and then created a database, and then took an incremental backup,
combining full backup with incremental backup is then failing.

I had a look over this issue and observed that when the new database is
created, the catalog files are copied as-is into the new directory
corresponding to a newly created database. And as they are just copied,
the LSN on those pages are not changed. Due to this incremental backup
thinks that its an existing file and thus do not copy the blocks from
these new files, leading to the failure.

I have surprised to know that even though we are creating new files from
old files, we kept the LSN unmodified. I didn't see any other parameter
in basebackup which tells that this is a new file from last LSN or
something.

I tried looking for any other DDL doing similar stuff like creating a new
page with existing LSN. But I could not find any other commands than
CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE.

Suggestions/thoughts?

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-09-09 Thread Jeevan Chalke

On Fri, Aug 30, 2019 at 6:52 PM Jeevan Ladhe 
wrote:

> Here are some comments:
> Or maybe we can just say:
> "cannot verify checksum in file \"%s\"" if checksum requested, disable the
> checksum and leave it to the following message:
>
> +   ereport(WARNING,
> +   (errmsg("file size (%d) not in multiple of page size
> (%d), sending whole file",
> +   (int) cnt, BLCKSZ)));
>
>
Opted for the above suggestion.


>
> I think we should give the user hint from where he should be reading the
> input
> lsn for incremental backup in the --help option as well as documentation?
> Something like - "To take an incremental backup, please provide value of
> "--lsn"
> as the "START WAL LOCATION" of previously taken full backup or incremental
> backup from backup_lable file.
>

Added this in the documentation. In help, it will be too crowdy.


> pg_combinebackup:
>
> +static bool made_new_outputdata = false;
> +static bool found_existing_outputdata = false;
>
> Both of these are global, I understand that we need them global so that
> they are
> accessible in cleanup_directories_atexit(). But they are passed to
> verify_dir_is_empty_or_create() as parameters, which I think is not needed.
> Instead verify_dir_is_empty_or_create() can directly change the globals.
>

After adding support for a tablespace, these two functions take different
values depending upon the context.


> The current logic assumes the incremental backup directories are to be
> provided
> as input in the serial order the backups were taken. This is bit confusing
> unless clarified in pg_combinebackup help menu or documentation. I think we
> should clarify it at both the places.
>

Added in doc.


>
> I think scan_directory() should be rather renamed as do_combinebackup().
>

I am not sure about this renaming. scan_directory() is called recursively
to scan each sub-directories too. If we rename it then it is not actually
recursively doing a combinebackup. Combine backup is a single whole
process.

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-09-09 Thread Jeevan Chalke

On Tue, Sep 3, 2019 at 12:11 PM Dilip Kumar  wrote:

> On Fri, Aug 16, 2019 at 3:54 PM Jeevan Chalke
>  wrote:
> >
> 0003:
> +/*
> + * When to send the whole file, % blocks modified (90%)
> + */
> +#define WHOLE_FILE_THRESHOLD 0.9
>
> How this threshold is selected.  Is it by some test?
>

Currently, it is set arbitrarily. If required, we will make it a GUC.


>
> - magic number, currently 0 (4 bytes)
> I think in the patch we are using  (#define INCREMENTAL_BACKUP_MAGIC
> 0x494E4352) as a magic number, not 0
>

Yes. Robert too reported this. Updated the commit message.


>
> Can we breakdown this function in 2-3 functions.  At least creating a
> file map can directly go to a separate function.
>

Separated out filemap changes to separate function. Rest kept as is to have
an easy followup.


>
> I have read 0003 and 0004 patch and there are few cosmetic comments.
>

Can you please post those too?

Other comments are fixed.


>
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-09-09 Thread Jeevan Chalke

On Wed, Sep 4, 2019 at 5:21 PM Dilip Kumar  wrote:

>
>  I have not yet completed the review for 0004, but I have few more
> comments.  Tomorrow I will try to complete the review and some testing
> as well.
>
> 1. It seems that the output full backup generated with
> pg_combinebackup also contains the "INCREMENTAL BACKUP REFERENCE WAL
> LOCATION".  It seems confusing
> because now this is a full backup, not the incremental backup.
>

Yes, that was remaining and was in my TODO.
Done in the new patchset. Also, taking --label as an input like
pg_basebackup.


>
> 2.
> + memset(outblocks, 0, sizeof(FileOffset) * RELSEG_SIZE);
>
> I don't think you need to memset this explicitly as you can initialize
> the array itself no?
> FileOffset outblocks[RELSEG_SIZE] = {{0}}
>

I didn't see any issue with memset either but changed this per your
suggestion.


>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-09-09 Thread Jeevan Chalke

On Tue, Aug 27, 2019 at 11:59 PM Robert Haas  wrote:

> On Fri, Aug 16, 2019 at 6:23 AM Jeevan Chalke
>  wrote:
> > [ patches ]
>
> Reviewing 0002 and 0003:
>
> - Commit message for 0003 claims magic number and checksum are 0, but
> that (fortunately) doesn't seem to be the case.
>

Oops, updated commit message.


>
> - looks_like_rel_name actually checks whether it looks like a
> *non-temporary* relation name; suggest adjusting the function name.
>
> - The names do_full_backup and do_incremental_backup are quite
> confusing because you're really talking about what to do with one
> file.  I suggest sendCompleteFile() and sendPartialFile().
>

Changed function names.


>
> - Is there any good reason to have 'refptr' as a global variable, or
> could we just pass the LSN around via function arguments?  I know it's
> just mimicking startptr, but storing startptr in a global variable
> doesn't seem like a great idea either, so if it's not too annoying,
> let's pass it down via function arguments instead.  Also, refptr is a
> crappy name (even worse than startptr); whether we end up with a
> global variable or a bunch of local variables, let's make the name(s)
> clear and unambiguous, like incremental_reference_lsn.  Yeah, I know
> that's long, but I still think it's better than being unclear.
>

Renamed variable.
However, I have kept that as global only as it needs many functions to
change their signature, like, sendFile(), sendDir(), sendTablspeace() etc.


> - do_incremental_backup looks like it can never report an error from
> fread(), which is bad.  But I see that this is just copied from the
> existing code which has the same problem, so I started a separate
> thread about that.
>
> - I think that passing cnt and blkindex to verify_page_checksum()
> doesn't look very good from an abstraction point of view.  Granted,
> the existing code isn't great either, but I think this makes the
> problem worse.  I suggest passing "int backup_distance" to this
> function, computed as cnt - BLCKSZ * blkindex.  Then, you can
> fseek(-backup_distance), fread(BLCKSZ), and then fseek(backup_distance
> - BLCKSZ).
>

Yep. Done these changes in the refactoring patch.


>
> - While I generally support the use of while and for loops rather than
> goto for flow control, a while (1) loop that ends with a break is
> functionally a goto anyway.  I think there are several ways this could
> be revised.  The most obvious one is probably to use goto, but I vote
> for inverting the sense of the test: if (PageIsNew(page) ||
> PageGetLSN(page) >= startptr) break; This approach also saves a level
> of indentation for more than half of the function.
>

I have used this new inverted condition, but we still need a while(1) loop.


> - I am not sure that it's a good idea for sendwholefile = true to
> result in dumping the entire file onto the wire in a single CopyData
> message.  I don't know of a concrete problem in typical
> configurations, but someone who increases RELSEG_SIZE might be able to
> overflow CopyData's length word.  At 2GB the length word would be
> negative, which might break, and at 4GB it would wrap around, which
> would certainly break.  See CopyData in
> https://www.postgresql.org/docs/12/protocol-message-formats.html  To
> avoid this issue, and maybe some others, I suggest defining a
> reasonably large chunk size, say 1MB as a constant in this file
> someplace, and sending the data as a series of chunks of that size.
>

OK. Done as per the suggestions.


>
> - I don't think that the way concurrent truncation is handled is
> correct for partial files.  Right now it just falls through to code
> which appends blocks of zeroes in either the complete-file or
> partial-file case.  I think that logic should be moved into the
> function that handles the complete-file case.  In the partial-file
> case, the blocks that we actually send need to match the list of block
> numbers we promised to send.  We can't just send the promised blocks
> and then tack a bunch of zero-filled blocks onto the end that the file
> header doesn't know about.
>

Well, in partial file case we won't end up inside that block. So we are
never sending zeroes at the end in case of partial file.


> - For reviewer convenience, please use the -v option to git
> format-patch when posting and reposting a patch series.  Using -v2,
> -v3, etc. on successive versions really helps.
>

Sure. Thanks for letting me know about this option.


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-09-09 Thread Jeevan Chalke

On Tue, Aug 27, 2019 at 4:46 PM vignesh C  wrote:

> Few comments:
> Comment:
> + buf = (char *) malloc(statbuf->st_size);
> + if (buf == NULL)
> + ereport(ERROR,
> + (errcode(ERRCODE_OUT_OF_MEMORY),
> + errmsg("out of memory")));
> +
> + if ((cnt = fread(buf, 1, statbuf->st_size, fp)) > 0)
> + {
> + Bitmapset  *mod_blocks = NULL;
> + int nmodblocks = 0;
> +
> + if (cnt % BLCKSZ != 0)
> + {
>
> We can use same size as full page size.
> After pg start backup full page write will be enabled.
> We can use the same file size to maintain data consistency.
>

Can you please explain which size?
The aim here is to read entire file in-memory and thus used
statbuf->st_size.

Comment:
> Should we check if it is same timeline as the system's timeline.
>

At the time of taking the incremental backup, we can't check that.
However, while combining, I made sure that the timeline is the same for all
backups.


>
> Comment:
>
> Should we support compression formats supported by pg_basebackup.
> This can be an enhancement after the functionality is completed.
>

For the incremental backup, it just works out of the box.
For combining backup, as discussed up-thread, the user has to
uncompress first, combine them, compress if required.


> Comment:
> We should provide some mechanism to validate the backup. To identify
> if some backup is corrupt or some file is missing(deleted) in a
> backup.
>

Maybe, but not for the first version.


> Comment:
> +/*
> + * When to send the whole file, % blocks modified (90%)
> + */
> +#define WHOLE_FILE_THRESHOLD 0.9
> +
> This can be user configured value.
> This can be an enhancement after the functionality is completed.
>

Yes.


> Comment:
> We can add a readme file with all the details regarding incremental
> backup and combine backup.
>

Will have a look.


>
> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-09-09 Thread Jeevan Chalke

Hi,

Attached new set of patches adding support for the tablespace handling.

This patchset also fixes the issues reported by Vignesh, Robert, Jeevan
Ladhe,
and Dilip Kumar.

Please have a look and let me know if I  missed any comments to account.

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 7cc2fb70188676406ca883055467aff602438fcd Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Mon, 9 Sep 2019 10:38:01 +0530
Subject: [PATCH v2 1/4] Add support for command line option to pass LSN.

This adds [ LSN 'lsn' ] to BASE_BACKUP command and --lsn=LSN to
the pg_basebackup binary.

Also, add small tests.
---
 src/backend/replication/basebackup.c | 20 
 src/backend/replication/repl_gram.y  |  6 ++
 src/backend/replication/repl_scanner.l   |  1 +
 src/bin/pg_basebackup/pg_basebackup.c| 15 +--
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 12 +++-
 5 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 6aab8d7..e72bf8e 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -38,6 +38,7 @@
 #include "storage/ipc.h"
 #include "storage/reinit.h"
 #include "utils/builtins.h"
+#include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/relcache.h"
 #include "utils/timestamp.h"
@@ -52,6 +53,7 @@ typedef struct
 	bool		includewal;
 	uint32		maxrate;
 	bool		sendtblspcmapfile;
+	XLogRecPtr 	lsn;
 } basebackup_options;
 
 
@@ -652,6 +654,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
 	bool		o_maxrate = false;
 	bool		o_tablespace_map = false;
 	bool		o_noverify_checksums = false;
+	bool		o_lsn = false;
 
 	MemSet(opt, 0, sizeof(*opt));
 	foreach(lopt, options)
@@ -740,6 +743,23 @@ parse_basebackup_options(List *options, basebackup_options *opt)
 			noverify_checksums = true;
 			o_noverify_checksums = true;
 		}
+		else if (strcmp(defel->defname, "lsn") == 0)
+		{
+			bool		have_error = false;
+
+			if (o_lsn)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("duplicate option \"%s\"", defel->defname)));
+			o_lsn = true;
+
+			/* Validate given LSN and convert it into XLogRecPtr. */
+			opt->lsn = pg_lsn_in_internal(strVal(defel->arg), _error);
+			if (XLogRecPtrIsInvalid(opt->lsn))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 errmsg("invalid value for LSN")));
+		}
 		else
 			elog(ERROR, "option \"%s\" not recognized",
  defel->defname);
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index c4e11cc..c24d319 100644
--- a/src/backend/replication/repl_gram.y
+++ b/src/backend/replication/repl_gram.y
@@ -87,6 +87,7 @@ static SQLCmd *make_sqlcmd(void);
 %token K_EXPORT_SNAPSHOT
 %token K_NOEXPORT_SNAPSHOT
 %token K_USE_SNAPSHOT
+%token K_LSN
 
 %type 	command
 %type 	base_backup start_replication start_logical_replication
@@ -214,6 +215,11 @@ base_backup_opt:
   $$ = makeDefElem("noverify_checksums",
    (Node *)makeInteger(true), -1);
 }
+			| K_LSN SCONST
+{
+  $$ = makeDefElem("lsn",
+   (Node *)makeString($2), -1);
+}
 			;
 
 create_replication_slot:
diff --git a/src/backend/replication/repl_scanner.l b/src/backend/replication/repl_scanner.l
index 380faeb..77b5af4 100644
--- a/src/backend/replication/repl_scanner.l
+++ b/src/backend/replication/repl_scanner.l
@@ -107,6 +107,7 @@ EXPORT_SNAPSHOT		{ return K_EXPORT_SNAPSHOT; }
 NOEXPORT_SNAPSHOT	{ return K_NOEXPORT_SNAPSHOT; }
 USE_SNAPSHOT		{ return K_USE_SNAPSHOT; }
 WAIT{ return K_WAIT; }
+LSN	{ return K_LSN; }
 
 ","{ return ','; }
 ";"{ return ';'; }
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 7986872..1791853 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -105,6 +105,7 @@ static bool temp_replication_slot = true;
 static bool create_slot = false;
 static bool no_slot = false;
 static bool verify_checksums = true;
+static char *lsn = NULL;
 
 static bool success = false;
 static bool made_new_pgdata = false;
@@ -341,6 +342,7 @@ usage(void)
 			 " include required WAL files with specified method\n"));
 	printf(_("  -z, --gzip compress tar output\n"));
 	printf(_("  -Z, --compress=0-9 compress tar output with given compression level\n"));
+	printf(_("  --lsn=LSN  incremental backup, using LSN as threshold\n"));
 	printf(_("\nGeneral options:\n"));
 	printf(_("  -c, --checkpoint=fast|spread\n"

Re: basebackup.c's sendFile() ignores read errors

2019-09-06 Thread Jeevan Chalke

On Thu, Sep 5, 2019 at 11:40 PM Robert Haas  wrote:

> On Fri, Aug 30, 2019 at 7:05 AM Jeevan Ladhe
>  wrote:
> >> Fixed both comments in the attached patch.
> >
> > Thanks, the patch looks good to me.
>
> Here's a version of the patch with a further change to the wording of
> the comment.  I hope this is clearer.
>

Yep.


>
> I think this needs to be back-patched all the way back to 9.4, and it
> doesn't seem to apply cleanly before v11.  Any chance you could
> prepare a version for the older branches?
>

Attached patch for v10 and pre. The same v10 patch applies cleanly.

Changes related to the page checksum verification is not present on v10 and
pre, and thus those changes are not applicable, so removed those.  Also,
wal_segment_size is XLogSegSize over there, adjusted that.


>
> Thanks,
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index ba4937b..dfb0f47 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -85,6 +85,18 @@ static char *statrelpath = NULL;
  */
 #define THROTTLING_FREQUENCY	8
 
+/*
+ * Checks whether we encountered any error in fread().  fread() doesn't give
+ * any clue what has happened, so we check with ferror().  Also, neither
+ * fread() nor ferror() set errno, so we just throw a generic error.
+ */
+#define CHECK_FREAD_ERROR(fp, filename) \
+do { \
+	if (ferror(fp)) \
+		ereport(ERROR, \
+(errmsg("could not read from file \"%s\"", filename))); \
+} while (0)
+
 /* The actual number of bytes, transfer of which may cause sleep. */
 static uint64 throttling_sample;
 
@@ -509,6 +521,8 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 	break;
 			}
 
+			CHECK_FREAD_ERROR(fp, pathbuf);
+
 			if (len != XLogSegSize)
 			{
 CheckXLogRemoved(segno, tli);
@@ -1245,6 +1259,8 @@ sendFile(char *readfilename, char *tarfilename, struct stat *statbuf,
 		}
 	}
 
+	CHECK_FREAD_ERROR(fp, readfilename);
+
 	/* If the file was truncated while we were sending it, pad it with zeros */
 	if (len < statbuf->st_size)
 	{

Re: basebackup.c's sendFile() ignores read errors

2019-08-30 Thread Jeevan Chalke

On Thu, Aug 29, 2019 at 3:17 PM Jeevan Ladhe 
wrote:

> Hi Jeevan
>
> I had a look at the patch and this seems correct to me.
>

Thanks, Jeevan Ladhe.


>
> Few minor comments:
>
> + /* Check fread() error. */
> + CHECK_FREAD_ERROR(fp, pathbuf);
> +
>
> The comments above the macro call at both the places are not necessary as
> your macro name itself is self-explanatory.
>
> --
> + /*
> + * If file is truncated, then we will hit
> + * end-of-file error in which case we don't
> + * want to error out, instead just pad it with
> + * zeros.
> + */
> + if (feof(fp))
>
> The if block does not do the truncation right away, so I think the comment
> above can be reworded to explain why we reset cnt?
>

Fixed both comments in the attached patch.

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index c91f66d..3aea470 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -90,6 +90,18 @@ static char *statrelpath = NULL;
  */
 #define THROTTLING_FREQUENCY	8
 
+/*
+ * Checks whether we encountered any error in fread().  fread() doesn't give
+ * any clue what has happened, so we check with ferror().  Also, neither
+ * fread() nor ferror() set errno, so we just throw a generic error.
+ */
+#define CHECK_FREAD_ERROR(fp, filename) \
+do { \
+	if (ferror(fp)) \
+		ereport(ERROR, \
+(errmsg("could not read from file \"%s\"", filename))); \
+} while (0)
+
 /* The actual number of bytes, transfer of which may cause sleep. */
 static uint64 throttling_sample;
 
@@ -543,6 +555,8 @@ perform_base_backup(basebackup_options *opt)
 	break;
 			}
 
+			CHECK_FREAD_ERROR(fp, pathbuf);
+
 			if (len != wal_segment_size)
 			{
 CheckXLogRemoved(segno, tli);
@@ -1478,6 +1492,20 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 
 			if (fread(buf + BLCKSZ * i, 1, BLCKSZ, fp) != BLCKSZ)
 			{
+/*
+ * If the file is truncated, then we will hit
+ * an end-of-file error in which case we don't
+ * want to error out, instead just pad the rest
+ * of the file with zeros. However, we want to
+ * send all the bytes read so far, and thus set
+ * the cnt accordingly.
+ */
+if (feof(fp))
+{
+	cnt = BLCKSZ * i;
+	break;
+}
+
 ereport(ERROR,
 		(errcode_for_file_access(),
 		 errmsg("could not reread block %d of file \"%s\": %m",
@@ -1529,7 +1557,7 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 		len += cnt;
 		throttle(cnt);
 
-		if (len >= statbuf->st_size)
+		if (feof(fp) || len >= statbuf->st_size)
 		{
 			/*
 			 * Reached end of file. The file could be longer, if it was
@@ -1540,6 +1568,8 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 		}
 	}
 
+	CHECK_FREAD_ERROR(fp, readfilename);
+
 	/* If the file was truncated while we were sending it, pad it with zeros */
 	if (len < statbuf->st_size)
 	{

Re: basebackup.c's sendFile() ignores read errors

2019-08-28 Thread Jeevan Chalke

On Tue, Aug 27, 2019 at 10:33 PM Robert Haas  wrote:

> While reviewing a proposed patch to basebackup.c this morning, I found
> myself a bit underwhelmed by the quality of the code and comments in
> basebackup.c's sendFile(). I believe it's already been pointed out
> that the the retry logic here is not particularly correct, and the
> comments demonstrate a pretty clear lack of understanding of how the
> actual race conditions that are possible here might operate.
>
> However, I then noticed another problem which I think is significantly
> more serious: this code totally ignores the possibility of a failure
> in fread().  It just assumes that if fread() fails to return a
> positive value, it's reached the end of the file, and if that happens,
> it just pads out the rest of the file with zeroes.  So it looks to me
> like if fread() encountered, say, an I/O error while trying to read a
> data file, and if that error were on the first data block, then the
> entire contents of that file would be replaced with zero bytes in the
> backup, and no error or warning of any kind would be given to the
> user.  If it hit the error later, everything from the point of the
> error onward would just get replaced with zero bytes.  To be clear, I
> think it is fine for basebackup.c to fill out the rest of the expected
> length with zeroes if in fact the file has been truncated during the
> backup; recovery should fix it.  But recovery is not going to fix data
> that got eaten because EIO was encountered while copying a file.
>

Per fread(3) manpage, we cannot distinguish between an error or EOF. So to
check for any error we must use ferror() after fread(). Attached patch which
tests ferror() and throws an error if it returns true.  However, I think
fread()/ferror() both do not set errno (per some web reference) and thus
throwing a generic error here. I have manually tested it.


> The logic that rereads a block appears to have the opposite problem.
> Here, it will detect an error, but it will also fail in the face of a
> concurrent truncation, which it shouldn't.
>

For this, I have checked for feof() assuming that when the file gets
truncated
we reach EOF. And if yes, getting out of the loop instead of throwing an
error.
I may be wrong as I couldn't reproduce this issue.

Please have a look over the patch and let me know your views.


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index c91f66d..5e64acf 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -90,6 +90,18 @@ static char *statrelpath = NULL;
  */
 #define THROTTLING_FREQUENCY	8
 
+/*
+ * Checks whether we encountered any error in fread().  fread() doesn't give
+ * any clue what has happened, so we check with ferror().  Also, neither
+ * fread() nor ferror() set errno, so we just throw a generic error.
+ */
+#define CHECK_FREAD_ERROR(fp, filename) \
+do { \
+	if (ferror(fp)) \
+		ereport(ERROR, \
+(errmsg("could not read from file \"%s\"", filename))); \
+} while (0)
+
 /* The actual number of bytes, transfer of which may cause sleep. */
 static uint64 throttling_sample;
 
@@ -543,6 +555,9 @@ perform_base_backup(basebackup_options *opt)
 	break;
 			}
 
+			/* Check fread() error. */
+			CHECK_FREAD_ERROR(fp, pathbuf);
+
 			if (len != wal_segment_size)
 			{
 CheckXLogRemoved(segno, tli);
@@ -1478,6 +1493,18 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 
 			if (fread(buf + BLCKSZ * i, 1, BLCKSZ, fp) != BLCKSZ)
 			{
+/*
+ * If file is truncated, then we will hit
+ * end-of-file error in which case we don't
+ * want to error out, instead just pad it with
+ * zeros.
+ */
+if (feof(fp))
+{
+	cnt = BLCKSZ * i;
+	break;
+}
+
 ereport(ERROR,
 		(errcode_for_file_access(),
 		 errmsg("could not reread block %d of file \"%s\": %m",
@@ -1529,7 +1556,7 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 		len += cnt;
 		throttle(cnt);
 
-		if (len >= statbuf->st_size)
+		if (feof(fp) || len >= statbuf->st_size)
 		{
 			/*
 			 * Reached end of file. The file could be longer, if it was
@@ -1540,6 +1567,9 @@ sendFile(const char *readfilename, const char *tarfilename, struct stat *statbuf
 		}
 	}
 
+	/* Check fread() error. */
+	CHECK_FREAD_ERROR(fp, readfilename);
+
 	/* If the file was truncated while we were sending it, pad it with zeros */
 	if (len < statbuf->st_size)
 	{

Re: block-level incremental backup

2019-08-16 Thread Jeevan Chalke

On Fri, Aug 2, 2019 at 6:43 PM vignesh C  wrote:

> Some comments:
> 1) There will be some link files created for tablespace, we might
> require some special handling for it
>

Yep. I have that in my ToDo.
Will start working on that soon.


> 2)
> Retry functionality is hanlded only for copying of full files, should
> we handle retry for copying of partial files
> 3)
> we can have some range for maxretries similar to sleeptime
>

I took help from pg_standby code related to maxentries and sleeptime.

However, as we don't want to use system() call now, I have
removed all this kludge and just used fread/fwrite as discussed.


> 4)
> Should we check for malloc failure
>

Used pg_malloc() instead. Same is also suggested by Ibrar.


>
> 5) Should we add display of progress as backup may take some time,
> this can be added as enhancement. We can get other's opinion on this.
>

Can be done afterward once we have the functionality in place.


>
> 6)
> If the backup count increases providing the input may be difficult,
> Shall user provide all the incremental backups from a parent folder
> and can we handle the ordering of incremental backup internally
>

I am not sure of this yet. We need to provide the tablespace mapping too.
But thanks for putting a point here. Will keep that in mind when I revisit
this.


>
> 7)
> Add verbose for copying whole file
>
Done


>
> 8) We can also check if approximate space is available in disk before
> starting combine backup, this can be added as enhancement. We can get
> other's opinion on this.
>

Hmm... will leave it for now. User will get an error anyway.


>
> 9)
> Combine backup into directory can be combine backup directory
>
Done


>
> 10)
> MAX_INCR_BK_COUNT can be increased little, some applications use 1
> full backup at the beginning of the month and use 30 incremental
> backups rest of the days in the month
>

Yeah, agree. But using any number here is debatable.
Let's see others opinion too.


> Regards,
> Vignesh
> EnterpriseDB: http://www.enterprisedb.com
>


Attached new sets of patches with refactoring done separately.
Incremental backup patch became small now and hopefully more
readable than the first version.

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 93041c12a7d07bf17073c9cf4571bd3b5a8acc81 Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Fri, 16 Aug 2019 14:08:34 +0530
Subject: [PATCH 1/4] Add support for command line option to pass LSN.

This adds [ LSN 'lsn' ] to BASE_BACKUP command and --lsn=LSN to
the pg_basebackup binary.

Also, add small tests.
---
 src/backend/replication/basebackup.c | 20 
 src/backend/replication/repl_gram.y  |  6 ++
 src/backend/replication/repl_scanner.l   |  1 +
 src/bin/pg_basebackup/pg_basebackup.c| 15 +--
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 12 +++-
 5 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index c91f66d..74c954b 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -38,6 +38,7 @@
 #include "storage/ipc.h"
 #include "storage/reinit.h"
 #include "utils/builtins.h"
+#include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/relcache.h"
 #include "utils/timestamp.h"
@@ -52,6 +53,7 @@ typedef struct
 	bool		includewal;
 	uint32		maxrate;
 	bool		sendtblspcmapfile;
+	XLogRecPtr 	lsn;
 } basebackup_options;
 
 
@@ -638,6 +640,7 @@ parse_basebackup_options(List *options, basebackup_options *opt)
 	bool		o_maxrate = false;
 	bool		o_tablespace_map = false;
 	bool		o_noverify_checksums = false;
+	bool		o_lsn = false;
 
 	MemSet(opt, 0, sizeof(*opt));
 	foreach(lopt, options)
@@ -726,6 +729,23 @@ parse_basebackup_options(List *options, basebackup_options *opt)
 			noverify_checksums = true;
 			o_noverify_checksums = true;
 		}
+		else if (strcmp(defel->defname, "lsn") == 0)
+		{
+			bool		have_error = false;
+
+			if (o_lsn)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("duplicate option \"%s\"", defel->defname)));
+			o_lsn = true;
+
+			/* Validate given LSN and convert it into XLogRecPtr. */
+			opt->lsn = pg_lsn_in_internal(strVal(defel->arg), _error);
+			if (XLogRecPtrIsInvalid(opt->lsn))
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TEXT_REPRESENTATION),
+		 errmsg("invalid value for LSN")));
+		}
 		else
 			elog(ERROR, "option \"%s\" not recognized",
  defel->defname);
diff --git a/src/backend/replication/repl_gram.y b/src/backend/replication/repl_gram.y
index c4e11cc..c24d319 100

Re: block-level incremental backup

2019-08-12 Thread Jeevan Chalke

On Mon, Aug 12, 2019 at 5:29 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Fri, Aug 9, 2019 at 11:56 PM Jeevan Ladhe <
> jeevan.la...@enterprisedb.com> wrote:
>
>> Hi Robert,
>>
>> On Fri, Aug 9, 2019 at 6:40 PM Robert Haas  wrote:
>>
>>> On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
>>>  wrote:
>>> > +   if (!XLogRecPtrIsInvalid(previous_lsn))
>>> > +   appendStringInfo(labelfile, "PREVIOUS WAL LOCATION:
>>> %X/%X\n",
>>> > +(uint32) (previous_lsn >> 32), (uint32)
>>> previous_lsn);
>>> >
>>> > May be we should rename to something like:
>>> > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
>>> START LOCATION"
>>> > to make it more intuitive?
>>>
>>> So, I think that you are right that PREVIOUS WAL LOCATION might not be
>>> entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
>>> LOCATION is definitely not clear.  This backup is an incremental
>>> backup, and it has a start WAL location, so you'd end up with START
>>> WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
>>> like they ought to both be the same thing, but they're not.  Perhaps
>>> something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
>>> INCREMENTAL BACKUP would be clearer.
>>>
>>
>> Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?
>>
>
> +1 for INCREMENTAL BACKUP REFERENCE WA.
>

Sorry for the typo:
+1 for the INCREMENTAL BACKUP REFERENCE WAL LOCATION.


>
>>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-08-12 Thread Jeevan Chalke

On Fri, Aug 9, 2019 at 11:56 PM Jeevan Ladhe 
wrote:

> Hi Robert,
>
> On Fri, Aug 9, 2019 at 6:40 PM Robert Haas  wrote:
>
>> On Thu, Aug 8, 2019 at 8:37 PM Jeevan Ladhe
>>  wrote:
>> > +   if (!XLogRecPtrIsInvalid(previous_lsn))
>> > +   appendStringInfo(labelfile, "PREVIOUS WAL LOCATION:
>> %X/%X\n",
>> > +(uint32) (previous_lsn >> 32), (uint32)
>> previous_lsn);
>> >
>> > May be we should rename to something like:
>> > "INCREMENTAL BACKUP START WAL LOCATION" or simply "INCREMENTAL BACKUP
>> START LOCATION"
>> > to make it more intuitive?
>>
>> So, I think that you are right that PREVIOUS WAL LOCATION might not be
>> entirely clear, but at least in my view, INCREMENTAL BACKUP START WAL
>> LOCATION is definitely not clear.  This backup is an incremental
>> backup, and it has a start WAL location, so you'd end up with START
>> WAL LOCATION and INCREMENTAL BACKUP START WAL LOCATION and those sound
>> like they ought to both be the same thing, but they're not.  Perhaps
>> something like REFERENCE WAL LOCATION or REFERENCE WAL LOCATION FOR
>> INCREMENTAL BACKUP would be clearer.
>>
>
> Agree, how about INCREMENTAL BACKUP REFERENCE WAL LOCATION ?
>

+1 for INCREMENTAL BACKUP REFERENCE WA.


>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-08-12 Thread Jeevan Chalke

On Fri, Aug 9, 2019 at 6:36 PM Robert Haas  wrote:

> On Wed, Aug 7, 2019 at 5:46 AM Jeevan Chalke
>  wrote:
> > So, do you mean we should just do fread() and fwrite() for the whole
> file?
> >
> > I thought it is better if it was done by the OS itself instead of
> reading 1GB
> > into the memory and writing the same to the file.
>
> Well, 'cp' is just a C program.  If they can write code to copy a
> file, so can we, and then we're not dependent on 'cp' being installed,
> working properly, being in the user's path or at the hard-coded
> pathname we expect, etc.  There's an existing copy_file() function in
> src/backed/storage/file/copydir.c which I'd probably look into
> adapting for frontend use.  I'm not sure whether it would be important
> to adapt the data-flushing code that's present in that routine or
> whether we could get by with just the loop to read() and write() data.
>

Agree that we can certainly use open(), read(), write(), and close() here,
but
given that pg_basebackup.c and basbackup.c are using file operations, I
think
using fopen(), fread(), fwrite(), and fclose() will be better here, at-least
for consistetncy.

Let me know if we still want to go with native OS calls.


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-08-07 Thread Jeevan Chalke

On Mon, Aug 5, 2019 at 7:13 PM Robert Haas  wrote:

> On Fri, Aug 2, 2019 at 9:13 AM vignesh C  wrote:
> > + rc = system(copycmd);
>
> I don't think this patch should be calling system() in the first place.
>

So, do you mean we should just do fread() and fwrite() for the whole file?

I thought it is better if it was done by the OS itself instead of reading
1GB
into the memory and writing the same to the file.

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-08-01 Thread Jeevan Chalke

On Tue, Jul 30, 2019 at 9:39 AM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
>
> I am almost done writing the patch for pg_combinebackup and will post soon.
>

Attached patch which implements the pg_combinebackup utility used to combine
full basebackup with one or more incremental backups.

I have tested it manually and it works for all best cases.

Let me know if you have any inputs/suggestions/review comments?

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 8d91f35..f3e90b6 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -200,6 +200,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 00782e0..92d9d13 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -415,7 +415,7 @@ PostgreSQL documentation
 which are modified after this given LSN will be backed up. The file
 which has these partial blocks has .partial as an extension. Backup
 taken in this manner has to be combined with the full backup with the
-pg_combinebackup utility.
+ utility.

   
  
diff --git a/doc/src/sgml/ref/pg_combinebackup.sgml b/doc/src/sgml/ref/pg_combinebackup.sgml
new file mode 100644
index 000..ed87931
--- /dev/null
+++ b/doc/src/sgml/ref/pg_combinebackup.sgml
@@ -0,0 +1,202 @@
+
+
+
+ 
+  pg_combinebackup
+ 
+
+ 
+  pg_combinebackup
+  1
+  Application
+ 
+
+ 
+  pg_combinebackup
+  create a synthetic backup from a full backup and one or more incremental backups
+ 
+
+ 
+  
+   pg_combinebackup
+   option
+  
+ 
+
+ 
+  Description
+  
+   pg_combinebackup combines one or more incremental
+   backups with the full base-backup to generate a synthetic backup.
+  
+ 
+
+ 
+  Options
+
+   
+The following command-line options are available:
+
+
+ 
+  -f directory
+  --full-backup=directory
+  
+   
+Specifies the directory where the full backup is stored.
+   
+  
+ 
+
+ 
+  -i directory
+  --incr-backup=directory
+  
+   
+Specifies the directory where the incremental backup is stored.
+   
+  
+ 
+
+ 
+  -o directory
+  --output-backup=directory
+  
+   
+Specifies the output directory where the combined full synthetic backup
+to be stored.
+   
+  
+ 
+
+ 
+  -n
+  --no-clean
+  
+   
+By default, when pg_combinebackup aborts with an
+error, it removes the output data directories it might have created
+before discovering that it cannot finish the job. This option inhibits
+tidying-up and is thus useful for debugging.
+   
+  
+ 
+
+ 
+  -r max-retries
+  
+   
+Max number of retries on copy command, with progressive wait.
+Default is 3.
+   
+  
+ 
+
+ 
+  -s interval
+  
+   
+Sleep interval before retry (in seconds).
+Should be in between 1 and 60, default is 5.
+   
+  
+ 
+
+ 
+  -v
+  --verbose
+  
+   
+Enable verbose output. Lists all partial files processed and its
+checksum status.
+   
+  
+ 
+
+ 
+   -V
+   --version
+   
+   
+Print the pg_combinebackup version and exit.
+   
+   
+ 
+
+ 
+  -?
+  --help
+   
+
+ Show help about pg_combinebackup command line
+ arguments, and exit.
+
+   
+  
+
+   
+ 
+
+ 
+  Environment
+  
+   
+PG_COLOR
+
+ 
+  Specifies whether to use color in diagnostics messages.  Possible values
+  are always, auto,
+  never.
+ 
+
+   
+  
+ 
+
+ 
+  Notes
+  
+   Output directory, full backup directory, and at-least one incremental backup
+   directory must be specified.
+  
+
+  
+   PREVIOUS WAL LOCATION of the incremental backup must
+   match with the START WAL LOCATION of the previous full
+   or incremental backup in a given sequence.
+  
+ 
+
+ 
+  Examples
+
+  
+   To combine a full backup with two incremental backups and store it in the
+   output directory:
+
+$ pg_combinebackup -f /data/full/data -i /data/incr/data1 -i /data/incr/data2 -o /data/full/fulldata
+
+  
+
+  
+   To combine a full backup with an incremental backups and store it in the
+   output directory along with verious options like, verbose, no-clean, and
+   maximum 3 retries: 
+
+$ pg_combinebackup -v --no-clean -r 3 -f /data/full/data -i /data/incr/data1 -o /data/full/fulldata
+
+  
+ 
+
+ 
+  See Also
+
+  
+   
+  
+ 
+
+
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index ce

Re: block-level incremental backup

2019-07-29 Thread Jeevan Chalke

On Tue, Jul 30, 2019 at 1:58 AM Robert Haas  wrote:

>
> I haven't had a chance to look at Jeevan's patch at all, or yours in
> any detail, as yet, so these are just some very preliminary comments.
> It will be good, however, if we can agree on who is going to do what
> part of this as we try to drive this forward together.  I'm sorry that
> I didn't communicate EDB's plans to work on this more clearly;
> duplicated effort serves nobody well.
>

I had a look over Anastasia's PoC patch to understand the approach she has
taken and here are my observations.

1.
The patch first creates a .blockmap file for each relation file containing
an array of all modified block numbers. This is done by reading all blocks
(in a chunk of 4 (32kb in total) in a loop) from a file and checking the
page
LSN with given LSN. Later, to create .partial file, a relation file is
opened
again and all blocks are read in a chunk of 4 in a loop. If found modified,
it is copied into another memory and after scanning all 4 blocks, all copied
blocks are sent to the .partial file.

In this approach, each file is opened and read twice which looks more
expensive
to me. Whereas in my patch, I do that just once. However, I read the entire
file in memory to check which blocks are modified but in Anastasia's design
max TAR_SEND_SIZE (32kb) will be read at a time but, in a loop. I need to do
that as we wanted to know how heavily the file got modified so that we can
send the entire file if it was modified beyond the threshold (currently
90%).

2.
Also, while sending modified blocks, they are copied in another buffer,
instead
they can be just sent from the read files contents (in BLCKSZ block size).
Here, the .blockmap created earlier was not used. In my implementation, we
are
sending just a .partial file with a header containing all required details
like
the number of blocks changes along with the block numbers including CRC
followed by the blocks itself.

3.
I tried compiling Anastasia's patch, but getting an error. So could not see
or
test how it goes. Also, like a normal backup option, the incremental backup
option needs to verify the checksum if requested.

4.
While combining full and incremental backup, files from the incremental
backup
are just copied into the full backup directory. While the design I posted
earlier, we are trying another way round to avoid over-writing and other
issues
as I explained earlier.

I am almost done writing the patch for pg_combinebackup and will post soon.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>
Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-07-17 Thread Jeevan Chalke

On Wed, Jul 17, 2019 at 7:38 PM Ibrar Ahmed  wrote:

>
>
> On Wed, Jul 17, 2019 at 6:43 PM Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
>
>> On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed 
>> wrote:
>>
>>>
>>> At what stage you will apply the WAL generated in between the START/STOP
>>> backup.
>>>
>>
>> In this design, we are not touching any WAL related code. The WAL files
>> will
>> get copied with each backup either full or incremental. And thus, the last
>> incremental backup will have the final WAL files which will be copied
>> as-is
>> in the combined full-backup and they will get apply automatically if that
>> the data directory is used to start the server.
>>
>
> Ok, so you keep all the WAL files since the first backup, right?
>

The WAL files will anyway be copied while taking a backup (full or
incremental),
but only last incremental backup's WAL files are copied to the combined
synthetic full backup.


>>
>>> --
>>> Ibrar Ahmed
>>>
>>
>> --
>> Jeevan Chalke
>> Technical Architect, Product Development
>> EnterpriseDB Corporation
>>
>>
>
> --
> Ibrar Ahmed
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation

Re: block-level incremental backup

2019-07-17 Thread Jeevan Chalke

On Wed, Jul 17, 2019 at 2:15 PM Ibrar Ahmed  wrote:

>
> At what stage you will apply the WAL generated in between the START/STOP
> backup.
>

In this design, we are not touching any WAL related code. The WAL files will
get copied with each backup either full or incremental. And thus, the last
incremental backup will have the final WAL files which will be copied as-is
in the combined full-backup and they will get apply automatically if that
the data directory is used to start the server.

> --
> Ibrar Ahmed
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation

Re: block-level incremental backup

2019-07-16 Thread Jeevan Chalke

On Thu, Jul 11, 2019 at 5:00 PM Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

> Hi Anastasia,
>
> On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>> 23.04.2019 14:08, Anastasia Lubennikova wrote:
>> > I'm volunteering to write a draft patch or, more likely, set of
>> > patches, which
>> > will allow us to discuss the subject in more detail.
>> > And to do that I wish we agree on the API and data format (at least
>> > broadly).
>> > Looking forward to hearing your thoughts.
>>
>> Though the previous discussion stalled,
>> I still hope that we could agree on basic points such as a map file
>> format and protocol extension,
>> which is necessary to start implementing the feature.
>>
>
> It's great that you too come up with the PoC patch. I didn't look at your
> changes in much details but we at EnterpriseDB too working on this feature
> and started implementing it.
>
> Attached series of patches I had so far... (which needed further
> optimization and adjustments though)
>
> Here is the overall design (as proposed by Robert) we are trying to
> implement:
>
> 1. Extend the BASE_BACKUP command that can be used with replication
> connections. Add a new [ LSN 'lsn' ] option.
>
> 2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
> the option added to the server in #1.
>
> Here are the implementation details when we have a valid LSN
>
> sendFile() in basebackup.c is the function which mostly does the thing for
> us. If the filename looks like a relation file, then we'll need to consider
> sending only a partial file. The way to do that is probably:
>
> A. Read the whole file into memory.
>
> B. Check the LSN of each block. Build a bitmap indicating which blocks
> have an LSN greater than or equal to the threshold LSN.
>
> C. If more than 90% of the bits in the bitmap are set, send the whole file
> just as if this were a full backup. This 90% is a constant now; we might
> make it a GUC later.
>
> D. Otherwise, send a file with .partial added to the name. The .partial
> file contains an indication of which blocks were changed at the beginning,
> followed by the data blocks. It also includes a checksum/CRC.
> Currently, a .partial file format looks like:
>  - start with a 4-byte magic number
>  - then store a 4-byte CRC covering the header
>  - then a 4-byte count of the number of blocks included in the file
>  - then the block numbers, each as a 4-byte quantity
>  - then the data blocks
>
>
> We are also working on combining these incremental back-ups with the full
> backup and for that, we are planning to add a new utility called
> pg_combinebackup. Will post the details on that later once we have on the
> same page for taking backup.
>

For combining a full backup with one or more incremental backup, we are
adding
a new utility called pg_combinebackup in src/bin.

Here is the overall design as proposed by Robert.

pg_combinebackup starts from the LAST backup specified and work backward. It
must NOT start with the full backup and work forward. This is important both
for reasons of efficiency and of correctness. For example, if you start by
copying over the full backup and then later apply the incremental backups on
top of it then you'll copy data and later end up overwriting it or removing
it. Any files that are leftover at the end that aren't in the final
incremental backup even as .partial files need to be removed, or the result
is
wrong. We should aim for a system where every block in the output directory
is
written exactly once and nothing ever has to be created and then removed.

To make that work, we should start by examining the final incremental
backup.
We should proceed with one file at a time. For each file:

1. If the complete file is present in the incremental backup, then just
copy it
to the output directory - and move on to the next file.

2. Otherwise, we have a .partial file. Work backward through the backup
chain
until we find a complete version of the file. That might happen when we get
\back to the full backup at the start of the chain, but it might also happen
sooner - at which point we do not need to and should not look at earlier
backups for that file. During this phase, we should read only the HEADER of
each .partial file, building a map of which blocks we're ultimately going to
need to read from each backup. We can also compute the offset within each
file
where that block is stored at this stage, again using the header
information.

3. Now, we can write the output file - reading each block in turn from the
correct backup and writing it to the write output file, using the map we
constructed in the previous step. We should probably keep all of t

Re: block-level incremental backup

2019-07-11 Thread Jeevan Chalke

Hi Anastasia,

On Wed, Jul 10, 2019 at 11:47 PM Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:

> 23.04.2019 14:08, Anastasia Lubennikova wrote:
> > I'm volunteering to write a draft patch or, more likely, set of
> > patches, which
> > will allow us to discuss the subject in more detail.
> > And to do that I wish we agree on the API and data format (at least
> > broadly).
> > Looking forward to hearing your thoughts.
>
> Though the previous discussion stalled,
> I still hope that we could agree on basic points such as a map file
> format and protocol extension,
> which is necessary to start implementing the feature.
>

It's great that you too come up with the PoC patch. I didn't look at your
changes in much details but we at EnterpriseDB too working on this feature
and started implementing it.

Attached series of patches I had so far... (which needed further
optimization and adjustments though)

Here is the overall design (as proposed by Robert) we are trying to
implement:

1. Extend the BASE_BACKUP command that can be used with replication
connections. Add a new [ LSN 'lsn' ] option.

2. Extend pg_basebackup with a new --lsn=LSN option that causes it to send
the option added to the server in #1.

Here are the implementation details when we have a valid LSN

sendFile() in basebackup.c is the function which mostly does the thing for
us. If the filename looks like a relation file, then we'll need to consider
sending only a partial file. The way to do that is probably:

A. Read the whole file into memory.

B. Check the LSN of each block. Build a bitmap indicating which blocks have
an LSN greater than or equal to the threshold LSN.

C. If more than 90% of the bits in the bitmap are set, send the whole file
just as if this were a full backup. This 90% is a constant now; we might
make it a GUC later.

D. Otherwise, send a file with .partial added to the name. The .partial
file contains an indication of which blocks were changed at the beginning,
followed by the data blocks. It also includes a checksum/CRC.
Currently, a .partial file format looks like:
 - start with a 4-byte magic number
 - then store a 4-byte CRC covering the header
 - then a 4-byte count of the number of blocks included in the file
 - then the block numbers, each as a 4-byte quantity
 - then the data blocks

We are also working on combining these incremental back-ups with the full
backup and for that, we are planning to add a new utility called
pg_combinebackup. Will post the details on that later once we have on the
same page for taking backup.

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation

IncrBackup.tar.gz
Description: application/gzip

Re: Statistical aggregate functions are not working with partitionwise aggregate

2019-05-03 Thread Jeevan Chalke

On Fri, May 3, 2019 at 2:56 PM Rajkumar Raghuwanshi <
rajkumar.raghuwan...@enterprisedb.com> wrote:

> Hi,
>
> On PG-head, Some of statistical aggregate function are not giving correct
> output when enable partitionwise aggregate while same is working on v11.
>

I had a quick look over this and observed that something broken with the
PARTIAL aggregation.

I can reproduce same issue with the larger dataset which results into
parallel scan.

CREATE TABLE tbl1(a int2,b float4) partition by range(a);
create table tbl1_p1 partition of tbl1 for values from (minvalue) to (0);
create table tbl1_p2 partition of tbl1 for values from (0) to (maxvalue);
insert into tbl1 select i%2, i from generate_series(1, 100) i;

# SELECT regr_count(b, a) FROM tbl1;
 regr_count

  0
(1 row)

postgres:5432 [120536]=# explain SELECT regr_count(b, a) FROM tbl1;
   QUERY
PLAN

 Finalize Aggregate  (cost=15418.08..15418.09 rows=1 width=8)
   ->  Gather  (cost=15417.87..15418.08 rows=2 width=8)
 Workers Planned: 2
 ->  Partial Aggregate  (cost=14417.87..14417.88 rows=1 width=8)
   ->  Parallel Append  (cost=0.00..11091.62 rows=443500
width=6)
 ->  Parallel Seq Scan on tbl1_p2  (cost=0.00..8850.00
rows=442500 width=6)
 ->  Parallel Seq Scan on tbl1_p1  (cost=0.00..24.12
rows=1412 width=6)
(7 rows)

postgres:5432 [120536]=# set max_parallel_workers_per_gather to 0;
SET
postgres:5432 [120536]=# SELECT regr_count(b, a) FROM tbl1;
 regr_count

100
(1 row)

After looking further, it seems that it got broken by following commit:

commit a9c35cf85ca1ff72f16f0f10d7ddee6e582b62b8
Author: Andres Freund 
Date:   Sat Jan 26 14:17:52 2019 -0800

Change function call information to be variable length.


This commit is too big to understand and thus could not get into the excact
cause.

Thanks


> below are some of examples.
>
> CREATE TABLE tbl(a int2,b float4) partition by range(a);
> create table tbl_p1 partition of tbl for values from (minvalue) to (0);
> create table tbl_p2 partition of tbl for values from (0) to (maxvalue);
> insert into tbl values (-1,-1),(0,0),(1,1),(2,2);
>
> --when partitionwise aggregate is off
> postgres=# SELECT regr_count(b, a) FROM tbl;
>  regr_count
> 
>   4
> (1 row)
> postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl;
>  regr_avgx | regr_avgy
> ---+---
>0.5 |   0.5
> (1 row)
> postgres=# SELECT corr(b, a) FROM tbl;
>  corr
> --
> 1
> (1 row)
>
> --when partitionwise aggregate is on
> postgres=# SET enable_partitionwise_aggregate = true;
> SET
> postgres=# SELECT regr_count(b, a) FROM tbl;
>  regr_count
> 
>   0
> (1 row)
> postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl;
>  regr_avgx | regr_avgy
> ---+---
>|
> (1 row)
> postgres=# SELECT corr(b, a) FROM tbl;
>  corr
> --
>
> (1 row)
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
> QMG, EnterpriseDB Corporation
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: cache lookup failed for collation 0

2019-04-14 Thread Jeevan Chalke

On Fri, Apr 12, 2019 at 1:26 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-04-11 17:04, Jeevan Chalke wrote:
> > The error is coming from get_collation_isdeterministic() when colloid
> > passed is 0. I think like we do in get_collation_name(), we should
> > return false here when such collation oid does not exist.
>
> I'm not in favor of doing that.  It would risk papering over errors of
> omission at other call sites.
>
> The root cause is that the same code match_pattern_prefix() is being
> used for text and bytea, but bytea does not use collations, so having
> the collation 0 is expected, and we shouldn't call
> get_collation_isdeterministic() in that case.
>
> Proposed patch attached.
>

Looks fine to me.


>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: cache lookup failed for collation 0

2019-04-12 Thread Jeevan Chalke

On Thu, Apr 11, 2019 at 10:50 PM Tom Lane  wrote:

> Jeevan Chalke  writes:
> > Do you mean, the code in get_collation_isdeterministic() should look like
> > something like below?
>
> > If colloid = InvalidOid then
> >   return TRUE
> > ELSE IF tuple is valid then
> >   return collisdeterministic from the tuple
> > ELSE
> >  return FALSE
>
> I think it's appropriate to fail if we don't find a tuple, for any
> collation oid other than zero.  Again, if you trace through the
> behavior of the longstanding collation check functions like
> lc_ctype_is_c(), you'll see that that's what happens (except for
> some hardwired OIDs that they have fast paths for).
>

OK.

Attached patch which treats "collation 0" as deterministic in
get_collation_isdeterministic() and returns true, keeping rest of the code
as is.


> regards, tom lane
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index b4f2d0f..79e6484 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -945,6 +945,10 @@ get_collation_isdeterministic(Oid colloid)
 	Form_pg_collation colltup;
 	bool		result;
 
+	/* Treat "collation 0" as deterministic. */
+	if (!OidIsValid(colloid))
+		return true;
+
 	tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(colloid));
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for collation %u", colloid);
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index 0dee7d7..a8e1f6e 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -676,13 +676,19 @@ SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
  "C"
 (1 row)
 
+CREATE TABLE byteatable(a bytea primary key, b int);
+SELECT * FROM byteatable WHERE a LIKE '%1%';
+ a | b 
+---+---
+(0 rows)
+
 --
 -- Clean up.  Many of these table names will be re-used if the user is
 -- trying to run any platform-specific collation tests later, so we
 -- must get rid of them.
 --
 DROP SCHEMA collate_tests CASCADE;
-NOTICE:  drop cascades to 17 other objects
+NOTICE:  drop cascades to 18 other objects
 DETAIL:  drop cascades to table collate_test1
 drop cascades to table collate_test_like
 drop cascades to table collate_test2
@@ -700,3 +706,4 @@ drop cascades to table collate_test21
 drop cascades to table collate_test22
 drop cascades to collation mycoll2
 drop cascades to table collate_test23
+drop cascades to table byteatable
diff --git a/src/test/regress/sql/collate.sql b/src/test/regress/sql/collate.sql
index 89de26a..124daf6 100644
--- a/src/test/regress/sql/collate.sql
+++ b/src/test/regress/sql/collate.sql
@@ -259,6 +259,9 @@ SELECT collation for ((SELECT a FROM collate_test1 LIMIT 1)); -- non-collatable
 SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
 
 
+CREATE TABLE byteatable(a bytea primary key, b int);
+SELECT * FROM byteatable WHERE a LIKE '%1%';
+
 --
 -- Clean up.  Many of these table names will be re-used if the user is
 -- trying to run any platform-specific collation tests later, so we

Re: cache lookup failed for collation 0

2019-04-11 Thread Jeevan Chalke

On Thu, Apr 11, 2019 at 9:07 PM Tom Lane  wrote:

> Jeevan Chalke  writes:
> > Following test-sequence causing an error "cache lookup failed for
> collation 0";
> > postgres:5432 [42106]=# create table foobar(a bytea primary key, b int);
> > CREATE TABLE
> > postgres:5432 [42106]=# insert into foobar
> > values('\x4c835521685c46ee827ab83d376cf028', 1);
> > INSERT 0 1
> > postgres:5432 [42106]=# select * from foobar where a like '%1%';
> > ERROR:  cache lookup failed for collation 0
>
> Good catch!
>
> > The error is coming from get_collation_isdeterministic() when colloid
> > passed is 0. I think like we do in get_collation_name(), we should return
> > false here when such collation oid does not exist.
>
> Considering that e.g. lc_ctype_is_c() doesn't fail for InvalidOid, I agree
> that it's probably a bad idea for get_collation_isdeterministic to fail.
> There's a lot of code that thinks it can check for InvalidOid only in slow
> paths.  However, I'd kind of expect the default result to be "true" not
> "false".  Doing what you suggest would make match_pattern_prefix fail
> entirely, unless we also put a special case there.
>

Do you mean, the code in get_collation_isdeterministic() should look like
something like below?

If colloid = InvalidOid then
  return TRUE
ELSE IF tuple is valid then
  return collisdeterministic from the tuple
ELSE
 return FALSE

I think for non-zero colloid which is not valid we should return false, but
I may be missing your point here.


>
> regards, tom lane
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

cache lookup failed for collation 0

2019-04-11 Thread Jeevan Chalke

Hello hackers,

Following test-sequence causing an error "cache lookup failed for collation
0";

postgres:5432 [42106]=# create table foobar(a bytea primary key, b int);
CREATE TABLE
postgres:5432 [42106]=# insert into foobar
values('\x4c835521685c46ee827ab83d376cf028', 1);
INSERT 0 1
postgres:5432 [42106]=# \d+ foobar
   Table "public.foobar"
 Column |  Type   | Collation | Nullable | Default | Storage  | Stats
target | Description
+-+---+--+-+--+--+-
 a  | bytea   |   | not null | | extended
|  |
 b  | integer |   |  | | plain
|  |
Indexes:
"foobar_pkey" PRIMARY KEY, btree (a)
Access method: heap

postgres:5432 [42106]=# select * from foobar where a like '%1%';
ERROR:  cache lookup failed for collation 0

---

After debugging it, I have observed that the code in question was added by
commit 5e1963fb764e9cc092e0f7b58b28985c311431d9 which added support for the
collations with nondeterministic comparison.

The error is coming from get_collation_isdeterministic() when colloid
passed is 0. I think like we do in get_collation_name(), we should return
false here when such collation oid does not exist.

Attached patch doing that change and re-arranged the code to look similar
to get_collation_name(). Also, added small testcase.

---

However, I have not fully understood the code changes done by the said
commit and thus the current behavior i.e. cache lookup error, might be the
expected one. But if that's the case, I kindly request to please explain
why that is expected.

Thanks

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index b4f2d0f..69e5d88 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -942,16 +942,19 @@ bool
 get_collation_isdeterministic(Oid colloid)
 {
 	HeapTuple	tp;
-	Form_pg_collation colltup;
-	bool		result;
 
 	tp = SearchSysCache1(COLLOID, ObjectIdGetDatum(colloid));
-	if (!HeapTupleIsValid(tp))
-		elog(ERROR, "cache lookup failed for collation %u", colloid);
-	colltup = (Form_pg_collation) GETSTRUCT(tp);
-	result = colltup->collisdeterministic;
-	ReleaseSysCache(tp);
-	return result;
+	if (HeapTupleIsValid(tp))
+	{
+		Form_pg_collation colltup = (Form_pg_collation) GETSTRUCT(tp);
+		bool		result;
+
+		result = colltup->collisdeterministic;
+		ReleaseSysCache(tp);
+		return result;
+	}
+	else
+		return false;
 }
 
 /*-- CONSTRAINT CACHE --	 */
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index 0dee7d7..a8e1f6e 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -676,13 +676,19 @@ SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
  "C"
 (1 row)
 
+CREATE TABLE byteatable(a bytea primary key, b int);
+SELECT * FROM byteatable WHERE a LIKE '%1%';
+ a | b 
+---+---
+(0 rows)
+
 --
 -- Clean up.  Many of these table names will be re-used if the user is
 -- trying to run any platform-specific collation tests later, so we
 -- must get rid of them.
 --
 DROP SCHEMA collate_tests CASCADE;
-NOTICE:  drop cascades to 17 other objects
+NOTICE:  drop cascades to 18 other objects
 DETAIL:  drop cascades to table collate_test1
 drop cascades to table collate_test_like
 drop cascades to table collate_test2
@@ -700,3 +706,4 @@ drop cascades to table collate_test21
 drop cascades to table collate_test22
 drop cascades to collation mycoll2
 drop cascades to table collate_test23
+drop cascades to table byteatable
diff --git a/src/test/regress/sql/collate.sql b/src/test/regress/sql/collate.sql
index 89de26a..124daf6 100644
--- a/src/test/regress/sql/collate.sql
+++ b/src/test/regress/sql/collate.sql
@@ -259,6 +259,9 @@ SELECT collation for ((SELECT a FROM collate_test1 LIMIT 1)); -- non-collatable
 SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
 
 
+CREATE TABLE byteatable(a bytea primary key, b int);
+SELECT * FROM byteatable WHERE a LIKE '%1%';
+
 --
 -- Clean up.  Many of these table names will be re-used if the user is
 -- trying to run any platform-specific collation tests later, so we

Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-27 Thread Jeevan Chalke

On Fri, Jun 22, 2018 at 6:58 PM, Robert Haas  wrote:

> On Fri, Jun 22, 2018 at 2:26 AM, Rajkumar Raghuwanshi
>  wrote:
> > I have applied patch and checked reported issue. Patch applied cleanly
> and
> > issues not reproducible any more.
>
> Committed, with a few changes:
>

Thanks Robert.


>
> - Renamed found_partially_grouped_child to partial_grouping_valid.
> The old name seemed to me to be a poor choice, because it sounds from
> the name like it gets set to true whenever we've found at least one
> partially grouped child, whereas really it gets set to false whenever
> we've failed to find at least one partially grouped child.  The new
> name is also more like the names in add_paths_to_append_rel.
>
> - Modified the wording of the comment.
>
> - Omitted the new assertion in add_paths_to_append_rel.  I doubt
> whether that's correct.  I don't see any obvious reason why
> live_childrels can't be NIL there, and further down I see this:
>
> /*
>  * If we found unparameterized paths for all children, build
> an unordered,
>  * unparameterized Append path for the rel.  (Note: this is
> correct even
>  * if we have zero or one live subpath due to constraint
> exclusion.)
>  */
>
> If it's not possible to have no live_childrels, then that comment is
> highly suspect.
>
>
OK, do these comments also holds true for partial_subpaths?

If we have NIL live_childrels, then the Assert() present inside the "if
(partial_subpaths_valid)" block will hit. Which I think is wrong.

We either need to remove these Asserts altogether or we should enter inside
this block only when we have non-NIL partial_subpaths. Also, if we don't
have any partial_subpaths, then variable partial_subpaths_valid should be
false. Currently, by default, it is set to true due to which we are ending
up inside that if block and then hitting an Assert.


> Also, even if this assertion *is* correct, I think it needs a better
> comment explaining why it's correct, because there isn't anything
> obvious in set_append_rel_pathlist that keeps IS_DUMMY_REL() from
> being true for every child.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-21 Thread Jeevan Chalke

Hi,

Off-list Ashutosh Bapat has suggested using a flag instead of counting
number of
dummy rels and then manipulating on it. That will be simple and smoother.

I agree with his suggestion and updated my patch accordingly.

Thanks

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 41c885a31eb61538dc762c74f4d9782da1db9bc8 Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Fri, 22 Jun 2018 10:59:24 +0530
Subject: [PATCH] Make sure that we have live children before we append them.

Since it is very well possible that we could not have a
partially_grouped_rel for some of (or all) the children for which
aggregation in partial is not feasible.  In that case, we should not
try to create any append path by calling add_paths_to_append_rel().
Calling it when no child present to append will result in a server
crash. And appending only a few children will result into data loss.
Thus, don't try to create an append path at first place itself.

In passing, add an Assert() in add_paths_to_append_rel() to check
that it receives a valid live children list.

Jeevan Chalke, reviewed by Ashutosh Bapat.
---
 src/backend/optimizer/path/allpaths.c |   3 +
 src/backend/optimizer/plan/planner.c  |  24 +++--
 src/test/regress/expected/partition_aggregate.out | 105 ++
 src/test/regress/sql/partition_aggregate.sql  |  24 +
 4 files changed, 148 insertions(+), 8 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 3ada379..9f3d725 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1391,6 +1391,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	bool		build_partitioned_rels = false;
 	double		partial_rows = -1;
 
+	/* We should end-up here only when we have at least one child. */
+	Assert(live_childrels != NIL);
+
 	/* If appropriate, consider parallel append */
 	pa_subpaths_valid = enable_parallel_append && rel->consider_parallel;
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 67a2c7a..46128b4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7012,12 +7012,14 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 	List	   *grouped_live_children = NIL;
 	List	   *partially_grouped_live_children = NIL;
 	PathTarget *target = grouped_rel->reltarget;
+	bool		found_partially_grouped_child;
 
 	Assert(patype != PARTITIONWISE_AGGREGATE_NONE);
 	Assert(patype != PARTITIONWISE_AGGREGATE_PARTIAL ||
 		   partially_grouped_rel != NULL);
 
 	/* Add paths for partitionwise aggregation/grouping. */
+	found_partially_grouped_child = true;
 	for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
 	{
 		RelOptInfo *child_input_rel = input_rel->part_rels[cnt_parts];
@@ -7091,6 +7093,8 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 lappend(partially_grouped_live_children,
 		child_partially_grouped_rel);
 		}
+		else if (found_partially_grouped_child)
+			found_partially_grouped_child = false;
 
 		if (patype == PARTITIONWISE_AGGREGATE_FULL)
 		{
@@ -7103,20 +7107,20 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 	}
 
 	/*
-	 * All children can't be dummy at this point. If they are, then the parent
-	 * too marked as dummy.
-	 */
-	Assert(grouped_live_children != NIL ||
-		   partially_grouped_live_children != NIL);
-
-	/*
 	 * Try to create append paths for partially grouped children. For full
 	 * partitionwise aggregation, we might have paths in the partial_pathlist
 	 * if parallel aggregation is possible.  For partial partitionwise
 	 * aggregation, we may have paths in both pathlist and partial_pathlist.
+	 *
+	 * However, it is very well possible that we could not have a
+	 * partially_grouped_rel for some of (or all) the children for which
+	 * aggregation in partial is not feasible.  In that case, we cannot create
+	 * any append path.
 	 */
-	if (partially_grouped_rel)
+	if (partially_grouped_rel && found_partially_grouped_child)
 	{
+		Assert(partially_grouped_live_children != NIL);
+
 		add_paths_to_append_rel(root, partially_grouped_rel,
 partially_grouped_live_children);
 
@@ -7130,7 +7134,11 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 
 	/* If possible, create append paths for fully grouped children. */
 	if (patype == PARTITIONWISE_AGGREGATE_FULL)
+	{
+		Assert(grouped_live_children != NIL);
+
 		add_paths_to_append_rel(root, grouped_rel, grouped_live_children);
+	}
 }
 
 /*
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index 76a8209..d286050 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -1394,3 +1394,108 @@ SELECT y, sum(x), avg(x), count(*) FROM pagg_tab_para GROUP BY y

Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-21 Thread Jeevan Chalke

On Wed, Jun 20, 2018 at 7:11 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Tue, Jun 19, 2018 at 2:13 PM, Jeevan Chalke
>  wrote:
> >
> >
> > In the reported testcase, parallel_workers is set to 0 for all partition
> > (child) relations. Which means partial parallel paths are not possible
> for
> > child rels. However, the parent can have partial parallel paths. Thus,
> when
> > we have a full partitionwise aggregate possible (as the group by clause
> > matches with the partition key), we end-up in a situation where we do
> create
> > a partially_grouped_rel for the parent but there won't be any
> > partially_grouped_live_children.
> >
> > The current code was calling add_paths_to_append_rel() without making
> sure
> > of any live children present or not (sorry, it was my fault). This
> causes an
> > Assertion failure in add_paths_to_append_rel() as this function assumes
> that
> > it will have non-NIL live_childrels list.
> >
>
> Thanks for the detailed explanation.
>
> > I have fixed partitionwise aggregate code which is calling
> > add_paths_to_append_rel() by checking the live children list correctly.
> And
> > for robustness, I have also added an Assert() in
> add_paths_to_append_rel().
> >
> > I have verified the callers of add_paths_to_append_rel() and except one,
> all
> > are calling it by making sure that they have a non-NIL live children
> list.
>
> I actually thought about moving if(live_childrel != NIL) test inside
> this function, but then every caller is doing something different when
> that condition occurs. So doesn't help much.
>
> > The one which is calling add_paths_to_append_rel() directly is
> > set_append_rel_pathlist(). And I think, at that place, we will never
> have an
> > empty live children list,
>
> Yes, I think so too. That's because set_append_rel_size() should have
> marked such a parent as dummy and subsequent set_rel_pathlist() would
> not create any paths for it.
>
> Here are some review comments.
>
> +/* We should end-up here only when we have few live child rels. */
>
> I think the right wording is " ... we have at least one child." I was
> actually
> thinking if we should just return from here when live_children == NIL. But
> then
> we will loose an opportunity to catch a bug earlier than set_cheapest().
>

Done.


>
> + * However, if there are no live children, then we cannot create any
> append
> + * path.
>
> I think "no live children" is kind of misleading here. We usually use that
> term
> to mean at least one non-dummy child. But in this case there is no child at
> all, so we might want to better describe this situation. May be also
> explain
> when this condition can happen.
>

Done. Tried re-phrasing it. Please check.


> +if (patype == PARTITIONWISE_AGGREGATE_FULL && grouped_live_children
> != NIL)
>
> I think for this to happen, the parent grouped relation and the underlying
> scan
> itself should be dummy. So, would an Assert be better? I think we have
> discussed this earlier, but I can not spot the mail.
>

Yep, Assert() will be better. Done.


>
> +-- Test when partition tables has parallel_workers = 0 but not the parent
>
> Better be worded as "Test when parent can produce parallel paths but not
> any of
> its children". parallel_worker = 0 is just a means to test that. Although
> the
> EXPLAIN output below doesn't really reflect that parent can have parallel
> plans. Is it possible to create a scenario to show that.
>

Added test.


>
> I see that you have posted some newer versions of this patch, but I
> think those still need to address these comments.
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>


Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From dd9951b3fd92a61d1b9fb02019a3af5b4d8e3b93 Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Wed, 20 Jun 2018 18:47:12 +0530
Subject: [PATCH] Make sure that we have live children before we append them.

In create_partitionwise_grouping_paths(), we were calling
add_paths_to_append_rel() on empty live children which causing
an Assertion failure inside it. Thus, prevent calling
add_paths_to_append_rel() when there are no live children.

In passing, add an Assert() in add_paths_to_append_rel() to
check that it receives a valid live children list.

Also, it is very well possible that we could not have a
partially_grouped_rel for some of the children for which aggregation
in partial is not feasible. In such cases, we will end up having
fe

Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-20 Thread Jeevan Chalke

On Tue, Jun 19, 2018 at 7:14 PM, Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
> In the reported testcase, parallel_workers is set to 0 for all partition
>> (child) relations. Which means partial parallel paths are not possible for
>> child rels. However, the parent can have partial parallel paths. Thus, when
>> we have a full partitionwise aggregate possible (as the group by clause
>> matches with the partition key), we end-up in a situation where we do
>> create a partially_grouped_rel for the parent but there won't be any
>> partially_grouped_live_children.
>>
>> The current code was calling add_paths_to_append_rel() without making
>> sure of any live children present or not (sorry, it was my fault). This
>> causes an Assertion failure in add_paths_to_append_rel() as this function
>> assumes that it will have non-NIL live_childrels list.
>>
>> I have fixed partitionwise aggregate code which is calling
>> add_paths_to_append_rel() by checking the live children list correctly. And
>> for robustness, I have also added an Assert() in add_paths_to_append_rel().
>>
>> I have verified the callers of add_paths_to_append_rel() and except one,
>> all are calling it by making sure that they have a non-NIL live children
>> list. The one which is calling add_paths_to_append_rel() directly is
>> set_append_rel_pathlist(). And I think, at that place, we will never have
>> an empty live children list, I may be wrong though. And if that's the case
>> then newly added Assert() in add_paths_to_append_rel() will fail anyway to
>> catch any programming error in that code path.
>>
>> Attached patch fixing the crash and also added a simple test-coverage for
>> that.
>>
>> Let me know if I missed any.
>>
>
> Rajkumar offlist reported another issue related to data-loss. If few of
> the partitions has parallel_workers = 0, not all, then PWA plan ended up
> with a plan having children which has parallel_workers != 0. So the
> partitions with parallel_workers = 0; were not scanned.
>
> Fixed this in attached version of the patch.
>

There were few commits in this area due to which patch is not cleanly
applying.

Attached rebased patch.

Thanks


>
>>
>> --
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 30aa2d1c24ed5222784935b68dd15ecbc5dc39be Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Wed, 20 Jun 2018 18:47:12 +0530
Subject: [PATCH] Make sure that we have live children before we append them.

In create_partitionwise_grouping_paths(), we were calling
add_paths_to_append_rel() on empty live children which causing
an Assertion failure inside it. Thus, prevent calling
add_paths_to_append_rel() when there are no live children.

In passing, add an Assert() in add_paths_to_append_rel() to
check that it receives a valid live children list.

Also, it is very well possible that we could not have a
partially_grouped_rel for some of the children for which aggregation
in partial is not feasible. In such cases, we will end up having
fewer children in the partially_grouped_live_children list. Don't
create append rel in that case.

Jeevan Chalke
---
 src/backend/optimizer/path/allpaths.c |  3 +
 src/backend/optimizer/plan/planner.c  | 25 +++-
 src/test/regress/expected/partition_aggregate.out | 73 +++
 src/test/regress/sql/partition_aggregate.sql  | 20 +++
 4 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 3ada379..6f5ec5a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1391,6 +1391,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	bool		build_partitioned_rels = false;
 	double		partial_rows = -1;
 
+	/* We should end-up here only when we have few live child rels. */
+	Assert(live_childrels != NIL);
+
 	/* If appropriate, consider parallel append */
 	pa_subpaths_valid = enable_parallel_append && rel->consider_parallel;
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 67a2c7a..ab27ad0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7011,12 +7011,14 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 	int			cnt_parts;
 	List	   *grouped_live_children = NIL;
 	List	   *partially_grouped_live_children = NIL;
+	int			dummy_children_cnt;
 	PathTarget *target = grouped_rel->reltarget;
 
 	Assert(patype != PARTITIONWISE_AGGREGATE_NONE);
 	Assert(patype != PARTITIONWISE_AGGREGATE_PARTIAL ||
 		   partially_grouped_rel != NULL);
 
+	dummy_children_cnt = 0;
 	/* Add paths for partitionwise aggregation/grou

Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-19 Thread Jeevan Chalke

On Tue, Jun 19, 2018 at 2:13 PM, Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

>
>
> On Mon, Jun 18, 2018 at 9:27 PM, Andres Freund  wrote:
>
>> On 2018-06-18 17:10:12 +0530, Jeevan Chalke wrote:
>> > On Mon, Jun 18, 2018 at 5:02 PM, Rajkumar Raghuwanshi <
>> > rajkumar.raghuwan...@enterprisedb.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > Below test case crashed, when set enable_partitionwise_aggregate to
>> true.
>> > >
>> >
>> > I will have a look over this.
>>
>
> In the reported testcase, parallel_workers is set to 0 for all partition
> (child) relations. Which means partial parallel paths are not possible for
> child rels. However, the parent can have partial parallel paths. Thus, when
> we have a full partitionwise aggregate possible (as the group by clause
> matches with the partition key), we end-up in a situation where we do
> create a partially_grouped_rel for the parent but there won't be any
> partially_grouped_live_children.
>
> The current code was calling add_paths_to_append_rel() without making sure
> of any live children present or not (sorry, it was my fault). This causes
> an Assertion failure in add_paths_to_append_rel() as this function assumes
> that it will have non-NIL live_childrels list.
>
> I have fixed partitionwise aggregate code which is calling
> add_paths_to_append_rel() by checking the live children list correctly. And
> for robustness, I have also added an Assert() in add_paths_to_append_rel().
>
> I have verified the callers of add_paths_to_append_rel() and except one,
> all are calling it by making sure that they have a non-NIL live children
> list. The one which is calling add_paths_to_append_rel() directly is
> set_append_rel_pathlist(). And I think, at that place, we will never have
> an empty live children list, I may be wrong though. And if that's the case
> then newly added Assert() in add_paths_to_append_rel() will fail anyway to
> catch any programming error in that code path.
>
> Attached patch fixing the crash and also added a simple test-coverage for
> that.
>
> Let me know if I missed any.
>

Rajkumar offlist reported another issue related to data-loss. If few of the
partitions has parallel_workers = 0, not all, then PWA plan ended up with a
plan having children which has parallel_workers != 0. So the partitions
with parallel_workers = 0; were not scanned.

Fixed this in attached version of the patch.


>
> Thanks
>
>
>> >
>> > Thanks for reporting.
>>
>> I've added an v11 open-items entry.
>>
>> - Andres
>>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 9eea7f4a83b3c295847774da9c1164fc8d3a5587 Mon Sep 17 00:00:00 2001
From: Jeevan Chalke 
Date: Tue, 19 Jun 2018 10:44:33 +0530
Subject: [PATCH] Make sure that we have live children before we append them.

In create_partitionwise_grouping_paths(), we were calling
add_paths_to_append_rel() on empty live children which causing
an Assertion failure inside it. Thus, prevent calling
add_paths_to_append_rel() when there are no live children.

In passing, add an Assert() in add_paths_to_append_rel() to
check that it receives a valid live children list.

Also, it is very well possible that we could not have a
partially_grouped_rel for some of the children for which aggregation
in partial is not feasible. In such cases, we will end up having
fewer children in the partially_grouped_live_children list. Don't
create append rel in that case.

Jeevan Chalke
---
 src/backend/optimizer/path/allpaths.c |  3 +
 src/backend/optimizer/plan/planner.c  | 25 +++-
 src/test/regress/expected/partition_aggregate.out | 73 +++
 src/test/regress/sql/partition_aggregate.sql  | 20 +++
 4 files changed, 118 insertions(+), 3 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1462988..b9dffef 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1391,6 +1391,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	bool		build_partitioned_rels = false;
 	double		partial_rows = -1;
 
+	/* We should end-up here only when we have few live child rels. */
+	Assert(live_childrels != NIL);
+
 	/*
 	 * AppendPath generated for partitioned tables must record the RT indexes
 	 * of partitioned tables that are direct or indirect children of this
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 67a2c7a..ab27ad0 100644
--

Re: Server crashed with TRAP: FailedAssertion("!(parallel_workers > 0)" when partitionwise_aggregate true.

2018-06-18 Thread Jeevan Chalke

paths.c:1630
> #4  0x007d37e1 in create_partitionwise_grouping_paths
> (root=0x1d6ff08, input_rel=0x1da5380, grouped_rel=0x1d43520,
> partially_grouped_rel=0x1d45d80,
> agg_costs=0x7ffceb18dd20, gd=0x0, patype=PARTITIONWISE_AGGREGATE_FULL,
> extra=0x7ffceb18dbe0) at planner.c:7120
> #5  0x007ce58d in create_ordinary_grouping_paths (root=0x1d6ff08,
> input_rel=0x1da5380, grouped_rel=0x1d43520, agg_costs=0x7ffceb18dd20,
> gd=0x0, extra=0x7ffceb18dbe0,
> partially_grouped_rel_p=0x7ffceb18dc70) at planner.c:4011
> #6  0x007ce14b in create_grouping_paths (root=0x1d6ff08,
> input_rel=0x1da5380, target=0x1d446d0, target_parallel_safe=true,
> agg_costs=0x7ffceb18dd20, gd=0x0)
> at planner.c:3783
> #7  0x007cb344 in grouping_planner (root=0x1d6ff08,
> inheritance_update=false, tuple_fraction=0) at planner.c:2037
> #8  0x007c94e6 in subquery_planner (glob=0x1d6fe70,
> parse=0x1d2a658, parent_root=0x0, hasRecursion=false, tuple_fraction=0) at
> planner.c:966
> #9  0x007c80a3 in standard_planner (parse=0x1d2a658,
> cursorOptions=256, boundParams=0x0) at planner.c:405
> #10 0x007c7dcb in planner (parse=0x1d2a658, cursorOptions=256,
> boundParams=0x0) at planner.c:263
> #11 0x008c4576 in pg_plan_query (querytree=0x1d2a658,
> cursorOptions=256, boundParams=0x0) at postgres.c:809
> #12 0x0064a1d0 in ExplainOneQuery (query=0x1d2a658,
> cursorOptions=256, into=0x0, es=0x1d24460,
> queryString=0x1c68798 "EXPLAIN (COSTS OFF)\nSELECT
> AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = t2.c1)
> GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2;",
> params=0x0, queryEnv=0x0) at explain.c:365
> #13 0x00649ed2 in ExplainQuery (pstate=0x1c8be28, stmt=0x1d34b08,
> queryString=0x1c68798 "EXPLAIN (COSTS OFF)\nSELECT
> AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = t2.c1)
> GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2;",
> params=0x0, queryEnv=0x0, dest=0x1c8bd90) at explain.c:254
> #14 0x008ccd99 in standard_ProcessUtility (pstmt=0x1d34bd8,
> queryString=0x1c68798 "EXPLAIN (COSTS OFF)\nSELECT
> AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = t2.c1)
> GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2;",
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
> dest=0x1c8bd90, completionTag=0x7ffceb18e450 "") at utility.c:672
> #15 0x008cc520 in ProcessUtility (pstmt=0x1d34bd8,
> queryString=0x1c68798 "EXPLAIN (COSTS OFF)\nSELECT
> AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = t2.c1)
> GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2;",
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
> dest=0x1c8bd90, completionTag=0x7ffceb18e450 "") at utility.c:360
> #16 0x008cb4ce in PortalRunUtility (portal=0x1ccdc28,
> pstmt=0x1d34bd8, isTopLevel=true, setHoldSnapshot=true, dest=0x1c8bd90,
> completionTag=0x7ffceb18e450 "")
> at pquery.c:1178
> #17 0x008cb1c5 in FillPortalStore (portal=0x1ccdc28,
> isTopLevel=true) at pquery.c:1038
> #18 0x008caaf6 in PortalRun (portal=0x1ccdc28,
> count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x1d6d9e8,
> altdest=0x1d6d9e8,
> completionTag=0x7ffceb18e650 "") at pquery.c:768
> #19 0x008c4aef in exec_simple_query (
> query_string=0x1c68798 "EXPLAIN (COSTS OFF)\nSELECT
> AVG(t2.c1),SUM(t1.c1) FROM part t1 INNER JOIN part t2 ON (t1.c1 = t2.c1)
> GROUP BY t1.c1, t2.c1 HAVING SUM(t1.c1) % 125 = 0 ORDER BY 1,2;") at
> postgres.c:1122
> #20 0x008c8dbf in PostgresMain (argc=1, argv=0x1c922a0,
> dbname=0x1c92100 "postgres", username=0x1c65298 "edb") at postgres.c:4153
> #21 0x00826703 in BackendRun (port=0x1c8a060) at postmaster.c:4361
> #22 0x00825e71 in BackendStartup (port=0x1c8a060) at
> postmaster.c:4033
> #23 0x00822253 in ServerLoop () at postmaster.c:1706
> #24 0x00821b85 in PostmasterMain (argc=3, argv=0x1c631f0) at
> postmaster.c:1379
> #25 0x00748d64 in main (argc=3, argv=0x1c631f0) at main.c:228
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
> QMG, EnterpriseDB Corporation
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [sqlsmith] Failed assertion in create_gather_path

2018-04-11 Thread Jeevan Chalke

On Tue, Apr 10, 2018 at 11:14 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Tue, Apr 10, 2018 at 2:59 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > I actually wanted to have rel->consider_parallel in the condition (yes,
> for
> > additional safety) as we are adding a partial path into rel. But then
> > observed that it is same as that of final_rel->consider_parallel and thus
> > used it along with other condition.
> >
> > I have observed at many places that we do check consider_parallel flag
> > before adding a partial path to it. Thus for consistency added here too,
> but
> > yes, it just adds an additional safety here.
>
> Thanks to Andreas for reporting this issue and to Jeevan for fixing
> it.  My commit 0927d2f46ddd4cf7d6bf2cc84b3be923e0aedc52 seems clearly
> to blame.
>
> The change to set_subquery_pathlist() looks correct, but can we add a
> simple test case?


I have tried adding simple testcase in this version of the patch. This test
hits the Assertion added in add_partial_path() like you have tried.


>   Right now if I keep the new Assert() in
> add_partial_path() and leave out the rest of the changes, the
> regression tests still pass.  That's not so good.  Also, I think I
> would be inclined to wrap the if-statement around the rest of the
> function instead of returning early.
>

OK.
Wrapped if block instead of returning mid-way.


>
> The new Assert() in add_partial_path() is an excellent idea.  I had
> the same thought before, but I didn't do anything about it.  That was
> a bad idea; your plan is better. In fact, I suggest an additional
> Assert() that any relation to which we're adding a partial path is
> marked consider_parallel, like this:
>
> +/* Path to be added must be parallel safe. */
> +Assert(new_path->parallel_safe);
> +
> +/* Relation should be OK for parallelism, too. */
> +Assert(parent_rel->consider_parallel);
>

Yep.
Added this new one too.


>
> Regarding recurse_set_operations, since the rel->consider_parallel is
> always the same as final_rel->consider_parallel there's really no
> value in testing it.  If it were possible for rel->consider_parallel
> to be false even when final_rel->consider_parallel were true then the
> test would be necessary for correctness.  That might or might not
> happen in the future, so I guess it wouldn't hurt to include this for
> future-proofing,


In that case, we should have rel in a condition rather than final_rel as we
are adding a path into rel. For future-proofing added check on
lateral_relids too.


> but it's not actually a bug fix, so it also wouldn't
> hurt to leave it out.  I could go either way, I guess.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 708026f34f6a5d087660930f9cb05aa3e08c130b Mon Sep 17 00:00:00 2001
From: Jeevan Chalke <jeevan.cha...@enterprisedb.com>
Date: Wed, 11 Apr 2018 14:08:13 +0530
Subject: [PATCH] Add partial path only when rel's consider_parallel is true.

Otherwise, it will result in an assertion failure later in the
create_gather_path().
---
 src/backend/optimizer/path/allpaths.c | 41 +++
 src/backend/optimizer/prep/prepunion.c|  3 +-
 src/backend/optimizer/util/pathnode.c |  6 
 src/test/regress/expected/select_parallel.out | 19 +
 src/test/regress/sql/select_parallel.sql  |  6 
 5 files changed, 56 insertions(+), 19 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 3ba3f87..c3d0634 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2242,26 +2242,31 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		  pathkeys, required_outer));
 	}
 
-	/* If consider_parallel is false, there should be no partial paths. */
-	Assert(sub_final_rel->consider_parallel ||
-		   sub_final_rel->partial_pathlist == NIL);
-
-	/* Same for partial paths. */
-	foreach(lc, sub_final_rel->partial_pathlist)
+	/* If outer rel allows parallelism, do same for partial paths. */
+	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
-		Path	   *subpath = (Path *) lfirst(lc);
-		List	   *pathkeys;
-
-		/* Convert subpath's pathkeys to outer representation */
-		pathkeys = convert_subquery_pathkeys(root,
-			 rel,
-			 subpath->pathkeys,
-			 make_tlist_from_pathtarget(subpath->pathtarget));
+		/* If consider_parallel is false, there should be no partial paths. */
+		Assert(sub_final_rel->consider_parallel ||
+			   sub_final_

Re: [HACKERS] Partition-wise aggregation/grouping

2018-04-10 Thread Jeevan Chalke

On Tue, Apr 10, 2018 at 7:30 PM, David Steele <da...@pgmasters.net> wrote:

> Hi Jeevan,
>
> On 4/2/18 10:57 AM, Robert Haas wrote:
> > On Thu, Mar 29, 2018 at 9:02 AM, Jeevan Chalke
> > <jeevan.cha...@enterprisedb.com> wrote:
> >> Yep, I see the risk.
> >
> > Committed 0001 last week and 0002 just now.  I don't really see 0003 a
> > a critical need.  If somebody demonstrates that this saves a
> > meaningful amount of planning time, we can consider that part for a
> > future release.
>
> The bulk of this patch was committed so I have marked it that way.
>

Thanks, David.


>
> If you would like to pursue patch 03 I think it would be best to start a
> new thread and demonstrate how the patch will improve performance.
>

Sure.


>
> Regards,
> --
> -David
> da...@pgmasters.net
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [sqlsmith] Failed assertion in create_gather_path

2018-04-10 Thread Jeevan Chalke

On Mon, Apr 9, 2018 at 5:52 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:

> On Sun, Apr 8, 2018 at 1:04 PM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > Hi,
> >
> > At some places, I have observed that we are adding a partial path even
> when
> > rel's consider_parallel is false. Due to this, the partial path added has
> > parallel_safe set to false and then later in create_gather_path()
> assertion
> > fails.
> >
>
> Few Comments:
> 1.
> @@ -2196,6 +2196,10 @@ set_subquery_pathlist(PlannerInfo *root,
> RelOptInfo *rel,
> pathkeys, required_outer));
>   }
>
> + /* If parallelism is not possible, return. */
> + if (!rel->consider_parallel || !bms_is_empty(required_outer))
> + return;
>
> In this case shouldn't we set the rel's consider_parallel flag
> correctly rather than avoiding adding the path to it as we do in
> recurse_set_operations?
>

In recurse_set_operations() we are building a new rel and setting its
properties from the final_rel. consider_parallel there is just copied from
the final_rel.
However, in set_subquery_pathlist(), rel is the input parameter here and we
are trying to add a partial path to it without looking at its
consider_parallel field. This patch does that.

And if we want to set consider_parallel for the rel, then it should have
been done prior to this function itself. And I am not sure where that
exactly but not in this function I guess.

> 2.
> @@ -331,7 +331,7 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
>   * to build a partial path for this relation.  But there's no point in
>   * considering any path but the cheapest.
>   */
> - if (final_rel->partial_pathlist != NIL)
> + if (final_rel->consider_parallel && final_rel->partial_pathlist != NIL)
>
> What problem did you see here or is it just for additional safety?
> Ideally, if the consider_parallel is false for a rel, it's partial
> path list should also be NULL.
>

I actually wanted to have rel->consider_parallel in the condition (yes, for
additional safety) as we are adding a partial path into rel. But then
observed that it is same as that of final_rel->consider_parallel and thus
used it along with other condition.

I have observed at many places that we do check consider_parallel flag
before adding a partial path to it. Thus for consistency added here too,
but yes, it just adds an additional safety here.

> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [sqlsmith] Failed assertion in create_gather_path

2018-04-08 Thread Jeevan Chalke

Hi,

At some places, I have observed that we are adding a partial path even when
rel's consider_parallel is false. Due to this, the partial path added has
parallel_safe set to false and then later in create_gather_path() assertion
fails.

Attached patch to fix that.


On Sun, Apr 8, 2018 at 12:26 AM, Andreas Seltenreich <seltenre...@gmx.de>
wrote:

> Tom Lane writes:
> > Alvaro Herrera <alvhe...@alvh.no-ip.org> writes:
> >> Andreas Seltenreich wrote:
> >>> as of 039eb6e92f:
> >>> TRAP: FailedAssertion("!(subpath->parallel_safe)", File:
> "pathnode.c", Line: 1813)
> >
> >> Uh, that's pretty strange -- that patch did not touch the planner at
> >> all, and the relcache.c changes are pretty trivial.
> >
> > I don't think Andreas meant that the bug is necessarily new in
> 039eb6e92f,
> > only that that's the version he happened to be testing.
>
> Right.  I'm not testing often enough yet to be able to report on commit
> granularity :-).  I'll try for a less ambiguos wording in future
> reports.
>
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 86f430e35583c6bfdd43c4f31e06ab83e8404e9a Mon Sep 17 00:00:00 2001
From: Jeevan Chalke <jeevan.cha...@enterprisedb.com>
Date: Sun, 8 Apr 2018 12:52:13 +0530
Subject: [PATCH] Add partial path only when rel's consider_parallel is true.

Otherwise, it will result in an assertion failure later in the
create_gather_path().
---
 src/backend/optimizer/path/allpaths.c  | 4 
 src/backend/optimizer/prep/prepunion.c | 2 +-
 src/backend/optimizer/util/pathnode.c  | 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c4e4db1..aa24345 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2196,6 +2196,10 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		  pathkeys, required_outer));
 	}
 
+	/* If parallelism is not possible, return. */
+	if (!rel->consider_parallel || !bms_is_empty(required_outer))
+		return;
+
 	/* If consider_parallel is false, there should be no partial paths. */
 	Assert(sub_final_rel->consider_parallel ||
 		   sub_final_rel->partial_pathlist == NIL);
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 5236ab3..f54807d 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -331,7 +331,7 @@ recurse_set_operations(Node *setOp, PlannerInfo *root,
 		 * to build a partial path for this relation.  But there's no point in
 		 * considering any path but the cheapest.
 		 */
-		if (final_rel->partial_pathlist != NIL)
+		if (final_rel->consider_parallel && final_rel->partial_pathlist != NIL)
 		{
 			Path	   *partial_subpath;
 			Path	   *partial_path;
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 416b3f9..fa332de 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -770,6 +770,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	/* Check for query cancel. */
 	CHECK_FOR_INTERRUPTS();
 
+	/* Path to be added must be parallel safe. */
+	Assert(new_path->parallel_safe);
+
 	/*
 	 * As in add_path, throw out any paths which are dominated by the new
 	 * path, but throw out the new path if some existing path dominates it.
-- 
1.8.3.1

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-29 Thread Jeevan Chalke

On Thu, Mar 29, 2018 at 4:13 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Wed, Mar 28, 2018 at 7:21 PM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
> >
> > Ah sorry, I was wrong about remote_conds. remote_conds and local_conds
> > are basically the conditions on the relation being pushed down.
> > havingQuals are conditions on a grouped relation so treating them like
> > baserestrictinfo or join conditions looks more straight forward,
> > rather than having a separate member in PgFdwRelationInfo. So, remote
> > havingQuals go into remote_conds and local havingQuals go to
> > local_conds.
>
> Looks like we already do that. Then we have remote_conds, local_conds
> which together should be equivalent to havingQual. Storing all those
> three doesn't make sense. In future someone may use havingQual instead
> of remote_conds/local_conds just because its available and then there
> is risk of these three lists going out of sync.
>

Yep, I see the risk.


>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-29 Thread Jeevan Chalke

On Wed, Mar 28, 2018 at 7:21 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Tue, Mar 27, 2018 at 2:43 PM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
>
> > I am not sure on what we should Assetrt here. Note that we end-up here
> only
> > when doing grouping, and thus I don't think we need any Assert here.
> > Let me know if I missed anything.
>
> Since we are just checking whether it's an upper relation and directly
> using root->upper_targets[UPPERREL_GROUP_AGG], I thought we could add
> an assert to verify that it's really the grouping rel we are dealing
> with. But I guess, we can't really check that from given relation. But
> then for a grouped rel we can get its target from RelOptInfo. So, we
> shouldn't need to root->upper_targets[UPPERREL_GROUP_AGG]. Am I
> missing something? For other upper relations we do not set the target
> yet but then we could assert that there exists one in the grouped
> relation.
>

Yes. We fetch target from the grouped_rel itself.
Added Assert() per out off-list discussion.


>
> >
> >>
> >>
> >> -get_agg_clause_costs(root, (Node *)
> >> root->parse->havingQual,
> >> +get_agg_clause_costs(root, fpinfo->havingQual,
> >>   AGGSPLIT_SIMPLE, );
> >>  }
> >> Should we pass agg costs as well through GroupPathExtraData to avoid
> >> calculating it again in this function?
> >
> >
> > Adding an extra member in GroupPathExtraData just for FDW does not look
> good
> > to me.
> > But yes, if we do that, then we can save this calculation.
> > Let me know if its OK to have an extra member for just FDW use, will
> prepare
> > a separate patch for that.
>
> I think that should be fine. A separate patch would be good, so that a
> committer can decide whether or not to include it.
>

Attached patch 0003 for this.


>
>
> >>
> >> +Node   *havingQual;
> >> I am wondering whether we could use remote_conds member for storing
> this.
> >
> >
> > This havingQual is later checked for shippability and classified into
> > pushable and non-pushable quals and stored in remote_conds and
> local_conds
> > respectively.
> > Storing it directly in remote_conds and then splitting it does not look
> good
> > to me.
> > Also, remote_conds is list of RestrictInfo nodes whereas havingQual is
> not.
> > So using that for storing havingQual does not make sense. So better to
> have
> > a separate member in PgFdwRelationInfo.
>
> Ah sorry, I was wrong about remote_conds. remote_conds and local_conds
> are basically the conditions on the relation being pushed down.
> havingQuals are conditions on a grouped relation so treating them like
> baserestrictinfo or join conditions looks more straight forward,
> rather than having a separate member in PgFdwRelationInfo. So, remote
> havingQuals go into remote_conds and local havingQuals go to
> local_conds.
>

OK. Agree.
In this version, I have not added anything in PgFdwRelationInfo.
Having qual is needed at two places; (1) in foreign_grouping_ok() to check
shippability, so passed this translated HAVING qual as a parameter to it,
and (2) estimating aggregates costs in estimate_path_cost_size(); there we
can use havingQual from root itself as costs won't change for parent and
child.
Thus no need of storing a havingQual in PgFdwRelationInfo.

Thanks


--
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 761d46b1b255a7521529bbe7d1b128c9d79d6b4e Mon Sep 17 00:00:00 2001
From: Jeevan Chalke <jeevan.cha...@enterprisedb.com>
Date: Tue, 27 Mar 2018 14:23:00 +0530
Subject: [PATCH 1/3] Remove target from GroupPathExtraData, instead fetch it
 from grouped_rel.

---
 src/backend/optimizer/plan/planner.c | 4 +---
 src/include/nodes/relation.h | 2 --
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a19f5d0..d4acde6 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3734,7 +3734,6 @@ create_grouping_paths(PlannerInfo *root,
 			flags |= GROUPING_CAN_PARTIAL_AGG;
 
 		extra.flags = flags;
-		extra.target = target;
 		extra.target_parallel_safe = target_parallel_safe;
 		extra.havingQual = parse->havingQual;
 		extra.targetList = parse->targetList;
@@ -6909,7 +6908,7 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 	int			cnt_parts;
 	List	   *group

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-27 Thread Jeevan Chalke

On Mon, Mar 26, 2018 at 5:24 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Fri, Mar 23, 2018 at 4:35 PM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> >
> > Changes related to postgres_fdw which allows pushing aggregate on the
> > foreign server is not yet committed. Due to this, we will end up getting
> an
> > error when we have foreign partitions + aggregation.
> >
> > Attached 0001 patch here (0006 from my earlier patch-set) which adds
> support
> > for this and thus will not have any error.
>

I have observed that, target member in GroupPathExtraData is not needed as
we store the target in grouped_rel itself.
Attached 0001 patch to remove that.


>
>
>  else if (IS_UPPER_REL(foreignrel))
>  {
>  PgFdwRelationInfo *ofpinfo;
> -PathTarget *ptarget = root->upper_targets[UPPERREL_
> GROUP_AGG];
> +PathTarget *ptarget = fpinfo->grouped_target;
>
> I think we need an assert there to make sure that the upper relation is a
> grouping relation. That way any future push down will notice it.
>

I am not sure on what we should Assetrt here. Note that we end-up here only
when doing grouping, and thus I don't think we need any Assert here.
Let me know if I missed anything.


>
> -get_agg_clause_costs(root, (Node *)
> root->parse->havingQual,
> +get_agg_clause_costs(root, fpinfo->havingQual,
>   AGGSPLIT_SIMPLE, );
>  }
> Should we pass agg costs as well through GroupPathExtraData to avoid
> calculating it again in this function?
>

Adding an extra member in GroupPathExtraData just for FDW does not look
good to me.
But yes, if we do that, then we can save this calculation.
Let me know if its OK to have an extra member for just FDW use, will
prepare a separate patch for that.


>
>  /*
> +/*
> + * Store passed-in target and havingQual in fpinfo. If its a foreign
> + * partition, then path target and HAVING quals fetched from the root
> are
> + * not correct as Vars within it won't match with this child relation.
> + * However, server passed them through extra and thus fetch from it.
> + */
> +if (extra)
> +{
> +/* Partial aggregates are not supported. */
> +Assert(extra->patype != PARTITIONWISE_AGGREGATE_PARTIAL);
> +
> +fpinfo->grouped_target = extra->target;
> +fpinfo->havingQual = extra->havingQual;
> +}
> +else
> +{
> +fpinfo->grouped_target = root->upper_targets[UPPERREL_GROUP_AGG];
> +fpinfo->havingQual = parse->havingQual;
> +}
> I think both these cases, extra should always be present whether a child
> relation or a parent relation. Just pick from extra always.
>

Yes.
Done.


>
>  /* Grouping information */
>  List   *grouped_tlist;
> +PathTarget *grouped_target;
>
> We should use the target stored in the grouped rel directly.
>

Yep.


>
> +Node   *havingQual;
> I am wondering whether we could use remote_conds member for storing this.
>

This havingQual is later checked for shippability and classified into
pushable and non-pushable quals and stored in remote_conds and local_conds
respectively.
Storing it directly in remote_conds and then splitting it does not look
good to me.
Also, remote_conds is list of RestrictInfo nodes whereas havingQual is not.
So using that for storing havingQual does not make sense. So better to have
a separate member in PgFdwRelationInfo.


>  /*
>   * Likewise, copy the relids that are represented by this foreign
> scan. An
> - * upper rel doesn't have relids set, but it covers all the base
> relations
> - * participating in the underlying scan, so use root's all_baserels.
> + * upper rel (but not the other upper rel) doesn't have relids set,
> but it
> + * covers all the base relations participating in the underlying
> scan, so
> + * use root's all_baserels.
>   */
>
> This is correct only for "other" grouping relations. We are yet to
> decide what to do
> for the other upper relations.
>

I have removed this comment change as existing comments look good after
doing following changes:


>
> -if (IS_UPPER_REL(rel))
> +if (IS_UPPER_REL(rel) && !IS_OTHER_REL(rel))
> I guess, this condition boils down to rel->reloptkind == RELOPT_UPPER_REL.
> Use
> it that way?
>

Done.

Attached 0002 for this.

Thanks


>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
En

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-27 Thread Jeevan Chalke

On Tue, Mar 27, 2018 at 3:33 AM, Andres Freund <and...@anarazel.de> wrote:

> Hi,
>
> On 2018-03-23 17:01:54 +0530, Jeevan Chalke wrote:
> > Attached patch which fixes that.
>
> Thanks, will push. For the future, I'd be more likely to notice if you
> CC me ;)
>

Sure. Thanks.


>
>
> > However, I am not sure whether it is expected to have stable regression
> run
> > with installcheck having local settings.
> > For example, If I have enabale_hashagg = false locally; I will definitely
> > see failures.
> >
> > ISTM, that I am missing Andres point here.
>
> I don't think there's a hard and fast rule here. I personally often
> during development disable parallelism because it makes some things
> harder (you can't easily debug crashes with gdb, benchmarks show larger
> variance, ...).


Yep.


>   There doesn't seem to be an equivalent benefit to
> support running e.g. with enabale_hashagg = false.
>

OK.
Noted.

Thanks for the explanation.


>
> - Andres
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-23 Thread Jeevan Chalke

Hi Robert,

On pgsql-committers Andres reported one concern about test case failure
with installcheck with local settings.
(Sorry, I have not subscribed to that mailing list and thus not able to
reply there).

Attached patch which fixes that.

However, I am not sure whether it is expected to have stable regression run
with installcheck having local settings.
For example, If I have enabale_hashagg = false locally; I will definitely
see failures.

ISTM, that I am missing Andres point here.

Thanks

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Phone: +91 20 66449694

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the
individual or entity to whom it is addressed. This message contains
information from EnterpriseDB Corporation that may be privileged,
confidential, or exempt from disclosure under applicable law. If you are
not the intended recipient or authorized to receive this for the intended
recipient, any use, dissemination, distribution, retention, archiving, or
copying of this communication is strictly prohibited. If you have received
this e-mail in error, please notify the sender immediately by reply e-mail
and delete this message.
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index bf8272e..76a8209 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -6,6 +6,8 @@
 SET enable_partitionwise_aggregate TO true;
 -- Enable partitionwise join, which by default is disabled.
 SET enable_partitionwise_join TO true;
+-- Disable parallel plans.
+SET max_parallel_workers_per_gather TO 0;
 --
 -- Tests for list partitioned tables.
 --
@@ -921,6 +923,8 @@ ALTER TABLE pagg_tab_ml_p3 ATTACH PARTITION pagg_tab_ml_p3_s1 FOR VALUES FROM (0
 ALTER TABLE pagg_tab_ml ATTACH PARTITION pagg_tab_ml_p3 FOR VALUES FROM (20) TO (30);
 INSERT INTO pagg_tab_ml SELECT i % 30, i % 10, to_char(i % 4, 'FM') FROM generate_series(0, 2) i;
 ANALYZE pagg_tab_ml;
+-- For Parallel Append
+SET max_parallel_workers_per_gather TO 2;
 -- Full aggregation at level 1 as GROUP BY clause matches with PARTITION KEY
 -- for level 1 only. For subpartitions, GROUP BY clause does not match with
 -- PARTITION KEY, but still we do not see a partial aggregation as array_agg()
@@ -1146,7 +1150,6 @@ SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 O
 (12 rows)
 
 -- Parallelism within partitionwise aggregates
-SET max_parallel_workers_per_gather TO 2;
 SET min_parallel_table_scan_size TO '8kB';
 SET parallel_setup_cost TO 0;
 -- Full aggregation at level 1 as GROUP BY clause matches with PARTITION KEY
diff --git a/src/test/regress/sql/partition_aggregate.sql b/src/test/regress/sql/partition_aggregate.sql
index f7b5f5a..c60d7d2 100644
--- a/src/test/regress/sql/partition_aggregate.sql
+++ b/src/test/regress/sql/partition_aggregate.sql
@@ -7,6 +7,8 @@
 SET enable_partitionwise_aggregate TO true;
 -- Enable partitionwise join, which by default is disabled.
 SET enable_partitionwise_join TO true;
+-- Disable parallel plans.
+SET max_parallel_workers_per_gather TO 0;
 
 --
 -- Tests for list partitioned tables.
@@ -206,6 +208,9 @@ ALTER TABLE pagg_tab_ml ATTACH PARTITION pagg_tab_ml_p3 FOR VALUES FROM (20) TO
 INSERT INTO pagg_tab_ml SELECT i % 30, i % 10, to_char(i % 4, 'FM') FROM generate_series(0, 2) i;
 ANALYZE pagg_tab_ml;
 
+-- For Parallel Append
+SET max_parallel_workers_per_gather TO 2;
+
 -- Full aggregation at level 1 as GROUP BY clause matches with PARTITION KEY
 -- for level 1 only. For subpartitions, GROUP BY clause does not match with
 -- PARTITION KEY, but still we do not see a partial aggregation as array_agg()
@@ -238,7 +243,6 @@ SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a, b, c HAVING avg(b) > 7 O
 
 -- Parallelism within partitionwise aggregates
 
-SET max_parallel_workers_per_gather TO 2;
 SET min_parallel_table_scan_size TO '8kB';
 SET parallel_setup_cost TO 0;

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-23 Thread Jeevan Chalke

On Thu, Mar 22, 2018 at 10:28 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Mar 22, 2018 at 6:15 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > Leeks cleaner now. Thanks for refactoring it.
> >
> > I have merged these changes and flatten all previuos changes into the
> main
> > patch.
>
> Committed 0001-0005.


Thanks Robert.


>   I made a few further modifications.  These were
> mostly cosmetic, but with two exceptions:
>
> 1. I moved one set_cheapest() call to avoid doing that twice for the
> top-level grouped_rel.
>
> 2. I removed the logic to set partition properties for grouped_rels.
> As far as I can see, there's nothing that needs this.  It would be
> important if we wanted subsequent planning stages to be able to do
> partition-wise stuff, e.g. when doing window functions or setops, or
> at higher query levels.  Maybe we'll have that someday; until then, I
> think this is just a waste of cycles.
>

OK.

Changes related to postgres_fdw which allows pushing aggregate on the
foreign server is not yet committed. Due to this, we will end up getting an
error when we have foreign partitions + aggregation.

Attached 0001 patch here (0006 from my earlier patch-set) which adds
support for this and thus will not have any error.

Thanks


> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
From 1173c609f6c6b2b45c0d1ce05533b840f7885717 Mon Sep 17 00:00:00 2001
From: Jeevan Chalke <jeevan.cha...@enterprisedb.com>
Date: Fri, 23 Mar 2018 14:23:53 +0530
Subject: [PATCH] Teach postgres_fdw to push aggregates for child relations
 too.

GetForeignUpperPaths() now takes an extra void parameter which will
be used to pass any additional details required to create an upper
path at the remote server. However, we support only grouping over
remote server today and thus it passes grouping specific details i.e.
GroupPathExtradata, NULL otherwise.

Since we don't know how to get a partially aggregated result from a
remote server, only full aggregation is pushed on the remote server.
---
 contrib/postgres_fdw/expected/postgres_fdw.out | 132 +
 contrib/postgres_fdw/postgres_fdw.c|  48 ++---
 contrib/postgres_fdw/postgres_fdw.h|   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql  |  51 ++
 doc/src/sgml/fdwhandler.sgml   |   8 +-
 src/backend/optimizer/plan/createplan.c|   7 +-
 src/backend/optimizer/plan/planner.c   |  29 +++---
 src/backend/optimizer/prep/prepunion.c |   2 +-
 src/include/foreign/fdwapi.h   |   3 +-
 src/include/optimizer/planner.h|   3 +-
 10 files changed, 253 insertions(+), 32 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 2d6e387..a211aa9 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -7852,3 +7852,135 @@ SELECT t1.a, t1.phv, t2.b, t2.phv FROM (SELECT 't1_phv' phv, * FROM fprt1 WHERE
 (14 rows)
 
 RESET enable_partitionwise_join;
+-- ===
+-- test partitionwise aggregates
+-- ===
+CREATE TABLE pagg_tab (a int, b int, c text) PARTITION BY RANGE(a);
+CREATE TABLE pagg_tab_p1 (LIKE pagg_tab);
+CREATE TABLE pagg_tab_p2 (LIKE pagg_tab);
+CREATE TABLE pagg_tab_p3 (LIKE pagg_tab);
+INSERT INTO pagg_tab_p1 SELECT i % 30, i % 50, to_char(i/30, 'FM') FROM generate_series(1, 3000) i WHERE (i % 30) < 10;
+INSERT INTO pagg_tab_p2 SELECT i % 30, i % 50, to_char(i/30, 'FM') FROM generate_series(1, 3000) i WHERE (i % 30) < 20 and (i % 30) >= 10;
+INSERT INTO pagg_tab_p3 SELECT i % 30, i % 50, to_char(i/30, 'FM') FROM generate_series(1, 3000) i WHERE (i % 30) < 30 and (i % 30) >= 20;
+-- Create foreign partitions
+CREATE FOREIGN TABLE fpagg_tab_p1 PARTITION OF pagg_tab FOR VALUES FROM (0) TO (10) SERVER loopback OPTIONS (table_name 'pagg_tab_p1');
+CREATE FOREIGN TABLE fpagg_tab_p2 PARTITION OF pagg_tab FOR VALUES FROM (10) TO (20) SERVER loopback OPTIONS (table_name 'pagg_tab_p2');;
+CREATE FOREIGN TABLE fpagg_tab_p3 PARTITION OF pagg_tab FOR VALUES FROM (20) TO (30) SERVER loopback OPTIONS (table_name 'pagg_tab_p3');;
+ANALYZE pagg_tab;
+ANALYZE fpagg_tab_p1;
+ANALYZE fpagg_tab_p2;
+ANALYZE fpagg_tab_p3;
+-- When GROUP BY clause matches with PARTITION KEY.
+-- Plan with partitionwise aggregates is disabled
+SET enable_partitionwise_aggregate TO false;
+EXPLAIN (COSTS OFF)
+SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
+

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-22 Thread Jeevan Chalke

On Thu, Mar 22, 2018 at 10:06 AM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Thu, Mar 22, 2018 at 3:26 AM, Robert Haas <robertmh...@gmail.com>
> wrote:
> >
> > Is there a good reason not to use input_rel->relids as the input to
> > fetch_upper_rel() in all cases, rather than just at subordinate
> > levels?
> >
>
> That would simplify some code in these patches. We have set
> upper_rel->relids to NULL for non-other upper relation since Tom
> expected to use relids to mean something other than scan/join relids.
> With these patch-sets for grouping rels we are using upper_rel->relids
> to the relids of underlying scan/join relation. So it does make sense
> to set relids to input_rel->relids for all the grouping rels whether
> "other" or non-"other" grouping rels.
>
> But with this change, we have to change all the existing code to pass
> input_rel->relids to fetch_upper_rel(). If we don't do that or in
> future somebody calls that function with relids = NULL we will produce
> two relations which are supposed to do the same thing but have
> different relids set. That's because fetch_upper_rel() creates a
> relation if one does not exist whether or not the caller intends to
> create one. We should probably create two functions 1. to build an
> upper relation and 2. to search for it similar to what we have done
> for join relations and base relation. The other possibility is to pass
> a flag to fetch_upper_rel() indicating whether a caller intends to
> create a new relation when one doesn't exist. With this design a
> caller can be sure that an upper relation will not be created when it
> wants to just fetch an existing relation (and error out/assert if it
> doesn't find one.).
>

Like Ashutosh said, splitting fetch_upper_rel() in two functions,
build_upper_rel() and find_upper_rel() looks better.

However, I am not sure whether setting relids in a top-most grouped rel is
a good idea or not. I remember we need this while working on Aggregate
PushDown, and in [1] Tom Lane opposed the idea of setting the relids in
grouped_rel.

If we want to go with this, then I think it should be done as a separate
stand-alone patch.

[1]
https://www.postgresql.org/message-id/CAFjFpRdUz6h6cmFZFYAngmQAX8Zvo+MZsPXidZ077h=gp9b...@mail.gmail.com


>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-22 Thread Jeevan Chalke

On Thu, Mar 22, 2018 at 3:26 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Wed, Mar 21, 2018 at 11:33 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > Let me try to explain this:
> > 1. GROUPING_CAN_PARTITIONWISE_AGG
> > 2. extra->is_partial_aggregation
> > 3. extra->perform_partial_partitionwise_aggregation
>
> Please find attached an incremental patch that attempts to refactor
> this logic into a simpler form.  What I've done is merged all three of
> the above Booleans into a single state variable called 'patype', which
> can be one of PARTITIONWISE_AGGREGATE_NONE,
> PARTITIONWISE_AGGREGATE_FULL, and PARTITIONWISE_AGGREGATE_PARTIAL.
> When create_ordinary_grouping_paths() is called, extra.patype is the
> value for the parent relation; that function computes a new value and
> passes it down to create_partitionwise_grouping_paths(), which inserts
> into the new 'extra' structure for the child.
>
> Essentially, in your system, extra->is_partial_aggregation and
> extra->perform_partial_partitionwise_aggregation both corresponded to
> whether or not patype == PARTITIONWISE_AGGREGATE_PARTIAL, but the
> former indicated whether the *parent* was doing partition-wise
> aggregation (and thus we needed to generate only partial paths) while
> the latter indicated whether the *current* relation was doing
> partition-wise aggregation (and thus we needed to force creation of
> partially_grouped_rel).  This took me a long time to understand
> because of the way the fields were named; they didn't indicate that
> one was for the parent and one for the current relation.  Meanwhile,
> GROUPING_CAN_PARTITIONWISE_AGG indicated whether partition-wise
> aggregate should be tried at all for the current relation; there was
> no analogous indicator for the parent relation because we can't be
> processing a child at all if the parent didn't decide to do
> partition-wise aggregation.  So to know what was happening for the
> current relation you had to look at GROUPING_CAN_PARTITIONWISE_AGG +
> extra->perform_partial_partitionwise_aggregation, and to know what was
> happening for the parent relation you just looked at
> extra->is_partial_aggregation.  With this proposed refactoring patch,
> there's just one patype value at each level, which at least to me
> seems simpler.  I tried to improve the comments somewhat, too.
>

Leeks cleaner now. Thanks for refactoring it.

I have merged these changes and flatten all previuos changes into the main
patch.


>
> You have some long lines that it would be good to break, like this:
>
>  child_extra.targetList = (List *) adjust_appendrel_attrs(root,
>
> (Node *) extra->targetList,
>
> nappinfos,
>
> appinfos);
>
> If you put a newline after (List *), the formatting will come out
> nicer -- it will fit within 80 columns.  Please go through the patches
> and make these kinds of changes for lines over 80 columns where
> possible.
>

OK. Done.
I was under impression that it is not good to split the first line. What if
in split version, while applying any new patch or in Merge process
something gets inserted between them? That will result into something
unexpected. But I am not sure if that's ever a possibility.


> I guess we'd better move the GROUPING_CAN_* constants to a header
> file, if they're going to be exposed through GroupPathExtraData.  That
> can go in some refactoring patch.
>

Yep. Moved that into 0003 where we create GroupPathExtraData.


> Is there a good reason not to use input_rel->relids as the input to
> fetch_upper_rel() in all cases, rather than just at subordinate
> levels?
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


partition-wise-agg-v23.tar.gz
Description: application/gzip

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-21 Thread Jeevan Chalke

On Wed, Mar 21, 2018 at 7:46 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Wed, Mar 21, 2018 at 8:01 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> >> In the patch as proposed, create_partial_grouping_paths() can get
> >> called even if GROUPING_CAN_PARTIAL_AGG is not set.  I think that's
> >> wrong.
> >
> > I don't think so. For parallel case, we do check that. And for
> partitionwise
> > aggregation check, it was checked inside can_partitionwise_grouping()
> > function and flags were set accordingly. Am I missing something?
>
> Well, one of us is missing something somewhere.  If
> GROUPING_CAN_PARTIAL_AGG means that we're allowed to do partial
> grouping, and if create_partial_grouping_paths() is where partial
> grouping happens, then we should only be calling the latter if the
> former is set.  I mean, how can it make sense to create
> partially-grouped paths if we're not allowed to do partial grouping?
>

Yes, that's true. If we are not allowed to do partial grouping,
partially_grouped paths should not be created.

However, what I mean is that the partitionwise related checks added, if
evaluates to true it implies that GROUPING_CAN_PARTIAL_AGG is also set as
it was checked earlier. And thus does not need explicit check again.

Anyway, after your refactoring, it becomes more readable now.

>
> > I have tweaked these conditions and posted in a separate patch (0006).
> > However, I have merged all your three patches in one (0005).
>
> OK, thanks.  I wasn't sure I had understood what was going on, so
> thanks for checking it.
>
> Thanks also for keeping 0004-0006 separate here, but I think you can
> flatten them into one patch in the next version.
>

OK. Sure.

>
> >> I think that if the last test in can_partitionwise_grouping were moved
> >> before the previous test, it could be simplified to test only
> >> (extra->flags & GROUPING_CAN_PARTIAL_AGG) == 0 and not
> >> *perform_partial_partitionwise_aggregation.
> >
> > I think we can't do this way. If *perform_partial_
> partitionwise_aggregation
> > found to be true then only we need to check whether partial aggregation
> > itself is possible or not. If we are going to perform a full
> partitionwise
> > aggregation then test for can_partial_agg is not needed. Have I misread
> your
> > comments?
>
> It seems you're correct, because when I change it the tests fail.  I
> don't yet understand why.
>
> Basically, the main patch seems to use three Boolean signaling mechanisms:
>
> 1. GROUPING_CAN_PARTITIONWISE_AGG
> 2. is_partial_aggregation
> 3. perform_partial_partitionwise_aggregation
>
> Stuff I don't understand:
>
> - Why is one of them a Boolean shoved into "flags", even though it's
> not static across the whole hierarchy like the other flags, and the
> other two are separate Booleans?
> - What do they all do, anyway?
>

Let me try to explain this:

1. GROUPING_CAN_PARTITIONWISE_AGG
Tells us whether or not partitionwise grouping and/or aggregation is ever
possible. If it is FALSE, other two have no meaning and they will be
useless. However, if it is TRUE, then only we attempt to create paths
partitionwise.
I have kept it in "flags" as it looks similar in behavior with other flag
members like can_sort, can_hash etc. And, for given grouped relation
whether parent or child, they all work similarly. But yes, for child
relation, we inherit can_sort/can_hash from the parent as they won't
change. But need to evaluate this for every child.
If required, I can move that to a GroupPathExtraData struct.

2. extra->is_partial_aggregation
This boolean var is used to identify at any given time whether we are
computing a full aggregation or a partial aggregation. This boolean is
necessary when doing partial aggregation to skip finalization. And also
tells us to use partially_grouped_rel when true.

3. extra->perform_partial_partitionwise_aggregation
This boolean var is used to instruct child that it has to create a
partially aggregated paths when TRUE. And then it transferred to
child_extra->is_partial_aggregation in
create_partitionwise_grouping_paths().

Basically (3) is required as we wanted to create a partially_grouped_rel
upfront. So that if the child is going to create a partially aggregated
paths, they can append those into the parent's partially grouped rel and
thus we need to create that before even we enter into the child paths
creation.
Since (3) is only valid if (1) is true, we need to compute (1) upfront too.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-21 Thread Jeevan Chalke

On Wed, Mar 21, 2018 at 2:04 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Tue, Mar 20, 2018 at 10:46 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > I have added all these three patches in the attached patch-set and
> rebased
> > my changes over it.
> >
> > However, I have not yet made this patch-set dependednt on UPPERREL_TLIST
> > changes you have proposed on another mail-thread and thus it has 0001
> patch
> > refactoring the scanjoin issue.
> > 0002, 0003 and 0004 are your patches added in this patchset.
> > 0005 and 0006 are further refactoring patches. 0006 adds a
> > GroupPathExtraData which stores mostly child variant data and costs.
> > 0007 is main partitionwise aggregation patch which is then rebased
> > accordingly.
> > 0008 contains testcase and 0009 contains FDW changes.
>
> Committed my refactoring patches (your 0002-0004).
>

Thanks Robert.

>
> Regarding apply_scanjoin_target_to_paths in 0001 and 0007, it seems
> like what happens is: we first build an Append path for the topmost
> scan/join rel.  That uses paths from the individual relations that
> don't necessarily produce the final scan/join target.  Then we mutate
> those relations in place during partition-wise aggregate so that they
> now do produce the final scan/join target and generate some more paths
> using the results.  So there's an ordering dependency, and the same
> pathlist represents different things at different times.  That is, I
> suppose, not technically any worse than what we're doing for the
> scan/join rel's pathlist in general, but here there's the additional
> complexity that the paths get used both before and after being
> mutated.  The UPPERREL_TLIST proposal would clean this up, although I
> realize that has unresolved issues.
>
> In create_partial_grouping_paths, the loop that does "for (i = 0; i <
> 2; i++)" is not exactly what I had in mind when I said that we should
> use two loops.  I did not mean a loop with two iterations.  I meant
> adding a loop like foreach(lc, input_rel->pathlist) in each place
> where we currently have a loop like
> foreach(input_rel->partial_pathlist).

The path creation logic for partial_pathlist and pathlist was identical and
thus I thought of just loop over it twice switching the pathlist, so that
we have minimal code to maintain. But yes I agree that it adds additional
complexity.

  See 0001, attached.
>

Looks great. Thanks.

>
> Don't write if (a) Assert(b) but rather Assert(!a || b).  See 0002,
> attached.
>

OK. Noted.

> In the patch as proposed, create_partial_grouping_paths() can get
> called even if GROUPING_CAN_PARTIAL_AGG is not set.  I think that's
> wrong.

I don't think so. For parallel case, we do check that. And for
partitionwise aggregation check, it was checked inside
can_partitionwise_grouping() function and flags were set accordingly. Am I
missing something?

>   If can_partial_agg() isn't accurately determining whether
> partial aggregation is possible,

I think it does accurately determine.
if (!parse->hasAggs && parse->groupClause == NIL)
is only valid for DISTINCT queries which we are anyway not handling here
and for partitionwise aggregate it won't be true otherwise it will be a
degenerate grouping case.

> and as Ashutosh and I have been
> discussing, there's room for improvement in that area, then that's a
> topic for some other set of patches.  Also, the test in
> create_ordinary_grouping_paths for whether or not to call
> create_partial_grouping_paths() is super-complicated and uncommented.
> I think a simpler approach is to allow create_partial_grouping_paths()
> the option of returning NULL.  See 0003, attached.
>

Thanks for simplifying it.

However, after this simplification, we were unnecessary creating
non-parallel partial aggregation paths for the root input rel when not
needed.
Consider a case where we need a partial aggregation from a child, in this
case, extra->is_partial_aggregation = 0 at root level entry as the parent
is still doing full aggregation but
perform_partial_partitionwise_aggregation is true, which tells a child to
perform partial partitionwise aggregation. In this case,
cheapest_total_path will be set and thus we will go ahead and create
partial aggregate paths for the parent rel, which is not needed.

I have tweaked these conditions and posted in a separate patch (0006).
However, I have merged all your three patches in one (0005).

>
> make_grouping_rel() claims that "For now, all aggregated paths are
> added to the (GROUP_AGG, NULL) upperrel", but this is false: we no
> longer have only one grouped upper rel.
>

Done.

>
> I'm having a heck of a time understanding what is_partial_aggregation
> an

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-20 Thread Jeevan Chalke

Hi,

On Mon, Mar 19, 2018 at 10:56 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Fri, Mar 16, 2018 at 1:50 PM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
> > Ok. That looks good.
>
> Here's an updated version.  In this version, based on a voice
> discussion with Ashutosh and Jeevan, I adjusted 0001 to combine it
> with an earlier idea of splitting Gather/Gather Merge path generation
> out of the function that creates partially aggregated paths.  The idea
> here is that create_ordinary_gather_paths() could first call
> create_partial_grouping_paths(), then add additional paths which might
> be partial or non-partial by invoking the partition-wise aggregate
> logic, then call gather_grouping_paths() and set_cheapest() to
> finalize the partially grouped rel.  Also, I added draft commit
> messages.
>

I have added all these three patches in the attached patch-set and rebased
my changes over it.

However, I have not yet made this patch-set dependednt on UPPERREL_TLIST
changes you have proposed on another mail-thread and thus it has 0001 patch
refactoring the scanjoin issue.
0002, 0003 and 0004 are your patches added in this patchset.
0005 and 0006 are further refactoring patches. 0006 adds a
GroupPathExtraData which stores mostly child variant data and costs.
0007 is main partitionwise aggregation patch which is then rebased
accordingly.
0008 contains testcase and 0009 contains FDW changes.

Let me know if I missed any point to consider while rebasing.

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

partition-wise-agg-v21.tar.gz
Description: application/gzip

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-15 Thread Jeevan Chalke

On Thu, Mar 15, 2018 at 3:38 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

>
> The patchset needs rebase.  I have rebased those on the latest head
> and made following changes.
>
> Argument force_partial_agg is added after output arguments to
> make_grouping_rels(). Moved it before output arguemnts to keep input and
> output
> arguments separate.
>
> Also moved the comment about creating partial aggregate paths before full
> aggregate paths in make_grouping_rels() moved to
> populate_grouping_rels_with_paths().
>
> In create_grouping_paths() moved call to try_degenerate_grouping_paths()
> before
> computing extra grouping information.
>

Thanks for these changes.


>
> Some more comment changes in the attached patch set.
>
> + *
> + * With partitionwise aggregates, we may have childrel's pathlist
> empty
> + * if we are doing partial aggregation. Thus do this only if
> childrel's
> + * pathlist is not NIL.
>   */
> -if (childrel->cheapest_total_path->param_info == NULL)
> +if (childrel->pathlist != NIL &&
> +childrel->cheapest_total_path->param_info == NULL)
>  accumulate_append_subpath(childrel->cheapest_total_path,
>, NULL);
> I thought we got rid of this code. Why has it come back? May be the comment
> should point to a function where this case happen.
>

Oops. My mistake. I have added it back while working on your comments. And
at that time we were creating partially_grouped_rel unconditionally. I was
getting segfault here.

But now, we create partially_grouped_rel only when needed, thanks for your
refactoring work. Thus no need of this guard. Removed in attached patch set.


> In populate_grouping_rels_with_paths(), we need to pass extra to
> extension hook create_upper_paths_hook similar to what
> add_paths_to_joinrel() does.
>

Hmm.. you are right. Done.


>
> Also, we aren't passing is_partial_agg to FDW hook, so it won't know
> whether to compute partial or full aggregates. I think the check
> 5296 /* Partial aggregates are not supported. */
> 5297 if (extra->partial_partitionwise_grouping)
> 5298 return;
> in add_foreign_grouping_paths() is wrong. It's checking whether the
> children of the given relation will produce partial aggregates or not.
> But it is supposed to check whether the given relation should produce
> partial aggregates or not. I think we need to include is_partial_agg
> in GroupPathExtraData so that it gets passed to FDWs. If we do so, we
> need to add a comment in the prologue of GroupPathExtraData to
> disambiguate usage of is_partial_agg and
> partial_partitionwise_grouping.
>

Sorry, I mis-interpreted this.
Reworked and added is_partial_aggregation in GroupPathExtraData.


> In current create_grouping_paths() (without any of your patches
> applied) we first create partial paths in partially grouped rel and
> then add parallel path to grouped rel using those partial paths. Then
> we hand over this to FDW and extension hooks, which may add partial
> paths, which might throw away a partial path used to create a parallel
> path in grouped rel causing a segfault. I think this bug exists since
> we introduced parallel aggregation or upper relation refactoring
> whichever happened later. Introduction of partially grouped rel has
> just made it visible.
>

Yes. That's why I needed to create partitionwise aggregation paths before
we create any normal paths.
Yeah, but if this issue needs to be taken care, it should be a separate
patch.

However, as noticed by Ashutosh Bapat, after refactoring, we no more try to
create partitionwise paths, rather if possible we do create them. So
inlined with create_grouping_paths() I have renamed it to
create_partitionwise_grouping_paths() in attached patch-set.

Thanks


> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


partition-wise-agg-v20.tar.gz
Description: application/gzip

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-14 Thread Jeevan Chalke

On Tue, Mar 13, 2018 at 7:43 PM, Jeevan Chalke <
jeevan.cha...@enterprisedb.com> wrote:

> Hi,
>
> The patch-set is complete now. But still, there is a scope of some comment
> improvements due to all these refactorings. I will work on it. Also, need
> to update few documentations and indentations etc. Will post those changes
> in next patch-set. But meanwhile, this patch-set is ready to review.
>

Attached complete patch-set.

Fixed all remaining documentation, indentation and comments.
Also incorporated few comments improvements provided off-list by Ashutosh
Bapat.

Thanks
-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


partition-wise-agg-v18.tar.gz
Description: application/gzip

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-13 Thread Jeevan Chalke

Hi,

I have resolved all the comments/issues reported in this new patch-set.

Changes done by Ashutosh Bapat for splitting out create_grouping_paths()
into separate functions so that partitionwise aggregate code will use them
were based on my partitionwise aggregate changes. Those were like
refactoring changes. And thus, I have refactored them separately and before
any partitionwise changes (see
0005-Split-create_grouping_paths-and-Add-GroupPathExtraDa.patch). And then
I have re-based all partitionwise changes over it including all fixes.

The patch-set is complete now. But still, there is a scope of some comment
improvements due to all these refactorings. I will work on it. Also, need
to update few documentations and indentations etc. Will post those changes
in next patch-set. But meanwhile, this patch-set is ready to review.

On Tue, Mar 13, 2018 at 9:12 AM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Mon, Mar 12, 2018 at 7:49 PM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> >
> >
> > On Mon, Mar 12, 2018 at 6:07 PM, Ashutosh Bapat
> > <ashutosh.ba...@enterprisedb.com> wrote:
> >>
> >> On Fri, Mar 9, 2018 at 4:21 PM, Ashutosh Bapat
> >>
> We don't need UpperRelationKind member in that structure. That will be
> provided by the RelOptInfo passed.
>
> The problem here is the extra information required for grouping is not
> going to be the same for that needed for window aggregate and
> certainly not for ordering. If we try to jam everything in the same
> structure, it will become large with many members useless for a given
> operation. A reader will not have an idea about which of them are
> useful and which of them are not. So, instead we should try some
> polymorphism. I think we can pass a void * to GetForeignUpperPaths()
> and corresponding FDW hook knows what to cast it to based on the
> UpperRelationKind passed.
>

Yep. Done this way.

>
> BTW, the patch has added an argument to GetForeignUpperPaths() but has
> not documented the change in API. If we go the route of polymorphism,
> we will need to document the mapping between UpperRelationKind and the
> type of structure passed in.
>

Oops. Will do that in next patchset.

Thanks for pointing out, I have missed this at first place it self.

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

partition-wise-agg-v17.tar.gz
Description: application/gzip

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-12 Thread Jeevan Chalke

On Mon, Mar 12, 2018 at 6:07 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Fri, Mar 9, 2018 at 4:21 PM, Ashutosh Bapat
>
> GroupPathExtraData now completely absorbs members from and replaces
> OtherUpperPathExtraData. This means that we have to consider a way to
> pass GroupPathExtraData to FDWs through GetForeignUpperPaths(). I
> haven't tried it in this patch.
>

Initially, I was passing OtherUpperPathExtraData to FDW. But now we need to
pass GroupPathExtraData.

However, since GetForeignUpperPaths() is a generic function for all upper
relations, we might think of renaming this struct to UpperPathExtraData.
Add an UpperRelationKind member to it
Which will be used to distinguish the passed in extra data. But now we only
have extra data for grouping only, I chose not to do that here. But
someone, when needed, may choose this approach.


> With this patch there's a failure in partition_aggregation where the
> patch is creating paths with MergeAppend with GatherMerge underneath.
> I think this is related to the call
> add_paths_to_partial_grouping_rel() when try_parallel_aggregation is
> true. But I didn't investigate it further.
>

I fixed it. We need to pass  is_partial_agg instead of
extra->partial_partitionwise_grouping while calling
add_paths_to_partial_grouping_rel() in case of parallelism.


> With those two things remaining I am posting this patch, so that
> Jeevan Chalke can merge this patch into his patches and also merge
> some of his changes related to mine and Robert's changes. Let me know
> if this refactoring looks good.
>

Will rebase my changes tomorrow.


>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 87ea18c..ed16f7d 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -353,7 +353,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 			 UpperRelationKind stage,
 			 RelOptInfo *input_rel,
 			 RelOptInfo *output_rel,
-			 OtherUpperPathExtraData *extra);
+			 GroupPathExtraData *group_extra);
 
 /*
  * Helper functions
@@ -429,7 +429,7 @@ static void add_paths_with_pathkeys_for_rel(PlannerInfo *root, RelOptInfo *rel,
 static void add_foreign_grouping_paths(PlannerInfo *root,
 		   RelOptInfo *input_rel,
 		   RelOptInfo *grouped_rel,
-		   OtherUpperPathExtraData *extra);
+		   GroupPathExtraData *group_extra);
 static void apply_server_options(PgFdwRelationInfo *fpinfo);
 static void apply_table_options(PgFdwRelationInfo *fpinfo);
 static void merge_fdw_options(PgFdwRelationInfo *fpinfo,
@@ -5235,7 +5235,7 @@ foreign_grouping_ok(PlannerInfo *root, RelOptInfo *grouped_rel)
 static void
 postgresGetForeignUpperPaths(PlannerInfo *root, UpperRelationKind stage,
 			 RelOptInfo *input_rel, RelOptInfo *output_rel,
-			 OtherUpperPathExtraData *extra)
+			 GroupPathExtraData *group_extra)
 {
 	PgFdwRelationInfo *fpinfo;
 
@@ -5255,7 +5255,7 @@ postgresGetForeignUpperPaths(PlannerInfo *root, UpperRelationKind stage,
 	fpinfo->pushdown_safe = false;
 	output_rel->fdw_private = fpinfo;
 
-	add_foreign_grouping_paths(root, input_rel, output_rel, extra);
+	add_foreign_grouping_paths(root, input_rel, output_rel, group_extra);
 }
 
 /*
@@ -5268,7 +5268,7 @@ postgresGetForeignUpperPaths(PlannerInfo *root, UpperRelationKind stage,
 static void
 add_foreign_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 		   RelOptInfo *grouped_rel,
-		   OtherUpperPathExtraData *extra)
+		   GroupPathExtraData *group_extra)
 {
 	Query	   *parse = root->parse;
 	PgFdwRelationInfo *ifpinfo = input_rel->fdw_private;
@@ -5288,16 +5288,16 @@ add_foreign_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 	 * Store passed-in target and havingQual in fpinfo. If its a foreign
 	 * partition, then path target and HAVING quals fetched from the root are
 	 * not correct as Vars within it won't match with this child relation.
-	 * However, server passed them through extra and thus fetch from it.
+	 * However, server passed them through group_extra and thus fetch from it.
 	 */
-	if (extra)
+	if (group_extra)
 	{
 		/* Partial aggregates are not supported. */
-		if (extra->isPartialAgg)
+		if (group_extra->partial_partitionwise_grouping)
 			return;
 
-		fpinfo->grouped_target = extra->relTarget;
-		fpinfo->havingQual = extra->havingQual;
+		fpinfo->grouped_target = group_extra->target;
+		fpinfo->havingQual = group_extra->havingQual;
 	}
 	else
 	{
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1611975..9a34bf9 100644
--- a/src/backend/optimizer

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-12 Thread Jeevan Chalke

On Mon, Mar 12, 2018 at 6:07 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Fri, Mar 9, 2018 at 4:21 PM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
> > On Thu, Mar 8, 2018 at 7:31 PM, Robert Haas <robertmh...@gmail.com>
> wrote:
> >>
> >> This kind of goes along with the suggestion I made yesterday to
> >> introduce a new function, which at the time I proposed calling
> >> initialize_grouping_rel(), to set up new grouped or partially grouped
> >> relations.  By doing that it would be easier to ensure the
> >> initialization is always done in a consistent way but only for the
> >> relations we actually need.  But maybe we should call it
> >> fetch_grouping_rel() instead.  The idea would be that instead of
> >> calling fetch_upper_rel() we would call fetch_grouping_rel() when it
> >> is a question of the grouped or partially grouped relation.  It would
> >> either return the existing relation or initialize a new one for us.  I
> >> think that would make it fairly easy to initialize only the ones we're
> >> going to need.
> >
> > Hmm. I am working on refactoring the code to do something like this.
> >
>
> Here's patch doing the same. I have split create_grouping_paths() into
> three functions 1. to handle degenerate grouping paths
> (try_degenerate_grouping_paths()) 2. to create the grouping rels,
> partial grouped rel and grouped rel (make_grouping_rels()), which also
> sets some properties in GroupPathExtraData. 3. populate grouping rels
> with paths (populate_grouping_rels_with_paths()). With those changes,
> I have been able to get rid of partially grouped rels when they are
> not necessary. But I haven't tried to get rid of grouped_rels when
> they are not needed.
>
> GroupPathExtraData now completely absorbs members from and replaces
> OtherUpperPathExtraData. This means that we have to consider a way to
> pass GroupPathExtraData to FDWs through GetForeignUpperPaths(). I
> haven't tried it in this patch.
>
> With this patch there's a failure in partition_aggregation where the
> patch is creating paths with MergeAppend with GatherMerge underneath.
> I think this is related to the call
> add_paths_to_partial_grouping_rel() when try_parallel_aggregation is
> true. But I didn't investigate it further.
>
> With those two things remaining I am posting this patch, so that
> Jeevan Chalke can merge this patch into his patches and also merge
> some of his changes related to mine and Robert's changes. Let me know
> if this refactoring looks good.
>

Thanks Ashutosh for the refactoring patch.
I will rebase my changes and will also resolve above two issues you have
reported.


>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-08 Thread Jeevan Chalke

On Thu, Mar 8, 2018 at 7:49 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Mar 8, 2018 at 9:15 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > I am not sure why we don't set reltarget into the grouped_rel too.
> >
> > But if we do so like we did in partially_grouped_rel, then it will be lot
> > easier for partitionwise aggregate as then we don't have to pass target
> to
> > functions creating paths like create_append_path. We now need to update
> > generate_gather_paths() to take target too as it is now being called on
> > grouped_rel in which reltarget is not set.
> >
> > But yes, if there is any specific reason we can't do so, then I think the
> > same like Ashutosh Said. I didn't aware of such reason though.
>
> I see no problem with setting reltarget for the grouped_rel.  Before
> we added partially_grouped_rel, that rel computed paths with two
> different targets: partial paths had the partial grouping target, and
> non-partial paths had the ordinary grouping target.  But that's fixed
> now.
>

OK.
Will update my changes accordingly.
If we set reltarget into the grouped_rel now, then I don't need one of the
refactoring patch which is passing target to the path creation functions.


> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-08 Thread Jeevan Chalke

On Thu, Mar 8, 2018 at 1:15 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> On Wed, Mar 7, 2018 at 8:07 PM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> Here are some more review comments esp. on
> try_partitionwise_grouping() function. BTW name of that function
> doesn't go in sync with enable_partitionwise_aggregation (which btw is
> in sync with enable_fooagg GUCs). But it goes in sync with
> create_grouping_paths(). It looks like we have already confused
> aggregation and grouping e.g. enable_hashagg may affect path creation
> when there is no aggregation involved i.e. only grouping is involved
> but create_grouping_paths will create paths when there is no grouping
> involved. Generally it looks like the function names use grouping to
> mean both aggregation and grouping but GUCs use aggregation to mean
> both of those. So, the naming in this patch looks consistent with the
> current naming conventions.
>
> +grouped_rel->part_scheme = input_rel->part_scheme;
> +grouped_rel->nparts = nparts;
> +grouped_rel->boundinfo = input_rel->boundinfo;
> +grouped_rel->part_rels = part_rels;
>
> You need to set the part_exprs which will provide partition keys for this
> partitioned relation. I think, we should include all the part_exprs of
> input_rel which are part of GROUP BY clause. Since any other expressions in
> part_exprs are not part of GROUP BY clause, they can not appear in the
> targetlist without an aggregate on top. So they can't be part of the
> partition
> keys of the grouped relation.
>
> In create_grouping_paths() we fetch both partial as well as fully grouped
> rel
> for given input relation. But in case of partial aggregation, we don't need
> fully grouped rel since we are not computing full aggregates for the
> children.
> Since fetch_upper_rel() creates a relation when one doesn't exist, we are
> unnecessarily creating fully grouped rels in this case. For thousands of
> partitions that's a lot of memory wasted.
>
> I see a similar issue with create_grouping_paths() when we are computing
> only
> full aggregates (either because partial aggregation is not possible or
> because
> parallelism is not possible). In that case, we unconditionally create
> partially
> grouped rels. That too would waste a lot of memory.
>
> I think that partially_grouped_rel, when required, is partitioned
> irrespective
> of whether we do full aggregation per partition or not. So, if we have its
> part_rels and other partitioning properties set. I agree that right now we
> won't use this information anywhere. It may be useful, in case we device a
> way
> to use partially_grouped_rel directly without using grouped_rel for
> planning
> beyond grouping, which seems unlikely.
>
> +
> +/*
> + * Parallel aggregation requires partial target, so compute it
> here
> + * and translate all vars. For partial aggregation, we need it
> + * anyways.
> + */
> +partial_target = make_partial_grouping_target(root, target,
> +  havingQual);
>
> Don't we have this available in partially_grouped_rel?
>
> That shows one asymmetry that Robert's refactoring has introduced. We
> don't set
> reltarget of grouped_rel but set reltarget of partially_grouped_rel. If
> reltarget of grouped_rel changes across paths (the reason why we have not
> set
> it in grouped_rel), shouldn't reltarget of partially grouped paths change
> accordingly?
>

I am not sure why we don't set reltarget into the grouped_rel too.

But if we do so like we did in partially_grouped_rel, then it will be lot
easier for partitionwise aggregate as then we don't have to pass target to
functions creating paths like create_append_path. We now need to update
generate_gather_paths() to take target too as it is now being called on
grouped_rel in which reltarget is not set.

But yes, if there is any specific reason we can't do so, then I think the
same like Ashutosh Said. I didn't aware of such reason though.


> +
> +/*
> + * group_by_has_partkey
> + *
> + * Returns true, if all the partition keys of the given relation are part
> of
> + * the GROUP BY clauses, false otherwise.
> + */
> +static bool
> +group_by_has_partkey(RelOptInfo *input_rel, PathTarget *target,
> + List *groupClause)
>
> We could modify this function to return the list of part_exprs which are
> part
> of group clause and use that as the partition keys of the grouped_rel if
> required. If group by doesn't have all the partition keys, the function
> would
> return a NULL list.
>
> Right now, in case of full aggregation, partially_group

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-07 Thread Jeevan Chalke

On Wed, Mar 7, 2018 at 1:45 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Tue, Mar 6, 2018 at 5:31 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > This is in-lined with enable_hashagg GUC. Do you think
> > enable_partitionwise_aggregate seems better? But it will be not
> consistent
> > with other GUC names like enable_hashagg then.
>
> Well, if I had my way, enable_hashagg would be spelled
> enable_hash_aggregate, too, but I wasn't involved in the project that
> long ago.  100% consistency is hard to achieve here; the perfect
> parallel of enable_hashagg would be enable_partitionwiseagg, but then
> it would be inconsistent with enable_partitionwise_join unless we
> renamed it to enable_partitionwisejoin, which I would rather not do.
> I think the way the enable_blahfoo names were done was kinda
> shortsighted -- it works OK as long as blahfoo is pretty short, like
> mergejoin or hashagg or whatever, but if you have more or longer words
> then I think it's hard to see where the word boundaries are without
> any punctuation.  And if you start abbreviating then you end up with
> things like enable_pwagg which are not very easy to understand.  So I
> favor spelling everything out and punctuating it.
>

Understood and make sense.
Updated.


>
> > So the code for doing partially aggregated partial paths and partially
> > aggregated non-partial path is same except partial paths goes into
> > partial_pathlist where as non-partial goes into pathlist of
> > partially_grouped_rel. Thus, calling add_paths_to_partial_grouping_rel()
> > when isPartialAgg = true seems correct. Also as we have decided, this
> > function is responsible to create all partially aggregated paths
> including
> > both partial and non-partial.
> >
> > Am I missing something?
>
> Hmm.  I guess not.  I think I didn't read this code well enough
> previously.  Please find attached proposed incremental patches (0001
> and 0002) which hopefully make the code in this area a bit clearer.
>

Yep. Thanks for these patches.
I have merged these changes into my main (0007) patch now.


>
> >> +   /*
> >> +* If there are any fully aggregated partial paths present,
> >> may be because
> >> +* of parallel Append over partitionwise aggregates, we must
> stick
> >> a
> >> +* Gather or Gather Merge path atop the cheapest partial path.
> >> +*/
> >> +   if (grouped_rel->partial_pathlist)
> >>
> >> This comment is copied from someplace where the code does what the
> >> comment says, but here it doesn't do any such thing.
> >
> > Well, these comments are not present anywhere else than this place. With
> > Parallel Append and Partitionwise aggregates, it is now possible to have
> > fully aggregated partial paths now. And thus we need to stick a Gather
> > and/or Gather Merge atop cheapest partial path. And I believe the code
> does
> > the same. Am I missing something?
>
> I misread the code.  Sigh.  I should have waited until today to send
> that email and taken time to study it more carefully.  But I still
> don't think it's completely correct.  It will not consider using a
> pre-sorted path; the only strategies it can consider are cheapest path
> + Gather and cheapest path + explicit Sort (even if the cheapest path
> is already correctly sorted!) + Gather Merge.  It should really do
> something similar to what add_paths_to_partial_grouping_rel() already
> does: first call generate_gather_paths() and then, if the cheapest
> partial path is not already correctly sorted, also try an explicit
> Sort + Gather Merge.  In fact, it looks like we can actually reuse
> that logic exactly.  See attached 0003 incremental patch; this changes
> the outputs of one of your regression tests, but the new plan looks
> better.
>

This seems like a refactoring patch and thus added as separate patch (0005)
in patch-set.
Changes related to PWA patch are merged accordingly too.

Attached new patch-set with these changes merged and fixing review comments
from Ashutosh Bapat along with his GroupPathExtraData changes patch.


> Some other notes:
>
> There's a difference between performing partial aggregation in the
> same process and performing it in a different process.  hasNonPartial
> tells us that we can't perform partial aggregation *at all*;
> hasNonSerial only tells us that partial and final aggregation must
> happen in the same process.  This patch could possibly take advantage
> of partial aggregation even when hasNonSerial is set.  Finalize
> Aggregate -> Append -> N copies of { Partial Aggregate -> Whatever }
> is OK with hasNonSeria

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-07 Thread Jeevan Chalke

 for partially_grouped_rel; that way, later code can
>   * easily consider both parallel and non-parallel approaches to
> grouping.
>   */
> -if (try_parallel_aggregation)
> +if (!child_data && !(agg_costs->hasNonPartial ||
> agg_costs->hasNonSerial))
>  {
> -PathTarget *partial_grouping_target;
> -
> [... clipped ...]
> +get_agg_clause_costs(root, havingQual,
>   AGGSPLIT_FINAL_DESERIAL,
> - _final_costs);
> + agg_final_costs);
>  }
> +}
>
> With this change, we are computing partial aggregation costs even in
> the cases when
> those will not be used e.g. when there are no children and parallel paths
> can
> not be created. In the attached patch, I have refactored the code such that
> they are computed when they are needed the first time and re-used later.
>
> +if (child_data)
> +{
> +partial_grouping_target = child_data->partialRelTarget;
> +partially_grouped_rel->reltarget = partial_grouping_target;
> +agg_partial_costs = child_data->agg_partial_costs;
> +agg_final_costs = child_data->agg_final_costs;
> +}
>
> I think, with the refactoring, we can get rid of the last two lines here. I
> think we can get rid of this block entirely, but I have not reviewed the
> entire
> code to confirm that.
>

I have added your patch as one of the refactoring patch and rebased my
changes over it.
Yes, it removed this block and other few conditions too.


>
>  static PathTarget *
> -make_partial_grouping_target(PlannerInfo *root, PathTarget
> *grouping_target)
> +make_partial_grouping_target(PlannerInfo *root,
> + PathTarget *grouping_target,
> + Node *havingQual)
> This looks like a refactoring change. Should go to one of the refactoring
> patches or in a patch of its own.
>

OK. Refactored into separate patch.


Will post a new patchset with these changes included.


>
> This isn't full review. I will continue reviewing this further.
>

Sure.


>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-06 Thread Jeevan Chalke

 * easily consider both parallel and non-parallel approaches to
> grouping.
>   */
> -if (try_parallel_aggregation)
> +if (!child_data && !(agg_costs->hasNonPartial ||
> agg_costs->hasNonSerial))
>  {
> -PathTarget *partial_grouping_target;
> -
> [... clipped ...]
> +get_agg_clause_costs(root, havingQual,
>   AGGSPLIT_FINAL_DESERIAL,
> - _final_costs);
> + agg_final_costs);
>  }
> +}
>
> With this change, we are computing partial aggregation costs even in
> the cases when
> those will not be used e.g. when there are no children and parallel paths
> can
> not be created. In the attached patch, I have refactored the code such that
> they are computed when they are needed the first time and re-used later.
>
> +if (child_data)
> +{
> +partial_grouping_target = child_data->partialRelTarget;
> +partially_grouped_rel->reltarget = partial_grouping_target;
> +agg_partial_costs = child_data->agg_partial_costs;
> +agg_final_costs = child_data->agg_final_costs;
> +}
>
> I think, with the refactoring, we can get rid of the last two lines here. I
> think we can get rid of this block entirely, but I have not reviewed the
> entire
> code to confirm that.
>
>  static PathTarget *
> -make_partial_grouping_target(PlannerInfo *root, PathTarget
> *grouping_target)
> +make_partial_grouping_target(PlannerInfo *root,
> + PathTarget *grouping_target,
> + Node *havingQual)
> This looks like a refactoring change. Should go to one of the refactoring
> patches or in a patch of its own.
>
> This isn't full review. I will continue reviewing this further.
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-06 Thread Jeevan Chalke

On Tue, Mar 6, 2018 at 2:29 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Mon, Mar 5, 2018 at 3:56 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > However, to perform Gather or Gather Merge once we have all partial paths
> > ready, and to avoid too many existing code rearrangement, I am calling
> > try_partitionwise_grouping() before we do any aggregation/grouping on
> whole
> > relation. By doing this, we will be having all partial paths in
> > partially_grouped_rel and then existing code will do required
> finalization
> > along with any Gather or Gather Merge, if required.
> >
> > Please have a look over attached patch-set and let me know if it needs
> > further changes.
>
> This does look better.
>

Thank you, Robert.

>
> +  enable_partitionwise_agg
> (boolean)
>
> Please don't abbreviate "aggregate" to "agg".
>

This is in-lined with enable_hashagg GUC. Do you think
enable_partitionwise_aggregate
seems better? But it will be not consistent with other GUC names like
enable_hashagg then.

>
> -   /* Build final grouping paths */
> -   add_paths_to_grouping_rel(root, input_rel, grouped_rel, target,
> -
> partially_grouped_rel, agg_costs,
> -
> _final_costs, gd, can_sort, can_hash,
> - dNumGroups,
> (List *) parse->havingQual);
> +   if (isPartialAgg)
> +   {
> +   Assert(agg_partial_costs && agg_final_costs);
> +   add_paths_to_partial_grouping_rel(root, input_rel,
> +
>partially_grouped_rel,
> +
>agg_partial_costs,
> +
>gd, can_sort, can_hash,
> +
>false, true);
> +   }
> +   else
> +   {
> +   double  dNumGroups;
> +
> +   /* Estimate number of groups. */
> +   dNumGroups = get_number_of_groups(root,
> +
>cheapest_path->rows,
> +
>gd,
> +
>child_data ? make_tlist_from_pathtarget(target) :
> parse->targetList);
> +
> +   /* Build final grouping paths */
> +   add_paths_to_grouping_rel(root, input_rel, grouped_rel,
> target,
> +
> partially_grouped_rel, agg_costs,
> +
> agg_final_costs, gd, can_sort, can_hash,
> +
> dNumGroups, (List *) havingQual);
> +   }
>
> This looks strange.  Why do we go down two completely different code
> paths here?

It is because when isPartialAgg = true we need to create partially
aggregated non-partial paths which should be added in
partially_grouped_rel->pathlist. And when isPartialAgg = false, we are
creating fully aggregated paths which goes into grouped_rel->pathlist.

>   It seems to me that the set of paths we add to the
> partial_pathlist shouldn't depend at all on isPartialAgg.  I might be
> confused, but it seems to me that any aggregate path we construct that
> is going to run in parallel must necessarily be partial, because even
> if each group will occur only in one partition, it might still occur
> in multiple workers for that partition, so finalization would be
> needed.

Thats's true. We are creating partially aggregated partial paths for this
and keeps them in partially_grouped_rel->partial_pathlist.

>   On the other hand, for non-partial paths, we can add then to
> partially_grouped_rel when isPartialAgg = true and to grouped_rel when
> isPartialAgg = false, with the only difference being AGGSPLIT_SIMPLE
> vs. AGGSPLIT_INITIAL_SERIAL.

Yes. As explained above, they goes in pathlist of respective Rel.
However, PathTarget is different too, we need partial_pathtarget when
isPartialAgg = true and also need agg_partial_costs.

> But that doesn't appear to be what this
> is doing.
>

So the code for doing partially aggregated partial paths and partially
aggregated non-partial path is same except partial paths goes into
partial_pathlist where as non-partial goes into pathlist of
partially_grouped_rel. Thus, calling add_paths_to_partial_grouping_rel()
when isPartialAgg = true seems correct. Also as we have decided, this
function is responsible to create all partially aggregated paths including
both partial and non-partial.

Am I missing something?

>
> +   /*
> +* If there are any fully aggregated partial paths present,
> may be because
> +* of parallel Append over partitionwise aggregates, we must stick
> a
> +* Gather or Gather Merge path atop the cheapest partial path.
> +*/
> +   if (grouped_rel->partial_pathlist)
>
> This comment is copied from someplace where the code does what the
> comment says, but here it doesn't do any such thing.
>

Well, the

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-05 Thread Jeevan Chalke

On Sat, Mar 3, 2018 at 12:12 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Mar 1, 2018 at 4:52 PM, Robert Haas <robertmh...@gmail.com> wrote:
> > This is not a full review, but I'm out of time for right now.
>
> Another thing I see here now is that create_grouping_paths() and
> create_child_grouping_paths() are extremely similar.  Isn't there some
> way we can refactor things so that we can reuse the code instead of
> duplicating it?
>

Yes. I too observed the same after our re-design.

To avoid code duplication, I am now calling create_grouping_paths() for
child relation too.

However, to perform Gather or Gather Merge once we have all partial paths
ready, and to avoid too many existing code rearrangement, I am calling
try_partitionwise_grouping() before we do any aggregation/grouping on whole
relation. By doing this, we will be having all partial paths in
partially_grouped_rel and then existing code will do required finalization
along with any Gather or Gather Merge, if required.

Please have a look over attached patch-set and let me know if it needs
further changes.

Thanks

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

partition-wise-agg-v15.tar.gz
Description: application/gzip

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-05 Thread Jeevan Chalke

On Fri, Mar 2, 2018 at 3:22 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Mar 1, 2018 at 5:34 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > Attached new patchset after rebasing my changes over these changes and on
> > latest HEAD.
>
> +* We have already created a Gather or Gather Merge path atop
> cheapest
> +* partial path. Thus the partial path referenced by the
> Gather node needs
> +* to be preserved as adding new partial paths in same rel may
> delete this
> +* referenced path. To do this we need to clear the
> partial_pathlist from
> +* the partially_grouped_rel as we may add partial paths again
> while doing
> +* partitionwise aggregation. Keeping older partial path intact
> seems
> +* reasonable too as it might possible that the final path
> chosen which is
> +* using it wins, but the underneath partial path is not the
> cheapest one.
>
> This isn't a good design.  You shouldn't create a Gather or Gather
> Merge node until all partial paths have been added.  I mean, the point
> is to put a Gather node on top of the cheapest path, not the path that
> is currently the cheapest but might not actually be the cheapest once
> we've added them all.
>

To be honest, I didn't know that we should not generated Gather or Gather
Merge until we have all possible partial paths in place. I realize it
recently while debugging one issue reported by Rajkumar off-list. While
working on that fix, what I have observed is
- I have cheapest partial path with cost say 10, a Gather on it increased
cost to 11.
- Later when I add a partial path it has a cost say 9 but a gather on it
resulted is total cost to 12.
This means, the first Gather path is the cheapest one but not the
underneath partial path and unfortunately that got removed when my partial
path is added into the partial_pathlist.

Due to this, I thought it is better to have both paths valid and to avoid
deleting earlier cheapest partial_path, I chose to reset the
partially_grouped_rel->partial_pathlist.

But, yes per comment in generate_gather_paths() and as you said, we should
add Gather or Gather Merge only after we have done with all partial path
creation. Sorry for not knowing this before.


>
> +add_gather_or_gather_merge(PlannerInfo *root,
>
> Please stop picking generic function names for functions that have
> very specific purposes.  I don't really think that you need this to be
> a separate function at all, but it it is certainly NOT a
> general-purpose function for adding a Gather or Gather Merge node.
>

OK. Got it now.


>
> +/*
> + * Collect statistics about aggregates for estimating costs of
> + * performing aggregation in parallel or in partial, if not
> already
> + * done. We use same cost for all the children as they will be
> same
> + * anyways.
> + */
>
> If it only needs to be done once, do we really have to have it inside
> the loop?  I see that you're using the varno-translated
> partial_target->exprs and target->exprs, but if the specific varnos
> don't matter, why not just use the untranslated version of the targets
> before entering the loop?  And if the specific varnos do matter, then
> presumably you need to do it every time.
>

Yes. It can be pulled outside a loop.


>
> This is not a full review, but I'm out of time for right now.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-01 Thread Jeevan Chalke

On Mon, Feb 26, 2018 at 8:03 PM, Robert Haas <robertmh...@gmail.com> wrote:

>
> Committed after incorporating your other fixes and updating the
> optimizer README.
>

Attached new patchset after rebasing my changes over these changes and on
latest HEAD.


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Thanks

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


partition-wise-agg-v14.tar.gz
Description: GNU Zip compressed data

Re: Re: ERROR: Aggref found in non-Agg plan node (introducesd in commit 3bf05e096b9f8375e640c5d7996aa57efd7f240c)

2018-02-27 Thread Jeevan Chalke

On Tue, Feb 27, 2018 at 4:55 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> På tirsdag 27. februar 2018 kl. 12:16:42, skrev Jeevan Chalke <
> jeevan.cha...@enterprisedb.com>:
>
> Hi,
>
> On Tue, Feb 27, 2018 at 3:46 PM, Andreas Joseph Krogh <andr...@visena.com>
> wrote:
>>
>> $subject
>>
>> The query I'm using is 14K and I have not produced a test-case, so this
>> is just a heads-up.
>>
>
> Rajkumar off-list shared same issue with me. And I have posted a fix for
> the same (https://www.postgresql.org/message-id/CAM2+6=X9kxQoL2ZqZ00E
> 6asbt9z+rfywbomhxj0+8fpaymz...@mail.gmail.com)
>
> Can you please check if that fixes your test-case or not?
>
>
>
> Applying fix_aggref_in_non-agg_error.patch (only) fixes the issue
>

Thank you Andreas for confirming.


>
> --
> Andreas Joseph Krogh
> 
>



-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: ERROR: Aggref found in non-Agg plan node (introducesd in commit 3bf05e096b9f8375e640c5d7996aa57efd7f240c)

2018-02-27 Thread Jeevan Chalke

Hi,

On Tue, Feb 27, 2018 at 3:46 PM, Andreas Joseph Krogh <andr...@visena.com>
wrote:

> $subject
>
> The query I'm using is 14K and I have not produced a test-case, so this is
> just a heads-up.
>

Rajkumar off-list shared same issue with me. And I have posted a fix for
the same (
https://www.postgresql.org/message-id/CAM2+6=x9kxqol2zqz00e6asbt9z+rfywbomhxj0+8fpaymz...@mail.gmail.com
)

Can you please check if that fixes your test-case or not?

Thanks

>
> --
> Andreas Joseph Krogh
> 
>
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-26 Thread Jeevan Chalke

Hi Robert,

On Fri, Feb 23, 2018 at 2:53 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Feb 8, 2018 at 8:05 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > In this attached version, I have rebased my changes over new design of
> > partially_grouped_rel. The preparatory changes of adding
> > partially_grouped_rel are in 0001.
>
> I spent today hacking in 0001; results attached.  The big change from
> your version is that this now uses generate_gather_paths() to add
> Gather/Gather Merge nodes (except in the case where we sort by group
> pathkeys and then Gather Merge) rather than keeping all of the bespoke
> code.  That turned up to be a bit less elegant than I would have liked
> -- I had to an override_rows argument to generate_gather_paths to make
> it work.  But overall I think this is still a big improvement, since
> it lets us share code instead of duplicating it.  Also, it potentially
> lets us add partially-aggregated but non-parallel paths into
> partially_grouped_rel->pathlist and that should Just Work; they will
> get the Finalize Aggregate step but not the Gather.  With your
> arrangement that wouldn't work.
>
> Please review/test.
>

I have reviewed and tested the patch and here are my couple of points:

 /*
- * If the input rel belongs to a single FDW, so does the grouped rel.
+ * If the input rel belongs to a single FDW, so does the grouped rel.
Same
+ * for the partially_grouped_rel.
  */
 grouped_rel->serverid = input_rel->serverid;
 grouped_rel->userid = input_rel->userid;
 grouped_rel->useridiscurrent = input_rel->useridiscurrent;
 grouped_rel->fdwroutine = input_rel->fdwroutine;
+partially_grouped_rel->serverid = input_rel->serverid;
+partially_grouped_rel->userid = input_rel->userid;
+partially_grouped_rel->useridiscurrent = input_rel->useridiscurrent;
+partially_grouped_rel->fdwroutine = input_rel->fdwroutine;

In my earlier mail where I have posted a patch for this partially grouped
rel changes, I forgot to put my question on this.
I was unclear about above changes and thus passed grouped_rel whenever we
wanted to work on partially_grouped_rel to fetch relevant details.

One idea I thought about is to memcpy the struct once we have set all
required fields for grouped_rel so that we don't have to do similar stuff
for partially_grouped_rel.

---

+ * Insert a Sort node, if required.  But there's no point in
+ * sorting anything but the cheapest path.
  */
-if (root->group_pathkeys)
+if (!pathkeys_contained_in(root->group_pathkeys,
path->pathkeys))
+{
+if (path != linitial(partially_grouped_rel->pathlist))
+continue;

Paths in pathlist are added by add_path(). Though we have paths is pathlist
is sorted with the cheapest total path, we generally use
RelOptInfo->cheapest_total_path instead of using first entry, unlike
partial paths. But here you use the first entry like partial paths case.
Will it better to use cheapest total path from partially_grouped_rel? This
will require calling set_cheapest on partially_grouped_rel before we call
this function.

Attached top-up patch doing this along with few indentation fixes.

Rest of the changes look good to me.

Once this gets in, I will re-base my other patches accordingly.

And, thanks for committing 0006.

Thanks


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 1c792a0..f6b0208 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2453,7 +2453,8 @@ set_worktable_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
  * we must do something.)
  */
 void
-generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
+generate_gather_paths(PlannerInfo *root, RelOptInfo *rel,
+	  bool override_rows)
 {
 	Path	   *cheapest_partial_path;
 	Path	   *simple_gather_path;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index e4f9bd4..e8f6cc5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6101,7 +6101,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
 			{
-if (path != linitial(partially_grouped_rel->pathlist))
+if (path != partially_grouped_rel->cheapest_total_path)
 	continue;
 path = (Path *) create_sort_path(root,
  grouped_rel,

Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-14 Thread Jeevan Chalke

On Wed, Feb 14, 2018 at 12:17 PM, Rafia Sabih <rafia.sa...@enterprisedb.com>
wrote:

> On Tue, Feb 13, 2018 at 6:21 PM, Jeevan Chalke <
> jeevan.cha...@enterprisedb.com> wrote:
>
>>
>> I see that partition-wise aggregate plan too uses parallel index, am I
>> missing something?
>>
>>
> You're right, I missed that, oops.
>
>>
>>> Q18 takes some 390 secs with patch and some 147 secs without it.
>>>
>>
>> This looks strange. This patch set does not touch parallel or seq scan as
>> such. I am not sure why this is happening. All these three queries explain
>> plan shows much higher execution time for parallel/seq scan.
>>
>> Yeah strange it is.
>

Off-list I have asked Rafia to provide me the perf machine access where she
is doing this bench-marking to see what's going wrong.
Thanks Rafia for the details.

What I have observed that, there are two sources, one with HEAD and other
with HEAD+PWA. However the configuration switches were different. Sources
with HEAD+PWA has CFLAGS="-ggdb3 -O0" CXXFLAGS="-ggdb3 -O0" flags in
addition with other sources. i.e. HEAD+PWA is configured with
debugging/optimization enabled which account for the slowness.

I have run EXPLAIN for these three queries on both the sources having
exactly same configuration switches and I don't find any slowness with PWA
patch-set.

Thus, it will be good if you re-run the benchmark by keeping configuration
switches same on both the sources and share the results.

Thanks

> However, do you see similar behaviour with patches applied,
>> "enable_partition_wise_agg = on" and "enable_partition_wise_agg = off" ?
>>
>
> I tried that for query 18, with patch and  enable_partition_wise_agg =
> off, query completes in some 270 secs. You may find the explain analyse
> output for it in the attached file. I noticed that on head the query plan
> had parallel hash join however with patch and no partition-wise agg it is
> using nested loop joins. This might be the issue.
>
>>
>> Also, does rest of the queries perform better with partition-wise
>> aggregates?
>>
>>
> As far as this setting goes, there wasn't any other query using
> partition-wise-agg, so, no.
>
> BTW, just an FYI, this experiment is on scale factor 20.
>
> --
> Regards,
> Rafia Sabih
> EnterpriseDB: http://www.enterprisedb.com/
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-13 Thread Jeevan Chalke

On Tue, Feb 13, 2018 at 12:37 PM, Rafia Sabih <rafia.sa...@enterprisedb.com>
wrote:

>
> I was testing this patch for TPC-H benchmarking and came across following
> results,
>

Thanks Rafia for testing this with TPC-H benchmarking.

>
> Q1 completes in 229 secs with patch and in 66 secs without it. It looks
> like with this patch the time of parallel seq scan itself is elevated for
> some of the partitions. Notice for partitions, lineitem_3, lineitem_7,
> lineitem_10, and linietem_5 it is some 13 secs which was somewhere around 5
> secs on head.
>

> Q6 completes in some 7 secs with patch and it takes 4 secs without it.
> This is mainly caused because with the new parallel append, the parallel
> operator below it (parallel index scan in this case) is not used, however,
> on head it was the append of all the parallel index scans, which was saving
> quite some time.
>

I see that partition-wise aggregate plan too uses parallel index, am I
missing something?

>
> Q18 takes some 390 secs with patch and some 147 secs without it.
>

This looks strange. This patch set does not touch parallel or seq scan as
such. I am not sure why this is happening. All these three queries explain
plan shows much higher execution time for parallel/seq scan.

However, do you see similar behaviour with patches applied,
"enable_partition_wise_agg = on" and "enable_partition_wise_agg = off" ?

Also, does rest of the queries perform better with partition-wise
aggregates?

>
> The experimental setup for these tests is as follows,
> work_mem = 500MB
> shared_buffers = 10GB
> effective_cache_size = 4GB
> seq_page_cost = random+page_cost = 0.01
> enable_partition_wise_join = off
>
> Partitioning info:
> Total 10 partitions on tables - lineitem and orders each with partitioning
> key being l_orderkey and o_orderkey respectively.
>
> Please find the attached file for explain analyse outputs of each of the
> reported query.
> --
> Regards,
> Rafia Sabih
> EnterpriseDB: http://www.enterprisedb.com/
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-08 Thread Jeevan Chalke

Hi,

In this attached version, I have rebased my changes over new design of
partially_grouped_rel. The preparatory changes of adding
partially_grouped_rel are in 0001.

Also to minimize finalization code duplication, I have refactored them into
two separate functions, finalize_sorted_partial_agg_path() and
finalize_hashed_partial_agg_path(). I need to create these two functions as
current path creation order in like,
Sort Agg Path
Sort Agg Path - Parallel Aware (Finalization needed here)
Hash Agg Path
Hash Agg Path - Parallel Aware (Finalization needed here)
And if we club those finalizations together, then path creation order will
be changed and it may result in the existing plan changes.
Let me know if that's OK, I will merge them together as they are distinct
anyways. These changes are part of 0002.

0003 - 0006 are refactoring patches as before.

0007 is the main patch per new design. I have removed
create_partition_agg_paths() altogether as finalization code is reused.
Also, renamed preferFullAgg with forcePartialAgg as we forcefully needed a
partial path from nested level if the parent is doing a partial
aggregation. add_single_path_to_append_rel() is no more exists and also
there is no need to pass OtherUpperPathExtraData to
add_paths_to_append_rel().

0008 - 0009, testcase and postgres_fdw changes.

Please have a look at new changes and let me know if I missed any.

Thanks

On Fri, Feb 2, 2018 at 7:29 PM, Robert Haas <robertmh...@gmail.com> wrote:

> On Fri, Feb 2, 2018 at 8:25 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> >> The problem is that create_partition_agg_paths() is doing *exactly*
> >> same thing that add_paths_to_grouping_rel() is already doing inside
> >> the blocks that say if (grouped_rel->partial_pathlist).  We don't need
> >> two copies of that code.  Both of those places except to take a
> >> partial path that has been partially aggregated and produce a
> >> non-partial path that is fully aggregated.  We do not need or want two
> >> copies of that code.
> >
> > OK. Got it.
> >
> > Will try to find a common place for them and will also check how it goes
> > with your suggested design change.
> >
> >> Here's another way to look at it.  We have four kinds of things.
> >>
> >> 1. Partially aggregated partial paths
> >> 2. Partially aggregated non-partial paths
> >> 3. Fully aggregated partial paths
> >> 4. Fully aggregated non-partial paths
>
> So in the new scheme I'm proposing, you've got a partially_grouped_rel
> and a grouped_rel.  So all paths of type #1 go into
> partially_grouped_rel->partial_pathlist, paths of type #2 go into
> partially_grouped_rel->pathlist, type #3 (if we have any) goes into
> grouped_rel->partial_pathlist, and type #4 goes into
> grouped_rel->pathlist.
>
> > add_paths_to_append_rel() -> generate_mergeappend_paths() does not
> consider
> > partial_pathlist. Thus we will never see MergeAppend over parallel scan
> > given by partial_pathlist. And thus plan like:
> > -> Gather Merge
> >   -> MergeAppend
> > is not possible with current HEAD.
> >
> > Are you suggesting we should implement that here? I think that itself is
> a
> > separate task.
>
> Oh, I didn't realize that wasn't working already.  I agree that it's a
> separate task from this patch, but it's really too bad that it doesn't
> already work.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

-- 
Jeevan Chalke
Technical Architect, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

partition-wise-agg-v13.tar.gz
Description: GNU Zip compressed data

Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-02 Thread Jeevan Chalke

On Fri, Feb 2, 2018 at 1:41 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Thu, Feb 1, 2018 at 8:59 AM, Jeevan Chalke
> <jeevan.cha...@enterprisedb.com> wrote:
> > I wrote a patch for this (on current HEAD) and attached separately here.
> > Please have a look.
>
> Yes, this is approximately what I had in mind, though it needs more
> work (e.g. it doesn't removing the clearing of the grouped_rel's
> partial_pathlist, which should no longer be necessary; also, it needs
> substantial comment updates).
>

That was just a quick patch to make sure is this what you meant.
Yes, it need some more work as you suggested and comment updates.


>
> >> 1. Parallel partition-wise aggregate, grouping key doesn't contain
> >> partition key.  This should just be a matter of adding additional
> >> Append paths to partially_grouped_rel->partial_pathlist.  The existing
> >> code already knows how to stick a Gather and FinalizeAggregate step on
> >> top of that, and I don't see why that logic would need any
> >> modification or addition.  An Append of child partial-grouping paths
> >> should be producing the same output as a partial grouping of an
> >> Append, except that the former case might produce more separate groups
> >> that need to be merged; but that should be OK: we can just throw all
> >> the paths into the same path list and let the cheapest one win.
> >
> > For any partial aggregation we need to add finalization step after we are
> > done with the APPEND i.e. post add_paths_to_append_rel(). Given that we
> need
> > to replicate the logic of sticking Gather and FinalizeAggregate step at
> > later stage. This is what exactly done in create_partition_agg_paths().
> > Am I missing something here?
>
> The problem is that create_partition_agg_paths() is doing *exactly*
> same thing that add_paths_to_grouping_rel() is already doing inside
> the blocks that say if (grouped_rel->partial_pathlist).  We don't need
> two copies of that code.  Both of those places except to take a
> partial path that has been partially aggregated and produce a
> non-partial path that is fully aggregated.  We do not need or want two
> copies of that code.
>

OK. Got it.

Will try to find a common place for them and will also check how it goes
with your suggested design change.


>
> Here's another way to look at it.  We have four kinds of things.
>
> 1. Partially aggregated partial paths
> 2. Partially aggregated non-partial paths
> 3. Fully aggregated partial paths
> 4. Fully aggregated non-partial paths
>
> The current code only ever generates paths of type #1 and #4; this
> patch will add paths of type #2 as well, and maybe also type #3.  But
> the way you've got it, the existing paths of type #1 go into the
> grouping_rel's partial_pathlist, and the new paths of type #1 go into
> the OtherUpperPathExtraData's partial_paths list.  Maybe there's a
> good reason why we should keep them separate, but I'm inclined to
> think they should all be going into the same list.
>

The new paths are specific to partition-wise aggregates and I thought
better to keep them separately without interfering with grouped_rel
pathlist/partial_pathlist. And as you said, I didn't find a better place
that its own structure.


> >> Overall, what I'm trying to argue for here is making this feature look
> >> less like its own separate thing and more like part of the general
> >> mechanism we've already got: partial paths would turn into regular
> >> paths via generate_gather_paths(), and partially aggregated paths
> >> would turn into fully aggregated paths by adding FinalizeAggregate.
> >> The existing special case that allows us to build a non-partial, fully
> >> aggregated path from a partial, partially-aggregated path would be
> >> preserved.
> >>
> >> I think this would probably eliminate some other problems I see in the
> >> existing design as well.  For example, create_partition_agg_paths
> >> doesn't consider using Gather Merge, but that might be a win.
> >
> > Append path is always non-sorted and has no pathkeys. Thus Gather Merge
> over
> > an Append path seems infeasible, isn't it?
>
> We currently never generate an Append path with pathkeys, but we do
> generate MergeAppend paths with pathkeys, as in the following example:
>
> rhaas=# create table foo (a int, b text) partition by range (a);
> CREATE TABLE
> rhaas=# create index on foo (a);
> CREATE INDEX
> rhaas=# create table foo1 partition of foo for values from (0) to
> (100);
> CREATE TABLE
> rhaas=# create table foo2 partition of foo for values from (100)
> to (2

1 2 >

1 - 100 of 110 matches

Mail list logo