date:20201012

On Tue, 13 Oct 2020 at 15:27, Amit Kapila  wrote:
>
> On Tue, Oct 13, 2020 at 11:49 AM Masahiko Sawada
>  wrote:
> >
> > On Tue, 13 Oct 2020 at 14:53, Amit Kapila  wrote:
> > >
> > > On Tue, Oct 13, 2020 at 11:05 AM Tom Lane  wrote:
> > > >
> > > > Amit Kapila  writes:
> > > > >> It is possible that MAXALIGN stuff is playing a role here and or the
> > > > >> background transaction stuff. I think if we go with the idea of
> > > > >> testing spill_txns and spill_count being positive then the results
> > > > >> will be stable. I'll write a patch for that.
> > > >
> > > > Here's our first failure on a MAXALIGN-8 machine:
> > > >
> > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-10-13%2005%3A00%3A08
> > > >
> > > > So this is just plain not stable.  It is odd though.  I can
> > > > easily think of mechanisms that would cause the WAL volume
> > > > to occasionally be *more* than the "typical" case.  What
> > > > would cause it to be *less*, if MAXALIGN is ruled out?
> > > >
> > >
> > > The original theory I have given above [1] which is an interleaved
> > > autovacumm transaction. Let me try to explain in a bit more detail.
> > > Say when transaction T-1 is performing Insert ('INSERT INTO stats_test
> > > SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000)
> > > g(i);') a parallel autovacuum transaction occurs. The problem as seen
> > > in buildfarm will happen when autovacuum transaction happens after 80%
> > > or more of the Insert is done.
> > >
> > > In such a situation we will start decoding 'Insert' first and need to
> > > spill multiple times due to the amount of changes (more than threshold
> > > logical_decoding_work_mem) and then before we encounter Commit of
> > > transaction that performed Insert (and probably some more changes from
> > > that transaction) we will encounter a small transaction (autovacuum
> > > transaction).  The decode of that small transaction will send the
> > > stats collected till now which will lead to the problem shown in
> > > buildfarm.
> >
> > That seems a possible scenario.
> >
> > I think probably this also explains the reason why spill_count
> > slightly varied and spill_txns was still 1. The spill_count value
> > depends on how much the process spilled out transactions before
> > encountering the commit of an autovacuum transaction. Since we have
> > the spill statistics per reorder buffer, not per transactions, it's
> > possible.
> >
>
> Okay, here is an updated version (changed some comments) of the patch
> I posted some time back. What do you think? I have tested this on both
> Windows and Linux environments. I think it is a bit tricky to
> reproduce the exact scenario so if you are fine we can push this and
> check or let me know if you any better idea?

I agree to check if the spill_counts and spill_txns are positive. I
thought we can reduce the number of tuples to insert to the half. It
would help to reduce the likelihood of other transactions interfere
and speed up the test (currently, the stats.sql test takes almost 1
sec in my environment). But it might lead to another problem like the
logical decoding doesn't spill out the transaction on some
environment.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, Oct 13, 2020 at 11:49 AM Masahiko Sawada
 wrote:
>
> On Tue, 13 Oct 2020 at 14:53, Amit Kapila  wrote:
> >
> > On Tue, Oct 13, 2020 at 11:05 AM Tom Lane  wrote:
> > >
> > > Amit Kapila  writes:
> > > >> It is possible that MAXALIGN stuff is playing a role here and or the
> > > >> background transaction stuff. I think if we go with the idea of
> > > >> testing spill_txns and spill_count being positive then the results
> > > >> will be stable. I'll write a patch for that.
> > >
> > > Here's our first failure on a MAXALIGN-8 machine:
> > >
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-10-13%2005%3A00%3A08
> > >
> > > So this is just plain not stable.  It is odd though.  I can
> > > easily think of mechanisms that would cause the WAL volume
> > > to occasionally be *more* than the "typical" case.  What
> > > would cause it to be *less*, if MAXALIGN is ruled out?
> > >
> >
> > The original theory I have given above [1] which is an interleaved
> > autovacumm transaction. Let me try to explain in a bit more detail.
> > Say when transaction T-1 is performing Insert ('INSERT INTO stats_test
> > SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000)
> > g(i);') a parallel autovacuum transaction occurs. The problem as seen
> > in buildfarm will happen when autovacuum transaction happens after 80%
> > or more of the Insert is done.
> >
> > In such a situation we will start decoding 'Insert' first and need to
> > spill multiple times due to the amount of changes (more than threshold
> > logical_decoding_work_mem) and then before we encounter Commit of
> > transaction that performed Insert (and probably some more changes from
> > that transaction) we will encounter a small transaction (autovacuum
> > transaction).  The decode of that small transaction will send the
> > stats collected till now which will lead to the problem shown in
> > buildfarm.
>
> That seems a possible scenario.
>
> I think probably this also explains the reason why spill_count
> slightly varied and spill_txns was still 1. The spill_count value
> depends on how much the process spilled out transactions before
> encountering the commit of an autovacuum transaction. Since we have
> the spill statistics per reorder buffer, not per transactions, it's
> possible.
>

Okay, here is an updated version (changed some comments) of the patch
I posted some time back. What do you think? I have tested this on both
Windows and Linux environments. I think it is a bit tricky to
reproduce the exact scenario so if you are fine we can push this and
check or let me know if you any better idea?

-- 
With Regards,
Amit Kapila.


fix_stats_test_2.patch
Description: Binary data

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, 13 Oct 2020 at 14:53, Amit Kapila  wrote:
>
> On Tue, Oct 13, 2020 at 11:05 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > >> It is possible that MAXALIGN stuff is playing a role here and or the
> > >> background transaction stuff. I think if we go with the idea of
> > >> testing spill_txns and spill_count being positive then the results
> > >> will be stable. I'll write a patch for that.
> >
> > Here's our first failure on a MAXALIGN-8 machine:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-10-13%2005%3A00%3A08
> >
> > So this is just plain not stable.  It is odd though.  I can
> > easily think of mechanisms that would cause the WAL volume
> > to occasionally be *more* than the "typical" case.  What
> > would cause it to be *less*, if MAXALIGN is ruled out?
> >
>
> The original theory I have given above [1] which is an interleaved
> autovacumm transaction. Let me try to explain in a bit more detail.
> Say when transaction T-1 is performing Insert ('INSERT INTO stats_test
> SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000)
> g(i);') a parallel autovacuum transaction occurs. The problem as seen
> in buildfarm will happen when autovacuum transaction happens after 80%
> or more of the Insert is done.
>
> In such a situation we will start decoding 'Insert' first and need to
> spill multiple times due to the amount of changes (more than threshold
> logical_decoding_work_mem) and then before we encounter Commit of
> transaction that performed Insert (and probably some more changes from
> that transaction) we will encounter a small transaction (autovacuum
> transaction).  The decode of that small transaction will send the
> stats collected till now which will lead to the problem shown in
> buildfarm.

That seems a possible scenario.

I think probably this also explains the reason why spill_count
slightly varied and spill_txns was still 1. The spill_count value
depends on how much the process spilled out transactions before
encountering the commit of an autovacuum transaction. Since we have
the spill statistics per reorder buffer, not per transactions, it's
possible.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, Oct 13, 2020 at 11:05 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> >> It is possible that MAXALIGN stuff is playing a role here and or the
> >> background transaction stuff. I think if we go with the idea of
> >> testing spill_txns and spill_count being positive then the results
> >> will be stable. I'll write a patch for that.
>
> Here's our first failure on a MAXALIGN-8 machine:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-10-13%2005%3A00%3A08
>
> So this is just plain not stable.  It is odd though.  I can
> easily think of mechanisms that would cause the WAL volume
> to occasionally be *more* than the "typical" case.  What
> would cause it to be *less*, if MAXALIGN is ruled out?
>

The original theory I have given above [1] which is an interleaved
autovacumm transaction. Let me try to explain in a bit more detail.
Say when transaction T-1 is performing Insert ('INSERT INTO stats_test
SELECT 'serialize-topbig--1:'||g.i FROM generate_series(1, 5000)
g(i);') a parallel autovacuum transaction occurs. The problem as seen
in buildfarm will happen when autovacuum transaction happens after 80%
or more of the Insert is done.

In such a situation we will start decoding 'Insert' first and need to
spill multiple times due to the amount of changes (more than threshold
logical_decoding_work_mem) and then before we encounter Commit of
transaction that performed Insert (and probably some more changes from
that transaction) we will encounter a small transaction (autovacuum
transaction).  The decode of that small transaction will send the
stats collected till now which will lead to the problem shown in
buildfarm.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1Jo0U1oSJyxrdA7i-bOOTh0Hue-NQqdG-CEqwGtDZPjyw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Resetting spilled txn statistics in pg_stat_replication

Amit Kapila  writes:
>> It is possible that MAXALIGN stuff is playing a role here and or the
>> background transaction stuff. I think if we go with the idea of
>> testing spill_txns and spill_count being positive then the results
>> will be stable. I'll write a patch for that.

Here's our first failure on a MAXALIGN-8 machine:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grison&dt=2020-10-13%2005%3A00%3A08

So this is just plain not stable.  It is odd though.  I can
easily think of mechanisms that would cause the WAL volume
to occasionally be *more* than the "typical" case.  What
would cause it to be *less*, if MAXALIGN is ruled out?

regards, tom lane

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, Oct 13, 2020 at 10:33 AM Amit Kapila  wrote:
>
> On Tue, Oct 13, 2020 at 10:21 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > On Tue, Oct 13, 2020 at 9:25 AM Tom Lane  wrote:
> > >> It's not very clear what spill_count actually counts (and the
> > >> documentation sure does nothing to clarify that), but if it has anything
> > >> to do with WAL volume, the explanation might be that florican is 32-bit.
> > >> All the animals that have passed that test so far are 64-bit.
> >
> > prairiedog just failed in not-quite-the-same way, which reinforces the
> > idea that this test is dependent on MAXALIGN, which determines physical
> > tuple size.  (I just checked the buildfarm, and the four active members
> > that report MAXALIGN 4 during configure are florican, lapwing, locust,
> > and prairiedog.  Not sure about the MSVC critters though.)  The
> > spill_count number is different though, so it seems that that may not
> > be the whole story.
> >
>
> It is possible that MAXALIGN stuff is playing a role here and or the
> background transaction stuff. I think if we go with the idea of
> testing spill_txns and spill_count being positive then the results
> will be stable. I'll write a patch for that.
>

Please find the attached patch for the same. Additionally, I have
skipped empty xacts during decoding as background autovacuum
transactions can impact that count as well. I have done some minimal
testing with this. I'll do some more.

-- 
With Regards,
Amit Kapila.


fix_stats_test_1.patch
Description: Binary data

Re: Resetting spilled txn statistics in pg_stat_replication

I wrote:
> prairiedog just failed in not-quite-the-same way, which reinforces the
> idea that this test is dependent on MAXALIGN, which determines physical
> tuple size.  (I just checked the buildfarm, and the four active members
> that report MAXALIGN 4 during configure are florican, lapwing, locust,
> and prairiedog.  Not sure about the MSVC critters though.)  The
> spill_count number is different though, so it seems that that may not
> be the whole story.

Oh, and here comes lapwing:

- regression_slot |  1 |  12
+ regression_slot |  1 |  10

So if it weren't that prairiedog showed 11 not 10, we'd have a nice
neat it-depends-on-MAXALIGN theory.  As is, I'm not sure what all
is affecting it, though MAXALIGN sure seems to be a component.

(locust seems to be AWOL at the moment, so I'm not holding my breath
for that one to report in.)

regards, tom lane

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, Oct 13, 2020 at 10:21 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > On Tue, Oct 13, 2020 at 9:25 AM Tom Lane  wrote:
> >> It's not very clear what spill_count actually counts (and the
> >> documentation sure does nothing to clarify that), but if it has anything
> >> to do with WAL volume, the explanation might be that florican is 32-bit.
> >> All the animals that have passed that test so far are 64-bit.
>
> prairiedog just failed in not-quite-the-same way, which reinforces the
> idea that this test is dependent on MAXALIGN, which determines physical
> tuple size.  (I just checked the buildfarm, and the four active members
> that report MAXALIGN 4 during configure are florican, lapwing, locust,
> and prairiedog.  Not sure about the MSVC critters though.)  The
> spill_count number is different though, so it seems that that may not
> be the whole story.
>

It is possible that MAXALIGN stuff is playing a role here and or the
background transaction stuff. I think if we go with the idea of
testing spill_txns and spill_count being positive then the results
will be stable. I'll write a patch for that.

> > It is based on the size of the change. In this case, it is the size of
> > the tuples inserted. See ReorderBufferChangeSize() know how we compute
> > the size of each change.
>
> I know I can go read the source code, but most users will not want to.
> Is the documentation in monitoring.sgml really sufficient?  If we can't
> explain this with more precision, is it really a number we want to expose
> at all?
>

This counter is important to give users an idea about the amount of
I/O we incur during decoding and to tune logical_decoding_work_mem
GUC. So, I would prefer to improve the documentation for this
variable.

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] Custom compression methods

2020-10-12 Thread Dilip Kumar

On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
 wrote:
>
> On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> >
> >> ...
> >
> >I have worked on this patch, so as discussed now I am maintaining the
> >preserved compression methods using dependency.  Still PRESERVE ALL
> >syntax is not supported, I will work on that part.
> >
>
> Cool, I'll take a look. What's your opinion on doing it this way? Do you
> think it's cleaner / more elegant, or is it something contrary to what
> the dependencies are meant to do?

I think this looks much cleaner.  Moreover, I feel that once we start
supporting the custom compression methods then we anyway have to
maintain the dependency so using that for finding the preserved
compression method is good option.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Resetting spilled txn statistics in pg_stat_replication

Amit Kapila  writes:
> On Tue, Oct 13, 2020 at 9:25 AM Tom Lane  wrote:
>> It's not very clear what spill_count actually counts (and the
>> documentation sure does nothing to clarify that), but if it has anything
>> to do with WAL volume, the explanation might be that florican is 32-bit.
>> All the animals that have passed that test so far are 64-bit.

prairiedog just failed in not-quite-the-same way, which reinforces the
idea that this test is dependent on MAXALIGN, which determines physical
tuple size.  (I just checked the buildfarm, and the four active members
that report MAXALIGN 4 during configure are florican, lapwing, locust,
and prairiedog.  Not sure about the MSVC critters though.)  The
spill_count number is different though, so it seems that that may not
be the whole story.

> It is based on the size of the change. In this case, it is the size of
> the tuples inserted. See ReorderBufferChangeSize() know how we compute
> the size of each change.

I know I can go read the source code, but most users will not want to.
Is the documentation in monitoring.sgml really sufficient?  If we can't
explain this with more precision, is it really a number we want to expose
at all?

regards, tom lane

Re: partition routing layering in nodeModifyTable.c

2020-10-12 Thread Amit Langote

On Tue, Oct 13, 2020 at 1:57 AM Heikki Linnakangas  wrote:
> On 12/10/2020 16:47, Amit Langote wrote:
> > On Mon, Oct 12, 2020 at 8:12 PM Heikki Linnakangas  wrote:
> >> 1. We have many different cleanup/close routines now:
> >> ExecCloseResultRelations, ExecCloseRangeTableRelations, and
> >> ExecCleanUpTriggerState. Do we need them all? It seems to me that we
> >> could merge ExecCloseRangeTableRelations() and
> >> ExecCleanUpTriggerState(), they seem to do roughly the same thing: close
> >> relations that were opened for ResultRelInfos. They are always called
> >> together, except in afterTriggerInvokeEvents(). And in
> >> afterTriggerInvokeEvents() too, there would be no harm in doing both,
> >> even though we know there aren't any entries in the es_result_relations
> >> array at that point.
> >
> > Hmm, I find trigger result relations to behave differently enough to
> > deserve a separate function.  For example, unlike plan-specified
> > result relations, they don't point to range table relations and don't
> > open indices.  Maybe the name could be revisited, say,
> > ExecCloseTriggerResultRelations().
>
> Matter of perception I guess. I still prefer to club them together into
> one Close call. It's true that they're slightly different, but they're
> also pretty similar. And IMHO they're more similar than different.

Okay, fine with me.

> > Also, maybe call the other functions:
> >
> > ExecInitPlanResultRelationsArray()
> > ExecInitPlanResultRelation()
> > ExecClosePlanResultRelations()
> >
> > Thoughts?
>
> Hmm. How about initializing the array lazily, on the first
> ExecInitPlanResultRelation() call? It's not performance critical, and
> that way there's one fewer initialization function that you need to
> remember to call.

Agree that's better.

> It occurred to me that if we do that (initialize the array lazily),
> there's very little need for the PlannedStmt->resultRelations list
> anymore. It's only used in ExecRelationIsTargetRelation(), but if we
> assume that ExecRelationIsTargetRelation() is only called after InitPlan
> has initialized the result relation for the relation, it can easily
> check es_result_relations instead. I think that's a safe assumption.
> ExecRelationIsTargetRelation() is only used in FDWs, and I believe the
> FDWs initialization routine can only be called after ExecInitModifyTable
> has been called on the relation.
>
> The PlannedStmt->rootResultRelations field is even more useless.

I am very much tempted to remove those fields from PlannedStmt,
although I am concerned that the following now assumes that *all*
result relations are initialized in the executor initialization phase:

bool
ExecRelationIsTargetRelation(EState *estate, Index scanrelid)
{
if (!estate->es_result_relations)
return false;

return estate->es_result_relations[scanrelid - 1] != NULL;
}

In the other thread [1], I am proposing that we initialize result
relations lazily, but the above will be a blocker to that.

> > Actually, maybe we don't need to be so paranoid about setting up
> > es_result_relations in worker.c, because none of the downstream
> > functionality invoked seems to rely on it, that is, no need to call
> > ExecInitResultRelationsArray() and ExecInitResultRelation().
> > ExecSimpleRelation* and downstream functionality assume a
> > single-relation operation and the ResultRelInfo is explicitly passed.
>
> Hmm, yeah, I like that. Similarly in ExecuteTruncateGuts(), there isn't
> actually any need to put the ResultRelInfos in the es_result_relations
> array.
>
> Putting all this together, I ended up with the attached. It doesn't
> include the subsequent commits in this patch set yet, for removal of
> es_result_relation_info et al.

Thanks.

+* We put the ResultRelInfos in the es_opened_result_relations list, even
+* though we don't have a range table and don't populate the
+* es_result_relations array.  That's a big bogus, but it's enough to make
+* ExecGetTriggerResultRel() find them.
 */
estate = CreateExecutorState();
resultRelInfos = (ResultRelInfo *)
palloc(list_length(rels) * sizeof(ResultRelInfo));
resultRelInfo = resultRelInfos;
+   estate->es_result_relations = (ResultRelInfo **)
+   palloc(list_length(rels) * sizeof(ResultRelInfo *));

Maybe don't allocate es_result_relations here?

+/*
+ * Close all relations opened by ExecGetRangeTableRelation()
+ */
+void
+ExecCloseRangeTableRelations(EState *estate)
+{
+   int i;
+
+   for (i = 0; i < estate->es_range_table_size; i++)
{
if (estate->es_relations[i])
table_close(estate->es_relations[i], NoLock);
}

I think we have an optimization opportunity here (maybe as a separate
patch).  Why don't we introduce es_opened_relations?  That way, if
only a single or few of potentially 1000s relations in the range table
is/are opened, we don't needlessly loop over *all* relations here.
That can happen, for example, with a query where no partitions could

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, Oct 13, 2020 at 9:25 AM Tom Lane  wrote:
>
> Amit Kapila  writes:
> > I have pushed this but it failed in one of the BF. See
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2020-10-13%2003%3A07%3A25
> > The failure is shown below and I am analyzing it. See, if you can
> > provide any insights.
>
> It's not very clear what spill_count actually counts (and the
> documentation sure does nothing to clarify that), but if it has anything
> to do with WAL volume, the explanation might be that florican is 32-bit.
> All the animals that have passed that test so far are 64-bit.
>

It is based on the size of the change. In this case, it is the size of
the tuples inserted. See ReorderBufferChangeSize() know how we compute
the size of each change. Once the total_size for changes crosses
logical_decoding_work_mem (64kB) in this case, we will spill. So
'spill_count'  is the number of times the size of changes in that
transaction crossed the threshold and which lead to a spill of the
corresponding changes.

> > The reason for this problem could be that there is some transaction
> > (say by autovacuum) which happened interleaved with this transaction
> > and committed before this one.
>
> I can believe that idea too, but would it not have resulted in a
> diff in spill_txns as well?
>

We count that 'spill_txns'  once for a transaction that is ever
spilled. I think the 'spill_txns' wouldn't vary for this particular
test even if the autovacuum transaction happens-before the main
transaction of the test because in that case, wait_for_decode_stats
won't finish until it sees the main transaction ('spill_txns' won't be
positive by that time)

> In short, I'm not real convinced that a stable result is possible in this
> test.  Maybe you should just test for spill_txns and spill_count being
> positive.
>

Yeah, that seems like the best we can do here.

-- 
With Regards,
Amit Kapila.

Re: Resetting spilled txn statistics in pg_stat_replication

Amit Kapila  writes:
> I have pushed this but it failed in one of the BF. See
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2020-10-13%2003%3A07%3A25
> The failure is shown below and I am analyzing it. See, if you can
> provide any insights.

It's not very clear what spill_count actually counts (and the
documentation sure does nothing to clarify that), but if it has anything
to do with WAL volume, the explanation might be that florican is 32-bit.
All the animals that have passed that test so far are 64-bit.

> The reason for this problem could be that there is some transaction
> (say by autovacuum) which happened interleaved with this transaction
> and committed before this one.

I can believe that idea too, but would it not have resulted in a
diff in spill_txns as well?

In short, I'm not real convinced that a stable result is possible in this
test.  Maybe you should just test for spill_txns and spill_count being
positive.

regards, tom lane

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, Oct 13, 2020 at 9:11 AM Amit Kapila  wrote:
>
> On Tue, Oct 13, 2020 at 9:02 AM Amit Kapila  wrote:
> >
> > On Tue, Oct 13, 2020 at 4:54 AM Masahiko Sawada
> >  wrote:
> > >
> > > Attached the updated version patch. Please review it.
> > >
> >
> > I have pushed this but it failed in one of the BF. See
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2020-10-13%2003%3A07%3A25
> >
> > The failure is shown below and I am analyzing it. See, if you can
> > provide any insights.
> >
> > @@ -58,7 +58,7 @@
> >  SELECT name, spill_txns, spill_count FROM pg_stat_replication_slots;
> >name   | spill_txns | spill_count
> >  -++-
> > - regression_slot |  1 |  12
> > + regression_slot |  1 |  10
> >  (1 row)
> >
> >  -- reset the slot stats, and wait for stats collector to reset
> > @@ -96,7 +96,7 @@
> >  SELECT name, spill_txns, spill_count FROM pg_stat_replication_slots;
> >name   | spill_txns | spill_count
> >  -++-
> > - regression_slot |  1 |  12
> > + regression_slot |  1 |  10
> >  (1 row)
> >
>
> The reason for this problem could be that there is some transaction
> (say by autovacuum) which happened interleaved with this transaction
> and committed before this one. Now during DecodeCommit of this
> background transaction, we will send the stats accumulated by that
> time which could lead to such a problem.
>

If this theory is correct then I think we can't rely on the
'spill_count' value, what do you think?

-- 
With Regards,
Amit Kapila.

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, Oct 13, 2020 at 9:02 AM Amit Kapila  wrote:
>
> On Tue, Oct 13, 2020 at 4:54 AM Masahiko Sawada
>  wrote:
> >
> > Attached the updated version patch. Please review it.
> >
>
> I have pushed this but it failed in one of the BF. See
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2020-10-13%2003%3A07%3A25
>
> The failure is shown below and I am analyzing it. See, if you can
> provide any insights.
>
> @@ -58,7 +58,7 @@
>  SELECT name, spill_txns, spill_count FROM pg_stat_replication_slots;
>name   | spill_txns | spill_count
>  -++-
> - regression_slot |  1 |  12
> + regression_slot |  1 |  10
>  (1 row)
>
>  -- reset the slot stats, and wait for stats collector to reset
> @@ -96,7 +96,7 @@
>  SELECT name, spill_txns, spill_count FROM pg_stat_replication_slots;
>name   | spill_txns | spill_count
>  -++-
> - regression_slot |  1 |  12
> + regression_slot |  1 |  10
>  (1 row)
>

The reason for this problem could be that there is some transaction
(say by autovacuum) which happened interleaved with this transaction
and committed before this one. Now during DecodeCommit of this
background transaction, we will send the stats accumulated by that
time which could lead to such a problem.

-- 
With Regards,
Amit Kapila.

Re: Resetting spilled txn statistics in pg_stat_replication

On Tue, Oct 13, 2020 at 4:54 AM Masahiko Sawada
 wrote:
>
> Attached the updated version patch. Please review it.
>

I have pushed this but it failed in one of the BF. See
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&dt=2020-10-13%2003%3A07%3A25

The failure is shown below and I am analyzing it. See, if you can
provide any insights.

@@ -58,7 +58,7 @@
 SELECT name, spill_txns, spill_count FROM pg_stat_replication_slots;
   name   | spill_txns | spill_count
 -++-
- regression_slot |  1 |  12
+ regression_slot |  1 |  10
 (1 row)

 -- reset the slot stats, and wait for stats collector to reset
@@ -96,7 +96,7 @@
 SELECT name, spill_txns, spill_count FROM pg_stat_replication_slots;
   name   | spill_txns | spill_count
 -++-
- regression_slot |  1 |  12
+ regression_slot |  1 |  10
 (1 row)



-- 
With Regards,
Amit Kapila.

Re: New statistics for tuning WAL buffer size

2020-10-12 Thread Masahiro Ikeda


On 2020-10-06 15:57, Masahiro Ikeda wrote:

Hi,

I think it's better to add other WAL statistics to the pg_stat_wal 
view.
I'm thinking to add the following statistics. Please let me know your 
thoughts.


1.  Basic wal statistics

* wal_records: Total number of WAL records generated
* wal_fpi: Total number of WAL full page images generated
* wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN,
auto_explain, autovacuum
and pg_stat_statements now, if users want to see the performance
trends for the entire database,
they must preprocess the statistics.

Is it useful to add the sum of the above statistics to the pg_stat_wal 
view?



2.  Number of when new WAL file is created and zero-filled.

As Fujii-san already commented, I think it's good for tuning.

Just idea; it may be worth exposing the number of when new WAL file is 
created and zero-filled. This initialization may have impact on the 
performance of write-heavy workload generating lots of WAL. If this 
number is reported high, to reduce the number of this initialization, 
we can tune WAL-related parameters so that more "recycled" WAL files 
can be hold.



3.  Number of when to switch the WAL logfile segment.

This is similar to 2, but this counts the number of when WAL file is
recylcled too.
I think it's useful for tuning "wal_segment_size"
if the number is high　relative to the startup time, "wal_segment_size"
must be bigger.


4. Number of when WAL is flushed

I think it's useful for tuning "synchronous_commit" and "commit_delay"
for query executions.
If the number of WAL is flushed is high, users can know
"synchronous_commit" is useful for the workload.

Also, it's useful for tuning "wal_writer_delay" and
"wal_writer_flush_after" for wal writer.
If the number is high, users can change the parameter for performance.

I think it's better to separate this for backends and wal writer.


5.  Wait time when WAL is flushed.

This is the accumulated time when wal is flushed.
If the time becomes much higher,　users can detect the possibility of
disk failure.

Since users can see how much flash time occupies of the query execution 
time,

it may lead to query tuning and so on.

Since there is the above reason, I think it's better to separate this
for backends and wal writer.


I made a patch for collecting the above statistics.
If you have any comments, please let me know.

I think it's better to separate some statistics for backend and 
backgrounds because
tuning target parameters like "synchronous_commit", "wal_writer_delay" 
and so on are different.

But first, I want to get a consensus to collect them.

Best regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONdiff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6656676..31a37de 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3406,12 +3406,95 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 
 
 
+ 
+  
+   wal_records bigint
+  
+  
+   Total number of WAL records generated
+  
+ 
+
+ 
+  
+   wal_fpi bigint
+  
+  
+   Total number of WAL full page images generated
+  
+ 
+
+ 
+  
+   wal_bytes bigint
+  
+  
+   Total amount of WAL bytes generated
+  
+ 
+
  
   
wal_buffers_full bigint
   
   
-   Number of times WAL data was written to the disk because WAL buffers got full
+   Total number of WAL data written to the disk because WAL buffers got full
+  
+ 
+
+ 
+  
+   wal_init bigint
+  
+  
+   Total number of WAL file segment created or opened a pre-existing one
+  
+ 
+
+ 
+  
+   wal_init_zero bigint
+  
+  
+   Total number of WAL file segment created to be filled with zeroes
+  
+ 
+
+ 
+  
+   wal_write bigint
+  
+  
+   Total number of WAL data written to the disk
+  
+ 
+
+ 
+  
+   wal_write_time bigint
+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was written to disk, in milliseconds
+  
+ 
+
+ 
+  
+   wal_sync bigint
+  
+  
+   Total number of WAL data synced to the disk
+  
+ 
+
+ 
+  
+   wal_sync_time bigint
+  
+  
+   Total amount of time that has been spent in the portion of
+   WAL data was synced to disk, in milliseconds
   
  
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 52a67b1..840e684 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1262,6 +126

Re: Transactions involving multiple postgres foreign servers, take 2

On Tue, 13 Oct 2020 at 10:00, Kyotaro Horiguchi  wrote:
>
> At Fri, 9 Oct 2020 21:45:57 +0900, Masahiko Sawada 
>  wrote in
> > On Fri, 9 Oct 2020 at 14:55, Kyotaro Horiguchi  
> > wrote:
> > >
> > > At Fri, 9 Oct 2020 02:33:37 +, "tsunakawa.ta...@fujitsu.com" 
> > >  wrote in
> > > > From: Masahiko Sawada 
> > > > > What about temporary network failures? I think there are users who
> > > > > don't want to give up resolving foreign transactions failed due to a
> > > > > temporary network failure. Or even they might want to wait for
> > > > > transaction completion until they send a cancel request. If we want to
> > > > > call the commit routine only once and therefore want FDW to retry
> > > > > connecting the foreign server within the call, it means we require all
> > > > > FDW implementors to write a retry loop code that is interruptible and
> > > > > ensures not to raise an error, which increases difficulty.
> > > > >
> > > > > Yes, but if we don’t retry to resolve foreign transactions at all on
> > > > > an unreliable network environment, the user might end up requiring
> > > > > every transaction to check the status of foreign transactions of the
> > > > > previous distributed transaction before starts. If we allow to do
> > > > > retry, I guess we ease that somewhat.
> > > >
> > > > OK.  As I said, I'm not against trying to cope with temporary network 
> > > > failure.  I just don't think it's mandatory.  If the network failure is 
> > > > really temporary and thus recovers soon, then the resolver will be able 
> > > > to commit the transaction soon, too.
> > >
> > > I should missing something, though...
> > >
> > > I don't understand why we hate ERRORs from fdw-2pc-commit routine so
> > > much. I think remote-commits should be performed before local commit
> > > passes the point-of-no-return and the v26-0002 actually places
> > > AtEOXact_FdwXact() before the critical section.
> > >
> >
> > So you're thinking the following sequence?
> >
> > 1. Prepare all foreign transactions.
> > 2. Commit the all prepared foreign transactions.
> > 3. Commit the local transaction.
> >
> > Suppose we have the backend process call the commit routine, what if
> > one of FDW raises an ERROR during committing the foreign transaction
> > after committing other foreign transactions? The transaction will end
> > up with an abort but some foreign transactions are already committed.
>
> Ok, I understand what you are aiming.
>
> It is apparently out of the focus of the two-phase commit
> protocol. Each FDW server can try to keep the contract as far as its
> ability reaches, but in the end such kind of failure is
> inevitable. Even if we require FDW developers not to respond until a
> 2pc-commit succeeds, that just leads the whole FDW-cluster to freeze
> even not in an extremely bad case.
>
> We have no other choices than shutting the server down (then the
> succeeding server start removes the garbage commits) or continueing
> working leaving some information in a system storage (or reverting the
> garbage commits). What we can do in that case is to provide a
> automated way to resolve the inconsistency.
>
> > Also, what if the backend process failed to commit the local
> > transaction? Since it already committed all foreign transactions it
> > cannot ensure the global atomicity in this case too. Therefore, I
> > think we should commit the distributed transactions in the following
> > sequence:
>
> Ditto. It's out of the range of 2pc. Using p2c for local transaction
> could reduce that kind of failure but I'm not sure. 3pc, 4pc ...npc
> could reduce the probability but can't elimite failure cases.

IMO the problems I mentioned arise from the fact that the above
sequence doesn't really follow the 2pc protocol in the first place.

We can think of the fact that we commit the local transaction without
preparation while preparing foreign transactions as that we’re using
the 2pc with last resource transaction optimization (or last agent
optimization)[1]. That is, we prepare all foreign transactions first
and the local node is always the last resource to process. At this
time, the outcome of the distributed transaction completely depends on
the fate of the last resource (i.g., the local transaction). If it
fails, the distributed transaction must be abort by rolling back
prepared foreign transactions. OTOH, if it succeeds, all prepared
foreign transaction must be committed. Therefore, we don’t need to
prepare the last resource and can commit it. In this way, if we want
to commit the local transaction without preparation, the local
transaction must be committed at last. But since the above sequence
doesn’t follow this protocol, we will have such problems. I think if
we follow the 2pc properly, such basic failures don't happen.

>
> > 1. Prepare all foreign transactions.
> > 2. Commit the local transaction.
> > 3. Commit the all prepared foreign transactions.
> >
> > But this is still not a perfect solution. If we have the backend
>

Re: Add a description to the documentation that toast_tuple_target affects "Main"

2020-10-12 Thread Kasahara Tatsuhito

Hi,

On Fri, Oct 9, 2020 at 5:44 PM Shinya Okano  wrote:
>
> Hi,
>
> Regarding the toast_tuple_target parameter of CREATE TABLE, the
> documentation says that it only affects External or Extended, but it
> actually affects the compression of Main as well.
>
> The attached patch modifies the document to match the actual behavior.
+1

Best regards,

-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-12 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 


(1)
> Alright. I also removed nTotalBlocks in v24-0003 patch.
> 
> for (i = 0; i < nforks; i++)
> {
> if (nForkBlocks[i] != InvalidBlockNumber &&
> nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
> {
> Optimization loop
> }
> else
> break;
> }
> if (i >= nforks)
> return;
> { usual buffer invalidation process }

Why do you do this way?  I think the previous patch was more correct (while 
agreeing with Horiguchi-san in that nTotalBlocks may be unnecessary.  What you 
want to do is "if the size of any fork could be inaccurate, do the traditional 
full buffer scan without performing any optimization for any fork," right?  But 
the above code performs optimization for forks until it finds a fork with 
inaccurate size.

(2)
+* Get the total number of cached blocks and to-be-invalidated blocks
+* of the relation.  The cached value returned by smgrnblocks could be
+* smaller than the actual number of existing buffers of the file.

As you changed the meaning of the smgrnblocks() argument from cached to 
accurate, and you nolonger calculate the total blocks, the comment should 
reflect them.


(3)
In smgrnblocks(), accurate is not set to false when mdnblocks() is called.  The 
caller doesn't initialize the value either, so it can see garbage value.


(4)
+   if (nForkBlocks[i] != InvalidBlockNumber &&
+   nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
+   {
...
+   }
+   else
+   break;
+   }

In cases like this, it's better to reverse the if and else.  Thus, you can 
reduce the nest depth.


 Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2

2020-10-12 Thread Kyotaro Horiguchi

At Fri, 9 Oct 2020 21:45:57 +0900, Masahiko Sawada 
 wrote in 
> On Fri, 9 Oct 2020 at 14:55, Kyotaro Horiguchi  
> wrote:
> >
> > At Fri, 9 Oct 2020 02:33:37 +, "tsunakawa.ta...@fujitsu.com" 
> >  wrote in
> > > From: Masahiko Sawada 
> > > > What about temporary network failures? I think there are users who
> > > > don't want to give up resolving foreign transactions failed due to a
> > > > temporary network failure. Or even they might want to wait for
> > > > transaction completion until they send a cancel request. If we want to
> > > > call the commit routine only once and therefore want FDW to retry
> > > > connecting the foreign server within the call, it means we require all
> > > > FDW implementors to write a retry loop code that is interruptible and
> > > > ensures not to raise an error, which increases difficulty.
> > > >
> > > > Yes, but if we don’t retry to resolve foreign transactions at all on
> > > > an unreliable network environment, the user might end up requiring
> > > > every transaction to check the status of foreign transactions of the
> > > > previous distributed transaction before starts. If we allow to do
> > > > retry, I guess we ease that somewhat.
> > >
> > > OK.  As I said, I'm not against trying to cope with temporary network 
> > > failure.  I just don't think it's mandatory.  If the network failure is 
> > > really temporary and thus recovers soon, then the resolver will be able 
> > > to commit the transaction soon, too.
> >
> > I should missing something, though...
> >
> > I don't understand why we hate ERRORs from fdw-2pc-commit routine so
> > much. I think remote-commits should be performed before local commit
> > passes the point-of-no-return and the v26-0002 actually places
> > AtEOXact_FdwXact() before the critical section.
> >
> 
> So you're thinking the following sequence?
> 
> 1. Prepare all foreign transactions.
> 2. Commit the all prepared foreign transactions.
> 3. Commit the local transaction.
> 
> Suppose we have the backend process call the commit routine, what if
> one of FDW raises an ERROR during committing the foreign transaction
> after committing other foreign transactions? The transaction will end
> up with an abort but some foreign transactions are already committed.

Ok, I understand what you are aiming.

It is apparently out of the focus of the two-phase commit
protocol. Each FDW server can try to keep the contract as far as its
ability reaches, but in the end such kind of failure is
inevitable. Even if we require FDW developers not to respond until a
2pc-commit succeeds, that just leads the whole FDW-cluster to freeze
even not in an extremely bad case.

We have no other choices than shutting the server down (then the
succeeding server start removes the garbage commits) or continueing
working leaving some information in a system storage (or reverting the
garbage commits). What we can do in that case is to provide a
automated way to resolve the inconsistency.

> Also, what if the backend process failed to commit the local
> transaction? Since it already committed all foreign transactions it
> cannot ensure the global atomicity in this case too. Therefore, I
> think we should commit the distributed transactions in the following
> sequence:

Ditto. It's out of the range of 2pc. Using p2c for local transaction
could reduce that kind of failure but I'm not sure. 3pc, 4pc ...npc
could reduce the probability but can't elimite failure cases.

> 1. Prepare all foreign transactions.
> 2. Commit the local transaction.
> 3. Commit the all prepared foreign transactions.
> 
> But this is still not a perfect solution. If we have the backend

2pc is not a perfect solution in the first place. Attaching a similar
phase to it cannot make it "perfect".

> process call the commit routine and an error happens during executing
> the commit routine of an FDW (i.g., at step 3) it's too late to report
> an error to the client because we already committed the local
> transaction. So the current solution is to have a background process
> commit the foreign transactions so that the backend can just wait
> without the possibility of errors.

Whatever process tries to complete a transaction, the client must wait
for the transaction to end and anyway that's just a freeze in the
client's view, unless you intended to respond to local commit before
all participant complete.

I don't think most of client applications wouldn't wait for frozen
server forever.  We have the same issue at the time the client decided
to give up the transacton, or the leader session is killed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: BUG #15858: could not stat file - over 4GB

2020-10-12 Thread Michael Paquier

On Mon, Oct 12, 2020 at 11:13:38AM -0400, Tom Lane wrote:
> Juan José Santamaría Flecha wrote:
>> Uhm, a good question indeed, forcing errno serves no purpose there.
> 
> OK, changed.

Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Resetting spilled txn statistics in pg_stat_replication

(Please avoid top-posting)

On Mon, 12 Oct 2020 at 23:45, Shinoda, Noriyoshi (PN Japan A&PS
Delivery)  wrote:
>
> Sawada-san, Thank you your comments.
>
> The attached patch reflects the comment.
> I also made a fix for the regression test.
>
> Regards,
> Noriyoshi Shinoda
>
> -Original Message-
> From: Masahiko Sawada [mailto:masahiko.saw...@2ndquadrant.com]
> Sent: Monday, October 12, 2020 8:12 PM
> To: Shinoda, Noriyoshi (PN Japan A&PS Delivery) 
> Cc: Amit Kapila ; Dilip Kumar 
> ; Magnus Hagander ; Tomas Vondra 
> ; PostgreSQL Hackers 
> ; Ajin Cherian 
> Subject: Re: Resetting spilled txn statistics in pg_stat_replication
>
> On Mon, 12 Oct 2020 at 18:29, Shinoda, Noriyoshi (PN Japan A&PS
> Delivery)  wrote:
> >
> > Hi, thank you for the awesome feature.
> >
>
> Thank you for reporting!
>
> > As it may have been discussed, I think the 'name' column in 
> > pg_stat_replication_slots is more consistent with the column name and data 
> > type matched to the pg_replication_slots catalog.
> > The attached patch changes the name and data type of the 'name' column to 
> > slot_name and 'name' type, respectively.
>
> It seems a good idea to me. In other system views, we use the name data type 
> for object name. When I wrote the first patch, I borrowed the code for 
> pg_stat_slru which uses text data for the name but I think it's an oversight.

Hmm, my above observation is wrong. All other statistics use text data
type and internally use char[NAMEDATALEN]. So I think renaming to
'slot_name' would be a good idea but probably we don’t need to change
the internally used data type. For the data type of slot_name of
pg_stat_replication_slots view, given that the doc says the
following[1], I think we can keep it too as this view is not a system
catalog. What do you think?

8.3. Character Types:
The name type exists only for the storage of identifiers in the
internal system catalogs

[1] https://www.postgresql.org/docs/devel/datatype-character.html

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Resetting spilled txn statistics in pg_stat_replication

On Mon, 12 Oct 2020 at 19:24, Amit Kapila  wrote:
>
> On Mon, Oct 12, 2020 at 10:59 AM Masahiko Sawada
>  wrote:
> >
> > On Thu, 8 Oct 2020 at 17:59, Amit Kapila  wrote:
> > >
> > > On Thu, Oct 8, 2020 at 1:55 PM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Thu, 8 Oct 2020 at 14:10, Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > We can write if we want but there are few things we need to do for
> > > > > that like maybe a new function like wait_for_spill_stats which will
> > > > > check if the counters have become zero. Then probably call a reset
> > > > > function, call a new wait function, and then again check stats to
> > > > > ensure they are reset to 0.
> > > >
> > > > Yes.
> > > >
> > >
> > > I am not sure if it is worth but probably it is not a bad idea
> > > especially if we extend the existing tests based on your below idea?
> > >
> > > > > We can't write any advanced test which means reset the existing stats
> > > > > perform some tests and again check stats because *slot_get_changes()
> > > > > function can start from the previous WAL for which we have covered the
> > > > > stats. We might write that if we can somehow track the WAL positions
> > > > > from the previous test. I am not sure if we want to go there.
> > > >
> > > > Can we use pg_logical_slot_peek_changes() instead to decode the same
> > > > transactions multiple times?
> > > >
> > >
> > > I think this will do the trick. If we want to go there then I suggest
> > > we can have a separate regression test file in test_decoding with name
> > > as decoding_stats, stats, or something like that. We can later add the
> > > tests related to streaming stats in that file as well.
> > >
> >
> > Agreed.
> >
> > I've updated the patch. Please review it.
> >
>
> Few comments:
> =

Thank you for your review.

> 1.
> +-- function to wait for counters to advance
> +CREATE FUNCTION wait_for_spill_stats(check_reset bool) RETURNS void AS $$
>
> Can we rename this function to wait_for_decode_stats? I am thinking we
> can later reuse this function for streaming stats as well by passing
> the additional parameter 'stream bool'.

+1. Fixed.

>
> 2. let's drop the table added by this test and regression_slot at the
> end of the test.

Fixed.

Attached the updated version patch. Please review it.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v11-0001-Test-stats.patch
Description: Binary data

Re: Assertion failure with barriers in parallel hash join

2020-10-12 Thread Thomas Munro

On Tue, Oct 13, 2020 at 12:15 PM Melanie Plageman
 wrote:
> On Thu, Oct 1, 2020 at 8:08 PM Thomas Munro  wrote:
>> On Tue, Sep 29, 2020 at 9:12 PM Thomas Munro  wrote:
>> Here's a throw-away patch to add some sleeps that trigger the problem,
>> and a first draft fix.  I'll do some more testing of this next week
>> and see if I can simplify it.
>
> I was just taking a look at the patch and noticed the commit message
> says:
>
> > With unlucky timing and parallel_leader_participation off...
>
> Is parallel_leader_participation being off required to reproduce the
> issue?

Yeah, because otherwise the leader detaches last so the problem doesn't arise.

Re: Assertion failure with barriers in parallel hash join

2020-10-12 Thread Melanie Plageman

On Thu, Oct 1, 2020 at 8:08 PM Thomas Munro  wrote:

> On Tue, Sep 29, 2020 at 9:12 PM Thomas Munro 
> wrote:
> > On Tue, Sep 29, 2020 at 7:11 PM Michael Paquier 
> wrote:
> > > #2  0x009027d2 in ExceptionalCondition
> > > (conditionName=conditionName@entry=0xa80846 "!barrier->static_party",
> >
> > > #4  0x00682ebf in ExecParallelHashJoinNewBatch
> >
> > Thanks.  Ohhh.  I think I see how that condition was reached and what
> > to do about it, but I'll need to look more closely.  I'm away on
> > vacation right now, and will update in a couple of days when I'm back
> > at a real computer.
>
> Here's a throw-away patch to add some sleeps that trigger the problem,
> and a first draft fix.  I'll do some more testing of this next week
> and see if I can simplify it.
>

I was just taking a look at the patch and noticed the commit message
says:

> With unlucky timing and parallel_leader_participation off...

Is parallel_leader_participation being off required to reproduce the
issue?

[patch] [doc] Clarify that signal functions have no feedback

2020-10-12 Thread David G. Johnston

Hackers,

Over in Bug# 16652 [1] Christoph failed to recognize the fact that signal
sending functions are inherently one-way just as signals are.  It seems
worth heading off this situation in the future by making it clear how
signals behave and, in the specific case of pg_reload_conf, that the
important feedback one would hope to get out of a success/failure response
from the function call must instead be found in other locations.

Please see the attached patch, included inline as well.

David J.

[1]
https://www.postgresql.org/message-id/flat/16652-58dd6028047058a6%40postgresql.org

commit 6f0ba7c8fd131c906669882e4402930e548e4522
Author: David G. Johnston 
Date:   Mon Oct 12 22:35:38 2020 +

Clarify that signal functions have no feedback

Bug# 16652 complains that the definition of success for pg_reload_conf
doesn't include the outcome of actually reloading the configuration
files.  While this is a fairly easy gap to cross given knowledge of
signals, being more explicit here doesn't hurt.

Additionally, because of the special nature of pg_reload_conf, add
links to the various locations where information related to the
success or failure of a reload can be found.  Lacking an existing
holistic location in the documentation to point the reader just
list the three resources explicitly.

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index e7cff980dd..75ff8acc93 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -23927,7 +23927,8 @@ SELECT collation for ('foo' COLLATE "de_DE");


 The functions shown in  send control signals to
+linkend="functions-admin-signal-table"/> send uni-directional
+control signals to
 other server processes.  Use of these functions is restricted to
 superusers by default but access may be granted to others using
 GRANT, with noted exceptions.
@@ -23935,7 +23936,8 @@ SELECT collation for ('foo' COLLATE "de_DE");


 Each of these functions returns true if
-successful and false otherwise.
+the signal was successfully sent and false
+if the sending of the signal failed.



@@ -23983,7 +23985,14 @@ SELECT collation for ('foo' COLLATE "de_DE");
 server to reload their configuration files.  (This is initiated by
 sending a SIGHUP signal to the postmaster
 process, which in turn sends SIGHUP to
each
-of its children.)
+of its children.)  Inspection of the
+log file,
+pg_file_settings view,
+and the
+pg_settings view,
+is recommended before and/or after executing
+this function to detect whether there are any issues in the
configuration
+files preventing some of all of their setting changes from taking
effect.

   


v1-doc-pg-reload-conf-signaling.patch
Description: Binary data

[patch] [doc] Further note required activity aspect of automatic checkpoint and archving

2020-10-12 Thread David G. Johnston

Hackers,

Over in general [1] Robert Inder griped about the not-so-recent change to
our automatic checkpointing, and thus archiving, behavior where
non-activity results in nothing happening.  In looking over the
documentation I felt a few changes could be made to increase the chance
that a reader learns this key dynamic.  Attached is a patch with those
changes.  Copied inline for ease of review.

commit 8af7f653907688252d8663a80e945f6f5782b0de
Author: David G. Johnston 
Date:   Mon Oct 12 21:32:32 2020 +

Further note required activity aspect of automatic checkpoint and
archiving

A few spots in the documentation could use a reminder that checkpoints
and archiving requires that actual WAL records be written in order to
happen
automatically.

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 42a8ed328d..c312fc9387 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -722,6 +722,8 @@ test ! -f
/mnt/server/archivedir/000100A90065 && cp pg_wal/0
 short archive_timeout — it will bloat your
archive
 storage.  archive_timeout settings of a minute or
so are
 usually reasonable.
+This is mitigated by the fact that empty WAL segments will not be
archived
+even if the archive_timeout period has elapsed.



diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ee914740cc..306f78765c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3131,6 +3131,8 @@ include_dir 'conf.d'
   

 Maximum time between automatic WAL checkpoints.
+The automatic checkpoint will do nothing if no new WAL has been
+written since the last recorded checkpoint.
 If this value is specified without units, it is taken as seconds.
 The valid range is between 30 seconds and one day.
 The default is five minutes (5min).
@@ -3337,18 +3339,17 @@ include_dir 'conf.d'
   
   

+Force the completion of the current, non-empty, WAL segment when
+this amount of time (if non-zero) has elapsed since the last
+segment file switch.
 The  is only invoked for
 completed WAL segments. Hence, if your server generates little WAL
 traffic (or has slack periods where it does so), there could be a
 long delay between the completion of a transaction and its safe
 recording in archive storage.  To limit how old unarchived
 data can be, you can set archive_timeout to
force the
-server to switch to a new WAL segment file periodically.  When this
-parameter is greater than zero, the server will switch to a new
-segment file whenever this amount of time has elapsed since the
last
-segment file switch, and there has been any database activity,
-including a single checkpoint (checkpoints are skipped if there is
-no database activity).  Note that archived files that are closed
+server to switch to a new WAL segment file periodically.
+Note that archived files that are closed
 early due to a forced switch are still the same length as
completely
 full files.  Therefore, it is unwise to use a very short
 archive_timeout — it will bloat your
archive

David J.

[1]
https://www.postgresql.org/message-id/flat/CAKqjJm83gnw2u0ugpkgc4bq58L%3DcLwbvmh69TwKKo83Y1CnANw%40mail.gmail.com


v1-doc-automatic-checkpoint-and-archive-skips.patch
Description: Binary data

Re: Deleting older versions in unique indexes to avoid page splits

2020-10-12 Thread Peter Geoghegan

On Mon, Oct 12, 2020 at 3:47 AM Andrey Borodin wrote:
> The idea looks very interesting.
> It resembles GiST microvacuum: GiST tries to vacuum single page before split.

AFAICT the GiST microvacuum mechanism is based on the one in nbtree,
which is based on setting LP_DEAD bits when index scans find that the
TIDs are dead-to-all. That's easy to justify, because it is easy and
cheap to save future index scans the trouble of following the TIDs
just to discover the same thing for themselves.

The difference here is that we're simply making an intelligent guess
-- there have been no index scans, but we're going to do a kind of
special index scan at the last minute to see if we can avoid a page
split. I think that this is okay because in practice we can do it in a
reasonably targeted way. We will never do it with INSERTs, for example
(except in unique indexes, though only when we know that there are
duplicates because we saw them in _bt_check_unique()). In the worst
case we make non-HOT updates a bit slower...and I'm not even sure that
that's a bad thing when we don't manage to delete anything. After all,
non-HOT updates impose a huge cost on the system. They have "negative
externalities".

Another way to look at it is that the mechanism I propose to add takes
advantage of "convexity" [1]. (Actually, several of my indexing ideas
are based on similar principles -- like the nbtsplitloc.c stuff.)

Attached is v2. It controls the cost of visiting the heap by finding
the heap page that has the most TIDs that we might be able to kill
(among those that are duplicates on a leaf page). It also adds a
hinting mechanism to the executor to avoid uselessly applying the
optimization with INSERTs.

> I'm curious how cost of page deduplication is compared to cost of page split?
> Should we do deduplication of page will still remain 99% full?

It depends on how you measure it, but in general I would say that the
cost of traditional Postgres 13 deduplication is much lower.
Especially as measured in WAL volume. I also believe that the same is
true for this new delete deduplication mechanism. The way we determine
which heap pages to visit maximizes the chances that we'll get lucky
while minimizing the cost (currently we don't visit more than 2 heap
pages unless we get at least one kill -- I think I could be more
conservative here without losing much). A page split also has to
exclusive lock two other pages (the right sibling page and parent
page), so even the locking is perhaps better.

The attached patch can completely or almost completely avoid index
bloat in extreme cases with non-HOT updates. This can easily increase
throughput by 2x or more, depending on how extreme you make it (i.e.
how many indexes you have). It seems like the main problem caused by
non-HOT updates is in fact page splits themselves. It is not so much a
problem with dirtying of pages.

You can test this with a benchmark like the one that was used for WARM
back in 2017:

https://www.postgresql.org/message-id/flat/CABOikdMNy6yowA%2BwTGK9RVd8iw%2BCzqHeQSGpW7Yka_4RSZ_LOQ%40mail.gmail.com

I suspect that it's maybe even more effective than WARM was with this benchmark.

[1] https://fooledbyrandomness.com/ConvexityScience.pdf
--
Peter Geoghegan

v2-0001-Add-delete-deduplication-to-nbtree.patch
Description: Binary data

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-12 Thread Bruce Momjian

On Mon, Oct 12, 2020 at 04:07:30PM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Mon, Oct 12, 2020 at 02:26:15PM -0400, Tom Lane wrote:
> >> Yeah, I agree --- a version number is the wrong way to think about this.
> 
> > The version number was to invalidate _all_ query hashes if the
> > algorithm is slightly modified, rather than invalidating just some of
> > them, which could lead to confusion.
> 
> Color me skeptical as to the use-case for that.  From users' standpoints,
> the hash is mainly going to change when we change the set of parse node
> fields that get hashed.  Which is going to happen at every major release
> and no (or at least epsilon) minor releases.  So I do not see a point in
> tracking an algorithm version number as such.  Seems like make-work.

OK, I came up with the hash idea only to address one of your concerns
about mismatched hashes for algorithm improvements/changes.  Seems we
might as well just document that cross-version hashes are different.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: SEARCH and CYCLE clauses

2020-10-12 Thread Robert Haas

On Fri, May 22, 2020 at 5:25 AM Peter Eisentraut
 wrote:
> This is something we need to think about.  If we want to check at parse
> time whether the two values are not the same (and perhaps not null),
> then we either need to restrict the generality of what we can specify,
> or we need to be prepared to do full expression evaluation in the
> parser.  A simple and practical way might be to only allow string and
> boolean literal.  I don't have a strong opinion here.

I don't have an opinion on this feature, but I think doing expression
evaluation in the raw parser would be a pretty bad idea. I think we're
not supposed to do anything in the parser that involves catalog access
or even references to GUC values. It might be OK if it happens in
parse analysis rather than gram.y, but even that sounds a bit sketchy
to me. We'd need to think carefully about what effects such things
woul have on the plan cache, and whether they introduce any security
holes, and maybe some other things.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

Bruce Momjian  writes:
> On Mon, Oct 12, 2020 at 02:26:15PM -0400, Tom Lane wrote:
>> Yeah, I agree --- a version number is the wrong way to think about this.

> The version number was to invalidate _all_ query hashes if the
> algorithm is slightly modified, rather than invalidating just some of
> them, which could lead to confusion.

Color me skeptical as to the use-case for that.  From users' standpoints,
the hash is mainly going to change when we change the set of parse node
fields that get hashed.  Which is going to happen at every major release
and no (or at least epsilon) minor releases.  So I do not see a point in
tracking an algorithm version number as such.  Seems like make-work.

regards, tom lane

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-12 Thread Bruce Momjian

On Mon, Oct 12, 2020 at 02:26:15PM -0400, Tom Lane wrote:
> Robert Haas  writes:
> > I don't really understand how a version number helps. It's not like
> > there is going to be a v2 that is in all ways better than v1. If there
> > are different algorithms here, they are going to be customized for
> > different needs.
> 
> Yeah, I agree --- a version number is the wrong way to think about this.
> It's gonna be more like algorithm foo versus algorithm bar versus
> algorithm baz, where each one is better for a specific set of use-cases.
> Julien already noted the point about hashing object OIDs versus object
> names; one can easily imagine disagreeing with pg_stat_statement's
> choices about ignoring values of constants; other properties of statements
> might be irrelevant for some use-cases; and so on.

The version number was to invalidate _all_ query hashes if the
algorithm is slightly modified, rather than invalidating just some of
them, which could lead to confusion.  The idea of selectable hash
algorithms is nice if people feel there is sufficient need for that.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

Robert Haas  writes:
> I don't really understand how a version number helps. It's not like
> there is going to be a v2 that is in all ways better than v1. If there
> are different algorithms here, they are going to be customized for
> different needs.

Yeah, I agree --- a version number is the wrong way to think about this.
It's gonna be more like algorithm foo versus algorithm bar versus
algorithm baz, where each one is better for a specific set of use-cases.
Julien already noted the point about hashing object OIDs versus object
names; one can easily imagine disagreeing with pg_stat_statement's
choices about ignoring values of constants; other properties of statements
might be irrelevant for some use-cases; and so on.

I'm okay with moving pg_stat_statement's existing algorithm into core as
long as there's a way for extensions to override it.  With proper design,
that would allow extensions that do override it to coexist with
pg_stat_statements (thereby redefining the latter's idea of which
statements are "the same"), which is something that doesn't really work
nicely today.

regards, tom lane

Re: [PATCH] Add `truncate` option to subscription commands

2020-10-12 Thread David Christensen

> On Oct 11, 2020, at 10:00 PM, Amit Kapila  wrote:
> 
> On Mon, Oct 12, 2020 at 3:44 AM David Christensen  wrote:
>> 
>> 
>> On Oct 11, 2020, at 1:14 PM, Euler Taveira  
>> wrote:
>> 
>> 
>> On Fri, 9 Oct 2020 at 15:54, David Christensen  wrote:
>>> 
>>> 
>>> Enclosed find a patch to add a “truncate” option to subscription commands.
>>> 
>>> When adding new tables to a subscription (either via `CREATE SUBSCRIPTION` 
>>> or `REFRESH PUBLICATION`), tables on the target which are being newly 
>>> subscribed will be truncated before the data copy step.  This saves 
>>> explicit coordination of a manual `TRUNCATE` on the target tables and 
>>> allows the results of the initial data sync to be the same as on the 
>>> publisher at the time of sync.
>>> 
>>> To preserve compatibility with existing behavior, the default value for 
>>> this parameter is `false`.
>>> 
>> 
>> Truncate will fail for tables whose foreign keys refer to it. If such a 
>> feature cannot handle foreign keys, the usefulness will be restricted.
>> 
>> 
>> This is true for existing “truncate” with FKs, so doesn’t seem to be any 
>> different to me.
>> 
> 
> What would happen if there are multiple tables and truncate on only
> one of them failed due to FK check? Does it give an error in such a
> case, if so will the other tables be truncated?

Currently each SyncRep relation is sync’d separately in its own worker process; 
we are doing the truncate at the initialization step of this, so it’s 
inherently in its own transaction. I think if we are going to do any sort of 
validation on this, it would have to be at the point of the CREATE 
SUBSCRIPTION/REFRESH PUBLICATION where we have the relation list and can do 
sanity-checking there.

Obviously if someone changes the schema at some point between when it does this 
and when relation syncs start there is a race condition, but the same issue 
would affect other data sync things, so I don’t care to solve that as part of 
this patch.

>> Hypothetically if you checked all new tables and could verify if there were 
>> FK cycles only already in the new tables being added then “truncate cascade” 
>> would be fine. Arguably if they had existing tables that were part of an FK 
>> that wasn’t fully replicated they were already operating brokenly.
>> 
> 
> I think if both PK_table and FK_table are part of the same
> subscription then there should be a problem as both them first get
> truncated? If they are part of a different subscription (or if they
> are not subscribed due to whatever reason) then probably we need to
> deal such cases carefully.

You mean “should not be a problem” here?  If so, I agree with that.  Obviously 
if we determine this features is only useful with this support we’d have to 
chase the entire dependency graph and make sure that is all contained in the 
set of newly-subscribed tables (or at least FK referents).

I have not considered tables that are part of more than one subscription (is 
that possible?); we presumably should error out if any table exists already in 
a separate subscription, as we’d want to avoid truncating tables already part 
of an existing subscription.

While I’m happy to take a stab at fixing some of the FK/PK issues, it seems 
easy to go down a rabbit hole.  I’m not convinced that we couldn’t just detect 
FK issues and choose to not handle this case without decreasing the utility for 
at least some cases.  (Perhaps we could give a hint as to the issues detected 
to point someone in the right direction.)  Anyway, glad to keep discussing 
potential implications, etc.

Best,

David
--
David Christensen
Senior Software and Database Engineer
End Point Corporation
da...@endpoint.com
785-727-1171

signature.asc
Description: Message signed with OpenPGP

Re: libpq debug log

2020-10-12 Thread Robert Haas

On Wed, Sep 16, 2020 at 4:41 PM Alvaro Herrera  wrote:
> 2020-09-16 13:27:29.072 -03  < ErrorResponse 117 S "ERROR" V "ERROR" C 
> "42704" M "no existe el slot de replicación «foobar»" F "slot.c" L "408" R 
> "ReplicationSlotAcquireInternal" ^@

Ooh, that looks really nice. The ^@ at the end seems odd, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_upgrade: fail early if a tablespace dir already exists for new cluster version

2020-10-12 Thread Justin Pryzby

On Fri, Oct 09, 2020 at 07:42:51PM -0400, Bruce Momjian wrote:
> On Fri, Oct  9, 2020 at 02:23:10PM -0500, Justin Pryzby wrote:
> > In my local branch, I had revised this comment to say:
> > 
> > + * Note, v8.4 has no tablespace_suffix, which is fine so long as the 
> > version we
> > + * being upgraded *to* has a suffix, since it's not allowed to pg_upgrade 
> > from
> > + * a version to the same version if tablespaces are in use.
> 
> OK, updated patch attached.  Also, from your original patch, I didn't
> need to call canonicalize_path() since we are not comparing paths, and I
> didn't need to include common/relpath.h.  I also renamed a variable.

Since I just hit it again, I'll take the opportunity to give an example of how
this can happen.

Here, I've pg_upgraded from a pg12 to pg13, but the pg12 cluster has
postgis-3.0, and pg13 has postgis-3.1.  Maybe that's not guaranteed/intended to
work, but that's what's been working well so far this year, except that there's
two gis functions which no longer exist.  So we fail while "Restoring database
schemas in the new cluster", leaving behind tablespace dirs, which then cause
future pg_upgrades to fail.

pg_restore: from TOC entry 13543; 1255 17106 FUNCTION 
pgis_geometry_union_transfn("internal", "public"."geometry") postgres
pg_restore: error: could not execute query: ERROR:  could not find function 
"pgis_geometry_union_transfn" in file "/usr/pgsql-13/lib/postgis-3.so"
Command was: CREATE FUNCTION "public"."pgis_geometry_union_transfn"("internal", 
"public"."geometry") RETURNS "internal"
LANGUAGE "c" PARALLEL SAFE
AS '$libdir/postgis-3', 'pgis_geometry_union_transfn';

I imagine in previous years I hit that some other way, like maybe I installed
postgres RC1 on a customer server, imported their schema (and maybe data),
which then caused upgrade failure 1-2 months later when trying to upgrade their
production instance.

-- 
Justin

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-12 Thread Robert Haas

On Mon, Oct 12, 2020 at 10:14 AM Bruce Momjian  wrote:
> On Mon, Oct 12, 2020 at 04:20:05PM +0800, Julien Rouhaud wrote:
> > But there are many people that aren't happy with the current hashing
> > approach.  If we're going to move the computation in core, shouldn't
> > we listen to their complaints and let them pay some probably quite
> > high overhead to base the hash on name and/or fully qualified name
> > rather than OID?
> > For instance people using logical replication to upgrade to a newer
> > version may want to easily compare query performance on the new
> > version, or people with multi-tenant databases may want to ignore the
> > schema name to keep a low number of different queryid.
>
> Well, we have to consider how complex the user interface has to be to
> allow more flexibility.  We don't need to allow every option a user will
> want.
>
> With a version number, we have the ability to improve the algorithm or
> add customization, but for the first use, we are probably better off
> keeping it simple.

I thought your earlier idea of allowing this to be controlled by a GUC
was good. There could be a default method built into core, matching
what pg_stat_statements does, so you could select no hashing or that
method no matter what. Then extensions could provide other methods
which could be selected via the GUC.

I don't really understand how a version number helps. It's not like
there is going to be a v2 that is in all ways better than v1. If there
are different algorithms here, they are going to be customized for
different needs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: partition routing layering in nodeModifyTable.c

2020-10-12 Thread Heikki Linnakangas


On 12/10/2020 16:47, Amit Langote wrote:

On Mon, Oct 12, 2020 at 8:12 PM Heikki Linnakangas  wrote:

1. We have many different cleanup/close routines now:
ExecCloseResultRelations, ExecCloseRangeTableRelations, and
ExecCleanUpTriggerState. Do we need them all? It seems to me that we
could merge ExecCloseRangeTableRelations() and
ExecCleanUpTriggerState(), they seem to do roughly the same thing: close
relations that were opened for ResultRelInfos. They are always called
together, except in afterTriggerInvokeEvents(). And in
afterTriggerInvokeEvents() too, there would be no harm in doing both,
even though we know there aren't any entries in the es_result_relations
array at that point.


Hmm, I find trigger result relations to behave differently enough to
deserve a separate function.  For example, unlike plan-specified
result relations, they don't point to range table relations and don't
open indices.  Maybe the name could be revisited, say,
ExecCloseTriggerResultRelations().


Matter of perception I guess. I still prefer to club them together into 
one Close call. It's true that they're slightly different, but they're 
also pretty similar. And IMHO they're more similar than different.



Also, maybe call the other functions:

ExecInitPlanResultRelationsArray()
ExecInitPlanResultRelation()
ExecClosePlanResultRelations()

Thoughts?


Hmm. How about initializing the array lazily, on the first 
ExecInitPlanResultRelation() call? It's not performance critical, and 
that way there's one fewer initialization function that you need to 
remember to call.


It occurred to me that if we do that (initialize the array lazily), 
there's very little need for the PlannedStmt->resultRelations list 
anymore. It's only used in ExecRelationIsTargetRelation(), but if we 
assume that ExecRelationIsTargetRelation() is only called after InitPlan 
has initialized the result relation for the relation, it can easily 
check es_result_relations instead. I think that's a safe assumption. 
ExecRelationIsTargetRelation() is only used in FDWs, and I believe the 
FDWs initialization routine can only be called after ExecInitModifyTable 
has been called on the relation.


The PlannedStmt->rootResultRelations field is even more useless.


Actually, maybe we don't need to be so paranoid about setting up
es_result_relations in worker.c, because none of the downstream
functionality invoked seems to rely on it, that is, no need to call
ExecInitResultRelationsArray() and ExecInitResultRelation().
ExecSimpleRelation* and downstream functionality assume a
single-relation operation and the ResultRelInfo is explicitly passed.


Hmm, yeah, I like that. Similarly in ExecuteTruncateGuts(), there isn't 
actually any need to put the ResultRelInfos in the es_result_relations 
array.


Putting all this together, I ended up with the attached. It doesn't 
include the subsequent commits in this patch set yet, for removal of 
es_result_relation_info et al.


- Heikki
>From c3f24ca219aa3d535df48c7add376ee3f8f4bc1e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 12 Oct 2020 19:43:55 +0300
Subject: [PATCH v15 1/1] Make es_result_relations array indexable by RT index.

Create each ResultRelInfo on demand in ExecInitModifyTable, instead of
initializing the whole array at once in InitPlan. And instead of having
a separate concept of a "result rel index", access the array using the
range table indexes. This allows several simplifications. The arrays of
regular result rels and root result are merged into one array. And we
don't need the global resultRelations list in PlannedStmt anymore.

Author: Amit Langote
Discussion: https://www.postgresql.org/message-id/CA%2BHiwqGEmiib8FLiHMhKB%2BCH5dRgHSLc5N5wnvc4kym%2BZYpQEQ%40mail.gmail.com
---
 src/backend/commands/copy.c  |  24 +--
 src/backend/commands/explain.c   |  17 +-
 src/backend/commands/tablecmds.c |  11 +-
 src/backend/commands/trigger.c   |   2 +-
 src/backend/executor/execMain.c  | 256 +++
 src/backend/executor/execParallel.c  |   2 -
 src/backend/executor/execUtils.c |  58 +++--
 src/backend/executor/nodeModifyTable.c   |  24 ++-
 src/backend/nodes/copyfuncs.c|   4 -
 src/backend/nodes/outfuncs.c |   6 -
 src/backend/nodes/readfuncs.c|   4 -
 src/backend/optimizer/plan/createplan.c  |   2 -
 src/backend/optimizer/plan/planner.c |   6 -
 src/backend/optimizer/plan/setrefs.c |  30 +--
 src/backend/replication/logical/worker.c |  32 +--
 src/include/executor/executor.h  |   5 +-
 src/include/nodes/execnodes.h|  19 +-
 src/include/nodes/pathnodes.h|   4 -
 src/include/nodes/plannodes.h|  11 -
 19 files changed, 188 insertions(+), 329 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3c7dbad27a..ca05702139 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2727,6 +2727,7 @@ Co

Bizarre coding in recovery target GUC management

Would someone explain to me why assign_recovery_target_lsn and related GUC
assign hooks throw errors, rather than doing so in the associated check
hooks?  An assign hook is not supposed to throw an error.  Full stop, no
exceptions.  We wouldn't bother to separate those hooks otherwise.

regards, tom lane

Re: BUG #15858: could not stat file - over 4GB

2020-10-12 Thread Shinoda, Noriyoshi (PN Japan A&PS Delivery)

=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?=  
writes:
> On Mon, Oct 12, 2020 at 5:27 AM Tom Lane  wrote:
>> Michael Paquier  writes:
>>> Why are we forcing errno=ENOENT here?  Wouldn't it be correct to use
>>> _dosmaperr(GetLastError()) to get the correct errno?

>> Fair question.  Juan, was there some good reason not to look at
>> GetLastError() in this step?

> Uhm, a good question indeed, forcing errno serves no purpose there.

OK, changed.

regards, tom lane

RE: Resetting spilled txn statistics in pg_stat_replication

Sawada-san, Thank you your comments.

The attached patch reflects the comment.
I also made a fix for the regression test.

Regards,
Noriyoshi Shinoda

-Original Message-
From: Masahiko Sawada [mailto:masahiko.saw...@2ndquadrant.com] 
Sent: Monday, October 12, 2020 8:12 PM
To: Shinoda, Noriyoshi (PN Japan A&PS Delivery) 
Cc: Amit Kapila ; Dilip Kumar ; 
Magnus Hagander ; Tomas Vondra 
; PostgreSQL Hackers 
; Ajin Cherian 
Subject: Re: Resetting spilled txn statistics in pg_stat_replication

On Mon, 12 Oct 2020 at 18:29, Shinoda, Noriyoshi (PN Japan A&PS
Delivery)  wrote:
>
> Hi, thank you for the awesome feature.
>

Thank you for reporting!

> As it may have been discussed, I think the 'name' column in 
> pg_stat_replication_slots is more consistent with the column name and data 
> type matched to the pg_replication_slots catalog.
> The attached patch changes the name and data type of the 'name' column to 
> slot_name and 'name' type, respectively.

It seems a good idea to me. In other system views, we use the name data type 
for object name. When I wrote the first patch, I borrowed the code for 
pg_stat_slru which uses text data for the name but I think it's an oversight.

> Also, the macro name PG_STAT_GET_..._CLOS has been changed to 
> PG_STAT_GET_..._COLS.

Good catch!

Here is my comments on the patch:

--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -798,7 +798,7 @@ CREATE VIEW pg_stat_replication AS

 CREATE VIEW pg_stat_replication_slots AS
 SELECT
-s.name,
+s.name AS slot_name,
 s.spill_txns,
 s.spill_count,
 s.spill_bytes,

I think we should modify 'proargnames' of
pg_stat_get_replication_slots() in pg_proc.dat instead.

---
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -7094,7 +7094,7 @@ pgstat_replslot_index(const char *name, bool create_it)
Assert(nReplSlotStats <= max_replication_slots);
for (i = 0; i < nReplSlotStats; i++)
{
-   if (strcmp(replSlotStats[i].slotname, name) == 0)
+   if (strcmp(replSlotStats[i].slotname.data, name) == 0)
return i;   /* found */
}

@@ -7107,7 +7107,7 @@ pgstat_replslot_index(const char *name, bool create_it)

/* Register new slot */
memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-   memcpy(&replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
+   memcpy(&replSlotStats[nReplSlotStats].slotname.data, name, 
+ NAMEDATALEN);

return nReplSlotStats++;
 }

We can use NameStr() instead.

---
Perhaps we need to update the regression test as well.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pg_stat_replication_slots_v2.diff
Description: pg_stat_replication_slots_v2.diff

Re: broken logic of simple_eval_resowner after CALL and COMMIT inside procedure

2020-10-12 Thread Pavel Stehule

po 12. 10. 2020 v 16:15 odesílatel Pavel Stehule 
napsal:

> Hi
>
> Inline handler creates simple_eval_resowner (without parent).
>
> Inside plpgsql_estate_setup this value is assigned to
> estate->simple_eval_resowner
>
> <-->if (simple_eval_resowner)
> <--><-->estate->simple_eval_resowner = simple_eval_resowner;
> <-->else
> <--><-->estate->simple_eval_resowner = shared_simple_eval_resowner;
>
> When we call procedure with inner COMMIT, then when "before_lxid !=
> after_lxid" following code is
> executed.
>
> <--><-->estate->simple_eval_estate = NULL;
> <--><-->estate->simple_eval_resowner = NULL;
> <--><-->plpgsql_create_econtext(estate);
>
> and
>
> fragment from plpgsql_create_econtext
>
> <-->/*
> <--> * Likewise for the simple-expression resource owner.
> <--> */
> <-->if (estate->simple_eval_resowner == NULL)
> <-->{
> <--><-->if (shared_simple_eval_resowner == NULL)
> <--><--><-->shared_simple_eval_resowner =
> <--><--><--><-->ResourceOwnerCreate(TopTransactionResourceOwner,
> <--><--><--><--><--><--><--><--><-->"PL/pgSQL simple expressions");
> <--><-->estate->simple_eval_resowner = shared_simple_eval_resowner;
> <-->}
>
> In this case simple_eval_resowner from inline handler is overwritten and
> only shared_simple_eval_resowner will be used.
>
> So is it "estate->simple_eval_resowner = NULL;" error (without other
> conditions)?
>

Probably it is described

 *
 * (However, if a DO block executes COMMIT or ROLLBACK, then
exec_stmt_commit
 * or exec_stmt_rollback will unlink it from the DO's simple-expression
EState
 * and create a new shared EState that will be used thenceforth.  The
original
 * EState will be cleaned up when we get back to plpgsql_inline_handler.
This
 * is a bit ugly, but it isn't worth doing better, since scenarios like this
 * can't result in indefinite accumulation of state trees.)

Pavel

>
> Regards
>
> Pavel
>
>
>

broken logic of simple_eval_resowner after CALL and COMMIT inside procedure

2020-10-12 Thread Pavel Stehule

Hi

Inline handler creates simple_eval_resowner (without parent).

Inside plpgsql_estate_setup this value is assigned to
estate->simple_eval_resowner

<-->if (simple_eval_resowner)
<--><-->estate->simple_eval_resowner = simple_eval_resowner;
<-->else
<--><-->estate->simple_eval_resowner = shared_simple_eval_resowner;

When we call procedure with inner COMMIT, then when "before_lxid !=
after_lxid" following code is
executed.

<--><-->estate->simple_eval_estate = NULL;
<--><-->estate->simple_eval_resowner = NULL;
<--><-->plpgsql_create_econtext(estate);

and

fragment from plpgsql_create_econtext

<-->/*
<--> * Likewise for the simple-expression resource owner.
<--> */
<-->if (estate->simple_eval_resowner == NULL)
<-->{
<--><-->if (shared_simple_eval_resowner == NULL)
<--><--><-->shared_simple_eval_resowner =
<--><--><--><-->ResourceOwnerCreate(TopTransactionResourceOwner,
<--><--><--><--><--><--><--><--><-->"PL/pgSQL simple expressions");
<--><-->estate->simple_eval_resowner = shared_simple_eval_resowner;
<-->}

In this case simple_eval_resowner from inline handler is overwritten and
only shared_simple_eval_resowner will be used.

So is it "estate->simple_eval_resowner = NULL;" error (without other
conditions)?

Regards

Pavel

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-12 Thread Bruce Momjian

On Mon, Oct 12, 2020 at 04:20:05PM +0800, Julien Rouhaud wrote:
> But there are many people that aren't happy with the current hashing
> approach.  If we're going to move the computation in core, shouldn't
> we listen to their complaints and let them pay some probably quite
> high overhead to base the hash on name and/or fully qualified name
> rather than OID?
> For instance people using logical replication to upgrade to a newer
> version may want to easily compare query performance on the new
> version, or people with multi-tenant databases may want to ignore the
> schema name to keep a low number of different queryid.

Well, we have to consider how complex the user interface has to be to
allow more flexibility.  We don't need to allow every option a user will
want.

With a version number, we have the ability to improve the algorithm or
add customization, but for the first use, we are probably better off
keeping it simple.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: [HACKERS] Custom compression methods

2020-10-12 Thread Tomas Vondra


On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:



...


I have worked on this patch, so as discussed now I am maintaining the
preserved compression methods using dependency.  Still PRESERVE ALL
syntax is not supported, I will work on that part.



Cool, I'll take a look. What's your opinion on doing it this way? Do you
think it's cleaner / more elegant, or is it something contrary to what
the dependencies are meant to do?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: partition routing layering in nodeModifyTable.c

2020-10-12 Thread Amit Langote

On Mon, Oct 12, 2020 at 8:12 PM Heikki Linnakangas  wrote:
> On 09/10/2020 11:01, Amit Langote wrote:
> > On Thu, Oct 8, 2020 at 9:35 PM Amit Langote  wrote:
> >> On Wed, Oct 7, 2020 at 9:07 PM Heikki Linnakangas  wrote:
> >>> On 07/10/2020 12:50, Amit Langote wrote:
>  I have thought about something like this before.  An idea I had is to
>  make es_result_relations array indexable by plain RT indexes, then we
>  don't need to maintain separate indexes that we do today for result
>  relations.
> >>>
> >>> That sounds like a good idea. es_result_relations is currently an array
> >>> of ResultRelInfos, so that would leave a lot of unfilled structs in the
> >>> array. But in on of your other threads, you proposed turning
> >>> es_result_relations into an array of pointers anyway
> >>> (https://www.postgresql.org/message-id/CA+HiwqE4k1Q2TLmCAvekw+8_NXepbnfUOamOeX=kphrdtfs...@mail.gmail.com).
> >>
> >> Okay, I am reorganizing the patches around that idea and will post an
> >> update soon.
> >
> > Attached updated patches.
> >
> > 0001 makes es_result_relations an RTI-indexable array, which allows to
> > get rid of all "result relation index" fields across the code.
>
> Thanks! A couple small things I wanted to check with you before committing:

Thanks for checking.

> 1. We have many different cleanup/close routines now:
> ExecCloseResultRelations, ExecCloseRangeTableRelations, and
> ExecCleanUpTriggerState. Do we need them all? It seems to me that we
> could merge ExecCloseRangeTableRelations() and
> ExecCleanUpTriggerState(), they seem to do roughly the same thing: close
> relations that were opened for ResultRelInfos. They are always called
> together, except in afterTriggerInvokeEvents(). And in
> afterTriggerInvokeEvents() too, there would be no harm in doing both,
> even though we know there aren't any entries in the es_result_relations
> array at that point.

Hmm, I find trigger result relations to behave differently enough to
deserve a separate function.  For example, unlike plan-specified
result relations, they don't point to range table relations and don't
open indices.  Maybe the name could be revisited, say,
ExecCloseTriggerResultRelations().  Also, maybe call the other
functions:

ExecInitPlanResultRelationsArray()
ExecInitPlanResultRelation()
ExecClosePlanResultRelations()

Thoughts?

> 2. The way this is handled in worker.c is a bit funny. In
> create_estate_for_relation(), you create a ResultRelInfo, but you
> *don't* put it in the es_opened_result_relations list. That's
> surprising, but I'm also surprised there are no
> ExecCloseResultRelations() calls before the FreeExecutorState() calls in
> worker.c. It's not needed because the
> apply_handle_insert/update/delete_internal() functions call
> ExecCloseIndices() directly, so they don't rely on the
> ExecCloseResultRelations() function for cleanup. That works too, but
> it's a bit surprising because it's different from how it's done in
> copy.c and nodeModifyTable.c. It would feel natural to rely on
> ExecCloseResultRelations() in worker.c as well, but on the other hand,
> it also calls ExecOpenIndices() in a more lazy fashion, and it makes
> sense to call ExecCloseIndices() in the same functions that
> ExecOpenIndices() is called. So I'm not sure if changing that would be
> an improvement overall. What do you think? Did you consider doing that?

Yeah, that did bother me too a bit.  I'm okay either way but it does
look a bit inconsistent.

Actually, maybe we don't need to be so paranoid about setting up
es_result_relations in worker.c, because none of the downstream
functionality invoked seems to rely on it, that is, no need to call
ExecInitResultRelationsArray() and ExecInitResultRelation().
ExecSimpleRelation* and downstream functionality assume a
single-relation operation and the ResultRelInfo is explicitly passed.

> Attached is your original patch v13, and a patch on top of it that
> merges ExecCloseResultRelations() and ExecCleanUpTriggerState(), and
> makes some minor comment changes. I didn't do anything about the
> worker.c business, aside from adding a comment about it.

Thanks for the cleanup.

I had noticed there was some funny capitalization in my patch:

+   ResultRelInfo **es_result_relations;/* Array of Per-range-table-entry

s/Per-/per-

Also, I think a comma may be needed in the parenthetical below:

+ * can index it by the RT index (minus 1 to be accurate).

...(minus 1, to be accurate)

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: BUG #15858: could not stat file - over 4GB

2020-10-12 Thread Juan José Santamaría Flecha

On Mon, Oct 12, 2020 at 5:27 AM Tom Lane  wrote:

> Michael Paquier  writes:
> > Why are we forcing errno=ENOENT here?  Wouldn't it be correct to use
> > _dosmaperr(GetLastError()) to get the correct errno?
>
> Fair question.  Juan, was there some good reason not to look at
> GetLastError() in this step?
>

Uhm, a good question indeed, forcing errno serves no purpose there.

Regards,

Juan José Santamaría Flecha

Re: [HACKERS] Runtime Partition Pruning

2020-10-12 Thread Andy Fan

On Mon, Oct 12, 2020 at 5:48 PM Ashutosh Bapat 
wrote:

> On Mon, Oct 12, 2020 at 7:59 AM Andy Fan  wrote:
>
> >
> > Sorry for the late reply!  Suppose we have partition defined like this:
> > p
> > - p1
> > - p2
> >
> > When you talk about "the statistics from the partitioned table", do you
> > mean the statistics from p or p1/p2?   I just confirmed there is no
> statistics
> > for p (at least pg_class.reltuples = -1),  so I think you are talking
> about
> > p1/p2.
>
> I am talking about p when I say statistics from the partitioned table.
> I see that pg_statistic row from p is well populated.
> pg_class.reltuples = -1 indicates that the heap doesn't have any rows.
> set_rel_size() sets the number of rows in the partitioned table based
> on the rows in individual unpruned partitions.
>
>
Glad to know that, Thanks for this info!


> >
> > Here we are talking about partkey = $1 or  partkey = RunTimeValue.
> > so even the value can hit 1 partition only,  but since we don't know
> > the exact value, so we treat all the partition equally.  so looks
> > nothing wrong with partition level estimation.  However when we cost
> > the Append path, we need know just one of them can be hit,  then
> > we need do something there.  Both AppendPath->rows/total_cost
> > should be adjusted (That doesn't happen now).
>
> I think in this case we can safely assume that only one partition will
> remain so normalize costs considering that only one partition will
> survive.
>

Exactly.  What I am trying to do is fix this at create_append_path,
do you have different suggestions? about the pkey > $1 case,  I think
even if we use the statistics from partition level,  it would be
hard-code as well since we don't know what value $1 is.

I have gone through the main part of the RunTime partition prune, hope
I can update a runnable patch soon.  The main idea is fix the rows/
costs at create_append_path stage.  So any suggestion in a different
direction will be very useful.

-- 
Best Regards
Andy Fan

Re: [PATCH] Add features to pg_stat_statements

2020-10-12 Thread Yuki Seino

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

The patch applies cleanly and looks fine to me. It's a small detail, However 
wouldn't it be better if the following changes were made.

1.There are unnecessary difference lines 707 and 863 in "pg_stat_statements.c".
2.There is no comment on "slock_t mutex" in "struct pgssCtlCounter" in 
"pg_stat_statements.c".
3."update_ctl" and "reset_ctl" are generic and illegible name.You might want to 
rename it something like "pgss_ctl_update".
4.To improve the readability of the source, why not change the function 
declaration option in "pg_stat_statements_ctl" from 
"AS 'MODULE_PATHNAME'" to "AS 'MODULE_PATHNAME', 'pg_stat_statements_ctl'"? in 
"pg_stat_statements--1.8--1.9.sql".

The new status of this patch is: Waiting on Author

Re: archive status ".ready" files may be created too early

2020-10-12 Thread Heikki Linnakangas


On 07/07/2020 12:02, matsumura@fujitsu.com wrote:

At Monday, July 6, 2020 05:13:40 +,  "Kyotaro Horiguchi 
" wrote in

after WAL buffer is filled up to the requested position. So when it
crosses segment boundary we know the all past corss segment-boundary
records are stable. That means all we need to remember is only the
position of the latest corss-boundary record.


I could not agree. In the following case, it may not work well.
- record-A and record-B (record-B is a newer one) is copied, and
- lastSegContRecStart/End points to record-B's, and
- FlushPtr is proceeded to in the middle of record-A.


IIUC, that means record-B is a cross segment-border record and we hav e
flushed beyond the recrod-B. In that case crash recovery afterwards
can read the complete record-B and will finish recovery *after* the
record-B. That's what we need here.


I'm sorry I didn't explain enough.

Record-A and Record-B are cross segment-border records.
Record-A spans segment X and X+1
Record-B spans segment X+2 and X+3.
If both records have been inserted to WAL buffer, lastSegContRecStart/End 
points to Record-B.
If a writer flushes upto the middle of segment-X+1, NotifyStableSegments() 
allows the writer to notify segment-X.
Is my understanding correct?


I think this little ASCII drawing illustrates the above scenario:

A  F  B
|-|-|-|
   seg Xseg X+1   seg X+2

A and B are Record-A and Record-B. F is the current flush pointer.

In this case, it would be OK to notify segment X, as long as F is 
greater than the end of record A. And if I'm reading Kyotaro's patch 
correctly, that's what would happen with the patch.


The patch seems correct to me. I'm a bit sad that we have to track yet 
another WAL position (two, actually) to fix this, but I don't see a 
better way.


I wonder if we should arrange things so that XLogwrtResult.Flush never 
points in the middle of a record? I'm not totally convinced that all the 
current callers of GetFlushRecPtr() are OK with a middle-of-WAL record 
value. Could we get into similar trouble if a standby replicates half of 
a cross-segment record to a cascaded standby, and the cascaded standby 
has WAL archiving enabled?


- Heikki

Re: Batching page logging during B-tree build

2020-10-12 Thread Dmitry Dolgov

> On Fri, Oct 09, 2020 at 11:08:42PM +0500, Andrey Borodin wrote:
>
> > 23 сент. 2020 г., в 23:19, Peter Geoghegan  написал(а):
> >
> > On Fri, Sep 18, 2020 at 8:39 AM Andrey M. Borodin  
> > wrote:
> >> Here is PoC with porting that same routine to B-tree. It allows to build 
> >> B-trees ~10% faster on my machine.
> >
> > It doesn't seem to make any difference on my machine, which has an
> > NVME SSD (a Samsung 970 Pro). This is quite a fast SSD, though the
> > sync time isn't exceptional. My test case is "reindex index
> > pgbench_accounts_pkey", with pgbench scale 500. I thought that this
> > would be a sympathetic case, since it's bottlenecked on writing the
> > index, with relatively little time spent scanning and sorting in
> > parallel workers.
> > Can you provide a test case that is sympathetic towards the patch?
> Thanks for looking into this!
>
> I've tried this test on my machine (2019 macbook) on scale 10 for 20 seconds.
> With patch I get consistently ~ tps = 2.403440, without patch ~ tps = 
> 1.951975.
> On scale 500 with patch
> postgres=# reindex index pgbench_accounts_pkey;
> REINDEX
> Time: 21577,640 ms (00:21,578)
> without patch
> postgres=# reindex index pgbench_accounts_pkey;
> REINDEX
> Time: 26139,175 ms (00:26,139)
>
> I think it's hardware dependent, I will try on servers.

Hi,

Thanks for the patch! Out of curiosity I've tried to experiment with it,
and in the reindex scenario Peter mentioned above I also can't see any
visible difference. Interesting enough, _bt_blwritepage tracing (IIUC
after reading the patch it's the main target) shows the influence of
batching, but apparently in my case it hits some other bottleneck (e.g.
profiling shows increase in LockBuffer and LogicalTapeRead).

RE: Remove some unnecessary if-condition

2020-10-12 Thread Hou, Zhijie

> To me it looks good to clean up the conditions as you have done in the patch.
> Please add this to commitfest so that it's not forgotten. I have verified
> the code and indeed the conditions you are removing are unnecessary. So
> the patch can be marked as CFP right away.

Thank you for reviewing! added it to commitfest
https://commitfest.postgresql.org/30/2760/


Best regards,
houzj

Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-12 Thread Thomas Munro

On Mon, Oct 12, 2020 at 6:35 PM Greg Nancarrow  wrote:
> On Mon, Oct 12, 2020 at 2:11 PM Thomas Munro  wrote:
> > Yeah, I think we should go ahead and fix that up front.  Here's a
> > version with a commit message.
>
> I've checked it and tested it, and it looks fine to me.
> Also, it seems to align with the gripe in the old comment about width
> ("XXX this is totally wrong: we should report zero if no RETURNING
> ...").
> I'm happy for you to commit it.

Pushed, though I left most of that comment there because the width
estimate still needs work when you do use RETURNING.  At least we now
have rows=0 for queries without RETURNING, which was a bigger problem
for your patch.

Re: speed up unicode normalization quick check

2020-10-12 Thread Michael Paquier

On Mon, Oct 12, 2020 at 03:39:51PM +0900, Masahiko Sawada wrote:
> Yes, this patch resolves the problem.

Okay, applied then.
--
Michael


signature.asc
Description: PGP signature

Re: speed up unicode normalization quick check

2020-10-12 Thread Michael Paquier

On Mon, Oct 12, 2020 at 05:46:16AM -0400, John Naylor wrote:
> Hmm, I hadn't actually, but now that you mention it, that looks worth
> optimizing that as well, since there are multiple callers that search that
> table -- thanks for the reminder. The attached patch was easy to whip up,
> being similar to the quick check (doesn't include the pg_hton32 fix). I'll
> do some performance testing soon. Note that a 25kB increase in size could
> be present in frontend binaries as well in this case. While looking at
> decomposition, I noticed that recomposition does a linear search through
> all 6600+ entries, although it seems only about 800 are valid for that.
> That could be optimized as well now, since with hashing we have more
> flexibility in the ordering and can put the recomp-valid entries in front.
> I'm not yet sure if it's worth the additional complexity. I'll take a look
> and start a new thread.

Starting a new thread for this topic sounds like a good idea to me,
with a separate performance analysis.  Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Resetting spilled txn statistics in pg_stat_replication

On Mon, Oct 12, 2020 at 11:52 AM Masahiko Sawada
 wrote:
>
> On Thu, 8 Oct 2020 at 22:57, Amit Kapila  wrote:
> >
> > On Thu, Oct 8, 2020 at 7:46 AM Masahiko Sawada
> >  wrote:
> >
> > I have rebased the stream stats patch and made minor modifications. I
> > haven't done a detailed review but one thing that I think is not
> > correct is:
> > @@ -3496,10 +3499,18 @@ ReorderBufferStreamTXN(ReorderBuffer *rb,
> > ReorderBufferTXN *txn)
> >   txn->snapshot_now = NULL;
> >   }
> >
> > +
> > + rb->streamCount += 1;
> > + rb->streamBytes += txn->total_size;
> > +
> > + /* Don't consider already streamed transaction. */
> > + rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
> > +
> >   /* Process and send the changes to output plugin. */
> >   ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
> >   command_id, true);
> >
> > I think we should update the stream stats after
> > ReorderBufferProcessTXN rather than before because any error in
> > ReorderBufferProcessTXN can lead to an unnecessary update of stats.
> > But OTOH, the txn flags, and other data can be changed after
> > ReorderBufferProcessTXN so we need to save them in a temporary
> > variable before calling the function.
>
> Thank you for updating the patch!
>
> I've not looked at the patch in-depth yet but RBTXN_IS_STREAMED could
> be cleared after ReorderBUfferProcessTXN()?
>

I think you mean to say RBTXN_IS_STREAMED could be *set* after
ReorderBUfferProcessTXN(). We need to set it for txn and subtxns and
currently, it is being done with other things in
ReorderBufferTruncateTXN so not sure if it is a good idea to do this
separately.

> BTW maybe it's better to start a new thread for this patch as the
> title is no longer relevant.
>

Yeah, that makes sense. I'll do that while posting a new version of the patch.


-- 
With Regards,
Amit Kapila.

Re: partition routing layering in nodeModifyTable.c

2020-10-12 Thread Heikki Linnakangas


On 09/10/2020 11:01, Amit Langote wrote:

On Thu, Oct 8, 2020 at 9:35 PM Amit Langote  wrote:

On Wed, Oct 7, 2020 at 9:07 PM Heikki Linnakangas  wrote:

On 07/10/2020 12:50, Amit Langote wrote:

I have thought about something like this before.  An idea I had is to
make es_result_relations array indexable by plain RT indexes, then we
don't need to maintain separate indexes that we do today for result
relations.


That sounds like a good idea. es_result_relations is currently an array
of ResultRelInfos, so that would leave a lot of unfilled structs in the
array. But in on of your other threads, you proposed turning
es_result_relations into an array of pointers anyway
(https://www.postgresql.org/message-id/CA+HiwqE4k1Q2TLmCAvekw+8_NXepbnfUOamOeX=kphrdtfs...@mail.gmail.com).


Okay, I am reorganizing the patches around that idea and will post an
update soon.


Attached updated patches.

0001 makes es_result_relations an RTI-indexable array, which allows to
get rid of all "result relation index" fields across the code.


Thanks! A couple small things I wanted to check with you before committing:

1. We have many different cleanup/close routines now: 
ExecCloseResultRelations, ExecCloseRangeTableRelations, and 
ExecCleanUpTriggerState. Do we need them all? It seems to me that we 
could merge ExecCloseRangeTableRelations() and 
ExecCleanUpTriggerState(), they seem to do roughly the same thing: close 
relations that were opened for ResultRelInfos. They are always called 
together, except in afterTriggerInvokeEvents(). And in 
afterTriggerInvokeEvents() too, there would be no harm in doing both, 
even though we know there aren't any entries in the es_result_relations 
array at that point.


2. The way this is handled in worker.c is a bit funny. In 
create_estate_for_relation(), you create a ResultRelInfo, but you 
*don't* put it in the es_opened_result_relations list. That's 
surprising, but I'm also surprised there are no 
ExecCloseResultRelations() calls before the FreeExecutorState() calls in 
worker.c. It's not needed because the 
apply_handle_insert/update/delete_internal() functions call 
ExecCloseIndices() directly, so they don't rely on the 
ExecCloseResultRelations() function for cleanup. That works too, but 
it's a bit surprising because it's different from how it's done in 
copy.c and nodeModifyTable.c. It would feel natural to rely on 
ExecCloseResultRelations() in worker.c as well, but on the other hand, 
it also calls ExecOpenIndices() in a more lazy fashion, and it makes 
sense to call ExecCloseIndices() in the same functions that 
ExecOpenIndices() is called. So I'm not sure if changing that would be 
an improvement overall. What do you think? Did you consider doing that?


Attached is your original patch v13, and a patch on top of it that 
merges ExecCloseResultRelations() and ExecCleanUpTriggerState(), and 
makes some minor comment changes. I didn't do anything about the 
worker.c business, aside from adding a comment about it.


- Heikki
>From 17e22c13509941c561346f386d53fd4f4ff16ef8 Mon Sep 17 00:00:00 2001
From: amitlan 
Date: Fri, 9 Oct 2020 13:52:29 +0900
Subject: [PATCH 1/2] Make es_result_relations array indexable by RT index

This allows us to get rid of the need to keep around a separate set
of indexes for each result relation.
---
 src/backend/commands/copy.c  |  20 +--
 src/backend/commands/explain.c   |  17 +-
 src/backend/commands/tablecmds.c |  12 +-
 src/backend/executor/execMain.c  | 189 ++-
 src/backend/executor/execUtils.c |  73 ++---
 src/backend/executor/nodeModifyTable.c   |  24 ++-
 src/backend/nodes/copyfuncs.c|   2 -
 src/backend/nodes/outfuncs.c |   2 -
 src/backend/nodes/readfuncs.c|   2 -
 src/backend/optimizer/plan/createplan.c  |   2 -
 src/backend/optimizer/plan/setrefs.c |  10 +-
 src/backend/replication/logical/worker.c |   4 +-
 src/include/executor/executor.h  |   5 +
 src/include/nodes/execnodes.h|  21 ++-
 src/include/nodes/plannodes.h|   2 -
 15 files changed, 162 insertions(+), 223 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3c7dbad27a..6948214334 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2829,25 +2829,18 @@ CopyFrom(CopyState cstate)
 	 * index-entry-making machinery.  (There used to be a huge amount of code
 	 * here that basically duplicated execUtils.c ...)
 	 */
-	resultRelInfo = makeNode(ResultRelInfo);
-	InitResultRelInfo(resultRelInfo,
-	  cstate->rel,
-	  1,		/* must match rel's position in range_table */
-	  NULL,
-	  0);
-	target_resultRelInfo = resultRelInfo;
+	ExecInitRangeTable(estate, cstate->range_table);
+	ExecInitResultRelationsArray(estate);
+	resultRelInfo = target_resultRelInfo = makeNode(ResultRelInfo);
+	ExecInitResultRelation(estate, resultRelInfo, 1);
 
 	/* Verify the named relation is a valid tar

Re: Resetting spilled txn statistics in pg_stat_replication

On Mon, 12 Oct 2020 at 18:29, Shinoda, Noriyoshi (PN Japan A&PS
Delivery)  wrote:
>
> Hi, thank you for the awesome feature.
>

Thank you for reporting!

> As it may have been discussed, I think the 'name' column in 
> pg_stat_replication_slots is more consistent with the column name and data 
> type matched to the pg_replication_slots catalog.
> The attached patch changes the name and data type of the 'name' column to 
> slot_name and 'name' type, respectively.

It seems a good idea to me. In other system views, we use the name
data type for object name. When I wrote the first patch, I borrowed
the code for pg_stat_slru which uses text data for the name but I
think it's an oversight.

> Also, the macro name PG_STAT_GET_..._CLOS has been changed to 
> PG_STAT_GET_..._COLS.

Good catch!

Here is my comments on the patch:

--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -798,7 +798,7 @@ CREATE VIEW pg_stat_replication AS

 CREATE VIEW pg_stat_replication_slots AS
 SELECT
-s.name,
+s.name AS slot_name,
 s.spill_txns,
 s.spill_count,
 s.spill_bytes,

I think we should modify 'proargnames' of
pg_stat_get_replication_slots() in pg_proc.dat instead.

---
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -7094,7 +7094,7 @@ pgstat_replslot_index(const char *name, bool create_it)
Assert(nReplSlotStats <= max_replication_slots);
for (i = 0; i < nReplSlotStats; i++)
{
-   if (strcmp(replSlotStats[i].slotname, name) == 0)
+   if (strcmp(replSlotStats[i].slotname.data, name) == 0)
return i;   /* found */
}

@@ -7107,7 +7107,7 @@ pgstat_replslot_index(const char *name, bool create_it)

/* Register new slot */
memset(&replSlotStats[nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
-   memcpy(&replSlotStats[nReplSlotStats].slotname, name, NAMEDATALEN);
+   memcpy(&replSlotStats[nReplSlotStats].slotname.data, name, NAMEDATALEN);

return nReplSlotStats++;
 }

We can use NameStr() instead.

---
Perhaps we need to update the regression test as well.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Patch] Optimize dropping of relation buffers using dlist

On Mon, Oct 12, 2020 at 3:08 PM k.jami...@fujitsu.com
 wrote:
>
> Hmm. When I repeated the performance measurement for non-recovery,
> I got almost similar execution results for both master and patched.
>
> Execution Time (in seconds)
> | s_b   | master | patched | %reg   |
> |---||-||
> | 128MB | 15.265 | 14.769  | -3.36% |
> | 1GB   | 14.808 | 14.618  | -1.30% |
> | 20GB  | 24.673 | 24.425  | -1.02% |
> | 100GB | 74.298 | 74.813  | 0.69%  |
>
> That is considering that I removed the recovery-related checks in the patch 
> and just
> executed the commands on a standalone server.
> -   if (InRecovery && reln->smgr_cached_nblocks[forknum] != 
> InvalidBlockNumber)
> +   if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)
>

Why so? Have you tried to investigate? Check if it takes an optimized
path for the non-recovery case?

-- 
With Regards,
Amit Kapila.

Re: Deleting older versions in unique indexes to avoid page splits

2020-10-12 Thread Andrey Borodin

> 8 окт. 2020 г., в 04:48, Peter Geoghegan  написал(а):
> 
> On Tue, Jun 30, 2020 at 5:03 PM Peter Geoghegan  wrote:
>> Attached is a POC patch that teaches nbtree to delete old duplicate
>> versions from unique indexes. The optimization targets non-HOT
>> duplicate version bloat. Although the patch is rather rough, it
>> nevertheless manages to more or less eliminate a whole class of index
>> bloat: Unique index bloat from non-HOT updates in workloads where no
>> transaction lasts for more than a few seconds.
> 
> I'm slightly surprised that this thread didn't generate more interest
> back in June.

The idea looks very interesting.
It resembles GiST microvacuum: GiST tries to vacuum single page before split.
I'm curious how cost of page deduplication is compared to cost of page split? 
Should we do deduplication of page will still remain 99% full?

Best regards, Andrey Borodin.

Re: Improved Cost Calculation for IndexOnlyScan

2020-10-12 Thread Hamid Akhtar

On Tue, Sep 29, 2020 at 2:57 PM Heikki Linnakangas  wrote:

> On 29/09/2020 11:49, Hamid Akhtar wrote:
> > So, not actually random replacement here, rather a change with
> > baserel->allvisfrac taken into consideration (as given below):
> > 
> > index_random_page_cost = Min(spc_seq_page_cost + spc_random_page_cost *
> > (1.0 - baserel->allvisfrac), spc_random_page_cost);
> > 
> >
> > Does this make sense?
>
> No. genericcostestimate() is only concerned with accesses to the index,
> not the the heap accesses that are needed with Index Scans. 'allvisfrac'
> should not affect the number of *index* pages fetched in any way.
>
> - Heikki
>

Currently, the costing for indexonlyscan only differs based on
'allvisfrac'. IIUC, the current implementation changes the number of pages
being fetched based on 'allvisfrac'.

This patch actually makes indexonlyscan specific changes
to genericcostestimate function. Currently, regardless of the value of
'allvisfrac', it is being assumed that the cost of fetching index pages is
random page cost. That is not aligned with the current cost calculation for
indexonlyscan. Therefore, I'm suggesting to reduce the random page in a
similar fashion in case of indexonlyscan.

I'm adding this to the commitfest.

-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL：+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus

Re: Resetting spilled txn statistics in pg_stat_replication

2020-10-12 Thread Shinoda, Noriyoshi (PN Japan A&PS Delivery)

On Mon, Oct 12, 2020 at 10:59 AM Masahiko Sawada
 wrote:
>
> On Thu, 8 Oct 2020 at 17:59, Amit Kapila  wrote:
> >
> > On Thu, Oct 8, 2020 at 1:55 PM Masahiko Sawada
> >  wrote:
> > >
> > > On Thu, 8 Oct 2020 at 14:10, Amit Kapila  wrote:
> > > >
> > > >
> > > > We can write if we want but there are few things we need to do for
> > > > that like maybe a new function like wait_for_spill_stats which will
> > > > check if the counters have become zero. Then probably call a reset
> > > > function, call a new wait function, and then again check stats to
> > > > ensure they are reset to 0.
> > >
> > > Yes.
> > >
> >
> > I am not sure if it is worth but probably it is not a bad idea
> > especially if we extend the existing tests based on your below idea?
> >
> > > > We can't write any advanced test which means reset the existing stats
> > > > perform some tests and again check stats because *slot_get_changes()
> > > > function can start from the previous WAL for which we have covered the
> > > > stats. We might write that if we can somehow track the WAL positions
> > > > from the previous test. I am not sure if we want to go there.
> > >
> > > Can we use pg_logical_slot_peek_changes() instead to decode the same
> > > transactions multiple times?
> > >
> >
> > I think this will do the trick. If we want to go there then I suggest
> > we can have a separate regression test file in test_decoding with name
> > as decoding_stats, stats, or something like that. We can later add the
> > tests related to streaming stats in that file as well.
> >
>
> Agreed.
>
> I've updated the patch. Please review it.
>

Few comments:
=
1.
+-- function to wait for counters to advance
+CREATE FUNCTION wait_for_spill_stats(check_reset bool) RETURNS void AS $$

Can we rename this function to wait_for_decode_stats? I am thinking we
can later reuse this function for streaming stats as well by passing
the additional parameter 'stream bool'.

2. let's drop the table added by this test and regression_slot at the
end of the test.

-- 
With Regards,
Amit Kapila.

Re: Add primary keys to system catalogs

2020-10-12 Thread John Naylor

On Tue, Oct 6, 2020 at 4:16 PM Andres Freund  wrote:

> True, we don't create new ones that often. Still think that distributing
> such setup over fewer places is good. And it's not like there's only a
> handful of pkeys to start with. To me it makes more sense to add a
> DECLARE_PRIMARY_KEY in indexing.h, replacing the relevant
> DECLARE_UNIQUE_INDEX stanzas.


That seems logical.


> Or, possibly preferrable, do it as part of
> the CATALOG() directly - which'd avoid needing the index statement in
> the first place.
>

That line is already messy in places. AFAICT, you'd still need info from
the index statement (at least my shortened version below), leading to two
syntaxes and two places for the same thing. Maybe I'm missing something?


> Wonder whether it's not time to move the DECLARE bits from indexing.h to
> the individual catalogs, independent of what we're discussing here. With
> todays Catalog.pm there's really not much point in keeping them
> separate, imo.
>

That would look nicer, at least. A declaration could go from

DECLARE_UNIQUE_INDEX(pg_subscription_rel_srrelid_srsubid_index, 6117, on
pg_subscription_rel using btree(srrelid oid_ops, srsubid oid_ops));
#define SubscriptionRelSrrelidSrsubidIndexId 6117

to something like

DECLARE_UNIQUE_INDEX(btree(srrelid oid_ops, srsubid oid_ops), 6117,
SubscriptionRelSrrelidSrsubidIndexId));

-- 
John Naylor
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Logical replication CPU-bound with TRUNCATE/DROP/CREATE many tables

2020-10-12 Thread Dilip Kumar

On Fri, Oct 9, 2020 at 2:34 PM Amit Kapila  wrote:
>
> On Fri, Oct 9, 2020 at 2:19 PM Simon Riggs  wrote:
> >
> > On Fri, 9 Oct 2020 at 04:10, Amit Kapila  wrote:
> > >
> > > On Thu, Oct 8, 2020 at 2:34 PM Simon Riggs  wrote:
> > > >
> > > > On Thu, 8 Oct 2020 at 09:47, Dilip Kumar  wrote:
> > > >
> > > > > > This script will wait 10 seconds after INSERT exits
> > > > > > before executing TRUNCATE, please wait for it to run.
> > > >
> > > > Has this been tested with anything other than the one test case?
> > > >
> > > > It would be good to know how the patch handles a transaction that
> > > > contains many aborted subtransactions that contain invals.
> > > >
> > >
> > > Are you thinking from the angle of performance or functionality? I
> > > don't see how this patch can impact either of those. Basically, it
> > > will not execute any extra invalidations then it is executing without
> > > the patch for aborted subtransactions. Can you please explain in a bit
> > > more detail about your fear?
> > >
> > > Having said that, I think it would be a good idea to test the scenario
> > > you mentioned to ensure that we have not broken anything unknowingly.
> >
> > The test appears to only cover the case of many subtransactions, all
> > of which commit, and then top-level commit occurs.
> >
> > We should be testing cases where the top-level commit occurs, yet some
> > proportion of the subtransactions abort. "Normal" would be 10-50%
> > aborts.
> >
> > I presume we support this case already, but wish to ensure the
> > performance tweak is not just for the one special case.
> >
>
> Okay, I think this makes sense. I think we should see the performance
> benefit for this case as well but maybe to a bit lesser degree because
> we will exclude some of the subtransactions from processing.

I have tried with a combination of abort/commit subtransaction and I
could see a similar benefit with the patch.

I tested below transaction
BEGIN;
truncate table nsp_001.tbl_001;
savepoint s1;
truncate table nsp_001.tbl_001;
savepoint s2;
truncate table part_0001;
savepoint s3;
truncate table part_0002;
savepoint s5;
truncate table part_0003;
rollback to s3;
commit;
EOF


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Runtime Partition Pruning

2020-10-12 Thread Ashutosh Bapat

On Mon, Oct 12, 2020 at 7:59 AM Andy Fan  wrote:

>
> Sorry for the late reply!  Suppose we have partition defined like this:
> p
> - p1
> - p2
>
> When you talk about "the statistics from the partitioned table", do you
> mean the statistics from p or p1/p2?   I just confirmed there is no statistics
> for p (at least pg_class.reltuples = -1),  so I think you are talking about
> p1/p2.

I am talking about p when I say statistics from the partitioned table.
I see that pg_statistic row from p is well populated.
pg_class.reltuples = -1 indicates that the heap doesn't have any rows.
set_rel_size() sets the number of rows in the partitioned table based
on the rows in individual unpruned partitions.

>
> Here we are talking about partkey = $1 or  partkey = RunTimeValue.
> so even the value can hit 1 partition only,  but since we don't know
> the exact value, so we treat all the partition equally.  so looks
> nothing wrong with partition level estimation.  However when we cost
> the Append path, we need know just one of them can be hit,  then
> we need do something there.  Both AppendPath->rows/total_cost
> should be adjusted (That doesn't happen now).

I think in this case we can safely assume that only one partition will
remain so normalize costs considering that only one partition will
survive.

-- 
Best Wishes,
Ashutosh Bapat

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-12 Thread k.jami...@fujitsu.com

On Friday, October 9, 2020 11:12 AM, Horiguchi-san wrote:
> I have some comments on the latest patch.

Thank you for the feedback!
I've attached the latest patches.

>  visibilitymap_prepare_truncate(Relation rel, BlockNumber nheapblocks)  {
>   BlockNumber newnblocks;
> + boolcached;
> 
> All the added variables added by 0002 is useless because all the caller sites
> are not interested in the value. smgrnblocks should accept NULL as isCached.
> (I'm agree with Tsunakawa-san that the camel-case name is not common
> there.)
> 
> + nForkBlocks[i] = smgrnblocks(smgr_reln, forkNum[i],
> &isCached);
> +
> + if (!isCached)
> 
> "is cached" is not the property that code is interested in. No other callers 
> to
> smgrnblocks are interested in that property. The need for caching is purely
> internal of smgrnblocks().
> On the other hand, we are going to utilize the property of "accuracy"
> that is a biproduct of reducing fseek calls, and, again, not interested in 
> how it
> is achieved.
> So I suggest that the name should be "accurite" or something that is not
> suggest the mechanism used under the hood.

I changed the bool param to "accurate" per your suggestion.
And I also removed the additional variables "bool cached" from the modified 
functions.
Now NULL values are accepted for the new boolean parameter
 

> + if (nTotalBlocks != InvalidBlockNumber &&
> + nBlocksToInvalidate <
> BUF_DROP_FULL_SCAN_THRESHOLD)
> 
> I don't think nTotalBlocks is useful. What we need here is only total blocks 
> for
> every forks (nForkBlocks[]) and the total number of buffers to be invalidated
> for all forks (nBlocksToInvalidate).

Alright. I also removed nTotalBlocks in v24-0003 patch.

for (i = 0; i < nforks; i++)
{
if (nForkBlocks[i] != InvalidBlockNumber &&
nBlocksToInvalidate < BUF_DROP_FULL_SCAN_THRESHOLD)
{
Optimization loop
}
else
break;
}
if (i >= nforks)
return;
{ usual buffer invalidation process }


> > > > The right side of >= should be cur_block.
> > > Fixed.
> > >= should be =, shouldn't it?
> 
> It's just from a paranoia. What we are going to invalidate is blocks blockNum
> of which >= curBlock. Although actually there's no chance of any other
> processes having replaced the buffer with another page (with lower blockid)
> of the same relation after BufTableLookup(), that condition makes it sure not
> to leave blocks to be invalidated left alone.
> Sorry. What we are going to invalidate is blocks that are blocNum >=
> firstDelBlock[i]. So what I wanted to suggest was the condition should be
> 
> + if (RelFileNodeEquals(bufHdr->tag.rnode,
> rnode.node) &&
> + bufHdr->tag.forkNum ==
> forkNum[j] &&
> + bufHdr->tag.blockNum >=
> firstDelBlock[j])

I used bufHdr->tag.blockNum >= firstDelBlock[i] in the latest patch.

> > Please measure and let us see just the recovery performance again because
> the critical part of the patch was modified.  If the performance is good as 
> the
> previous one, and there's no review interaction with others in progress, I'll
> mark the patch as ready for committer in a few days.
> 
> The performance is expected to be kept since smgrnblocks() is called in a
> non-hot code path and actually it is called at most four times per a buffer
> drop in this patch. But it's better making it sure.

Hmm. When I repeated the performance measurement for non-recovery,
I got almost similar execution results for both master and patched.

Execution Time (in seconds)
| s_b   | master | patched | %reg   | 
|---||-|| 
| 128MB | 15.265 | 14.769  | -3.36% | 
| 1GB   | 14.808 | 14.618  | -1.30% | 
| 20GB  | 24.673 | 24.425  | -1.02% | 
| 100GB | 74.298 | 74.813  | 0.69%  |

That is considering that I removed the recovery-related checks in the patch and 
just
executed the commands on a standalone server.
-   if (InRecovery && reln->smgr_cached_nblocks[forknum] != 
InvalidBlockNumber)
+   if (reln->smgr_cached_nblocks[forknum] != InvalidBlockNumber)

OTOH, I also measured the recovery performance by having hot standby and 
executing failover.
The results were good and almost similar to the previously reported recovery 
performance.

Recovery Time (in seconds)
| s_b   | master | patched | %reg   | 
|---||-|| 
| 128MB | 3.043  | 2.977   | -2.22% | 
| 1GB   | 3.417  | 3.41| -0.21% | 
| 20GB  | 20.597 | 2.409   | -755%  | 
| 100GB | 66.862 | 2.409   | -2676% |

For 20GB s_b, from 20.597 s (Master) to 2.409 s (Patched).
For 100GB s_b, from 66.862 s (Master) to 2.409 s (Patched).
This is mainly benefits for large shared_buffers setting,
without compromising when shared_buffers is set to default or lower value.

If you could take a look again and if you have additional feedback or comments, 
I'd appreciate it.
Thank you for your time

Regards,
Kirk Jamison



v

RE: Resetting spilled txn statistics in pg_stat_replication

Hi, thank you for the awesome feature.

As it may have been discussed, I think the 'name' column in 
pg_stat_replication_slots is more consistent with the column name and data type 
matched to the pg_replication_slots catalog.
The attached patch changes the name and data type of the 'name' column to 
slot_name and 'name' type, respectively.
Also, the macro name PG_STAT_GET_..._CLOS has been changed to 
PG_STAT_GET_..._COLS.

Regards,
Noriyoshi Shinoda
-Original Message-
From: Masahiko Sawada [mailto:masahiko.saw...@2ndquadrant.com] 
Sent: Monday, October 12, 2020 3:22 PM
To: Amit Kapila 
Cc: Dilip Kumar ; Magnus Hagander ; 
Tomas Vondra ; PostgreSQL Hackers 
; Ajin Cherian 
Subject: Re: Resetting spilled txn statistics in pg_stat_replication

On Thu, 8 Oct 2020 at 22:57, Amit Kapila  wrote:
>
> On Thu, Oct 8, 2020 at 7:46 AM Masahiko Sawada 
>  wrote:
> >
> > On Wed, 7 Oct 2020 at 17:52, Amit Kapila  wrote:
> > >
> >
> > > I think after we are done with this the next step would be to 
> > > finish the streaming stats work [1]. We probably need to review 
> > > and add the test case in that patch. If nobody else shows up I 
> > > will pick it up and complete it.
> >
> > +1
> > I can review that patch.
> >
>
> I have rebased the stream stats patch and made minor modifications. I 
> haven't done a detailed review but one thing that I think is not 
> correct is:
> @@ -3496,10 +3499,18 @@ ReorderBufferStreamTXN(ReorderBuffer *rb, 
> ReorderBufferTXN *txn)
>   txn->snapshot_now = NULL;
>   }
>
> +
> + rb->streamCount += 1;
> + rb->streamBytes += txn->total_size;
> +
> + /* Don't consider already streamed transaction. */
> + rb->streamTxns += (rbtxn_is_streamed(txn)) ? 0 : 1;
> +
>   /* Process and send the changes to output plugin. */
>   ReorderBufferProcessTXN(rb, txn, InvalidXLogRecPtr, snapshot_now,
>   command_id, true);
>
> I think we should update the stream stats after 
> ReorderBufferProcessTXN rather than before because any error in 
> ReorderBufferProcessTXN can lead to an unnecessary update of stats.
> But OTOH, the txn flags, and other data can be changed after 
> ReorderBufferProcessTXN so we need to save them in a temporary 
> variable before calling the function.

Thank you for updating the patch!

I've not looked at the patch in-depth yet but RBTXN_IS_STREAMED could be 
cleared after ReorderBUfferProcessTXN()?

BTW maybe it's better to start a new thread for this patch as the title is no 
longer relevant.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/ 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pg_stat_replication_slots.diff
Description: pg_stat_replication_slots.diff

Re: Parallel INSERT (INTO ... SELECT ...)

On Mon, Oct 12, 2020 at 12:38 PM Greg Nancarrow  wrote:
>
> On Mon, Oct 12, 2020 at 5:36 PM Amit Kapila  wrote:
> >
> > >
> > > I have not thought about this yet but I don't understand your
> > > proposal. How will you set it only for the scope of Gather (Merge)?
> > > The execution of the Gather node will be interleaved with the Insert
> > > node, basically, you fetch a tuple from Gather, and then you need to
> > > Insert it. Can you be a bit more specific on what you have in mind for
> > > this?
> > >
> >
> > One more thing I would like to add here is that we can't exit
> > parallel-mode till the workers are running (performing the scan or
> > other operation it is assigned with) and shared memory is not
> > destroyed. Otherwise, the leader can perform un-safe things like
> > assigning new commandsids or probably workers can send some error for
> > which the leader should still be in parallel-mode. So, considering
> > this I think we need quite similar checks (similar to parallel
> > inserts) to make even the Select part parallel for Inserts. If we do
> > that then you won't face many of the problems you mentioned above like
> > executing triggers that contain parallel-unsafe stuff. I feel still it
> > will be beneficial to do this as it will cover cases like Insert with
> > GatherMerge underneath it which would otherwise not possible.
> >
>
> Yes, I see what you mean, exiting parallel-mode can't be done safely
> where I had hoped it could, so looks like, even for making just the
> Select part of Insert parallel, I need to add checks (along the same
> lines as for Parallel Insert) to avoid the parallel Select in certain
> potentially-unsafe cases.
>

Right, after we take care of that, we can think of assigning xid or
things like that before we enter parallel mode. Say we have a function
like PrepareParallelMode (or PrepareEnterParallelMode) or something
like that where we can check whether we need to perform
parallel-safe-write operation (as of now Insert) and then do the
required preparation like assign xid, etc. I think this might not be
idle because it is possible that we don't fetch even a single row (say
due to filter condition) which needs to be inserted and then we will
waste xid but such cases might not occur often enough to worry.

-- 
With Regards,
Amit Kapila.

Re: Remove some unnecessary if-condition

2020-10-12 Thread Ashutosh Bapat

On Fri, Oct 9, 2020 at 6:29 AM Hou, Zhijie  wrote:
>
> Hi
>
> I found some likely unnecessary if-condition in code.
>
> 1. Some check in else branch seems unnecessary.
>
> In (/src/backend/replication/logical/reorderbuffer.c)
> ① @@ -4068,7 +4068,7 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, 
> ReorderBufferTXN *txn,
> > bool   found;
> > if (!found)
> > {
> >...
> > }
> > else if (found && chunk_seq != ent->last_chunk_seq + 1)
> >...
>
> The check of "found" in else if branch seems unnecessary.
>
> ② (/src/backend/utils/init/postinit.c)
> @@ -924,11 +924,8 @@ InitPostgres(const char *in_dbname, Oid dboid, const 
> char *username,
>
> > bool  bootstrap = IsBootstrapProcessingMode();
> > if (bootstrap)
> > {
> >...
> > }
> > else if(...)
> > {...}
> > else
> > {
> >if (!bootstrap)
> >{
> >...
> >}
> > }
>
> The check of "bootstrap" in else branch seems unnecessary.
>
>
> 2.In (/src/interfaces/ecpg/compatlib/informix.c)
> @@ -944,7 +944,7 @@ rupshift(char *str)
>
> > for (len--; str[len] && str[len] == ' '; len--);
>
> The first "str[len]" seems unnecessary since  " str[len] == ' '" will check 
> it as well.
>
> Do you think we should remove these if-condition for code clean ?

To me it looks good to clean up the conditions as you have done in the
patch. Please add this to commitfest so that it's not forgotten. I
have verified the code and indeed the conditions you are removing are
unnecessary. So the patch can be marked as CFP right away.

-- 
Best Wishes,
Ashutosh Bapat

Re: Wired if-statement in gen_partprune_steps_internal

2020-10-12 Thread Amit Langote

Hi,

On Thu, Oct 8, 2020 at 6:56 PM Andy Fan  wrote:
>
> Hi:
>
>   I found the following code in gen_partprune_steps_internal,  which
> looks the if-statement to be always true since list_length(results) > 1;
> I added an Assert(step_ids != NIL) and all the test cases passed.
> if the if-statement is always true,  shall we remove it to avoid confusion?
>
>
> gen_partprune_steps_internal(GeneratePruningStepsContext *context,
>
>
> if (list_length(result) > 1)
> {
> List   *step_ids = NIL;
>
> foreach(lc, result)
> {
> PartitionPruneStep *step = lfirst(lc);
>
> step_ids = lappend_int(step_ids, step->step_id);
> }
> Assert(step_ids != NIL);
> if (step_ids != NIL) // This should always be true.
> {
> PartitionPruneStep *step;
>
> step = gen_prune_step_combine(context, step_ids,
>   
> PARTPRUNE_COMBINE_INTERSECT);
> result = lappend(result, step);
> }
> }

That seems fine to me.

Looking at this piece of code, I remembered that exactly the same
piece of logic is also present in gen_prune_steps_from_opexps(), which
looks like this:

/* Lastly, add a combine step to mutually AND these op steps, if needed */
if (list_length(opsteps) > 1)
{
List   *opstep_ids = NIL;

foreach(lc, opsteps)
{
PartitionPruneStep *step = lfirst(lc);

opstep_ids = lappend_int(opstep_ids, step->step_id);
}

if (opstep_ids != NIL)
return gen_prune_step_combine(context, opstep_ids,
  PARTPRUNE_COMBINE_INTERSECT);
return NULL;
}
else if (opsteps != NIL)
return linitial(opsteps);

I think we should remove this duplicative logic and return the
generated steps in a list from this function, which the code in
gen_partprune_steps_internal() then "combines" using an INTERSECT
step.  See attached a patch to show what I mean.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


partprune-code-tweaks.patch
Description: Binary data

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-10-12 Thread Julien Rouhaud

On Wed, Oct 7, 2020 at 9:53 AM Bruce Momjian  wrote:
>
> On Wed, Oct  7, 2020 at 10:42:49AM +0900, Michael Paquier wrote:
> > On Tue, Oct 06, 2020 at 09:22:29AM -0400, Bruce Momjian wrote:
> > > I propose moving the pg_stat_statements queryid hash code into the
> > > server (with a version number), and also adding a postgresql.conf
> > > variable that lets you control how detailed the queryid hash is
> > > computed.  This addresses the problem of people wanting different hash
> > > methods.
> >
> > In terms of making this part expendable in the future, there could be
> > a point in having an enum here, but are we sure that we will have a
> > need for that in the future?  What I get from this discussion is that
> > we want a unique source of truth that users can consume, and that the
> > only source of truth proposed is the PGSS hashing.  We may change the
> > way we compute the query ID in the future, for example if it gets
> > expanded to some utility statements, etc.  But that would be
> > controlled by the version number in the hash, not the GUC itself.
>
> Oh, if that is true, then I agree let's just go with the version number.

But there are many people that aren't happy with the current hashing
approach.  If we're going to move the computation in core, shouldn't
we listen to their complaints and let them pay some probably quite
high overhead to base the hash on name and/or fully qualified name
rather than OID?
For instance people using logical replication to upgrade to a newer
version may want to easily compare query performance on the new
version, or people with multi-tenant databases may want to ignore the
schema name to keep a low number of different queryid.

It would probably still be possible to have a custom queryid hashing
by disabling the core one and computing a new one in a custom
extension, but that seems a bit hackish.

Jumping back on Tom's point that there are judgment calls on what is
examined or not, after a quick look I see at least two possible
problems of ignored clauses:
- WITH TIES clause
- OVERRIDING clause

I personally think that they shouldn't be ignored, but I don't know if
they were only forgotten or ignored on purpose.

RE: Transactions involving multiple postgres foreign servers, take 2

2020-10-12 Thread tsunakawa.ta...@fujitsu.com

From: Masahiko Sawada 
> I was thinking to have a GUC timeout parameter like statement_timeout.
> The backend waits for the setting value when resolving foreign
> transactions.

Me too.


> But this idea seems different. FDW can set its timeout
> via a transaction timeout API, is that right?

I'm not perfectly sure about how the TM( application server works) , but 
probably no.  The TM has a configuration parameter for transaction timeout, and 
the TM calls XAResource.setTransactionTimeout() with that or smaller value for 
the argument.


> But even if FDW can set
> the timeout using a transaction timeout API, the problem that client
> libraries for some DBMS don't support interruptible functions still
> remains. The user can set a short time to the timeout but it also
> leads to unnecessary timeouts. Thoughts?

Unfortunately, I'm afraid we can do nothing about it.  If the DBMS's client 
library doesn't support cancellation (e.g. doesn't respond to Ctrl+C or provide 
a function that cancel processing in pgorogss), then the Postgres user just 
finds that he can't cancel queries (just like we experienced with odbc_fdw.)


 Regards
Takayuki Tsunakawa

Re: Transactions involving multiple postgres foreign servers, take 2