Re: [BUG] wrong refresh when ALTER SUBSCRIPTION ADD/DROP PUBLICATION

2021-08-03 Thread Masahiko Sawada
Hi,

On Mon, Aug 2, 2021 at 10:52 PM houzj.f...@fujitsu.com
 wrote:
>
>
> Hi hackers,
>
> When testing some other logical replication related patches, I found two
> unexpected behaviours about ALTER SUBSCRIPTION ADD/DROP PUBLICATION.
>
> (1)
> when I execute the following sqls[1], the data of tables that registered to
> 'pub' wasn't copied to the subscriber side which means tablesync worker didn't
> start.
>
> -sub originally had 2 pub nodes(pub,pub2)
> ALTER SUBSCRIPTION sub drop PUBLICATION pub;
> ALTER SUBSCRIPTION sub add PUBLICATION pub;
> -
>
> (2)
> And when I execute the following sqls, the data of table registered to 'pub2'
> are synced again.
>
> -sub originally had 2 pub nodes(pub,pub2)
> ALTER SUBSCRIPTION sub drop PUBLICATION pub;
> ALTER SUBSCRIPTION sub REFRESH PUBLICATION;
> -

I could reproduce this issue by the above steps.

>
> After looking into this problem, I think the reason is the
> [alter sub add/drop publication] misused the function
> AlterSubscription_refresh().
>
> For DROP cases:
>
> Currently, in function AlterSubscription_refresh(), it will first fetch the
> target tables from the target publication, and also fetch the tables in
> subscriber side from pg_subscription_rel. Then it will check each table from
> local pg_subscription_rel, if the table does not exists in the tables fetched
> from the target publication then drop it.
>
> The logic above only works for SET PUBLICATION. However, When DROP 
> PUBLICATION,
> the tables fetched from target publication is actually the tables that need to
> be dropped. If reuse the above logic, it will drop the wrong table which 
> result
> in unexpected behavioud in (1) and (2).(ADD PUBLICATION have similar problem).

Yes. For example, suppose that the publisher has two publications pub1
and pub2 for table1 and table2, respecitively. And we create a
subscription subscribing to those two publications.

postgres(1:74454)=# \dRs sub
  List of subscriptions
 Name |  Owner   | Enabled | Publication
--+--+-+-
 sub  | masahiko | t   | {pub1,pub2}
(1 row)

postgres(1:74454)=# select srsubid, srrelid::regclass::text,
srsubstate, srsublsn from pg_subscription_rel;
 srsubid | srrelid | srsubstate | srsublsn
-+-++---
   16394 | table1  | r  | 0/170F388
   16394 | table2  | r  | 0/170F388
(2 rows)

After dropping pub1 from the subscription, 'table2' associated with
'pub2' is removed:

postgres(1:74454)=# alter subscription sub drop publication pub1;
ALTER SUBSCRIPTION
postgres(1:74454)=# select srsubid, srrelid::regclass::text,
srsubstate, srsublsn from pg_subscription_rel;
 srsubid | srrelid | srsubstate | srsublsn
-+-++---
   16394 | table1  | r  | 0/170F388
(1 row)

> So, I think we'd better fix this problem. I tried add some additional check in
> AlterSubscription_refresh() which can avoid the problem like the attached
> patch. Not sure do we need to further refactor.

I've not looked at the patch deeply yet but I think that the following
one line change seems to fix the cases you reported, although I migit
be missing something:

diff --git a/src/backend/commands/subscriptioncmds.c
b/src/backend/commands/subscriptioncmds.c
index 22ae982328..c74d6d51d6 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -1068,7 +1068,7 @@ AlterSubscription(ParseState *pstate,
AlterSubscriptionStmt *stmt,
PreventInTransactionBlock(isTopLevel, "ALTER
SUBSCRIPTION with refresh");

/* Only refresh the added/dropped list of publications. */
-   sub->publications = stmt->publication;
+   sub->publications = publist;

AlterSubscription_refresh(sub, opts.copy_data);
}

Also, I think we need to add some regression tests for this.
Currently, subscription.sql has tests for ADD/DROP PUBLICATION but
they don't check pg_subscription_rel.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Autovacuum on partitioned table (autoanalyze)

2021-08-03 Thread Kyotaro Horiguchi
At Thu, 29 Jul 2021 18:03:55 -0700, Andres Freund  wrote in 
> And if one instead inverts the order of pgstat_report_analyze() and
> pgstat_report_anl_ancestors() one gets a slightly different problem: A manual
> ANALYZE of the partition root results in the partition root having a non-zero
> changes_since_analyze afterwards. expand_vacuum() causes child partitions to 
> be
> added to the list of relations, which *first* causes the partition root to be
> analyzed, and *then* partitions. The partitions then report their
> changes_since_analyze upwards.

For the last behavior, as Andres suggested, the scan order need to be
reversed (or to be in the same order with autovacuum). Since
find_all_inheritors scans breadth-first so just reversing the result
works. The breadth-first is currently not in the contract of the
interface of the function. I suppose we can add such a contract?

Finally, I ended up with the attached.

 - reverse the relation order within a tree
 - reverse the order of pgstat_report_analyze and pgstat_report_analyze.

Inheritance expansion is performed per-tree basis so it works fine
even if multiple relations are given to vacuum().


> I don't think the code as is is fit for v14. It looks like it was rewritten
> with a new approach just before the freeze ([1]), and as far as I can tell the
> concerns I quoted above weren't even discussed in the whole thread.  Alvaro,
> any comments?
> 
> Greetings,
> 
> Andres Freund
> 
> [1] 
> https://www.postgresql.org/message-id/20210408032235.GA6842%40alvherre.pgsql

FYI: this bahaves as the follows.

CREATE TABLE p (a int) PARTITION BY RANGE (a);
CREATE TABLE c1 PARTITION OF p FOR VALUES FROM (0) TO (200) PARTITION BY 
RANGE(a);
CREATE TABLE c11 PARTITION OF c1 FOR VALUES FROM (0) TO (100);
CREATE TABLE c12 PARTITION OF c1 FOR VALUES FROM (100) TO (200);
CREATE TABLE c2 PARTITION OF p FOR VALUES FROM (200) TO (400) PARTITION BY 
RANGE(a);
CREATE TABLE c21 PARTITION OF c2 FOR VALUES FROM (200) TO (300);
CREATE TABLE c22 PARTITION OF c2 FOR VALUES FROM (300) TO (400);
INSERT INTO p (SELECT a FROM generate_series(0, 400 - 1) a, generate_series(0, 
10) b);


INSERT INTO p (SELECT 200 FROM generate_series(0, 99));

SELECT relid, relname, n_mod_since_analyze FROM pg_stat_user_tables ORDER BY 
relid;
 relid | relname | n_mod_since_analyze 
---+-+-
 16426 | p   |   0
 16429 | c1  |   0
 16432 | c11 |   0
 16435 | c12 |   0
 16438 | c2  |   0
 16441 | c21 | 100
 16444 | c22 |   0
 16447 | sa  |   0
(8 rows)

After "ANALYZE c21;"
 relid | relname | n_mod_since_analyze 
---+-+-
 16426 | p   | 100
 16429 | c1  |   0
 16432 | c11 |   0
 16435 | c12 |   0
 16438 | c2  | 100
 16441 | c21 |   0
 16444 | c22 |   0
 16447 | sa  |   0

After "ANALYZE c2;"
 relid | relname | n_mod_since_analyze 
---+-+-
 16426 | p   | 100
 16429 | c1  |   0
 16432 | c11 |   0
 16435 | c12 |   0
 16438 | c2  |   0
 16441 | c21 |   0
 16444 | c22 |   0
 16447 | sa  |   0

After "ANALYZE p;"
(all zero)


However, this gives a strange-looking side-effect, which affected
regression results.

=# VACUUM ANALYZE p(a, a);
ERROR:  column "a" of relation "c22" appears more than once

(Prevously it complained about p.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 16f7602f1b7755f288c508f1e57e0eae3c305813 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 4 Aug 2021 13:40:59 +0900
Subject: [PATCH] Fix changes_since_analyze's motion on manual analyze on
 partitioned tables

The analyze-stats machinery assumed bottom-to-top-ordered relation
scans but actually it was in the opposite order in manual
ANALYZE. Addition to that the current code tries to propagate
changes_since_analyze to parents after stats reporting which resets
the number to propagate.

As the result, when doing manual ANALYZE on a partition,
changes_since_analyze vanishes instead of being propagated to
parents. On the other hand when doing that on a partitioned tables,
the leaf relations end up with having bogus stats values.

To fix this, reverse the order relations on running manual ANALYZE and
move stats-reporting after stats-propagation.
---
 src/backend/catalog/pg_inherits.c|  3 ++-
 src/backend/commands/analyze.c   | 26 +-
 src/backend/commands/vacuum.c| 25 +++--
 src/test/regress/expected/vacuum.out | 18 +-
 4 files changed, 39 insertions(+), 33 deletions(-)

RE: Skipping logical replication transactions on subscriber side

2021-08-03 Thread houzj.f...@fujitsu.com
On Tuesday, August 3, 2021 2:49 PM Masahiko Sawada  
wrote:
> 
> I've attached new patches that incorporate all comments I got so far.
> Please review them.

Hi,

I had a few comments for the 0003 patch.

1).
-  This clause alters parameters originally set by
-  .  See there for more
-  information.  The parameters that can be altered
-  are slot_name,
-  synchronous_commit,
-  binary, and
-  streaming.
+  This clause sets or resets a subscription option. The parameters that 
can be
+  set are the parameters originally set by :
+  slot_name, synchronous_commit,
+  binary, streaming.
+ 
+ 
+   The parameters that can be reset are: streaming,
+   binary, synchronous_commit.

Maybe the doc looks better like the following ?

+  This clause alters parameters originally set by
+  .  See there for more
+  information.  The parameters that can be set
+  are slot_name,
+  synchronous_commit,
+  binary, and
+  streaming.
+ 
+ 
+   The parameters that can be reset are: streaming,
+   binary, synchronous_commit.

2).
-   opts->create_slot = defGetBoolean(defel);
+   if (!is_reset)
+   opts->create_slot = defGetBoolean(defel);
}

Since we only support RESET streaming/binary/synchronous_commit, it
might be unnecessary to add the check 'if (!is_reset)' for other
option.

3).
typedef struct AlterSubscriptionStmt
{
NodeTag type;
AlterSubscriptionType kind; /* ALTER_SUBSCRIPTION_OPTIONS, etc */

Since the patch change the remove the enum value
'ALTER_SUBSCRIPTION_OPTIONS', it'd better to change the comment here
as well.

Best regards,
houzj


Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-08-03 Thread Greg Nancarrow
On Wed, Aug 4, 2021 at 3:21 AM Robert Haas  wrote:
>
>The idea I sort of had floating around in my mind is a little
>different than what Greg has implemented. I was thinking that we could
>just skip SerializeSnapshot and the corresponding shm_toc_allocate()
>if !IsolationUsesXactSnapshot(). Then on the restore side we could
>just call shm_toc_lookup() with noError = true and skip
>RestoreTransactionSnapshot/RestoreSnapshot if it returns NULL.

I've tried to follow your description and have attached a patch to
hopefully match it, but it doesn't pass "make check-world".
Perhaps I messed something up (apologies if so), or additional changes
are needed to match what you had in mind or correct additional issues
you didn't foresee?

t/001_pgbench_with_server.pl .. 10/?
#   Failed test 'pgbench scale 1 initialization status (got 1 vs expected 0)'
#   at t/001_pgbench_with_server.pl line 108.
...
# creating primary keys...
# pgbench: fatal: query failed: ERROR:  cannot take query snapshot
during a parallel operation
# CONTEXT:  parallel worker
# pgbench: query was: alter table pgbench_accounts add primary key (aid)


>I don't know why Greg's patch is changing anything related to the
>active snapshot (as opposed to the transaction snapshot). Maybe
>there's a reason why we need that change, but I don't know what it is.

I don't think my v2/v5 patch is changing anything related to the
active snapshot (is it?).
It's restoring the serialized active snapshot, installing it as the
transaction snapshot and setting it as the active snapshot.
The patch removes the additionally-acquired transaction snapshot in
InitializeParallelDSM (which seems like a later snapshot to that which
is actually being used in the execution state for the statement, back
in the leader, if I recall correctly). Basically, in the parallel
workers, I've tried to match the snapshot setup to that used in the
leader.
If the v2/v5 patch doesn't work correctly for some isolation level,
I've yet to find it (but can't absolutely rule out there's some case
not accounted for).

Regards,
Greg Nancarrow
Fujitsu Australia


v6-0001-RH-Fix-parallel-worker-failed-assertion-and-coredump.patch
Description: Binary data


Re: Use POPCNT on MSVC

2021-08-03 Thread David Rowley
On Tue, 3 Aug 2021 at 22:43, John Naylor  wrote:
> 1. the __popcnt64() intrinsic is put inside pg_popcount64_asm(), which is a 
> bit of a misnomer since it's not assembly. Renaming s/_asm/_fast/ would help 
> it look better. But then looking around, other platforms have intrinsics 
> coded, but for some reason they're put in pg_popcount64_slow(), where the 
> compiler will emit instructions we could easily write ourselves in C (and 
> without #ifdefs) since without the right CFLAGS these intrinsics won't emit a 
> popcnt instruction. Is MSVC different in that regard? If so, it might be 
> worth a comment.

Yeah, the names no longer fit so well after stuffing the MSVC
intrinsic into the asm function.  The reason I did it that way is down
to what I read in the docs. Namely:

"If you run code that uses these intrinsics on hardware that doesn't
support the popcnt instruction, the results are unpredictable."

So, it has to go somewhere where we're sure the CPU supports POPCNT
and that seemed like the correct place.

>From what I understand of GCC and __builtin_popcountl(), the code
it'll output will depend on the -mpopcnt flag.  So having
__builtin_popcountl() does not mean the popcnt instruction will be
used. The Microsoft documentation indicates that's not the case for
__popcnt().

> 2. (defined(_MSC_VER) && defined(_WIN64)  lead to a runtime check for the 
> CPUID, which is fine, but now next to it HAVE_X86_64_POPCNTQ looks strange 
> because the latter symbol comes from a specific configure test -- the two 
> don't seem equivalent, but maybe they are because of what MSVC does? That 
> would be nice to spell out here.

The problem there is that we define HAVE_X86_64_POPCNTQ based on the
outcome of configure so it does not get set for MSVC.  Maybe it's
worth coming up with some other constant that can be defined or we
could just do:

#if defined(_MSC_VER) && defined(_WIN64)
#define HAVE_X86_64_POPCNTQ
#endif

> I know the present state of affairs works a bit different than what you 
> proposed a couple years ago and that something had to change for portability 
> reasons I didn't quite understand, but I think if we change things here we 
> should at least try to have things fit together a bit more nicely.

I think the reason for the asm is that __builtin_popcountl does not
mean popcnt will be used. Maybe we could have done something like
compile pg_bitutils.c with -mpopcnt, and have kept the run-time check,
but the problem there is that means that the compiler might end up
using that instruction in some other function in that file that we
don't want it to. It looks like my patch in [1] did pass the -mpopcnt
flag which Tom fixed.

> (Side note, but sort of related to #1 above: non-x86 platforms have to 
> indirect through a function pointer even though they have no fast 
> implementation to make it worth their while. It would be better for them if 
> the "slow" implementation was called static inline or at least a direct 
> function call, but that's a separate thread.)

hmm yeah. I see there's a few more usages of pg_popcount() than when I
looked last.  It would be fairly easy to get rid of the
pg_popcount64/pg_popcount32 function call in that. A separate patch
though.

David

[1] 
https://www.postgresql.org/message-id/cakjs1f9wtagg1tpejnd18hiqw5gak59fq6wk-vfdakehyrg...@mail.gmail.com




Re: Failed transaction statistics to measure the logical replication progress

2021-08-03 Thread Amit Kapila
On Wed, Aug 4, 2021 at 6:19 AM Masahiko Sawada  wrote:
>
> On Tue, Aug 3, 2021 at 6:11 PM Amit Kapila  wrote:
> >
> > I was trying to think based on similar counters in pg_stat_database
> > but if you think there is a value in showing errored and actual
> > rollbacked transactions separately then we can do that but how do you
> > think one can make use of it?
>
> I'm concerned that the value that includes both errored and actual
> rollbacked transactions doesn't make sense in practice since the
> number of errored transactions can easily get increased once a
> conflict happens. IMO the errored transaction doesn’t not necessarily
> necessary since the number of (successive) errors that happened on the
> subscription is tracked by pg_stat_subscription_errors view.
>

It sounds awkward to display two of the xact (xact_commit,
xact_rollback) counters in one view and then the other similar counter
(xact_error or something like that) in another view. Isn't it better
to display all of them together possibly in pg_stat_subscription? I
guess it might be a bit tricky to track counters for tablesync workers
separately but we can track them in the corresponding subscription.

> So it
> might be enough to have actual rollbacked transactions. If this value
> is high, it's likely that many rollbacked transactions are streamed,
> unnecessarily consuming network bandwidth. So the user might want to
> increase logical_decoding_work_mem to suppress transactions getting
> streamed.
>

Okay, we might want to probably document such a use case.

-- 
With Regards,
Amit Kapila.




Re: Lowering the ever-growing heap->pd_lower

2021-08-03 Thread Peter Geoghegan
On Tue, Aug 3, 2021 at 12:27 PM Matthias van de Meent
 wrote:
> This change makes it easier and more worthwile to implement a further
> optimization for the checkpointer and/or buffer manager to determine
> that 1.) this page is now empty, and that 2.) we can therefore write a
> specialized WAL record specifically tuned for empty pages instead of
> FPI records. No additional pages are changed, because each time the
> line pointer array is shrunk, we've already either marked dead tuples
> as unused (2nd phase vacuum) or removed HOT line pointers / truncated
> dead tuples to lp_dead line pointers (heap_page_prune).

We generate an FPI the first time a page is modified after a
checkpoint. The FPI consists of the page *after* it has been modified.
Presumably this optimization would need the heap page to be 100%
empty, so we're left with what seems to me to be a very narrow target
for optimization; something that is naturally rare.

A fully-empty page seems quite unlikely in the case of xl_heap_vacuum
records, and impossible in the case of xl_heap_prune records. Even
with all the patches, working together. Have I missed something?

-- 
Peter Geoghegan




Re: Commitfest overflow

2021-08-03 Thread Thomas Munro
On Wed, Aug 4, 2021 at 8:23 AM Greg Stark  wrote:
> On Tue, 3 Aug 2021 at 11:56, Bruce Momjian  wrote:
>
> > I wonder if our lack of in-person developer meetings is causing some of
> > our issues to not get closed.
>
> That's an interesting thought. Every year there are some especially
> contentious patches that don't get dealt with until the in-person
> meeting which pushes people to make a call. However it seems like
> that's only a handful and not really enough to explain the volume.

I think there might be a higher number of work-in-progress patches
these days, which represent ongoing collaborative efforts, and are not
expected to be committed soon, but are registered to attract the
attention of humans and robots.  Perhaps if there were a separate
status for that, it would be clearer.  Or perhaps they don't belong in
the "commit" fest.




Re: Slow standby snapshot

2021-08-03 Thread Andres Freund
Hi,

On 2021-08-03 22:23:58 +0300, Michail Nikolaev wrote:
> > I'm not sure that we need to care as much about the cost of
> > KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes 
> > that
> > less frequent.
> 
> It is still about 5-7% of CPU for a typical workload, a considerable
> amount for me.

I'm not saying we shouldn't optimize things. Just that it's less pressing. And
what kind of price we're willing to optimize may have changed.


> And a lot of systems still work on 12 and 13.

I don't see us backporting performance improvements around this to 12 and 13,
so I don't think that matters much... We've done that a few times, but usually
when there's some accidentally quadratic behaviour or such.


> > But more importantly, it'd not be hard to maintain an
> > occasionally (or opportunistically) maintained denser version for
> > GetSnapshotData() - there's only a single writer, making the concurrency
> > issues a lot simpler.
> 
> Could you please explain it in more detail?
> Single writer and GetSnapshotData() already exclusively hold
> ProcArrayLock at the moment of KnownAssignedXidsRemove,
> KnownAssignedXidsGetAndSetXmin, and sometimes KnownAssignedXidsAdd.

GetSnapshotData() only locks ProcArrayLock in shared mode.

The problem is that we don't want to add a lot of work
KnownAssignedXidsAdd/Remove, because very often nobody will build a snapshot
for that moment and building a sorted, gap-free, linear array of xids isn't
cheap. In my experience it's more common to be bottlenecked by replay CPU
performance than on replica snapshot building.

During recovery, where there's only one writer to the procarray / known xids,
it might not be hard to opportunistically build a dense version of the known
xids whenever a snapshot is built.

Greetings,

Andres Freund




RE: [HACKERS] logical decoding of two-phase transactions

2021-08-03 Thread tanghy.f...@fujitsu.com
On Tuesday, August 3, 2021 6:03 PM vignesh C wrote:
> 
> On Tue, Aug 3, 2021 at 12:32 PM Amit Kapila  wrote:
> >
> > On Tue, Aug 3, 2021 at 6:17 AM Peter Smith  wrote:
> > >
> > > Please find attached the latest patch set v102*
> > >
> >
> > I have made minor modifications in the comments and docs, please see
> > attached. Can you please check whether the names of contributors in
> > the commit message are correct or do we need to change it?
> 
> The patch applies neatly, the tests pass and documentation built with
> the updates provided. I could not find any comments. The patch looks
> good to me.
> 

I did some stress tests on the patch and found no issues. 
It also works well when using synchronized replication.
So the patch LGTM.

Regards
Tang


Re: Use POPCNT on MSVC

2021-08-03 Thread Thomas Munro
On Tue, Aug 3, 2021 at 10:43 PM John Naylor
 wrote:
> (Side note, but sort of related to #1 above: non-x86 platforms have to 
> indirect through a function pointer even though they have no fast 
> implementation to make it worth their while. It would be better for them if 
> the "slow" implementation was called static inline or at least a direct 
> function call, but that's a separate thread.)

+1

I haven't looked into whether we could benefit from it in real use
cases, but it seems like it'd also be nice if pg_popcount() were a
candidate for auto-vectorisation and inlining.  For example, NEON has
vector popcount, and for Intel/AMD there is a shuffle-based AVX2 trick
that at least Clang produces automatically[1].  We're obstructing that
by doing function dispatch at individual word level, and using inline
assembler instead of builtins.

[1] https://arxiv.org/abs/1611.07612




Re: Use generation context to speed up tuplesorts

2021-08-03 Thread Andres Freund
Hi,

On 2021-08-03 10:59:25 +1200, David Rowley wrote:
> On Sat, 31 Jul 2021 at 08:38, Andres Freund  wrote:
> > There is at least one path using tuplecontext that reaches code that
> > could end up freeing memory to a significant enough degree to care about
> > generation.c effectively not using that memory:
> > tuplesort_putdatum()->datumCopy()->EOH_flatten_into()
> > On a quick look I didn't find any expanded record user that frees
> > nontrivial amounts of memory, but I didn't look all that carefully.
> 
> I guess we could just use a normal context for datum sorts if we
> thought that might be a problem.

I think that's probably a cure worse than the disease. I suspect datum sorts
can benefit from the higher density quite a bit...


> I'm not too familiar with the expanded object code, but I'm struggling
> to imagine why anything would need to do a pfree in there. We just do
> EOH_get_flat_size() to determine how big to make the allocation then
> allocate some memory for EOH_flatten_into() to use to expand the
> object into.

I can see some scenarios with a bit more creative uses of expanded
objects. We've e.g. been talking about using EA to avoid repeated and partial
detoasting overhead and you might need to do some more toast fetches when
flattening. Toast fetches always allocate, and if the fetch were only for
later parts of the tuple, the fetched data would need to be freed.

It's probably fine to deal with this at later time, and just leave a comment
somewhere.

It could be addressed by having a bump style allocator combined with having
freelists. It's not like the tuplesort.c case is actually interested in the
generational behaviour of generation.c (which makes freelists uninteresting),
it's just that generation.c is the densest allocator that we have right now...

Greetings,

Andres Freund




Re: Failed transaction statistics to measure the logical replication progress

2021-08-03 Thread Masahiko Sawada
On Tue, Aug 3, 2021 at 6:11 PM Amit Kapila  wrote:
>
> On Tue, Aug 3, 2021 at 10:59 AM Masahiko Sawada  wrote:
> >
> > On Tue, Aug 3, 2021 at 11:47 AM Amit Kapila  wrote:
> > >
> > > On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, Aug 2, 2021 at 2:52 PM osumi.takami...@fujitsu.com
> > > >  wrote:
> > > > >
> > > > >
> > > > > Accordingly, I'm thinking to have unsuccessful and successful stats 
> > > > > on the sub side.
> > > > > Sawada-san is now implementing a new view in [1].
> > > > > Do you think that I should write a patch to introduce a new separate 
> > > > > view
> > > > > or write a patch to add more columns to the new view 
> > > > > "pg_stat_subscription_errors" that is added at [1] ?
> > > >
> > > > pg_stat_subscriptions_errors view I'm proposing is a view showing the
> > > > details of error happening during logical replication. So I think a
> > > > separate view or pg_stat_subscription view would be a more appropriate
> > > > place.
> > > >
> > >
> > > +1 for having these stats in pg_stat_subscription. Do we want to add
> > > two columns (xact_commit: number of transactions successfully applied
> > > in this subscription, xact_rollback: number of transactions that have
> > > been rolled back in this subscription)
> >
> > Sounds good. We might want to have separate counters for the number of
> > transactions failed due to an error and transactions rolled back by
> > stream_abort.
> >
>
> I was trying to think based on similar counters in pg_stat_database
> but if you think there is a value in showing errored and actual
> rollbacked transactions separately then we can do that but how do you
> think one can make use of it?

I'm concerned that the value that includes both errored and actual
rollbacked transactions doesn't make sense in practice since the
number of errored transactions can easily get increased once a
conflict happens. IMO the errored transaction doesn’t not necessarily
necessary since the number of (successive) errors that happened on the
subscription is tracked by pg_stat_subscription_errors view. So it
might be enough to have actual rollbacked transactions. If this value
is high, it's likely that many rollbacked transactions are streamed,
unnecessarily consuming network bandwidth. So the user might want to
increase logical_decoding_work_mem to suppress transactions getting
streamed.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Lowering the ever-growing heap->pd_lower

2021-08-03 Thread Peter Geoghegan
On Mon, Aug 2, 2021 at 11:57 PM Simon Riggs
 wrote:
> 1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup
> That is going to have a clear benefit for HOT workloads, which by
> their nature will use a lot of line pointers.

Why do you say that?

> Many applications are updated much more frequently than they are vacuumed.
> Peter - what is your concern about doing this more frequently? Why
> would we *not* do this?

What I meant before was that this argument worked back when we limited
the technique to VACUUM's second heap pass. Doing line pointer array
truncation at that point alone meant that it only ever happened
outside of VACUUM proper. Prior to that point we literally did nothing
about LP_UNUSED items at the end of each line pointer array, so we
were going from doing nothing to doing something.

This time it's quite different: we're truncating the line pointer
array during pruning. Pruning often occurs opportunistically, during
regular query processing. In fact I'd say that it's far more common
than pruning by VACUUM. So the chances of line pointer array
truncation hurting rather than helping seems higher. Plus now we might
break things like DDL, that would naturally not have been affected
before because we were only doing line pointer truncation during
VACUUM proper.

Of course it might well be true that there is a significant benefit to
this patch. I don't think that we should assume that that's the case,
though. We have yet to see a test case showing any benefit. Maybe
that's an easy thing to produce, and maybe Matthias has assumed that I
must already know what to look at. But I don't. It's not obvious to me
how to assess this patch now.

I don't claim to have any insight about what we should or should not
do. At least not right now. When I committed the original (commit
3c3b8a4b), I did so because I thought that it was very likely to
improve certain cases and very unlikely to hurt any other cases.
Nothing more.

> 2. Reduce number of line pointers to 0 in some cases.
> Matthias - I don't think you've made a full case for doing this, nor
> looked at the implications.
> The comment clearly says "it seems like a good idea to avoid leaving a
> PageIsEmpty()" page behind.
> So I would be inclined to remove that from the patch and consider that
> as a separate issue, or close this.

This was part of that earlier commit because of sheer paranoia;
nothing more. I actually think that it's highly unlikely to protect us
from bugs in practice. Though I am, in a certain sense, likely to be
wrong about "PageIsEmpty() defensiveness", it does not bother me in
the slightest. It seems like the right approach in the absence of new
information about a significant downside. If my paranoia really did
turn out to be justified, then I would expect that there'd be a
subtle, nasty bug. That possibility is what I was really thinking of.
And so it almost doesn't matter to me how unlikely we might think such
a bug is now, unless and until somebody can demonstrate a real
practical downside to my defensive approach.

-- 
Peter Geoghegan




Re: archive status ".ready" files may be created too early

2021-08-03 Thread Bossart, Nathan
On 8/2/21, 7:37 PM, "Kyotaro Horiguchi"  wrote:
> I'm afraid that using hash to store boundary info is too much. Isn't a
> ring buffer enough for this use?  In that case it is enough to
> remember only the end LSN of the segment spanning records.  It is easy
> to expand the buffer if needed.

I agree that the hash table requires a bit more memory than what is
probably necessary, but I'm not sure I agree that maintaining a custom
data structure to save a few kilobytes of memory is worth the effort.

> @@ -2602,7 +2855,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
> LogwrtResult.Flush = LogwrtResult.Write;  
>   /* end of page */
>
> if (XLogArchivingActive())
> -   XLogArchiveNotifySeg(openLogSegNo);
> +   
> SetLastNotifiedSegmentIfInvalid(openLogSegNo - 1);
>
> Is it safe?  If server didn't notified of WAL files for recent 3
> finished segments in the previous server life, they need to be
> archived this life time. But this omits maybe all of the tree.
> (I didn't confirm that behavior..)

I tested this scenario out earlier [0].  It looks like the call to
XLogArchiveCheckDone() in RemoveOldXlogFiles() will take care of
creating any .ready files we missed.

>> I believe my worry was that we'd miss notifying a segment as soon as
>> possible if the record was somehow flushed prior to registering the
>> record boundary in the map.  If that's actually impossible, then I
>> would agree that the extra call to NotifySegmentsReadyForArchive() is
>> unnecessary.
>
> I don't think that XLogWrite(up to LSN=X) can happen before
> XLogInsert(endpos = X) ends.

Is there anything preventing that from happening?  At the location
where we are registering the record boundary, we've already called
CopyXLogRecordToWAL(), and neither the WAL insertion lock nor the
WALWriteLock are held.  Even if we register the boundary before
updating the shared LogwrtRqst.Write, there's a chance that someone
else has already moved it ahead and called XLogWrite().  I think the
worst case scenario is that we hold off creating .ready files longer
than necessary, but IMO that's still a worthwhile thing to do.

Nathan

[0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com



Re: Reduce lock level for ALTER TABLE ... ADD CHECK .. NOT VALID

2021-08-03 Thread Simon Riggs
On Thu, 15 Jul 2021 at 07:47, Simon Riggs  wrote:
>
> On Sat, Jul 10, 2021 at 2:50 PM John Naylor
>  wrote:
> > On Thu, Apr 22, 2021 at 8:01 AM Simon Riggs  
> > wrote:
> > >
> > > 897795240cfaaed724af2f53ed2c50c9862f951f forgot to reduce the lock
> > > level for CHECK constraints when allowing them to be NOT VALID.
> > >
> > > This is simple and safe, since check constraints are not used in
> > > planning until validated.
> >
> > The patch also reduces the lock level when NOT VALID is not specified, 
> > which didn't seem to be the intention.
>
> Thank you for reviewing. I agree that the behavior works as you indicated.
>
> My description of this was slightly muddled. The lock level for
> CONSTR_FOREIGN applies whether or not NOT VALID is used, but the test
> case covers only NOT VALID because it a) isn't tested and b) is more
> important. I just followed that earlier pattern and that led me to
> adding "NOT VALID" onto the title of the thread.
>
> What is true for CONSTR_FOREIGN  is also true for CONSTR_CHECK - the
> lock level can be set down to ShareRowExclusiveLock in all cases
> because adding a new CHECK does not affect the outcome of currently
> executing SELECT statements. (Note that this is not true for Drop
> Constraint, which has a different lock level, but we aren't changing
> that here). Once the constraint is validated it may influence the
> optimization of later SELECTs.
>
> So the patch and included docs are completely correct. Notice that the
> name of the patch reflects this better than the title of the thread.

An additional patch covering other types of ALTER TABLE attached. Both
can be applied independently.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


alter_table_reduce_more_lock_levels.v1.patch
Description: Binary data


Re: .ready and .done files considered harmful

2021-08-03 Thread Bossart, Nathan
+   /*
+* Perform a full directory scan to identify the next log segment. There
+* may be one of the following scenarios which may require us to 
perform a
+* full directory scan.
+*
+* 1. This is the first cycle since archiver has started and there is no
+* idea about the next anticipated log segment.
+*
+* 2. There is a timeline switch, i.e. the timeline ID tracked at 
archiver
+* does not match with current timeline ID. Archive history file as 
part of
+* this timeline switch.
+*
+* 3. The next anticipated log segment is not available.

One benefit of the current implementation of pgarch_readyXlog() is
that .ready files created out of order will be prioritized before
segments with greater LSNs.  IIUC, with this patch, as long as there
is a "next anticipated" segment available, the archiver won't go back
and archive segments it missed.  I don't think the archive status
files are regularly created out of order, but XLogArchiveCheckDone()
has handling for that case, and the work to avoid creating .ready
files too early [0] seems to make it more likely.  Perhaps we should
also force a directory scan when we detect that we are creating a
.ready file for a segment that is older than the "next anticipated"
segment.

Nathan

[0] https://postgr.es/m/DA71434B-7340-4984-9B91-F085BC47A778%40amazon.com



Re: Commitfest overflow

2021-08-03 Thread Greg Stark
On Tue, 3 Aug 2021 at 11:56, Bruce Momjian  wrote:

> I wonder if our lack of in-person developer meetings is causing some of
> our issues to not get closed.

That's an interesting thought. Every year there are some especially
contentious patches that don't get dealt with until the in-person
meeting which pushes people to make a call. However it seems like
that's only a handful and not really enough to explain the volume.

Unless people just find it easier to focus on reviewing in the days
leading up to the conferences and meetings. Fwiw I will have some time
set aside to do some patch reviewing coming up -- though I don't know
how much I can really help if the patches are the problem.


-- 
greg




Re: Autovacuum on partitioned table (autoanalyze)

2021-08-03 Thread Andrew Dunstan


On 7/29/21 9:03 PM, Andres Freund wrote:
> Hi,
>
> CCing RMT because I think we need to do something about this for v14.



Thanks. We are now aware of it.


[...]

> I don't think the code as is is fit for v14. It looks like it was rewritten
> with a new approach just before the freeze ([1]), and as far as I can tell the
> concerns I quoted above weren't even discussed in the whole thread.  Alvaro,
> any comments?
>

I discussed this briefly with Alvaro late last night. He's now aware of
the issue, but I believe he's away for some days, and probably won't be
able to respond until his return.


Sorry I don't have more news, but I didn't want anyone thinking this was
being ignored.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Commitfest overflow

2021-08-03 Thread Bruce Momjian
On Tue, Aug  3, 2021 at 03:42:57PM -0400, Tom Lane wrote:
> Tomas Vondra  writes:
> > But it's not clear to me whether you're arguing for CFM to assess this, 
> > or whether someone else should make this decision?
> 
> > IMHO asking the CFM to do this would be a tremendous burden - properly 
> > assessing 50+ patches is a lot of work, and probably requires a fairly 
> > experienced hacker ...
> 
> Yeah.  I can recall that once or twice, Andres went through and triaged
> all the open patches, which was tremendously helpful and I'm sure took a
> serious investment of time.  That way doesn't scale.  But maybe if we
> split the work among half a dozen or so senior hackers, it'd be a
> tolerable amount of work per-person?

Yes, I think we are going to need to do something like this.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Commitfest overflow

2021-08-03 Thread Tom Lane
Tomas Vondra  writes:
> But it's not clear to me whether you're arguing for CFM to assess this, 
> or whether someone else should make this decision?

> IMHO asking the CFM to do this would be a tremendous burden - properly 
> assessing 50+ patches is a lot of work, and probably requires a fairly 
> experienced hacker ...

Yeah.  I can recall that once or twice, Andres went through and triaged
all the open patches, which was tremendously helpful and I'm sure took a
serious investment of time.  That way doesn't scale.  But maybe if we
split the work among half a dozen or so senior hackers, it'd be a
tolerable amount of work per-person?

regards, tom lane




Re: A varint implementation for PG?

2021-08-03 Thread Tom Lane
Robert Haas  writes:
> However, I suspect that the whole approach should be completely
> revised for a user-visible data type. On the one hand, there's no
> telling how large a value some user will want to represent, so
> limiting ourselves to 64 bits does seem shortsighted. And on the othe
> hand, if we've got a varlena, we already know the length, so it seems
> like we shouldn't also encode the length in the value. Maybe there's a
> more efficient way, but the first thing that occurs to me is to just
> discard high order bytes that are all zeroes or all ones until the
> high order bit of the next byte doesn't match and plonk the remaining
> bytes into the varlena. To decompress, just sign-extend out to the
> target length. Really, this kind of representation can be extended to
> represent arbitrarily large integers, even bigger than what we can
> currently do with numeric, which is already crazy huge, and it seems
> to have some advantage in that every payload byte contains exactly 8
> data bits, so we don't need to shift or mask while encoding and
> decoding.

+1.  I think this, together with our existing rules for varlena headers,
would address the issue quite nicely.  Any sanely-sized integer would
require only a one-byte header, so the minimum on-disk size is 2 bytes
(with no alignment padding required).  I don't buy that there's enough
need to justify inventing a new typlen code, since even if you did it
wouldn't improve things all that much compared to this design.

(Oh ... actually the minimum on-disk size is one byte, since value zero
would require no payload bytes.)

regards, tom lane




Re: Commitfest overflow

2021-08-03 Thread Bruce Momjian
On Tue, Aug  3, 2021 at 09:36:41PM +0200, Tomas Vondra wrote:
> On 8/3/21 8:57 PM, Bruce Momjian wrote:
> > On Tue, Aug  3, 2021 at 08:51:57PM +0200, Tomas Vondra wrote:
> > > How would this be different from the CFM just rejecting patches? It does 
> > > not
> > > matter if there's an explicit number of patches that we allow to be moved 
> > > to
> > > the next CF - someone still needs to make the decision, and I agree with 
> > > Tom
> > > it probably should not be CFM's job.
> > 
> > My experience with the query id patch is that it can't be rejected
> > because everyone wants it, but it needs work to get it in a state that
> > everyone approves of.  Sometimes it is impossible for the patch author
> > to figure that out, and I needed Álvaro Herrera's help on the query id
> > patch, so even I wasn't able to figure it out alone.
> > 
> 
> Yeah, and I'm sure this applies to various other patches too - we want the
> feature, but it requires more work, and it may not be clear how much and
> what's the path forward.
> 
> But it's not clear to me whether you're arguing for CFM to assess this, or
> whether someone else should make this decision?
> 
> IMHO asking the CFM to do this would be a tremendous burden - properly
> assessing 50+ patches is a lot of work, and probably requires a fairly
> experienced hacker ...

I don't think the CFM can do this --- I think it has to be a team
effort, as was the query id patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Commitfest overflow

2021-08-03 Thread Tomas Vondra

On 8/3/21 8:57 PM, Bruce Momjian wrote:

On Tue, Aug  3, 2021 at 08:51:57PM +0200, Tomas Vondra wrote:

How would this be different from the CFM just rejecting patches? It does not
matter if there's an explicit number of patches that we allow to be moved to
the next CF - someone still needs to make the decision, and I agree with Tom
it probably should not be CFM's job.


My experience with the query id patch is that it can't be rejected
because everyone wants it, but it needs work to get it in a state that
everyone approves of.  Sometimes it is impossible for the patch author
to figure that out, and I needed Álvaro Herrera's help on the query id
patch, so even I wasn't able to figure it out alone.



Yeah, and I'm sure this applies to various other patches too - we want 
the feature, but it requires more work, and it may not be clear how much 
and what's the path forward.


But it's not clear to me whether you're arguing for CFM to assess this, 
or whether someone else should make this decision?


IMHO asking the CFM to do this would be a tremendous burden - properly 
assessing 50+ patches is a lot of work, and probably requires a fairly 
experienced hacker ...



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: A varint implementation for PG?

2021-08-03 Thread Andres Freund
Hi,

On 2021-08-03 14:26:16 -0400, Robert Haas wrote:
> [ resurrecting this 2-year-old thread ]
> On Fri, Dec 13, 2019 at 12:45 AM Andres Freund  wrote:
> > I don't think it's ever going to be sensible to transport 64bit quanta
> > of data. Also, uh, it'd be larger than the data a postgres instance
> > could really contain, given LSNs are 64 bit.
> 
> We already use 128-bit integers in some places in the source code, and
> it seems more likely than not that use of 128-bit integer data types
> will continue to expand over time. If we want to represent 64 bit
> integers in a variable number of bits today, it does seem pretty
> likely that some day we will want to do the same thing with 128 bit
> integers, and maybe eventually larger. And I think that all of that is
> true even though exhausting a 64-bit LSN space looks difficult today
> and will likely remain difficult for the foreseeable future.

Yea, my answer here was intended to be about areas where we "bake in", to
quote Craig, the varint format, like varlena tags or protocol messages. For
those a 64bit length tag seems entirely sufficient.

I think e.g. adding an SQL level variable width integer type would potentially
come to a different outcome. But even for things like that I suspect that more
often than not you'd be better off with a variable length length value, rather
than encoding the whole "huge" value as a single variable width integer.


> So to me, the right thing to do here probably depends a bit on the use
> case. If we are talking about using this for strictly internal
> purposes - e.g. cramming stuff into an internal data structure into
> which users have no direct visibility - then I think it's probably
> best to pick a representation that corresponds exactly to the C data
> type that we're trying to store. If we're starting out with a  int64
> and we want to byte-squeeze it, the representation you've chosen here
> seems like just the thing. And it's easy to imagine why we might also
> want to have similar transformations for uint64, int32, and uint32,
> and maybe eventually int128 and uint128. And if the details of the
> format are a little different in each case that's fine; for these
> sorts of use cases we're always going to unpack into the same data
> type that we packed originally, so it's not important to have exact
> compatibility across differences in signedness or bit count.

Agreed.


> However, I suspect that the whole approach should be completely
> revised for a user-visible data type. On the one hand, there's no
> telling how large a value some user will want to represent, so
> limiting ourselves to 64 bits does seem shortsighted. And on the othe
> hand, if we've got a varlena, we already know the length, so it seems
> like we shouldn't also encode the length in the value.

If we're talking varlenas, then I think using embedded varints only really
makes sense if they're "sub-components" of that varlena, not the entire
value...


> Now, we could think of introducing a new format for variable-length
> datums, essentially making this a new typlen rather than a new kind of
> varlena. That might be worth it, because if you are storing a lot of
> values that are small enough that this format could represent them in
> 3 bytes or less, which I think would be everything up to +/- 2^20, you
> could save a pretty significant amount of space even if your table was
> laid out to avoid alignment padding. However, there would be some
> distributed overhead to this, because all the places that have special
> handling for typlen = -1 and typlen = -2 would need to grow new cases
> for this. I'm not sure how much of a problem that is, really, but I
> don't think we can go nuts with adding new typlen values.

Yes - the branches for the different typelens already are quite visible in
profiles... It might be that we could compensate for that without too much
difficulty (e.g. by having a field in a tupdesc indicating what kind of types
are in use in a tuple to be formed/deformed, and dispatching to different
form/deform routines based on that), but it's not obviously a win.



> > It'll be a bit problmeatic to deal with all the casting necessary, and
> > with the likely resulting overload resolution issues.  I'm wondering
> > whether it'd be worthwhile to have a ALTER TABLE ... STORAGE ... option
> > that encodes int2/4/8 as varints when inside a tuple, but otherwise just
> > let it be a normal integer.
> 
> I don't see how a STORAGE option could work. We tend to treat those as
> hints, rather than critical data, which wouldn't work here.

I wasn't thinking that this would be something that could be changed without a
table rewrite - so maybe STORAGE would be the wrong place to put it, given how
it's currently used. OTOH, I don't think it'd be that big a problem to have a
rewrite for some, but not all STORAGE options...


> I think there are a number of problems that crop up, but one of them is the
> same thing we hit with the custom TOAST 

Re: Lowering the ever-growing heap->pd_lower

2021-08-03 Thread Matthias van de Meent
On Tue, 3 Aug 2021 at 20:37, Simon Riggs  wrote:
>
> On Tue, 3 Aug 2021 at 17:15, Matthias van de Meent
>  wrote:
>
> > and further future optimizations might include
> >
> > - Full-page WAL logging of empty pages produced in the checkpointer
> > could potentially be optimized to only log 'it's an empty page'
> > instead of writing out the full 8kb page, which would help in reducing
> > WAL volume. Previously this optimization would never be hit on
> > heapam-pages because pages could not become empty again, but right now
> > this has real potential for applying an optimization.
>
> So what you are saying is your small change will cause multiple
> additional FPIs in WAL. I don't call that a future optimization, I
> call that essential additional work.

I think you misunderstood my statement. The patch does not change more
pages than before. The patch only ensures that empty heapam pages are
truly empty according to the relevant PageIsEmpty()-macro; which
hypothethically allows for optimizations in the checkpointer process
that currently (as in, since its inception) writes all changed pages
as full page writes (unless turned off).

This change makes it easier and more worthwile to implement a further
optimization for the checkpointer and/or buffer manager to determine
that 1.) this page is now empty, and that 2.) we can therefore write a
specialized WAL record specifically tuned for empty pages instead of
FPI records. No additional pages are changed, because each time the
line pointer array is shrunk, we've already either marked dead tuples
as unused (2nd phase vacuum) or removed HOT line pointers / truncated
dead tuples to lp_dead line pointers (heap_page_prune).

If, instead, you are suggesting that this checkpointer FPI
optimization should be part of the patch just because the potential is
there, then I disagree. Please pardon me if this was not your intended
message, but "you suggested this might be possible after your patch,
thus you must implement this" seems like a great way to burn
contributor goodwill.

The scope of the checkpointer is effectively PostgreSQL's WAL, plus
the page formats of all access methods that use the Page-based storage
manager (not just table access methods, but also those of indexes).
I'm not yet comfortable with hacking in those (sub)systems, nor do I
expect to have the time/effort capacity soon to go through all the
logic of these access methods to validate the hypothesis that such
optimization can be both correctly implemented and worthwile.


Patch 2, as I see it, just clears up some leftover stuff from the end
of the pg14 release cycle with new insights and research I didn't have
at that point in time. As it is a behaviour change for wal-logged
actions, it cannot realistically be backported; therefore it is
included as an improvement for pg15.

Kind regards,

Matthias van de Meent




Re: Slow standby snapshot

2021-08-03 Thread Michail Nikolaev
Hello, Andres.

Thanks for your feedback.

>> Maybe use a hashtable of running transactions? It will be slightly faster
>> when adding\removing single transactions. But much worse when doing
>> KnownAssignedXidsRemove().
> Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
> write KnownAssignedXidsGet[AndSetXmin]()?

It is actually was a hashtable in 2010. It was replaced by Simon Riggs
in 2871b4618af1acc85665eec0912c48f8341504c4.

> I'm not sure that we need to care as much about the cost of
> KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
> less frequent.

It is still about 5-7% of CPU for a typical workload, a considerable
amount for me. And a lot of systems still work on 12 and 13.
The proposed approach eliminates KnownAssignedXidsGetAndSetXmin from
the top by a small changeset.

> But more importantly, it'd not be hard to maintain an
> occasionally (or opportunistically) maintained denser version for
> GetSnapshotData() - there's only a single writer, making the concurrency
> issues a lot simpler.

Could you please explain it in more detail?
Single writer and GetSnapshotData() already exclusively hold
ProcArrayLock at the moment of KnownAssignedXidsRemove,
KnownAssignedXidsGetAndSetXmin, and sometimes KnownAssignedXidsAdd.

BTW, the tiny thing we could also fix now is (comment from code):

> (We could dispense with the spinlock if we were to
>  * create suitable memory access barrier primitives and use those instead.)
>  * The spinlock must be taken to read or write the head/tail pointers unless
>  * the caller holds ProcArrayLock exclusively.

Thanks,
Michail.




Re: Commitfest overflow

2021-08-03 Thread Bruce Momjian
On Tue, Aug  3, 2021 at 08:51:57PM +0200, Tomas Vondra wrote:
> How would this be different from the CFM just rejecting patches? It does not
> matter if there's an explicit number of patches that we allow to be moved to
> the next CF - someone still needs to make the decision, and I agree with Tom
> it probably should not be CFM's job.

My experience with the query id patch is that it can't be rejected
because everyone wants it, but it needs work to get it in a state that
everyone approves of.  Sometimes it is impossible for the patch author
to figure that out, and I needed Álvaro Herrera's help on the query id
patch, so even I wasn't able to figure it out alone.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Commitfest overflow

2021-08-03 Thread Bruce Momjian
On Tue, Aug  3, 2021 at 07:30:38PM +0100, Simon Riggs wrote:
> I would still ask for someone to spend a little time triaging things,
> so as to direct people who volunteer to be so directed. Many will not
> want to be directed, but I'm sure there must be 5-10 people who would
> do that? (Volunteers?)

I am usually not around long enough to handle complex patches but I
considered the query id patch important and needing guidance so I was
able to help on that for PG 14, and I needed Álvaro Herrera's help to
complete it.  My point is that some of these patches are really complex
to get them through, but I think a focused effort could make a big
impact, though it will take months.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Commitfest overflow

2021-08-03 Thread Tomas Vondra

On 8/3/21 8:30 PM, Simon Riggs wrote:

On Tue, 3 Aug 2021 at 17:13, Tom Lane  wrote:


Bruce Momjian  writes:

On Tue, Aug  3, 2021 at 04:53:40PM +0100, Simon Riggs wrote:

There are 273 patches in the queue for the Sept Commitfest already, so
it seems clear the queue is not being cleared down each CF as it was
before. We've been trying hard, but it's overflowing.



I wonder if our lack of in-person developer meetings is causing some of
our issues to not get closed.


I think there are a couple of things happening here:

1. There wasn't that much getting done during this CF because it's
summer and many people are on vacation (in the northern hemisphere
anyway).

2. As a community, we don't really have the strength of will to
flat-out reject patches.  I think the dynamic is that individual
committers look at something, think "I don't like that, I'll go
work on some better-designed patch", and it just keeps slipping
to the next CF.  In the past we've had some CFMs who were assertive
enough and senior enough to kill off patches that didn't look like
they were going to go anywhere.  But that hasn't happened for
awhile, and I'm not sure it should be the CFM's job anyway.

(I hasten to add that I'm not trying to imply that all the
long-lingering patches are hopeless.  But I think some of them are.)

I don't think there's much to be done about the vacation effect;
we just have to accept that the summer CF is likely to be less
productive than others.  But I'd like to see some better-formalized
way of rejecting patches that aren't going anywhere.  Maybe there
should be a time limit on how many CFs a patch is allowed to just
automatically slide through?


I guess the big number is 233 patches moved to next CF, out of 342, or
68% deferred.

Perhaps we need a budget of how many patches can be moved to next CF,
i.e. CF mgr is responsible for ensuring that no more than ?50% of
patches can be moved forwards. Any in excess of that need to join the
kill list.



How would this be different from the CFM just rejecting patches? It does 
not matter if there's an explicit number of patches that we allow to be 
moved to the next CF - someone still needs to make the decision, and I 
agree with Tom it probably should not be CFM's job.


IMO what the CFM can do is make an assessment, share it on the mailing 
list with a proposal to reject the patch, and leave the decision up to 
the patch author. Either the author accepts the view it's futile, or 
decides to keep working on it, solicits some reviews, etc.



I would still ask for someone to spend a little time triaging things,
so as to direct people who volunteer to be so directed. Many will not
want to be directed, but I'm sure there must be 5-10 people who would
do that? (Volunteers?)




regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Lowering the ever-growing heap->pd_lower

2021-08-03 Thread Simon Riggs
On Tue, 3 Aug 2021 at 17:15, Matthias van de Meent
 wrote:

> and further future optimizations might include
>
> - Full-page WAL logging of empty pages produced in the checkpointer
> could potentially be optimized to only log 'it's an empty page'
> instead of writing out the full 8kb page, which would help in reducing
> WAL volume. Previously this optimization would never be hit on
> heapam-pages because pages could not become empty again, but right now
> this has real potential for applying an optimization.

So what you are saying is your small change will cause multiple
additional FPIs in WAL. I don't call that a future optimization, I
call that essential additional work.

+1 on committing the first part of the patch, -1 on the second. I
suggest you split the patch and investigate the second part further.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Commitfest overflow

2021-08-03 Thread Ibrar Ahmed
On Tue, Aug 3, 2021 at 11:31 PM Simon Riggs 
wrote:

> On Tue, 3 Aug 2021 at 17:13, Tom Lane  wrote:
> >
> > Bruce Momjian  writes:
> > > On Tue, Aug  3, 2021 at 04:53:40PM +0100, Simon Riggs wrote:
> > >> There are 273 patches in the queue for the Sept Commitfest already, so
> > >> it seems clear the queue is not being cleared down each CF as it was
> > >> before. We've been trying hard, but it's overflowing.
> >
> > > I wonder if our lack of in-person developer meetings is causing some of
> > > our issues to not get closed.
> >
> > I think there are a couple of things happening here:
> >
> > 1. There wasn't that much getting done during this CF because it's
> > summer and many people are on vacation (in the northern hemisphere
> > anyway).
> >
> > 2. As a community, we don't really have the strength of will to
> > flat-out reject patches.  I think the dynamic is that individual
> > committers look at something, think "I don't like that, I'll go
> > work on some better-designed patch", and it just keeps slipping
> > to the next CF.  In the past we've had some CFMs who were assertive
> > enough and senior enough to kill off patches that didn't look like
> > they were going to go anywhere.  But that hasn't happened for
> > awhile, and I'm not sure it should be the CFM's job anyway.
> >
> > (I hasten to add that I'm not trying to imply that all the
> > long-lingering patches are hopeless.  But I think some of them are.)
> >
> > I don't think there's much to be done about the vacation effect;
> > we just have to accept that the summer CF is likely to be less
> > productive than others.  But I'd like to see some better-formalized
> > way of rejecting patches that aren't going anywhere.  Maybe there
> > should be a time limit on how many CFs a patch is allowed to just
> > automatically slide through?
>
> I guess the big number is 233 patches moved to next CF, out of 342, or
> 68% deferred.
>
> Perhaps we need a budget of how many patches can be moved to next CF,
> i.e. CF mgr is responsible for ensuring that no more than ?50% of
> patches can be moved forwards. Any in excess of that need to join the
> kill list.
>
> I would still ask for someone to spend a little time triaging things,
> so as to direct people who volunteer to be so directed. Many will not
> want to be directed, but I'm sure there must be 5-10 people who would
> do that? (Volunteers?)
>

Count me as a Volunteer.


> --
> Simon Riggshttp://www.EnterpriseDB.com/
>
>
>

-- 
Ibrar Ahmed


Re: Commitfest overflow

2021-08-03 Thread Ibrar Ahmed
On Tue, Aug 3, 2021 at 9:13 PM Tom Lane  wrote:

> Bruce Momjian  writes:
> > On Tue, Aug  3, 2021 at 04:53:40PM +0100, Simon Riggs wrote:
> >> There are 273 patches in the queue for the Sept Commitfest already, so
> >> it seems clear the queue is not being cleared down each CF as it was
> >> before. We've been trying hard, but it's overflowing.
>
> > I wonder if our lack of in-person developer meetings is causing some of
> > our issues to not get closed.
>
> I think there are a couple of things happening here:
>
> 1. There wasn't that much getting done during this CF because it's
> summer and many people are on vacation (in the northern hemisphere
> anyway).
>
> 2. As a community, we don't really have the strength of will to
> flat-out reject patches.  I think the dynamic is that individual
> committers look at something, think "I don't like that, I'll go
> work on some better-designed patch", and it just keeps slipping
> to the next CF.  In the past we've had some CFMs who were assertive
> enough and senior enough to kill off patches that didn't look like
> they were going to go anywhere.  But that hasn't happened for
> awhile, and I'm not sure it should be the CFM's job anyway.
>
> (I hasten to add that I'm not trying to imply that all the
> long-lingering patches are hopeless.  But I think some of them are.)
>
> I don't think there's much to be done about the vacation effect;
> we just have to accept that the summer CF is likely to be less
> productive than others.  But I'd like to see some better-formalized
> way of rejecting patches that aren't going anywhere.  Maybe there
> should be a time limit on how many CFs a patch is allowed to just
> automatically slide through?
>

+1 for the idea of allowed CFs. Secondly we can think about the patches
which
have not had a response from the author since long.


>
> regards, tom lane
>
>
>

-- 
Ibrar Ahmed


Re: Commitfest overflow

2021-08-03 Thread Simon Riggs
On Tue, 3 Aug 2021 at 17:13, Tom Lane  wrote:
>
> Bruce Momjian  writes:
> > On Tue, Aug  3, 2021 at 04:53:40PM +0100, Simon Riggs wrote:
> >> There are 273 patches in the queue for the Sept Commitfest already, so
> >> it seems clear the queue is not being cleared down each CF as it was
> >> before. We've been trying hard, but it's overflowing.
>
> > I wonder if our lack of in-person developer meetings is causing some of
> > our issues to not get closed.
>
> I think there are a couple of things happening here:
>
> 1. There wasn't that much getting done during this CF because it's
> summer and many people are on vacation (in the northern hemisphere
> anyway).
>
> 2. As a community, we don't really have the strength of will to
> flat-out reject patches.  I think the dynamic is that individual
> committers look at something, think "I don't like that, I'll go
> work on some better-designed patch", and it just keeps slipping
> to the next CF.  In the past we've had some CFMs who were assertive
> enough and senior enough to kill off patches that didn't look like
> they were going to go anywhere.  But that hasn't happened for
> awhile, and I'm not sure it should be the CFM's job anyway.
>
> (I hasten to add that I'm not trying to imply that all the
> long-lingering patches are hopeless.  But I think some of them are.)
>
> I don't think there's much to be done about the vacation effect;
> we just have to accept that the summer CF is likely to be less
> productive than others.  But I'd like to see some better-formalized
> way of rejecting patches that aren't going anywhere.  Maybe there
> should be a time limit on how many CFs a patch is allowed to just
> automatically slide through?

I guess the big number is 233 patches moved to next CF, out of 342, or
68% deferred.

Perhaps we need a budget of how many patches can be moved to next CF,
i.e. CF mgr is responsible for ensuring that no more than ?50% of
patches can be moved forwards. Any in excess of that need to join the
kill list.

I would still ask for someone to spend a little time triaging things,
so as to direct people who volunteer to be so directed. Many will not
want to be directed, but I'm sure there must be 5-10 people who would
do that? (Volunteers?)

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Remove unused 'len' from pg_stat_recv_* functions

2021-08-03 Thread Andres Freund
On 2021-08-03 16:56:12 +0900, Masahiko Sawada wrote:
> On Tue, Aug 3, 2021 at 3:17 PM Kyotaro Horiguchi
>  wrote:
> >
> > At Tue, 3 Aug 2021 12:40:23 +0900, Masahiko Sawada  
> > wrote in
> > > Hi all,
> > >
> > > While working on a patch adding new stats, houzj pointed out that
> > > 'len' function arguments of all pgstat_recv_* functions are not used
> > > at all. Looking at git history, pgstat_recv_* functions have been
> > > having ‘len’ since the stats collector was introduced by commit
> > > 140ddb78fe 20 years ago but it was not used at all even in the first
> > > commit. It seems like the improvements so far for the stats collector
> > > had pgstat_recv_* function have ‘len’ for consistency with the
> > > existing pgstat_recv_* functions. Is there any historical reason for
> > > having 'len' argument? Or can we remove them?
> > >
> > > I've attached the patch that removes 'len' from all pgstat_recv_* 
> > > functions.
> >
> > I at the first look thought that giving "len" as a parameter is
> > reasonable as message-processing functions, but the given message
> > struct contains the same value and the functions can refer to the
> > message length without the parameter if they want.  So I'm +-0 for the
> > removal.
> >
> > It applies cleanly on the master and compiled without an error.
> >
> > That being said, I'm not sure it is worthwhile to change parameters of
> > going-to-be-removed functions (if shared-memory stats collector is
> > successfully introduced).
> 
> Good point.

Indeed. It's already kinda painful to maintain the shared memory stats patch,
no need to make it harder...


> I'm not going to insist on removing them but I got confused a bit
> whether or not I should add 'len' when writing a new pgstat_recv_*
> function.

I'd keep it in sync with what's there right now.

Greetings,

Andres Freund




Re: A varint implementation for PG?

2021-08-03 Thread Robert Haas
[ resurrecting this 2-year-old thread ]

On Fri, Dec 13, 2019 at 12:45 AM Andres Freund  wrote:
> > If baking a new variant integer format now, I think limiting it to 64 bits
> > is probably a mistake given how long-lived PostgreSQL is, and how hard it
> > can be to change things in the protocol, on disk, etc.
>
> I don't think it's ever going to be sensible to transport 64bit quanta
> of data. Also, uh, it'd be larger than the data a postgres instance
> could really contain, given LSNs are 64 bit.

We already use 128-bit integers in some places in the source code, and
it seems more likely than not that use of 128-bit integer data types
will continue to expand over time. If we want to represent 64 bit
integers in a variable number of bits today, it does seem pretty
likely that some day we will want to do the same thing with 128 bit
integers, and maybe eventually larger. And I think that all of that is
true even though exhausting a 64-bit LSN space looks difficult today
and will likely remain difficult for the foreseeable future.

So to me, the right thing to do here probably depends a bit on the use
case. If we are talking about using this for strictly internal
purposes - e.g. cramming stuff into an internal data structure into
which users have no direct visibility - then I think it's probably
best to pick a representation that corresponds exactly to the C data
type that we're trying to store. If we're starting out with a  int64
and we want to byte-squeeze it, the representation you've chosen here
seems like just the thing. And it's easy to imagine why we might also
want to have similar transformations for uint64, int32, and uint32,
and maybe eventually int128 and uint128. And if the details of the
format are a little different in each case that's fine; for these
sorts of use cases we're always going to unpack into the same data
type that we packed originally, so it's not important to have exact
compatibility across differences in signedness or bit count.

However, I suspect that the whole approach should be completely
revised for a user-visible data type. On the one hand, there's no
telling how large a value some user will want to represent, so
limiting ourselves to 64 bits does seem shortsighted. And on the othe
hand, if we've got a varlena, we already know the length, so it seems
like we shouldn't also encode the length in the value. Maybe there's a
more efficient way, but the first thing that occurs to me is to just
discard high order bytes that are all zeroes or all ones until the
high order bit of the next byte doesn't match and plonk the remaining
bytes into the varlena. To decompress, just sign-extend out to the
target length. Really, this kind of representation can be extended to
represent arbitrarily large integers, even bigger than what we can
currently do with numeric, which is already crazy huge, and it seems
to have some advantage in that every payload byte contains exactly 8
data bits, so we don't need to shift or mask while encoding and
decoding.

Now, we could think of introducing a new format for variable-length
datums, essentially making this a new typlen rather than a new kind of
varlena. That might be worth it, because if you are storing a lot of
values that are small enough that this format could represent them in
3 bytes or less, which I think would be everything up to +/- 2^20, you
could save a pretty significant amount of space even if your table was
laid out to avoid alignment padding. However, there would be some
distributed overhead to this, because all the places that have special
handling for typlen = -1 and typlen = -2 would need to grow new cases
for this. I'm not sure how much of a problem that is, really, but I
don't think we can go nuts with adding new typlen values.

> It'll be a bit problmeatic to deal with all the casting necessary, and
> with the likely resulting overload resolution issues.  I'm wondering
> whether it'd be worthwhile to have a ALTER TABLE ... STORAGE ... option
> that encodes int2/4/8 as varints when inside a tuple, but otherwise just
> let it be a normal integer.

I don't see how a STORAGE option could work. We tend to treat those as
hints, rather than critical data, which wouldn't work here. I think
there are a number of problems that crop up, but one of them is the
same thing we hit with the custom TOAST compression stuff. If you need
to make a value of some row type out of a HeapTuple, you're not going
to know which settings were used to create that heap tuple, and you're
certainly not going to know that when you go to deform that tuple. The
only thing you're going to have is the tuple descriptor, which AFAICS
means that the representation needs to be a property of the type, not
where the value is stored. Maybe you have a clever idea I'm not
seeing.

As for the casting and overloading issues, that's tripped up quite a
few people now when trying to add new numeric data types - unsigned
integers being a case that has come up a few times now, I 

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2021-08-03 Thread Andres Freund
Hi,

On 2021-08-02 18:25:56 -0400, Melanie Plageman wrote:
> Thanks for the feedback!
> 
> I agree it makes sense to count strategy writes separately.
> 
> I thought about this some more, and I don't know if it makes sense to
> only count "avoidable" strategy writes.
> 
> This would mean that a backend writing out a buffer from the strategy
> ring when no clean shared buffers (as well as no clean strategy buffers)
> are available would not count that write as a strategy write (even
> though it is writing out a buffer from its strategy ring). But, it
> obviously doesn't make sense to count it as a regular buffer being
> written out. So, I plan to change this code.

What do you mean with "no clean shared buffers ... are available"?



> The most substantial missing piece of the patch right now is persisting
> the data across reboots.
> 
> The two places in the code I can see to persist the buffer action stats
> data are:
> 1) using the stats collector code (like in
> pgstat_read/write_statsfiles()
> 2) using a before_shmem_exit() hook which writes the data structure to a
> file and then read from it when making the shared memory array initially

I think it's pretty clear that we should go for 1. Having two mechanisms for
persisting stats data is a bad idea.


> Also, I'm unsure how writing the buffer action stats out in
> pgstat_write_statsfiles() will work, since I think that backends can
> update their buffer action stats after we would have already persisted
> the data from the BufferActionStatsArray -- causing us to lose those
> updates.

I was thinking it'd work differently. Whenever a connection ends, it reports
its data up to pgstats.c (otherwise we'd loose those stats). By the time
shutdown happens, they all need to have already have reported their stats - so
we don't need to do anything to get the data to pgstats.c during shutdown
time.


> And, I don't think I can use pgstat_read_statsfiles() since the
> BufferActionStatsArray should have the data from the file as soon as the
> view containing the buffer action stats can be queried. Thus, it seems
> like I would need to read the file while initializing the array in
> CreateBufferActionStatsCounters().

Why would backends need to read that data back?


> diff --git a/src/backend/catalog/system_views.sql 
> b/src/backend/catalog/system_views.sql
> index 55f6e3711d..96cac0a74e 100644
> --- a/src/backend/catalog/system_views.sql
> +++ b/src/backend/catalog/system_views.sql
> @@ -1067,9 +1067,6 @@ CREATE VIEW pg_stat_bgwriter AS
>  pg_stat_get_bgwriter_buf_written_checkpoints() AS buffers_checkpoint,
>  pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
>  pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
> -pg_stat_get_buf_written_backend() AS buffers_backend,
> -pg_stat_get_buf_fsync_backend() AS buffers_backend_fsync,
> -pg_stat_get_buf_alloc() AS buffers_alloc,
>  pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;

Material for a separate patch, not this. But if we're going to break
monitoring queries anyway, I think we should consider also renaming
maxwritten_clean (and perhaps a few others), because nobody understands what
that is supposed to mean.



> @@ -1089,10 +1077,6 @@ ForwardSyncRequest(const FileTag *ftag, 
> SyncRequestType type)
>  
>   LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
>  
> - /* Count all backend writes regardless of if they fit in the queue */
> - if (!AmBackgroundWriterProcess())
> - CheckpointerShmem->num_backend_writes++;
> -
>   /*
>* If the checkpointer isn't running or the request queue is full, the
>* backend will have to perform its own fsync request.  But before 
> forcing
> @@ -1106,8 +1090,10 @@ ForwardSyncRequest(const FileTag *ftag, 
> SyncRequestType type)
>* Count the subset of writes where backends have to do their 
> own
>* fsync
>*/
> + /* TODO: should we count fsyncs for all types of procs? */
>   if (!AmBackgroundWriterProcess())
> - CheckpointerShmem->num_backend_fsync++;
> + pgstat_increment_buffer_action(BA_Fsync);
> +

Yes, I think that'd make sense. Now that we can disambiguate the different
types of syncs between procs, I don't see a point of having a process-type
filter here. We just loose data...



>   /* don't set checksum for all-zero page */
> @@ -1229,11 +1234,60 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, 
> ForkNumber forkNum,
>   if (XLogNeedsFlush(lsn) &&
>   StrategyRejectBuffer(strategy, 
> buf))
>   {
> + /*
> +  * Unset the strat write flag, 
> as we will not be writing
> +  * 

Re: Slow standby snapshot

2021-08-03 Thread Andres Freund
Hi,

On 2021-08-03 10:33:50 +0500, Andrey Borodin wrote:
> > 3 авг. 2021 г., в 03:01, Andres Freund  написал(а):
> > On 2021-08-03 00:07:23 +0300, Michail Nikolaev wrote:
> >> The main idea is simple optimistic optimization - store offset to next
> >> valid entry. So, in most cases, we could just skip all the gaps.
> >> Of course, it adds some additional impact for workloads without long
> >> (few seconds) transactions but it is almost not detectable (because of
> >> CPU caches).
> > 
> > I'm doubtful that that's really the right direction. For workloads that
> > are replay heavy we already often can see the cost of maintaining the
> > known xids datastructures show up significantly - not surprising, given
> > the datastructure. And for standby workloads with active primaries the
> > cost of searching through the array in all backends is noticeable as
> > well.  I think this needs a bigger data structure redesign.
> 
> KnownAssignedXids implements simple membership test idea. What kind of
> redesign would you suggest? Proposed optimisation makes it close to optimal,
> but needs eventual compression.

Binary searches are very ineffecient on modern CPUs (unpredictable memory
accesses, unpredictable branches). We constantly need to do binary searches
during replay to remove xids from the array. I don't see how you can address
that with just the current datastructure.


> Maybe use a hashtable of running transactions? It will be slightly faster
> when adding\removing single transactions. But much worse when doing
> KnownAssignedXidsRemove().

Why would it be worse for KnownAssignedXidsRemove()? Were you intending to
write KnownAssignedXidsGet[AndSetXmin]()?


> Maybe use a tree? (AVL\RB or something like that) It will be slightly better, 
> because it does not need eventual compression like exiting array.

I'm not entirely sure what datastructure would work best. I can see something
like a radix tree work well, or a skiplist style approach. Or a hashtable:

I'm not sure that we need to care as much about the cost of
KnownAssignedXidsGetAndSetXmin() - for one, the caching we now have makes that
less frequent. But more importantly, it'd not be hard to maintain an
occasionally (or opportunistically) maintained denser version for
GetSnapshotData() - there's only a single writer, making the concurrency
issues a lot simpler.


Greetings,

Andres Freund




Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-08-03 Thread Robert Haas
On Tue, Aug 3, 2021 at 9:59 AM Pavel Borisov  wrote:
> I've looked at v5 patch. It is completely similar to v2 patch, which I've 
> already tested using the workflow, I've described in the comments above.  
> Before the patch I get the errors quite soon, the patch corrects them. 
> Furthermore I've tested the same patch under REPEATABLE READ and SERIALIZABLE 
> and detected no flaws. So, now, when we've got Robert's explanation, it seems 
> to me that v2 (aka v5) patch can be committed (leaving possible REPEATABLE 
> READ and SERIALIZABLE improvements for the future). I don't really sure it is 
> still possible on 07/21 СF, but I'd change status of the patch to the 
> ready-for-committer. Also I'd like the bugfix to be backported to the 
> previous PG versions.

I agree that the fix should be back-ported, but I'm not keen to commit
anything unless it works for all isolation levels.

The idea I sort of had floating around in my mind is a little
different than what Greg has implemented. I was thinking that we could
just skip SerializeSnapshot and the corresponding shm_toc_allocate()
if !IsolationUsesXactSnapshot(). Then on the restore side we could
just call shm_toc_lookup() with noError = true and skip
RestoreTransactionSnapshot/RestoreSnapshot if it returns NULL.

I don't know why Greg's patch is changing anything related to the
active snapshot (as opposed to the transaction snapshot). Maybe
there's a reason why we need that change, but I don't know what it is.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

2021-08-03 Thread Tom Lane
Gilles Darold  writes:
> Sorry I have missed that, but I'm fine with this implemenation so let's 
> keep the v6 version of the patch and drop this one.

Pushed, then.  There's still lots of time to tweak the behavior of course.

regards, tom lane




Re: .ready and .done files considered harmful

2021-08-03 Thread Robert Haas
On Mon, Aug 2, 2021 at 9:06 AM Dipesh Pandit  wrote:
> We can maintain the current timeline ID in archiver specific shared memory.
> If we switch to a new timeline then the backend process can update the new
> timeline ID in shared memory. Archiver can keep a track of current timeline ID
> and if it finds that there is a timeline switch then it can perform a full 
> directory
> scan to make sure that archiving history files takes precedence over WAL 
> files.
> Access to the shared memory area can be protected by adding a WALArchiverLock.
> If we take this approach then it doesn't require to use a dedicated signal to 
> notify
> a timeline switch.

Hi,

I don't really understand why you are storing something in shared
memory specifically for the archiver. Can't we use XLogCtl's
ThisTimeLineID instead of storing another copy of the information?

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-08-03 Thread Bruce Momjian
On Tue, Aug  3, 2021 at 11:13:45AM -0500, Justin Pryzby wrote:
> On Tue, Aug 03, 2021 at 12:00:47PM -0400, Bruce Momjian wrote:
> > Patch applied through 9.6.
> 
> The comment seems to be a leftover from a copy pasto.
> 
> +   /* find hash indexes */
> +   res = executeQueryOrDie(conn,
> +   "SELECT name "
> +   "FROM 
> pg_available_extensions "
> +   "WHERE 
> installed_version != default_version"
> +   );

Thanks much, fixed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Have I found an interval arithmetic bug?

2021-08-03 Thread Bruce Momjian
On Fri, Jul 30, 2021 at 03:03:13PM -0400, Bruce Momjian wrote:
> >   On output, the months field is shown as an appropriate number of
> >   years and months; the days field is shown as-is; the microseconds
> >   field is converted to hours, minutes, and possibly-fractional
> >   seconds.
> 
> Here is an updated patch that includes some of your ideas.

"Rounding" patch applied to master, and back branches got only the
adjusted "truncated" doc patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: Lowering the ever-growing heap->pd_lower

2021-08-03 Thread Matthias van de Meent
On Tue, 3 Aug 2021 at 08:57, Simon Riggs  wrote:
>
> On Tue, 18 May 2021 at 20:33, Peter Geoghegan  wrote:
> >
> > On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent
> >  wrote:
> > > PFA the updated version of this patch. Apart from adding line pointer
> > > truncation in PageRepairFragmentation (as in the earlier patches), I
> > > also altered PageTruncateLinePointerArray to clean up all trailing
> > > line pointers, even if it was the last item on the page.
> >
> > Can you show a practical benefit to this patch, such as an improvement
> > in throughout or in efficiency for a given workload?
> >
> > It was easy to see that having something was better than having
> > nothing at all. But things are of course different now that we have
> > PageTruncateLinePointerArray().
>
> There does seem to be utility in Matthias' patch, which currently does
> two things:
> 1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup
> That is going to have a clear benefit for HOT workloads, which by
> their nature will use a lot of line pointers.
> Many applications are updated much more frequently than they are vacuumed.
> Peter - what is your concern about doing this more frequently? Why
> would we *not* do this?

One clear reason as to why we _do_ want this, is that the current
shrinking only happens in the second phase of vacuum. Shrinking the
LP-array in heap_page_prune decreases the chance that tuples that
could fit on the page due to removed HOT chain items don't currently
fit on the page due to lack of vacuum, whilst adding only little
overhead. Additionally, heap_page_prune is also executed if more empty
space on the page is required for a new tuple that currently doesn't
fit, and in such cases I think clearing as much space as possible is
useful.

> 2. Reduce number of line pointers to 0 in some cases.
> Matthias - I don't think you've made a full case for doing this, nor
> looked at the implications.

I have looked at the implications (see upthread), and I haven't found
any implications other than those mentioned below.

> The comment clearly says "it seems like a good idea to avoid leaving a
> PageIsEmpty()" page behind.

Do note that that comment is based on (to the best of my knowledge)
unmeasured, but somewhat informed, guesswork ('it seems like a good
idea'), which I also commented on in the thread discussing the patch
that resulted in that commit [0].

If I recall correctly, the decision to keep at least 1 line pointer on
the page was because this feature was to be committed late in the
development cycle of pg14, and as such there would be little time to
check the impact of fully clearing pages. To go forward with the
feature in pg14 at that point, it was safer to not completely empty
pages, so that we'd not be changing the paths we were hitting during
e.g. vacuum too significantly, reducing the chances on significant
bugs that would require the patch to be reverted [1].


I agreed at that point that that was a safer bet, but right now it's
early in the pg15 development cycle, and I've had the time to get more
experience around the vacuum and line pointer machinery. That being
the case, I consider this a re-visit of the topic 'is it OK to
truncate the LP-array to 0', where previously the answer was 'we don't
know, and it's late in the release cycle', and after looking through
the code base now I argue that the answer is Yes.

One more point for going to 0 is that for 32-bit systems, a single
line pointer is enough to block a page from being 'empty' enough to
fit a MaxHeapTupleSize-sized tuple (when requesting pages through the
FSM).

Additionally, there are some other optimizations we can only apply to
empty pages:

- vacuum (with disable_page_skipping = on) will process these empty
pages faster, as it won't need to do any pruning on that page. With
page skipping enabled this won't matter because empty pages are
all_visible and therefore vacuum won't access that page.
- the pgstattuple contrib extension processes emtpy pages (slightly)
faster in pgstattuple_approx
- various loops won't need to check the remaining item that it is
unused, saving some cycles in those loops when the page is accessed.

and further future optimizations might include

- Full-page WAL logging of empty pages produced in the checkpointer
could potentially be optimized to only log 'it's an empty page'
instead of writing out the full 8kb page, which would help in reducing
WAL volume. Previously this optimization would never be hit on
heapam-pages because pages could not become empty again, but right now
this has real potential for applying an optimization.

Kind regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/CAEze2Wh-nXjkp0bLN_vQwgHttC8CRH%3D1ewcrWk%2B7RX5B93YQPQ%40mail.gmail.com
[1] 
https://www.postgresql.org/message-id/CAH2-WznCxtWL4B995y2KJWj-%2BjrjahH4n6gD2R74SyQJo6Y63w%40mail.gmail.com




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-08-03 Thread Justin Pryzby
On Tue, Aug 03, 2021 at 12:00:47PM -0400, Bruce Momjian wrote:
> Patch applied through 9.6.

The comment seems to be a leftover from a copy pasto.

+   /* find hash indexes */
+   res = executeQueryOrDie(conn,
+   "SELECT name "
+   "FROM 
pg_available_extensions "
+   "WHERE 
installed_version != default_version"
+   );

-- 
Justin




Re: Commitfest overflow

2021-08-03 Thread Tom Lane
Bruce Momjian  writes:
> On Tue, Aug  3, 2021 at 04:53:40PM +0100, Simon Riggs wrote:
>> There are 273 patches in the queue for the Sept Commitfest already, so
>> it seems clear the queue is not being cleared down each CF as it was
>> before. We've been trying hard, but it's overflowing.

> I wonder if our lack of in-person developer meetings is causing some of
> our issues to not get closed.

I think there are a couple of things happening here:

1. There wasn't that much getting done during this CF because it's
summer and many people are on vacation (in the northern hemisphere
anyway).

2. As a community, we don't really have the strength of will to
flat-out reject patches.  I think the dynamic is that individual
committers look at something, think "I don't like that, I'll go
work on some better-designed patch", and it just keeps slipping
to the next CF.  In the past we've had some CFMs who were assertive
enough and senior enough to kill off patches that didn't look like
they were going to go anywhere.  But that hasn't happened for
awhile, and I'm not sure it should be the CFM's job anyway.

(I hasten to add that I'm not trying to imply that all the
long-lingering patches are hopeless.  But I think some of them are.)

I don't think there's much to be done about the vacation effect;
we just have to accept that the summer CF is likely to be less
productive than others.  But I'd like to see some better-formalized
way of rejecting patches that aren't going anywhere.  Maybe there
should be a time limit on how many CFs a patch is allowed to just
automatically slide through?

regards, tom lane




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-08-03 Thread Bruce Momjian
On Fri, Jul 30, 2021 at 12:40:06PM -0400, Bruce Momjian wrote:
> On Thu, Jul 29, 2021 at 06:19:56PM -0400, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > I think we need to first give clear instructions on how to find out if
> > > an extension update is available, and then how to update it.  I am
> > > thinking we should supply a query which reports all extensions that can
> > > be upgraded, at least for contrib.
> > 
> > I suggested awhile ago that pg_upgrade should look into
> > pg_available_extensions in the new cluster, and prepare
> > a script with ALTER EXTENSION UPDATE commands for
> > anything that's installed but is not the (new cluster's)
> > default version.
> 
> OK, done in this patch.  I am assuming that everything that shows an
> update in pg_available_extensions can use ALTER EXTENSION UPDATE.  I
> assume this would be backpatched to 9.6.

Patch applied through 9.6.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: [PATCH]Comment improvement in publication.sql

2021-08-03 Thread vignesh C
On Tue, Aug 3, 2021 at 8:36 AM tanghy.f...@fujitsu.com
 wrote:
>
> On Monday, August 2, 2021 11:56 PM vignesh C  wrote:
> >
> > Few minor suggestions:
> > 1) Should we change below to "fail - tables can't be added, dropped or
> > set to "FOR ALL TABLES" publications"
> > --- fail - can't add to for all tables publication
> > +-- fail - tables can't be added to or dropped from FOR ALL TABLES 
> > publications
>
> Thanks for your kindly comments.
>
> I'm not sure should we ignore that 'drop' should be followed by 'from', but I
> think there's no misunderstandings. So, modified as you described.
>
> Besides, in other file comments (src/test/subscription/t/100_bugs.pl) I saw 
> FOR ALL TABLES without using quotes.
> So I don't think we need the quotes, too.
>
> > 2) Should we change this to "--fail - already existing publication"
> > --- fail - already added
> > +-- fail - already exists
>
> Modified.
>
> A modified patch is attached.

Thanks for the updated patch, the changes look good to me.

Regards,
Vignesh




Re: Commitfest overflow

2021-08-03 Thread Bruce Momjian
On Tue, Aug  3, 2021 at 04:53:40PM +0100, Simon Riggs wrote:
> There are 273 patches in the queue for the Sept Commitfest already, so
> it seems clear the queue is not being cleared down each CF as it was
> before. We've been trying hard, but it's overflowing.
> 
> Of those, about 50 items have been waiting more than one year, and
> about 25 entries waiting for more than two years.
> 
> One problem is that many entries don't have official reviewers, though
> many people have commented on Hackers, making it difficult to track
> down items that actually need work. That wastes a lot of time for
> would-be reviewers (or at least, it has done for me).
> 
> Please can there be some additional coordination to actively clear
> down this problem? I won't attempt to come up with ideas to do this,
> but others may wish to suggest ways that avoid Committer burn-out?
> 
> I will be happy to volunteer to be part of an organized approach that
> spreads out the work.

I wonder if our lack of in-person developer meetings is causing some of
our issues to not get closed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Commitfest overflow

2021-08-03 Thread Simon Riggs
There are 273 patches in the queue for the Sept Commitfest already, so
it seems clear the queue is not being cleared down each CF as it was
before. We've been trying hard, but it's overflowing.

Of those, about 50 items have been waiting more than one year, and
about 25 entries waiting for more than two years.

One problem is that many entries don't have official reviewers, though
many people have commented on Hackers, making it difficult to track
down items that actually need work. That wastes a lot of time for
would-be reviewers (or at least, it has done for me).

Please can there be some additional coordination to actively clear
down this problem? I won't attempt to come up with ideas to do this,
but others may wish to suggest ways that avoid Committer burn-out?

I will be happy to volunteer to be part of an organized approach that
spreads out the work.

Thanks

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: pg_upgrade does not upgrade pg_stat_statements properly

2021-08-03 Thread Bruce Momjian
On Thu, Jul 29, 2021 at 03:06:45PM -0400, Bruce Momjian wrote:
> > I haven't seen it mentioned in the thread, but I think the final phrase
> > of this  should be a separate step,
> > 
> > 
> >  Copy custom full-text search files
> >  
> >   Copy any custom full text search file (dictionary, synonym, thesaurus,
> >   stop word list) to the new server.
> >  
> > 
> > 
> > While this is closely related to extensions, it's completely different.
> 
> Agreed.  See attached patch.

Doc patch applied to all supported versions.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: make MaxBackends available in _PG_init

2021-08-03 Thread Robert Haas
On Mon, Aug 2, 2021 at 4:37 PM Andres Freund  wrote:
> I'm not so sure it's a good idea. I've seen several shared_preload_library
> using extensions that adjust some GUCs (e.g. max_prepared_transactions)
> because they need some more resources internally - that's perhaps not a great
> idea, but there's also not an obviously better way.

Blech.

I'm fine with solving the problem some other way, but I say again, blech.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Make relfile tombstone files conditional on WAL level

2021-08-03 Thread Robert Haas
On Mon, Aug 2, 2021 at 6:38 PM Andres Freund  wrote:
> What I proposed in the past was to have a new shared table that tracks
> relfilenodes. I still think that's a decent solution for just the problem at
> hand.

It's not really clear to me what problem is at hand. The problems that
the tombstone system created for the async I/O stuff weren't really
explained properly, IMHO. And I don't think the current system is all
that ugly. it's not the most beautiful thing in the world but we have
lots of way worse hacks. And, it's easy to understand, requires very
little code, and has few moving parts that can fail. As hacks go it's
a quality hack, I would say.

> But it'd also potentially be the way to redesign relation forks and even
> slim down buffer tags:
>
> Right now a buffer tag is:
> - 4 byte tablespace oid
> - 4 byte database oid
> - 4 byte "relfilenode oid" (don't think we have a good name for this)
> - 4 byte fork number
> - 4 byte block number
>
> If we had such a shared table we could put at least tablespace, fork number
> into that table mapping them to an 8 byte "new relfilenode". That'd only make
> the "new relfilenode" unique within a database, but that'd be sufficient for
> our purposes.  It'd give use a buffertag consisting out of the following:
> - 4 byte database oid
> - 8 byte "relfilenode"
> - 4 byte block number

Yep. I think this is a good direction.

> Of course, it'd add some complexity too, because a buffertag alone wouldn't be
> sufficient to read data (as you'd need the tablespace oid from elsewhere). But
> that's probably ok, I think all relevant places would have that information.

I think the thing to look at would be the places that call
relpathperm() or relpathbackend(). I imagine this can be worked out,
but it might require some adjustment.

> It's probably possible to remove the database oid from the tag as well, but
> it'd make CREATE DATABASE tricker - we'd need to change the filenames of
> tables as we copy, to adjust them to the differing oid.

Yeah, I'm not really sure that works out to a win. I tend to think
that we should be trying to make databases within the same cluster
more rather than less independent of each other. If we switch to using
a radix tree for the buffer mapping table as you have previously
proposed, then presumably each backend can cache a pointer to the
second level, after the database OID has been resolved. Then you have
no need to compare database OIDs for every lookup. That might turn out
to be better for performance than shoving everything into the buffer
tag anyway, because then backends in different databases would be
accessing distinct parts of the buffer mapping data structure instead
of contending with one another.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




CLOBBER_CACHE_ALWAYS testing (was Re: Release 13 of the PostgreSQL BuildFarm client)

2021-08-03 Thread Tom Lane
Andrew Dunstan  writes:
> I have pushed Release 13 of the PostgreSQL BuildFarm client.
> ...
> the `use_discard_caches` setting reflects a change in the way postgres
> handles this - it's now a runtime setting rather than a compile time
> setting. On older branches it sets "-DCLOBBER_CACHE_ALWAYS". If you use
> this setting don't use that define.

I'd just like to take a moment to recommend that owners of
CLOBBER_CACHE_ALWAYS animals adopt this new way of doing things.

For PG 13 and earlier branches, this makes no difference --- it's
just an indirect way of adding "-DCLOBBER_CACHE_ALWAYS".  However,
for v14 and up, it does not do that but builds the binaries normally.
Then, cache-clobber testing is performed by adding "use_discard_caches=1"
as a GUC setting.  The reason this is useful is that we can skip running
individual tests in cache-clobber mode when we choose to.  As of
the new buildfarm client, this is exploited by skipping clobber mode
for initdb runs that are part of other tests.  (We still run initdb in
clobber mode in the initdb-LOCALE steps, so we aren't losing coverage.)
That makes a noticeable difference in the standard set of tests, but
where it really shines is if you enable TAP testing --- those tests
otherwise spend nearly half their time running initdb :-(.

The last I checked, no buildfarm animals were running with both
--enable-tap-tests and CLOBBER_CACHE_ALWAYS, which was reasonable
because it just took insanely long.  However, if you've got a fast
machine you may find that --enable-tap-tests plus use_discard_caches
is tolerable now ... at least in v14 and later branches.

I don't normally run my animal "sifaka" in cache-clobber mode, but
I did do a couple of runs that way while testing the new buildfarm
client.  Here's a run with buildfarm client v13, --enable-tap-tests,
and use_discard_caches:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka=2021-08-02%2022%3A09%3A03

Total time 9:21:52.  That compares well to this previous run with the
v12 client:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka=2021-07-03%2004%3A02%3A16

... total time 16:15:43.

Also of interest is that a month ago, it took twice as long (32:53:24):

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka=2021-07-01%2018%3A06%3A09

I don't have comparable runs from sifaka before that, but extrapolating
from avocet's times, it would have taken ~ 58.5 hours a year ago.
Those reductions are from various other changes we've implemented to
reduce the cost of cache-clobber testing.  Hopefully, these fixes
make it practical to be more ambitious about how much testing can
be done by cache-clobber animals.

regards, tom lane




Re: when the startup process doesn't (logging startup delays)

2021-08-03 Thread Robert Haas
On Tue, Aug 3, 2021 at 2:48 AM Nitin Jadhav
 wrote:
> Implemented the above approach and verified the patch. Kindly have a
> look and share your thoughts.

+LogStartupProgressTimerExpired(bool force, long *secs, int *usecs)

The name of this function begins with "log" but it does not log
anything, so that's probably a sign that you should rethink the name
of the function. I suggested startup_progress_timer_expired()
upthread, but now I think maybe we should call it
startup_progress_timer_has_expired() and then renaming the Boolean
you've called logStartupProgressTimeout to
startup_progress_timer_expired. Also, the argument "bool force"
doesn't really make sense for this function, which is why I suggested
above calling it "bool done" instead.

+elapsed_ms = (seconds * 1000) + (useconds / 1000);
+interval_in_ms = log_startup_progress_interval * 1000;
+enable_timeout_after(LOG_STARTUP_PROGRESS_TIMEOUT,
+ (interval_in_ms - (elapsed_ms % interval_in_ms)));

This will work correctly only if elapsed_ms is equal to interval_in_ms
or slightly greater than interval_ms. But if elapsed_ms is greater
than two times interval_ms, then it will produce pretty much random
results. If elapsed_ms is negative because the system clock gets set
backward, a possibility I've already mentioned to you in a previous
review, then it will also misbehave. I actually don't think
enable_timeout_after() is very easy to use for this kind of thing. At
least for me, it's way easier to think about calculating the timestamp
at which I want the timer to expire. Maybe something along these
lines:

next_timeout = last_startup_progress_timeout + interval;
if (next_timeout < now)
{
// Either the timeout was processed so late that we missed an entire cycle,
// or the system clock was set backwards.
next_timeout = now + interval;
}
enable_timeout_at(next_timeout);

Also, I said before that I thought it was OK that you were logging a
line at the end of every operation as well as after every N
milliseconds. But, the more I think about it, the less I like it. We
already have a 'redo done' line that shows up at the end of redo,
which the patch wisely does not duplicate. But it's not quite clear
that any of these other things are important enough to bother
mentioning in the log unless they take a long time. After an immediate
shutdown of an empty cluster, with this patch applied, I get 3 extra
log messages:

2021-08-03 10:17:49.630 EDT [17567] LOG:  data directory sync (fsync)
complete after 0.13 s
2021-08-03 10:17:49.633 EDT [17567] LOG:  resetting unlogged relations
(cleanup) complete after 0.00 s
2021-08-03 10:17:49.635 EDT [17567] LOG:  resetting unlogged relations
(init) complete after 0.00 s

That doesn't seem like information anyone really needs. If it had
taken a long time, it would have been worth logging, but in the normal
case where it doesn't, it's just clutter. Another funny thing is that,
as you've coded it, those additional log lines only appear when
log_startup_progress_interval != 0. That's strange. It seems
particularly strange because of the existing precedent where 'redo
done' appears regardless of any setting, but also because when I set,
say, a 10s interval, I guess I expect something to happen every 10s.
Making something happen once at the end is different.

So I think we should take this out, which would permit simplifying a
bunch of code.The four places where you call
ereport_startup_progress(true, ...) would go away.
ereport_startup_progress() would no longer need a Boolean argument,
and neither would LogStartupProgressTimerExpired() /
startup_progress_timer_has_expired(). Note that there's no real need
to disable the timeout when we're done with it. It's fine if we do,
but if we don't, it's also not a big deal; all that happens if we
leave the timer scheduled and let it expire is that it will set a
Boolean flag that nobody will care about. So what I'm thinking about
is that we could just have, say, reset_startup_progress_timeout() and
startup_progress_timeout_has_expired().
reset_startup_progress_timeout() would just do exactly what I showed
above to reset the timeout, and you'd call it at the beginning of each
phase. And startup_progress_timeout_has_expired() would look roughly
like this:

if (!startup_progress_timer_expired)
return;
now = GetCurrentTimestamp();
// compute timestamp difference
last_startup_progress_timeout = now;
reset_startup_progress_timeout();

With these changes you'd have only 1 place in the code that needs to
care about log_startup_progress_interval, as opposed to 3 as you have
it currently, and only one place that enables the timeout, as opposed
to 2 as you have it currently. I think that would be tidier.

I think we also should consider where to put the new functions
introduced by this patch, and the GUC. You put them in xlog.c which is
reasonable since that is where StartupXLOG() lives. However, xlog.c is
also a gigantic file, and xlog.h is included 

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

2021-08-03 Thread Gilles Darold

Le 03/08/2021 à 15:39, Tom Lane a écrit :

Erik Rijkers  writes:

On 8/3/21 1:26 PM, Gilles Darold wrote:

Le 03/08/2021 à 11:45, Gilles Darold a écrit :

Actually I just found that the regexp_like() function doesn't support
the start parameter which is something we should support. I saw that
Oracle do not support it but DB2 does and I think we should also
support it. I will post a new version of the patch once it is done.

+1
I for one am in favor of this 'start'-argument addition.  Slightly
harder usage, but more precise manipulation.

As I said upthread, I am *not* in favor of making those DB2 additions.
We do not need to create ambiguities around those functions like the
one we have for regexp_replace.  If Oracle doesn't have those options,
why do we need them?



Sorry I have missed that, but I'm fine with this implemenation so let's 
keep the v6 version of the patch and drop this one.


--
Gilles Darold





Re: Parallel scan with SubTransGetTopmostTransaction assert coredump

2021-08-03 Thread Pavel Borisov
пт, 23 июл. 2021 г. в 10:00, Greg Nancarrow :

> On Thu, Jul 22, 2021 at 2:15 AM Robert Haas  wrote:
> >
> > Thanks to Thomas Munro for drawing my attention to this thread. I
> > wasn't intentionally ignoring it, but there's a lot of email in the
> > world and only so much time.
> >
> > When I wrote this code originally, the idea that I had in mind was
> > simply this: whatever state we have in the leader ought to be
> > reproduced in each worker. So if there's an active snapshot in the
> > leader, we ought to make that active in all the workers, and if
> > there's a transaction snapshot in the leader, we ought to make that
> > the transaction snapshot in all of the workers.
> >
> > But I see now that my thinking was fuzzy, and I'm going to blame that
> > on the name GetTransactionSnapshot() being slightly misleading. If
> > IsolationUsesXactSnapshot() is true, then there's really such a thing
> > as a transaction snapshot and reproducing that in the worker is a
> > sensible thing to do. But when !IsolationUsesXactSnapshot(),
> > GetTransactionSnapshot() doesn't just "get the transaction snapshot",
> > because there isn't any such thing. It takes a whole new snapshot, on
> > the theory that you wouldn't be calling this function unless you had
> > finished up with the snapshot you got the last time you called this
> > function. And in the case of initiating parallel query, that is the
> > wrong thing.
> >
> > I think that, at least in the case where IsolationUsesXactSnapshot()
> > is true, we need to make sure that calling GetTransactionSnapshot() in
> > a worker produces the same result that it would have produced in the
> > leader. Say one of the workers calls an sql or plpgsql function and
> > that function runs a bunch of SQL statements. It seems to me that
> > there's probably a way for this to result in calls inside the worker
> > to GetTransactionSnapshot(), and if that doesn't return the same
> > snapshot as in the leader, then we've broken MVCC.
> >
> > What about when IsolationUsesXactSnapshot() is false? Perhaps it's OK
> > to just skip this altogether in that case. Certainly what we're doing
> > can't be right, because copying a snapshot that wouldn't have been
> > taken without parallel query can't ever be the right thing to do.
> > Perhaps we need to copy something else instead. I'm not really sure.
> >
> > So I think v2 is probably on the right track, but wrong when the
> > transaction isolation level is REPEATABLE READ or SERIALIZABLE, and v3
> > and v4 just seem like unprincipled hacks that try to avoid the
> > assertion failure by lying about whether there's a problem.
> >
>
> Many thanks for taking time to respond to this (and thanks to Thomas Munro
> too).
> It's much appreciated, as this is a complex area.
> For the time being, I'll reinstate the v2 patch (say as v5) as a
> partial solution, and then work on addressing the REPEATABLE READ and
> SERIALIZABLE transaction isolation levels, which you point out are not
> handled correctly by the patch.
>
`
I've looked at v5 patch. It is completely similar to v2 patch, which I've
already tested using the workflow, I've described in the comments above.
Before the patch I get the errors quite soon, the patch corrects them.
Furthermore I've tested the same patch under REPEATABLE READ and
SERIALIZABLE and detected no flaws. So, now, when we've got Robert's
explanation, it seems to me that v2 (aka v5) patch can be committed
(leaving possible REPEATABLE READ and SERIALIZABLE improvements for the
future). I don't really sure it is still possible on 07/21 СF, but I'd
change status of the patch to the ready-for-committer. Also I'd like the
bugfix to be backported to the previous PG versions.

Further consideration on the patch and on the backporting is very much
welcome!

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

2021-08-03 Thread Tom Lane
Erik Rijkers  writes:
> On 8/3/21 1:26 PM, Gilles Darold wrote:
>> Le 03/08/2021 à 11:45, Gilles Darold a écrit :
>>> Actually I just found that the regexp_like() function doesn't support 
>>> the start parameter which is something we should support. I saw that 
>>> Oracle do not support it but DB2 does and I think we should also 
>>> support it. I will post a new version of the patch once it is done.

> +1

> I for one am in favor of this 'start'-argument addition.  Slightly 
> harder usage, but more precise manipulation.

As I said upthread, I am *not* in favor of making those DB2 additions.
We do not need to create ambiguities around those functions like the
one we have for regexp_replace.  If Oracle doesn't have those options,
why do we need them?

regards, tom lane




Re: when the startup process doesn't (logging startup delays)

2021-08-03 Thread Justin Pryzby
Two issues that I saw:

The first syncfs message is not output, because it's before
InitStartupProgress() is called:

2021-08-03 07:53:02.176 CDT startup[9717] LOG:  database system was 
interrupted; last known up at 2021-08-03 07:52:15 CDT
2021-08-03 07:53:02.733 CDT startup[9717] LOG:  data directory sync (syncfs) 
complete after 0.55 s
2021-08-03 07:53:02.734 CDT startup[9717] LOG:  database system was not 
properly shut down; automatic recovery in progress

FP exception when the GUC is set to 0:

2021-08-03 07:53:02.877 CDT postmaster[9715] LOG:  startup process (PID 9717) 
was terminated by signal 8: Floating point exception

Probably due to mod zero operation.
This prevents the process from starting.

+   enable_timeout_after(LOG_STARTUP_PROGRESS_TIMEOUT,
+(interval_in_ms - (elapsed_ms 
% interval_in_ms)));

-- 
Justin




Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

2021-08-03 Thread Erik Rijkers

On 8/3/21 1:26 PM, Gilles Darold wrote:

Le 03/08/2021 à 11:45, Gilles Darold a écrit :
Actually I just found that the regexp_like() function doesn't support 
the start parameter which is something we should support. I saw that 
Oracle do not support it but DB2 does and I think we should also 
support it. I will post a new version of the patch once it is done.




+1

I for one am in favor of this 'start'-argument addition.  Slightly 
harder usage, but more precise manipulation.



Erik Rijkers




Here is a new version of the patch that adds the start parameter to 
regexp_like() function but while I'm adding support to this parameter it 
become less obvious for me that we should implement it. However feel 
free to not use this version if you think that adding the start 
parameter has no real interest.



Best regards,






Re: Extra code in commit_ts.h

2021-08-03 Thread David Rowley
On Tue, 3 Aug 2021 at 23:12, Andrey Lepikhov  wrote:
>
> On 3/8/21 15:16, David Rowley wrote:
> > it diff 73c986adde5~1.. | grep check_track_commit_timestamp
> I think this is waste code. May be delete it?

Yeah, I think it can be safely removed.  I can do so unless someone
thinks otherwise.

David




[PATCH] Tab completion for ALTER TABLE … ADD …

2021-08-03 Thread Dagfinn Ilmari Mannsåker
Hi Hackers,

The other day I noticed that there's no tab completion after ALTER TABLE
… ADD, so here's a patch.  In addition to COLUMN and all the table
constraint types, it also completes with the list of unique indexes on
the table after ALTER TABLE … ADD … USING INDEX.

- ilmari


>From 231cccd2e84ef6dc2bf41423979efc4e760c013e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= 
Date: Tue, 3 Aug 2021 12:23:07 +0100
Subject: [PATCH] =?UTF-8?q?Add=20tab=20completion=20for=20ALTER=20TABLE=20?=
 =?UTF-8?q?=E2=80=A6=20ADD=20=E2=80=A6?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Complte COLUMN and table constraint types, and list of indexes on the
table for ADD (UNQIUE|PRIMARY KEY) USING INDEX.
---
 src/bin/psql/tab-complete.c | 44 +
 1 file changed, 44 insertions(+)

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 064892bade..476a72908f 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -776,6 +776,10 @@ static const SchemaQuery Query_for_list_of_collations = {
 "   and pg_catalog.quote_ident(c1.relname)='%s'"\
 "   and pg_catalog.pg_table_is_visible(c2.oid)"
 
+#define Query_for_unique_index_of_table \
+Query_for_index_of_table \
+"   and i.indisunique"
+
 /* the silly-looking length condition is just to eat up the current word */
 #define Query_for_constraint_of_table \
 "SELECT pg_catalog.quote_ident(conname) "\
@@ -2023,6 +2027,46 @@ psql_completion(const char *text, int start, int end)
 	  "OWNER TO", "SET", "VALIDATE CONSTRAINT",
 	  "REPLICA IDENTITY", "ATTACH PARTITION",
 	  "DETACH PARTITION", "FORCE ROW LEVEL SECURITY");
+	/* ALTER TABLE xxx ADD */
+	else if (Matches("ALTER", "TABLE", MatchAny, "ADD"))
+		COMPLETE_WITH("COLUMN", "CONSTRAINT", "CHECK", "UNIQUE", "PRIMARY KEY",
+	  "EXCLUDE", "FOREIGN KEY");
+	/* ALTER TABLE xxx ADD CONSTRAINT yyy */
+	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "CONSTRAINT", MatchAny))
+		COMPLETE_WITH("CHECK", "UNIQUE", "PRIMARY KEY", "EXCLUDE", "FOREIGN KEY");
+	/* ALTER TABLE xxx ADD (CONSTRAINT yyy)? FOREIGN|PRIMARY */
+	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "FOREIGN|PRIMARY") ||
+			 Matches("ALTER", "TABLE", MatchAny, "ADD", "CONSTRAINT", MatchAny, "FOREIGN|PRIMARY"))
+		COMPLETE_WITH("KEY");
+	/* ALTER TABLE xxx ADD (CONSTRAINT yyy)? (PRIMARY KEY|UNIQUE) */
+	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "PRIMARY", "KEY") ||
+			 Matches("ALTER", "TABLE", MatchAny, "ADD", "UNIQUE") ||
+			 Matches("ALTER", "TABLE", MatchAny, "ADD", "CONSTRAINT", MatchAny, "PRIMARY", "KEY") ||
+			 Matches("ALTER", "TABLE", MatchAny, "ADD", "CONSTRAINT", MatchAny, "UNIQUE"))
+		COMPLETE_WITH("(", "USING INDEX");
+	/* ALTER TABLE xxx ADD (CONSTRAINT yyy)? ... USING INDEX */
+	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "PRIMARY", "KEY", "USING", "INDEX"))
+	{
+		completion_info_charp = prev6_wd;
+		COMPLETE_WITH_QUERY(Query_for_unique_index_of_table);
+	}
+	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "UNIQUE", "USING", "INDEX"))
+	{
+		completion_info_charp = prev5_wd;
+		COMPLETE_WITH_QUERY(Query_for_unique_index_of_table);
+	}
+	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "CONSTRAINT", MatchAny,
+	 "PRIMARY", "KEY", "USING", "INDEX"))
+	{
+		completion_info_charp = prev8_wd;
+		COMPLETE_WITH_QUERY(Query_for_unique_index_of_table);
+	}
+	else if (Matches("ALTER", "TABLE", MatchAny, "ADD", "CONSTRAINT", MatchAny,
+	 "UNIQUE", "USING", "INDEX"))
+	{
+		completion_info_charp = prev7_wd;
+		COMPLETE_WITH_QUERY(Query_for_unique_index_of_table);
+	}
 	/* ALTER TABLE xxx ENABLE */
 	else if (Matches("ALTER", "TABLE", MatchAny, "ENABLE"))
 		COMPLETE_WITH("ALWAYS", "REPLICA", "ROW LEVEL SECURITY", "RULE",
-- 
2.30.2



Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

2021-08-03 Thread Gilles Darold

Le 03/08/2021 à 11:45, Gilles Darold a écrit :
Actually I just found that the regexp_like() function doesn't support 
the start parameter which is something we should support. I saw that 
Oracle do not support it but DB2 does and I think we should also 
support it. I will post a new version of the patch once it is done.



Here is a new version of the patch that adds the start parameter to 
regexp_like() function but while I'm adding support to this parameter it 
become less obvious for me that we should implement it. However feel 
free to not use this version if you think that adding the start 
parameter has no real interest.



Best regards,

--
Gilles Darold

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a5b6adc4bb..2bc9060e47 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3108,6 +3108,80 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ regexp_count
+
+regexp_count ( string text, pattern text
+ [, start integer
+ [, flags text ] ] )
+integer
+   
+   
+Returns the number of times the POSIX regular
+expression pattern matches in
+the string; see
+.
+   
+   
+regexp_count('123456789012', '\d\d\d', 2)
+3
+   
+  
+
+  
+   
+
+ regexp_instr
+
+regexp_instr ( string text, pattern text
+ [, start integer
+ [, N integer
+ [, endoption integer
+ [, flags text
+ [, subexpr integer ] ] ] ] ] )
+integer
+   
+   
+Returns the position within string where
+the N'th match of the POSIX regular
+expression pattern occurs, or zero if there is
+no such match; see .
+   
+   
+regexp_instr('ABCDEF', 'c(.)(..)', 1, 1, 0, 'i')
+3
+   
+   
+regexp_instr('ABCDEF', 'c(.)(..)', 1, 1, 0, 'i', 2)
+5
+   
+  
+
+  
+   
+
+ regexp_like
+
+regexp_like ( string text, pattern text
+ [, start integer
+ [, flags text ] ] )
+boolean
+   
+   
+Checks whether a match of the POSIX regular
+expression pattern occurs
+within string starting at
+the start'th character; see
+.
+   
+   
+regexp_like('Hello World', 'world$', 1, 'i')
+t
+   
+  
+
   

 
@@ -3117,8 +3191,9 @@ repeat('Pg', 4) PgPgPgPg
 text[]


-Returns captured substrings resulting from the first match of a POSIX
-regular expression to the string; see
+Returns captured substrings resulting from the first match of the
+POSIX regular expression pattern to
+the string; see
 .


@@ -3136,10 +3211,11 @@ repeat('Pg', 4) PgPgPgPg
 setof text[]


-Returns captured substrings resulting from the first match of a
-POSIX regular expression to the string,
-or multiple matches if the g flag is used;
-see .
+Returns captured substrings resulting from the first match of the
+POSIX regular expression pattern to
+the string, or all matches if
+the g flag is used; see
+.


 regexp_matches('foobarbequebaz', 'ba.', 'g')
@@ -3156,14 +3232,16 @@ repeat('Pg', 4) PgPgPgPg
 
  regexp_replace
 
-regexp_replace ( string text, pattern text, replacement text [, flags text ] )
+regexp_replace ( string text, pattern text, replacement text
+ [, start integer ]
+ [, flags text ] )
 text


-Replaces substrings resulting from the first match of a
-POSIX regular expression, or multiple substring matches
-if the g flag is used; see .
+Replaces the substring that is the first match to the POSIX
+regular expression pattern, or all matches
+if the g flag is used; see
+.


 regexp_replace('Thomas', '.[mN]a.', 'M')
@@ -3171,6 +3249,26 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+regexp_replace ( string text, pattern text, replacement text,
+ start integer,
+ N integer
+ [, flags text ] )
+text
+   
+   
+Replaces the substring that is the N'th
+match to the POSIX regular expression pattern,
+or all matches if N is zero; see
+.
+   
+   
+regexp_replace('Thomas', '.', 'X', 3, 2)
+ThoXas
+   
+  
+
   

 
@@ -3213,6 +3311,35 @@ repeat('Pg', 4) PgPgPgPg

   
 
+  
+   
+
+ regexp_substr
+
+regexp_substr ( string text, pattern text
+ [, start integer
+ [, N integer
+ [, flags text
+ [, subexpr 

Re: Extra code in commit_ts.h

2021-08-03 Thread Andrey Lepikhov

On 3/8/21 15:16, David Rowley wrote:

it diff 73c986adde5~1.. | grep check_track_commit_timestamp

I think this is waste code. May be delete it?

--
regards,
Andrey Lepikhov
Postgres Professional




Re: row filtering for logical replication

2021-08-03 Thread Amit Kapila
On Tue, Jul 27, 2021 at 9:56 AM Dilip Kumar  wrote:
>
> On Tue, Jul 27, 2021 at 6:21 AM houzj.f...@fujitsu.com
>  wrote:
>
> > 1) UPDATE a nonkey column in publisher.
> > 2) Use debugger to block the walsender process in function
> >pgoutput_row_filter_exec_expr().
> > 3) Open another psql to connect the publisher, and drop the table which 
> > updated
> >in 1).
> > 4) Unblock the debugger in 2), and then I can see the following error:
> > ---
> > ERROR:  could not read block 0 in file "base/13675/16391"
>
> Yeah, that's a big problem, seems like the expression evaluation
> machinery directly going and detoasting the externally stored data
> using some random snapshot.  Ideally, in walsender we can never
> attempt to detoast the data because there is no guarantee that those
> data are preserved.  Somehow before going to the expression evaluation
> machinery, I think we will have to deform that tuple and need to do
> something for the externally stored data otherwise it will be very
> difficult to control that inside the expression evaluation.
>

True, I think it would be possible after we fix the issue reported in
another thread [1] where we will log the key values as part of
old_tuple_key for toast tuples even if they are not changed. We can
have a restriction that in the WHERE clause that user can specify only
Key columns for Updates similar to Deletes. Then, we have the data
required for filter columns basically if the toasted key values are
changed, then they will be anyway part of the old and new tuple and if
they are not changed then they will be part of the old tuple. I have
not checked the implementation part of it but theoretically, it seems
possible. If my understanding is correct then it becomes necessary to
solve the other bug [1] to solve this part of the problem for this
patch. The other possibility is to disallow columns (datatypes) that
can lead to toasted data (at least for Updates) which doesn't sound
like a good idea to me. Do you have any other ideas for this problem?

[1] - 
https://www.postgresql.org/message-id/OS0PR01MB611342D0A92D4F4BF26C0F47FB229%40OS0PR01MB6113.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.




Re: Skipping logical replication transactions on subscriber side

2021-08-03 Thread vignesh C
On Tue, Aug 3, 2021 at 12:20 PM Masahiko Sawada  wrote:
>
> On Mon, Aug 2, 2021 at 12:21 PM Amit Kapila  wrote:
> >
> > On Mon, Aug 2, 2021 at 7:45 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Jul 30, 2021 at 12:52 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Jul 29, 2021 at 11:18 AM Masahiko Sawada 
> > > >  wrote:
> > > >
> > > > Setting up logical rep error context in a generic function looks a bit
> > > > odd to me. Do we really need to set up error context here? I
> > > > understand we can't do this in caller but anyway I think we are not
> > > > sending this to logical replication view as well, so not sure we need
> > > > to do it here.
> > >
> > > Yeah, I'm not convinced of this part yet. I wanted to show relid also
> > > in truncate cases but I came up with only this idea.
> > >
> > > If an error happens during truncating the table (in
> > > ExecuteTruncateGuts()), relid set by
> > > set_logicalrep_error_context_rel() is actually sent to the view. If we
> > > don’t have it, the view always shows relid as NULL in truncate cases.
> > > On the other hand, it doesn’t cover all cases. For example, it doesn’t
> > > cover an error that the target table doesn’t exist on the subscriber,
> > > which happens when opening the target table. Anyway, in most cases,
> > > even if relid is NULL, the error message in the view helps users to
> > > know which relation the error happened on. What do you think?
> > >
> >
> > Yeah, I also think at this stage error message is sufficient in such cases.
>
> I've attached new patches that incorporate all comments I got so far.
> Please review them.

I had a look at the first patch, couple of minor comments:
1) Should we include this in typedefs.lst
+/* Struct for saving and restoring apply information */
+typedef struct ApplyErrCallbackArg
+{
+   LogicalRepMsgType command;  /* 0 if invalid */
+
+   /* Local relation information */
+   char   *nspname;

2)  We can keep the case statement in the same order as in the
LogicalRepMsgType enum, this will help in easily identifying if any
enum gets missed.
+   case LOGICAL_REP_MSG_RELATION:
+   return "RELATION";
+   case LOGICAL_REP_MSG_TYPE:
+   return "TYPE";
+   case LOGICAL_REP_MSG_ORIGIN:
+   return "ORIGIN";
+   case LOGICAL_REP_MSG_MESSAGE:
+   return "MESSAGE";
+   case LOGICAL_REP_MSG_STREAM_START:
+   return "STREAM START";

Regards,
Vignesh




Re: Use POPCNT on MSVC

2021-08-03 Thread John Naylor
On Tue, Aug 3, 2021 at 5:03 AM David Rowley  wrote:
>
> Going by [1], it looks like we can use the __popcnt and __popcnt64
> intrinsic functions on MSVC if the CPU supports POPCNT.  We already
> have code to check for that, we just need to enable it on MSVC.
>
> The attached patch seems to be all that's needed.

+1 for the concept, but the coding is a bit strange. Granted, the way we
handle popcnt is a bit strange, but this makes it stranger:

1. the __popcnt64() intrinsic is put inside pg_popcount64_asm(), which is a
bit of a misnomer since it's not assembly. Renaming s/_asm/_fast/ would
help it look better. But then looking around, other platforms have
intrinsics coded, but for some reason they're put in pg_popcount64_slow(),
where the compiler will emit instructions we could easily write ourselves
in C (and without #ifdefs) since without the right CFLAGS these intrinsics
won't emit a popcnt instruction. Is MSVC different in that regard? If so,
it might be worth a comment.

2. (defined(_MSC_VER) && defined(_WIN64)  lead to a runtime check for the
CPUID, which is fine, but now next to it HAVE_X86_64_POPCNTQ looks strange
because the latter symbol comes from a specific configure test -- the two
don't seem equivalent, but maybe they are because of what MSVC does? That
would be nice to spell out here.

I know the present state of affairs works a bit different than what you
proposed a couple years ago and that something had to change for
portability reasons I didn't quite understand, but I think if we change
things here we should at least try to have things fit together a bit more
nicely.

(Side note, but sort of related to #1 above: non-x86 platforms have to
indirect through a function pointer even though they have no fast
implementation to make it worth their while. It would be better for them if
the "slow" implementation was called static inline or at least a direct
function call, but that's a separate thread.)

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Extra code in commit_ts.h

2021-08-03 Thread David Rowley
On Tue, 3 Aug 2021 at 21:36, Andrey V. Lepikhov
 wrote:
> I found two extra code lines in commit_ts.h (see attachment).
> They confused me during exploring of the code. If they still needed, may
> be add some comments?

Going by:

$ git config --global diff.renamelimit 1000
$ git diff 73c986adde5~1.. | grep check_track_commit_timestamp
+extern bool check_track_commit_timestamp(bool *newval, void **extra,

that function has never been defined.

David




RE: Implementing Incremental View Maintenance

2021-08-03 Thread r.takahash...@fujitsu.com
Hi Nagata-san,


I am interested in this patch since it is good feature.

I run some simple tests.
I found the following problems.


(1) 
Failed to "make world".
I think there are extra "" in 
doc/src/sgml/ref/create_materialized_view.sgml
(line 110 and 117)


(2)
In the case of partition, it seems that IVM does not work well.
I run as follows.

postgres=# create table parent (c int) partition by range (c);
CREATE TABLE
postgres=# create table child partition of parent for values from (1) to (100);
CREATE TABLE
postgres=# create incremental materialized view ivm_parent as select c from 
parent;
NOTICE:  could not create an index on materialized view "ivm_parent" 
automatically
HINT:  Create an index on the materialized view for efficient incremental 
maintenance.
SELECT 0
postgres=# create incremental materialized view ivm_child as select c from 
child;
NOTICE:  could not create an index on materialized view "ivm_child" 
automatically
HINT:  Create an index on the materialized view for efficient incremental 
maintenance.
SELECT 0
postgres=# insert into parent values (1);
INSERT 0 1
postgres=# insert into child values (2);
INSERT 0 1
postgres=# select * from parent;
 c
---
 1
 2
(2 rows)

postgres=# select * from child;
 c
---
 1
 2
(2 rows)

postgres=# select * from ivm_parent;
 c
---
 1
(1 row)

postgres=# select * from ivm_child;
 c
---
 2
(1 row)


I think ivm_parent and ivm_child should return 2 rows.


(3)
I think IVM does not support foreign table, but try to make IVM.

postgres=# create incremental materialized view ivm_foreign as select c from 
foreign_table;
NOTICE:  could not create an index on materialized view "ivm_foreign" 
automatically
HINT:  Create an index on the materialized view for efficient incremental 
maintenance.
ERROR:  "foreign_table" is a foreign table
DETAIL:  Triggers on foreign tables cannot have transition tables.

It finally failed to make IVM, but I think it should be checked more early.


Regards,
Ryohei Takahashi




Re: [HACKERS] logical decoding of two-phase transactions

2021-08-03 Thread vignesh C
On Tue, Aug 3, 2021 at 12:32 PM Amit Kapila  wrote:
>
> On Tue, Aug 3, 2021 at 6:17 AM Peter Smith  wrote:
> >
> > Please find attached the latest patch set v102*
> >
>
> I have made minor modifications in the comments and docs, please see
> attached. Can you please check whether the names of contributors in
> the commit message are correct or do we need to change it?

The patch applies neatly, the tests pass and documentation built with
the updates provided. I could not find any comments. The patch looks
good to me.

Regards,
Vignesh




Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

2021-08-03 Thread Gilles Darold

Le 02/08/2021 à 23:22, Gilles Darold a écrit :

Le 02/08/2021 à 01:21, Tom Lane a écrit :

Gilles Darold  writes:

[ v5-0001-regexp-foo-functions.patch ]

I've gone through this whole patch now, and found quite a lot that I did
not like.  In no particular order:

* Wrapping parentheses around the user's regexp doesn't work.  It can
turn an invalid regexp into a valid one: for example 'a)(b' should draw
a syntax error.  With this patch, no error would be thrown, but the
"outer" parens wouldn't do what you expected.  Worse, it can turn a
valid regexp into an invalid one: the metasyntax options described in
9.7.3.4 only work at the start of the regexp.  So we have to handle
whole-regexp cases honestly rather than trying to turn them into an
instance of the parenthesized-subexpression case.

* You did a lot of things quite inefficiently, apparently to avoid
touching any existing code.  I think it's better to extend
setup_regexp_matches() and replace_text_regexp() a little bit so that
they can support the behaviors these new functions need.  In both of
them, it's absolutely trivial to allow a search start position to be
passed in; and it doesn't take much to teach replace_text_regexp()
to replace only the N'th match.

* Speaking of N'th, there is not much of anything that I like
about Oracle's terminology for the function arguments, and I don't
think we ought to adopt it.  If we're documenting the functions as
processing the "N'th match", it seems to me to be natural to call
the parameter "N" not "occurrence".  Speaking of the "occurrence'th
occurrence" is just silly, not to mention long and easy to misspell.
Likewise, "position" is a horribly vague term for the search start
position; it could be interpreted to mean several other things.
"start" seems much better.  "return_opt" is likewise awfully unclear.
I went with "endoption" below, though I could be talked into something
else.  The only one of Oracle's choices that I like is "subexpr" for
subexpression number ... but you went with DB2's rather vague "group"
instead.  I don't want to use their "capture group" terminology,
because that appears nowhere else in our documentation.  Our existing
terminology is "parenthesized subexpression", which seems fine to me
(and also agrees with Oracle's docs).

* I spent a lot of time on the docs too.  A lot of the syntax specs
were wrong (where you put the brackets matters), many of the examples
seemed confusingly overcomplicated, and the text explanations needed
copy-editing.

* Also, the regression tests seemed misguided.  This patch is not
responsible for testing the regexp engine as such; we have tests
elsewhere that do that.  So I don't think we need complex regexps
here.  We just need to verify that the parameters of these functions
act properly, and check their error cases.  That can be done much
more quickly and straightforwardly than what you had.


So here's a revised version that I like better.  I think this
is pretty nearly committable, aside from the question of whether
a too-large subexpression number should be an error or not.


Thanks a lot for the patch improvement and the guidance. I have read the
patch and I agree with your choices I think I was too much trying to
mimic the oraclisms. I don't think we should take care of the too-large
subexpression number, the regexp writer should always test its regular
expression and also this will not prevent him to chose the wrong capture
group number but just a non existing one.



Actually I just found that the regexp_like() function doesn't support 
the start parameter which is something we should support. I saw that 
Oracle do not support it but DB2 does and I think we should also support 
it. I will post a new version of the patch once it is done.



Best regards,

--
Gilles Darold





Extra code in commit_ts.h

2021-08-03 Thread Andrey V. Lepikhov

Hi,

I found two extra code lines in commit_ts.h (see attachment).
They confused me during exploring of the code. If they still needed, may 
be add some comments?


--
regards,
Andrey Lepikhov
Postgres Professional
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index e045dd416f..a1538978c6 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -15,14 +15,10 @@
 #include "datatype/timestamp.h"
 #include "replication/origin.h"
 #include "storage/sync.h"
-#include "utils/guc.h"
 
 
 extern PGDLLIMPORT bool track_commit_timestamp;
 
-extern bool check_track_commit_timestamp(bool *newval, void **extra,
-		 GucSource source);
-
 extern void TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
 		   TransactionId *subxids, TimestampTz timestamp,
 		   RepOriginId nodeid);


RE: SI messages sent when excuting ROLLBACK PREPARED command

2021-08-03 Thread liuhuail...@fujitsu.com
 Hi, tom

> >Hmmm, yeah, I think you're right.  It probably doesn't make a big difference 
> >in
> the real world --- anyone who's dependent on the performance of 2PC rollbaxks
> is Doing It Wrong.
> > But we'd have already done LocalExecuteInvalidationMessage when getting
> out of the prepared transaction, so no other SI invals should be needed.
> Yes, it does not make any error.
> 
> But for the beginner,  when understanding the code,  it may make confused.
> And for the developer, when developing based on this code, it may make
> unnecessary handling added.
> 
> So, I think it is better to optimize the code.
> 
> Here is the patch.
There was a problem with the before patch when testing.  
So resubmit it.

Regards, Liu Huailing



0001-Disallow-sending-SI-messages-when-excuting-ROLLBACK.patch
Description:  0001-Disallow-sending-SI-messages-when-excuting-ROLLBACK.patch


Re: Failed transaction statistics to measure the logical replication progress

2021-08-03 Thread Amit Kapila
On Tue, Aug 3, 2021 at 10:59 AM Masahiko Sawada  wrote:
>
> On Tue, Aug 3, 2021 at 11:47 AM Amit Kapila  wrote:
> >
> > On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Aug 2, 2021 at 2:52 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > >
> > > >
> > > > Accordingly, I'm thinking to have unsuccessful and successful stats on 
> > > > the sub side.
> > > > Sawada-san is now implementing a new view in [1].
> > > > Do you think that I should write a patch to introduce a new separate 
> > > > view
> > > > or write a patch to add more columns to the new view 
> > > > "pg_stat_subscription_errors" that is added at [1] ?
> > >
> > > pg_stat_subscriptions_errors view I'm proposing is a view showing the
> > > details of error happening during logical replication. So I think a
> > > separate view or pg_stat_subscription view would be a more appropriate
> > > place.
> > >
> >
> > +1 for having these stats in pg_stat_subscription. Do we want to add
> > two columns (xact_commit: number of transactions successfully applied
> > in this subscription, xact_rollback: number of transactions that have
> > been rolled back in this subscription)
>
> Sounds good. We might want to have separate counters for the number of
> transactions failed due to an error and transactions rolled back by
> stream_abort.
>

I was trying to think based on similar counters in pg_stat_database
but if you think there is a value in showing errored and actual
rollbacked transactions separately then we can do that but how do you
think one can make use of it?

-- 
With Regards,
Amit Kapila.




Use POPCNT on MSVC

2021-08-03 Thread David Rowley
Going by [1], it looks like we can use the __popcnt and __popcnt64
intrinsic functions on MSVC if the CPU supports POPCNT.  We already
have code to check for that, we just need to enable it on MSVC.

The attached patch seems to be all that's needed.

David

[1] 
https://docs.microsoft.com/en-us/cpp/intrinsics/popcnt16-popcnt-popcnt64?view=msvc-140


use_popcnt_on_msvc.patch
Description: Binary data


Re: [HACKERS] logical decoding of two-phase transactions

2021-08-03 Thread Peter Smith
On Tue, Aug 3, 2021 at 5:02 PM Amit Kapila  wrote:
>
> On Tue, Aug 3, 2021 at 6:17 AM Peter Smith  wrote:
> >
> > Please find attached the latest patch set v102*
> >
>
> I have made minor modifications in the comments and docs, please see
> attached. Can you please check whether the names of contributors in
> the commit message are correct or do we need to change it?
>

I checked the differences between v102 and v103 and have no review
comments about the latest changes.

The commit message looks ok.

I applied the v103 to the current HEAD; no errors.
The build is ok.
The make check is ok.
The TAP subscription tests are ok.

I also rebuilt the PG docs and verified rendering of the updated pages looks ok.

The patch v103 LGTM.

--
Kind Regards,
Peter Smith.
Fujitsu Australia




RE: Added schema level support for publication.

2021-08-03 Thread houzj.f...@fujitsu.com
On Tuesday, August 3, 2021 4:10 PM houzj.f...@fujitsu.com 
 wrote:
> On August 2, 2021 11:40 PM vignesh C  wrote:
> >
> > Thanks for the comments, attached v17 patches has the fixes for the same.
> 
> Thanks for the new patch, it looks good to me except one minor thing:
> It might be better to add the [CREATE PUBLICATION xxx FOR SCHEMA ] in
> tab-complete.c

Sorry, the patch already had the logic, please ignore this comment.

Best regards,
houzj


RE: Added schema level support for publication.

2021-08-03 Thread houzj.f...@fujitsu.com
On August 2, 2021 11:40 PM vignesh C  wrote:
> 
> Thanks for the comments, attached v17 patches has the fixes for the same.

Thanks for the new patch, it looks good to me except one minor thing:
It might be better to add the [CREATE PUBLICATION xxx FOR SCHEMA ] in 
tab-complete.c

Best regards,
houzj


Re: Remove unused 'len' from pg_stat_recv_* functions

2021-08-03 Thread Masahiko Sawada
On Tue, Aug 3, 2021 at 3:17 PM Kyotaro Horiguchi
 wrote:
>
> At Tue, 3 Aug 2021 12:40:23 +0900, Masahiko Sawada  
> wrote in
> > Hi all,
> >
> > While working on a patch adding new stats, houzj pointed out that
> > 'len' function arguments of all pgstat_recv_* functions are not used
> > at all. Looking at git history, pgstat_recv_* functions have been
> > having ‘len’ since the stats collector was introduced by commit
> > 140ddb78fe 20 years ago but it was not used at all even in the first
> > commit. It seems like the improvements so far for the stats collector
> > had pgstat_recv_* function have ‘len’ for consistency with the
> > existing pgstat_recv_* functions. Is there any historical reason for
> > having 'len' argument? Or can we remove them?
> >
> > I've attached the patch that removes 'len' from all pgstat_recv_* functions.
>
> I at the first look thought that giving "len" as a parameter is
> reasonable as message-processing functions, but the given message
> struct contains the same value and the functions can refer to the
> message length without the parameter if they want.  So I'm +-0 for the
> removal.
>
> It applies cleanly on the master and compiled without an error.
>
> That being said, I'm not sure it is worthwhile to change parameters of
> going-to-be-removed functions (if shared-memory stats collector is
> successfully introduced).

Good point.

I'm not going to insist on removing them but I got confused a bit
whether or not I should add 'len' when writing a new pgstat_recv_*
function.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: straightening out backend process startup

2021-08-03 Thread Kyotaro Horiguchi
At Mon, 2 Aug 2021 09:41:24 -0700, Andres Freund  wrote in 
> Hi,
> 
> I've previously complained ([1]) that process initialization has gotten
> very complicated. I hit this once more last week when trying to commit
> one of the shared memory stats pieces...
> 
> I think there's quite a few different issues around this - here I'm just
> trying to tackle a few of the most glaring (to me):
> 
> - AuxiliaryProcessMain() is used for two independent tasks: Start bootstrap /
>   checker mode and starting auxiliary processes. In HEAD there's maybe 5 lines
>   out 250 that are actually common to both uses.
> 
>   A related oddity is that we reserve shared memory resources for bootstrap &
>   checker aux processes, despite those never existing.
> 
>   This is addressed in patches 1-7
> 
> - The order of invocation of InitProcess()/InitAuxiliaryProcess() and
>   BaseInit() depends on EXEC_BACKEND. Due to that there often is no single
>   place initialization code can be put if it needs any locks.
> 
>   This is addressed in patches 8-9
> 
> - PostgresMain() has code for single user and multi user interleaved, making
>   it unnecessarily hard to understand what's going on.
> 
>   This is addressed in patches 10
> 
> 
> This isn't a patch series ready to commit, there's a bunch of polishing that
> needs to be done if there's agreement.
> 
> 
> Questions I have:
> 
> - What exactly to do with checker mode: Keep it as part of bootstrap, separate
>   it out completely? What commandline flags?

Checker tries to attach shared memory just to make sure it is actually
attachable with a set of parameters.  It is similar to bootstrap as it
is not run under postmaster but similar to auxiliary process as it
needs to attach shared memory.  If we are going to get rid of
shared-memory access by bootstrap, or get rid of the bootstrap itself,
checker should be separated out from bootstrap.

> - I used a separate argv entry to pass the aux proc type - do we rather want
>   to go for the approach that e.g. bgworker went for? Adding code for string
>   splitting seems a bit unnecessary to me.

It seems to me separate entry is cleaner and robust.

> - PostgresMainSingle() should probably not be in postgres.c. We could put it
>   into postinit.c or ..?

PostgresMainSingle() looks like the single-process version of
PostgresMain so it is natural that they are placed together in
postgres.c.  If PostgresMainSingle is constructed as initializing
standalone first then calling PostgresMain, it might be right that
PostgresMain calls the initialization function resides in postinit.c
if !IsUnderPostmaster.

PostgresMain()
{
  if (!IsUnderPostmaster)
InitSinglePostgres(argv[0]);
  ...

> - My first attempt at PostgresMainSingle() separated the single/multi user
>   cases a bit more than the code right now, by having a PostgresMainCommon()
>   which was called by PostgresMainSingle(), PostgresMain(). *Common only
>   started with the MessageContext allocation, which did have the advantage of
>   splitting out a few of the remaining conditional actions in PostgresMain()
>   (PostmasterContext, welcome banner, Log_disconnections). But lead to a bit
>   more duplication. I don't really have an opinion on what's better.

I'm not sure how it looked like, but isn't it reasonable that quickdie
and log_disconnections(). handle IsUnderPostmaster instead?  Or for
log_disconnections, Log_disconnections should be turned off at
standalone-initialization?

> - I had to move the PgStartTime computation to a bit earlier for single user
>   mode. That seems to make sense to me anyway, given that postmaster does so
>   fairly early too.
> 
>   Any reason that'd be a bad idea?
> 
>   Arguably it should even be a tad earlier to be symmetric.

Why don't you move the code for multiuser as earlier as standalone does?

> There's one further issue that I think is big enough to be worth
> tackling in the near future: Too many things depend on BackendIds.
> 
> Aux processes need procsignal and backend status reporting, which use
> BackendId for indexing. But they don't use sinval, so they don't have a
> BackendId - so we have hacks to work around that in a few places. If we
> instead make those places use pgprocno for indexing the whole issue
> vanishes.
> 
> In fact, I think there's a good argument to be made that we should
> entirely remove the concept of BackendIds and just use pgprocnos. We
> have a fair number of places like SignalVirtualTransaction() that need
> to search the procarray just to find the proc to signal based on the
> BackendId. If we used pgprocno instead, that'd not be needed.
> 
> But perhaps that's a separate thread.
> 
> Greetings,
> 
> Andres Freund
> 
> [1] https://postgr.es/m/20210402002240.56cuz3uo3alnqwae%40alap3.anarazel.de

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-03 Thread Soumyadeep Chakraborty
Hi Kyotaro,

Thanks for the review!

On Mon, Aug 2, 2021 at 11:42 PM Kyotaro Horiguchi
 wrote:

> One comment from me.
>
> +$node_standby->safe_psql('postgres', "ALTER SYSTEM SET 
> recovery_min_apply_delay TO 0;");
>
> It might be better do "SET reco.. TO DEFAULT" instead.
>

Sure.

> And how about adding the new test at the end of existing tests. We can
> get rid of the extra lines in the diff.

No problem. See attached v2. I didn't do that initially as the test file
looks as though it is split into two halves: one for testing
recovery_min_apply_delay and the other for testing recovery pause. So while
making this change, I added a header comment for the newly added test case.
Please take a look.

Regards,
Soumyadeep (VMware)
From 8b4ea51e1a22548fdd3f6921fe374d3a9d987a77 Mon Sep 17 00:00:00 2001
From: Soumyadeep Chakraborty 
Date: Mon, 2 Aug 2021 20:50:37 -0700
Subject: [PATCH v2 1/1] Refresh delayUntil in recovery delay loop

This ensures that the wait interval in the loop is correctly
recalculated with the updated value of recovery_min_apply_delay.

Now, if one changes the GUC while we are waiting, those changes will
take effect. Practical applications include instantly cancelling a long
wait time by setting recovery_min_apply_delay to 0. This can be useful
in tests.
---
 src/backend/access/transam/xlog.c   | 11 ++---
 src/test/recovery/t/005_replay_delay.pl | 31 +++--
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f84c0bb01eb..89dc759586c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6234,12 +6234,11 @@ recoveryApplyDelay(XLogReaderState *record)
 	if (!getRecordTimestamp(record, ))
 		return false;
 
-	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
-
 	/*
 	 * Exit without arming the latch if it's already past time to apply this
 	 * record
 	 */
+	delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
 	msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(), delayUntil);
 	if (msecs <= 0)
 		return false;
@@ -6255,7 +6254,13 @@ recoveryApplyDelay(XLogReaderState *record)
 			break;
 
 		/*
-		 * Wait for difference between GetCurrentTimestamp() and delayUntil
+		 * Recalculate delayUntil as recovery_min_apply_delay could have changed
+		 * while we were waiting in the loop.
+		 */
+		delayUntil = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay);
+
+		/*
+		 * Wait for difference between GetCurrentTimestamp() and delayUntil.
 		 */
 		msecs = TimestampDifferenceMilliseconds(GetCurrentTimestamp(),
 delayUntil);
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 0b56380e0a7..d55f1fc257f 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -7,7 +7,7 @@ use warnings;
 
 use PostgresNode;
 use TestLib;
-use Test::More tests => 3;
+use Test::More tests => 4;
 
 # Initialize primary node
 my $node_primary = PostgresNode->new('primary');
@@ -56,7 +56,6 @@ $node_standby->poll_query_until('postgres',
 ok(time() - $primary_insert_time >= $delay,
 	"standby applies WAL only after replication delay");
 
-
 # Check that recovery can be paused or resumed expectedly.
 my $node_standby2 = PostgresNode->new('standby2');
 $node_standby2->init_from_backup($node_primary, $backup_name,
@@ -110,3 +109,31 @@ $node_standby2->poll_query_until('postgres',
 $node_standby2->promote;
 $node_standby2->poll_query_until('postgres', "SELECT NOT pg_is_in_recovery()")
   or die "Timed out while waiting for promotion to finish";
+
+# Now test to see if updates to recovery_min_apply_delay apply when a standby is
+# waiting for a recovery delay to elapse.
+
+# First, set the delay for the next commit to some obscenely high value.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO '2h';");
+$node_standby->reload;
+
+$node_primary->safe_psql('postgres',
+	"INSERT INTO tab_int VALUES (generate_series(51, 60))");
+
+# We should not have replayed the LSN from the last insert on the standby yet,
+# even though it will be received and flushed eventually. In other words, we
+# should be stuck in recovery_min_apply_delay.
+my $last_insert_lsn =
+	$node_primary->safe_psql('postgres', "SELECT pg_current_wal_lsn();");
+$node_primary->wait_for_catchup('standby', 'flush', $last_insert_lsn);
+is( $node_standby->safe_psql('postgres',
+	"SELECT pg_last_wal_replay_lsn() < '$last_insert_lsn'::pg_lsn;"),
+	't', 'standby stuck in recovery_min_apply_delay');
+
+# Clear the recovery_min_apply_delay timeout so that the wait is immediately
+# cancelled and replay can proceed.
+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET recovery_min_apply_delay TO DEFAULT;");
+$node_standby->reload;
+
+# Now the standby should catch up.
+$node_primary->wait_for_catchup('standby');
-- 
2.25.1



Re: [HACKERS] logical decoding of two-phase transactions

2021-08-03 Thread Amit Kapila
On Tue, Aug 3, 2021 at 6:17 AM Peter Smith  wrote:
>
> Please find attached the latest patch set v102*
>

I have made minor modifications in the comments and docs, please see
attached. Can you please check whether the names of contributors in
the commit message are correct or do we need to change it?

-- 
With Regards,
Amit Kapila.


v103-0001-Add-prepare-API-support-for-streaming-transacti.patch
Description: Binary data


Re: Lowering the ever-growing heap->pd_lower

2021-08-03 Thread Simon Riggs
On Tue, 18 May 2021 at 20:33, Peter Geoghegan  wrote:
>
> On Tue, May 18, 2021 at 12:29 PM Matthias van de Meent
>  wrote:
> > PFA the updated version of this patch. Apart from adding line pointer
> > truncation in PageRepairFragmentation (as in the earlier patches), I
> > also altered PageTruncateLinePointerArray to clean up all trailing
> > line pointers, even if it was the last item on the page.
>
> Can you show a practical benefit to this patch, such as an improvement
> in throughout or in efficiency for a given workload?
>
> It was easy to see that having something was better than having
> nothing at all. But things are of course different now that we have
> PageTruncateLinePointerArray().

There does seem to be utility in Matthias' patch, which currently does
two things:
1. Allow same thing as PageTruncateLinePointerArray() during HOT cleanup
That is going to have a clear benefit for HOT workloads, which by
their nature will use a lot of line pointers.
Many applications are updated much more frequently than they are vacuumed.
Peter - what is your concern about doing this more frequently? Why
would we *not* do this?

2. Reduce number of line pointers to 0 in some cases.
Matthias - I don't think you've made a full case for doing this, nor
looked at the implications.
The comment clearly says "it seems like a good idea to avoid leaving a
PageIsEmpty()" page behind.
So I would be inclined to remove that from the patch and consider that
as a separate issue, or close this.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: when the startup process doesn't (logging startup delays)

2021-08-03 Thread Nitin Jadhav
> Thanks. Can you please have a look at what I suggested down toward the
> bottom of 
> http://postgr.es/m/ca+tgmoap2wefsktmcgwt9lxuz7y99hnduyshpk7qnfuqb98...@mail.gmail.com
?

Implemented the above approach and verified the patch. Kindly have a
look and share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Fri, Jul 30, 2021 at 10:40 AM Nitin Jadhav
 wrote:
>
> > Thanks. Can you please have a look at what I suggested down toward the
> > bottom of 
> > http://postgr.es/m/ca+tgmoap2wefsktmcgwt9lxuz7y99hnduyshpk7qnfuqb98...@mail.gmail.com
> > ?
> >
> > If we're going to go the route of combining the functions, I agree
> > that a Boolean is the way to go; I think we have some existing
> > precedent for 'bool finished' rather than 'bool done'.
> >
> > But I kind of wonder if we should have an enum in the first place. It
> > feels like we've got code in a bunch of places that just exists to
> > decide which enum value to use, and then code someplace else that
> > turns around and decides which message to produce. That would be
> > sensible if we were using the same enum values in lots of places, but
> > that's not the case here. So suppose we just moved the messages to the
> > places where we're now calling LogStartupProgress() or
> > LogStartupProgressComplete()? So something like this:
>
> Sorry. I thought it is related to the discussion of deciding whether
> LogStartupProgress() and LogStartupProgressComplete() should be
> combined or not. I feel it's a really nice design. With this we avoid
> a "action at a distance" issue and its easy to use. If we are
> reporting the same kind of msgs at multiple places then the current
> approach of using enum will be more suitable since we don't have to
> worry about matching the log msg string. But in the current scenario,
> we are not using the same kind of msgs at multiple places (I feel such
> scenario will not occur in future also. Even if there is similar
> operation, it can be distinguished like resetting unlogged relations
> is distinguished by init and clean. Kindly mention if you can oversee
> any such scenario), hence the approach you are suggesting will be a
> best suit.
>
> Thanks & Regards,
> Nitin Jadhav
>
> On Thu, Jul 29, 2021 at 9:48 PM Robert Haas  wrote:
> >
> > On Thu, Jul 29, 2021 at 4:56 AM Nitin Jadhav
> >  wrote:
> > > Thanks for sharing the information. I have done the necessary changes
> > > to show the logs during the latter case (postgres --single) and
> > > verified the log messages.
> >
> > Thanks. Can you please have a look at what I suggested down toward the
> > bottom of 
> > http://postgr.es/m/ca+tgmoap2wefsktmcgwt9lxuz7y99hnduyshpk7qnfuqb98...@mail.gmail.com
> > ?
> >
> > --
> > Robert Haas
> > EDB: http://www.enterprisedb.com


v9-0001-startup-process-progress.patch
Description: Binary data


Re: Changes to recovery_min_apply_delay are ignored while waiting for delay

2021-08-03 Thread Kyotaro Horiguchi
At Mon, 2 Aug 2021 22:21:56 -0700, Soumyadeep Chakraborty 
 wrote in 
> Hello,
> 
> I came across this issue while I was tweaking a TAP test with Ashwin for
> this thread [1].
> 
> We noticed that while the startup process waits for a recovery delay, it does
> not respect changes to the recovery_min_apply_delay GUC. So:
> 
> // Standby is waiting on recoveryWakeupLatch with a large
> recovery_min_apply_delay value
> node_standby->safe_psql('postgres', 'ALTER SYSTEM SET
> recovery_min_apply_delay TO 0;');
> node_standby->reload;
> // Standby is still waiting, it does not proceed until the old timeout
> is elapsed.
> 
> I attached a small patch to fix this. It makes it so that
> recovery_min_apply_delay is reconsulted in the loop to redetermine the wait
> interval. This makes it similar to the loop in CheckpointerMain, where
> CheckPointTimeout is consulted after interrupts are handled:
> 
> if (elapsed_secs >= CheckPointTimeout)
> continue; /* no sleep for us ... */
> 
> I have also added a test to 005_replay_delay.pl.
> 
> Regards,
> Soumyadeep(VMware)
> 
> [1] 
> https://www.postgresql.org/message-id/flat/CANXE4Tc3FNvZ_xAimempJWv_RH9pCvsZH7Yq93o1VuNLjUT-mQ%40mail.gmail.com

Sounds reasonable and the patch looks fine as a whole. Applied cleanly
and works as expected.  The added test properly catches the issue.

One comment from me.

+$node_standby->safe_psql('postgres', "ALTER SYSTEM SET 
recovery_min_apply_delay TO 0;");

It might be better do "SET reco.. TO DEFAULT" instead.

And how about adding the new test at the end of existing tests. We can
get rid of the extra lines in the diff.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: Added schema level support for publication.

2021-08-03 Thread tanghy.f...@fujitsu.com
On Monday, August 2, 2021 11:40 PM vignesh C wrote:
> 
> Thanks for the comments, attached v17 patches has the fixes for the same.

Thanks for your new patch.

I saw the following warning when compiling. It seems we don't need this 
variable any more.

publicationcmds.c: In function ‘AlterPublicationSchemas’:
publicationcmds.c:592:15: warning: unused variable ‘oldlc’ [-Wunused-variable]
   ListCell   *oldlc;
   ^

Regards
Tang


Re: Remove unused 'len' from pg_stat_recv_* functions

2021-08-03 Thread Kyotaro Horiguchi
At Tue, 3 Aug 2021 12:40:23 +0900, Masahiko Sawada  
wrote in 
> Hi all,
> 
> While working on a patch adding new stats, houzj pointed out that
> 'len' function arguments of all pgstat_recv_* functions are not used
> at all. Looking at git history, pgstat_recv_* functions have been
> having ‘len’ since the stats collector was introduced by commit
> 140ddb78fe 20 years ago but it was not used at all even in the first
> commit. It seems like the improvements so far for the stats collector
> had pgstat_recv_* function have ‘len’ for consistency with the
> existing pgstat_recv_* functions. Is there any historical reason for
> having 'len' argument? Or can we remove them?
> 
> I've attached the patch that removes 'len' from all pgstat_recv_* functions.

I at the first look thought that giving "len" as a parameter is
reasonable as message-processing functions, but the given message
struct contains the same value and the functions can refer to the
message length without the parameter if they want.  So I'm +-0 for the
removal.

It applies cleanly on the master and compiled without an error.

That being said, I'm not sure it is worthwhile to change parameters of
going-to-be-removed functions (if shared-memory stats collector is
successfully introduced).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


RE: Failed transaction statistics to measure the logical replication progress

2021-08-03 Thread osumi.takami...@fujitsu.com
On Tuesday, August 3, 2021 2:29 PM  Masahiko Sawada  
wrote:
> On Tue, Aug 3, 2021 at 11:47 AM Amit Kapila 
> wrote:
> >
> > On Mon, Aug 2, 2021 at 1:13 PM Masahiko Sawada
>  wrote:
> > >
> > > On Mon, Aug 2, 2021 at 2:52 PM osumi.takami...@fujitsu.com
> > >  wrote:
> > > >
> > > >
> > > > Accordingly, I'm thinking to have unsuccessful and successful stats on 
> > > > the
> sub side.
> > > > Sawada-san is now implementing a new view in [1].
> > > > Do you think that I should write a patch to introduce a new
> > > > separate view or write a patch to add more columns to the new view
> "pg_stat_subscription_errors" that is added at [1] ?
> > >
> > > pg_stat_subscriptions_errors view I'm proposing is a view showing
> > > the details of error happening during logical replication. So I
> > > think a separate view or pg_stat_subscription view would be a more
> > > appropriate place.
> > >
> >
> > +1 for having these stats in pg_stat_subscription. Do we want to add
> > two columns (xact_commit: number of transactions successfully applied
> > in this subscription, xact_rollback: number of transactions that have
> > been rolled back in this subscription)
> 
> Sounds good. We might want to have separate counters for the number of
> transactions failed due to an error and transactions rolled back by 
> stream_abort.
Okay. I wanna make those separate as well for this feature.


> pg_stat_subscription currently shows logical replication worker stats on the
> shared memory. I think xact_commit and xact_rollback should survive beyond
> the server restarts, so it would be better to be collected by the stats 
> collector.
> My skipping transaction patch adds a hash table of which entry represents a
> subscription stats. I guess we can use the hash table so that one subscription
> stats entry has both transaction stats and errors.
> 
> >  or do you guys have something else in mind?
> 
> Osumi-san was originally concerned that there is no way to grasp the exact
> number and size of successful and unsuccessful transactions. The above idea
> covers only the number of successful and unsuccessful transactions but not the
> size. What do you think, Osumi-san?
Yeah, I think tracking sizes of failed transactions and roll-backed transactions
is helpful to identify the genuine network consumption,
as mentioned by Vignesh in another mail.
I'd like to include those also and post a patch for this.

Best Regards,
Takamichi Osumi