Re: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread Thomas Munro
On Thu, Oct 22, 2020 at 5:52 PM Tom Lane  wrote:
> Per the referenced bug-reporting thread, it was ReiserFS and/or NFS on
> SLES 9.3; so, dubious storage choices on an ancient-even-then Linux
> kernel.

O.  I can reproduce that on a modern Linux box by forcing
writeback to a full NFS filesystem[1], approximately as the kernel
does asynchronously when it feels like it, causing the size reported
by SEEK_END to go down.

$ cat magic_shrinking_file.c
#include 
#include 
#include 
#include 

int main()
{
  int fd;
  char buffer[8192] = {0};

  fd = open("/mnt/test_loopback_remote/dir/file", O_RDWR | O_APPEND);
  if (fd < 0) {
perror("open");
return EXIT_FAILURE;
  }
  printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));
  printf("write(...)  = %zd\n", write(fd, buffer, sizeof(buffer)));
  printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));
  printf("fsync(...) = %d\n", fsync(fd));
  printf("lseek(..., SEEK_END) = %jd\n", lseek(fd, 0, SEEK_END));

  return EXIT_SUCCESS;
}
$ cc magic_shrinking_file.c
$ ./a.out
lseek(..., SEEK_END) = 9670656
write(...)  = 8192
lseek(..., SEEK_END) = 9678848
fsync(...) = -1
lseek(..., SEEK_END) = 9670656

> I think the takeaway point is not so much that that particular bug
> might recur as that storage infrastructure does sometimes have bugs.
> If you're wanting to introduce new assumptions about what the filesystem
> will do, it's prudent to think about how badly will we break if the
> assumptions fail.

Yeah.  My point was just that the caching trick doesn't seem to
improve matters on this particular front, it can just cache a bogus
value.

[1] 
https://www.postgresql.org/message-id/CAEepm=1FGo=ACPKRmAxvb53mBwyVC=tdwte0dmzkwjdbayw...@mail.gmail.com




Re: Mop-up around psql's \connect behavior

2020-10-21 Thread Kyotaro Horiguchi
At Thu, 22 Oct 2020 00:34:20 -0400, Tom Lane  wrote in 
> Kyotaro Horiguchi  writes:
> > At Wed, 21 Oct 2020 18:59:04 -0400, Tom Lane  wrote in 
> > But once it got on my mind, it might be strange that just \c or \c
> > -reuse-previous=y doesn't reconnect a broken session. It might be
> > better we reuse the previous connection parameter even if the
> > connection has been lost, but this would be another issue.
> 
> I did actually look into saving the active connection's PQconninfo
> immediately at connection establishment and then referring to it in any
> subsequent \connect.  Then things could work the same even if the original
> connection had failed meanwhile.  But there are technical details that
> make that a lot harder than it seems on the surface --- mainly, that the
> way do_connect works now requires that it have a copy of the PQconninfo
> data that it can scribble on, and that won't do if we need the saved
> PQconninfo to persist when a \connect attempt fails.  That could be dealt
> with with enough new code, but I didn't think it was worth the trouble.

Agreed.

> (Note that we developers face the server-crashed scenario a whole lot more
> often than normal people ;-), so we probably overrate how useful it'd be
> to be able to reconnect in that case.)

Agreed^2.

> >> * The old-style-syntax code path understands that it should re-use
> >> the old password (if any) when the user, host, and port settings
> >> haven't changed.  The connstring code path was too lazy to make
> >> that work, but now that we're deconstructing the connstring there's
> >> very little excuse for not having it act the same way.
> 
> > +1 (I thought sslmode might affect but that is wrong since cert
> > authenticaion cannot be turned off from command line.)
> 
> Yeah.  That could affect whether the server asks for a password at
> all, but you have to really stretch to think of cases where it could
> result in needing a *different* password.  An example perhaps is
> where pg_hba.conf is configured to do, say, LDAP auth on SSL connections
> and normal password auth on non-SSL, and the LDAP server has a different
> password than what is in pg_authid.  But that seems like something nobody
> could want.  Also notice that unlike the previous code, with my patch
> do_connect will always (barring --no-password) give you an opportunity
> to interactively supply a password, even if we initially reused an
> old password and it didn't work.  So it seems like this will cope
> even if you do have a setup as wacko as that.

I thought of that scenarios and conclused as the same. Sounds
reasonable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread Kyotaro Horiguchi
At Thu, 22 Oct 2020 16:35:27 +1300, Thomas Munro  wrote 
in 
> On Thu, Oct 22, 2020 at 3:07 PM k.jami...@fujitsu.com
>  wrote:
> +/*
> + * Get the total number of to-be-invalidated blocks of a relation as well
> + * as the total blocks for a given fork.  The cached value returned by
> + * smgrnblocks could be smaller than the actual number of existing 
> buffers
> + * of the file.  This is caused by buggy Linux kernels that might not 
> have
> + * accounted for the recent write.  Give up the optimization if the block
> + * count of any fork cannot be trusted.
> + */
> +for (i = 0; i < nforks; i++)
> +{
> +/* Get the number of blocks for a relation's fork */
> +nForkBlocks[i] = smgrnblocks(smgr_reln, forkNum[i], );
> +
> +if (!accurate)
> +break;
> 
> Hmmm.  The Linux comment led me to commit ffae5cc and a 2006 thread[1]
> showing a buggy sequence of system calls.  AFAICS it was not even an
> SMP/race problem of the type you might half expect, it was a single
> process not seeing its own write.  I didn't find details on the
> version, filesystem etc.

Anyway that comment is irrelevant to the added code. The point here is
that the returned value may not be reliable, due to not only the
kernel bugs, but the files is extended/truncated by other
procesess. But I suppose that we may have synchronized file-size cache
in the future?

> Searching for our message "This has been seen to occur with buggy
> kernels; consider updating your system" turns up recent-ish results
> too.  The reports I read involved GlusterFS, which I don't personally
> know anything about, but it claims full POSIX compliance, and POSIX is
> strict about that sort of thing, so I'd guess that is/was a fairly
> serious bug or misconfiguration.  Surely there must be other symptoms
> for PostgreSQL on such systems too, like sequential scans that don't
> see recently added pages.
> 
> But... does the proposed caching behaviour and "accurate" flag really
> help with any of that?  Cached values come from lseek() anyway.  If we

That "accurate" (good name wanted) flag suggest that it is guaranteed
that we don't have a buffer for blocks after that block number.

> just trusted unmodified smgrnblocks(), someone running on such a
> forgetful file system might eventually see nasty errors because we
> left buffers in the buffer pool that prevent a checkpoint from
> completing (and panic?), but they might also see other really strange
> errors, and that applies with or without that "accurate" flag, no?
> 
> [1] 
> https://www.postgresql.org/message-id/flat/26202.1159032931%40sss.pgh.pa.us

smgrtruncate and msgrextend modifies that cache from their parameter,
not from lseek().  At the very first the value in the cache comes from
lseek() but if nothing other than postgres have changed the file size,
I believe we can rely on the cache even with such a buggy kernels even
if still exists.

If there's no longer such a buggy kernel, we can rely on lseek() only
when InRecovery. If we had synchronized file size cache we could rely
on the cache even while !InRecovery.  (I'm not sure about how vacuum
affects, though.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] Custom compression methods

2020-10-21 Thread Dilip Kumar
On Thu, Oct 22, 2020 at 2:11 AM Tomas Vondra
 wrote:
>
> On Wed, Oct 21, 2020 at 01:59:50PM +0530, Dilip Kumar wrote:
> >On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar  wrote:
> >>
> >> On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar  wrote:
> >> >
> >> > On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
> >> >  wrote:
> >> > >
> >> > > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> >> > > >
> >> > > >> ...
> >> > > >
> >> > > >I have worked on this patch, so as discussed now I am maintaining the
> >> > > >preserved compression methods using dependency.  Still PRESERVE ALL
> >> > > >syntax is not supported, I will work on that part.
> >> > > >
> >> > >
> >> > > Cool, I'll take a look. What's your opinion on doing it this way? Do 
> >> > > you
> >> > > think it's cleaner / more elegant, or is it something contrary to what
> >> > > the dependencies are meant to do?
> >> >
> >> > I think this looks much cleaner.  Moreover, I feel that once we start
> >> > supporting the custom compression methods then we anyway have to
> >> > maintain the dependency so using that for finding the preserved
> >> > compression method is good option.
> >>
> >> I have also implemented the next set of patches.
> >> 0004 -> Provide a way to create custom compression methods
> >> 0005 -> Extention to implement lz4 as a custom compression method.
> >
> >In the updated version I have worked on some of the listed items
> >> A pending list of items:
> >> 1. Provide support for handling the compression option
> >> - As discussed up thread I will store the compression option of the
> >> latest compression method in a new field in pg_atrribute table
> >> 2. As of now I have kept zlib as the second built-in option and lz4 as
> >> a custom compression extension.  In Offlist discussion with Robert, he
> >> suggested that we should keep lz4 as the built-in method and we can
> >> move zlib as an extension because lz4 is faster than zlib so better to
> >> keep that as the built-in method.  So in the next version, I will
> >> change that.  Any different opinion on this?
> >
> >Done
> >
> >> 3. Improve the documentation, especially for create_compression_method.
> >> 4. By default support table compression method for the index.
> >
> >Done
> >
> >> 5. Support the PRESERVE ALL option so that we can preserve all
> >> existing lists of compression methods without providing the whole
> >> list.
> >
> >1,3,5 points are still pending.
> >
>
> Thanks. I took a quick look at the patches and I think it seems fine. I
> have one question, though - toast_compress_datum contains this code:
>
>
>  /* Call the actual compression function */
>  tmp = cmroutine->cmcompress((const struct varlena *) value);
>  if (!tmp)
>  return PointerGetDatum(NULL);
>
>
> Shouldn't this really throw an error instead? I mean, if the compression
> library returns NULL, isn't that an error?

I don't think that we can throw an error here because pglz_compress
might return -1 if it finds that it can not reduce the size of the
data and we consider such data as "incompressible data" and return
NULL.  In such a case the caller will try to compress another
attribute of the tuple.  I think we can handle such cases in the
specific handler functions.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-10-21 Thread Kyotaro Horiguchi
At Thu, 22 Oct 2020 10:44:53 +0900, Masahiro Ikeda  
wrote in 
> On 2020-10-21 18:03, Kyotaro Horiguchi wrote:
> > At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
> >  wrote in
> >> On 2020-10-20 12:46, Amit Kapila wrote:
> >> > I see that we also need to add extra code to capture these stats (some
> >> > of which is in performance-critical path especially in
> >> > XLogInsertRecord) which again makes me a bit uncomfortable. It might
> >> > be that it is all fine as it is very important to collect these stats
> >> > at cluster-level in spite that the same information can be gathered at
> >> > statement-level to help customers but I don't see a very strong case
> >> > for that in your proposal.
> > We should avoid that duplication as possible even if the both number
> > are important.
> > 
> >> Also about performance, I thought there are few impacts because it
> >> increments stats in memory. If I can implement to reuse pgWalUsage's
> >> value which already collects these stats, there is no impact in
> >> XLogInsertRecord.
> >> For example, how about pg_stat_wal() calculates the accumulated
> >> value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
> >> value?
> > I don't think that works, but it would work that pgstat_send_wal()
> > takes the difference of that values between two successive calls.
> > WalUsage prevWalUsage;
> > ...
> > pgstat_send_wal()
> > {
> > ..
> >/* fill in some values using pgWalUsage */
> >WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes;
> >WalStats.m_wal_records = pgWalUsage.wal_records -
> >prevWalUsage.wal_records;
> >WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;
> > ...
> >pgstat_send(, sizeof(WalStats));
> >/* remember the current numbers */
> >prevWalUsage = pgWalUsage;
> 
> Thanks for your advice. This code can avoid the performance impact of
> critical code.
> 
> By the way, what do you think to add these statistics to the
> pg_stat_wal view?
> I thought to remove the above statistics because there is advice that
> PG13's features,
> for example, pg_stat_statement view, vacuum log, and so on can cover
> use-cases.


At Thu, 22 Oct 2020 10:09:21 +0900, Masahiro Ikeda  
wrote in 
> >> I agreed that the statement-level stat is important and I understood
> >> that we can
> >> know the aggregated WAL stats of pg_stat_statement view and
> >> autovacuum's
> >> log.
> >> But now, WAL stats generated by autovacuum can be output to logs and
> >> it
> >> is not
> >> easy to aggregate them. Since WAL writes impacts for the entire
> >> cluster,
> >> I thought
> >> it's natural to provide accumulated value.
> >> 
> > I think it is other way i.e if we would have accumulated stats then it
> > makes sense to provide those at statement-level because one would like
> > to know the exact cause of more WAL activity. Say it is due to an
> > autovacuum or due to the particular set of statements then it would
> > easier for users to do something about it.
> 
> OK, I'll remove them.
> Do you have any comments for other statistics?

That discussion comes from the fact that the code adds duplicate code
in a hot path.  If we that extra cost doesn't exist, we are free to
add the accumulated values in pg_stat_wal. I think they are useful for
stats-collecting tools as far as we can do that without such an extra
cost.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread Tom Lane
Thomas Munro  writes:
> Hmmm.  The Linux comment led me to commit ffae5cc and a 2006 thread[1]
> showing a buggy sequence of system calls.

Hah, blast from the past ...

> AFAICS it was not even an
> SMP/race problem of the type you might half expect, it was a single
> process not seeing its own write.  I didn't find details on the
> version, filesystem etc.

Per the referenced bug-reporting thread, it was ReiserFS and/or NFS on
SLES 9.3; so, dubious storage choices on an ancient-even-then Linux
kernel.

I think the takeaway point is not so much that that particular bug
might recur as that storage infrastructure does sometimes have bugs.
If you're wanting to introduce new assumptions about what the filesystem
will do, it's prudent to think about how badly will we break if the
assumptions fail.

regards, tom lane




User accounts on windows

2020-10-21 Thread Joel Mariadasan (jomariad)
Background:
We have an installer for our application as part of which we are planning to 
include archive
postgresql-13.0-1-windows-x64-binaries.zip which will be extracted along with 
the installation of our application.

When the archive is extracted the folder's permission will belong to the 
current user who is installing our application (who will belong to the 
administrators group).
This is contradictory to the approach followed by enterprise db installer where 
a separate user "postgres" will own the folders and the process created.
We want to simply the installation approach as many times we are hitting 
permission issues while using the separate "postgres" user created.
That's why we want to make the current user own the postgres and services.

Note: This is not about the "postgres" database user account.

Following are the queries with regards to the approach:

  1.  Will having an administrator user own the postgres installation and data 
directory cause any side effects?
  2.  The postgres process also will be owned by an user of administrator 
group. Will it cause any issues from security perspective?

Regards,
Joel


Re: Mop-up around psql's \connect behavior

2020-10-21 Thread Tom Lane
Kyotaro Horiguchi  writes:
> At Wed, 21 Oct 2020 18:59:04 -0400, Tom Lane  wrote in 
>> ...  What I propose
>> is to complain if we have no o_conn *and* we are asked to re-use
>> parameters from it.  Otherwise, it's fine.

> The reason I haven't complain about this is I don't reconnect by \c
> after involuntary disconnection. (That is, C-d then psql again:p)

Yeah, me too.

> But once it got on my mind, it might be strange that just \c or \c
> -reuse-previous=y doesn't reconnect a broken session. It might be
> better we reuse the previous connection parameter even if the
> connection has been lost, but this would be another issue.

I did actually look into saving the active connection's PQconninfo
immediately at connection establishment and then referring to it in any
subsequent \connect.  Then things could work the same even if the original
connection had failed meanwhile.  But there are technical details that
make that a lot harder than it seems on the surface --- mainly, that the
way do_connect works now requires that it have a copy of the PQconninfo
data that it can scribble on, and that won't do if we need the saved
PQconninfo to persist when a \connect attempt fails.  That could be dealt
with with enough new code, but I didn't think it was worth the trouble.
(Note that we developers face the server-crashed scenario a whole lot more
often than normal people ;-), so we probably overrate how useful it'd be
to be able to reconnect in that case.)

>> * The old-style-syntax code path understands that it should re-use
>> the old password (if any) when the user, host, and port settings
>> haven't changed.  The connstring code path was too lazy to make
>> that work, but now that we're deconstructing the connstring there's
>> very little excuse for not having it act the same way.

> +1 (I thought sslmode might affect but that is wrong since cert
> authenticaion cannot be turned off from command line.)

Yeah.  That could affect whether the server asks for a password at
all, but you have to really stretch to think of cases where it could
result in needing a *different* password.  An example perhaps is
where pg_hba.conf is configured to do, say, LDAP auth on SSL connections
and normal password auth on non-SSL, and the LDAP server has a different
password than what is in pg_authid.  But that seems like something nobody
could want.  Also notice that unlike the previous code, with my patch
do_connect will always (barring --no-password) give you an opportunity
to interactively supply a password, even if we initially reused an
old password and it didn't work.  So it seems like this will cope
even if you do have a setup as wacko as that.

regards, tom lane




Re: Parallel INSERT (INTO ... SELECT ...)

2020-10-21 Thread Greg Nancarrow
On Fri, Oct 16, 2020 at 9:26 PM Amit Kapila  wrote:
>
>
> Cool, let me try to explain my thoughts a bit more. The idea is first
> (in standard_planner) we check if there is any 'parallel_unsafe'
> function/expression (via max_parallel_hazard) in the query tree. If we
> don't find anything 'parallel_unsafe' then we mark parallelModeOk. At
> this stage, the patch is checking whether there is any
> 'parallel_unsafe' or 'parallel_restricted' expression/function in the
> target relation and if there is none then we mark parallelModeOK as
> true. So, if there is anything 'parallel_restricted' then we will mark
> parallelModeOK as false which doesn't seem right to me.
>
> Then later in the planner during set_rel_consider_parallel, we
> determine if a particular relation can be scanned from within a
> worker, then we consider that relation for parallelism. Here, we
> determine if certain things are parallel-restricted then we don't
> consider this for parallelism. Then we create partial paths for the
> relations that are considered for parallelism. I think we don't need
> to change anything for the current patch in these later stages because
> we anyway are not considering Insert to be pushed into workers.
> However, in the second patch where we are thinking to push Inserts in
> workers, we might need to do something to filter parallel-restricted
> cases during this stage of the planner.
>

Posting an update to the smaller patch (Parallel SELECT for INSERT
INTO...SELECT...).

Most of this patch feeds into the larger Parallel INSERT patch, for
which I'll also be posting an update soon.

Patch updates include:
- Removed explicit trigger-type checks (instead rely on declared
trigger parallel safety)
- Restored parallel-related XID checks that previous patch altered;
now assign XID prior to entering parallel-mode
- Now considers parallel-SELECT for parallel RESTRICTED cases (not
just parallel SAFE cases)
- Added parallel-safety checks for partition key expressions and
support functions
- Workaround added for test failure in "partition-concurrent-attach"
test; since ALTER TABLE operations may exclusively lock a relation
until end-of-transaction, now assume and return UNSAFE if can't
acquire a share-lock on the relation, rather than block until
potentially end of the other concurrent transaction in which the
exclusive lock is held.
Examples of when a relation is exclusively locked
(AccessExclusiveLock) until end-of-transaction include:
ALTER TABLE DROP COLUMN
ALTER TABLE ... RENAME
ALTER TABLE ... ATTACH PARTITION  (locks default partition)

Regards,
Greg Nancarrow
Fujitsu Australia


v3-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT.patch
Description: Binary data


Re: A new function to wait for the backend exit after termination

2020-10-21 Thread Bharath Rupireddy
On Thu, Oct 22, 2020 at 8:39 AM David G. Johnston
 wrote:
>
>> If the backend is terminated within the user specified timeout then
>> the function returns true, otherwise false.
>
> I’m suggesting an option for the second case to fail instead of returning 
> false.
>

That seems fine.

>
>> >
>> > I could imagine, in theory at least, wanting to wait for a backend to go 
>> > idle as well as for it disappearing.  Scope creep in terms of this patch's 
>> > goal but worth at least considering now.
>> >
>>
>> IIUC, do we need a new option, something like pg_wait_backend(pid,
>> timeout, waituntil) where "waituntil" if specified "idle" waits until
>> the given backend goes to idle mode, or "termination" waits until
>> termination?
>>
>> If my understanding is wrong, could you please explain more?
>
>
> Yes, this describes what i was thinking.
>

+1.

I will implement these functionality and post a new patch soon.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread Thomas Munro
On Thu, Oct 22, 2020 at 3:07 PM k.jami...@fujitsu.com
 wrote:
+/*
+ * Get the total number of to-be-invalidated blocks of a relation as well
+ * as the total blocks for a given fork.  The cached value returned by
+ * smgrnblocks could be smaller than the actual number of existing buffers
+ * of the file.  This is caused by buggy Linux kernels that might not have
+ * accounted for the recent write.  Give up the optimization if the block
+ * count of any fork cannot be trusted.
+ */
+for (i = 0; i < nforks; i++)
+{
+/* Get the number of blocks for a relation's fork */
+nForkBlocks[i] = smgrnblocks(smgr_reln, forkNum[i], );
+
+if (!accurate)
+break;

Hmmm.  The Linux comment led me to commit ffae5cc and a 2006 thread[1]
showing a buggy sequence of system calls.  AFAICS it was not even an
SMP/race problem of the type you might half expect, it was a single
process not seeing its own write.  I didn't find details on the
version, filesystem etc.

Searching for our message "This has been seen to occur with buggy
kernels; consider updating your system" turns up recent-ish results
too.  The reports I read involved GlusterFS, which I don't personally
know anything about, but it claims full POSIX compliance, and POSIX is
strict about that sort of thing, so I'd guess that is/was a fairly
serious bug or misconfiguration.  Surely there must be other symptoms
for PostgreSQL on such systems too, like sequential scans that don't
see recently added pages.

But... does the proposed caching behaviour and "accurate" flag really
help with any of that?  Cached values come from lseek() anyway.  If we
just trusted unmodified smgrnblocks(), someone running on such a
forgetful file system might eventually see nasty errors because we
left buffers in the buffer pool that prevent a checkpoint from
completing (and panic?), but they might also see other really strange
errors, and that applies with or without that "accurate" flag, no?

[1] https://www.postgresql.org/message-id/flat/26202.1159032931%40sss.pgh.pa.us




Re: Implementing Incremental View Maintenance

2020-10-21 Thread Yugo NAGATA
Hi Adam Brusselback,

On Mon, 31 Dec 2018 11:20:11 -0500
Adam Brusselback  wrote:

> Hi all, just wanted to say  I am very happy to see progress made on this,
> my codebase has multiple "materialized tables" which are maintained with
> statement triggers (transition tables) and custom functions. They are ugly
> and a pain to maintain, but they work because I have no other
> solution...for now at least.

We are want to find sutable use cases of the IVM patch being discussed in this
thread, and I remembered your post that said you used statement triggers and
custom functions. We hope the patch will help you.

The patch implements IVM of immediate, that is, eager approach. Materialized
views are updated immediately when its base tables are modified. While the view
is always up-to-date, there is a overhead on base table modification. 

We would appreciate it if you could tell us what your use cases of materialized
view is and whether our implementation suits your needs or not.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: A new function to wait for the backend exit after termination

2020-10-21 Thread David G. Johnston
On Wednesday, October 21, 2020, Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Thanks for the feedback.
>
> On Wed, Oct 21, 2020 at 8:01 PM David G. Johnston
>  wrote:
> >
> > On Wed, Oct 21, 2020 at 6:13 AM Magnus Hagander 
> wrote:
> >>
> >> I think it would be nicer to have a pg_terminate_backend(pid,
> wait=false), so a function with a second parameter which defaults to the
> current behaviour of not waiting. And it might be a good idea to also give
> it a timeout parameter?
> >
> > Agreed on the overload, and the timeouts make sense too - with the
> caller deciding whether a timeout results in a failure or a false return
> value.
> >
>
> If the backend is terminated within the user specified timeout then
> the function returns true, otherwise false.


I’m suggesting an option for the second case to fail instead of returning
false.


> >
> >>> 2. pg_wait_backend() -- which waits for a given backend process. Note
> that this function has to be used carefully after pg_terminate_backend(),
> if used on a backend that's not ternmited it simply keeps waiting in a loop.
> >>
> >> It seems this one also very much would need a timeout value.
> >
> > Is there a requirement for waiting to be superuser only?  You are not
> affecting any session but your own during the waiting period.
> >
>
> IIUC, in the same patch instead of returning an error in case of
> non-superusers, do we need to wait for user provided timeout
> milliseconds until the current user becomes superuser and then throw
> error if still non-superuser, and proceed further if superuser?
>
> Do we need to have a new function that waits until a current
> non-superuser in a session becomes superuser?
>
> Something else?


Not sure how that would even be possible mid-statement.  I was suggesting
removing the superuser check altogether and letting any user execute “wait”.


> >
> > I could imagine, in theory at least, wanting to wait for a backend to go
> idle as well as for it disappearing.  Scope creep in terms of this patch's
> goal but worth at least considering now.
> >
>
> IIUC, do we need a new option, something like pg_wait_backend(pid,
> timeout, waituntil) where "waituntil" if specified "idle" waits until
> the given backend goes to idle mode, or "termination" waits until
> termination?
>
> If my understanding is wrong, could you please explain more?
>

Yes, this describes what i was thinking.

David J.


Re: Mop-up around psql's \connect behavior

2020-10-21 Thread Kyotaro Horiguchi
At Wed, 21 Oct 2020 18:59:04 -0400, Tom Lane  wrote in 
> While working on commit 85c54287a, I noticed a few things I did not
> much care for in do_connect().  These don't quite seem to rise to
> the level of back-patchable bugs, but they're still not great:
> 
> * The initial stanza that complains about
> 
>   if (!o_conn && (!dbname || !user || !host || !port))
> 
> seems woefully obsolete.  In the first place, it's pretty silly
> to equate a "complete connection specification" with having just
> those four values; the whole point of 85c54287a and predecessors
> is that other settings such as sslmode may be just as important.
> In the second place, this fails to consider the possibility that
> we only have a connstring parameter --- which may nonetheless
> provide all the required settings.  And in the third place,
> this clearly wasn't revisited when we added explicit control of
> whether or not we're supposed to re-use parameters from the old
> connection.  It's very silly to insist on having an o_conn if we're
> going to ignore it anyway.

Sounds reasonable.

> I think the reason we've not had complaints about this is that the
> situation normally doesn't arise in interactive sessions (since we
> won't release the old connection voluntarily), while scripts are
> likely not designed to cope with connection losses anyway.  These
> facts militate against spending a whole lot of effort on a fix,
> but still we ought to reduce the silliness factor.  What I propose
> is to complain if we have no o_conn *and* we are asked to re-use
> parameters from it.  Otherwise, it's fine.

The reason I haven't complain about this is I don't reconnect by \c
after involuntary disconnection. (That is, C-d then psql again:p) But
once it got on my mind, it might be strange that just \c or \c
-reuse-previous=y doesn't reconnect a broken session. It might be
better we reuse the previous connection parameter even if the
connection has been lost, but this would be another issue.

> * I really don't like the bit about silently ignoring user, host,
> and port parameters if we see that the first parameter is a connstring.
> That's as user-unfriendly as can be.  It should be a syntax error
> to specify both; the documentation certainly implies that it is.

+1

> * The old-style-syntax code path understands that it should re-use
> the old password (if any) when the user, host, and port settings
> haven't changed.  The connstring code path was too lazy to make
> that work, but now that we're deconstructing the connstring there's
> very little excuse for not having it act the same way.

+1 (I thought sslmode might affect but that is wrong since cert
authenticaion cannot be turned off from command line.)

> The attached patch fixes these things and documents the password
> behavior, which for some reason went unmentioned before.  Along
> the way I simplified the mechanism for re-using a password a bit;
> there's no reason to treat it so much differently from re-using
> other parameters.

Looks fine.

> Any objections?

Nope from me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] logical decoding of two-phase transactions

2020-10-21 Thread Amit Kapila
On Thu, Oct 22, 2020 at 4:58 AM Peter Smith  wrote:
>
> On Wed, Oct 21, 2020 at 7:42 PM Amit Kapila  wrote:
> >
> > On Wed, Oct 21, 2020 at 1:38 PM Peter Smith  wrote:
> > >
> > > The PG docs for PREPARE TRANSACTION [1] don't say anything about an
> > > empty (zero length) transaction-id.
> > > e.g. PREPARE TRANSACTION '';
> > > [1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html
> > >
> > > ~
> > >
> > > Meanwhile, during testing I found the 2PC prepare hangs when an empty
> > > id is used.
> > >
> >
> > Can you please take an example to explain what you are trying to say?
>
> I was referring to an empty (zero length) transaction ID, not an empty
> transaction.
>

oh, I got it confused with the system generated 32-bit TransactionId.
But now, I got what you were referring to.

> The example was already given as PREPARE TRANSACTION '';
>
>
> Is that something that should be made to work for 2PC pub/sub, or was
> Postgres PREPARE TRANSACTION statement wrong to allow the user to
> specify an empty transaction ID in the first place?
>

I don't see any problem with the empty transaction identifier used in
Prepare Transaction. This is just used as an identifier to uniquely
identify the transaction. If you try to use an empty string ('') more
than once for Prepare Transaction, it will give an error like below:
postgres=*# prepare transaction '';
ERROR:  transaction identifier "" is already in use

So, I think this should work for pub/sub as well. Did you find out the
reason of hang?

-- 
With Regards,
Amit Kapila.




Re: A new function to wait for the backend exit after termination

2020-10-21 Thread Bharath Rupireddy
Thanks for the feedback.

On Wed, Oct 21, 2020 at 8:01 PM David G. Johnston
 wrote:
>
> On Wed, Oct 21, 2020 at 6:13 AM Magnus Hagander  wrote:
>>
>> I think it would be nicer to have a pg_terminate_backend(pid, wait=false), 
>> so a function with a second parameter which defaults to the current 
>> behaviour of not waiting. And it might be a good idea to also give it a 
>> timeout parameter?
>
> Agreed on the overload, and the timeouts make sense too - with the caller 
> deciding whether a timeout results in a failure or a false return value.
>

If the backend is terminated within the user specified timeout then
the function returns true, otherwise false.

>
>>> 2. pg_wait_backend() -- which waits for a given backend process. Note that 
>>> this function has to be used carefully after pg_terminate_backend(), if 
>>> used on a backend that's not ternmited it simply keeps waiting in a loop.
>>
>> It seems this one also very much would need a timeout value.
>
> Is there a requirement for waiting to be superuser only?  You are not 
> affecting any session but your own during the waiting period.
>

IIUC, in the same patch instead of returning an error in case of
non-superusers, do we need to wait for user provided timeout
milliseconds until the current user becomes superuser and then throw
error if still non-superuser, and proceed further if superuser?

Do we need to have a new function that waits until a current
non-superuser in a session becomes superuser?

Something else?

>
> I could imagine, in theory at least, wanting to wait for a backend to go idle 
> as well as for it disappearing.  Scope creep in terms of this patch's goal 
> but worth at least considering now.
>

IIUC, do we need a new option, something like pg_wait_backend(pid,
timeout, waituntil) where "waituntil" if specified "idle" waits until
the given backend goes to idle mode, or "termination" waits until
termination?

If my understanding is wrong, could you please explain more?




Re: A new function to wait for the backend exit after termination

2020-10-21 Thread Andres Freund
On 2020-10-21 15:13:36 +0200, Magnus Hagander wrote:
> It seems this one also very much would need a timeout value.

I'm not really against that, but I wonder if we just end up
reimplementing statement timeout...




Re: A new function to wait for the backend exit after termination

2020-10-21 Thread Bharath Rupireddy
Thanks for the feedback.

On Wed, Oct 21, 2020 at 6:43 PM Magnus Hagander  wrote:
>
>> Currently pg_terminate_backend(), sends SIGTERM to the backend process but 
>> doesn't ensure it's exit. There are chances that backends still are 
>> running(even after pg_terminate_backend() is called) until the interrupts 
>> are processed(using ProcessInterrupts()). This could cause problems 
>> especially in testing, for instance in a sql file right after 
>> pg_terminate_backend(), if any test case depends on the backend's 
>> non-existence[1], but the backend is not terminated. As discussed in [1], we 
>> have wait_pid()(see regress.c and sql/dblink.sql), but it's not usable 
>> across the system. In [1], we thought it would be better to have functions 
>> ensuring the backend's exit on the similar lines of pg_terminate_backend().
>>
>> I propose to have two functions:
>>
>> 1. pg_terminate_backend_and_wait() -- which sends SIGTERM to the backend and 
>> wait's until it's exit.
>
> I think it would be nicer to have a pg_terminate_backend(pid, wait=false), so 
> a function with a second parameter which defaults to the current behaviour of 
> not waiting. And it might be a good idea to also give it a timeout parameter?
>

+1 to have pg_terminate_backend(pid, wait=false, timeout), timeout in
milliseconds only valid if wait = true.

>
>> 2. pg_wait_backend() -- which waits for a given backend process. Note that 
>> this function has to be used carefully after pg_terminate_backend(), if used 
>> on a backend that's not ternmited it simply keeps waiting in a loop.
>
> It seems this one also very much would need a timeout value.
>
> And surely we should show some sort of wait event when it's waiting.
>

Yes for this function too we can have a timeout value.
pg_wait_backend(pid, timeout), timeout in milliseconds.

I think we can use WaitLatch with the given timeout and with a new
wait event type WAIT_EVENT_BACKEND_SHUTDOWN  instead of pg_usleep for
achieving the given timeout mechanism. With WaitLatch we would also
get the waiting event in stats. Thoughts?

rc = WaitLatch(MyLatch,
   WL_LATCH_SET | WL_POSTMASTER_DEATH, timeout,
   WAIT_EVENT_BACKEND_SHUTDOWN);

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread k.jami...@fujitsu.com
On Thursday, October 22, 2020 10:34 AM, Tsunakwa-san wrote:
> > I have confirmed that the above comment (commenting out the lines in
> > RelationTruncate) solves the issue for non-recovery case.
> > The attached 0004 patch is just for non-recovery testing and is not
> > included in the final set of patches to be committed for vacuum
> optimization.
> 
> I'm relieved to hear that.
> 
> As for 0004:
> When testing TRUNCATE, remove the change to storage.c because it was
> intended to troubleshoot the VACUUM test.
I've removed it now.

> What's the change in bufmgr.c for?  Is it to be included in 0001 or 0002?

Right. But that should be in 0003. Fixed.

I also fixed the feedback from the previous email:
>(1)
>+   * as the total nblocks for a given fork. The cached value returned by
>
>nblocks -> blocks


> > The table below shows the vacuum execution time for non-recovery case.
> > I've also subtracted the execution time when VACUUM (truncate off) is set.
> >
> > [NON-RECOVERY CASE - VACUUM execution Time in seconds]
> (snip)
> > | 100GB | 65.456 | 1.795   | -3546.57% |
> 
> So, the full shared buffer scan for 10,000 relations took about as long as 63
> seconds (= 6.3 ms per relation).  It's nice to shorten this long time.
> 
> I'll review the patch soon.

Thank you very much for the reviews. Attached are the latest set of patches.

Regards,
Kirk Jamison


v27-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v27-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v27-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v27-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v27-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v27-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v27-0004-For-non-recovery-performance-test-case-purposes-.patch
Description:  v27-0004-For-non-recovery-performance-test-case-purposes-.patch


Re: Allow some recovery parameters to be changed with reload

2020-10-21 Thread Kyotaro Horiguchi
At Thu, 22 Oct 2020 01:59:07 +0300, Anastasia Lubennikova 
 wrote in 
> On 10.08.2020 23:20, Robert Haas wrote:
> > On Sun, Aug 9, 2020 at 1:21 AM Michael Paquier 
> > wrote:
> >> Sorry for the late reply.  I have been looking at that stuff again,
> >> and restore_command can be called in the context of a WAL sender
> >> process within the page_read callback of logical decoding via
> >> XLogReadDetermineTimeline(), as readTimeLineHistory() could look for a
> >> timeline history file.  So restore_command is not used only in the
> >> startup process.
> > Hmm, interesting. But, does that make this change wrong, apart from
> > the comments? Like, in the case of primary_conninfo, maybe some
> > confusion could result if the startup process decided whether to ask
> > for a WAL receiver based on thinking primary_conninfo being set, while
> > that process thought that it wasn't actually set after all, as
> > previously discussed in
> > http://postgr.es/m/ca+tgmozvmjx1+qtww2tsnphrnkwkzxc3zsrynfb-fpzm1ox...@mail.gmail.com
> > ... but what's the corresponding hazard here, exactly? It doesn't seem
> > that there's any way in which the decision one process makes affects
> > the decision the other process makes. There's still a race condition:
> > it's possible for a walsender
> Did you mean walreceiver here?

It's logical walsender. restore_command is used within
logical_read_xlog_page() via XLogReadDetermineTimeline().

> >   to use the old restore_command after the
> > startup process had already used the new one, or the other way around.
> > However, it doesn't seem like that should confuse anything inside the
> > server, and therefore I'm not sure we need to code around it.
> I came up with following scenario. Let's say we have xlog files 1,2,3
> in dir1 and files 4,5 in dir2. If startup process had only handled
> files 1 and 2, before we switched restore_command from reading dir1 to
> reading dir2, it will fail to find next file. IIUC, it will assume
> that recovery is done, start server and walreceiver. The walreceiver
> will fail as well. I don't know, how realistic is this case, though.

That operation is somewhat bogus, if the server is not in standby
mode.  In standby mode, startup waits for the next segment safely.

> In general,. this feature looks useful and consistent with previous
> changes, so I am interested in pushing it forward.

Agreed. The feature seems to work fine as far as we don't make a
change of restore_command that moves to another history.  Otherwise
recovery doesn't work correctly regaredless whether it is PGC_SIGHUP
or not.

> Sergey, could you please attach this thread to the upcoming CF, if
> you're going to continue working on it.
> 
>  A few more questions:
> - RestoreArchivedFile() is also used by pg_rewind. I don't see any
> - particular problem with it, just want to remind that we should test it
> - too.
> - How will it interact with possible future optimizations of archive
> - restore? For example, WAL prefetch [1].
> 
>  [1]
> https://www.postgresql.org/message-id/flat/601ee1f5-0b78-47e1-9aae-c15f74a1c...@postgrespro.ru

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: PostgresNode::backup uses spread checkpoint?

2020-10-21 Thread Michael Paquier
On Wed, Oct 21, 2020 at 07:55:18AM +0900, Michael Paquier wrote:
> Some nits: I would recommend to use the long option name, and list
> the option name and its value as two separate arguments of the
> command.

For the archives: this got applied as of 831611b.
--
Michael


signature.asc
Description: PGP signature


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread tsunakawa.ta...@fujitsu.com
The patch looks good except for the minor one:

(1)
+* as the total nblocks for a given fork. The cached value returned by

nblocks -> blocks


Regards
Takayuki Tsunakawa





Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-10-21 Thread Masahiro Ikeda

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both number
are important.


Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?


I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - 
prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - 
prevWalUsage.wal_fpi;

...
   pgstat_send(, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for your advice. This code can avoid the performance impact of 
critical code.


By the way, what do you think to add these statistics to the pg_stat_wal 
view?
I thought to remove the above statistics because there is advice that 
PG13's features,
for example, pg_stat_statement view, vacuum log, and so on can cover 
use-cases.


regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




Re: Online verification of checksums

2020-10-21 Thread Michael Paquier
On Wed, Oct 21, 2020 at 07:10:34PM +0900, Michael Paquier wrote:
> My guess is that we should be able to make use of that for base
> backups as well, but I also think that I'd rather let v13 go with more
> retries without depending on a new API layer, removing of the LSN
> check altogether.  Thinking of it, that's actually roughly what I
> posted here, but without the PageGetLSN() bit in the refactored code.
> So I see a pretty good argument to address the stable branches with
> that, and study for the future a better API to govern them all:
> https://www.postgresql.org/message-id/20201020062432.ga30...@paquier.xyz

So, I was sleeping on this one, and I could not find a reason why we
should not address both the zero case and the random data case at the
same time, as mentioned here:
https://www.postgresql.org/message-id/20201022012519.gf1...@paquier.xyz

We cannot trust the fields fields of the page header because these may
have been messed up with some random corruption, so what really
matters is that the checksums don't match, and that we can just rely
on that.  The zero-only case of a page is different because these
don't have a checksum set, so I would finish with something like the
attached to make the detection more robust.  This does not make the
detection perfect as there is no locking insurance (we really need to
do that but v13 has been released already), but with a sufficient
number of retries this can make things much more reliable than what's
present.

Are there any comments?  Anybody?
--
Michael
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 51b8f994ac..1a28ee1130 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -404,26 +404,33 @@ do { \
  *		extern declarations
  * 
  */
+
+/* flags for PageAddItemExtended() */
 #define PAI_OVERWRITE			(1 << 0)
 #define PAI_IS_HEAP(1 << 1)
 
+/* flags for PageIsVerifiedExtended() */
+#define PIV_LOG_WARNING			(1 << 0)
+#define PIV_REPORT_STAT			(1 << 1)
+
 #define PageAddItem(page, item, size, offsetNumber, overwrite, is_heap) \
 	PageAddItemExtended(page, item, size, offsetNumber, \
 		((overwrite) ? PAI_OVERWRITE : 0) | \
 		((is_heap) ? PAI_IS_HEAP : 0))
 
 /*
- * Check that BLCKSZ is a multiple of sizeof(size_t).  In PageIsVerified(),
- * it is much faster to check if a page is full of zeroes using the native
- * word size.  Note that this assertion is kept within a header to make
- * sure that StaticAssertDecl() works across various combinations of
- * platforms and compilers.
+ * Check that BLCKSZ is a multiple of sizeof(size_t).  In
+ * PageIsVerifiedExtended(), it is much faster to check if a page is
+ * full of zeroes using the native word size.  Note that this assertion
+ * is kept within a header to make sure that StaticAssertDecl() works
+ * across various combinations of platforms and compilers.
  */
 StaticAssertDecl(BLCKSZ == ((BLCKSZ / sizeof(size_t)) * sizeof(size_t)),
  "BLCKSZ has to be a multiple of sizeof(size_t)");
 
 extern void PageInit(Page page, Size pageSize, Size specialSize);
 extern bool PageIsVerified(Page page, BlockNumber blkno);
+extern bool PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags);
 extern OffsetNumber PageAddItemExtended(Page page, Item item, Size size,
 		OffsetNumber offsetNumber, int flags);
 extern Page PageGetTempPage(Page page);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index dbbd3aa31f..d538f25726 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -443,7 +443,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
 		smgrread(src, forkNum, blkno, buf.data);
 
-		if (!PageIsVerified(page, blkno))
+		if (!PageIsVerifiedExtended(page, blkno,
+	PIV_LOG_WARNING | PIV_REPORT_STAT))
 			ereport(ERROR,
 	(errcode(ERRCODE_DATA_CORRUPTED),
 	 errmsg("invalid page in block %u of relation %s",
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..6f717c8956 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1569,21 +1569,24 @@ sendFile(const char *readfilename, const char *tarfilename,
 {
 	int			fd;
 	BlockNumber blkno = 0;
-	bool		block_retry = false;
+	int			block_attempts = 0;
 	char		buf[TAR_SEND_SIZE];
-	uint16		checksum;
 	int			checksum_failures = 0;
 	off_t		cnt;
 	int			i;
 	pgoff_t		len = 0;
 	char	   *page;
 	size_t		pad;
-	PageHeader	phdr;
 	int			segmentno = 0;
 	char	   *segmentpath;
 	bool		verify_checksum = false;
 	pg_checksum_context checksum_ctx;
 
+	/* Maximum number of checksum failures reported for this file */
+#define CHECKSUM_REPORT_THRESHOLD		5
+	/* maximum number of retries done for a single page */
+#define PAGE_RETRY_THRESHOLD			5
+
 	pg_checksum_init(_ctx, manifest->checksum_type);
 
 	fd = OpenTransientFile(readfilename, O_RDONLY | 

Re: Transactions involving multiple postgres foreign servers, take 2

2020-10-21 Thread Masahiko Sawada
On Wed, 21 Oct 2020 at 18:33, tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Masahiko Sawada 
> > So what's your opinion?
>
> My opinion is simple and has not changed.  Let's clarify and refine the 
> design first in the following areas (others may have pointed out something 
> else too, but I don't remember), before going deeper into the code review.
>
> * FDW interface
> New functions so that other FDWs can really implement.  Currently, XA seems 
> to be the only model we can rely on to validate the FDW interface.
> What FDW function would call what XA function(s)?  What should be the 
> arguments for the FEW functions?

I guess since FDW interfaces may be affected by the feature
architecture we can discuss later.

> * Performance
> Parallel prepare and commits on the client backend.  The current 
> implementation is untolerable and should not be the first release quality.  I 
> proposed the idea.
> (If you insist you don't want to anything about this, I have to think you're 
> just rushing for the patch commit.  I want to keep Postgres's reputation.)

What is in your mind regarding the implementation of parallel prepare
and commit? Given that some FDW plugins don't support asynchronous
execution I guess we need to use parallel workers or something. That
is, the backend process launches parallel workers to
prepare/commit/rollback foreign transactions in parallel. I don't deny
this approach but it'll definitely make the feature complex and needs
more codes.

My point is a small start and keeping simple the first version. Even
if we need one or more years for this feature, I think that
introducing the simple and minimum functionality as the first version
to the core still has benefits. We will be able to have the
opportunity to get real feedback from users and to fix bugs in the
main infrastructure before making it complex. In this sense, the patch
having the backend return without waits for resolution after the local
commit would be a good start as the first version (i.g., up to
applying v26-0006 patch). Anyway, the architecture should be
extensible enough for future improvements.

For the performance improvements, we will be able to support
asynchronous and/or prepare/commit/rollback. Moreover, having multiple
resolver processes on one database would also help get better
through-put. For the user who needs much better through-put, the user
also can select not to wait for resolution after the local commit,
like synchronous_commit = ‘local’ in replication.

> As part of this, I'd like to see the 2PC's message flow and disk writes (via 
> email and/or on the following wiki.)  That helps evaluate the 2PC 
> performance, because it's hard to figure it out in the code of a large patch 
> set.  I'm simply imagining what is typically written in database textbooks 
> and research papers.  I'm asking this because I saw some discussion in this 
> thread that some new WAL records are added.  I was worried that transactions 
> have to write WAL records other than prepare and commit unlike textbook 
> implementations.
>
> Atomic Commit of Distributed Transactions
> https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

Understood. I'll add an explanation about the message flow and disk
writes to the wiki page.

We need to consider the point of error handling during resolving
foreign transactions too.

>
> > I don’t think we need to stipulate the query cancellation. Anyway I
> > guess the facts neither that we don’t stipulate anything about query
> > cancellation now nor that postgres_fdw might not be cancellable in
> > some situations now are not a reason for not supporting query
> > cancellation. If it's a desirable behavior and users want it, we need
> > to put an effort to support it as much as possible like we’ve done in
> > postgres_fdw.  Some FDWs unfortunately might not be able to support it
> > only by their functionality but it would be good if we can achieve
> > that by combination of PostgreSQL and FDW plugins.
>
> Let me comment on this a bit; this is a bit dangerous idea, I'm afraid.  We 
> need to pay attention to the FDW interface and its documentation so that FDW 
> developers can implement what we consider important -- query cancellation in 
> your discussion.  "postgres_fdw is OK, so the interface is good" can create 
> interfaces that other FDW developers can't use.  That's what Tomas Vondra 
> pointed out several years ago.

I suspect the story is somewhat different. libpq fortunately supports
asynchronous execution, but when it comes to canceling the foreign
transaction resolution I think basically all FDW plugins are in the
same situation at this time. We can choose whether to make it
cancellable or not. According to the discussion so far, it completely
depends on the architecture of this feature. So my point is whether
it's worth to have this functionality for users and whether users want
it, not whether postgres_fdw is ok.

Regards,

-- 
Masahiko Sawada

Re: Use list_delete_xxxcell O(1) instead of list_delete_ptr O(N) in some places

2020-10-21 Thread David Rowley
On Fri, 16 Oct 2020 at 16:42, Hou, Zhijie  wrote:
> And after checking the code again and I found two more places which can be 
> improved.
>
> 1.
> --- a/src/backend/parser/parse_expr.c
> +++ b/src/backend/parser/parse_expr.c
> @@ -1702,7 +1702,7 @@ transformMultiAssignRef(ParseState *pstate, 
> MultiAssignRef *maref)
>  */
> if (maref->colno == maref->ncolumns)
> pstate->p_multiassign_exprs =
> -   list_delete_ptr(pstate->p_multiassign_exprs, 
> tle);
> +   list_delete_last(pstate->p_multiassign_exprs);
>
> Based on the logic above in function transformMultiAssignRef,
> I found 'tle' is always the last one in list ' pstate->p_multiassign_exprs ' ,
> So list_delete_last seems can be used here.


Yeah. After a bit of looking I agree.  There's a similar assumption
there already with:

/*
* Second or later column in a multiassignment.  Re-fetch the
* transformed SubLink or RowExpr, which we assume is still the last
* entry in p_multiassign_exprs.
*/
Assert(pstate->p_multiassign_exprs != NIL);
tle = (TargetEntry *) llast(pstate->p_multiassign_exprs);

> 2.
>
> +   nameEl_idx = foreach_current_index(option);
> }
> }
>
> @@ -405,7 +407,7 @@ generateSerialExtraStmts(CreateStmtContext *cxt, 
> ColumnDef *column,
> }
> sname = rv->relname;
> /* Remove the SEQUENCE NAME item from seqoptions */
> -   seqoptions = list_delete_ptr(seqoptions, nameEl);
> +   seqoptions = list_delete_nth_cell(seqoptions, nameEl_idx);
>
> Add a new var ' nameEl_idx ' to catch the index.

Yeah. That looks fine too.

Pushed.

David




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread tsunakawa.ta...@fujitsu.com
> As for 0004:
> When testing TRUNCATE, remove the change to storage.c because it was
> intended to troubleshoot the VACUUM test.

I meant vacuum.c.  Sorry.


Regards
Takayuki Tsunakawa





Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-10-21 Thread Masahiro Ikeda

On 2020-10-21 15:54, lchch1...@sina.cn wrote:

I think it's really a more convenient way to collect wal usage
information,
with it we can query when I want. Several points on my side:


Thanks for your comments.



1. It will be nice If you provide a chance to reset the information in
WalStats,
so that we can reset it without restart the database.


I agree.

Now, pg_stat_wal supports reset all informantion in WalStats
using pg_stat_reset_shared('wal') function.

Isn't it enough?



2. I think 'wal_write_backend' is mean wal write times happen in
backends. The describe in document is not so clear, suggest rewrite
it.


OK, I'll rewrite to "Total number of times backends write WAL data to 
the disk".




3. I do not think it's a correct describe in document for
'wal_buffers_full'.


Where should I rewrite the description? If my understanding is not 
correct, please let me know.




4. Quite strange to collect twice in XLogInsertRecord() for
xl_tot_len,
m_wal_records, m_wal_fpi.


Yes, Amit-san pointed me that too.
I'll remove them from pg_stat_wal since pg_stat_statements and vacuum 
log

already shows the related statistics and there is a comment it's enough.

Anyway, if you need to the accumulated above statistics in pg_stat_wal,
please let me know.



5. I notice some code in issue_xlog_fsync() function to collect sync
info,
a standby may call the issue_xlog_fsync() too, which do not want to
to collect this info. I think this need some change avoid run by
standby
side.


Thanks, I will fix it.

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread tsunakawa.ta...@fujitsu.com
From: Jamison, Kirk/ジャミソン カーク 
> I have confirmed that the above comment (commenting out the lines in
> RelationTruncate) solves the issue for non-recovery case.
> The attached 0004 patch is just for non-recovery testing and is not included 
> in
> the final set of patches to be committed for vacuum optimization.

I'm relieved to hear that.

As for 0004:
When testing TRUNCATE, remove the change to storage.c because it was intended 
to troubleshoot the VACUUM test.
What's the change in bufmgr.c for?  Is it to be included in 0001 or 0002?


> The table below shows the vacuum execution time for non-recovery case.
> I've also subtracted the execution time when VACUUM (truncate off) is set.
> 
> [NON-RECOVERY CASE - VACUUM execution Time in seconds]
(snip)
> | 100GB | 65.456 | 1.795   | -3546.57% |

So, the full shared buffer scan for 10,000 relations took about as long as 63 
seconds (= 6.3 ms per relation).  It's nice to shorten this long time.

I'll review the patch soon.


Regards
Takayuki Tsunakawa






Re: Is Recovery actually paused?

2020-10-21 Thread Kyotaro Horiguchi
At Wed, 21 Oct 2020 11:14:24 -0400, Robert Haas  wrote 
in 
> On Wed, Oct 21, 2020 at 7:16 AM Dilip Kumar  wrote:
> > One idea could be, if the recovery process is waiting for WAL and a
> > recovery pause is requested then we can assume that the recovery is
> > paused because before processing the next wal it will always check
> > whether the recovery pause is requested or not.
..
> However, it might be better to implement this by having the system
> absorb the pause immediately when it's in this state, rather than
> trying to detect this state and treat it specially.

The paused state is shown in pg_stat_activity.wait_event and it is
strange that pg_is_wal_replay_paused() is inconsistent with the
column. To make them consistent, we need to call recoveryPausesHere()
at the end of WaitForWALToBecomeAvailable() and let
pg_wal_replay_pause() call WakeupRecovery().

I think we don't need a separate function to find the state.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [patch] Fix checksum verification in base backups for zero page headers

2020-10-21 Thread Michael Paquier
On Thu, Oct 22, 2020 at 12:47:03AM +0300, Anastasia Lubennikova wrote:
> Thank you. I always forget about this. Do we have any checklist for such
> changes, that patch authors and reviewers can use?

Not really.  That's more a habit than anything else where any
non-static routine that we publish could be used by some out-of-core 
code, so maintaining a pure API compatibility on stable branches is
essential.

> We can also read such pages via shared buffers to be 100% sure.

Yeah, but this has its limits as well.  One can use
ignore_checksum_failure, but this can actually be very dangerous as
you can finish by loading into shared buffers a page that has a header
thought as sane but with a large portion of its page randomly
corrupted, spreading corruption around and leading to more fancy
logic failures in Postgres, with more panic from customers.  Not using
ignore_checksum_failure is safer, but it makes an analyze of the
damages for a given file harder as things would stop at the first
failure of a file with a seqscan.  pg_prewarm can help here, but
that's not the goal of the tool to do that either.

We definitely need a different approach that guarantees that a page is
correctly locked with no concurrent I/O while checked on retry, and I
am looking at that for the SQL-level check.  That's not something I
would do in v13 though, but we can make the existing logic much more
reliable with a set of fixed retries and some sleeps in-between.  A
maximum of 5 retries with 100ms seems like a good balance seen from
here, but that's not a perfect science of course depending on the
hardware latency.

This has been a sensitive topic during the stability period of v13
with many committers commenting on the issue, so I'd rather be very
careful that there are no objections to what's published here, and in
consequence I am going to ping the other thread on the matter.  For
now I have the attached to address the zero-case and the random LSN
case.
--
Michael
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index 51b8f994ac..1a28ee1130 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -404,26 +404,33 @@ do { \
  *		extern declarations
  * 
  */
+
+/* flags for PageAddItemExtended() */
 #define PAI_OVERWRITE			(1 << 0)
 #define PAI_IS_HEAP(1 << 1)
 
+/* flags for PageIsVerifiedExtended() */
+#define PIV_LOG_WARNING			(1 << 0)
+#define PIV_REPORT_STAT			(1 << 1)
+
 #define PageAddItem(page, item, size, offsetNumber, overwrite, is_heap) \
 	PageAddItemExtended(page, item, size, offsetNumber, \
 		((overwrite) ? PAI_OVERWRITE : 0) | \
 		((is_heap) ? PAI_IS_HEAP : 0))
 
 /*
- * Check that BLCKSZ is a multiple of sizeof(size_t).  In PageIsVerified(),
- * it is much faster to check if a page is full of zeroes using the native
- * word size.  Note that this assertion is kept within a header to make
- * sure that StaticAssertDecl() works across various combinations of
- * platforms and compilers.
+ * Check that BLCKSZ is a multiple of sizeof(size_t).  In
+ * PageIsVerifiedExtended(), it is much faster to check if a page is
+ * full of zeroes using the native word size.  Note that this assertion
+ * is kept within a header to make sure that StaticAssertDecl() works
+ * across various combinations of platforms and compilers.
  */
 StaticAssertDecl(BLCKSZ == ((BLCKSZ / sizeof(size_t)) * sizeof(size_t)),
  "BLCKSZ has to be a multiple of sizeof(size_t)");
 
 extern void PageInit(Page page, Size pageSize, Size specialSize);
 extern bool PageIsVerified(Page page, BlockNumber blkno);
+extern bool PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags);
 extern OffsetNumber PageAddItemExtended(Page page, Item item, Size size,
 		OffsetNumber offsetNumber, int flags);
 extern Page PageGetTempPage(Page page);
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index dbbd3aa31f..d538f25726 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -443,7 +443,8 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 
 		smgrread(src, forkNum, blkno, buf.data);
 
-		if (!PageIsVerified(page, blkno))
+		if (!PageIsVerifiedExtended(page, blkno,
+	PIV_LOG_WARNING | PIV_REPORT_STAT))
 			ereport(ERROR,
 	(errcode(ERRCODE_DATA_CORRUPTED),
 	 errmsg("invalid page in block %u of relation %s",
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..6f717c8956 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1569,21 +1569,24 @@ sendFile(const char *readfilename, const char *tarfilename,
 {
 	int			fd;
 	BlockNumber blkno = 0;
-	bool		block_retry = false;
+	int			block_attempts = 0;
 	char		buf[TAR_SEND_SIZE];
-	uint16		checksum;
 	int			checksum_failures = 0;
 	off_t		cnt;
 	int			i;
 	pgoff_t		len = 0;
 	char	   *page;
 	size_t		pad;
-	

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-10-21 Thread Masahiro Ikeda

On 2020-10-21 13:41, Amit Kapila wrote:

On Tue, Oct 20, 2020 at 12:41 PM Masahiro Ikeda
 wrote:


On 2020-10-20 12:46, Amit Kapila wrote:
> On Tue, Oct 20, 2020 at 8:01 AM Masahiro Ikeda
>> 1. Basic statistics of WAL activity
>>
>> - wal_records: Total number of WAL records generated
>> - wal_fpi: Total number of WAL full page images generated
>> - wal_bytes: Total amount of WAL bytes generated
>>
>> To understand DB's performance, first, we will check the performance
>> trends for the entire database instance.
>> For example, if the number of wal_fpi becomes higher, users may tune
>> "wal_compression", "checkpoint_timeout" and so on.
>>
>> Although users can check the above statistics via EXPLAIN,
>> auto_explain,
>> autovacuum and pg_stat_statements now,
>> if users want to see the performance trends  for the entire database,
>> they must recalculate the statistics.
>>
>
> Here, do you mean to say 'entire cluster' instead of 'entire database'
> because it seems these stats are getting collected for the entire
> cluster?

Thanks for your comments.
Yes, I wanted to say 'entire cluster'.

>> I think it is useful to add the sum of the basic statistics.
>>
>
> There is an argument that it is better to view these stats at the
> statement-level so that one can know which statements are causing most
> WAL and then try to rate-limit them if required in the application and
> anyway they can get the aggregate of all the WAL if they want. We have
> added these stats in PG-13, so do we have any evidence that the
> already added stats don't provide enough information? I understand
> that you are trying to display the accumulated stats here which if
> required users/DBA need to compute with the currently provided stats.
> OTOH, sometimes adding more ways to do some things causes difficulty
> for users to understand and learn.

I agreed that the statement-level stat is important and I understood
that we can
know the aggregated WAL stats of pg_stat_statement view and 
autovacuum's

log.
But now, WAL stats generated by autovacuum can be output to logs and 
it

is not
easy to aggregate them. Since WAL writes impacts for the entire 
cluster,

I thought
it's natural to provide accumulated value.



I think it is other way i.e if we would have accumulated stats then it
makes sense to provide those at statement-level because one would like
to know the exact cause of more WAL activity. Say it is due to an
autovacuum or due to the particular set of statements then it would
easier for users to do something about it.


OK, I'll remove them.
Do you have any comments for other statistics?

--
Masahiro Ikeda
NTT DATA CORPORATION




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread k.jami...@fujitsu.com
On Wednesday, October 21, 2020 4:37 PM, Tsunakawa-san wrote:
> RelationTruncate() invalidates the cached fork sizes as follows.  This causes
> smgrnblocks() return accurate=false, resulting in not running optimization.
> Try commenting out for non-recovery case.
> 
> /*
>  * Make sure smgr_targblock etc aren't pointing somewhere past new
> end
>  */
> rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
> for (int i = 0; i <= MAX_FORKNUM; ++i)
> rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;

Hello, I have updated the set of patches which incorporated all your feedback 
in the previous email.
Thank you for also looking into it. The patch 0003 (DropRelFileNodeBuffers 
improvement)
is indeed for vacuum optimization and not for truncate.
I'll post a separate patch for the truncate optimization in the coming days.

1. Vacuum Optimization
I have confirmed that the above comment (commenting out the lines in 
RelationTruncate)
solves the issue for non-recovery case.
The attached 0004 patch is just for non-recovery testing and is not included in 
the
final set of patches to be committed for vacuum optimization.

The table below shows the vacuum execution time for non-recovery case.
I've also subtracted the execution time when VACUUM (truncate off) is set.

[NON-RECOVERY CASE - VACUUM execution Time in seconds]

| s_b   | master | patched | %reg  | 
|---||-|---| 
| 128MB | 0.22   | 0.181   | -21.55%   | 
| 1GB   | 0.701  | 0.712   | 1.54% | 
| 20GB  | 15.027 | 1.920   | -682.66%  | 
| 100GB | 65.456 | 1.795   | -3546.57% |

[RECOVERY CASE, VACUUM execution + failover]
I've made a mistake in my writing of the previous email [1].
DELETE from was executed before pausing the WAL replay on standby.
In short, the procedure and results were correct. But I repeated the
performance measurement just in case. The results are still great and 
almost the same as the previous measurement.

| s_b   | master | patched | %reg   | 
|---||-|| 
| 128MB | 3.043  | 3.009   | -1.13% | 
| 1GB   | 3.417  | 3.410   | -0.21% | 
| 20GB  | 20.597 | 2.410   | -755%  | 
| 100GB | 65.734 | 2.409   | -2629% |

Based from the results above, with the patches applied,
the performance for both recovery and non-recovery were relatively close.
For default and small shared_buffers (128MB, 1GB), the performance is
relatively the same as master. But we see the benefit when we have large 
shared_buffers setting.

I've tested using the same test case I indicated in the previous email,
Including the following additional setting:
vacuum_cost_delay = 0
vacuum_cost_limit = 1

That's it for the vacuum optimization. Feedback and comments would be highly 
appreciated.

2. Truncate Optimization
I'll post a separate patch in the future for the truncate optimization which 
modifies the
DropRelFileNodesAllBuffers and related functions along the truncate path..

Thank you.

Regards,
Kirk Jamison

[1] 
https://www.postgresql.org/message-id/OSBPR01MB2341672E9A95E5EC6D2E79B5EF020%40OSBPR01MB2341.jpnprd01.prod.outlook.com


v26-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v26-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch


v26-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch
Description:  v26-0002-Add-bool-param-in-smgrnblocks-for-cached-blocks.patch


v26-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch
Description:  v26-0003-Optimize-DropRelFileNodeBuffers-during-recovery.patch


v26-0004-For-non-recovery-performance-test-case-purposes-.patch
Description:  v26-0004-For-non-recovery-performance-test-case-purposes-.patch


Re: [HACKERS] logical decoding of two-phase transactions

2020-10-21 Thread Peter Smith
On Wed, Oct 21, 2020 at 7:42 PM Amit Kapila  wrote:
>
> On Wed, Oct 21, 2020 at 1:38 PM Peter Smith  wrote:
> >
> > The PG docs for PREPARE TRANSACTION [1] don't say anything about an
> > empty (zero length) transaction-id.
> > e.g. PREPARE TRANSACTION '';
> > [1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html
> >
> > ~
> >
> > Meanwhile, during testing I found the 2PC prepare hangs when an empty
> > id is used.
> >
>
> Can you please take an example to explain what you are trying to say?

I was referring to an empty (zero length) transaction ID, not an empty
transaction.

The example was already given as PREPARE TRANSACTION '';

A longer example from my regress test is shown below. Using 2PC
pub/sub this will currently hang:


# 
# Test using empty GID
# 
# check that 2PC gets replicated to subscriber
$node_publisher->safe_psql('postgres',
"BEGIN;INSERT INTO tab_full VALUES (51);PREPARE TRANSACTION '';");
$node_publisher->poll_query_until('postgres', $caughtup_query)
  or die "Timed out while waiting for subscriber to catch up";
# check that transaction is in prepared state on subscriber
$result =
   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM
pg_prepared_xacts where gid = '';");
   is($result, qq(1), 'transaction is prepared on subscriber');
# ROLLBACK
$node_publisher->safe_psql('postgres',
"ROLLBACK PREPARED '';");
# check that 2PC gets aborted on subscriber
$node_publisher->poll_query_until('postgres', $caughtup_query)
  or die "Timed out while waiting for subscriber to catch up";
$result =
   $node_subscriber->safe_psql('postgres', "SELECT count(*) FROM
pg_prepared_xacts where gid = '';");
   is($result, qq(0), 'transaction is aborted on subscriber');

~

Is that something that should be made to work for 2PC pub/sub, or was
Postgres PREPARE TRANSACTION statement wrong to allow the user to
specify an empty transaction ID in the first place?

Kind Regards
Peter Smith.
Fujitsu Australia.




Re: Mop-up around psql's \connect behavior

2020-10-21 Thread Chapman Flack
On 10/21/20 18:59, Tom Lane wrote:

> I think the reason we've not had complaints about this is that the
> situation normally doesn't arise in interactive sessions (since we
> won't release the old connection voluntarily), while scripts are
> likely not designed to cope with connection losses anyway.  These
> facts militate against spending a whole lot of effort on a fix,
> but still we ought to reduce the silliness factor.  What I propose
> is to complain if we have no o_conn *and* we are asked to re-use
> parameters from it.  Otherwise, it's fine.

I've been getting around it just by saying

  \c "connstring" . . .

which works. It gives me a tiny thrill every time I do it, like I'm
getting away with something. Which is why I haven't been complaining.

I suppose I wouldn't complain if it were fixed, either.

Regards,
-Chap




Re: Allow some recovery parameters to be changed with reload

2020-10-21 Thread Anastasia Lubennikova

On 10.08.2020 23:20, Robert Haas wrote:

On Sun, Aug 9, 2020 at 1:21 AM Michael Paquier  wrote:

Sorry for the late reply.  I have been looking at that stuff again,
and restore_command can be called in the context of a WAL sender
process within the page_read callback of logical decoding via
XLogReadDetermineTimeline(), as readTimeLineHistory() could look for a
timeline history file.  So restore_command is not used only in the
startup process.

Hmm, interesting. But, does that make this change wrong, apart from
the comments? Like, in the case of primary_conninfo, maybe some
confusion could result if the startup process decided whether to ask
for a WAL receiver based on thinking primary_conninfo being set, while
that process thought that it wasn't actually set after all, as
previously discussed in
http://postgr.es/m/ca+tgmozvmjx1+qtww2tsnphrnkwkzxc3zsrynfb-fpzm1ox...@mail.gmail.com
... but what's the corresponding hazard here, exactly? It doesn't seem
that there's any way in which the decision one process makes affects
the decision the other process makes. There's still a race condition:
it's possible for a walsender

Did you mean walreceiver here?

  to use the old restore_command after the
startup process had already used the new one, or the other way around.
However, it doesn't seem like that should confuse anything inside the
server, and therefore I'm not sure we need to code around it.
I came up with following scenario. Let's say we have xlog files 1,2,3 in 
dir1 and files 4,5 in dir2. If startup process had only handled files 1 
and 2, before we switched restore_command from reading dir1 to reading 
dir2, it will fail to find next file. IIUC, it will assume that recovery 
is done, start server and walreceiver. The walreceiver will fail as 
well. I don't know, how realistic is this case, though.


In general,. this feature looks useful and consistent with previous 
changes, so I am interested in pushing it forward.
Sergey, could you please attach this thread to the upcoming CF, if 
you're going to continue working on it.


 A few more questions:
- RestoreArchivedFile() is also used by pg_rewind. I don't see any 
particular problem with it, just want to remind that we should test it too.
- How will it interact with possible future optimizations of archive 
restore? For example, WAL prefetch [1].


 [1] 
https://www.postgresql.org/message-id/flat/601ee1f5-0b78-47e1-9aae-c15f74a1c...@postgrespro.ru


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Mop-up around psql's \connect behavior

2020-10-21 Thread Tom Lane
While working on commit 85c54287a, I noticed a few things I did not
much care for in do_connect().  These don't quite seem to rise to
the level of back-patchable bugs, but they're still not great:

* The initial stanza that complains about

if (!o_conn && (!dbname || !user || !host || !port))

seems woefully obsolete.  In the first place, it's pretty silly
to equate a "complete connection specification" with having just
those four values; the whole point of 85c54287a and predecessors
is that other settings such as sslmode may be just as important.
In the second place, this fails to consider the possibility that
we only have a connstring parameter --- which may nonetheless
provide all the required settings.  And in the third place,
this clearly wasn't revisited when we added explicit control of
whether or not we're supposed to re-use parameters from the old
connection.  It's very silly to insist on having an o_conn if we're
going to ignore it anyway.

I think the reason we've not had complaints about this is that the
situation normally doesn't arise in interactive sessions (since we
won't release the old connection voluntarily), while scripts are
likely not designed to cope with connection losses anyway.  These
facts militate against spending a whole lot of effort on a fix,
but still we ought to reduce the silliness factor.  What I propose
is to complain if we have no o_conn *and* we are asked to re-use
parameters from it.  Otherwise, it's fine.

* I really don't like the bit about silently ignoring user, host,
and port parameters if we see that the first parameter is a connstring.
That's as user-unfriendly as can be.  It should be a syntax error
to specify both; the documentation certainly implies that it is.

* The old-style-syntax code path understands that it should re-use
the old password (if any) when the user, host, and port settings
haven't changed.  The connstring code path was too lazy to make
that work, but now that we're deconstructing the connstring there's
very little excuse for not having it act the same way.

The attached patch fixes these things and documents the password
behavior, which for some reason went unmentioned before.  Along
the way I simplified the mechanism for re-using a password a bit;
there's no reason to treat it so much differently from re-using
other parameters.

Any objections?

regards, tom lane

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 7d0d361657..f6f77dbac3 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -920,6 +920,8 @@ testdb=
 is changed from its previous value using the positional syntax,
 any hostaddr setting present in the
 existing connection's parameters is dropped.
+Also, any password used for the existing connection will be re-used
+only if the user, host, and port settings are not changed.
 When the command neither specifies nor reuses a particular parameter,
 the libpq default is used.
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index ba3ea39aaa..39a460d85c 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -3014,11 +3014,10 @@ param_is_newly_set(const char *old_val, const char *new_val)
 /*
  * do_connect -- handler for \connect
  *
- * Connects to a database with given parameters. Absent an established
- * connection, all parameters are required. Given -reuse-previous=off or a
- * connection string without -reuse-previous=on, NULL values will pass through
- * to PQconnectdbParams(), so the libpq defaults will be used. Otherwise, NULL
- * values will be replaced with the ones in the current connection.
+ * Connects to a database with given parameters.  If we are told to re-use
+ * parameters, parameters from the previous connection are used where the
+ * command's own options do not supply a value.  Otherwise, libpq defaults
+ * are used.
  *
  * In interactive mode, if connection fails with the given parameters,
  * the old connection will be kept.
@@ -3038,20 +3037,16 @@ do_connect(enum trivalue reuse_previous_specification,
 	bool		has_connection_string;
 	bool		reuse_previous;
 
-	if (!o_conn && (!dbname || !user || !host || !port))
+	has_connection_string = dbname ?
+		recognized_connection_string(dbname) : false;
+
+	/* Complain if we have additional arguments after a connection string. */
+	if (has_connection_string && (user || host || port))
 	{
-		/*
-		 * We don't know the supplied connection parameters and don't want to
-		 * connect to the wrong database by using defaults, so require all
-		 * parameters to be specified.
-		 */
-		pg_log_error("All connection parameters must be supplied because no "
-	 "database connection exists");
+		pg_log_error("Do not give user, host, or port separately when using a connection string");
 		return false;
 	}
 
-	has_connection_string = dbname ?
-		recognized_connection_string(dbname) : false;
 	

Re: [DOC] Document concurrent index builds waiting on each other

2020-10-21 Thread David G. Johnston
On Wed, Oct 21, 2020 at 3:25 PM David G. Johnston <
david.g.johns...@gmail.com> wrote:

>
> v3-0002 needs a rebase over the create_index.sgml page due to the change
> of the nearby xref to link.  Attached as v4-0002 along with the original
> v3-0001.
>
>
attached...

Reading the commit message on 0002 - vacuum isn't a transaction-taking
command so it wouldn't interfere with itself, create index does use
transactions and thus it's not surprising that it interferes with vacuum -
which looks at transactions, not commands (as most of the internals would
I'd presume).

David J.


v3-0001-Document-concurrent-indexes-waiting-on-each-other.patch
Description: Binary data


v4-0002-Document-vacuum-on-one-table-depending-on-concurr.patch
Description: Binary data


Re: [DOC] Document concurrent index builds waiting on each other

2020-10-21 Thread David G. Johnston
On Wed, Sep 30, 2020 at 2:10 AM Michael Paquier  wrote:

> On Tue, Sep 08, 2020 at 01:25:21PM -0400, James Coleman wrote:
> > Álvaro's patch confused the current state of this thread, so I'm
> > reattaching (rebased) v2 as v3.
>
> +  
> +   CREATE INDEX (including the
> CONCURRENTLY
> +   option) commands are included when VACUUM
> calculates what
> +   dead tuples are safe to remove even on tables other than the one being
> indexed.
> +  
> FWIW, this is true as well for REINDEX CONCURRENTLY because both use
> the same code paths for index builds and validation, with basically
> the same waiting phases.  But is CREATE INDEX the correct place for
> that?  Wouldn't it be better to tell about such things on the VACUUM
> doc?
>
> 0001 sounds fine to me.
>
>
v3-0002 needs a rebase over the create_index.sgml page due to the change of
the nearby xref to link.  Attached as v4-0002 along with the original
v3-0001.

I resisted the temptation to commit my word-smithing thoughts to the
affected paragraph.  The word "phase" appearing out of nowhere struck me a
bit oddly.  "Then finally the" feels like it is missing a couple of commas
- or just drop the finally.  "then two table scans occur in separate
transactions" reads better than "two more transactions" IMO.

For 0002 maybe focus on the fact that CREATE INDEX is a global concern even
though it only names a single table in any one invocation.  As a
consequence, while it is running, vacuum cannot bring the system's oldest
xid more current than the oldest xid on any index-in-progress table (I
don't know exactly how this works).  And, rehasing 0001, all concurrent
indexing will finish at the same time.

In short maybe focus less on procedure and specific waiting states and more
on the user-visible consequences.  0001 didn't really clear things up much
in that regard.  It reads like we are introducing a deadlock situation even
though that evidently is not the case.

I concur that vacuum's perspective on the create index global reach needs
to be addressed there if it is not already.



I'm a bit confused as to why/whether create index transactions are somehow
special in this regard, compared to other transactions.  I infer from the
existence of 0002 that they somehow are...

My conclusion thus far is that with respect to the original complaint:

On 2019-09-18 13:51:00 -0400, James Coleman wrote:
> In my experience it's not immediately obvious (even after reading the
> documentation) the implications of how concurrent index builds manage
> transactions with respect to multiple concurrent index builds in
> flight at the same time.

These two limited scope patches have not materially moved the needle in
understanding.  They are too technical when the underlying issue is
comprehension by non-technical people in terms of how they see their system
behave.

David J.


Re: [patch] Fix checksum verification in base backups for zero page headers

2020-10-21 Thread Anastasia Lubennikova

On 20.10.2020 09:24, Michael Paquier wrote:

On Fri, Oct 16, 2020 at 06:02:52PM +0300, Anastasia Lubennikova wrote:

In the current patch, PageIsVerifed called from pgbasebackup simply doesn't
Should we change this too? I don't see any difference.

I considered that, but now that does not seem worth bothering here.


Done.

Thanks for the updated patch.  I have played with dd and some fake
files to check the validity of the zero-case (as long as pd_lsn does
not go wild).  Here are some comments, with an updated patch attached:
- Added a PageIsVerifiedExtended to make this patch back-patchable,
with the original routine keeping its original shape.
Thank you. I always forget about this. Do we have any checklist for such 
changes, that patch authors and reviewers can use?

- I did not like much to show the WARNING from PageIsVerified() for
the report, and we'd lose some context related to the base backups.
The important part is to know which blocks from which files are found
as problematic.
- Switched the checks to have PageIsVerified() placed first in the
hierarchy, so as we do all the basic validity checks before a look at
the LSN.  This is not really change things logically.
- The meaning of block_retry is also the opposite of what it should be
in the original code?  IMO, the flag should be set to true if we still
are fine to retry a block, and set it to false once the one-time retry
has failed.


Agree.


- The error strings are not really consistent with the project style
in this area.  These are usually not spawned across multiple lines to
ease searches with grep or such.
Same question as above. I don't see this info about formatting in the 
error message style guide in documentation. Is it mentioned somewhere else?

Anastasia, Michael B, does that look OK to you?

The final patch looks good to me.

NB: This patch fixes only one problem, the zero-page case, as it was
the intention of Michael B to split this part into its own thread.  We
still have, of course, a second problem when it comes to a random LSN
put into the page header which could mask an actual checksum failure
so this only takes care of half the issues.  Having a correct LSN
approximation is a tricky problem IMO, but we could improve things by
having some delta with an extra WAL page on top of GetInsertRecPtr().
And this function cannot be used for base backups taken from
standbys.


We can also read such pages via shared buffers to be 100% sure.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: language cleanups in code and docs

2020-10-21 Thread Thomas Munro
On Wed, Jun 17, 2020 at 10:32 PM Magnus Hagander  wrote:
> In looking at this I realize we also have exactly one thing referred to as 
> "blacklist" in our codebase, which is the "enum blacklist" (and then a small 
> internal variable in pgindent). AFAICT, it's not actually exposed to 
> userspace anywhere, so we could probably make the attached change to 
> blocklist at no "cost" (the only thing changed is the name of the hash table, 
> and we definitely change things like that in normal releases with no specific 
> thought on backwards compat).

+1

Hmm, can we find a more descriptive name for this mechanism?  What
about calling it the "uncommitted enum table"?  See attached.
From 39dfbc83691ee9f48ea844172e38231740674aed Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 22 Oct 2020 10:09:55 +1300
Subject: [PATCH] Rename "enum blacklist" to "uncommitted enum table".

Internal data structures relating to uncommitted enum values are better
described that way, and the earlier term is now unpopular in the
software industry for its (unintended) connotations.

Discussion: https://postgr.es/m/20200615182235.x7lch5n6kcjq4aue%40alap3.anarazel.de
---
 src/backend/access/transam/parallel.c | 32 ++---
 src/backend/catalog/pg_enum.c | 65 ++-
 src/backend/utils/adt/enum.c  |  4 +-
 src/include/catalog/pg_enum.h |  8 ++--
 4 files changed, 56 insertions(+), 53 deletions(-)

diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index b0426960c7..1f499b5131 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -75,7 +75,7 @@
 #define PARALLEL_KEY_PENDING_SYNCS			UINT64CONST(0x000B)
 #define PARALLEL_KEY_REINDEX_STATE			UINT64CONST(0x000C)
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0x000D)
-#define PARALLEL_KEY_ENUMBLACKLIST			UINT64CONST(0x000E)
+#define PARALLEL_KEY_UNCOMMITTEDENUMTABLE	UINT64CONST(0x000E)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -211,7 +211,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 	Size		pendingsyncslen = 0;
 	Size		reindexlen = 0;
 	Size		relmapperlen = 0;
-	Size		enumblacklistlen = 0;
+	Size		uncommittedenumtablelen = 0;
 	Size		segsize = 0;
 	int			i;
 	FixedParallelState *fps;
@@ -267,8 +267,8 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		shm_toc_estimate_chunk(>estimator, reindexlen);
 		relmapperlen = EstimateRelationMapSpace();
 		shm_toc_estimate_chunk(>estimator, relmapperlen);
-		enumblacklistlen = EstimateEnumBlacklistSpace();
-		shm_toc_estimate_chunk(>estimator, enumblacklistlen);
+		uncommittedenumtablelen = EstimateUncommittedEnumTableSpace();
+		shm_toc_estimate_chunk(>estimator, uncommittedenumtablelen);
 		/* If you add more chunks here, you probably need to add keys. */
 		shm_toc_estimate_keys(>estimator, 11);
 
@@ -348,7 +348,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *error_queue_space;
 		char	   *session_dsm_handle_space;
 		char	   *entrypointstate;
-		char	   *enumblacklistspace;
+		char	   *uncommittedenumtablespace;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -404,11 +404,13 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_RELMAPPER_STATE,
 	   relmapperspace);
 
-		/* Serialize enum blacklist state. */
-		enumblacklistspace = shm_toc_allocate(pcxt->toc, enumblacklistlen);
-		SerializeEnumBlacklist(enumblacklistspace, enumblacklistlen);
-		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENUMBLACKLIST,
-	   enumblacklistspace);
+		/* Serialize uncommitted enum state. */
+		uncommittedenumtablespace = shm_toc_allocate(pcxt->toc,
+	 uncommittedenumtablelen);
+		SerializeUncommittedEnumTable(uncommittedenumtablespace,
+	  uncommittedenumtablelen);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_UNCOMMITTEDENUMTABLE,
+	   uncommittedenumtablespace);
 
 		/* Allocate space for worker information. */
 		pcxt->worker = palloc0(sizeof(ParallelWorkerInfo) * pcxt->nworkers);
@@ -1257,7 +1259,7 @@ ParallelWorkerMain(Datum main_arg)
 	char	   *pendingsyncsspace;
 	char	   *reindexspace;
 	char	   *relmapperspace;
-	char	   *enumblacklistspace;
+	char	   *uncommittedenumtablespace;
 	StringInfoData msgbuf;
 	char	   *session_dsm_handle_space;
 
@@ -1449,10 +1451,10 @@ ParallelWorkerMain(Datum main_arg)
 	relmapperspace = shm_toc_lookup(toc, PARALLEL_KEY_RELMAPPER_STATE, false);
 	RestoreRelationMap(relmapperspace);
 
-	/* Restore enum blacklist. */
-	enumblacklistspace = shm_toc_lookup(toc, PARALLEL_KEY_ENUMBLACKLIST,
-		false);
-	RestoreEnumBlacklist(enumblacklistspace);
+	/* Restore uncommitted enum table. */
+	uncommittedenumtablespace =
+		shm_toc_lookup(toc,PARALLEL_KEY_UNCOMMITTEDENUMTABLE, false);
+	RestoreUncommittedEnumTable(uncommittedenumtablespace);
 
 	/* Attach to the leader's serializable transaction, if 

Re: [PATCH] Add section headings to index types doc

2020-10-21 Thread David G. Johnston
On Wed, Sep 30, 2020 at 4:25 AM Dagfinn Ilmari Mannsåker 
wrote:

> Michael Paquier  writes:
>
> > On Mon, Aug 10, 2020 at 12:52:17PM +, Jürgen Purtz wrote:
> >> The new status of this patch is: Waiting on Author
> >
> > This has not been answered yet, so I have marked the patch as returned
> > with feedback.
>
> Updated patch attached, wich reformats the operator lists as requested
> by Jürgen,


A couple of things:

One, I would place the equality operator for hash inside a standalone
synopsis just like all of the others.
Two, why is hash special in having its create index command provided here?
(I notice this isn't the fault of this patch but it stands out when
reviewing it)

I would suggest rewording hash to look more like the others and add a link
to the "CREATE INDEX" command from the chapter preamble.

and skips the reindentation as suggested by Tom.
>

Agreed
David J.


Re: Use of "long" in incremental sort code

2020-10-21 Thread Tomas Vondra

On Wed, Oct 21, 2020 at 06:06:52AM +, Tang, Haiying wrote:

Hi


Found one more place needed to be changed(long -> int64).

Also changed the output for int64 data(Debug mode on & define EXEC_SORTDEBUG )

And, maybe there's a typo in " src\backend\executor\nodeIncrementalSort.c" as 
below.
Obviously, the ">=" is meaningless, right?

And, maybe there's a typo in " src\backend\executor\nodeIncrementalSort.c" as 
below.
Obviously, the ">=" is meaningless, right?

-   SO1_printf("Sorting presorted prefix tuplesort with >= %ld 
tuples\n", nTuples);
+   SO1_printf("Sorting presorted prefix tuplesort with %ld 
tuples\n", nTuples);

Please take a check at the attached patch file.


I have added it to commit fest.
https://commitfest.postgresql.org/30/2772/



Thanks, the changes seem fine to me. I'll do a bit more review and get
it pushed.


regards
Tomas




Re: new heapcheck contrib module

2020-10-21 Thread Mark Dilger



> On Oct 21, 2020, at 1:51 PM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>> There is still something screwy here, though, as this compiles, links and 
>> runs fine for me on mac and linux, but not for Robert.
> 
> Are you using --enable-nls at all on your Mac build?  Because for sure it
> should not work there, given the failure to include -lintl in amcheck's
> link step.  Some platforms are forgiving of that, but not Mac.

Thanks, Tom!

No, that's the answer.  I had a typo/thinko in my configure options, --with-nls 
instead of --enable-nls, and the warning about it being an invalid flag went by 
so fast I didn't see it.  I had it spelled correctly on linux, but I guess 
that's one of the platforms that is more forgiving.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: new heapcheck contrib module

2020-10-21 Thread Tom Lane
Mark Dilger  writes:
> There is still something screwy here, though, as this compiles, links and 
> runs fine for me on mac and linux, but not for Robert.

Are you using --enable-nls at all on your Mac build?  Because for sure it
should not work there, given the failure to include -lintl in amcheck's
link step.  Some platforms are forgiving of that, but not Mac.

regards, tom lane




Re: new heapcheck contrib module

2020-10-21 Thread Tom Lane
Robert Haas  writes:
> I was about to commit 0001, after making some cosmetic changes, when I
> discovered that it won't link for me. I think there must be something
> wrong with the NLS stuff. My version of 0001 is attached. The error I
> got is:

Well, the short answer would be "you need to add

SHLIB_LINK += $(filter -lintl, $(LIBS))

to the Makefile".  However, I would vote against that, because in point
of fact amcheck has no translation support, just like all our other
contrib modules.  What should likely happen instead is to rip out
whatever code is overoptimistically expecting it needs to support
translation.

regards, tom lane




Re: new heapcheck contrib module

2020-10-21 Thread Mark Dilger



> On Oct 21, 2020, at 1:13 PM, Alvaro Herrera  wrote:
> 
> On 2020-Oct-21, Robert Haas wrote:
> 
>> On Wed, Oct 7, 2020 at 9:01 PM Mark Dilger  
>> wrote:
>>> This next version, attached, has the acl checking and associated 
>>> documentation changes split out into patch 0005, making it easier to review 
>>> in isolation from the rest of the patch series.
>>> 
>>> Independently of acl considerations, this version also has some verbiage 
>>> changes in 0004, in response to Andrey's review upthread.
>> 
>> I was about to commit 0001, after making some cosmetic changes, when I
>> discovered that it won't link for me. I think there must be something
>> wrong with the NLS stuff. My version of 0001 is attached. The error I
>> got is:
> 
> Hmm ... I don't think we have translation support in contrib, do we?  I
> think you could solve that by adding a "#undef _, #define _(...) (...)"
> or similar at the top of the offending C files, assuming you don't want
> to rip out all use of _() there.

There is still something screwy here, though, as this compiles, links and runs 
fine for me on mac and linux, but not for Robert.

On mac, I'm using the toolchain from XCode, whereas Robert is using MacPorts.

Mine reports:

Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin19.6.0
Thread model: posix
InstalledDir: 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Robert's reports:

clang version 5.0.2 (tags/RELEASE_502/final)
Target: x86_64-apple-darwin19.4.0
Thread model: posix
InstalledDir: /opt/local/libexec/llvm-5.0/bin

On linux, I'm using gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36) 

Searching around on the web, there are various reports of MacPort's clang not 
linking libintl correctly, though I don't know if that is a real problem with 
MacPorts or just a few cases of user error.  Has anybody else following this 
thread had issues with MacPort's version of clang vis-a-vis linking libintl's 
gettext?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: [HACKERS] Custom compression methods

2020-10-21 Thread Tomas Vondra

On Wed, Oct 21, 2020 at 01:59:50PM +0530, Dilip Kumar wrote:

On Sat, Oct 17, 2020 at 11:34 AM Dilip Kumar  wrote:


On Tue, Oct 13, 2020 at 10:30 AM Dilip Kumar  wrote:
>
> On Mon, Oct 12, 2020 at 7:32 PM Tomas Vondra
>  wrote:
> >
> > On Mon, Oct 12, 2020 at 02:28:43PM +0530, Dilip Kumar wrote:
> > >
> > >> ...
> > >
> > >I have worked on this patch, so as discussed now I am maintaining the
> > >preserved compression methods using dependency.  Still PRESERVE ALL
> > >syntax is not supported, I will work on that part.
> > >
> >
> > Cool, I'll take a look. What's your opinion on doing it this way? Do you
> > think it's cleaner / more elegant, or is it something contrary to what
> > the dependencies are meant to do?
>
> I think this looks much cleaner.  Moreover, I feel that once we start
> supporting the custom compression methods then we anyway have to
> maintain the dependency so using that for finding the preserved
> compression method is good option.

I have also implemented the next set of patches.
0004 -> Provide a way to create custom compression methods
0005 -> Extention to implement lz4 as a custom compression method.


In the updated version I have worked on some of the listed items

A pending list of items:
1. Provide support for handling the compression option
- As discussed up thread I will store the compression option of the
latest compression method in a new field in pg_atrribute table
2. As of now I have kept zlib as the second built-in option and lz4 as
a custom compression extension.  In Offlist discussion with Robert, he
suggested that we should keep lz4 as the built-in method and we can
move zlib as an extension because lz4 is faster than zlib so better to
keep that as the built-in method.  So in the next version, I will
change that.  Any different opinion on this?


Done


3. Improve the documentation, especially for create_compression_method.
4. By default support table compression method for the index.


Done


5. Support the PRESERVE ALL option so that we can preserve all
existing lists of compression methods without providing the whole
list.


1,3,5 points are still pending.



Thanks. I took a quick look at the patches and I think it seems fine. I
have one question, though - toast_compress_datum contains this code:


/* Call the actual compression function */
tmp = cmroutine->cmcompress((const struct varlena *) value);
if (!tmp)
return PointerGetDatum(NULL);


Shouldn't this really throw an error instead? I mean, if the compression
library returns NULL, isn't that an error?


regards


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com











Re: new heapcheck contrib module

2020-10-21 Thread Alvaro Herrera
On 2020-Oct-21, Robert Haas wrote:

> On Wed, Oct 7, 2020 at 9:01 PM Mark Dilger  
> wrote:
> > This next version, attached, has the acl checking and associated 
> > documentation changes split out into patch 0005, making it easier to review 
> > in isolation from the rest of the patch series.
> >
> > Independently of acl considerations, this version also has some verbiage 
> > changes in 0004, in response to Andrey's review upthread.
> 
> I was about to commit 0001, after making some cosmetic changes, when I
> discovered that it won't link for me. I think there must be something
> wrong with the NLS stuff. My version of 0001 is attached. The error I
> got is:

Hmm ... I don't think we have translation support in contrib, do we?  I
think you could solve that by adding a "#undef _, #define _(...) (...)"
or similar at the top of the offending C files, assuming you don't want
to rip out all use of _() there.

TBH the usage of "translation:" comments in this patch seems
over-enthusiastic to me.





Re: new heapcheck contrib module

2020-10-21 Thread Robert Haas
On Wed, Oct 7, 2020 at 9:01 PM Mark Dilger  wrote:
> This next version, attached, has the acl checking and associated 
> documentation changes split out into patch 0005, making it easier to review 
> in isolation from the rest of the patch series.
>
> Independently of acl considerations, this version also has some verbiage 
> changes in 0004, in response to Andrey's review upthread.

I was about to commit 0001, after making some cosmetic changes, when I
discovered that it won't link for me. I think there must be something
wrong with the NLS stuff. My version of 0001 is attached. The error I
got is:

ccache clang -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -Wno-unused-command-line-argument -g -O2 -Wall -Werror
-fno-omit-frame-pointer  -bundle -multiply_defined suppress -o
amcheck.so  verify_heapam.o verify_nbtree.o -L../../src/port
-L../../src/common   -L/opt/local/lib -L/opt/local/lib
-L/opt/local/lib -L/opt/local/lib  -L/opt/local/lib
-Wl,-dead_strip_dylibs  -Wall -Werror -fno-omit-frame-pointer
-bundle_loader ../../src/backend/postgres
Undefined symbols for architecture x86_64:
  "_libintl_gettext", referenced from:
  _verify_heapam in verify_heapam.o
  _check_tuple in verify_heapam.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [amcheck.so] Error 1

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


v19-0001-Extend-amcheck-to-check-heap-pages.patch
Description: Binary data


Re: Deleting older versions in unique indexes to avoid page splits

2020-10-21 Thread Peter Geoghegan
On Wed, Oct 21, 2020 at 8:25 AM Robert Haas  wrote:
> That certainly isn't great. I mean, it might be not be too terrible,
> because it's a leaf index page isn't nearly as potentially hot as a VM
> page or a clog page, but it hurts interruptibility and risks hurting
> concurrency, but if it were possible to arrange to hold only a pin on
> the page during all this rather than a lock, it would be better. I'm
> not sure how realistic that is, though.

I don't think that it's realistic. Well, technically you could do
something like that, but you'd end up with some logically equivalent
mechanism which would probably be slower. As you know, in nbtree pins
are generally much less helpful than within heapam (you cannot read a
page without a shared buffer lock, no matter what). Holding a pin only
provides a very weak guarantee about VACUUM and TID recycling that
usually doesn't come up.

Bear in mind that we actually do practically the same thing all the
time with the current LP_DEAD setting stuff, where we need to call
compute_xid_horizon_for_tuples/heap_compute_xid_horizon_for_tuples
with a leaf buffer lock held in almost the same way. That's actually
potentially far worse if you look at it in isolation, because you
could potentially have hundreds of heap pages, whereas this is just 1
- 3. (BTW, next version will also do that work in passing, so you're
practically guaranteed to do less with a buffer lock held compared to
the typical case of nbtree LP_DEAD setting, even without counting how
the LP_DEAD bits get set in the first place.)

I could also point out that something very similar happens in
_bt_check_unique().

Also bear in mind that the alternative is pretty much a page split, which means:

* Locking the leaf page

* Then obtaining relation extension lock

* Locking to create new right sibling

* Releasing relation extension lock

* Locking original right sibling page

* Release original right sibling page

* Release new right sibling page

* Lock parent page

* Release original now-split page

* Release parent page

(I will refrain from going into all of the absurd and near-permanent
secondary costs that just giving up and splitting the page imposes for
now. I didn't even include all of the information about locking --
there is one thing that didn't seem worth mentioning.)

The key concept here is of course asymmetry. The asymmetry here is not
only favorable; it's just outrageous. The other key concept is it's
fundamentally impossible to pay more than a very small fixed cost
without getting a benefit.

That said, I accept that there is still some uncertainty that all
workloads that get a benefit will be happy with the trade-off. I am
still fine tuning how this works in cases with high contention. I
welcome any help with that part. But note that this doesn't
necessarily have much to do with the heap page accesses. It's not
always strictly better to never have any bloat at all (it's pretty
close to that, but not quite). We saw this with the Postgres 12 work,
where small TPC-C test cases had some queries go slower simply because
a small highly contended index did not get bloated due to a smarter
split algorithm. There is no reason to believe that it had anything to
do with the cost of making better decisions. It was the decisions
themselves.

I don't want to completely prevent "version driven page splits"
(though a person could reasonably imagine that that is in fact my
precise goal); rather, I want to make non-hot updates work to prove
that it's almost certainly necessary to split the page due to version
churn - then and only then should it be accepted. Currently we meekly
roll over and let non-hot updaters impose negative externalities on
the system as a whole. The patch usually clearly benefits even
workloads that consist entirely of non-hot updaters. Negative
externalities are only good for the individual trying to impose costs
on the collective when they can be a true freeloader. It's always bad
for the collective, but it's even bad for the bad actors once they're
more than a small minority.

Currently non-hot updaters are not merely selfish to the extent that
they impose a downside on the collective or the system as a whole that
is roughly proportionate to the upside benefit they get. Not cleaning
up their mess as they go creates a downside that is a huge multiple of
any possible upside for them. To me this seems incontrovertible.
Worrying about the precise extent to which this is true in each
situation doesn't seem particularly productive to me. Whatever the
actual extent of the imbalance is, the solution is that you don't let
them do that.

This patch is not really about overall throughput. It could be
justified on that basis, but that's not how I like to think of it.
Rather, it's about providing a stabilizing backstop mechanism, which
tends to bound the amount of index bloat and the number of versions in
each index for each *logical row* -- that's the most important benefit
of the patch. There are 

Re: dynamic result sets support in extended query protocol

2020-10-21 Thread Andres Freund
Hi,

On 2020-10-20 20:17:45 -0400, Dave Cramer wrote:
> You are correct we are not talking about a whole new protocol, but why not ?
> Seems to me we would have a lot more latitude to get it right if we didn't
> have this limitation.

A new protocol will face a much bigger adoption hurdle, and there's much
stuff that we'll want to do that we'll have a hard time ever getting off
the ground. Whereas opt-in extensions are much easier to get off the ground.

Greetings,

Andres Freund




Re: Add header support to text format and matching feature

2020-10-21 Thread Daniel Verite
Rémi Lapeyre wrote:

> It looks like this is not in the current commitfest and that Cabot does not
> find it. I’m not yet accustomed to the PostgreSQL workflow, should I just
> create a new entry in the current commitfest?

Yes. Because in the last CommitFest it was marked
as "Returned with feedback"
https://commitfest.postgresql.org/29/2504/


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite




Re: Yet another fast GiST build

2020-10-21 Thread Pavel Borisov
>
>
> >> +return (ia->lower > ib->lower) ? 1 : -1;
> >> +}
> >
> > We're only dealing with leaf items during index build, so the 'upper'
> and 'lower' should always be equal here, right? Maybe add a comment and an
> assertion on that.
> >
> > (It's pretty sad that the on-disk representation is identical for leaf
> and internal items, because that wastes a lot of disk space, but that's way
> out of scope for this patch.)
>
> Thanks, I've added assert() where is was easy to test equalty.
>
> PFA patch with all types.
>
> I had a plan to implement and test one type each day. I did not quite
> understood how rich our type system is.
>

Cool, thanks!

BTW there is a somewhat parallel discussion on this gist patch in
pgsql-bugs which you may miss
https://www.postgresql.org/message-id/flat/3dda6945-c147-5afc-7720-38ec6848dafa%40gmail.com#9be5b8a75c4921498748514cb77c6e68

The main point is that buffering build is better to be enforced with
buffering=on
as otherwise buffering build becomes hard to test in the presence of sort
support.

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


Re: Add header support to text format and matching feature

2020-10-21 Thread Rémi Lapeyre
It looks like this is not in the current commitfest and that Cabot does not 
find it. I’m not yet accustomed to the PostgreSQL workflow, should I just 
create a new entry in the current commitfest?

Regards,
Rémi
 

> Le 13 oct. 2020 à 14:49, Rémi Lapeyre  a écrit :
> 
> Thanks Michael for taking care of that!
> 
> Here’s the rebased patches with the last one dropped.
> 
> Regards,
> Rémi
> 
> 
> 
> 
>> Le 5 oct. 2020 à 03:05, Michael Paquier  a écrit :
>> 
>> On Sat, Oct 03, 2020 at 11:42:52PM +0200, Rémi Lapeyre wrote:
>>> Here’s a new version of the patches that report an error when the options 
>>> are set multiple time.
>> 
>> Please note that I have applied a fix for the redundant option
>> handling as of 10c5291, though I have missed that you sent a patch.
>> Sorry about that.  Looking at it, we have done the same thing
>> byte-by-byte except that I have added tests for all option
>> combinations.
>> --
>> Michael
> 





Re: Yet another fast GiST build

2020-10-21 Thread Andrey Borodin


> 7 окт. 2020 г., в 17:38, Heikki Linnakangas  написал(а):
> 
> On 07/10/2020 15:27, Andrey Borodin wrote:
>> Here's draft patch with implementation of sortsupport for ints and floats.
> 
>> +static int
>> +gbt_int4_cmp(Datum a, Datum b, SortSupport ssup)
>> +{
>> +int32KEY   *ia = (int32KEY *) DatumGetPointer(a);
>> +int32KEY   *ib = (int32KEY *) DatumGetPointer(b);
>> +
>> +if (ia->lower == ib->lower)
>> +{
>> +if (ia->upper == ib->upper)
>> +return 0;
>> +
>> +return (ia->upper > ib->upper) ? 1 : -1;
>> +}
>> +
>> +return (ia->lower > ib->lower) ? 1 : -1;
>> +}
> 
> We're only dealing with leaf items during index build, so the 'upper' and 
> 'lower' should always be equal here, right? Maybe add a comment and an 
> assertion on that.
> 
> (It's pretty sad that the on-disk representation is identical for leaf and 
> internal items, because that wastes a lot of disk space, but that's way out 
> of scope for this patch.)

Thanks, I've added assert() where is was easy to test equalty.

PFA patch with all types.

I had a plan to implement and test one type each day. I did not quite 
understood how rich our type system is.

Thanks!

Best regards, Andrey Borodin.


0001-Sortsupport-for-sorting-GiST-build-for-gist_btree-ty.patch
Description: Binary data


Re: CREATE TABLE .. PARTITION OF fails to preserve tgenabled for inherited row triggers

2020-10-21 Thread Alvaro Herrera
On 2020-Oct-21, Justin Pryzby wrote:

> I came up with this, which probably needs more than a little finesse.

Hmm, there are two important changes needed on this: 1) it must not emit
CREATE lines for the child triggers; only the ALTER TABLE ONLY
 lines to set disable state on the partition are needed.  2)
tgparentid does not exist prior to pg13, so you need some additional
trick to cover that case.

Also, I think the multipartition case is broken: if grandparent has
trigger enabled, parent has trigger disabled and child trigger set to
always, is that dumped correctly?  I think the right way to do this is
change only the partitions that differ from the topmost partitioned
table -- not their immediate parents; and use ONLY to ensure they don't
affect downstream children.

Change 1 also means that the test with the "this shouldn't ever get
emitted" comment remains unchanged.

I'm not sure how to tackle change 2.  You need to search pg_depend for
entries with classid=pg_trigger and refclass=pg_trigger ... (commit
1fa846f1c9af might give some clue)




Re: [PATCH] Add features to pg_stat_statements

2020-10-21 Thread Fujii Masao




On 2020/10/12 21:18, Yuki Seino wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

The patch applies cleanly and looks fine to me. It's a small detail, However 
wouldn't it be better if the following changes were made.

1.There are unnecessary difference lines 707 and 863 in "pg_stat_statements.c".
2.There is no comment on "slock_t mutex" in "struct pgssCtlCounter" in 
"pg_stat_statements.c".
3."update_ctl" and "reset_ctl" are generic and illegible name.You might want to rename it 
something like "pgss_ctl_update".
4.To improve the readability of the source, why not change the function declaration 
option in "pg_stat_statements_ctl" from
"AS 'MODULE_PATHNAME'" to "AS 'MODULE_PATHNAME', 'pg_stat_statements_ctl'"? in 
"pg_stat_statements--1.8--1.9.sql".

The new status of this patch is: Waiting on Author


Here are other comments from me.

-DATA = pg_stat_statements--1.4.sql \
+DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.8--1.9.sql\

One space character needs to be added just before the character "\".

+--- Define pg_stat_statements_ctl

ISTM that this is not good name.
What about pg_stat_statements_info, _stats, _activity or something?

+   OUT dealloc integer,

The type of "dealloc" should be bigint, instead?

+   OUT last_dealloc TIMESTAMP WITH TIME ZONE

Is this information really useful?
If there is no valid use case for this, I'd like to drop it.
Thought?

+LANGUAGE  C STRICT VOLATILE PARALLEL SAFE;

There are two space characters just after "LANGUAGE".
One space character should be removed from there.

+CREATE VIEW pg_stat_statements_ctl AS
+   SELECT * FROM pg_stat_statements_ctl();

If we rename the function, this view name also should be changed.

+GRANT SELECT ON pg_stat_statements TO PUBLIC;

"pg_stat_statements" should be "pg_stat_statement_xxx"?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: refactoring basebackup.c

2020-10-21 Thread Mark Dilger



> On Jul 29, 2020, at 8:31 AM, Robert Haas  wrote:
> 
> On Fri, May 8, 2020 at 4:55 PM Robert Haas  wrote:
>> So it might be good if I'd remembered to attach the patches. Let's try
>> that again.
> 
> Here's an updated patch set.

Hi Robert,

v2-0001 through v2-0009 still apply cleanly, but v2-0010 no longer applies.  It 
seems to be conflicting with Heikki's work from August.  Could you rebase 
please?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Is Recovery actually paused?

2020-10-21 Thread Simon Riggs
On Wed, 21 Oct 2020 at 12:16, Dilip Kumar  wrote:
>
> On Tue, Oct 20, 2020 at 5:59 PM Dilip Kumar  wrote:
> >
> > On Tue, Oct 20, 2020 at 3:00 PM Simon Riggs  wrote:
> > >
> > > On Tue, 20 Oct 2020 at 09:50, Dilip Kumar  wrote:
> > >
> > > > > > Why would we want this? What problem are you trying to solve?
> > > > >
> > > > > The requirement is to know the last replayed WAL on the standby so
> > > > > unless we can guarantee that the recovery is actually paused we can
> > > > > never get the safe last_replay_lsn value.
> > > > >
> > > > > > If we do care, why not fix pg_is_wal_replay_paused() so it responds 
> > > > > > as you wish?
> > > > >
> > > > > Maybe we can also do that but pg_is_wal_replay_paused is an existing
> > > > > API and the behavior is to know whether the recovery paused is
> > > > > requested or not,  So I am not sure is it a good idea to change the
> > > > > behavior of the existing API?
> > > > >
> > > >
> > > > Attached is the POC patch to show what I have in mind.
> > >
> > > If you don't like it, I doubt anyone else cares for the exact current
> > > behavior either. Thanks for pointing those issues out.
> > >
> > > It would make sense to alter pg_wal_replay_pause() so that it blocks
> > > until paused.
> > >
> > > I suggest you add the 3-value state as you suggest, but make
> > > pg_is_wal_replay_paused() respond:
> > > if paused, true
> > > if requested, wait until paused, then return true
> > > else false
> > >
> > > That then solves your issues with a smoother interface.
> > >
> >
> > Make sense to me, I will change as per the suggestion.
>
> I have noticed one more issue, the problem is that if the recovery
> process is currently not processing any WAL and just waiting for the
> WAL to become available then the pg_is_wal_replay_paused will be stuck
> forever.  Having said that there is the same problem even if we design
> the new interface which checks whether the recovery is actually paused
> or not because until the recovery process gets the next wal it will
> not check whether the recovery pause is requested or not so the actual
> recovery paused flag will never be set.
>
> One idea could be, if the recovery process is waiting for WAL and a
> recovery pause is requested then we can assume that the recovery is
> paused because before processing the next wal it will always check
> whether the recovery pause is requested or not.

If ReadRecord() is waiting for WAL (at bottom of recovery loop), then
when it does return it will immediately move to pause (at top of next
loop). Which makes it easy to cover these cases.

It would be easy enough to create another variable that shows "waiting
for WAL", since that is in itself a useful and interesting thing to be
able to report.

pg_is_wal_replay_paused() and pg_wal_replay_pause() would then return
whenever it is either (fully paused || waiting for WAL &&
pause_requested))

We can then create a new function called pg_wal_replay_status() that
returns multiple values: RECOVERING | WAITING_FOR_WAL | PAUSED

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: Timing of relcache inval at parallel worker init

2020-10-21 Thread Robert Haas
On Sat, Oct 17, 2020 at 7:53 AM Noah Misch  wrote:
> Let's move InvalidateSystemCaches() later.
> Invalidation should follow any worker initialization step that changes the
> results of relcache validation; otherwise, we'd need to ensure the
> InvalidateSystemCaches() will not validate any relcache entry.

The thinking behind the current placement was this: just before the
call to InvalidateSystemCaches(), we restore the active and
transaction snapshots. I think that we must now flush the caches
before anyone does any more lookups; otherwise, they might get wrong
answers. So, putting this code later makes the assumption that no such
lookups happen meanwhile. That feels a little risky to me; at the very
least, it should be clearly spelled out in the comments that no system
cache lookups can happen in the functions we call in the interim.
Would it be obvious to a future developer that e.g.
RestoreEnumBlacklist cannot perform any syscache lookups? It doesn't
seem so to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




[patch] [doc] Clarify temporary table name shadowing in CREATE TABLE

2020-10-21 Thread David G. Johnston
Hackers,

Moving this over from -general [1]

David J.

[1]
https://www.postgresql.org/message-id/CAKFQuwaM1K%3DprJNwKnoaC2AyDFn-7OvtCpmQ23bcVe5Z%3DLKA3Q%40mail.gmail.com
diff --git a/doc/src/sgml/ref/create_table.sgml 
b/doc/src/sgml/ref/create_table.sgml
index 087cad184c..a400334092 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -171,8 +171,9 @@ WITH ( MODULUS numeric_literal, REM
   If specified, the table is created as a temporary table.
   Temporary tables are automatically dropped at the end of a
   session, or optionally at the end of the current transaction
-  (see ON COMMIT below).  Existing permanent
-  tables with the same name are not visible to the current session
+  (see ON COMMIT below).  The default
+  search_path includes the temporary schema first and so identically
+  named existing permanent tables are not chosen for new plans
   while the temporary table exists, unless they are referenced
   with schema-qualified names. Any indexes created on a temporary
   table are automatically temporary as well.


Re: Deleting older versions in unique indexes to avoid page splits

2020-10-21 Thread Robert Haas
On Wed, Oct 7, 2020 at 7:48 PM Peter Geoghegan  wrote:
> To be blunt: It may be controversial that we're accessing multiple
> heap pages while holding an exclusive lock on a leaf page, in the
> hopes that we can avoid a page split, but without any certainty that
> it'll work out.

That certainly isn't great. I mean, it might be not be too terrible,
because it's a leaf index page isn't nearly as potentially hot as a VM
page or a clog page, but it hurts interruptibility and risks hurting
concurrency, but if it were possible to arrange to hold only a pin on
the page during all this rather than a lock, it would be better. I'm
not sure how realistic that is, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




[patch] [doc] Minor variable related cleanup and rewording of plpgsql docs

2020-10-21 Thread David G. Johnston
Hackers,

Bug # 16519 [1] is another report of confusion regarding trying to use
parameters in improper locations - specifically the SET ROLE command within
pl/pgsql.  I'm re-attaching the doc patch and am adding it to the
commitfest.

David J.

[1]
https://www.postgresql.org/message-id/16519-9ef04828d058a319%40postgresql.org


v1-doc-plpgsql-variable-usage-cleanup.patch
Description: Binary data


Re: [HACKERS] Custom compression methods

2020-10-21 Thread Robert Haas
On Thu, Oct 8, 2020 at 5:54 PM Tomas Vondra
 wrote:
> And is the oidvector actually needed? If we have the extra catalog,
> can't we track this simply using the regular dependencies? So we'd have
> the attcompression OID of the current compression method, and the
> preserved values would be tracked in pg_depend.

If we go that route, we have to be sure that no such dependencies can
exist for any other reason. Otherwise, there would be confusion about
whether the dependency was there because values of that type were
being preserved in the table, or whether it was for the hypothetical
other reason. Now, admittedly, I can't quite think how that would
happen. For example, if the attribute default expression somehow
embedded a reference to a compression AM, that wouldn't cause this
problem, because the dependency would be on the attribute default
rather than the attribute itself. So maybe it's fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




[patch] [doc] Introduce view updating options more succinctly

2020-10-21 Thread David G. Johnston
Hackers,

Over in -docs [1], where I attached the wrong patch anyway, the poster
needed some clarity regarding view updating.  A minor documentation patch
is attached providing just that.

David J.

[1] https://www.postgresql.org/message-id/20200303174248.GB5019%40panix.com


v1-doc-rules-view-updating-touchup.patch
Description: Binary data


Re: Is Recovery actually paused?

2020-10-21 Thread Robert Haas
On Wed, Oct 21, 2020 at 7:16 AM Dilip Kumar  wrote:
> I have noticed one more issue, the problem is that if the recovery
> process is currently not processing any WAL and just waiting for the
> WAL to become available then the pg_is_wal_replay_paused will be stuck
> forever.  Having said that there is the same problem even if we design
> the new interface which checks whether the recovery is actually paused
> or not because until the recovery process gets the next wal it will
> not check whether the recovery pause is requested or not so the actual
> recovery paused flag will never be set.
>
> One idea could be, if the recovery process is waiting for WAL and a
> recovery pause is requested then we can assume that the recovery is
> paused because before processing the next wal it will always check
> whether the recovery pause is requested or not.

That seems fine, because the user's question is presumably whether the
pause has taken effect so that no more records will be replayed
barring an un-paused.

However, it might be better to implement this by having the system
absorb the pause immediately when it's in this state, rather than
trying to detect this state and treat it specially.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [patch] [doc] Add SELECT clause literals to queries section headers

2020-10-21 Thread David G. Johnston
>
> Hackers,
>
> Re-sending from -docs [1] with attachment in order to add to commitfest.
>
> David J.
>
> [1]
> https://www.postgresql.org/message-id/flat/159981394174.31338.7014519396749859167%40wrigleys.postgresql.org
>

edit: attaching the patch


v1-doc-add-select-literals-to-queries-section-headers.patch
Description: Binary data


[patch] [doc] Add SELECT clause literals to queries section headers

2020-10-21 Thread David G. Johnston
Hackers,

Re-sending from -docs [1] with attachment in order to add to commitfest.

David J.

[1]
https://www.postgresql.org/message-id/flat/159981394174.31338.7014519396749859167%40wrigleys.postgresql.org


Re: A new function to wait for the backend exit after termination

2020-10-21 Thread David G. Johnston
On Wed, Oct 21, 2020 at 6:13 AM Magnus Hagander  wrote:

> On Wed, Oct 21, 2020 at 3:02 PM Bharath Rupireddy <
> bharath.rupireddyforpostg...@gmail.com> wrote:
>
>> Hi,
>>
>> Currently pg_terminate_backend(), sends SIGTERM to the backend process
>> but doesn't ensure it's exit. There are chances that backends still are
>> running(even after pg_terminate_backend() is called) until the interrupts
>> are processed(using ProcessInterrupts()). This could cause problems
>> especially in testing, for instance in a sql file right after
>> pg_terminate_backend(), if any test case depends on the backend's
>> non-existence[1], but the backend is not terminated. As discussed in [1],
>> we have wait_pid()(see regress.c and sql/dblink.sql), but it's not usable
>> across the system. In [1], we thought it would be better to have functions
>> ensuring the backend's exit on the similar lines of pg_terminate_backend().
>>
>> I propose to have two functions:
>>
>> 1. pg_terminate_backend_and_wait() -- which sends SIGTERM to the backend
>> and wait's until it's exit.
>>
>
> I think it would be nicer to have a pg_terminate_backend(pid, wait=false),
> so a function with a second parameter which defaults to the current
> behaviour of not waiting. And it might be a good idea to also give it a
> timeout parameter?
>

Agreed on the overload, and the timeouts make sense too - with the caller
deciding whether a timeout results in a failure or a false return value.


>
>> 2. pg_wait_backend() -- which waits for a given backend process. Note
>> that this function has to be used carefully after pg_terminate_backend(),
>> if used on a backend that's not ternmited it simply keeps waiting in a loop.
>>
>
> It seems this one also very much would need a timeout value.
>
>
Is there a requirement for waiting to be superuser only?  You are not
affecting any session but your own during the waiting period.

I could imagine, in theory at least, wanting to wait for a backend to go
idle as well as for it disappearing.  Scope creep in terms of this patch's
goal but worth at least considering now.

David J.


Re: Combination of geqo and enable_partitionwise_join leads to crashes in the regression tests

2020-10-21 Thread Onder Kalaci
Hi,

I think this is already discussed  here: 
https://www.postgresql.org/message-id/flat/CAExHW5tgiLsYC_CLcaKHFFc8H56C0s9mCu_0OpahGxn%3DhUi_Pg%40mail.gmail.com#db54218ab7bb9e1484cdcc52abf2d324

Sorry for missing that thread before sending the mail.


From: Onder Kalaci 
Date: Wednesday, 21 October 2020 12:49
To: pgsql-hack...@postgresql.org 
Subject: Combination of geqo and enable_partitionwise_join leads to crashes in 
the regression tests
Hi,

I was running “make installcheck” with the following settings:

SET geqo_threshold=2;
SET geqo_generations=1000;
SETT geqo_pool_size=1000;
SET enable_partitionwise_join to true;

And, realized that “partition_join” test crashed. It is reproducible for both 
12.3 and 13.0  (I’ve not tested further).

Minimal steps to reproduce:

SET geqo_threshold=2;
SET geqo_generations=1000;
SET geqo_pool_size=1000;
SET enable_partitionwise_join to true;

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt2 (a int, b int, c varchar) PARTITION BY RANGE(b);
CREATE TABLE prt2_p1 PARTITION OF prt2 FOR VALUES FROM (0) TO (250);

EXPLAIN (COSTS OFF)
SELECT t1.a,
   ss.t2a,
   ss.t2c
FROM prt1 t1
LEFT JOIN LATERAL
  (SELECT t2.a AS t2a,
  t3.a AS t3a,
  t2.b t2b,
  t2.c t2c,
  least(t1.a, t2.a, t3.b)
   FROM prt1 t2
   JOIN prt2 t3 ON (t2.a = t3.b)) ss ON t1.c = ss.t2c
WHERE (t1.b + coalesce(ss.t2b, 0)) = 0
ORDER BY t1.a;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 4.966 ms
@:-!>



Top of the backtrace on PG 13.0:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0x170100)
  * frame #0: 0x000108b255c0 postgres`bms_is_subset(a=0x00170100, 
b=0x7fac37834db8) at bitmapset.c:327:13
frame #1: 0x000108b65b55 
postgres`generate_join_implied_equalities_normal(root=0x7fac37815640, 
ec=0x7fac3781d2c0, join_relids=0x7fac37834db8, 
outer_relids=0x7fac3781a9f8, inner_relids=0x7fac37087608) at 
equivclass.c:1324:8
frame #2: 0x000108b659a9 
postgres`generate_join_implied_equalities(root=0x7fac37815640, 
join_relids=0x7fac37834db8, outer_relids=0x7fac3781a9f8, 
inner_rel=0x7fac370873f0) at equivclass.c:1197:14
frame #3: 0x000108ba71a3 
postgres`build_joinrel_restrictlist(root=, 
joinrel=0x7fac37834ba0, outer_rel=0x7fac37802f10, 
inner_rel=0x7fac370873f0) at relnode.c:1079:8
frame #4: 0x000108ba6fe0 
postgres`build_join_rel(root=0x7fac37815640, joinrelids=0x7fac370873c8, 
outer_rel=0x7fac37802f10, inner_rel=0x7fac370873f0, 
sjinfo=0x7fac3781c540, restrictlist_ptr=0x7ffee72c9668) at 
relnode.c:709:17
frame #5: 0x000108b6e552 
postgres`make_join_rel(root=0x7fac37815640, rel1=0x7fac37802f10, 
rel2=0x7fac370873f0) at joinrels.c:746:12
frame #6: 0x000108b58d68 postgres`merge_clump(root=0x7fac37815640, 
clumps=0x7fac37087348, new_clump=0x7fac37087320, num_gene=3, 
force=) at geqo_eval.c:260:14
frame #7: 0x000108b58bee postgres`gimme_tree(root=, 
tour=0x7fac378248c8, num_gene=) at geqo_eval.c:199:12
frame #8: 0x000108b58ab9 postgres`geqo_eval(root=0x7fac37815640, 
tour=0x7fac378248c8, num_gene=3) at geqo_eval.c:102:12
frame #9: 0x000108b592b8 
postgres`random_init_pool(root=0x7fac37815640, pool=0x7fac37824828) at 
geqo_pool.c:109:25
frame #10: 0x000108b58fb7 postgres`geqo(root=0x7fac37815640, 
number_of_rels=, initial_rels=) at geqo_main.c:114:2
frame #11: 0x000108b5988f 
postgres`make_one_rel(root=0x7fac37815640, joinlist=0x7fac3781cf08) at 
allpaths.c:227:8
frame #12: 0x000108b7f187 
postgres`query_planner(root=0x7fac37815640, qp_callback=, 
qp_extra=0x7ffee7
….

Thanks,
Onder


Combination of geqo and enable_partitionwise_join leads to crashes in the regression tests

2020-10-21 Thread Onder Kalaci
Hi,

I was running “make installcheck” with the following settings:

SET geqo_threshold=2;
SET geqo_generations=1000;
SETT geqo_pool_size=1000;
SET enable_partitionwise_join to true;

And, realized that “partition_join” test crashed. It is reproducible for both 
12.3 and 13.0  (I’ve not tested further).

Minimal steps to reproduce:

SET geqo_threshold=2;
SET geqo_generations=1000;
SET geqo_pool_size=1000;
SET enable_partitionwise_join to true;

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt2 (a int, b int, c varchar) PARTITION BY RANGE(b);
CREATE TABLE prt2_p1 PARTITION OF prt2 FOR VALUES FROM (0) TO (250);

EXPLAIN (COSTS OFF)
SELECT t1.a,
   ss.t2a,
   ss.t2c
FROM prt1 t1
LEFT JOIN LATERAL
  (SELECT t2.a AS t2a,
  t3.a AS t3a,
  t2.b t2b,
  t2.c t2c,
  least(t1.a, t2.a, t3.b)
   FROM prt1 t2
   JOIN prt2 t3 ON (t2.a = t3.b)) ss ON t1.c = ss.t2c
WHERE (t1.b + coalesce(ss.t2b, 0)) = 0
ORDER BY t1.a;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 4.966 ms
@:-!>



Top of the backtrace on PG 13.0:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0x170100)
  * frame #0: 0x000108b255c0 postgres`bms_is_subset(a=0x00170100, 
b=0x7fac37834db8) at bitmapset.c:327:13
frame #1: 0x000108b65b55 
postgres`generate_join_implied_equalities_normal(root=0x7fac37815640, 
ec=0x7fac3781d2c0, join_relids=0x7fac37834db8, 
outer_relids=0x7fac3781a9f8, inner_relids=0x7fac37087608) at 
equivclass.c:1324:8
frame #2: 0x000108b659a9 
postgres`generate_join_implied_equalities(root=0x7fac37815640, 
join_relids=0x7fac37834db8, outer_relids=0x7fac3781a9f8, 
inner_rel=0x7fac370873f0) at equivclass.c:1197:14
frame #3: 0x000108ba71a3 
postgres`build_joinrel_restrictlist(root=, 
joinrel=0x7fac37834ba0, outer_rel=0x7fac37802f10, 
inner_rel=0x7fac370873f0) at relnode.c:1079:8
frame #4: 0x000108ba6fe0 
postgres`build_join_rel(root=0x7fac37815640, joinrelids=0x7fac370873c8, 
outer_rel=0x7fac37802f10, inner_rel=0x7fac370873f0, 
sjinfo=0x7fac3781c540, restrictlist_ptr=0x7ffee72c9668) at 
relnode.c:709:17
frame #5: 0x000108b6e552 
postgres`make_join_rel(root=0x7fac37815640, rel1=0x7fac37802f10, 
rel2=0x7fac370873f0) at joinrels.c:746:12
frame #6: 0x000108b58d68 postgres`merge_clump(root=0x7fac37815640, 
clumps=0x7fac37087348, new_clump=0x7fac37087320, num_gene=3, 
force=) at geqo_eval.c:260:14
frame #7: 0x000108b58bee postgres`gimme_tree(root=, 
tour=0x7fac378248c8, num_gene=) at geqo_eval.c:199:12
frame #8: 0x000108b58ab9 postgres`geqo_eval(root=0x7fac37815640, 
tour=0x7fac378248c8, num_gene=3) at geqo_eval.c:102:12
frame #9: 0x000108b592b8 
postgres`random_init_pool(root=0x7fac37815640, pool=0x7fac37824828) at 
geqo_pool.c:109:25
frame #10: 0x000108b58fb7 postgres`geqo(root=0x7fac37815640, 
number_of_rels=, initial_rels=) at geqo_main.c:114:2
frame #11: 0x000108b5988f 
postgres`make_one_rel(root=0x7fac37815640, joinlist=0x7fac3781cf08) at 
allpaths.c:227:8
frame #12: 0x000108b7f187 
postgres`query_planner(root=0x7fac37815640, qp_callback=, 
qp_extra=0x7ffee7
….

Thanks,
Onder


Re: libpq compression

2020-10-21 Thread Konstantin Knizhnik


On 06.10.2020 9:34, Andrey M. Borodin wrote:



26 марта 2019 г., в 19:46, Konstantin Knizhnik  
написал(а):

Version of the patch correctly working when no compression algorithm are 
avaiable.

Thanks for this work, Konstantin.
PFA rebased version of this patch.

This compression seems very important to reduce network-induced replication lag.
Recently I've found out that small installation suffer from huge latency spike 
when archive_timeout occurs.
How this happens? archive_timeout emits segment switch, which pads current 
segment with zeroes. Now these zeroes need to be replicated to standbys, 
without compression. Transfer of zeroes causes long waits for synchronous 
replication (up to 1s with network restricted to 16Mbps).

Also replication compression will reduce overall cross-AZ traffic of HA 
installations.

So I'm considering following plan for 2020: implement this protocol in Odyssey 
and send patches to drivers to enable early access to this feature.

Best regards, Andrey Borodin.


Rebased version of the patch is attached.
I am going to resubmit it to the next CF.

diff --git a/configure b/configure
index 071d050..12ae2dd 100755
--- a/configure
+++ b/configure
@@ -700,6 +700,7 @@ LD
 LDFLAGS_SL
 LDFLAGS_EX
 with_zlib
+with_zstd
 with_system_tzdata
 with_libxslt
 XML2_LIBS
@@ -867,6 +868,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 '
@@ -8571,6 +8573,85 @@ fi
 
 
 
+#
+# ZStd
+#
+
+
+
+# Check whether --with-zstd was given.
+if test "${with_zstd+set}" = set; then :
+  withval=$with_zstd;
+  case $withval in
+yes)
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_zstd=no
+
+fi
+
+
+
+
+if test "$with_zstd" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5
+$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; }
+if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lzstd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char ZSTD_compress ();
+int
+main ()
+{
+return ZSTD_compress ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_zstd_ZSTD_compress=yes
+else
+  ac_cv_lib_zstd_ZSTD_compress=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_zstd_ZSTD_compress" >&5
+$as_echo "$ac_cv_lib_zstd_ZSTD_compress" >&6; }
+if test "x$ac_cv_lib_zstd_ZSTD_compress" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBZSTD 1
+_ACEOF
+
+  LIBS="-lzstd $LIBS"
+
+else
+  as_fn_error $? "library 'zstd' is required for ZSTD support" "$LINENO" 5
+fi
+
+fi
+
+
 
 #
 # Zlib
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index de60281..4a1f49c 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -1225,6 +1225,20 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   
  
 
+ 
+  compression
+  
+  
+Request compression of libpq traffic. Client sends to the server list of compression algorithms, supported by client library.
+If server supports one of this algorithms, then it acknowledges use of this algorithm and then all libpq messages send both from client to server and
+visa versa will be compressed. If server is not supporting any of the suggested algorithms, then it replies with 'n' (no compression)
+message and it is up to the client whether to continue work without compression or report error.
+Supported compression algorithms are chosen at configure time. Right now two libraries are supported: zlib (default) and zstd (if Postgres was
+configured with --with-zstd option). In both cases streaming mode is used.
+  
+  
+ 
+
  
   client_encoding
   
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index a46cb72..6ff5777 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -92,6 +92,15 @@
such as COPY.
   
 
+  
+It is possible to compress protocol data to reduce traffic and speed-up client-server interaction.
+Compression is especial useful for importing/exporting data to/from database using COPY command
+and for replication (both physical and logical). Also compression can reduce server response time
+in case of queries returning large amount of data (for example returning JSON, BLOBs, text,...)
+Right now two libraries are supported: zlib (default) and zstd (if Postgres was
+configured with 

Re: A new function to wait for the backend exit after termination

2020-10-21 Thread Magnus Hagander
On Wed, Oct 21, 2020 at 3:02 PM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> Hi,
>
> Currently pg_terminate_backend(), sends SIGTERM to the backend process but
> doesn't ensure it's exit. There are chances that backends still are
> running(even after pg_terminate_backend() is called) until the interrupts
> are processed(using ProcessInterrupts()). This could cause problems
> especially in testing, for instance in a sql file right after
> pg_terminate_backend(), if any test case depends on the backend's
> non-existence[1], but the backend is not terminated. As discussed in [1],
> we have wait_pid()(see regress.c and sql/dblink.sql), but it's not usable
> across the system. In [1], we thought it would be better to have functions
> ensuring the backend's exit on the similar lines of pg_terminate_backend().
>
> I propose to have two functions:
>
> 1. pg_terminate_backend_and_wait() -- which sends SIGTERM to the backend
> and wait's until it's exit.
>

I think it would be nicer to have a pg_terminate_backend(pid, wait=false),
so a function with a second parameter which defaults to the current
behaviour of not waiting. And it might be a good idea to also give it a
timeout parameter?


> 2. pg_wait_backend() -- which waits for a given backend process. Note that
> this function has to be used carefully after pg_terminate_backend(), if
> used on a backend that's not ternmited it simply keeps waiting in a loop.
>

It seems this one also very much would need a timeout value.

And surely we should show some sort of wait event when it's waiting.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


A new function to wait for the backend exit after termination

2020-10-21 Thread Bharath Rupireddy
Hi,

Currently pg_terminate_backend(), sends SIGTERM to the backend process but
doesn't ensure it's exit. There are chances that backends still are
running(even after pg_terminate_backend() is called) until the interrupts
are processed(using ProcessInterrupts()). This could cause problems
especially in testing, for instance in a sql file right after
pg_terminate_backend(), if any test case depends on the backend's
non-existence[1], but the backend is not terminated. As discussed in [1],
we have wait_pid()(see regress.c and sql/dblink.sql), but it's not usable
across the system. In [1], we thought it would be better to have functions
ensuring the backend's exit on the similar lines of pg_terminate_backend().

I propose to have two functions:

1. pg_terminate_backend_and_wait() -- which sends SIGTERM to the backend
and wait's until it's exit.
2. pg_wait_backend() -- which waits for a given backend process. Note that
this function has to be used carefully after pg_terminate_backend(), if
used on a backend that's not ternmited it simply keeps waiting in a loop.

Attaching a WIP patch herewith.

Thoughts?

[1]
https://www.postgresql.org/message-id/flat/f31cc4da-a7ea-677f-cf64-a2f9db854bf5%40oss.nttdata.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From b1714f9fd95007b9fb6c26b675f83fa73fecd562 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 21 Oct 2020 11:52:34 +0530
Subject: [PATCH v1] pg_terminate_backend_and_wait

---
 src/backend/storage/ipc/signalfuncs.c | 71 +++
 src/include/catalog/pg_proc.dat   |  7 +++
 2 files changed, 78 insertions(+)

diff --git a/src/backend/storage/ipc/signalfuncs.c b/src/backend/storage/ipc/signalfuncs.c
index d822e82cb9..002ac5f10d 100644
--- a/src/backend/storage/ipc/signalfuncs.c
+++ b/src/backend/storage/ipc/signalfuncs.c
@@ -149,6 +149,77 @@ pg_terminate_backend(PG_FUNCTION_ARGS)
 	PG_RETURN_BOOL(r == SIGNAL_BACKEND_SUCCESS);
 }
 
+/*
+ * Signal to terminate a backend process and wait until no process has a given
+ * PID. This is allowed if you are a member of the role whose process is being
+ * terminated.
+ *
+ * Note that only superusers can signal superuser-owned processes.
+ */
+Datum
+pg_terminate_backend_and_wait(PG_FUNCTION_ARGS)
+{
+	int			pid = PG_GETARG_INT32(0);
+	bool		r = DatumGetBool(DirectFunctionCall1(pg_terminate_backend,
+	 PG_GETARG_DATUM(0)));
+	if (r)
+	{
+		while (kill(pid, 0) == 0)
+		{
+			/* Process interrupts in case of self termination. */
+			if (pid == MyProcPid)
+CHECK_FOR_INTERRUPTS();
+
+			pg_usleep(5);
+		}
+
+		if (errno != ESRCH)
+			elog(ERROR, "could not check PID %d liveness: %m", pid);
+	}
+
+	PG_RETURN_BOOL(r);
+}
+
+Datum
+pg_wait_backend(PG_FUNCTION_ARGS)
+{
+	int			pid = PG_GETARG_INT32(0);
+	PGPROC	   *proc = BackendPidGetProc(pid);
+
+	/*
+	 * BackendPidGetProc returns NULL if the pid isn't valid; but by the time
+	 * we reach kill(), a process for which we get a valid proc here might
+	 * have terminated on its own.  There's no way to acquire a lock on an
+	 * arbitrary process to prevent that. But since so far all the callers of
+	 * this mechanism involve some request for ending the process anyway, that
+	 * it might end on its own first is not a problem.
+	 */
+	if (proc == NULL)
+	{
+		/*
+		 * This is just a warning so a loop-through-resultset will not abort
+		 * if one backend terminated on its own during the run.
+		 */
+		ereport(WARNING,
+(errmsg("PID %d is not a PostgreSQL server process", pid)));
+		PG_RETURN_BOOL(false);
+	}
+
+	if (!superuser())
+		elog(ERROR, "must be superuser to check PID liveness");
+
+	while (kill(pid, 0) == 0)
+	{
+		//CHECK_FOR_INTERRUPTS();
+		pg_usleep(5);
+	}
+
+	if (errno != ESRCH)
+		elog(ERROR, "could not check PID %d liveness: %m", pid);
+
+	PG_RETURN_BOOL(true);
+}
+
 /*
  * Signal to reload the database configuration
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 22340baf1c..1c502d51e3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6085,6 +6085,13 @@
 { oid => '2096', descr => 'terminate a server process',
   proname => 'pg_terminate_backend', provolatile => 'v', prorettype => 'bool',
   proargtypes => 'int4', prosrc => 'pg_terminate_backend' },
+{ oid => '16386', descr => 'terminate a server process and wait for it\'s exit',
+  proname => 'pg_terminate_backend_and_wait', provolatile => 'v',
+  prorettype => 'bool', proargtypes => 'int4',
+  prosrc => 'pg_terminate_backend_and_wait' },
+{ oid => '16387', descr => 'waits for a backend\'s exit',
+  proname => 'pg_wait_backend', provolatile => 'v', prorettype => 'bool',
+  proargtypes => 'int4', prosrc => 'pg_wait_backend' },
 { oid => '2172', descr => 'prepare for taking an online backup',
   proname => 'pg_start_backup', provolatile => 'v', proparallel => 'r',
   prorettype => 'pg_lsn', proargtypes => 'text bool bool',
-- 
2.25.1



Re: CREATE TABLE .. PARTITION OF fails to preserve tgenabled for inherited row triggers

2020-10-21 Thread Justin Pryzby
On Tue, Oct 20, 2020 at 09:54:53PM -0300, Alvaro Herrera wrote:
> On 2020-Oct-20, Justin Pryzby wrote:
> > On Tue, Oct 20, 2020 at 03:56:30PM -0400, Tom Lane wrote:
> > > Alvaro Herrera  writes:
> > > > Hmm, next question: should we backpatch a fix for this?  (This applies
> > > > all the way back to 11.)  If we do, then we would change behavior of
> > > > partition creation.  It's hard to see that the current behavior is
> > > > desirable ... and I think anybody who would have come across this, would
> > > > wish it behaved the other way.  But still -- it would definitely be a
> > > > behavior change.
> > > 
> > > The behavior change seems like it'd be an improvement in a vacuum,
> > > but I wonder how it would interact with catalog contents left behind
> > > by the old misbehavior.  Also, would we expect pg_dump to try to do
> > > anything to clean up the mess?  If so, allowing a back branch to have
> > > had more than one behavior would complicate that greatly.
> > 
> > I don't think there's a problem with catalog content ?
> > I think it's fine if there's an enabled child trigger inheriting from a
> > disabled parent?  This changes the initial tgenabled for new partitions.
> 
> I don't think we'd need to do anything special here ... particularly
> considering the discovery that pg_dump does not preserve the disable
> status of trigger on partitions:
> 
> > However...it looks like pg_dump should ALTER the child trigger state if it
> > differ from its parent.  Or maybe it needs to CREATE child triggers with the
> > proper state before attaching the child table ?
> 
> I guess *something* needs to be done, but I'm not clear on what it is.
> Creating the trigger on partition beforehand does not work: an error is
> raised on attach that the trigger already exists.
> 
> The only way I see to do this, is to have pg_dump extract tgenabled for

I came up with this, which probably needs more than a little finesse.

-- 
Justin
>From 465fba070986774e8bf4ec911986cc97dd211b20 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 20 Oct 2020 23:19:07 -0500
Subject: [PATCH v1] pg_dump: output DISABLE/ENABLE for child triggers ..

..if their state does not match their parent
---
 src/bin/pg_dump/pg_dump.c| 36 +---
 src/bin/pg_dump/pg_dump.h|  1 +
 src/bin/pg_dump/t/002_pg_dump.pl | 26 +++
 3 files changed, 52 insertions(+), 11 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ff45e3fb8c..c4b7046cac 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -7792,77 +7792,79 @@ getRules(Archive *fout, int *numRules)
  * does get entered into the DumpableObject tables.
  */
 void
 getTriggers(Archive *fout, TableInfo tblinfo[], int numTables)
 {
 	int			i,
 j;
 	PQExpBuffer query = createPQExpBuffer();
 	PGresult   *res;
 	TriggerInfo *tginfo;
 	int			i_tableoid,
 i_oid,
 i_tgname,
 i_tgfname,
 i_tgtype,
 i_tgnargs,
 i_tgargs,
 i_tgisconstraint,
 i_tgconstrname,
 i_tgconstrrelid,
 i_tgconstrrelname,
 i_tgenabled,
+i_tgisinternal,
 i_tgdeferrable,
 i_tginitdeferred,
 i_tgdef;
 	int			ntups;
 
 	for (i = 0; i < numTables; i++)
 	{
 		TableInfo  *tbinfo = [i];
 
 		if (!tbinfo->hastriggers ||
 			!(tbinfo->dobj.dump & DUMP_COMPONENT_DEFINITION))
 			continue;
 
 		pg_log_info("reading triggers for table \"%s.%s\"",
 	tbinfo->dobj.namespace->dobj.name,
 	tbinfo->dobj.name);
 
 		resetPQExpBuffer(query);
 		if (fout->remoteVersion >= 9)
 		{
 			/*
 			 * NB: think not to use pretty=true in pg_get_triggerdef.  It
 			 * could result in non-forward-compatible dumps of WHEN clauses
 			 * due to under-parenthesization.
 			 */
 			appendPQExpBuffer(query,
-			  "SELECT tgname, "
-			  "tgfoid::pg_catalog.regproc AS tgfname, "
-			  "pg_catalog.pg_get_triggerdef(oid, false) AS tgdef, "
-			  "tgenabled, tableoid, oid "
+			  "SELECT t.tgname, "
+			  "t.tgfoid::pg_catalog.regproc AS tgfname, "
+			  "pg_catalog.pg_get_triggerdef(t.oid, false) AS tgdef, "
+			  "t.tgenabled, t.tableoid, t.oid, t.tgisinternal "
 			  "FROM pg_catalog.pg_trigger t "
-			  "WHERE tgrelid = '%u'::pg_catalog.oid "
-			  "AND NOT tgisinternal",
+			  "LEFT JOIN pg_catalog.pg_trigger u ON u.oid = t.tgparentid "
+			  "WHERE t.tgrelid = '%u'::pg_catalog.oid "
+			  "AND (NOT t.tgisinternal OR t.tgenabled != u.tgenabled)",
 			  tbinfo->dobj.catId.oid);
 		}
 		else if (fout->remoteVersion >= 80300)
 		{
 			/*
 			 * We ignore triggers that are tied to a foreign-key constraint
 			 */
 			appendPQExpBuffer(query,
 			  "SELECT tgname, "
 			  "tgfoid::pg_catalog.regproc AS tgfname, "
 			  "tgtype, tgnargs, tgargs, tgenabled, "
 			  "tgisconstraint, tgconstrname, tgdeferrable, "
 			  "tgconstrrelid, tginitdeferred, tableoid, oid, "
 			  "tgconstrrelid::pg_catalog.regclass AS 

Re: pg_upgrade: fail early if a tablespace dir already exists for new cluster version

2020-10-21 Thread Justin Pryzby
On Tue, Oct 20, 2020 at 09:17:22PM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > I wonder if pg_upgrade should try to rmdir() the tablespace dirs before
> > restoring global objects, allowing it to succeed, rather than just "failing
> > early".
> 
> I'm a little confused about that.  If the directories aren't empty,
> that will fail,

You mean rmdir() will fail, returning -1, which my patch will ignore, and the
pg_upgrade will fail, same as it would have before.  The goal of the patch is
to improve the "empty" case, only.

> but if they are, shouldn't the upgrade just work?

It fails in "Restoring global objects", which runs "CREATE TABLESPACE".
| errmsg("directory \"%s\" already in use as a tablespace",

I considered the possibility of changing that, but it seems like this is
specific to pg_upgrade.  I wouldn't want to change the core server just for
that, and it has a good reason for failing in that case:
| * The creation of the version directory prevents more than one tablespace
| * in a single location.

-- 
Justin




Re: Is Recovery actually paused?

2020-10-21 Thread Dilip Kumar
On Tue, Oct 20, 2020 at 5:59 PM Dilip Kumar  wrote:
>
> On Tue, Oct 20, 2020 at 3:00 PM Simon Riggs  wrote:
> >
> > On Tue, 20 Oct 2020 at 09:50, Dilip Kumar  wrote:
> >
> > > > > Why would we want this? What problem are you trying to solve?
> > > >
> > > > The requirement is to know the last replayed WAL on the standby so
> > > > unless we can guarantee that the recovery is actually paused we can
> > > > never get the safe last_replay_lsn value.
> > > >
> > > > > If we do care, why not fix pg_is_wal_replay_paused() so it responds 
> > > > > as you wish?
> > > >
> > > > Maybe we can also do that but pg_is_wal_replay_paused is an existing
> > > > API and the behavior is to know whether the recovery paused is
> > > > requested or not,  So I am not sure is it a good idea to change the
> > > > behavior of the existing API?
> > > >
> > >
> > > Attached is the POC patch to show what I have in mind.
> >
> > If you don't like it, I doubt anyone else cares for the exact current
> > behavior either. Thanks for pointing those issues out.
> >
> > It would make sense to alter pg_wal_replay_pause() so that it blocks
> > until paused.
> >
> > I suggest you add the 3-value state as you suggest, but make
> > pg_is_wal_replay_paused() respond:
> > if paused, true
> > if requested, wait until paused, then return true
> > else false
> >
> > That then solves your issues with a smoother interface.
> >
>
> Make sense to me, I will change as per the suggestion.

I have noticed one more issue, the problem is that if the recovery
process is currently not processing any WAL and just waiting for the
WAL to become available then the pg_is_wal_replay_paused will be stuck
forever.  Having said that there is the same problem even if we design
the new interface which checks whether the recovery is actually paused
or not because until the recovery process gets the next wal it will
not check whether the recovery pause is requested or not so the actual
recovery paused flag will never be set.

One idea could be, if the recovery process is waiting for WAL and a
recovery pause is requested then we can assume that the recovery is
paused because before processing the next wal it will always check
whether the recovery pause is requested or not.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-21 Thread Bharath Rupireddy
On Wed, Oct 21, 2020 at 3:18 PM Bharath Rupireddy
 wrote:
>
> 17. Remove extra lines after #define IsHeaderLine()
> (cstate->header_line && cstate->cur_lineno == 1) in copy.h
>

 I missed one comment:

 18. I think we need to treat the number of parallel workers as an
integer similar to the parallel option in vacuum.

postgres=# copy t1 from stdin with(parallel '1');  < - we
should not allow this.
Enter data to be copied followed by a newline.

postgres=# vacuum (parallel '1') t1;
ERROR:  parallel requires an integer value


With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel copy

2020-10-21 Thread Amit Kapila
On Wed, Oct 21, 2020 at 3:19 PM Bharath Rupireddy
 wrote:
>
>
> 9. Instead of calling CopyStringToSharedMemory() for each string
> variable, can't we just create a linked list of all the strings that
> need to be copied into shm and call CopyStringToSharedMemory() only
> once? We could avoid 5 function calls?
>

If we want to avoid different function calls then can't we just store
all these strings in a local structure and use it? That might improve
the other parts of code as well where we are using these as individual
parameters.

> 10. Similar to above comment: can we fill all the required
> cstate->variables inside the function CopyNodeFromSharedMemory() and
> call it only once? In each worker we could save overhead of 5 function
> calls.
>

Yeah, that makes sense.

-- 
With Regards,
Amit Kapila.




Re: Online verification of checksums

2020-10-21 Thread Michael Paquier
On Wed, Oct 21, 2020 at 12:00:23PM +0200, Michael Banck wrote:
> The check was ported (or the concept of it adapted) from pgBackRest if I
> remember correctly.

Okay, I did not know that.

>> As things stand, I'd like to think that it would be much more useful
>> to remove this check and to have one or two extra retries (the current
>> code only has one).  I don't like much the possibility of false
>> positives for such critical checks, but as we need to live with what
>> has been released, that looks like a good move for stable branches.
> 
> Sounds good to me. I think some were advocating for locking the page
> before re-reading. When I looked at it, the level of abstraction that
> pg_basebackup has (just a list of files chopped up into blocks, no
> notion of relations I think) made that non-trivial, but maybe still
> possible for v14 and beyond.

That's a API layer I was looking at here:
https://www.postgresql.org/message-id/flat/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5q0y+wulml5sr...@mail.gmail.com

My guess is that we should be able to make use of that for base
backups as well, but I also think that I'd rather let v13 go with more
retries without depending on a new API layer, removing of the LSN
check altogether.  Thinking of it, that's actually roughly what I
posted here, but without the PageGetLSN() bit in the refactored code.
So I see a pretty good argument to address the stable branches with
that, and study for the future a better API to govern them all:
https://www.postgresql.org/message-id/20201020062432.ga30...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature


Re: Transactions involving multiple postgres foreign servers, take 2

2020-10-21 Thread Amit Kapila
On Wed, Oct 21, 2020 at 3:03 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Masahiko Sawada 
> > So what's your opinion?
>
> * Global visibility
> This is what Amit-san suggested some times -- "design it before reviewing the 
> current patch."  I'm a bit optimistic about this and think this FDW 2PC can 
> be implemented separately as a pure enhancement of FDW.  But I also 
> understand his concern.  If your (our?) aim is to use this FDW 2PC for 
> sharding,
>

As far as I understand that is what the goal is for which this is a
step. For example, see the wiki [1]. I understand that wiki is not the
final thing but I have seen other places as well where there is a
mention of FDW based sharding and I feel this is the reason why many
people are trying to improve this area. That is why I suggested having
an upfront design of global visibility and a deadlock detector along
with this work.


[1] - https://wiki.postgresql.org/wiki/WIP_PostgreSQL_Sharding

-- 
With Regards,
Amit Kapila.




Re: Online verification of checksums

2020-10-21 Thread Michael Banck
Hi,

Am Dienstag, den 20.10.2020, 18:11 +0900 schrieb Michael Paquier:
> On Mon, Apr 06, 2020 at 04:45:44PM -0400, Tom Lane wrote:
> > Actually, after thinking about that a bit more: why is there an LSN-based
> > special condition at all?  It seems like it'd be far more useful to
> > checksum everything, and on failure try to re-read and re-verify the page
> > once or twice, so as to handle the corner case where we examine a page
> > that's in process of being overwritten.
> 
> I was reviewing this area today, and that actually matches my
> impression.  Why do we need a LSN-based check at all?  As said
> upthread, that's of course weak with random data as we would miss most
> of the real checksum failures, with odds getting better depending on
> the current LSN of the cluster moving on.  However, it seems to me
> that we would have an extra advantage in removing this check
> all together: it would be possible to check for pages even if these
> are more recent than the start LSN of the backup, and that could be a
> lot of pages that could be checked on a large cluster.  So by keeping
> this check we also delay the detection of real problems.

The check was ported (or the concept of it adapted) from pgBackRest if I
remember correctly.

> As things stand, I'd like to think that it would be much more useful
> to remove this check and to have one or two extra retries (the current
> code only has one).  I don't like much the possibility of false
> positives for such critical checks, but as we need to live with what
> has been released, that looks like a good move for stable branches.

Sounds good to me. I think some were advocating for locking the page
before re-reading. When I looked at it, the level of abstraction that
pg_basebackup has (just a list of files chopped up into blocks, no
notion of relations I think) made that non-trivial, but maybe still
possible for v14 and beyond.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz





Re: Parallel copy

2020-10-21 Thread Bharath Rupireddy
Hi Vignesh,

I took a look at the v8 patch set. Here are some comments:

1. PopulateCommonCstateInfo() -- can we use PopulateCommonCStateInfo()
or PopulateCopyStateInfo()? And also EstimateCstateSize() --
EstimateCStateSize(), PopulateCstateCatalogInfo() --
PopulateCStateCatalogInfo()?

2. Instead of mentioning numbers like 1024, 64K, 10240 in the
comments, can we represent them in terms of macros?
/* It can hold 1024 blocks of 64K data in DSM to be processed by the worker. */
#define MAX_BLOCKS_COUNT 1024
/*
 * It can hold upto 10240 record information for worker to process. RINGSIZE

3. How about
"
Each worker at once will pick the WORKER_CHUNK_COUNT records from the
DSM data blocks and store them in it's local memory.
This is to make workers not contend much while getting record
information from the DSM. Read RINGSIZE comments before
 changing this value.
"
instead of
/*
 * Each worker will be allocated WORKER_CHUNK_COUNT of records from DSM data
 * block to process to avoid lock contention. Read RINGSIZE comments before
 * changing this value.
 */

4.  How about one line gap before and after for comments: "Leader
should operate in the following order:" and "Worker should operate in
the following order:"

5. Can we move RAW_BUF_BYTES macro definition to the beginning of the
copy.h where all the macro are defined?

6. I don't think we need the change in toast_internals.c with the
temporary hack Assert(!(IsParallelWorker() && !currentCommandIdUsed));
in GetCurrentCommandId()

7. I think
/* Can't perform copy in parallel */
if (parallel_workers <= 0)
return NULL;
can be
/* Can't perform copy in parallel */
if (parallel_workers == 0)
return NULL;
as parallel_workers can never be < 0 since we enter BeginParallelCopy
only if cstate->nworkers > 0 and also we are not allowed to have
negative values for max_worker_processes.

8. Do we want to pfree(cstate->pcdata) in case we failed to start any
parallel workers, we would have allocated a good
else
{
/*
 * Reset nworkers to -1 here. This is useful in cases where user
 * specifies parallel workers, but, no worker is picked up, so go
 * back to non parallel mode value of nworkers.
 */
cstate->nworkers = -1;
*processed = CopyFrom(cstate);/* copy from file to database */
}

9. Instead of calling CopyStringToSharedMemory() for each string
variable, can't we just create a linked list of all the strings that
need to be copied into shm and call CopyStringToSharedMemory() only
once? We could avoid 5 function calls?

10. Similar to above comment: can we fill all the required
cstate->variables inside the function CopyNodeFromSharedMemory() and
call it only once? In each worker we could save overhead of 5 function
calls.

11. Looks like CopyStringFromSharedMemory() and
CopyNodeFromSharedMemory() do almost the same things except
stringToNode() and pfree(destptr);. Can we have a generic function
CopyFromSharedMemory() or something else and handle with flag "bool
isnode" to differentiate the two use cases?

12. Can we move below check to the end in IsParallelCopyAllowed()?
/* Check parallel safety of the trigger functions. */
if (cstate->rel->trigdesc != NULL &&
!CheckRelTrigFunParallelSafety(cstate->rel->trigdesc))
return false;

13. CacheLineInfo(): Instead of goto empty_data_line_update; how about
having this directly inside the if block as it's being used only once?

14. GetWorkerLine(): How about avoiding goto statements and replacing
the common code with a always static inline function or a macro?

15. UpdateSharedLineInfo(): Below line is misaligned.
lineInfo->first_block = blk_pos;
lineInfo->start_offset = offset;

16. ParallelCopyFrom(): Do we need CHECK_FOR_INTERRUPTS(); at the
start of  for (;;)?

17. Remove extra lines after #define IsHeaderLine()
(cstate->header_line && cstate->cur_lineno == 1) in copy.h

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




RE: Transactions involving multiple postgres foreign servers, take 2

2020-10-21 Thread tsunakawa.ta...@fujitsu.com
From: Masahiko Sawada 
> So what's your opinion?

My opinion is simple and has not changed.  Let's clarify and refine the design 
first in the following areas (others may have pointed out something else too, 
but I don't remember), before going deeper into the code review.

* FDW interface
New functions so that other FDWs can really implement.  Currently, XA seems to 
be the only model we can rely on to validate the FDW interface.
What FDW function would call what XA function(s)?  What should be the arguments 
for the FEW functions?

* Performance
Parallel prepare and commits on the client backend.  The current implementation 
is untolerable and should not be the first release quality.  I proposed the 
idea.
(If you insist you don't want to anything about this, I have to think you're 
just rushing for the patch commit.  I want to keep Postgres's reputation.)
As part of this, I'd like to see the 2PC's message flow and disk writes (via 
email and/or on the following wiki.)  That helps evaluate the 2PC performance, 
because it's hard to figure it out in the code of a large patch set.  I'm 
simply imagining what is typically written in database textbooks and research 
papers.  I'm asking this because I saw some discussion in this thread that some 
new WAL records are added.  I was worried that transactions have to write WAL 
records other than prepare and commit unlike textbook implementations.

Atomic Commit of Distributed Transactions
https://wiki.postgresql.org/wiki/Atomic_Commit_of_Distributed_Transactions

* Query cancellation
As you showed, there's no problem with postgres_fdw?
The cancelability of FDW in general remains a problem, but that can be a 
separate undertaking.

* Global visibility
This is what Amit-san suggested some times -- "design it before reviewing the 
current patch."  I'm a bit optimistic about this and think this FDW 2PC can be 
implemented separately as a pure enhancement of FDW.  But I also understand his 
concern.  If your (our?) aim is to use this FDW 2PC for sharding, we may have 
to design the combination of 2PC and visibility first.



> I don’t think we need to stipulate the query cancellation. Anyway I
> guess the facts neither that we don’t stipulate anything about query
> cancellation now nor that postgres_fdw might not be cancellable in
> some situations now are not a reason for not supporting query
> cancellation. If it's a desirable behavior and users want it, we need
> to put an effort to support it as much as possible like we’ve done in
> postgres_fdw.  Some FDWs unfortunately might not be able to support it
> only by their functionality but it would be good if we can achieve
> that by combination of PostgreSQL and FDW plugins.

Let me comment on this a bit; this is a bit dangerous idea, I'm afraid.  We 
need to pay attention to the FDW interface and its documentation so that FDW 
developers can implement what we consider important -- query cancellation in 
your discussion.  "postgres_fdw is OK, so the interface is good" can create 
interfaces that other FDW developers can't use.  That's what Tomas Vondra 
pointed out several years ago.


Regards
Takayuki Tsunakawa



Re: [HACKERS] logical decoding of two-phase transactions

2020-10-21 Thread Amit Kapila
On Tue, Oct 20, 2020 at 9:46 AM Peter Smith  wrote:
>
>
> ==
> Patch v10-0002, File: src/backend/replication/logical/reorderbuffer.c
> ==
>
> COMMENT
> There are some parts of the code where in my v6 review I had a doubt
> about the mutually exclusive treatment of the "streaming" flag and the
> "rbtxn_prepared(txn)" state.
>

I am not sure about the exact specifics here but we can always prepare
a transaction that is streamed. I have to raise one more point in this
regard. Why do we need stream_commit_prepared_cb,
stream_rollback_prepared_cb callbacks? Do we need to do something
separate in pgoutput or otherwise for these APIs? If not, can't we use
a non-stream version of these APIs instead? There appears to be a
use-case for stream_prepare_cb which is to apply the existing changes
on subscriber and call prepare but I can't see usecase for the other
two APIs.

One minor comment:
v10-0001-Support-2PC-txn-base

1.
@@ -574,6 +655,11 @@ void ReorderBufferQueueMessage(ReorderBuffer *,
TransactionId, Snapshot snapsho
 void ReorderBufferCommit(ReorderBuffer *, TransactionId,
  XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
  TimestampTz commit_time, RepOriginId origin_id, XLogRecPtr origin_lsn);
+void ReorderBufferFinishPrepared(ReorderBuffer *rb, TransactionId xid,
+   XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+   TimestampTz commit_time,
+   RepOriginId origin_id, XLogRecPtr origin_lsn,
+   char *gid, bool is_commit);
 void ReorderBufferAssignChild(ReorderBuffer *, TransactionId,
TransactionId, XLogRecPtr commit_lsn);
 void ReorderBufferCommitChild(ReorderBuffer *, TransactionId, TransactionId,
  XLogRecPtr commit_lsn, XLogRecPtr end_lsn);
@@ -597,6 +683,15 @@ void
ReorderBufferXidSetCatalogChanges(ReorderBuffer *, TransactionId xid,
XLog
 bool ReorderBufferXidHasCatalogChanges(ReorderBuffer *, TransactionId xid);
 bool ReorderBufferXidHasBaseSnapshot(ReorderBuffer *, TransactionId xid);

+bool ReorderBufferPrepareNeedSkip(ReorderBuffer *rb, TransactionId xid,
+ const char *gid);
+bool ReorderBufferTxnIsPrepared(ReorderBuffer *rb, TransactionId xid,
+const char *gid);
+void ReorderBufferPrepare(ReorderBuffer *rb, TransactionId xid,
+ XLogRecPtr commit_lsn, XLogRecPtr end_lsn,
+ TimestampTz commit_time,
+ RepOriginId origin_id, XLogRecPtr origin_lsn,
+ char *gid);
 ReorderBufferTXN *ReorderBufferGetOldestTXN(ReorderBuf

I don't think these changes belong to this patch as the definition of
these functions is not part of this patch.

-- 
With Regards,
Amit Kapila.




Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-10-21 Thread Kyotaro Horiguchi
At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda  
wrote in 
> On 2020-10-20 12:46, Amit Kapila wrote:
> > I see that we also need to add extra code to capture these stats (some
> > of which is in performance-critical path especially in
> > XLogInsertRecord) which again makes me a bit uncomfortable. It might
> > be that it is all fine as it is very important to collect these stats
> > at cluster-level in spite that the same information can be gathered at
> > statement-level to help customers but I don't see a very strong case
> > for that in your proposal.

We should avoid that duplication as possible even if the both number
are important.

> Also about performance, I thought there are few impacts because it
> increments stats in memory. If I can implement to reuse pgWalUsage's
> value which already collects these stats, there is no impact in
> XLogInsertRecord.
> For example, how about pg_stat_wal() calculates the accumulated
> value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
> value?

I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;
...
   pgstat_send(, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;   


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [HACKERS] logical decoding of two-phase transactions

2020-10-21 Thread Amit Kapila
On Wed, Oct 21, 2020 at 1:38 PM Peter Smith  wrote:
>
> The PG docs for PREPARE TRANSACTION [1] don't say anything about an
> empty (zero length) transaction-id.
> e.g. PREPARE TRANSACTION '';
> [1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html
>
> ~
>
> Meanwhile, during testing I found the 2PC prepare hangs when an empty
> id is used.
>

Can you please take an example to explain what you are trying to say?
I have tried below and doesn't face any problem:

postgres=# Begin;
BEGIN
postgres=*# select txid_current();
 txid_current
--
  534
(1 row)
postgres=*# Prepare Transaction 'foo';
PREPARE TRANSACTION
postgres=# Commit Prepared 'foo';
COMMIT PREPARED
postgres=# Begin;
BEGIN
postgres=*# Prepare Transaction 'foo';
PREPARE TRANSACTION
postgres=# Commit Prepared 'foo';
COMMIT PREPARED

-- 
With Regards,
Amit Kapila.




Re: [HACKERS] logical decoding of two-phase transactions

2020-10-21 Thread Amit Kapila
On Tue, Oct 20, 2020 at 4:32 PM Ajin Cherian  wrote:
>
> On Fri, Oct 16, 2020 at 5:21 PM Peter Smith  wrote:
> >
>
> Comments:
>
> src/backend/replication/logical/worker.c
> @@ -888,6 +888,319 @@ apply_handle_prepare(StringInfo s)
> + /*
> + * FIXME - Following condition was in apply_handle_prepare_txn except
> I found  it was ALWAYS IsTransactionState() == false
> + * The synchronization worker runs in single transaction. *
> + if (IsTransactionState() && !am_tablesync_worker())
> + */
> + if (!am_tablesync_worker())
>
> Comment: I dont think a tablesync worker will use streaming, none of
> the other stream APIs check this, this might not be relevant for
> stream_prepare either.
>

Yes, I think this is right. See pgoutput_startup where we are
disabling the streaming for init phase. But it is always good to once
test this and ensure the same.

>
> + /*
> + * 
> ==
> + * The following chunk of code is largely cut/paste from the existing
> apply_handle_prepare_commit_txn
>
> Comment: Here, I think you meant apply_handle_stream_commit. Also
> rather than duplicating this chunk of code, you could put it in a new
> function.
>
> + /* open the spool file for the committed transaction */
> + changes_filename(path, MyLogicalRepWorker->subid, xid);
>
> Comment: Here the comment should read "committed/prepared" rather than
> "committed"
>
>
> + else
> + {
> + /* Process any invalidation messages that might have accumulated. */
> + AcceptInvalidationMessages();
> + maybe_reread_subscription();
> + }
>
> Comment: This else block might not be necessary as a tablesync worker
> will not initiate the streaming APIs.
>

I think it is better to have an Assert here for streaming-mode?

> + BeginTransactionBlock();
> + CommitTransactionCommand();
> + StartTransactionCommand();
>
> Comment: Rereading the code and the transaction state description in
> src/backend/access/transam/README. I am not entirely sure if the
> BeginTransactionBlock followed by CommitTransactionBlock is really
> needed here.
>

Yeah, I also find this strange. I guess the patch is doing so because
it needs to call PrepareTransactionBlock later but I am not sure. How
can we call CommitTransactionCommand(), won't it commit the on-going
transaction and make it visible before even it is visible on the
publisher. I think you can verify by having a breakpoint after
CommitTransactionCommand() and see if the changes for which we are
doing prepare become visible.

> I understand this code was copied over from apply_handle_prepare_txn,
> but now looking back I'm not so sure if it is correct. The transaction
> would have already begin as part of applying the changes, why begin it
> again?
> Maybe Amit could confirm this.
>

I hope the above suggestions will help to proceed here.

-- 
With Regards,
Amit Kapila.




Re: Parallel copy

2020-10-21 Thread vignesh C
On Thu, Oct 8, 2020 at 11:15 AM vignesh C  wrote:
>
> I'm summarizing the pending open points so that I don't miss anything:
> 1) Performance test on latest patch set.

It is tested and results are shared by bharath at [1]

> 2) Testing points suggested.

Tests are added as suggested and details shared by bharath at [1]

> 3) Support of parallel copy for COPY_OLD_FE.

It is handled as part of v8 patch shared at [2]

> 4) Worker has to hop through all the processed chunks before getting
> the chunk which it can process.

Open

> 5) Handling of Tomas's comments.

I have fixed and updated the fix details as part of [3]

> 6) Handling of Greg's comments.

I have fixed and updated the fix details as part of [4]

Except for "4) Worker has to hop through all the processed chunks before
getting the chunk which it can process", all open tasks are handled. I will
work on this and provide an update shortly.

[1]
https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/caldanm2ucmcmozcbkl8b7az9oyd9hz+fndczhssiiqj4v-x...@mail.gmail.com
[3]
https://www.postgresql.org/message-id/CALDaNm0_zUa9%2BS%3DpwCz3Yp43SY3r9bnO4v-9ucXUujEE%3D0Sd7g%40mail.gmail.com
[4]
https://www.postgresql.org/message-id/CALDaNm31pGG%2BL9N4HbM0mO4iuceih4mJ5s87jEwOPaFLpmDKyQ%40mail.gmail.com

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com


RE: Transactions involving multiple postgres foreign servers, take 2

2020-10-21 Thread tsunakawa.ta...@fujitsu.com
From: Kyotaro Horiguchi 
> > if (PQstatus(entry->conn) != CONNECTION_OK ||
> > PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
> > entry->changing_xact_state)
> > {
> > elog(DEBUG3, "discarding connection %p", entry->conn);
> > disconnect_pg_server(entry);
> > }
> 
> Right.  Although it's not directly relevant to this discussion,
> precisely, that part is not visited just after the remote "COMMIT
> TRANSACTION" failed. If that commit fails or is canceled, an exception
> is raised while entry->changing_xact_state = true. Then the function
> is called again within AbortCurrentTransaction() and reaches the above
> code.

Ah, then the connection to the foreign server is closed after failing to cancel 
the query.  Thanks.


Regards
Takayuki Tsunakawa





Re: [HACKERS] logical decoding of two-phase transactions

2020-10-21 Thread Peter Smith
The PG docs for PREPARE TRANSACTION [1] don't say anything about an
empty (zero length) transaction-id.
e.g. PREPARE TRANSACTION '';
[1] https://www.postgresql.org/docs/current/sql-prepare-transaction.html

~

Meanwhile, during testing I found the 2PC prepare hangs when an empty
id is used.

Now I am not sure does this represent some bug within the 2PC code, or
in fact should the PREPARE never have allowed an empty transaction-id
to be specified in the first place?

Thoughts?

Kind Regards
Peter Smith.
Fujitsu Australia.




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-10-21 Thread tsunakawa.ta...@fujitsu.com
RelationTruncate() invalidates the cached fork sizes as follows.  This causes 
smgrnblocks() return accurate=false, resulting in not running optimization.  
Try commenting out for non-recovery case.

/*
 * Make sure smgr_targblock etc aren't pointing somewhere past new end
 */
rel->rd_smgr->smgr_targblock = InvalidBlockNumber;
for (int i = 0; i <= MAX_FORKNUM; ++i)
rel->rd_smgr->smgr_cached_nblocks[i] = InvalidBlockNumber;


Regards
Takayuki Tsunakawa





Re: Transactions involving multiple postgres foreign servers, take 2

2020-10-21 Thread Kyotaro Horiguchi
At Tue, 20 Oct 2020 21:22:31 +0900, Masahiko Sawada 
 wrote in 
> On Tue, 20 Oct 2020 at 17:56, tsunakawa.ta...@fujitsu.com
>  wrote:
> >
> > From: Kyotaro Horiguchi 
> > > It seems to respond to a statement-cancel signal immediately while
> > > waiting for a coming byte.  However, seems to wait forever while
> > > waiting a space in send-buffer. (Is that mean the session will be
> > > stuck if it sends a large chunk of bytes while the network is down?)
> >
> > What part makes you worried about that?  libpq's send processing?
> >
> > I've just examined pgfdw_cancel_query(), too.  As below, it uses a hidden 
> > 30 second timeout.  After all, postgres_fdw also relies on timeout already.
> 
> It uses the timeout but it's also cancellable before the timeout. See
> we call CHECK_FOR_INTERRUPTS() in pgfdw_get_cleanup_result().

Yes. And as Sawada-san mentioned it's not a matter if a specific FDW
module accepts cancellation or not. It's sufficient that we have one
example. Other FDWs will follow postgres_fdw if needed.

> > > After receiving a signal, it closes the problem connection. So the
> > > local session is usable after that but the fiailed remote sessions are
> > > closed and created another one at the next use.
> >
> > I couldn't see that the problematic connection is closed when the 
> > cancellation fails... Am I looking at a wrong place?
...
> 
> I guess Horiguchi-san refereed the following code in pgfdw_xact_callback():
> 
> /*
>  * If the connection isn't in a good idle state, discard it to
>  * recover. Next GetConnection will open a new connection.
>  */
> if (PQstatus(entry->conn) != CONNECTION_OK ||
> PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
> entry->changing_xact_state)
> {
> elog(DEBUG3, "discarding connection %p", entry->conn);
> disconnect_pg_server(entry);
> }

Right.  Although it's not directly relevant to this discussion,
precisely, that part is not visited just after the remote "COMMIT
TRANSACTION" failed. If that commit fails or is canceled, an exception
is raised while entry->changing_xact_state = true. Then the function
is called again within AbortCurrentTransaction() and reaches the above
code.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-10-21 Thread lchch1...@sina.cn

I think it's really a more convenient way to collect wal usage information,
with it we can query when I want. Several points on my side:

1. It will be nice If you provide a chance to reset the information in WalStats,
so that we can reset it without restart the database.

2. I think 'wal_write_backend' is mean wal write times happen in
backends. The describe in document is not so clear, suggest rewrite it.

3. I do not think it's a correct describe in document for 'wal_buffers_full'.

4. Quite strange to collect twice in XLogInsertRecord() for xl_tot_len,
m_wal_records, m_wal_fpi.

5. I notice some code in issue_xlog_fsync() function to collect sync info,
a standby may call the issue_xlog_fsync() too, which do not want to
to collect this info. I think this need some change avoid run by standby
side. 



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


Re: Error in pg_restore (could not close data file: Success)

2020-10-21 Thread Kyotaro Horiguchi
At Wed, 21 Oct 2020 13:45:15 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> https://www.postgresql.org/message-id/flat/20200416.181945.759179589924840062.horikyota.ntt%40gmail.com#ed85c5fda64873c45811be4e3027a2ea
> 
> Me> Hmm. Sounds reasonable.  I'm going to do that.  Thanks!
> 
> But somehow that haven't happened, I'll come up with a new version.

pg_restore shows the following error instead of "Success" for broken
compressed file.

pg_restore: error: could not close data file "d/3149.dat": zlib error: error 
reading or writing compressed file


0001:

cfclose() calls fatal() instead of returning the result to the callers
on error, which isobviously okay for all existing callers that are
handling errors from the function. Other callers ignored the returned
value but we should fatal() on error of the function.

At least for me, gzerror doesn't return a message (specifically,
returns NULL) after gzclose failure so currently cfclose shows its own
messages for erros of gzclose().  Am I missing something?

0002:

cfread has the same code with get_cfp_error() and cfgetc uses
sterror() after gzgetc(). It would be suitable for a separate patch,
but 0002 fixes those bugs.  I changed _EndBlob() to show the cause of
an error.

Did not do in this patch:

We could do further small refactoring to remove temporary variables in
pg_backup_directory.c for _StartData(), InitArchiveFmt_Directory,
_LoadBlobs(), _StartBlobs() and _CloseArchive(), but I left them as is
for the ease of back-patching.

Now that we have the file name in the context variable so we could
show the file name in all error messages, but that change was large
and there's a part where that change is a bit more complex so I didn't
do that.


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 15190e795dbc4a6245d96d37ee561624769ab7ae Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 21 Oct 2020 11:50:02 +0900
Subject: [PATCH 1/2] Fix error handling for file close of pg_dump

Errors from cfclose is handled as if it were fclose, but the function
returned errors from gzclose and the callers show incorrect error
messages like "could not close: Success". Fix that behavior by moving
the error handling code into cfclose.
---
 src/bin/pg_dump/compress_io.c | 38 +++
 src/bin/pg_dump/compress_io.h |  2 +-
 src/bin/pg_dump/pg_backup_directory.c | 37 +-
 3 files changed, 59 insertions(+), 18 deletions(-)

diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c
index 1417401086..28d45fd299 100644
--- a/src/bin/pg_dump/compress_io.c
+++ b/src/bin/pg_dump/compress_io.c
@@ -645,31 +645,59 @@ cfgets(cfp *fp, char *buf, int len)
 		return fgets(buf, len, fp->uncompressedfp);
 }
 
-int
-cfclose(cfp *fp)
+/*
+ * cfclose close the stream
+ *
+ * Returns if successfully closed the cfp.  End the process with fatal() on
+ * error. filedesc is a label that identifies the file type and filename is the
+ * name of the file, both of which are used for error reporting.
+ */
+void
+cfclose(cfp *fp, const char *filedesc, const char *filename)
 {
 	int			result;
 
 	if (fp == NULL)
 	{
 		errno = EBADF;
-		return EOF;
+
+		fatal("could not close %s \"%s\": %m", filedesc, filename);
 	}
 #ifdef HAVE_LIBZ
 	if (fp->compressedfp)
 	{
 		result = gzclose(fp->compressedfp);
 		fp->compressedfp = NULL;
+
+		switch (result)
+		{
+			case Z_OK:
+break;
+
+			case Z_STREAM_ERROR:
+fatal("could not close %s \"%s\": zlib error: invalid parameter",
+	  filedesc, filename);
+			case Z_ERRNO:
+fatal("could not close %s \"%s\": %m", filedesc, filename);
+			case Z_BUF_ERROR:
+fatal("could not close %s \"%s\": zlib error: error reading or writing compressed file",
+	  filedesc, filename);
+			default:
+fatal("could not close %s \"%s\": zlib error: unknown gzclose error: %d",
+	  filedesc, filename, result);
+		}
 	}
 	else
 #endif
 	{
 		result = fclose(fp->uncompressedfp);
 		fp->uncompressedfp = NULL;
+
+		if (result != 0)
+			fatal("could not close %s \"%s\": %m", filedesc, filename);
 	}
+
 	free_keep_errno(fp);
-
-	return result;
 }
 
 int
diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index d2e6e1b854..97986636c7 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -63,7 +63,7 @@ extern int	cfread(void *ptr, int size, cfp *fp);
 extern int	cfwrite(const void *ptr, int size, cfp *fp);
 extern int	cfgetc(cfp *fp);
 extern char *cfgets(cfp *fp, char *buf, int len);
-extern int	cfclose(cfp *fp);
+extern void	cfclose(cfp *fp, const char *filedesc, const char *filename);
 extern int	cfeof(cfp *fp);
 extern const char *get_cfp_error(cfp *fp);
 
diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c
index 48fa7cb1a3..4b1cf4a5a3 100644
--- a/src/bin/pg_dump/pg_backup_directory.c
+++ b/src/bin/pg_dump/pg_backup_directory.c
@@ -51,8 +51,13 @@ typedef struct
 	char	   

RE: Use of "long" in incremental sort code

2020-10-21 Thread Tang, Haiying
Hi

>Found one more place needed to be changed(long -> int64).
>
>Also changed the output for int64 data(Debug mode on & define EXEC_SORTDEBUG )
>
>And, maybe there's a typo in " src\backend\executor\nodeIncrementalSort.c" as 
>below.
>Obviously, the ">=" is meaningless, right?
>
>And, maybe there's a typo in " src\backend\executor\nodeIncrementalSort.c" as 
>below.
>Obviously, the ">=" is meaningless, right?
>
>-  SO1_printf("Sorting presorted prefix tuplesort with >= %ld 
>tuples\n", nTuples);
>+  SO1_printf("Sorting presorted prefix tuplesort with %ld 
>tuples\n", nTuples);
>
>Please take a check at the attached patch file.

I have added it to commit fest.
https://commitfest.postgresql.org/30/2772/

Best regards
Tang

-Original Message-
From: Tang, Haiying  
Sent: Monday, October 19, 2020 12:57 PM
To: David Rowley ; James Coleman 
Cc: pgsql-hack...@postgresql.org
Subject: RE: Use of "long" in incremental sort code

Hi

Found one more place needed to be changed(long -> int64).

Also changed the output for int64 data(Debug mode on & define EXEC_SORTDEBUG )

And, maybe there's a typo in " src\backend\executor\nodeIncrementalSort.c" as 
below.
Obviously, the ">=" is meaningless, right?

-   SO1_printf("Sorting presorted prefix tuplesort with >= %ld 
tuples\n", nTuples);
+   SO1_printf("Sorting presorted prefix tuplesort with %ld 
tuples\n", nTuples);

Please take a check at the attached patch file.

Previous disscution:
https://www.postgresql.org/message-id/CAApHDvpky%2BUhof8mryPf5i%3D6e6fib2dxHqBrhp0Qhu0NeBhLJw%40mail.gmail.com

Best regards
Tang