date:20200326

On Thu, Mar 26, 2020 at 09:39:17PM -0300, Alvaro Herrera wrote:
> Now, would anyone be too upset if I push these in this other order?  I
> realized that the reason the tests broke after Sergei's patch is that
> recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp
> walreceiver slots, since it's using the non-temp name it tries to give
> to the slot, rather than the temp name under which it is actually
> created.  The workaround proposed by 0002 is to edit standby_1's config
> to set walreceiver's slot to be non-temp.

FWIW, I find a bit strange that the change introduced in
001_stream_rep.pl as of patch 0002 is basically undone in 0003 by
changing the default value of wal_receiver_create_temp_slot.

> The reason is that I think I would like to get Sergei's patch pushed
> right away (0001+0002, as a single commit); but that I think there's
> more to attack in the walreceiver temp slot code than 0003 here, and I
> don't want to delay the new feature any longer while I figure out the
> fix for that.

Not sure I agree with this approach.  I'd rather fix all the existing
known problems first, and then introduce the new features on top of
what we consider to be a clean state.  If we lack of time between the
first and second patches, we may have a risk of keeping the code with
the new feature but without the fixes discussed for
wal_receiver_create_temp_slot.

> (The thing is: if I specify primary_slot_name in the config, why is the
> temp walreceiver slot code not obeying that name?  I think walreceiver
> should create a temp slot, sure, but using the given name rather than
> coming up with a random name.)

Good point.  I am not sure either why the current logic has been
chosen.  The discussion related to the original patch is in this area:
https://www.postgresql.org/message-id/4792e456-d75f-8e6a-2d47-34b8f78c2...@2ndquadrant.com

> (The other reason is that I would like to push one patch to fix
> walreceiver tmp slot rather than two, setting the thing first this way
> and then the opposite way.)

So your problem here is that by applying first 0003 and then 0001-0002
you would finish by switching wal_receiver_create_temp_slot to
PGC_POSTMASTER, and then back to PGC_SIGHUP again?
--
Michael

signature.asc
Description: PGP signature

Re: backup manifests

Hi,

On 2020-03-26 15:37:11 -0400, Stephen Frost wrote:
> The argument is that adding checksums takes more time.  I can understand
> that argument, though I don't really agree with it.  Certainly a few
> percent really shouldn't be that big of an issue, and in many cases even
> a sha256 hash isn't going to have that dramatic of an impact on the
> actual overall time.

I don't understand how you can come to that conclusion?  It doesn't take
very long to measure openssl's sha256 performance (which is pretty well
optimized). Note that we do use openssl's sha256, when compiled with
openssl support.

On my workstation, with a pretty new (but not fastest single core perf
model) intel Xeon Gold 5215, I get:

$ openssl speed sha256
...
type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes  
16384 bytes
sha256   76711.75k   172036.78k   321566.89k   399008.09k   431423.49k  
 433689.94k

IOW, ~430MB/s.

On my laptop, with pretty fast cores:
type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes  
16384 bytes
sha256   97054.91k   217188.63k   394864.13k   493441.02k   532100.44k  
 533441.19k

IOW, 530MB/s

530 MB/s is well within the realm of medium sized VMs.

And, as mentioned before. even if you do only half of that, you're still
going to be spending roughly half of the CPU time of sending a base
backup.

What makes you think that a few hundred MB/s is out of reach for a large
fraction of PG installations that actually keep backups?

Greetings,

Andres Freund

Re: backup manifests

Hi,

On 2020-03-26 14:02:29 -0400, Robert Haas wrote:
> On Thu, Mar 26, 2020 at 12:34 PM Stephen Frost  wrote:
> > Why was crc32c
> > picked for that purpose?
> 
> Because it was discovered that 64-bit CRC was too slow, per commit
> 21fda22ec46deb7734f793ef4d7fa6c226b4c78e.

Well, a 32bit crc, not crc32c. IIRC it was the ethernet polynomial (+
bug). We switched to crc32c at some point because there are hardware
implementations:

commit 5028f22f6eb0579890689655285a4778b4ffc460
Author: Heikki Linnakangas 
Date:   2014-11-04 11:35:15 +0200

Switch to CRC-32C in WAL and other places.

> Like, suppose we change the default from CRC-32C to SHA-something. On
> the upside, the error detection rate will increase from 99.999+%
> to something much closer to 100%.

FWIW, I don't buy the relevancy of 99.999+% at all. That's assuming
a single bit error (at relevant lengths, before that it's single burst
errors of a greater length), which isn't that relevant for our purposes.

That's not to say that I don't think a CRC check can provide value. It
does provide a high likelihood of detecting enough errors, including
coding errors in how data is restored (not unimportant), that you're
likely not find out aobut a problem soon.

> On the downside,
> backups will get as much as 40-50% slower for some users. I hope we
> can agree that both detecting errors and taking backups quickly are
> important. However, it is hard for me to imagine that the typical user
> would want to pay even a 5-10% performance penalty when taking a
> backup in order to improve an error detection feature which they may
> not even use and which already has less than a one-in-a-billion chance
> of going wrong.

FWIW, that seems far too large a slowdown to default to for me. Most
people aren't going to be able to figure out that it's the checksum
parameter that causes this slowdown, there just going to feel the pain
of the backup being much slower than their hardware.

A few hundred megabytes of streaming reads/writes really doesn't take a
beefy server these days. Medium sized VMs + a bit larger network block
devices at all the common cloud providers have considerably higher
bandwidth. Even a raid5x of 4 spinning disks can deliver > 500MB/s.

And plenty of even the smaller instances at many providers have >
5gbit/s network. At the upper end it's way more than that.

Greetings,

Andres Freund

Re: Race condition in SyncRepGetSyncStandbysPriority





On 2020/03/27 10:26, Tom Lane wrote:

Twice in the past month [1][2], buildfarm member hoverfly has managed
to reach the "unreachable" Assert(false) at the end of
SyncRepGetSyncStandbysPriority.


When I search the past discussions, I found that Noah Misch reported
the same issue.
https://www.postgresql.org/message-id/20200206074552.gb3326...@rfd.leadboat.com


What seems likely to me, after quickly eyeballing the code, is that
hoverfly is hitting the blatantly-obvious race condition in that function.
Namely, that the second loop supposes that the state of the walsender
array hasn't changed since the first loop.

The minimum fix for this, I suppose, would have the first loop capture
the sync_standby_priority value for each walsender along with what it's
already capturing.  But I wonder if the whole function shouldn't be
rewritten from scratch, because it seems like the algorithm is both
expensively brute-force and unintelligible, which is a sad combination.
It's likely that the number of walsenders would never be high enough
that efficiency could matter, but then couldn't we use an algorithm
that is less complicated and more obviously correct?


+1 to rewrite the function with better algorithm.


(Because the
alternative conclusion, if you reject the theory that a race is happening,
is that the algorithm is just flat out buggy; something that's not too
easy to disprove either.)

Another fairly dubious thing here is that whether or not *am_sync
gets set depends not only on whether MyWalSnd is claiming to be
synchronous but on how many lower-numbered walsenders are too.
Is that really the right thing?

But worse than any of that is that the return value seems to be
a list of walsender array indexes, meaning that the callers cannot
use it without making even stronger assumptions about the array
contents not having changed since the start of this function.

It sort of looks like the design is based on the assumption that
the array contents can't change while SyncRepLock is held ... but
if that's the plan then why bother with the per-walsender spinlocks?
In any case this assumption seems to be failing, suggesting either
that there's a caller that's not holding SyncRepLock when it calls
this function, or that somebody is failing to take that lock while
modifying the array.


As far as I read the code, that assumption seems still valid. But the problem
is that each walsender updates MyWalSnd->sync_standby_priority at each
convenient timing, when SIGHUP is signaled. That is, at a certain moment,
some walsenders (also their WalSnd entries in shmem) work based on
the latest configuration but the others (also their WalSnd entries) work based
on the old one.

lowest_priority = SyncRepConfig->nmembers;
next_highest_priority = lowest_priority + 1;

SyncRepGetSyncStandbysPriority() calculates the lowest priority among
all running walsenders as the above, by using the configuration info that
this walsender is based on. But this calculated lowest priority would be
invalid if other walsender is based on different (e.g., old) configuraiton.
This can cause the (other) walsender to have lower priority than
the calculated lowest priority and the second loop in
SyncRepGetSyncStandbysPriority() to unexpectedly end.

Therefore, the band-aid fix seems to be to set the lowest priority to
very large number at the beginning of SyncRepGetSyncStandbysPriority().

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: error context for vacuum to include block number

On Fri, Mar 27, 2020 at 09:49:29AM +0530, Amit Kapila wrote:
> On Fri, Mar 27, 2020 at 3:47 AM Justin Pryzby  wrote:
> >
> > > Hm, I was just wondering what happens if an error happens *during*
> > > update_vacuum_error_cbarg().  It seems like if we set
> > > errcbarg->phase=VACUUM_INDEX before setting errcbarg->indname=indname, 
> > > then an
> > > error would cause a crash.
> > >
> 
> Can't that be avoided if you check if cbarg->indname is non-null in
> vacuum_error_callback as we are already doing for
> VACUUM_ERRCB_PHASE_TRUNCATE?
> 
> > >  And if we pfree and set indname before phase, it'd
> > > be a problem when going from an index phase to non-index phase.
> 
> How is it possible that we move to the non-index phase without
> clearing indname as we always revert back the old phase information?

The crash scenario I'm trying to avoid would be like statement_timeout or other
asynchronous event occurring between two non-atomic operations.

I said that there's an issue no matter what order we set indname/phase;
If we wrote:
|cbarg->indname = indname;
|cbarg->phase = phase;
..and hit a timeout (or similar) between setting indname=NULL but before
setting phase=VACUUM_INDEX, then we can crash due to null pointer.

But if we write:
|cbarg->phase = phase;
|if (cbarg->indname) {pfree(cbarg->indname);}
|cbarg->indname = indname ? pstrdup(indname) : NULL;
..then we can still crash if we timeout between freeing cbarg->indname and
setting it to null, due to acccessing a pfreed allocation.

> > >  So maybe we
> > > have to set errcbarg->phase=VACUUM_ERRCB_PHASE_UNKNOWN while in the 
> > > function,
> > > and errcbarg->phase=phase last.
> 
> I find that a bit ad-hoc, if possible, let's try to avoid it.

I think we can do what you suggesting, if the callback checks if 
(cbarg->indname!=NULL).

We'd have to write:
// Must set indname *before* updating phase, in case an error occurs before
// phase is set, to avoid crashing if we're going from an index phase to a
// non-index phase (which should not read indname).  Must not free indname
// until it's set to null.
char *tmp = cbarg->indname;
cbarg->indname = indname ? pstrdup(indname) : NULL;
cbarg->phase = phase;
if (tmp){pfree(tmp);}

Do you think that's better ?

-- 
Justin

Re: backup manifests

Hi,

On 2020-03-26 11:37:48 -0400, Robert Haas wrote:
> I mean, you're just repeating the same argument here, and it's just
> not valid. Regardless of the file size, the chances of a false
> checksum match are literally less than one in a billion. There is
> every reason to believe that users will be happy with a low-overhead
> method that has a 99.999+% chance of detecting corrupt files. I do
> agree that a 64-bit CRC would probably be not much more expensive and
> improve the probability of detecting errors even further

I *seriously* doubt that it's true that 64bit CRCs wouldn't be
slower. The only reason CRC32C is semi-fast is that we're accelerating
it using hardware instructions (on x86-64 and ARM at least). Before that
it was very regularly the bottleneck for processing WAL - and it still
sometimes is. Most CRCs aren't actually very fast to compute, because
they don't lend themselves to benefit from ILP or SIMD.  We spent a fair
bit of time optimizing our crc implementation before the hardware
support was widespread.

> but I wanted to restrict this patch to using infrastructure we already
> have. The choices there are the various SHA functions (so I supported
> those), MD5 (which I deliberately omitted, for reasons I hope you'll
> be the first to agree with), CRC-32C (which is fast), a couple of
> other CRC-32 variants (which I omitted because they seemed redundant
> and one of them only ever existed in PostgreSQL because of a coding
> mistake), and the hacked-up version of FNV that we use for page-level
> checksums (which is only 16 bits and seems to have no advantages for
> this purpose).

FWIW, FNV is only 16bit because we reduce its size to 16 bit. See the
tail of pg_checksum_page.

I'm not sure the error detection guarantees of various CRC algorithms
are that relevant here, btw. IMO, for something like checksums in a
backup, just having a single one-bit error isn't as common as having
larger errors (e.g. entire blocks beeing zeroed). And to detect that
32bit checksums aren't that good.

> > As for folks who are that close to the edge on their backup timing that
> > they can't have it slow down- chances are pretty darn good that they're
> > not far from ending up needing to find a better solution than
> > pg_basebackup anyway.  Or they don't need to generate a manifest (or, I
> > suppose, they could have one but not have checksums..).
> 
> 40-50% is a lot more than "if you were on the edge."

sha256 does about approx 400MB/s per core on modern intel CPUs. That's
way below commonly accessible storage / network capabilities (and even
if you're only doing 200MB/s, you're still going to spend roughly half
of the CPU time just doing hashing.  It's unlikely that you're going to
see much speedups for sha256 just by upgrading a CPU. While there are
hardware instructions available, they don't result in all that large
improvements. Of course, we could also start using the GPU (err, really
no).

Defaulting to that makes very little sense to me. You're not just going
to spend that time while backing up, but also when validating backups
(i.e. network limits suddenly aren't a relevant bottleneck anymore).

> > I fail to see the usefulness of a tool that doesn't actually verify that
> > the backup is able to be restored from.
> >
> > Even pg_basebackup (in both fetch and stream modes...) checks that we at
> > least got all the WAL that's needed for the backup from the server
> > before considering the backup to be valid and telling the user that
> > there was a successful backup.  With what you're proposing here, we
> > could have someone do a pg_basebackup, get back an ERROR saying the
> > backup wasn't valid, and then run pg_validatebackup and be told that the
> > backup is valid.  I don't get how that's sensible.
> 
> I'm sorry that you can't see how that's sensible, but it doesn't mean
> that it isn't sensible. It is totally unrealistic to expect that any
> backup verification tool can verify that you won't get an error when
> trying to use the backup. That would require that everything that the
> validation tool try to do everything that PostgreSQL will try to do
> when the backup is used, including running recovery and updating the
> data files. Anything less than that creates a real possibility that
> the backup will verify good but fail when used. This tool has a much
> narrower purpose, which is to try to verify that we (still) have the
> files the server sent as part of the backup and that, to the best of
> our ability to detect such things, they have not been modified. As you
> know, or should know, the WAL files are not sent as part of the
> backup, and so are not verified. Other things that would also be
> useful to check are also not verified. It would be fantastic to have
> more verification tools in the future, but it is difficult to see why
> anyone would bother trying if an attempt to get the first one
> committed gets blocked because it does not yet do everything. Very few
>

Re: error context for vacuum to include block number

2020-03-26 Thread Amit Kapila

On Fri, Mar 27, 2020 at 3:47 AM Justin Pryzby  wrote:
>
>
> > Hm, I was just wondering what happens if an error happens *during*
> > update_vacuum_error_cbarg().  It seems like if we set
> > errcbarg->phase=VACUUM_INDEX before setting errcbarg->indname=indname, then 
> > an
> > error would cause a crash.
> >

Can't that be avoided if you check if cbarg->indname is non-null in
vacuum_error_callback as we are already doing for
VACUUM_ERRCB_PHASE_TRUNCATE?

> >  And if we pfree and set indname before phase, it'd
> > be a problem when going from an index phase to non-index phase.

How is it possible that we move to the non-index phase without
clearing indname as we always revert back the old phase information?
I think it is possible only if we don't clear indname after the phase
is over.

> >  So maybe we
> > have to set errcbarg->phase=VACUUM_ERRCB_PHASE_UNKNOWN while in the 
> > function,
> > and errcbarg->phase=phase last.

I find that a bit ad-hoc, if possible, let's try to avoid it.

>
> And addressed that.
>
> Also, I realized that lazy_cleanup_index has an early "return", so the "Revert
> back" was ineffective.
>

We can call it immediately after index_vacuum_cleanup to avoid that.

>  We talked about how that wasn't needed, since we never
> go back to a previous phase.  Amit wanted to keep it there for consistency, 
> but
> I'd prefer to put any extra effort into calling out the special treatment
> needed/given to lazy_vacuum_heap/index, rather than making everything
> "consistent".
>

Apart from being consistent, the point was it doesn't seem good that
API being called to assume that there is nothing more the caller can
do.  It might be problematic if we later want to enhance or add
something to the caller.

> Amit: I also moved the TRUNCATE_HEAP bit back to truncate_heap(),

There is no problem with it.  We can do it either way and I have also
considered it the way you have done but decide to keep in the caller
because of the previous point I mentioned (not sure if it a good idea
that API being called can assume that there is nothing more the caller
can do after this).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

> Another issue is this:
> > +VACUUM ( FULL [, ...] ) [ TABLESPACE  > class="parameter">new_tablespace ] [  > class="parameter">table_and_columns [, ...] ]
> As you mentioned in your v1 patch, in the other cases, "tablespace
> [tablespace]" is added at the end of the command rather than in the middle.  I
> wasn't able to make that work, maybe because "tablespace" isn't a fully
> reserved word (?).  I didn't try with "SET TABLESPACE", although I understand
> it'd be better without "SET".

I think we should use the parenthesized syntax for vacuum - it seems clear in
hindsight.

Possibly REINDEX should use that, too, instead of adding OptTablespace at the
end.  I'm not sure.

CLUSTER doesn't support parenthesized syntax, but .. maybe it should?

Also, perhaps VAC FULL (and CLUSTER, if it grows parenthesized syntax), should
support something like this:

USING INDEX TABLESPACE name

I guess I would prefer just "index tablespace", without "using":

|VACUUM(FULL, TABLESPACE ts, INDEX TABLESPACE its) t;
|CLUSTER(VERBOSE, TABLESPACE ts, INDEX TABLESPACE its) t;

-- 
Justin

Re: Some problems of recovery conflict wait events




On 2020/03/26 14:33, Masahiko Sawada wrote:

On Tue, 24 Mar 2020 at 17:04, Fujii Masao  wrote:




On 2020/03/05 20:16, Fujii Masao wrote:



On 2020/03/05 16:58, Masahiko Sawada wrote:

On Wed, 4 Mar 2020 at 15:21, Fujii Masao  wrote:




On 2020/03/04 14:31, Masahiko Sawada wrote:

On Wed, 4 Mar 2020 at 13:48, Fujii Masao  wrote:




On 2020/03/04 13:27, Michael Paquier wrote:

On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote:

Yeah, so 0001 patch sets existing wait events to recovery conflict
resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION)
to the recovery conflict on a snapshot. 0003 patch improves these wait
events by adding the new type of wait event such as
WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch
is the fix for existing versions and 0003 patch is an improvement for
only PG13. Did you mean even 0001 patch doesn't fit for back-patching?


Yes, it looks like a improvement rather than bug fix.



Okay, understand.


I got my eyes on this patch set.  The full patch set is in my opinion
just a set of improvements, and not bug fixes, so I would refrain from
back-backpatching.


I think that the issue (i.e., "waiting" is reported twice needlessly
in PS display) that 0002 patch tries to fix is a bug. So it should be
fixed even in the back branches.


So we need only two patches: one fixes process title issue and another
improve wait event. I've attached updated patches.


Thanks for updating the patches! I started reading 0001 patch.


Thank you for reviewing this patch.



-   /*
-* Report via ps if we have been waiting for more than 
500 msec
-* (should that be configurable?)
-*/
-   if (update_process_title && new_status == NULL &&
-   TimestampDifferenceExceeds(waitStart, 
GetCurrentTimestamp(),
-   
   500))

The patch changes ResolveRecoveryConflictWithSnapshot() and
ResolveRecoveryConflictWithTablespace() so that they always add
"waiting" into the PS display, whether wait is really necessary or not.
But isn't it better to display "waiting" in PS basically when wait is
necessary, like originally ResolveRecoveryConflictWithVirtualXIDs()
does as the above?


You're right. Will fix it.



ResolveRecoveryConflictWithDatabase(Oid dbid)
{
+   char*new_status = NULL;
+
+   /* Report via ps we are waiting */
+   new_status = set_process_title_waiting();

In  ResolveRecoveryConflictWithDatabase(), there seems no need to
display "waiting" in PS because no wait occurs when recovery conflict
with database happens.


Isn't the startup process waiting for other backend to terminate?


Yeah, you're right. I agree that "waiting" should be reported in this case.


On second thought, in recovery conflict case, "waiting" should be reported
while waiting for the specified delay (e.g., by max_standby_streaming_delay)
until the conflict is resolved. So IMO reporting "waiting" in the case of
recovery conflict with buffer pin, snapshot, lock and tablespace seems valid,
because they are user-visible "expected" wait time.

However, in the case of recovery conflict with database, the recovery
basically doesn't wait at all and just terminates the conflicting sessions
immediately. Then the recovery waits for all those sessions to be terminated,
but that wait time is basically small and should not be the user-visible.
If that wait time becomes very long because of unresponsive backend, ISTM
that LOG or WARNING should be logged instead of reporting something in
PS display. I'm not sure if that logging is really necessary now, though.
Therefore, I'm thinking that "waiting" doesn't need to be reported in the case
of recovery conflict with database. Thought?


Fair enough. The longer wait time of conflicts with database is not
user-expected behaviour so logging would be better. I'd like to just
drop the change around ResolveRecoveryConflictWithDatabase() from the
patch.


Make sense.

+   if (update_process_title)
+   waitStart = GetCurrentTimestamp();

Since LockBufferForCleanup() can be called very frequently,
I don't think that it's good thing to call GetCurrentTimestamp()
every time when LockBufferForCleanup() is called.

+   /* Report via ps if we have been waiting for more than 500 msec 
*/
+   if (update_process_title && new_status == NULL &&
+   TimestampDifferenceExceeds(waitStart, 
GetCurrentTimestamp(),
+  500))

Do we really want to see "waiting" in PS even in non hot standby mode?

If max_standby_streaming_delay and deadlock_timeout are set to large value,
ResolveRecoveryConflictWithBufferPin() can wait for a long time, e.g.,
more than 500ms. In that case, I'm afraid that "report

Re: Memory-Bounded Hash Aggregation

2020-03-26 Thread Tomas Vondra


On Thu, Mar 26, 2020 at 05:56:56PM +0800, Richard Guo wrote:

Hello,

When calculating the disk costs of hash aggregation that spills to disk,
there is something wrong with how we determine depth:


   depth = ceil( log(nbatches - 1) / log(num_partitions) );


If nbatches is some number between 1.0 and 2.0, we would have a negative
depth. As a result, we may have a negative cost for hash aggregation
plan node, as described in [1].

I don't think 'log(nbatches - 1)' is what we want here. Should it be
just '(nbatches - 1)'?



I think using log() is correct, but why should we allow fractional
nbatches values between 1.0 and 2.0? You either have 1 batch or 2
batches, you can't have 1.5 batches. So I think the issue is here

  nbatches = Max((numGroups * hashentrysize) / mem_limit,
 numGroups / ngroups_limit );

and we should probably do

  nbatches = ceil(nbatches);

right after it.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Race condition in SyncRepGetSyncStandbysPriority

Twice in the past month [1][2], buildfarm member hoverfly has managed
to reach the "unreachable" Assert(false) at the end of
SyncRepGetSyncStandbysPriority.

What seems likely to me, after quickly eyeballing the code, is that
hoverfly is hitting the blatantly-obvious race condition in that function.
Namely, that the second loop supposes that the state of the walsender
array hasn't changed since the first loop.

The minimum fix for this, I suppose, would have the first loop capture
the sync_standby_priority value for each walsender along with what it's
already capturing.  But I wonder if the whole function shouldn't be
rewritten from scratch, because it seems like the algorithm is both
expensively brute-force and unintelligible, which is a sad combination.
It's likely that the number of walsenders would never be high enough
that efficiency could matter, but then couldn't we use an algorithm
that is less complicated and more obviously correct?  (Because the
alternative conclusion, if you reject the theory that a race is happening,
is that the algorithm is just flat out buggy; something that's not too
easy to disprove either.)

Another fairly dubious thing here is that whether or not *am_sync
gets set depends not only on whether MyWalSnd is claiming to be
synchronous but on how many lower-numbered walsenders are too.
Is that really the right thing?

But worse than any of that is that the return value seems to be
a list of walsender array indexes, meaning that the callers cannot
use it without making even stronger assumptions about the array
contents not having changed since the start of this function.

It sort of looks like the design is based on the assumption that
the array contents can't change while SyncRepLock is held ... but
if that's the plan then why bother with the per-walsender spinlocks?
In any case this assumption seems to be failing, suggesting either
that there's a caller that's not holding SyncRepLock when it calls
this function, or that somebody is failing to take that lock while
modifying the array.

regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly=2020-02-29%2001%3A34%3A55
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly=2020-03-26%2013%3A51%3A15

Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-03-26 Thread Kartyshov Ivan


Anna, thank you for your review.

On 2020-03-25 21:10, Anna Akenteva wrote:

On 2020-03-21 14:16, Kartyshov Ivan wrote:

  and event is:
  LSN value [options]
  TIMESTAMP value

I would maybe remove WAIT FOR TIMESTAMP. As Robert Haas has pointed
out, it seems a lot like pg_sleep_until(). Besides, it doesn't
necessarily need to be connected to transaction start, which makes it
different from WAIT FOR LSN - so I wouldn't mix them together.

I don't mind.
But I think we should get one more opinions on this point.


===
This is how WaitUtility() is called - note that time_val will always be 
> 0:

+    if (time_val <= 0)
+        time_val = 1;
+...
+    res = WaitUtility(lsn, (int)(time_val * 1000), dest);

(time_val * 1000) is passed to WaitUtility() as the delay argument.
And inside WaitUtility() we have this:

+if (delay > 0)
+    latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH;
+else
+    latch_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;

Since we always pass a delay value greater than 0, we'll never get to
the "else" clause here and we'll never be ready to wait for LSN
forever. Perhaps due to that, the current test outputs this after a
simple WAIT FOR LSN command:
psql::1: NOTICE:  LSN is not reached.

I fix it, and Interruptions in last patch.

Anna, feel free to work on this patch.

--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 8d91f3529e..8697f9807f 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -187,6 +187,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/begin.sgml b/doc/src/sgml/ref/begin.sgml
index c23bbfb4e7..45289c0173 100644
--- a/doc/src/sgml/ref/begin.sgml
+++ b/doc/src/sgml/ref/begin.sgml
@@ -21,13 +21,25 @@ PostgreSQL documentation
 
  
 
-BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ]
+BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] wait_for_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+wait_for_event
+WAIT FOR [ANY | ALL] event [, event ...]
+
+where event is:
+LSN value [options]
+TIMESTAMP value
+
+and where options is one of:
+TIMEOUT delay
+UNTIL TIMESTAMP timestamp
+
 
  
 
diff --git a/doc/src/sgml/ref/start_transaction.sgml b/doc/src/sgml/ref/start_transaction.sgml
index d6cd1d4177..01b402e9cd 100644
--- a/doc/src/sgml/ref/start_transaction.sgml
+++ b/doc/src/sgml/ref/start_transaction.sgml
@@ -21,13 +21,24 @@ PostgreSQL documentation
 
  
 
-START TRANSACTION [ transaction_mode [, ...] ]
+START TRANSACTION [ transaction_mode [, ...] ] wait_for_event
 
 where transaction_mode is one of:
 
 ISOLATION LEVEL { SERIALIZABLE | REPEATABLE READ | READ COMMITTED | READ UNCOMMITTED }
 READ WRITE | READ ONLY
 [ NOT ] DEFERRABLE
+
+wait_for_event
+WAIT FOR [ANY | SOME | ALL] event [, event ...]
+
+where event is:
+LSN value [options]
+TIMESTAMP value
+
+and where options is one of:
+TIMEOUT delay
+UNTIL TIMESTAMP timestamp
 
  
 
diff --git a/doc/src/sgml/ref/wait.sgml b/doc/src/sgml/ref/wait.sgml
new file mode 100644
index 00..b824088f6c
--- /dev/null
+++ b/doc/src/sgml/ref/wait.sgml
@@ -0,0 +1,148 @@
+
+
+
+ 
+  WAIT FOR
+ 
+
+ 
+  WAIT FOR
+  7
+  SQL - Language Statements
+ 
+
+ 
+  WAIT FOR
+  wait for the target LSN to be replayed
+ 
+
+ 
+
+WAIT FOR [ANY | ALL] event [, event ...]
+
+where event is:
+LSN value [options]
+TIMESTAMP value
+
+and where options is one of:
+TIMEOUT delay
+UNTIL TIMESTAMP timestamp
+
+WAIT FOR LSN 'lsn_number'
+WAIT FOR LSN 'lsn_number' TIMEOUT wait_timeout
+WAIT FOR LSN 'lsn_number' UNTIL TIMESTAMP wait_time
+WAIT FOR TIMESTAMP wait_time
+WAIT FOR ALL LSN 'lsn_number' TIMEOUT wait_timeout, TIMESTAMP wait_time
+WAIT FOR ANY LSN 'lsn_number', TIMESTAMP wait_time
+
+ 
+
+ 
+  Description
+
+  
+   WAIT FOR provides a simple
+   interprocess communication mechanism to wait for timestamp or the target log sequence
+   number (LSN) on standby in PostgreSQL
+   databases with master-standby asynchronous replication. When run with the
+   LSN option, the WAIT FOR command
+   waits for the specified LSN to be replayed. By default, wait
+   time is unlimited. Waiting can be interrupted using Ctrl+C, or
+   by shutting down the postgres server. You can also limit the wait
+   time using the TIMEOUT option.
+  
+
+ 
+
+ 
+  Parameters
+
+  
+   
+LSN
+
+ 
+  Specify the target log sequence number to wait for.
+ 
+
+   
+
+   
+TIMEOUT wait_timeout
+
+ 
+  Limit the time interval to wait for the LSN to be replayed.
+  The specified wait_timeout must be an integer
+  and is measured in milliseconds.
+ 
+
+   
+
+

Re: SLRU statistics

2020-03-26 Thread Tomas Vondra


Hi,

Attached is v3 of the patch with one big change and various small ones.

The main change is that it gets rid of the SlruType enum and the new
field in SlruCtlData. Instead, the patch now uses the name passed to
SimpleLruInit (which is then stored as LWLock tranche name).

The counters are still stored in a fixed-sized array, and there's a
simple name/index mapping. We don't have any registry of stable SLRU
IDs, so I can't think of anything better, and I think this is good
enough for now.

The other change is that I got rid of the io_error counter. We don't
have that for shared buffers etc. either, anyway.

I've also renamed the colunms from "pages" to "blks" to make it
consistent with other similar stats (blks_hit, blks_read). I've
renamed the fields to "blks_written" and "blks_zeroed".

And finally, I've added the view to monitoring.sgml.

Barring objections, I'll get this committed in the next few days, after
reviewing the comments a bit.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From e1a187b9b331798c87900f94aa77999f9198f556 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Thu, 26 Mar 2020 20:52:26 +0100
Subject: [PATCH] SLRU stats

---
 doc/src/sgml/monitoring.sgml |  77 +
 src/backend/access/transam/slru.c|  23 +++
 src/backend/catalog/system_views.sql |  13 ++
 src/backend/postmaster/pgstat.c  | 238 +++
 src/backend/utils/adt/pgstatfuncs.c  |  77 +
 src/include/catalog/pg_proc.dat  |  10 ++
 src/include/pgstat.h |  53 +-
 src/test/regress/expected/rules.out  |  10 ++
 8 files changed, 500 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 270178d57e..b58ac5acb8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -575,6 +575,13 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   yet included in pg_stat_user_functions).
  
 
+ 
+  
pg_stat_slrupg_stat_slru
+  One row per SLRU, showing statistics of operations. See
+for details.
+  
+ 
+
 

   
@@ -3254,6 +3261,76 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i

   
 
+  
+   The pg_stat_slru view will contain
+   one row for each tracked SLRU cache, showing statistics about access
+   to cached pages.
+  
+
+  
+   pg_stat_slru View
+   
+
+
+  Column
+  Type
+  Description
+ 
+
+
+   
+
+ name
+ name
+ name of the SLRU
+
+
+ blks_zeroed
+ bigint
+ Number of blocks zeroed during initializations
+
+
+ blks_hit
+ biging
+ Number of times disk blocks were found already in the SLRU,
+  so that a read was not necessary (this only includes hits in the
+  SLRU, not the operating system's file system cache)
+ 
+
+
+ blks_read
+ bigint
+ Number of disk blocks read for this SLRU
+
+
+ blks_written
+ bigint
+ Number of disk blocks written for this SLRU
+
+
+ blks_exists
+ bigint
+ Number of blocks checked for existence for this SLRU
+
+
+ flushes
+ bigint
+ Number of flushes of dirty data for this SLRU
+
+
+ truncates
+ bigint
+ Number of truncates for this SLRU
+
+
+ stats_reset
+ timestamp with time zone
+ Time at which these statistics were last reset
+
+   
+   
+  
+
   
The pg_stat_user_functions view will contain
one row for each tracked function, showing statistics about executions of
diff --git a/src/backend/access/transam/slru.c 
b/src/backend/access/transam/slru.c
index d5b7a08f73..f7160dd574 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -286,6 +286,9 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
/* Assume this page is now the latest active page */
shared->latest_page_number = pageno;
 
+   /* update the stats counter of zeroed pages */
+   pgstat_slru_count_page_zeroed(ctl);
+
return slotno;
 }
 
@@ -403,6 +406,10 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
}
/* Otherwise, it's ready to use */
SlruRecentlyUsed(shared, slotno);
+
+   /* update the stats counter of pages found in the SLRU 
*/
+   pgstat_slru_count_page_hit(ctl);
+
return slotno;
}
 
@@ -444,6 +451,10 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
SlruReportIOError(ctl, pageno, xid);
 
SlruRecentlyUsed(shared, slotno);
+
+   /* update the stats counter of pages not found in SLRU */
+   pgstat_slru_count_page_read(ctl);
+
return slotno;
}

Re: error context for vacuum to include block number

On 2020-Mar-26, Justin Pryzby wrote:

> On Thu, Mar 26, 2020 at 07:49:51PM -0300, Alvaro Herrera wrote:

> > BTW I'm pretty sure this "revert back" phrasing is not good English --
> > you should just use "revert".  Maybe get some native speaker's opinion
> > on it.
> 
> I'm a native speaker; "revert back" might be called redundant but I think it's
> common usage.

Oh, okay.

> > And speaking of language, I find the particle "cbarg" rather very ugly,
> 
> I renamed it since it was kind of opaque looking.  It's in all the same 
> places,
> so equally infectious; but I hope you like it better.

I like it much better, thanks :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: allow online change primary_conninfo

Now, would anyone be too upset if I push these in this other order?  I
realized that the reason the tests broke after Sergei's patch is that
recovery/t/001_stream_rep.pl's get_slot_xmins() is broken for temp
walreceiver slots, since it's using the non-temp name it tries to give
to the slot, rather than the temp name under which it is actually
created.  The workaround proposed by 0002 is to edit standby_1's config
to set walreceiver's slot to be non-temp.

Thanks to Justin Pryzby for offlist typo corrections.

The reason is that I think I would like to get Sergei's patch pushed
right away (0001+0002, as a single commit); but that I think there's
more to attack in the walreceiver temp slot code than 0003 here, and I
don't want to delay the new feature any longer while I figure out the
fix for that.


(The thing is: if I specify primary_slot_name in the config, why is the
temp walreceiver slot code not obeying that name?  I think walreceiver
should create a temp slot, sure, but using the given name rather than
coming up with a random name.)

(The other reason is that I would like to push one patch to fix
walreceiver tmp slot rather than two, setting the thing first this way
and then the opposite way.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From f7a9717caa9dd73c00fb5e6a1dddb354ad78d09a Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 25 Mar 2020 20:48:56 -0300
Subject: [PATCH v11 1/3] Allow changing primary_conninfo and primary_slot_name
 in SIGHUP

---
 doc/src/sgml/config.sgml  |  16 ++-
 src/backend/access/transam/xlog.c | 130 +++---
 src/backend/access/transam/xlogreader.c   |   6 +-
 src/backend/postmaster/startup.c  |  30 +++-
 src/backend/replication/walreceiver.c |  12 +-
 src/backend/utils/misc/guc.c  |   4 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 -
 src/include/access/xlog.h |   2 +
 src/test/recovery/t/001_stream_rep.pl |  24 +++-
 9 files changed, 165 insertions(+), 61 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 355b408b0a..1cf88e953d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4028,9 +4028,15 @@ ANY num_sync ( 
@@ -4045,9 +4051,13 @@ ANY num_sync ( ).
-  This parameter can only be set at server start.
+  This parameter can only be set in the postgresql.conf
+  file or on the server command line.
   This setting has no effect if primary_conninfo is not
-  set.
+  set or the server is not in standby mode.
+ 
+ 
+  The WAL receiver is restarted after an update of primary_slot_name.
  
 

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7621fc05e2..bbf8d4eee5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -815,9 +815,13 @@ static XLogSource readSource = XLOG_FROM_ANY;
  * currently have a WAL file open. If lastSourceFailed is set, our last
  * attempt to read from currentSource failed, and we should try another source
  * next.
+ *
+ * pendingWalRcvRestart is set when a config change occurs that requires a
+ * walreceiver restart.  This is only valid in XLOG_FROM_STREAM state.
  */
 static XLogSource currentSource = XLOG_FROM_ANY;
 static bool lastSourceFailed = false;
+static bool pendingWalRcvRestart = false;
 
 typedef struct XLogPageReadPrivate
 {
@@ -11904,6 +11908,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	for (;;)
 	{
 		XLogSource	oldSource = currentSource;
+		bool		startWalReceiver = false;
 
 		/*
 		 * First check if we failed to read from the current source, and
@@ -11938,53 +11943,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		return false;
 
 	/*
-	 * If primary_conninfo is set, launch walreceiver to try
-	 * to stream the missing WAL.
-	 *
-	 * If fetching_ckpt is true, RecPtr points to the initial
-	 * checkpoint location. In that case, we use RedoStartLSN
-	 * as the streaming start position instead of RecPtr, so
-	 * that when we later jump backwards to start redo at
-	 * RedoStartLSN, we will have the logs streamed already.
-	 */
-	if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
-	{
-		XLogRecPtr	ptr;
-		TimeLineID	tli;
-
-		if (fetching_ckpt)
-		{
-			ptr = RedoStartLSN;
-			tli = ControlFile->checkPointCopy.ThisTimeLineID;
-		}
-		else
-		{
-			ptr = RecPtr;
-
-			/*
-			 * Use the record begin position to determine the
-			 * TLI, rather than the position we're reading.
-			 */
-			tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
-
-			if (curFileTLI > 0 && tli < curFileTLI)
-elog(ERROR, "according to history file, WAL

RE: Conflict handling for COPY FROM

2020-03-26 Thread asaba.takan...@fujitsu.com

Hello Surafel,

From: Surafel Temesgen 
>An error that can be surly handled without transaction rollback can
>be included in error handling but i will like to proceed without binary file
>errors handling for the time being

Thank you.

Also it seems that you apply Alexey's comment.
So I'll mark this patch as ready for commiter.

Regards,

--
Takanori Asaba

Re: error context for vacuum to include block number

On Thu, Mar 26, 2020 at 07:49:51PM -0300, Alvaro Herrera wrote:
> > > ... So once we've "reverted back", 1) the pointer is null; and, 2)
> > > the callback function doesn't access it for the previous/reverted
> > > phase anyway.
> 
> BTW I'm pretty sure this "revert back" phrasing is not good English --
> you should just use "revert".  Maybe get some native speaker's opinion
> on it.

I'm a native speaker; "revert back" might be called redundant but I think it's
common usage.

> And speaking of language, I find the particle "cbarg" rather very ugly,
> and it's *everywhere* -- function name, function argument, local
> variable, enum values, enum name.  It even spread to the typedefs.list
> file!  Is this a new virus???  Put some soap in it!  Can't we use "info"
> or "state" or something similar, less infectious, instead?

I renamed it since it was kind of opaque looking.  It's in all the same places,
so equally infectious; but I hope you like it better.

Cheers,
-- 
Justin
>From 4d33dc1c125690b48dc0028c63a663db53ac0311 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 12 Dec 2019 20:54:37 -0600
Subject: [PATCH v37 1/3] Introduce vacuum errcontext to display additional
 information.

The additional information displayed will be block number for error
occurring while processing heap and index name for error occurring
while processing the index.

This will help us in diagnosing the problems that occur during a vacuum.
For ex. due to corruption (either caused by bad hardware or by some bug)
if we get some error while vacuuming, it can help us identify the block
in heap and or additional index information.

It sets up an error context callback to display additional information
with the error.  During different phases of vacuum (heap scan, heap
vacuum, index vacuum, index clean up, heap truncate), we update the error
context callback to display appropriate information.  We can extend it to
a bit more granular level like adding the phases for FSM operations or for
prefetching the blocks while truncating. However, I felt that it requires
adding many more error callback function calls and can make the code a bit
complex, so left those for now.

Author: Justin Pryzby, with few changes by Amit Kapila
Reviewed-by: Alvaro Herrera, Amit Kapila, Andres Freund, Michael Paquier
and Sawada Masahiko
Discussion: https://www.postgresql.org/message-id/20191120210600.gc30...@telsasoft.com
---
 src/backend/access/heap/vacuumlazy.c | 240 ---
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 216 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 03c43efc32..973f411caf 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,6 +144,17 @@
  */
 #define ParallelVacuumIsActive(lps) PointerIsValid(lps)
 
+/* Phases of vacuum during which we report error context. */
+typedef enum
+{
+	VACUUM_ERRCB_PHASE_UNKNOWN,
+	VACUUM_ERRCB_PHASE_SCAN_HEAP,
+	VACUUM_ERRCB_PHASE_VACUUM_INDEX,
+	VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+	VACUUM_ERRCB_PHASE_INDEX_CLEANUP,
+	VACUUM_ERRCB_PHASE_TRUNCATE
+} VacErrPhase;
+
 /*
  * LVDeadTuples stores the dead tuple TIDs collected during the heap scan.
  * This is allocated in the DSM segment in parallel mode and in local memory
@@ -270,6 +281,8 @@ typedef struct LVParallelState
 
 typedef struct LVRelStats
 {
+	char	   *relnamespace;
+	char	   *relname;
 	/* useindex = true means two-pass strategy; false means one-pass */
 	bool		useindex;
 	/* Overall statistics about rel */
@@ -290,8 +303,12 @@ typedef struct LVRelStats
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
-} LVRelStats;
 
+	/* Used for error callback */
+	char	   *indname;
+	BlockNumber blkno;			/* used only for heap operations */
+	VacErrPhase phase;
+} LVRelStats;
 
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
@@ -314,10 +331,10 @@ static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 	LVRelStats *vacrelstats, LVParallelState *lps,
 	int nindexes);
 static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-			  LVDeadTuples *dead_tuples, double reltuples);
+			  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
 static void lazy_cleanup_index(Relation indrel,
 			   IndexBulkDeleteResult **stats,
-			   double reltuples, bool estimated_count);
+			   double reltuples, bool estimated_count, LVRelStats *vacrelstats);
 static int	lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 			 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
 static bool should_attempt_truncation(VacuumParams *params,
@@ -337,13 +354,13 @@ static void lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult *
 		 int nindexes);
 static void parallel_vacuum_index(Relation *Irel,

Re: error context for vacuum to include block number

On 2020-Mar-26, Justin Pryzby wrote:

> On Thu, Mar 26, 2020 at 10:04:57AM -0500, Justin Pryzby wrote:

> > And I think you're right: we only save state when the calling function has a
> > indname=NULL, so we never "put back" a non-NULL indname.  We go from having 
> > a
> > indname=NULL at lazy_scan_heap to not not-NULL at lazy_vacuum_index, and 
> > never
> > the other way around.
> 
> I removed the free_oldindname argument.

Hah, I was wondering about that free_oldindname business this morning as
well.

> > ... So once we've "reverted back", 1) the pointer is null; and, 2)
> > the callback function doesn't access it for the previous/reverted
> > phase anyway.

BTW I'm pretty sure this "revert back" phrasing is not good English --
you should just use "revert".  Maybe get some native speaker's opinion
on it.

And speaking of language, I find the particle "cbarg" rather very ugly,
and it's *everywhere* -- function name, function argument, local
variable, enum values, enum name.  It even spread to the typedefs.list
file!  Is this a new virus???  Put some soap in it!  Can't we use "info"
or "state" or something similar, less infectious, instead?

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

On Thu, Mar 26, 2020 at 06:56:36PM -0300, Alvaro Herrera wrote:
> Uh, is XLOGDIR the only reason to include xlog_internal.h?  Maybe it
> would be easier to have a routine in xlog.c that returns the path?
> There's a few functions in xlog.c that could use it, as well.

Yep, that's all.  We could also just hardcode the path as we did when
we worked on the exclusion filter lists for pg_rewind and basebackup.c, 
though I'd prefer avoid that if possible.

> The patch downthread looks decent cleanup, but I'm not sure how it helps
> further the objective.

I actually think it does, because you get out of the way the conflicts
caused by RestoreArchivedFile() in the backend, as we are targetting
for fe_archive.c to be a frontend-only file, though I agree that it is
not the full of it.

> (A really good cleanup could be a situation where frontend files don't
> need xlog_internal.h -- for example, maybe a new file xlogpage.h could
> contain struct defs that relate to page and segment headers and the
> like, as well as useful macros.  I don't know if this can be made to
> work -- but xlog_internal.h contains stuff like xl_parameter_change etc
> as well as RmgrData which surely are of no interest to readers of wal
> files ... or, say, RequestXLogSwitch.)

A second header that could be created is xlogpaths.h (or xlogfiles.h?)
that includes all the routines and variables we use to build WAL
segment names and such, with XLogFileNameById, IsTLHistoryFileName,
etc.  I agree that splitting the record-related parts may make sense,
say xlogrecovery.h?

> I don't think any such cleanup should hamper the patch at hand anyway.

I don't think either, so I would do the split for the archive routines
at least.  It still feels strange to me to have a different routine
name for the frontend-side of RestoreArchivedFile().  That's not
really consistent with the current practice we have palloc() & co, as
well as the sync routines.
--
Michael

signature.asc
Description: PGP signature

Re: Include sequence relation support in logical replication

2020-03-26 Thread Cary Huang

Hi Andres



thanks for your reply and your patch review. Please see my comments below



>On 2020-03-24 16:19:21 -0700, Cary Huang wrote: 

>> I have shared a patch that allows sequence relation to be supported in 

>> logical replication via the decoding plugin ( test_decoding for 

>> example ); it does not support sequence relation in logical 

>> replication between a PG publisher and a PG subscriber via pgoutput 

>> plugin as it will require much more work. 

>

> Could you expand on "much more work"? Once decoding support is there, 

> that shouldn't be that much? 



By much more work, I meant more source files will need to be changed to have 
sequence replication 

supported between a PG subscriber and publisher using pgoutput plugin. About 10 
more source file changes.

Idea is similar though.




>> Sequence changes caused by other sequence-related SQL functions like 

>> setval() or ALTER SEQUENCE xxx, will always emit a WAL update, so 

>> replicating changes caused by these should not be a problem. 

>

> I think this really would need to handle at the very least setval to 

> make sense. 



yes, sure



>> For the replication to make sense, the patch actually disables the WAL 

>> update at every 32 nextval() calls, so every call to nextval() will 

>> emit a WAL update for proper replication. This is done by setting 

>> SEQ_LOG_VALS to 0 in sequence.c 

>

> Why is that needed? ISTM updating in increments of 32 is fine for 

> replication purposes? It's good imo, because sending out more granular 

> increments would increase the size of the WAL stream? 



yes, updating WAL at every 32 increment is good and have huge performance 
benefits according 

to Michael, but when it is replicated logically to subscribers, the sequence 
value they receive would not 

make much sense to them.

For example, 



If i have a Sequence called "seq" with current value = 100 and increment = 5. 
The nextval('seq') call will

return 105 to the client but it will write 260 to WAL record ( 100 + (5*32) ), 
because that is the value after 32

increments and internally it is also maintaining a "log_cnt" counter that 
tracks how many nextval() calls have been invoked

since the last WAL write, so it could kind of derive backwards to find the 
proper next value to return to client. 



But the subscriber for this sequence will receive a change of 260 instead of 
105, and it does not represent the current

sequence status. Subscriber is not able to derive backwards because it does not 
know the increment size in its schema.



Setting SEQ_LOG_VALS to 0 in sequence.c basically disables this 32 increment 
behavior and makes WAL update at every nextval() call

and therefore the subscriber to this sequence will receive the same value (105) 
as the client, as a cost of more WAL writes.



I would like to ask if you have some suggestions or ideas that can make 
subscriber receives the current value without the need to

disabling the 32 increment behavior?



>> diff --git a/contrib/test_decoding/test_decoding.c 
>> b/contrib/test_decoding/test_decoding.c 

>> index 93c948856e..7a7e572d6c 100644 

>> --- a/contrib/test_decoding/test_decoding.c 

>> +++ b/contrib/test_decoding/test_decoding.c 

>> @@ -466,6 +466,15 @@ pg_decode_change(LogicalDecodingContext *ctx, 
>> ReorderBufferTXN *txn, 

>>  >data.tp.oldtuple->tuple, 

>>  true); 

>>  break; 

>> +case REORDER_BUFFER_CHANGE_SEQUENCE: 

>> +appendStringInfoString(ctx->out, " SEQUENCE:"); 

>> +if (change->data.sequence.newtuple == NULL) 

>> +appendStringInfoString(ctx->out, " 
>> (no-tuple-data)"); 

>> +else 

>> +tuple_to_stringinfo(ctx->out, tupdesc, 

>> +
>> >data.sequence.newtuple->tuple, 

>> +false); 

>> +break; 

>>  default: 

>>  Assert(false); 

>>  } 

>

> You should also add tests - the main purpose of contrib/test_decoding is 

> to facilitate writing those... 



thanks, I will add




>> +ReorderBufferXidSetCatalogChanges(ctx->reorder, 
>> XLogRecGetXid(buf->record), buf->origptr); 

>

> Huh, why are you doing this? That's going to increase overhead of logical 

> decoding by many times? 



This is needed to allow snapshot to be built inside DecodeCommit() function. 
Regular changes caused by INSERT also has this 

function called so I assume it is needed to ensure proper decoding. Without 
this, a snapshot will not be built and the change 

transaction will not be logged




>> +case REORDER_BUFFER_CHANGE_SEQUENCE: 

>> +Assert(snapshot_now); 

>> + 

>> +reloid = 
>> RelidByRelfilenode(change->data.sequence.relnode.spcNode, 

>> +

Re: error context for vacuum to include block number

On Thu, Mar 26, 2020 at 10:04:57AM -0500, Justin Pryzby wrote:
> Does that address your comment ?

I hope so.

> > I'm not sure why "free_oldindname" is necessary. Since we initialize
> > vacrelstats->indname with NULL and revert the callback arguments at
> > the end of functions that needs update them, vacrelstats->indname is
> > NULL at the beginning of lazy_vacuum_index() and lazy_cleanup_index().
> > And we make a copy of index name in update_vacuum_error_cbarg(). So I
> > think we can pfree the old index name if errcbarg->indname is not NULL.
> 
> We want to avoid doing this:
>  olderrcbarg = *vacrelstats // saves a pointer
>  update_vacuum_error_cbarg(... NULL); // frees the pointer and sets indname 
> to NULL
>  update_vacuum_error_cbarg(... olderrcbarg.oldindnam) // puts back the 
> pointer, which has been freed
>  // hit an error, and the callback accesses the pfreed pointer
> 
> I think that's only an issue for lazy_vacuum_index().
> 
> And I think you're right: we only save state when the calling function has a
> indname=NULL, so we never "put back" a non-NULL indname.  We go from having a
> indname=NULL at lazy_scan_heap to not not-NULL at lazy_vacuum_index, and never
> the other way around.  So once we've "reverted back", 1) the pointer is null;
> and, 2) the callback function doesn't access it for the previous/reverted 
> phase
> anyway.

I removed the free_oldindname argument.

> Hm, I was just wondering what happens if an error happens *during*
> update_vacuum_error_cbarg().  It seems like if we set
> errcbarg->phase=VACUUM_INDEX before setting errcbarg->indname=indname, then an
> error would cause a crash.  And if we pfree and set indname before phase, it'd
> be a problem when going from an index phase to non-index phase.  So maybe we
> have to set errcbarg->phase=VACUUM_ERRCB_PHASE_UNKNOWN while in the function,
> and errcbarg->phase=phase last.

And addressed that.

Also, I realized that lazy_cleanup_index has an early "return", so the "Revert
back" was ineffective.  We talked about how that wasn't needed, since we never
go back to a previous phase.  Amit wanted to keep it there for consistency, but
I'd prefer to put any extra effort into calling out the special treatment
needed/given to lazy_vacuum_heap/index, rather than making everything
"consistent".

Amit: I also moved the TRUNCATE_HEAP bit back to truncate_heap(), since 1) it's
odd if we don't have anything in truncate_heap() about error reporting except
for "vacrelstats->blkno = blkno"; and, 2) it's nice to set the err callback arg
right after pgstat_progress, and outside of any loop.  In previous versions, it
was within the loop, because it closely wrapped RelationTruncate() and
count_nondeletable_pages() - a previous version used separate phases.

-- 
Justin
>From bfc574979f85c0f0722d182ae8ae03097fb5f9c4 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 12 Dec 2019 20:54:37 -0600
Subject: [PATCH v36 1/3] Introduce vacuum errcontext to display additional
 information.

The additional information displayed will be block number for error
occurring while processing heap and index name for error occurring
while processing the index.

This will help us in diagnosing the problems that occur during a vacuum.
For ex. due to corruption (either caused by bad hardware or by some bug)
if we get some error while vacuuming, it can help us identify the block
in heap and or additional index information.

It sets up an error context callback to display additional information
with the error.  During different phases of vacuum (heap scan, heap
vacuum, index vacuum, index clean up, heap truncate), we update the error
context callback to display appropriate information.  We can extend it to
a bit more granular level like adding the phases for FSM operations or for
prefetching the blocks while truncating. However, I felt that it requires
adding many more error callback function calls and can make the code a bit
complex, so left those for now.

Author: Justin Pryzby, with few changes by Amit Kapila
Reviewed-by: Alvaro Herrera, Amit Kapila, Andres Freund, Michael Paquier
and Sawada Masahiko
Discussion: https://www.postgresql.org/message-id/20191120210600.gc30...@telsasoft.com
---
 src/backend/access/heap/vacuumlazy.c | 240 ---
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 216 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 03c43efc32..e98e6b45d3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,6 +144,17 @@
  */
 #define ParallelVacuumIsActive(lps) PointerIsValid(lps)

+/* Phases of vacuum during which we report error context. */
+typedef enum
+{
+	VACUUM_ERRCB_PHASE_UNKNOWN,
+	VACUUM_ERRCB_PHASE_SCAN_HEAP,
+	VACUUM_ERRCB_PHASE_VACUUM_INDEX,
+	VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+	VACUUM_ERRCB_PHASE_INDEX_CLEANUP,
+	VACUUM_ERRCB_PHASE_TRUNCATE
+} VacErrCbPhase;
+
 /*
  *

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

On 2020-Mar-26, Michael Paquier wrote:

> I was looking at this patch again today and I am rather fine with the
> existing semantics.  Still I don't like much to name the frontend-side
> routine FrontendRestoreArchivedFile() and use a different name than
> the backend counterpart because we have to include xlog_internal.h in
> fe_archive.c to be able to grab XLOGDIR.

Uh, is XLOGDIR the only reason to include xlog_internal.h?  Maybe it
would be easier to have a routine in xlog.c that returns the path?
There's a few functions in xlog.c that could use it, as well.

Altough ... looking at xlog.h, that one is even worse, so I'm not sure
*where* you'd put the prototype for the new function I'm proposing.

> So here is an idea: let's move the declaration of the routines part of
> xlogarchive.c to a new header, called xlogarchive.h, and then let's
> use the same routine name for the frontend and the backend in this
> second patch.  We include xlog_internal.h already in many frontend
> tools, so that would clean up things a bit.

The patch downthread looks decent cleanup, but I'm not sure how it helps
further the objective.

(A really good cleanup could be a situation where frontend files don't
need xlog_internal.h -- for example, maybe a new file xlogpage.h could
contain struct defs that relate to page and segment headers and the
like, as well as useful macros.  I don't know if this can be made to
work -- but xlog_internal.h contains stuff like xl_parameter_change etc
as well as RmgrData which surely are of no interest to readers of wal
files ... or, say, RequestXLogSwitch.)

I don't think any such cleanup should hamper the patch at hand anyway.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgbench - add \aset to store results of a combined query

2020-03-26 Thread Fabien COELHO



Bonjour Michaël,

[...] Still sounds strange to me to invent a new variable to this 
structure if it is possible to track the exact same thing with an 
existing part of a Command, or it would make sense to split Command into 
two different structures with an extra structure used after the parsing 
for clarity?


Hmmm.

Your point is to store the gset/aset status into the meta field, even if 
the command type is SQL. This is not done for gset, which relies on the 
non-null prefix, and breaks the assumption that meta is set to something 
only when the command is a meta command. Why not. I updated the comment, 
so now meta is none/gset/aset when command type is sql, and I removed the 
aset field.



Well, it still looks cleaner to me to just assign the meta field
properly within ParseScript(), and you get the same result.  And it is
also possible to use "meta" to do more sanity checks around META_GSET
for some code paths.  So I'd actually find the addition of a new
argument using a meta command within readCommandResponse() cleaner.


I tried to do that.


- * varprefix   SQL commands terminated with \gset have this set
+ * varprefix   SQL commands terminated with \gset or \aset have this set



Nit from v4: varprefix can be used for \gset and \aset, and the
comment was not updated.


It is now updated.


+   /* coldly skip empty result under \aset */
+   if (ntuples <= 0)
+   break;
Shouldn't this check after \aset?  And it seems to me that this code
path is not taken, so a test would be nice.


Added (I think, if I understood what you suggested.).


-   } while (res);
+   } while (res != NULL);
Useless diff.


Yep.

Attached an updated v5.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 41b3880c91..58a2aa3bf2 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1057,18 +1057,29 @@ pgbench  options  d

 
  \gset [prefix]
+ \aset [prefix]
 
 
 
  
-  This command may be used to end SQL queries, taking the place of the
+  These commands may be used to end SQL queries, taking the place of the
   terminating semicolon (;).
  
 
  
-  When this command is used, the preceding SQL query is expected to
-  return one row, the columns of which are stored into variables named after
-  column names, and prefixed with prefix if provided.
+  When the \gset command is used, the preceding SQL query is
+  expected to return one row, the columns of which are stored into variables
+  named after column names, and prefixed with prefix
+  if provided.
+ 
+
+ 
+  When the \aset command is used, all combined SQL queries
+  (separated by \;) have their columns stored into variables
+  named after column names, and prefixed with prefix
+  if provided. If a query returns no row, no assignment is made and the variable
+  can be tested for existence to detect this. If a query returns more than one
+  row, the last value is kept.
  
 
  
@@ -1077,6 +1088,8 @@ pgbench  options  d
   p_two and p_three
   with integers from the third query.
   The result of the second query is discarded.
+  The result of the two last combined queries are stored in variables
+  four and five.
 
 UPDATE pgbench_accounts
   SET abalance = abalance + :delta
@@ -1085,6 +1098,7 @@ UPDATE pgbench_accounts
 -- compound of two queries
 SELECT 1 \;
 SELECT 2 AS two, 3 AS three \gset p_
+SELECT 4 AS four \; SELECT 5 AS five \aset
 
  
 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b8864c6ae5..74aadeade0 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -480,6 +480,7 @@ typedef enum MetaCommand
 	META_SHELL,	/* \shell */
 	META_SLEEP,	/* \sleep */
 	META_GSET,	/* \gset */
+	META_ASET,	/* \aset */
 	META_IF,	/* \if */
 	META_ELIF,	/* \elif */
 	META_ELSE,	/* \else */
@@ -504,14 +505,17 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
  *not applied.
  * first_line	A short, single-line extract of 'lines', for error reporting.
  * type			SQL_COMMAND or META_COMMAND
- * meta			The type of meta-command, or META_NONE if command is SQL
+ * meta			The type of meta-command, if command is SQL META_NONE,
+ *META_GSET or META_ASET which dictate what to do with the
+ * SQL query result.
  * argc			Number of arguments of the command, 0 if not yet processed.
  * argv			Command arguments, the first of which is the command or SQL
  *string itself.  For SQL commands, after post-processing
  *argv[0] is the same as 'lines' with variables substituted.
- * varprefix 	SQL commands terminated with \gset have this set
+ * varprefix 	SQL commands terminated with \gset or \aset have this set
  *to a non NULL value.  If nonempty, it's used to prefix the
  *variable name that receives the value.
+ *

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-26 Thread Alexey Kondratov


On 2020-03-26 10:34, Michael Paquier wrote:

On Tue, Mar 24, 2020 at 12:22:16PM +0900, Michael Paquier wrote:

Thanks Alvaro and Alexander.  0001 has been applied as of e09ad07.
Now for 0002, let's see about it later.  Attached is a rebased
version, with no actual changes.


I was looking at this patch again today and I am rather fine with the
existing semantics.  Still I don't like much to name the frontend-side
routine FrontendRestoreArchivedFile() and use a different name than
the backend counterpart because we have to include xlog_internal.h in
fe_archive.c to be able to grab XLOGDIR.

So here is an idea: let's move the declaration of the routines part of
xlogarchive.c to a new header, called xlogarchive.h, and then let's
use the same routine name for the frontend and the backend in this
second patch.  We include xlog_internal.h already in many frontend
tools, so that would clean up things a bit.  Two extra things are the
routines for the checkpointer as well as the variables like
ArchiveRecoveryRequested.  It may be nice to move those while on it,
but I am not sure where and that's not actually required for this
patch set so that could be addressed later if need be.

Any thoughts?



The block of function declarations for xlogarchive.c inside 
xlog_internal.h looks a bit dangling already, since all other functions 
and variables defined/initialized in xlog.c. That way, it looks good to 
me to move it outside.


The only one concern about using the same name I have is that later 
someone may introduce a new variable or typedef inside xlogarchive.h. So 
someone else would be required to include both fe_archive.h and 
xlogarchive.h at once. But probably there should be a warning in the 
header comment section against doing so.


Anyway, I have tried to do what you proposed and attached is a small 
patch, that introduces xlogarchive.h.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 860a996414..d683af377f 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -36,6 +36,7 @@
 
 #include "access/timeline.h"
 #include "access/xlog.h"
+#include "access/xlogarchive.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
 #include "pgstat.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7621fc05e2..242427f815 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -32,6 +32,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlogarchive.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 914ad340ea..04104b55ea 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -20,6 +20,7 @@
 #include 
 
 #include "access/xlog.h"
+#include "access/xlogarchive.h"
 #include "access/xlog_internal.h"
 #include "common/archive.h"
 #include "miscadmin.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..cc91954ac1 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -48,6 +48,7 @@
 #include "access/htup_details.h"
 #include "access/timeline.h"
 #include "access/transam.h"
+#include "access/xlogarchive.h"
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27ded593ab..8e3cfcf83e 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -320,22 +320,4 @@ extern bool InArchiveRecovery;
 extern bool StandbyMode;
 extern char *recoveryRestoreCommand;
 
-/*
- * Prototypes for functions in xlogarchive.c
- */
-extern bool RestoreArchivedFile(char *path, const char *xlogfname,
-const char *recovername, off_t expectedSize,
-bool cleanupEnabled);
-extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
-   bool failOnSignal);
-extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
-extern void XLogArchiveNotify(const char *xlog);
-extern void XLogArchiveNotifySeg(XLogSegNo segno);
-extern void XLogArchiveForceDone(const char *xlog);
-extern bool XLogArchiveCheckDone(const char *xlog);
-extern bool XLogArchiveIsBusy(const char *xlog);
-extern bool XLogArchiveIsReady(const char *xlog);
-extern bool XLogArchiveIsReadyOrDone(const char *xlog);
-extern void XLogArchiveCleanup(const char *xlog);
-
 #endif			/* XLOG_INTERNAL_H */
diff --git a/src/include/access/xlogarchive.h b/src/include/access/xlogarchive.h
new file mode 100644
index 00..3406fb2800
---

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-26 Thread David Rowley

On Fri, 27 Mar 2020 at 07:51, Andres Freund  wrote:
>
> Hi,
>
> On 2020-03-26 10:12:39 +0100, Laurenz Albe wrote:
> > On Wed, 2020-03-25 at 23:19 +0300, Alexander Korotkov wrote:
> > I am reluctant to introduce new semantics like a reloption value of -2
> > to disable a feature in this patch right before feature freeze.
> >
> > I believe there are enough options to disable insert-only vacuuming for
> > an individual table:
>
> > - Set the threshold to 2147483647.  True, that will not work for very
> >   large tables, but I think that there are few tables that insert that
> >   many rows before they hit autovacuum_freeze_max_age anyway.
> >
> > - Set the scale factor to some astronomical value.
>
> Meh. You *are* adding new semantics with these. And they're terrible.

I've modified this to allow a proper way to disable the entire feature
by allowing the setting to be set to -1 to disable the feature. I feel
people are fairly used to using -1 to disable various features (e.g.
log_autovacuum_min_duration).  I've used the special value of -2 for
the reloption to have that cascade to using the GUC instead.  The
autovacuum_vacuum_insert_threshold reloption may be explicitly set to
-1 to disable autovacuums for inserts for the relation.

I've also knocked the default threshold down to 1000. I feel this is a
better value given that the scale factor is now 0.2. There seemed to
be no need to exclude smaller tables from seeing gains such as
index-only scans.

If nobody objects, I plan to push this one shortly.

David


insert_autovac_v12.patch
Description: Binary data

Re: Ltree syntax improvement

2020-03-26 Thread Nikita Glukhov


On 25.03.2020 2:08, Tom Lane wrote:


Nikita Glukhov  writes:

Attached new version of the patch.

I spent a little bit of time looking through this, and have a few
comments:

* You have a lot of places where tests for particular ASCII characters
are done like this:

if ((charlen == 1) && t_iseq(src, '\\'))

This is a tedious and expensive way to spell

if (*src == '\\')

because charlen is necessarily 1 if you are looking at an ASCII character;
there is no allowed backend encoding in which an ASCII character can be
the first byte of a multibyte character.  Aside from the direct
simplifications of the tests that this makes possible, I see some places
where you'd not have to pass charlen around, either.


All unnecessary checks of charlen were removed, but t_iseq() were left for
consistency.


* I spent a fair amount of time thinking that a lot of the added code
was wrong because it was only considering escaping and not
double-quoting.  I eventually concluded that the idea is to convert
double-quoting to a pure escape-based representation during input
and store it that way.  However, I don't really see why that is either
necessary or a good idea --- the internal storage already has a length
counter for each label.  So I think what you really ought to be doing
here is simplifying out both quotes and escapes during ltree_in
and just storing the notionally-represented string internally.
(If I've misunderstood what the plan is, well the utter lack of
documentation in the patch isn't helping.)


ltree_in() removes quotes and escapes before storing strings (see
copy_unescaped()), just as you suggest.

ltree_out() adds escapes and quotes if necessary (see copy_escaped(),
extra_bytes_for_escaping()).

I have refactored code a bit, removed duplicated code, fixed several
bugs in reallocation of output strings, and added some comments.



* The added test cases seem a bit excessive and repetitive.


I have removed some tests that have become redundant after changes in
parsing.


--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
>From 8ec6f93203684d63bf3c5d006c2499a71a1a9dad Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Thu, 5 Mar 2020 17:34:59 +0300
Subject: [PATCH 1/2] Replace 'if' with 'switch' in ltree code

---
 contrib/ltree/ltree_io.c | 402 ---
 1 file changed, 204 insertions(+), 198 deletions(-)

diff --git a/contrib/ltree/ltree_io.c b/contrib/ltree/ltree_io.c
index 900a46a..e97f035 100644
--- a/contrib/ltree/ltree_io.c
+++ b/contrib/ltree/ltree_io.c
@@ -69,40 +69,43 @@ ltree_in(PG_FUNCTION_ARGS)
 	{
 		charlen = pg_mblen(ptr);
 
-		if (state == LTPRS_WAITNAME)
+		switch (state)
 		{
-			if (ISALNUM(ptr))
-			{
-lptr->start = ptr;
-lptr->wlen = 0;
-state = LTPRS_WAITDELIM;
-			}
-			else
-UNCHAR;
-		}
-		else if (state == LTPRS_WAITDELIM)
-		{
-			if (charlen == 1 && t_iseq(ptr, '.'))
-			{
-lptr->len = ptr - lptr->start;
-if (lptr->wlen > 255)
-	ereport(ERROR,
-			(errcode(ERRCODE_NAME_TOO_LONG),
-			 errmsg("name of level is too long"),
-			 errdetail("Name length is %d, must "
-	   "be < 256, in position %d.",
-	   lptr->wlen, pos)));
-
-totallen += MAXALIGN(lptr->len + LEVEL_HDRSIZE);
-lptr++;
-state = LTPRS_WAITNAME;
-			}
-			else if (!ISALNUM(ptr))
-UNCHAR;
+			case LTPRS_WAITNAME:
+if (ISALNUM(ptr))
+{
+	lptr->start = ptr;
+	lptr->wlen = 0;
+	state = LTPRS_WAITDELIM;
+}
+else
+	UNCHAR;
+break;
+
+			case LTPRS_WAITDELIM:
+if (charlen == 1 && t_iseq(ptr, '.'))
+{
+	lptr->len = ptr - lptr->start;
+	if (lptr->wlen > 255)
+		ereport(ERROR,
+(errcode(ERRCODE_NAME_TOO_LONG),
+ errmsg("name of level is too long"),
+ errdetail("Name length is %d, must "
+		   "be < 256, in position %d.",
+		   lptr->wlen, pos)));
+
+	totallen += MAXALIGN(lptr->len + LEVEL_HDRSIZE);
+	lptr++;
+	state = LTPRS_WAITNAME;
+}
+else if (!ISALNUM(ptr))
+	UNCHAR;
+break;
+
+			default:
+/* internal error */
+elog(ERROR, "internal error in parser");
 		}
-		else
-			/* internal error */
-			elog(ERROR, "internal error in parser");
 
 		ptr += charlen;
 		lptr->wlen++;
@@ -238,178 +241,181 @@ lquery_in(PG_FUNCTION_ARGS)
 	{
 		charlen = pg_mblen(ptr);
 
-		if (state == LQPRS_WAITLEVEL)
-		{
-			if (ISALNUM(ptr))
-			{
-GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * (numOR + 1));
-lptr->start = ptr;
-state = LQPRS_WAITDELIM;
-curqlevel->numvar = 1;
-			}
-			else if (charlen == 1 && t_iseq(ptr, '!'))
-			{
-GETVAR(curqlevel) = lptr = (nodeitem *) palloc0(sizeof(nodeitem) * (numOR + 1));
-lptr->start = ptr + 1;
-state = LQPRS_WAITDELIM;
-curqlevel->numvar = 1;
-curqlevel->flag |= LQL_NOT;
-hasnot = true;
-			}
-			else if (charlen == 1 && t_iseq(ptr, '*'))
-state

Re: plan cache overhead on plpgsql expression

Andres Freund  writes:
> On 2020-03-26 14:37:59 -0400, Tom Lane wrote:
>> Testing that reminded me of the other regression test failure I'd seen
>> when I first tried to do it: select_parallel.sql shows a WARNING about
>> a plancache leak in a parallel worker process.  When I looked into the
>> reason for that, it turned out that some cowboy has split
>> XACT_EVENT_COMMIT into XACT_EVENT_COMMIT and
>> XACT_EVENT_PARALLEL_COMMIT (where the latter is used in parallel
>> workers) without bothering to fix the collateral damage to plpgsql.
>> So plpgsql_xact_cb isn't doing any cleanup in parallel workers, and
>> hasn't been for a couple of releases.
>> The bad effects of that are probably limited given that the worker
>> process will exit after committing, but I still think that that part
>> of this patch is a bug fix that needs to be back-patched.

> Ugh. Lucky that we don't register many things inside those resowners.

Yeah.  I spent some time trying to produce a failure this way, and
concluded that it's pretty hard because most of the relevant callbacks
will be run during ExprContext shutdown, which is done during plpgsql
function exit.  In a non-transaction-abort situation, the simple EState
shouldn't have any live ExprContexts left at commit.  I did find a case
where a memory context callback attached to the EState's query context
doesn't get run when expected ... but it still gets run later, when the
whole memory context tree is destroyed.  So I can't demonstrate any
user-visible misbehavior in the core code.  But it still seems like a
prudent idea to back-patch a fix, in case I missed something or there is
some extension that's pushing the boundaries further.  It's definitely
not very cool that we're leaving behind a dangling static pointer to an
EState that won't exist once TopTransactionMemoryContext is gone.

I'll back-patch relevant parts of those comments about DO block
management, too.

regards, tom lane

Re: backup manifests

Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Mar 26, 2020, at 12:37 PM, Stephen Frost  wrote:
> > * Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> >>> On Mar 26, 2020, at 9:34 AM, Stephen Frost  wrote:
> >>> I'm not actually argueing about which hash functions we should support,
> >>> but rather what the default is and if crc32c, specifically, is actually
> >>> a reasonable choice.  Just because it's fast and we already had an
> >>> implementation of it doesn't justify its use as the default.  Given that
> >>> it doesn't actually provide the check that is generally expected of
> >>> CRC checksums (100% detection of single-bit errors) when the file size
> >>> gets over 512MB makes me wonder if we should have it at all, yes, but it
> >>> definitely makes me think it shouldn't be our default.
> >> 
> >> I don't understand your focus on the single-bit error issue.  
> > 
> > Maybe I'm wrong, but my understanding was that detecting single-bit
> > errors was one of the primary design goals of CRC and why people talk
> > about CRCs of certain sizes having 'limits'- that's the size at which
> > single-bit errors will no longer, necessarily, be picked up and
> > therefore that's where the CRC of that size starts falling down on that
> > goal.
> 
> I think I agree with all that.  I'm not sure it is relevant.  When people use 
> CRCs to detect things *other than* transmission errors, they are in some 
> sense using a hammer to drive a screw.  At that point, the analysis of how 
> good the hammer is, and how big a nail it can drive, is no longer relevant.  
> The relevant discussion here is how appropriate a CRC is for our purpose.  I 
> don't know the answer to that, but it doesn't seem the single-bit error 
> analysis is the right analysis.

I disagree that it's not relevant- it's, in fact, the one really clear
thing we can get a pretty straight-forward answer on, and that seems
really useful to me.

> >> If you are sending your backup across the wire, single bit errors during 
> >> transmission should already be detected as part of the networking 
> >> protocol.  The real issue has to be detection of the kinds of errors or 
> >> modifications that are most likely to happen in practice.  Which are 
> >> those?  People manually mucking with the files?  Bugs in backup scripts?  
> >> Corruption on the storage device?  Truncated files?  The more bits in the 
> >> checksum (assuming a well designed checksum algorithm), the more likely we 
> >> are to detect accidental modification, so it is no surprise if a 64-bit 
> >> crc does better than 32-bit crc.  But that logic can be taken arbitrarily 
> >> far.  I don't see the connection between, on the one hand, an analysis of 
> >> single-bit error detection against file size, and on the other hand, the 
> >> verification of backups.
> > 
> > We'd like something that does a good job at detecting any differences
> > between when the file was copied off of the server and when the command
> > is run- potentially weeks or months later.  I would expect most issues
> > to end up being storage-level corruption over time where the backup is
> > stored, which could be single bit flips or whole pages getting zeroed or
> > various other things.  Files changing size probably is one of the less
> > common things, but, sure, that too.
> > 
> > That we could take this "arbitrarily far" is actually entirely fine-
> > that's a good reason to have alternatives, which this patch does have,
> > but that doesn't mean we should have a default that's not suitable for
> > the files that we know we're going to be storing.
> > 
> > Consider that we could have used a 16-bit CRC instead, but does that
> > actually make sense?  Ok, sure, maybe someone really wants something
> > super fast- but should that be our default?  If not, then what criteria
> > should we use for the default?
> 
> I'll answer this below
> 
> >> From a support perspective, I think the much more important issue is 
> >> making certain that checksums are turned on.  A one in a billion chance of 
> >> missing an error seems pretty acceptable compared to the, let's say, one 
> >> in two chance that your customer didn't use checksums.  Why are we even 
> >> allowing this to be turned off?  Is there a usage case compelling that 
> >> option?
> > 
> > The argument is that adding checksums takes more time.  I can understand
> > that argument, though I don't really agree with it.  Certainly a few
> > percent really shouldn't be that big of an issue, and in many cases even
> > a sha256 hash isn't going to have that dramatic of an impact on the
> > actual overall time.
> 
> I see two dangers here:
> 
> (1) The user enables checksums of some type, and due to checksums not being 
> perfect, corruption happens but goes undetected, leaving her in a bad place.
> 
> (2) The user makes no checksum selection at all, gets checksums of the 
> *default* type, determines it is too slow for her purposes, and instead of

Re: NOT IN subquery optimization

2020-03-26 Thread Li, Zheng

>BTW, so far as I can see, the only reason you're bothering with the whole
thing is to compare the size of the subquery output with work_mem, because
that's all that subplan_is_hashable does.  I wonder whether that
consideration is even still necessary in the wake of 1f39bce02.  If it is,
I wonder whether there isn't a cheaper way to figure it out.  (Note
similar comment in make_subplan.)

The comment in make_subplan says there is no cheaper way to figure out:
/* At present, however, we can only check hashability after
 * we've made the subplan :-(.  (Determining whether it'll fit in work_mem
 * is the really hard part.)
 */

I don't see why commit 1f39bce02 is related to this problem. Can you expand 
on this?

>But can't you detect that case directly?  It seems like you'd need to
figure out the NULL situation anyway to know whether the transformation
to antijoin is valid in the first place.

Yes, we do need to figure out the NULL situation, and there is always valid 
transformation
to antijoin, it's just in the NULL case we need to stuff additional clause 
to the anti join
condition, and in these cases the transformation actually outperforms 
Subplan (non-hashed),
but underperforms the hashed Subplan. The unmodified anti hash join has 
similar performance
compared to hashed Subplan.

Re: backup manifests

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Mar 26, 2020 at 12:34 PM Stephen Frost  wrote:
> > I do agree with excluding things like md5 and others that aren't good
> > options.  I wasn't saying we should necessarily exclude crc32c either..
> > but rather saying that it shouldn't be the default.
> >
> > Here's another way to look at it- where do we use crc32c today, and how
> > much data might we possibly be covering with that crc?
> 
> WAL record size is a 32-bit unsigned integer, so in theory, up to 4GB
> minus 1 byte. In practice, most of them are not more than a few
> hundred bytes, the amount we might possibly be covering is a lot more.

Is it actually possible, today, in PG, to have a 4GB WAL record?
Judging this based on the WAL record size doesn't seem quite right.

> > Why was crc32c
> > picked for that purpose?
> 
> Because it was discovered that 64-bit CRC was too slow, per commit
> 21fda22ec46deb7734f793ef4d7fa6c226b4c78e.

... 15 years ago.  I actually find it pretty interesting that we started
out with a 64bit CRC there, I didn't know that was the case.  Also
interesting is that we had 64bit CRC code already.

> > If the individual who decided to pick crc32c
> > for that case was contemplating a checksum for up-to-1GB files, would
> > they have picked crc32c?  Seems unlikely to me.
> 
> It's hard to be sure what someone who isn't us would have done in some
> situation that they didn't face, but we do have the discussion thread:
> 
> https://www.postgresql.org/message-id/flat/9291.1117593389%40sss.pgh.pa.us#c4e413bbf3d7fbeced7786da1c3aca9c
> 
> The question of how much data is protected by the CRC was discussed,
> mostly in the first few messages, in general terms, but it doesn't
> seem to have covered the question very thoroughly. I'm sure we could
> each draw things from that discussion that support our view of the
> situation, but I'm not sure it would be very productive.

Interesting.

> What confuses to me is that you seem to have a view of the upsides and
> downsides of these various algorithms that seems to me to be highly
> skewed. Like, suppose we change the default from CRC-32C to
> SHA-something. On the upside, the error detection rate will increase
> from 99.999+% to something much closer to 100%. On the downside,
> backups will get as much as 40-50% slower for some users. I hope we
> can agree that both detecting errors and taking backups quickly are
> important. However, it is hard for me to imagine that the typical user
> would want to pay even a 5-10% performance penalty when taking a
> backup in order to improve an error detection feature which they may
> not even use and which already has less than a one-in-a-billion chance
> of going wrong. We routinely reject features for causing, say, a 2%
> regression on general workloads. Base backup speed is probably less
> important than how many SELECT or INSERT queries you can pump through
> the system in a second, but it's still a pain point for lots of
> people. I think if you said to some users "hey, would you like to have
> error detection for your backups? it'll cost 10%" many people would
> say "yes, please." But I think if you went to the same users and said
> "hey, would you like to make the error detection for your backups
> better? it currently has a less than 1-in-a-billion chance of failing
> to detect random corruption, and you can reduce that by many orders of
> magnitude for an extra 10% on your backup time," I think the results
> would be much more mixed. Some people would like it, but it certainly
> not everybody.

I think you're right that base backup speed is much less of an issue to
slow down than SELECT or INSERT workloads, but I do also understand
that it isn't completely unimportant, which is why having options isn't
a bad idea here.  That said, the options presented for users should all
be reasonable options, and for the default we should pick something
sensible, erroring on the "be safer" side, if anything.

There's lots of options for speeding up base backups, with this patch,
even if the default is to have a manifest with sha256 hashes- it could
be changed to some form of CRC, or changed to not have checksums, or
changed to not have a manifest.  Users will have options.

Again, I'm not against having a checksum algorithm as a option.  I'm not
saying that it must be SHA512 as the default.

> > I'm not actually argueing about which hash functions we should support,
> > but rather what the default is and if crc32c, specifically, is actually
> > a reasonable choice.  Just because it's fast and we already had an
> > implementation of it doesn't justify its use as the default.  Given that
> > it doesn't actually provide the check that is generally expected of
> > CRC checksums (100% detection of single-bit errors) when the file size
> > gets over 512MB makes me wonder if we should have it at all, yes, but it
> > definitely makes me think it shouldn't be our default.
> 
> I mean, the

Re: backup manifests

2020-03-26 Thread Mark Dilger




> On Mar 26, 2020, at 12:37 PM, Stephen Frost  wrote:
> 
> Greetings,
> 
> * Mark Dilger (mark.dil...@enterprisedb.com) wrote:
>>> On Mar 26, 2020, at 9:34 AM, Stephen Frost  wrote:
>>> I'm not actually argueing about which hash functions we should support,
>>> but rather what the default is and if crc32c, specifically, is actually
>>> a reasonable choice.  Just because it's fast and we already had an
>>> implementation of it doesn't justify its use as the default.  Given that
>>> it doesn't actually provide the check that is generally expected of
>>> CRC checksums (100% detection of single-bit errors) when the file size
>>> gets over 512MB makes me wonder if we should have it at all, yes, but it
>>> definitely makes me think it shouldn't be our default.
>> 
>> I don't understand your focus on the single-bit error issue.  
> 
> Maybe I'm wrong, but my understanding was that detecting single-bit
> errors was one of the primary design goals of CRC and why people talk
> about CRCs of certain sizes having 'limits'- that's the size at which
> single-bit errors will no longer, necessarily, be picked up and
> therefore that's where the CRC of that size starts falling down on that
> goal.

I think I agree with all that.  I'm not sure it is relevant.  When people use 
CRCs to detect things *other than* transmission errors, they are in some sense 
using a hammer to drive a screw.  At that point, the analysis of how good the 
hammer is, and how big a nail it can drive, is no longer relevant.  The 
relevant discussion here is how appropriate a CRC is for our purpose.  I don't 
know the answer to that, but it doesn't seem the single-bit error analysis is 
the right analysis.

>> If you are sending your backup across the wire, single bit errors during 
>> transmission should already be detected as part of the networking protocol.  
>> The real issue has to be detection of the kinds of errors or modifications 
>> that are most likely to happen in practice.  Which are those?  People 
>> manually mucking with the files?  Bugs in backup scripts?  Corruption on the 
>> storage device?  Truncated files?  The more bits in the checksum (assuming a 
>> well designed checksum algorithm), the more likely we are to detect 
>> accidental modification, so it is no surprise if a 64-bit crc does better 
>> than 32-bit crc.  But that logic can be taken arbitrarily far.  I don't see 
>> the connection between, on the one hand, an analysis of single-bit error 
>> detection against file size, and on the other hand, the verification of 
>> backups.
> 
> We'd like something that does a good job at detecting any differences
> between when the file was copied off of the server and when the command
> is run- potentially weeks or months later.  I would expect most issues
> to end up being storage-level corruption over time where the backup is
> stored, which could be single bit flips or whole pages getting zeroed or
> various other things.  Files changing size probably is one of the less
> common things, but, sure, that too.
> 
> That we could take this "arbitrarily far" is actually entirely fine-
> that's a good reason to have alternatives, which this patch does have,
> but that doesn't mean we should have a default that's not suitable for
> the files that we know we're going to be storing.
> 
> Consider that we could have used a 16-bit CRC instead, but does that
> actually make sense?  Ok, sure, maybe someone really wants something
> super fast- but should that be our default?  If not, then what criteria
> should we use for the default?

I'll answer this below

>> From a support perspective, I think the much more important issue is making 
>> certain that checksums are turned on.  A one in a billion chance of missing 
>> an error seems pretty acceptable compared to the, let's say, one in two 
>> chance that your customer didn't use checksums.  Why are we even allowing 
>> this to be turned off?  Is there a usage case compelling that option?
> 
> The argument is that adding checksums takes more time.  I can understand
> that argument, though I don't really agree with it.  Certainly a few
> percent really shouldn't be that big of an issue, and in many cases even
> a sha256 hash isn't going to have that dramatic of an impact on the
> actual overall time.

I see two dangers here:

(1) The user enables checksums of some type, and due to checksums not being 
perfect, corruption happens but goes undetected, leaving her in a bad place.

(2) The user makes no checksum selection at all, gets checksums of the 
*default* type, determines it is too slow for her purposes, and instead of 
adjusting the checksum algorithm to something faster, simply turns checksums 
off; corruption happens and of course is undetected, leaving her in a bad place.

I think the risk of (2) is far worse, which makes me tend towards a default 
that is fast enough not to encourage anybody to disable checksums altogether.  
I have no opinion about which algorithm is best

Re: backup manifests

2020-03-26 Thread David Steele


On 3/26/20 11:37 AM, Robert Haas wrote:

On Wed, Mar 25, 2020 at 4:54 PM Stephen Frost  wrot >

This is where I feel like I'm trying to make decisions in a vacuum. If
we had a few more people weighing in on the thread on this point, I'd
be happy to go with whatever the consensus was. If most people think
having both --no-manifest (suppressing the manifest completely) and
--manifest-checksums=none (suppressing only the checksums) is useless
and confusing, then sure, let's rip the latter one out. If most people
like the flexibility, let's keep it: it's already implemented and
tested. But I hate to base the decision on what one or two people
think.


I'm not sure I see a lot of value to being able to build manifest with 
no checksums, especially if overhead for the default checksum algorithm 
is negligible.


However, I'd still prefer that the default be something more robust and 
allow users to tune it down rather than the other way around.  But I've 
made that pretty clear up-thread and I consider that argument lost at 
this point.



As for folks who are that close to the edge on their backup timing that
they can't have it slow down- chances are pretty darn good that they're
not far from ending up needing to find a better solution than
pg_basebackup anyway.  Or they don't need to generate a manifest (or, I
suppose, they could have one but not have checksums..).


40-50% is a lot more than "if you were on the edge."


For the record I think this is a very misleading number.  Sure, if you 
are doing your backup to a local SSD on a powerful development laptop it 
makes sense.


But backups are generally placed on slower storage, remotely, with 
compression.  Even without compression the first two are going to bring 
this percentage down by a lot.


When you get to page-level incremental backups, which is where this all 
started, I'd still recommend using a stronger checksum algorithm to 
verify that the file was reconstructed correctly on restore.  That much 
I believe we have agreed on.



Even pg_basebackup (in both fetch and stream modes...) checks that we at
least got all the WAL that's needed for the backup from the server
before considering the backup to be valid and telling the user that
there was a successful backup.  With what you're proposing here, we
could have someone do a pg_basebackup, get back an ERROR saying the
backup wasn't valid, and then run pg_validatebackup and be told that the
backup is valid.  I don't get how that's sensible.


I'm sorry that you can't see how that's sensible, but it doesn't mean
that it isn't sensible. It is totally unrealistic to expect that any
backup verification tool can verify that you won't get an error when
trying to use the backup. That would require that everything that the
validation tool try to do everything that PostgreSQL will try to do
when the backup is used, including running recovery and updating the
data files. Anything less than that creates a real possibility that
the backup will verify good but fail when used. This tool has a much
narrower purpose, which is to try to verify that we (still) have the
files the server sent as part of the backup and that, to the best of
our ability to detect such things, they have not been modified. As you
know, or should know, the WAL files are not sent as part of the
backup, and so are not verified. Other things that would also be
useful to check are also not verified. It would be fantastic to have
more verification tools in the future, but it is difficult to see why
anyone would bother trying if an attempt to get the first one
committed gets blocked because it does not yet do everything. Very few
patches try to do everything, and those that do usually get blocked
because, by trying to do too much, they get some of it badly wrong.


I agree with Stephen that this should be done, but I agree with you that 
it can wait for a future commit. However, I do think:


1) It should be called out rather plainly in the documentation.
2) If there are files in pg_wal then pg_validatebackup should inform the 
user that those files have not been validated.


I know you and Stephen have agreed on a number of doc changes, would it 
be possible to get a new patch with those included? I finally have time 
to do a review of this tomorrow.  I saw some mistakes in the docs in the 
current patch but I know those patches are not current.


Regards,
--
-David
da...@pgmasters.net

Re: backup manifests

Greetings,

* Mark Dilger (mark.dil...@enterprisedb.com) wrote:
> > On Mar 26, 2020, at 9:34 AM, Stephen Frost  wrote:
> > I'm not actually argueing about which hash functions we should support,
> > but rather what the default is and if crc32c, specifically, is actually
> > a reasonable choice.  Just because it's fast and we already had an
> > implementation of it doesn't justify its use as the default.  Given that
> > it doesn't actually provide the check that is generally expected of
> > CRC checksums (100% detection of single-bit errors) when the file size
> > gets over 512MB makes me wonder if we should have it at all, yes, but it
> > definitely makes me think it shouldn't be our default.
> 
> I don't understand your focus on the single-bit error issue.  

Maybe I'm wrong, but my understanding was that detecting single-bit
errors was one of the primary design goals of CRC and why people talk
about CRCs of certain sizes having 'limits'- that's the size at which
single-bit errors will no longer, necessarily, be picked up and
therefore that's where the CRC of that size starts falling down on that
goal.

> If you are sending your backup across the wire, single bit errors during 
> transmission should already be detected as part of the networking protocol.  
> The real issue has to be detection of the kinds of errors or modifications 
> that are most likely to happen in practice.  Which are those?  People 
> manually mucking with the files?  Bugs in backup scripts?  Corruption on the 
> storage device?  Truncated files?  The more bits in the checksum (assuming a 
> well designed checksum algorithm), the more likely we are to detect 
> accidental modification, so it is no surprise if a 64-bit crc does better 
> than 32-bit crc.  But that logic can be taken arbitrarily far.  I don't see 
> the connection between, on the one hand, an analysis of single-bit error 
> detection against file size, and on the other hand, the verification of 
> backups.

We'd like something that does a good job at detecting any differences
between when the file was copied off of the server and when the command
is run- potentially weeks or months later.  I would expect most issues
to end up being storage-level corruption over time where the backup is
stored, which could be single bit flips or whole pages getting zeroed or
various other things.  Files changing size probably is one of the less
common things, but, sure, that too.

That we could take this "arbitrarily far" is actually entirely fine-
that's a good reason to have alternatives, which this patch does have,
but that doesn't mean we should have a default that's not suitable for
the files that we know we're going to be storing.

Consider that we could have used a 16-bit CRC instead, but does that
actually make sense?  Ok, sure, maybe someone really wants something
super fast- but should that be our default?  If not, then what criteria
should we use for the default?

> From a support perspective, I think the much more important issue is making 
> certain that checksums are turned on.  A one in a billion chance of missing 
> an error seems pretty acceptable compared to the, let's say, one in two 
> chance that your customer didn't use checksums.  Why are we even allowing 
> this to be turned off?  Is there a usage case compelling that option?

The argument is that adding checksums takes more time.  I can understand
that argument, though I don't really agree with it.  Certainly a few
percent really shouldn't be that big of an issue, and in many cases even
a sha256 hash isn't going to have that dramatic of an impact on the
actual overall time.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-03-26 Thread Alexey Kondratov


On 2020-03-26 21:01, Justin Pryzby wrote:



@@ -262,7 +280,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
  * and error messages should refer to the operation as VACUUM not 
CLUSTER.

  */
 void
-cluster_rel(Oid tableOid, Oid indexOid, int options)
+cluster_rel(Oid tableOid, Oid indexOid, Oid tablespaceOid, int 
options)


Add a comment here about the tablespaceOid parameter, like the other 
functions

where it's added.

The permission checking is kind of duplicitive, so I'd suggest to 
factor it
out.  Ideally we'd only have one place that checks for 
pg_global/system/mapped.
It needs to check that it's not a system relation, or that 
system_table_mods
are allowed, and in any case that if it's a mapped rel, that it's not 
being
moved.  I would pass a boolean indicating if the tablespace is being 
changed.




Yes, but I wanted to make sure first that all necessary validations are 
there to do not miss something as I did last time. I do not like 
repetitive code either, so I would like to introduce more common check 
after reviewing the code as a whole.




Another issue is this:
+VACUUM ( FULL [, ...] ) [ TABLESPACE class="parameter">new_tablespace ] [ class="parameter">table_and_columns [, ...] ]

As you mentioned in your v1 patch, in the other cases, "tablespace
[tablespace]" is added at the end of the command rather than in the 
middle.  I

wasn't able to make that work, maybe because "tablespace" isn't a fully
reserved word (?).  I didn't try with "SET TABLESPACE", although I 
understand

it'd be better without "SET".



Initially I tried "SET TABLESPACE", but also failed to completely get 
rid of shift/reduce conflicts. I will try to rewrite VACUUM's part again 
with OptTableSpace. Maybe I will manage it this time.


I will take into account all your text edits as well.


Thanks
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: plan cache overhead on plpgsql expression

Andres Freund  writes:
> On 2020-03-26 14:37:59 -0400, Tom Lane wrote:
>> + * This function, together with CachedPlanIsSimplyValid, provides a fast 
>> path
>> + * for revalidating "simple" generic plans.  The core requirement to be 
>> simple
>> + * is that the plan must not require taking any locks, which translates to
>> + * not touching any tables; this happens to match up well with an important
>> + * use-case in PL/pgSQL.

> Hm - is there currently sufficient guarantee that we absorb sinval
> messages? Would still matter for types, functions, etc?

There are potentially issues of that sort throughout the backend, not
just here, since we don't have any locking on types or functions.
I don't think it's this patch's job to address that.  In practice
I think we've thought about it and concluded that the cost/benefit
of introducing such locks just isn't promising:

* Generally speaking you can't do anything very interesting to a type
anyway, at least not with supported DDL.  The worst-case situation that
could materialize AFAIK is possibly evaluating slightly-stale constraints
for a domain.  (The typcache does have sinval invalidation for those
constraints, but I don't recall offhand how much we guarantee about
how quickly we'll notice updates.)

* For functions, you might execute a somewhat stale version of the
function body.  The bad effects there are pretty limited since a function
is defined by just one catalog row, unlike tables, so you can't see a
self-inconsistent version of it.

The amount of lock overhead that it would take to remove those edge
cases seems slightly staggering, so I doubt we'd ever do it.

> While it'd do a small bit unnecessary work, I do wonder if it'd be
> better to use this code in ResourceOwnereReleaseInternal().

When and if we refactor to expose this sort of thing more generally,
it might be worth doing it like that.  I don't think it helps much
right now.

> Perhaps add a reference to the new (appreciated, btw) DO comment above?

Can do.

Again, thanks for reviewing!

regards, tom lane

Re: Berserk Autovacuum (let's save next Mandrill)

Hi,

On 2020-03-26 10:12:39 +0100, Laurenz Albe wrote:
> On Wed, 2020-03-25 at 23:19 +0300, Alexander Korotkov wrote:
> I am reluctant to introduce new semantics like a reloption value of -2
> to disable a feature in this patch right before feature freeze.
> 
> I believe there are enough options to disable insert-only vacuuming for
> an individual table:

> - Set the threshold to 2147483647.  True, that will not work for very
>   large tables, but I think that there are few tables that insert that
>   many rows before they hit autovacuum_freeze_max_age anyway.
> 
> - Set the scale factor to some astronomical value.

Meh. You *are* adding new semantics with these. And they're terrible.

Greetings,

Andres Freund

Re: plan cache overhead on plpgsql expression

Hi,

On 2020-03-26 14:37:59 -0400, Tom Lane wrote:
> I wrote:
> > I had a thought about a possibly-cleaner way to do this.  We could invent
> > a resowner function, say ResourceOwnerReleaseAllPlanCacheRefs, that
> > explicitly releases all plancache pins it knows about.  So plpgsql
> > would not call the regular ResourceOwnerRelease entry point at all,
> > but just call that and then ResourceOwnerDelete, again relying on the
> > assertions therein to catch anything not released.
> 
> Here's a version that does it like that.  This does seem marginally
> nicer than the other way.  I have a feeling that at some point we'll
> want to expose resowners' contents more generally, but I'm not quite
> sure what the use-cases will be, so I don't want to design that now.

Yea, agreed with all of what you said in that paragraph.


> Testing that reminded me of the other regression test failure I'd seen
> when I first tried to do it: select_parallel.sql shows a WARNING about
> a plancache leak in a parallel worker process.  When I looked into the
> reason for that, it turned out that some cowboy has split
> XACT_EVENT_COMMIT into XACT_EVENT_COMMIT and
> XACT_EVENT_PARALLEL_COMMIT (where the latter is used in parallel
> workers) without bothering to fix the collateral damage to plpgsql.
> So plpgsql_xact_cb isn't doing any cleanup in parallel workers, and
> hasn't been for a couple of releases.

Ugh.


> The bad effects of that are probably limited given that the worker
> process will exit after committing, but I still think that that part
> of this patch is a bug fix that needs to be back-patched.

Ugh. Lucky that we don't register many things inside those resowners.


> (Just
> looking at what FreeExecutorState does, I wonder whether
> jit_release_context has any side-effects that are visible outside the
> process?  But I bet I can make a test case that shows issues even
> without JIT, based on the failure to call ExprContext shutdown
> callbacks.)

JIT doesn't currently have side-effects outside of the process. I really
want to add caching support, which'd presumably have problems due to
this, but it's not there yet... This could lead to leaking a fair bit of
memory over time otherwise.



>  /*
> + * CachedPlanAllowsSimpleValidityCheck: can we use CachedPlanIsSimplyValid?
> + *
> + * This function, together with CachedPlanIsSimplyValid, provides a fast path
> + * for revalidating "simple" generic plans.  The core requirement to be 
> simple
> + * is that the plan must not require taking any locks, which translates to
> + * not touching any tables; this happens to match up well with an important
> + * use-case in PL/pgSQL.

Hm - is there currently sufficient guarantee that we absorb sinval
messages? Would still matter for types, functions, etc?


>  /*
> + * ResourceOwnerReleaseAllPlanCacheRefs
> + *   Release the plancache references (only) held by this owner.
> + *
> + * We might eventually add similar functions for other resource types,
> + * but for now, only this is needed.
> + */
> +void
> +ResourceOwnerReleaseAllPlanCacheRefs(ResourceOwner owner)
> +{
> + ResourceOwner save;
> + Datum   foundres;
> +
> + save = CurrentResourceOwner;
> + CurrentResourceOwner = owner;
> + while (ResourceArrayGetAny(&(owner->planrefarr), ))
> + {
> + CachedPlan *res = (CachedPlan *) DatumGetPointer(foundres);
> +
> + ReleaseCachedPlan(res, true);
> + }
> + CurrentResourceOwner = save;
> +}

While it'd do a small bit unnecessary work, I do wonder if it'd be
better to use this code in ResourceOwnereReleaseInternal().


> --- a/src/pl/plpgsql/src/pl_exec.c
> +++ b/src/pl/plpgsql/src/pl_exec.c
> @@ -84,6 +84,13 @@ typedef struct
>   * has its own simple-expression EState, which is cleaned up at exit from
>   * plpgsql_inline_handler().  DO blocks still use the simple_econtext_stack,
>   * though, so that subxact abort cleanup does the right thing.
> + *
> + * (However, if a DO block executes COMMIT or ROLLBACK, then exec_stmt_commit
> + * or exec_stmt_rollback will unlink it from the DO's simple-expression 
> EState
> + * and create a new shared EState that will be used thenceforth.  The 
> original
> + * EState will be cleaned up when we get back to plpgsql_inline_handler.  
> This
> + * is a bit ugly, but it isn't worth doing better, since scenarios like this
> + * can't result in indefinite accumulation of state trees.)
>   */
>  typedef struct SimpleEcontextStackEntry
>  {
> @@ -96,6 +103,15 @@ static EState *shared_simple_eval_estate = NULL;
>  static SimpleEcontextStackEntry *simple_econtext_stack = NULL;
>  
>  /*
> + * In addition to the shared simple-eval EState, we have a shared resource
> + * owner that holds refcounts on the CachedPlans for any "simple" expressions
> + * we have evaluated in the current transaction.  This allows us to avoid
> + * continually grabbing and releasing a plan refcount when a simple 
> expression
> + * is used over and

Re: proposal \gcsv

2020-03-26 Thread Erik Rijkers


On 2020-03-26 18:49, Pavel Stehule wrote:

Hi

[psql-gfmt.patch]


This seems useful and works well; I haven't found any errors. Well done.

However, I have a suggestion that is perhaps slightly outside of this 
patch but functionally so close that maybe we can discuss it here.


When you try to get a tab-separated output via this new  \gfmt  in a 
one-liner

you're still forced to use
   \pset csv_fieldsep '\t'

Would it be possible to do one of the following to enable a more compact 
one-liner syntax:


1. add an option:
\gfmt tsv   --> use a TAB instead of a comma in the csv

or

2. let the psql command-line option '--csv' honour the value given by  
psql -F/--field-separator (it does not do so now)


or

3. add an psql -commandline option:
--csv-field-separator

Any of these three (I'd prefer the first) would make producing a tsv in 
shell one-liners with psql easier/more compact.



Thanks,


Erik Rijkers

Re: plan cache overhead on plpgsql expression

I wrote:
> I had a thought about a possibly-cleaner way to do this.  We could invent
> a resowner function, say ResourceOwnerReleaseAllPlanCacheRefs, that
> explicitly releases all plancache pins it knows about.  So plpgsql
> would not call the regular ResourceOwnerRelease entry point at all,
> but just call that and then ResourceOwnerDelete, again relying on the
> assertions therein to catch anything not released.

Here's a version that does it like that.  This does seem marginally
nicer than the other way.  I have a feeling that at some point we'll
want to expose resowners' contents more generally, but I'm not quite
sure what the use-cases will be, so I don't want to design that now.

Also, I studied the question of DO blocks' eval_estate + resowner
more carefully, and eventually concluded that the way it's being
done is okay --- it doesn't leak memory, as I'd first suspected.
But it's surely underdocumented, so I added some comments about it.
I also concluded as part of that study that it's probably best if
we *do* make the resowner parentage different in the two cases
after all.  So this has the "shared" resowner as a child of
TopTransactionResourceOwner after all (which means we don't need
to delete it explicitly), while a DO block's private resowner is
standalone (so it needs an explicit deletion).

Testing that reminded me of the other regression test failure I'd seen
when I first tried to do it: select_parallel.sql shows a WARNING about
a plancache leak in a parallel worker process.  When I looked into the
reason for that, it turned out that some cowboy has split
XACT_EVENT_COMMIT into XACT_EVENT_COMMIT and
XACT_EVENT_PARALLEL_COMMIT (where the latter is used in parallel
workers) without bothering to fix the collateral damage to plpgsql.
So plpgsql_xact_cb isn't doing any cleanup in parallel workers, and
hasn't been for a couple of releases.  The bad effects of that are
probably limited given that the worker process will exit after
committing, but I still think that that part of this patch is a bug
fix that needs to be back-patched.  (Just looking at what
FreeExecutorState does, I wonder whether jit_release_context has any
side-effects that are visible outside the process?  But I bet I can
make a test case that shows issues even without JIT, based on the
failure to call ExprContext shutdown callbacks.)

Anyway, I think this is committable now.

regards, tom lane

diff --git a/src/pl/plpgsql/src/Makefile b/src/pl/plpgsql/src/Makefile
index 70a9c34..193df8a 100644
--- a/src/pl/plpgsql/src/Makefile
+++ b/src/pl/plpgsql/src/Makefile
@@ -33,8 +33,8 @@ DATA = plpgsql.control plpgsql--1.0.sql
 REGRESS_OPTS = --dbname=$(PL_TESTDB)

 REGRESS = plpgsql_call plpgsql_control plpgsql_copy plpgsql_domain \
-	plpgsql_record plpgsql_cache plpgsql_transaction plpgsql_trap \
-	plpgsql_trigger plpgsql_varprops
+	plpgsql_record plpgsql_cache plpgsql_simple plpgsql_transaction \
+	plpgsql_trap plpgsql_trigger plpgsql_varprops

 # where to find gen_keywordlist.pl and subsidiary files
 TOOLSDIR = $(top_srcdir)/src/tools
diff --git a/src/pl/plpgsql/src/expected/plpgsql_simple.out b/src/pl/plpgsql/src/expected/plpgsql_simple.out
new file mode 100644
index 000..5a2fefa
--- /dev/null
+++ b/src/pl/plpgsql/src/expected/plpgsql_simple.out
@@ -0,0 +1,68 @@
+--
+-- Tests for plpgsql's handling of "simple" expressions
+--
+-- Check that changes to an inline-able function are handled correctly
+create function simplesql(int) returns int language sql
+as 'select $1';
+create function simplecaller() returns int language plpgsql
+as $$
+declare
+  sum int := 0;
+begin
+  for n in 1..10 loop
+sum := sum + simplesql(n);
+if n = 5 then
+  create or replace function simplesql(int) returns int language sql
+  as 'select $1 + 100';
+end if;
+  end loop;
+  return sum;
+end$$;
+select simplecaller();
+ simplecaller 
+--
+  555
+(1 row)
+
+-- Check that changes in search path are dealt with correctly
+create schema simple1;
+create function simple1.simpletarget(int) returns int language plpgsql
+as $$begin return $1; end$$;
+create function simpletarget(int) returns int language plpgsql
+as $$begin return $1 + 100; end$$;
+create or replace function simplecaller() returns int language plpgsql
+as $$
+declare
+  sum int := 0;
+begin
+  for n in 1..10 loop
+sum := sum + simpletarget(n);
+if n = 5 then
+  set local search_path = 'simple1';
+end if;
+  end loop;
+  return sum;
+end$$;
+select simplecaller();
+ simplecaller 
+--
+  555
+(1 row)
+
+-- try it with non-volatile functions, too
+alter function simple1.simpletarget(int) immutable;
+alter function simpletarget(int) immutable;
+select simplecaller();
+ simplecaller 
+--
+  555
+(1 row)
+
+-- make sure flushing local caches changes nothing
+\c -
+select simplecaller();
+ simplecaller 
+--
+  555
+(1 row)
+
diff --git

Re: backup manifests

2020-03-26 Thread Robert Haas

On Thu, Mar 26, 2020 at 12:34 PM Stephen Frost  wrote:
> I do agree with excluding things like md5 and others that aren't good
> options.  I wasn't saying we should necessarily exclude crc32c either..
> but rather saying that it shouldn't be the default.
>
> Here's another way to look at it- where do we use crc32c today, and how
> much data might we possibly be covering with that crc?

WAL record size is a 32-bit unsigned integer, so in theory, up to 4GB
minus 1 byte. In practice, most of them are not more than a few
hundred bytes, the amount we might possibly be covering is a lot more.

> Why was crc32c
> picked for that purpose?

Because it was discovered that 64-bit CRC was too slow, per commit
21fda22ec46deb7734f793ef4d7fa6c226b4c78e.

> If the individual who decided to pick crc32c
> for that case was contemplating a checksum for up-to-1GB files, would
> they have picked crc32c?  Seems unlikely to me.

It's hard to be sure what someone who isn't us would have done in some
situation that they didn't face, but we do have the discussion thread:

https://www.postgresql.org/message-id/flat/9291.1117593389%40sss.pgh.pa.us#c4e413bbf3d7fbeced7786da1c3aca9c

The question of how much data is protected by the CRC was discussed,
mostly in the first few messages, in general terms, but it doesn't
seem to have covered the question very thoroughly. I'm sure we could
each draw things from that discussion that support our view of the
situation, but I'm not sure it would be very productive.

What confuses to me is that you seem to have a view of the upsides and
downsides of these various algorithms that seems to me to be highly
skewed. Like, suppose we change the default from CRC-32C to
SHA-something. On the upside, the error detection rate will increase
from 99.999+% to something much closer to 100%. On the downside,
backups will get as much as 40-50% slower for some users. I hope we
can agree that both detecting errors and taking backups quickly are
important. However, it is hard for me to imagine that the typical user
would want to pay even a 5-10% performance penalty when taking a
backup in order to improve an error detection feature which they may
not even use and which already has less than a one-in-a-billion chance
of going wrong. We routinely reject features for causing, say, a 2%
regression on general workloads. Base backup speed is probably less
important than how many SELECT or INSERT queries you can pump through
the system in a second, but it's still a pain point for lots of
people. I think if you said to some users "hey, would you like to have
error detection for your backups? it'll cost 10%" many people would
say "yes, please." But I think if you went to the same users and said
"hey, would you like to make the error detection for your backups
better? it currently has a less than 1-in-a-billion chance of failing
to detect random corruption, and you can reduce that by many orders of
magnitude for an extra 10% on your backup time," I think the results
would be much more mixed. Some people would like it, but it certainly
not everybody.

> I'm not actually argueing about which hash functions we should support,
> but rather what the default is and if crc32c, specifically, is actually
> a reasonable choice.  Just because it's fast and we already had an
> implementation of it doesn't justify its use as the default.  Given that
> it doesn't actually provide the check that is generally expected of
> CRC checksums (100% detection of single-bit errors) when the file size
> gets over 512MB makes me wonder if we should have it at all, yes, but it
> definitely makes me think it shouldn't be our default.

I mean, the property that I care about is the one where it detects
better than 999,999,999 errors out of every 1,000,000,000, regardless
of input length.

> I don't agree with limiting our view to only those algorithms that we've
> already got implemented in PG.

I mean, opening that giant can of worms ~2 weeks before feature freeze
is not very nice. This patch has been around for months, and the
algorithms were openly discussed a long time ago. I checked and found
out that the CRC-64 code was nuked in commit
404bc51cde9dce1c674abe4695635612f08fe27e, so in theory we could revert
that, but how much confidence do we have that the code in question
actually did the right thing, or that it's actually fast? An awful lot
of work has been done on the CRC-32C code over the years, including
several rounds of speeding it up
(f044d71e331d77a0039cec0a11859b5a3c72bc95,
3dc2d62d0486325bf263655c2d9a96aee0b02abe) and one round of fixing it
because it was producing completely wrong answers
(5028f22f6eb0579890689655285a4778b4ffc460), so I don't have a lot of
confidence about that CRC-64 code being totally without problems.

The commit message for that last commit,
5028f22f6eb0579890689655285a4778b4ffc460, seems pretty relevant in
this context, too. It observes that, because it "does not correspond
to any bit-wise CRC calculation" it is

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

> I included your new solution regarding this part from 0004 into 0001. It
> seems that at least a tip of the problem was in that we tried to change
> tablespace to pg_default being already there.

Right, causing it to try to drop that filenode twice.

> +++ b/doc/src/sgml/ref/cluster.sgml
> +  The name of a specific tablespace to store clustered relations.

Could you phrase these like you did in the comments:
" the name of the tablespace where the clustered relation is to be rebuilt."

> +++ b/doc/src/sgml/ref/reindex.sgml
> +  The name of a specific tablespace to store rebuilt indexes.

" The name of a tablespace where indexes will be rebuilt"

> +++ b/doc/src/sgml/ref/vacuum.sgml
> +  The name of a specific tablespace to write a new copy of the table.

> +  This specifies a tablespace, where all rebuilt indexes will be created.

say "specifies the tablespace where", with no comma.

> + else if (!OidIsValid(classtuple->relfilenode))
> + {
> + /*
> +  * Skip all mapped relations.
> +  * relfilenode == 0 checks after that, 
> similarly to
> +  * RelationIsMapped().

I would say "OidIsValid(relfilenode) checks for that, ..."

> @@ -262,7 +280,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
>   * and error messages should refer to the operation as VACUUM not CLUSTER.
>   */
>  void
> -cluster_rel(Oid tableOid, Oid indexOid, int options)
> +cluster_rel(Oid tableOid, Oid indexOid, Oid tablespaceOid, int options)

Add a comment here about the tablespaceOid parameter, like the other functions
where it's added.

The permission checking is kind of duplicitive, so I'd suggest to factor it
out.  Ideally we'd only have one place that checks for pg_global/system/mapped.
It needs to check that it's not a system relation, or that system_table_mods
are allowed, and in any case that if it's a mapped rel, that it's not being
moved.  I would pass a boolean indicating if the tablespace is being changed.

Another issue is this:
> +VACUUM ( FULL [, ...] ) [ TABLESPACE  class="parameter">new_tablespace ] [  class="parameter">table_and_columns [, ...] ]
As you mentioned in your v1 patch, in the other cases, "tablespace
[tablespace]" is added at the end of the command rather than in the middle.  I
wasn't able to make that work, maybe because "tablespace" isn't a fully
reserved word (?).  I didn't try with "SET TABLESPACE", although I understand
it'd be better without "SET".

-- 
Justin

Re: proposal \gcsv

2020-03-26 Thread Pavel Stehule

čt 26. 3. 2020 v 18:55 odesílatel Vik Fearing 
napsal:

> On 3/26/20 10:49 AM, Pavel Stehule wrote:
> > Hi
> >
> > čt 26. 3. 2020 v 17:45 odesílatel Vik Fearing 
> > napsal:
> >
> >> After some opinions on the first issue and fixing the second, I think
> >> this is good to be committed.
> >>
> >
> > Thank you for review
>
> This patch now looks good to me.  Marking as Ready for Committer.
>

Thank you very much

Pavel

-- 
> Vik Fearing
>

Re: proposal \gcsv

2020-03-26 Thread Vik Fearing

On 3/26/20 10:49 AM, Pavel Stehule wrote:
> Hi
> 
> čt 26. 3. 2020 v 17:45 odesílatel Vik Fearing 
> napsal:
> 
>> After some opinions on the first issue and fixing the second, I think
>> this is good to be committed.
>>
> 
> Thank you for review

This patch now looks good to me.  Marking as Ready for Committer.
-- 
Vik Fearing

Re: proposal \gcsv

2020-03-26 Thread Pavel Stehule

Hi

čt 26. 3. 2020 v 17:45 odesílatel Vik Fearing 
napsal:

> On 3/24/20 3:02 AM, Pavel Stehule wrote:
> > Hi
> >
> > rebase
>
> Thank you, Pavel.
>
> I have now had time to review it, and it looks good to me except for two
> issues.
>
> The first is, even though I suggested gf, I think it should actually be
> gfmt.  There may be something else in the future that starts with f and
> we shouldn't close ourselves off to it.
>

renamed to \gfmt


> The second is tab completion doesn't work for the second argument.
> Adding the following fixes that:
>
> diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
> index ed6945a7f12..9d8cf442972 100644
> --- a/src/bin/psql/tab-complete.c
> +++ b/src/bin/psql/tab-complete.c
> @@ -3786,6 +3786,12 @@ psql_completion(const char *text, int start, int
> end)
> COMPLETE_WITH_CS("aligned", "asciidoc", "csv", "html",
> "latex",
>  "latex-longtable",
> "troff-ms", "unaligned",
>  "wrapped");
> +   else if (TailMatchesCS("\\gf", MatchAny))
> +   {
> +   completion_charp = "\\";
> +   completion_force_quote = false;
> +   matches = rl_completion_matches(text, complete_from_files);
> +   }
>
> else if (TailMatchesCS("\\h|\\help"))
> COMPLETE_WITH_LIST(sql_commands);
>
>
merged


> After some opinions on the first issue and fixing the second, I think
> this is good to be committed.
>

Thank you for review

Pavel

-- 
> Vik Fearing
>
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 402c4b1b26..ba480c7bb6 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -2199,6 +2199,16 @@ CREATE INDEX
 
   
 
+  
+\gfmt format [ filename ]
+\gfmt format [ |command ]
+
+
+\gfmt is equivalent to \g, but
+forces specified format.  See \pset format.
+
+
+  
 
   
 \gset [ prefix ]
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index abb18a19c2..d724a5eb16 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -163,6 +163,8 @@ static void printGSSInfo(void);
 static bool printPsetInfo(const char *param, struct printQueryOpt *popt);
 static char *pset_value_string(const char *param, struct printQueryOpt *popt);
 
+static bool format_number(const char *value, int vallen, enum printFormat *numformat);
+
 #ifdef WIN32
 static void checkWin32Codepage(void);
 #endif
@@ -333,7 +335,8 @@ exec_command(const char *cmd,
 		status = exec_command_errverbose(scan_state, active_branch);
 	else if (strcmp(cmd, "f") == 0)
 		status = exec_command_f(scan_state, active_branch);
-	else if (strcmp(cmd, "g") == 0 || strcmp(cmd, "gx") == 0)
+	else if (strcmp(cmd, "g") == 0 || strcmp(cmd, "gx") == 0 ||
+			 strcmp(cmd, "gfmt") == 0)
 		status = exec_command_g(scan_state, active_branch, cmd);
 	else if (strcmp(cmd, "gdesc") == 0)
 		status = exec_command_gdesc(scan_state, active_branch);
@@ -1282,6 +1285,8 @@ exec_command_f(PsqlScanState scan_state, bool active_branch)
 /*
  * \g [filename] -- send query, optionally with output to file/pipe
  * \gx [filename] -- same as \g, with expanded mode forced
+ * \gfmt format [filename] -- send result in specified format to
+ * file/pipe.
  */
 static backslashResult
 exec_command_g(PsqlScanState scan_state, bool active_branch, const char *cmd)
@@ -1290,7 +1295,42 @@ exec_command_g(PsqlScanState scan_state, bool active_branch, const char *cmd)
 
 	if (active_branch)
 	{
-		char	   *fname = psql_scan_slash_option(scan_state,
+		char	   *fname;
+
+		if (strcmp(cmd, "gfmt") == 0)
+		{
+			char	   *fmtname = psql_scan_slash_option(scan_state,
+		 OT_NORMAL, NULL, false);
+
+			if (!fmtname)
+			{
+pg_log_error("no format name");
+
+return PSQL_CMD_ERROR;
+			}
+			else
+			{
+enum printFormat	format;
+bool		result;
+
+result = format_number(fmtname, strlen(fmtname), );
+
+free(fmtname);
+
+if (result)
+{
+	pset.gfmt_format = format;
+}
+else
+{
+	pg_log_error("\\pset: allowed formats are aligned, asciidoc, csv, html, latex, latex-longtable, troff-ms, unaligned, wrapped");
+
+	return PSQL_CMD_ERROR;
+}
+			}
+		}
+
+		fname  = psql_scan_slash_option(scan_state,
    OT_FILEPIPE, NULL, false);
 
 		if (!fname)
@@ -3777,6 +3817,68 @@ _unicode_linestyle2string(int linestyle)
 	return "unknown";
 }
 
+/*
+ * Returns true if format name was recognized.
+ */
+static bool
+format_number(const char *value, int vallen, enum printFormat *numformat)
+{
+	static const struct fmt
+	{
+		const char *name;
+		enum printFormat number;
+	}			formats[] =
+	{
+		/* remember to update error message below when adding more */
+		{"aligned", PRINT_ALIGNED},
+		{"asciidoc", PRINT_ASCIIDOC},
+		{"csv", PRINT_CSV},
+		{"html", PRINT_HTML},
+

Re: backup manifests

2020-03-26 Thread Mark Dilger

> On Mar 26, 2020, at 9:34 AM, Stephen Frost  wrote:
> 
> I'm not actually argueing about which hash functions we should support,
> but rather what the default is and if crc32c, specifically, is actually
> a reasonable choice.  Just because it's fast and we already had an
> implementation of it doesn't justify its use as the default.  Given that
> it doesn't actually provide the check that is generally expected of
> CRC checksums (100% detection of single-bit errors) when the file size
> gets over 512MB makes me wonder if we should have it at all, yes, but it
> definitely makes me think it shouldn't be our default.

I don't understand your focus on the single-bit error issue.  If you are 
sending your backup across the wire, single bit errors during transmission 
should already be detected as part of the networking protocol.  The real issue 
has to be detection of the kinds of errors or modifications that are most 
likely to happen in practice.  Which are those?  People manually mucking with 
the files?  Bugs in backup scripts?  Corruption on the storage device?  
Truncated files?  The more bits in the checksum (assuming a well designed 
checksum algorithm), the more likely we are to detect accidental modification, 
so it is no surprise if a 64-bit crc does better than 32-bit crc.  But that 
logic can be taken arbitrarily far.  I don't see the connection between, on the 
one hand, an analysis of single-bit error detection against file size, and on 
the other hand, the verification of backups.

From a support perspective, I think the much more important issue is making 
certain that checksums are turned on.  A one in a billion chance of missing an 
error seems pretty acceptable compared to the, let's say, one in two chance 
that your customer didn't use checksums.  Why are we even allowing this to be 
turned off?  Is there a usage case compelling that option?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: plan cache overhead on plpgsql expression

Amit Langote  writes:
> One thing -- I don't get the division between
> CachedPlanAllowsSimpleValidityCheck() and CachedPlanIsSimplyValid().
> Maybe I am missing something, but could there not be just one
> function, possibly using whether expr_simple_expr is set or not to
> skip or do, resp., the checks that the former does?

Well, we don't want to do the initial checks over again every time;
we want the is-valid test to be as simple and fast as we can make it.
I suppose we could have one function with a boolean flag saying "this is a
recheck", but I don't find that idea to be any better than the way it is.

Also, although the existing structure in plpgsql always calls
CachedPlanIsSimplyValid immediately after a successful call to
CachedPlanAllowsSimpleValidityCheck, I don't think that's necessarily
going to be true for other potential users of the functions.
So merging the functions would reduce flexibility.

regards, tom lane

Re: A bug when use get_bit() function for a long bytea string

2020-03-26 Thread Ashutosh Bapat

On Thu, 26 Mar 2020 at 19:39, Tom Lane  wrote:

> Ashutosh Bapat  writes:
> > On Wed, 18 Mar 2020 at 08:18, movead...@highgo.ca 
> > wrote:
> >> if we change return type of all those functions to int64, we won't need
> >> this cast.
> >> I change the 'encode' function, it needs an int64 return type, but keep
> >> other
> >> functions in 'pg_encoding', because I think it of no necessary reason.
>
> > Ok, let's leave it for a committer to decide.
>
> If I'm grasping the purpose of these correctly, wouldn't Size or size_t
> be a more appropriate type?

Andy had used Size in his earlier patch. But I didn't understand the reason
behind it and Andy didn't give any reason. From the patch and the code
around the changes some kind of int (so int64) looked better. But if
there's a valid reason for using Size, I am fine with it too. Do we have a
SQL datatype corresponding to Size?

-- 
Best Wishes,
Ashutosh

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-03-26 Thread Alexey Kondratov


On 2020-03-26 02:40, Justin Pryzby wrote:

On Thu, Mar 12, 2020 at 08:08:46PM +0300, Alexey Kondratov wrote:

On 09.03.2020 23:04, Justin Pryzby wrote:

On Sat, Feb 29, 2020 at 08:53:04AM -0600, Justin Pryzby wrote:

On Sat, Feb 29, 2020 at 03:35:27PM +0300, Alexey Kondratov wrote:
tests for that.  (I'm including your v8 untouched in hopes of not 
messing up
the cfbot).  My fixes avoid an issue if you try to REINDEX onto 
pg_default, I

think due to moving system toast indexes.
I was able to avoid this issue by adding a call to GetNewRelFileNode, 
even
though that's already called by RelationSetNewRelfilenode().  Not 
sure if
there's a better way, or if it's worth Alexey's v3 patch which added 
a

tablespace param to RelationSetNewRelfilenode.


Do you have any understanding of what exactly causes this error? I 
have
tried to debug it a little bit, but still cannot figure out why we 
need this

extra GetNewRelFileNode() call and a mechanism how it helps.


The PANIC is from smgr hashtable, which couldn't find an entry it 
expected.  My
very tentative understanding is that smgr is prepared to handle a 
*relation*
which is dropped/recreated multiple times in a transaction, but it's 
*not*
prepared to deal with a given RelFileNode(Backend) being 
dropped/recreated,

since that's used as a hash key.

I revisited it and solved it in a somewhat nicer way.



I included your new solution regarding this part from 0004 into 0001. It 
seems that at least a tip of the problem was in that we tried to change 
tablespace to pg_default being already there.




It's still not clear to
me if there's an issue with your original way of adding a tablespace 
parameter

to RelationSetNewRelfilenode().



Yes, it is not clear for me too.



Many thanks for you review and fixups! There are some inconsistencies 
like

mentions of SET TABLESPACE in error messages and so on. I am going to
refactor and include your fixes 0003-0004 into 0001 and 0002, but keep 
0005
separated for now, since this part requires more understanding IMO 
(and

comparison with v4 implementation).


I'd suggest to keep the CLUSTER/VACUUM FULL separate from REINDEX, in 
case
Michael or someone else wants to progress one but cannot commit to 
both.




Yes, sure, I did not have plans to melt everything into a single patch.

So, it has taken much longer to understand and rework all these fixes 
and permission validations. Attached is the updated patch set.


0001:
 — It is mostly the same, but refactored
 — I also included your most recent fix for REINDEX DATABASE with 
allow_system_table_mods=1
 — With this patch REINDEX + TABLESPACE simply errors out, when index on 
TOAST table is met and allow_system_table_mods=0


0002:
 — I reworked it a bit, since REINDEX CONCURRENTLY is not allowed on 
system catalog anyway, that is checked at the hegher levels of statement 
processing. So we have to care about TOAST relations

 — Also added the same check into the plain REINDEX
 — It works fine, but I am not entirely happy that with this patch 
errors/warnings are a bit inconsistent:


template1=# REINDEX INDEX CONCURRENTLY pg_toast.pg_toast_12773_index 
TABLESPACE pg_default;

WARNING:  skipping tablespace change of "pg_toast_12773_index"
DETAIL:  Cannot move system relation, only REINDEX CONCURRENTLY is 
performed.


template1=# REINDEX TABLE CONCURRENTLY pg_toast.pg_toast_12773 
TABLESPACE pg_default;

ERROR:  permission denied: "pg_toast_12773" is a system catalog

And REINDEX DATABASE CONCURRENTLY will generate a warning again.

Maybe we should always throw a warning and do only reindex if it is not 
possible to change tablespace?


0003:
 — I have get rid of some of previous refactoring pieces like 
check_relation_is_movable for now. Let all these validations to settle 
and then think whether we could do it better

 — Added CLUSTER to copy/equalfuncs
 — Cleaned up messages and comments

I hope that I did not forget anything from your proposals.


--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
From 692bcadfacfa0c47fec7b6969525d33d0cac1f83 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Tue, 24 Mar 2020 18:16:06 +0300
Subject: [PATCH v12 3/3] Allow CLUSTER and VACUUM FULL to change tablespace

---
 doc/src/sgml/ref/cluster.sgml | 11 -
 doc/src/sgml/ref/vacuum.sgml  | 11 +
 src/backend/commands/cluster.c| 58 ---
 src/backend/commands/vacuum.c | 51 ++--
 src/backend/nodes/copyfuncs.c |  2 +
 src/backend/nodes/equalfuncs.c|  2 +
 src/backend/parser/gram.y | 45 --
 src/backend/postmaster/autovacuum.c   |  1 +
 src/include/commands/cluster.h|  2 +-
 src/include/commands/vacuum.h |  2 +
 src/include/nodes/parsenodes.h|  2 +
 src/test/regress/input/tablespace.source  | 23 -
 src/test/regress/output/tablespace.source | 37

Tab completion for \gx

2020-03-26 Thread Vik Fearing

While reviewing the patch for \gf, I noticed that \gx does not have tab
completion for its optional filename.  Trivial patch attached.  I would
also suggest this be backpatched.
-- 
Vik Fearing
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index ae35fa4aa9..7252b6c4e6 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -3882,7 +3882,7 @@ psql_completion(const char *text, int start, int end)
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_routines, NULL);
 	else if (TailMatchesCS("\\sv*"))
 		COMPLETE_WITH_SCHEMA_QUERY(Query_for_list_of_views, NULL);
-	else if (TailMatchesCS("\\cd|\\e|\\edit|\\g|\\i|\\include|"
+	else if (TailMatchesCS("\\cd|\\e|\\edit|\\g|\\gx|\\i|\\include|"
 		   "\\ir|\\include_relative|\\o|\\out|"
 		   "\\s|\\w|\\write|\\lo_import"))
 	{

Re: proposal \gcsv

2020-03-26 Thread Vik Fearing

On 3/24/20 3:02 AM, Pavel Stehule wrote:
> Hi
> 
> rebase

Thank you, Pavel.

I have now had time to review it, and it looks good to me except for two
issues.

The first is, even though I suggested gf, I think it should actually be
gfmt.  There may be something else in the future that starts with f and
we shouldn't close ourselves off to it.

The second is tab completion doesn't work for the second argument.
Adding the following fixes that:

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index ed6945a7f12..9d8cf442972 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -3786,6 +3786,12 @@ psql_completion(const char *text, int start, int end)
COMPLETE_WITH_CS("aligned", "asciidoc", "csv", "html",
"latex",
 "latex-longtable",
"troff-ms", "unaligned",
 "wrapped");
+   else if (TailMatchesCS("\\gf", MatchAny))
+   {
+   completion_charp = "\\";
+   completion_force_quote = false;
+   matches = rl_completion_matches(text, complete_from_files);
+   }

else if (TailMatchesCS("\\h|\\help"))
COMPLETE_WITH_LIST(sql_commands);

After some opinions on the first issue and fixing the second, I think
this is good to be committed.
-- 
Vik Fearing

Re: Patch: to pass query string to pg_plan_query()

On Thu, Mar 26, 2020 at 11:44:44AM -0400, Tom Lane wrote:
> Fujii Masao  writes:
> > Does anyone object to this patch? I'm thinking to commit it separetely
> > at first before committing the planning_counter_in_pg_stat_statements
> > patch.
> 
> I took a quick look through v9-0001-Pass-query-string-to-the-planner.patch
> and it's fine by me.  It also matches up with something I've wanted to
> do for awhile, which is to make the query string available during
> planning and execution so that we can produce error cursors for
> run-time errors, when relevant.
> 
> (It's a little weird that the patch doesn't make standard_planner
> actually *do* anything with the string, like say save it into
> the PlannerInfo struct.  But that can come later I guess.)
> 
> Note that I wouldn't want to bet that all of these call sites always have
> non-null query strings to pass; but probably most of the time they will.

Surprinsingly, the whole regression tests pass flawlessly with an non-null
query string assert, but we did had some discussion about it.  The pending IVM
patch would break that assumption, same as some non trivial extensions like
citus (see
https://www.postgresql.org/message-id/flat/CAFMSG9HJQr%3DH8doWJOp%3DwqyKbVqxMLkk_Qu2KfpmkKvS-Xn7qQ%40mail.gmail.com#ab8ea541b8c8464f7b52ba6d8d480b7d
and later), so we didn't make it a hard requirement.

Re: Make mesage at end-of-recovery less scary.

2020-03-26 Thread Robert Haas

On Wed, Mar 25, 2020 at 8:53 AM Peter Eisentraut
 wrote:
> HINT:  This is to be expected if this is the end of the WAL.  Otherwise,
> it could indicate corruption.

First, I agree that this general issue is a problem, because it's come
up for me in quite a number of customer situations. Either people get
scared when they shouldn't, because the message is innocuous, or they
don't get scared about other things that actually are scary, because
if some scary-looking messages are actually innocuous, it can lead
people to believe that the same is true in other cases.

Second, I don't really like the particular formulation you have above,
because the user still doesn't know whether or not to be scared. Can
we figure that out? I think if we're in crash recovery, I think that
we should not be scared, because we have no alternative to assuming
that we've reached the end of WAL, so all crash recoveries will end
like this. If we're in archive recovery, we should definitely be
scared if we haven't yet reached the minimum recovery point, because
more WAL than that should certainly exist. After that, it depends on
how we got the WAL. If it's being streamed, the question is whether
we've reached the end of what got streamed. If it's being copied from
the archive, we ought to have the whole segment, but maybe not more.
Can we get the right context to the point where the error is being
reported to know whether we hit the error at the end of the WAL that
was streamed? If not, can we somehow rejigger things so that we only
make it sound scary if we keep getting stuck at the same point when we
woud've expected to make progress meanwhile?

I'm just spitballing here, but it would be really good if there's a
way to know definitely whether or not you should be scared. Corrupted
WAL segments are definitely a thing that happens, but retries are a
lot more common.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: backup manifests

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Mar 25, 2020 at 4:54 PM Stephen Frost  wrote:
> > > That's a fair argument, but I think the other relevant principle is
> > > that we try to give people useful defaults for things. I think that
> > > checksums are a sufficiently useful thing that having the default be
> > > not to do it doesn't make sense. I had the impression that you and
> > > David were in agreement on that point, actually.
> >
> > I agree with wanting to have useful defaults and that checksums should
> > be included by default, and I'm alright even with letting people pick
> > what algorithms they'd like to have too.  The construct here is made odd
> > because we've got this idea that "no checksum" is an option, which is
> > actually something that I don't particularly like, but that's what's
> > making this particular syntax weird.  I don't suppose you'd be open to
> > the idea of just dropping that though..?  There wouldn't be any issue
> > with this syntax if we just always had checksums included when a
> > manifest is requested. :)
> >
> > Somehow, I don't think I'm going to win that argument.
> 
> Well, it's not a crazy idea. So, at some point, I had the idea that
> you were always going to get a manifest, and therefore you should at
> least ought to have the option of not checksumming to avoid the
> overhead. But, as things stand now, you can suppress the manifest
> altogether, so that you can still take a backup even if you've got no
> disk space to spool the manifest on the master. So, if you really want
> no overhead from manifests, just don't have a manifest. And if you are
> OK with some overhead, why not at least have a CRC-32C checksum, which
> is, after all, pretty cheap?
> 
> Now, on the other hand, I don't have any strong evidence that the
> manifest-without-checksums mode is useless. You can still use it to
> verify that you have the correct files and that those files have the
> expected sizes. And, verifying those things is very cheap, because you
> only need to stat() each file, not open and read them all. True, you
> can do those things by using pg_validatebackup -s. But, you'd still
> incur the (admittedly fairly low) overhead of computing checksums that
> you don't intend to use.
> 
> This is where I feel like I'm trying to make decisions in a vacuum. If
> we had a few more people weighing in on the thread on this point, I'd
> be happy to go with whatever the consensus was. If most people think
> having both --no-manifest (suppressing the manifest completely) and
> --manifest-checksums=none (suppressing only the checksums) is useless
> and confusing, then sure, let's rip the latter one out. If most people
> like the flexibility, let's keep it: it's already implemented and
> tested. But I hate to base the decision on what one or two people
> think.

I'm frustrated at the lack of involvement from others also.

Just to be clear- I'm not completely against having a 'manifest but no
checksum' option, but if that's what we're going to have then it seems
like the syntax should be such that if you don't specify checksums then
you don't get checksums and "MANIFEST_CHECKSUM none" shouldn't be a
thing.

All that said, as I said up-thread, I appreciate that we aren't
designing SQL here and that this is pretty special syntax to begin with,
so if you ended up committing it the way you have it now, so be it, I
wouldn't be asking for it to be reverted over this.  It's a bit awkward
and kind of a thorn, but it's not entirely unreasonable, and we'd
probably end up there anyway if we started out without a 'none' option
and someone did come up with a good argument and a patch to add such an
option in the future.

> > > Well, the 512MB "limit" for CRC-32C means only that for certain very
> > > specific types of errors, detection is not guaranteed above that file
> > > size. So if you have a single flipped bit, for example, and the file
> > > size is greater than 512MB, then CRC-32C has only a 99.999767169%
> > > chance of detecting the error, whereas if the file size is less than
> > > 512MB, it is 100% certain, because of the design of the algorithm. But
> > > nine nines is plenty, and neither SHA nor our page-level checksums
> > > provide guaranteed error detection properties anyway.
> >
> > Right, so we know that CRC-32C has an upper-bound of 512MB to be useful
> > for exactly what it's designed to be useful for, but we also know that
> > we're going to have larger files- at least 1GB ones, and quite possibly
> > larger, so why are we choosing this?
> >
> > At the least, wouldn't it make sense to consider a larger CRC, one whose
> > limit is above the size of commonly expected files, if we're going to
> > use a CRC?
> 
> I mean, you're just repeating the same argument here, and it's just
> not valid. Regardless of the file size, the chances of a false
> checksum match are literally less than one in a billion. There is
> every reason to believe that users will

Re: [PATCH] Fix CommitTransactionCommand() to CallXactCallbacks() in TBLOCK_ABORT_END

2020-03-26 Thread Gilles Darold

Le 25/03/2020 à 03:49, Dave Sharpe a écrit :
>
> Hi pghackers,
>
> This is my first time posting here ...  Gilles Darold and I are
> developing a new FDW which is based on the contrib/postgres_fdw. The
> postgres_fdw logic uses a RegisterXactCallback function to send local
> transactions remote. But I found that a registered XactCallback is not
> always called for a successful client ROLLBACK statement. So, a
> successful local ROLLBACK does not get sent remote by FDW, and now the
> local and remote transaction states are messed up, out of sync. The
> local database is "eating" the successful rollback.
>
>  
>
> I attach a git format-patch file
> 0001-Fix-CommitTransactionCommand-to-CallXactCallbacks-in.patch
>
> The patch fixes the problem and is ready to commit as far as I can
> tell. It adds some comment lines and three lines of code to
> CommitTransactionCommand() in the TBLOCK_ABORT_END case. Line (1) to
> reset the transaction's blockState back to TBLOCK_ABORT, ahead of (2)
> a new call to callXactCallbacks(). If the callback returns successful
> (which is usually the case), (3) the new code switches back to
> TBLOCK_ABORT_END, then continues with old code to CleanupTransaction()
> as before. If any callback does error out, the TBLOCK_ABORT was the
> correct block state for an error.
>
>  
>
> I have not added a regression test since my scenario involves a C
> module ... I didn't see such a regression test, but somebody can teach
> me where to put the C module and Makefile if a regression test should
> be added. Heads up that the scenario to hit this is a bit complex (to
> me) ... I attach the following unit test files:
>
> 1) eat_rollback.c, _/PG_init() installs my/_utility_hook() for INFO
> log for debugging.
>
> RegisterSubXactCallback(mySubtransactionCallback) which injects some
> logging and an ERROR for savepoints, i.e., negative testing, e.g.,
> like a remote FDW savepoint failed.
>
> And RegisterXactTransaction(myTransactionCallback) with info logging.
>
> 2) Makefile, to make the eat_rollback.so
>
> 3) eat_rollback2.sql, drives the fail sequence, especially the
> SAVEPOINT, which errors out, then later a successful ROLLBACK gets
> incorrectly eaten (no CALLBACK info logging, demonstrates the bug),
> then another successful ROLLBACK (now there is CALLBACK info logging).
>
> 4) eat_rollback2.out, output without the fix, the rollback at line 68
> is successful but there is not myTransactionCallback() INFO output
>
> 5) eat_rollback2.fixed, output after the fix, the rollback at line 68
> is successful, and now there is a myTransactionCallback() INFO log.
> Success!
>
>  
>
> It first failed for me in v11.1, I suspect it failed since before then
> too. And it is still failing in current master.
>
>  
>
> Bye the way, we worked around the bug in our FDW by handling sub/xact
> in the utility hook. A transaction callback is still needed for
> implicit, internal rollbacks. Another advantage of the workaround is
> (I think) it reduces the code complexity and improves performance
> because the entire subxact stack is not unwound to drive each and
> every subtransaction level to remote. Also sub/transaction statements
> are sent remote as they arrive (local and remote are more
> "transactionally" synced, not stacked by FDW for later). And of
> course, the workaround doesn't hit the above bug, since our utility
> hook correctly handles the client ROLLBACK. If it makes sense to the
> community, I could try and patch contrib/postgres_fdw to do what we
> are doing. But note that I don't need it myself: we have our own new
> FDW for remote DB2 z/OS (!) ... at LzLabs we are a building a software
> defined mainframe with PostgreSQL (of all things).
>
>  
>
> Hope it helps!
>
>  
>
> Dave Sharpe
>
> /I don't necessarily agree with everything I say./(MM)
>
> www.lzlabs.com
>

Hi,


As I'm involved in this thread I have given a review to this bug report
and I don't think that there is a problem here but as a disclaimer my
knowledge on internal transaction management is probably not enough to
have a definitive opinion.


Actually the callback function is called when the error is thrown:

psql:eat_rollback2.sql:20: INFO:  0: myTransactionCallback()
XactEvent 2 (is abort) level 1 <---
LOCATION:  myTransactionCallback, eat_rollback.c:52
psql:eat_rollback2.sql:20: ERROR:  XX000: no no no
LOCATION:  mySubtransactionCallback, eat_rollback.c:65


this is probably why the callback is not called on the subsequent
ROLLBACK execution because abort processing is already done
(src/backend/access/transam/xact.c:3890).


With this simple test and the debug extension:

BEGIN;
DROP INDEX "index2", "index3"; -- will fail indexes doesn't exist
ROLLBACK;


the callback function is also called at error level, not on the ROLLBACK:


-BEGIN
psql:eat_rollback-2/sql/eat_rollback3.sql:39: INFO:

Re: Columns correlation and adaptive query optimization

2020-03-26 Thread Konstantin Knizhnik




On 25.03.2020 20:04, Rafia Sabih wrote:


Also, there is no description about any of the functions here,
wouldn’t hurt having some more comments there.



Attached please find new version of the patch with more comments and 
descriptions added.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index f69dde8..8ab95a1 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include 
+#include 
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +53,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -84,6 +106,7 @@ static void explain_ExecutorRun(QueryDesc *queryDesc,
 static void explain_ExecutorFinish(QueryDesc *queryDesc);
 static void explain_ExecutorEnd(QueryDesc *queryDesc);
 
+static void AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
 
 /*
  * Module load callback
@@ -218,6 +241,30 @@ _PG_init(void)
 			 NULL,
 			 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+			 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+			 "Zero disables implicit creation of multicolumn statistic.",
+			 _explain_add_statistics_threshold,
+			 0.0,
+			 0.0,
+			 INT_MAX,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+			 "Do not create statistic but just record in WAL suggested create statistics statement.",
+			 NULL,
+			 _explain_suggest_only,
+			 false,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -349,6 +396,276 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+/**
+ * Try to add multicolumn statistics for specified subplans.
+ */
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+/**
+ * Try to add multicolumn statistics for plan subnodes.
+ */
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+	   ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+/**
+ * Comparator used to sort Vars by name
+ */
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+/**
+ * Try to add multicolumn statistics for qual
+ */
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	/* Extract vars from all quals */
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+
+	/* Loop until we considered all vars */
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		/* Contruct list of unique vars */
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+Var *var = (Var *) node;
+if (cols == NULL || var->varno == varno)
+{
+	varno = var->varno;
+	if (var->varattno > 0 &&
+		!bms_is_member(var->varattno, colmap) &&
+		varno >= 1 && /* not synthetic var */
+

Re: Patch: to pass query string to pg_plan_query()

Fujii Masao  writes:
> Does anyone object to this patch? I'm thinking to commit it separetely
> at first before committing the planning_counter_in_pg_stat_statements
> patch.

I took a quick look through v9-0001-Pass-query-string-to-the-planner.patch
and it's fine by me.  It also matches up with something I've wanted to
do for awhile, which is to make the query string available during
planning and execution so that we can produce error cursors for
run-time errors, when relevant.

(It's a little weird that the patch doesn't make standard_planner
actually *do* anything with the string, like say save it into
the PlannerInfo struct.  But that can come later I guess.)

Note that I wouldn't want to bet that all of these call sites always have
non-null query strings to pass; but probably most of the time they will.

regards, tom lane

Re: backup manifests

2020-03-26 Thread Robert Haas

On Wed, Mar 25, 2020 at 4:54 PM Stephen Frost  wrote:
> > That's a fair argument, but I think the other relevant principle is
> > that we try to give people useful defaults for things. I think that
> > checksums are a sufficiently useful thing that having the default be
> > not to do it doesn't make sense. I had the impression that you and
> > David were in agreement on that point, actually.
>
> I agree with wanting to have useful defaults and that checksums should
> be included by default, and I'm alright even with letting people pick
> what algorithms they'd like to have too.  The construct here is made odd
> because we've got this idea that "no checksum" is an option, which is
> actually something that I don't particularly like, but that's what's
> making this particular syntax weird.  I don't suppose you'd be open to
> the idea of just dropping that though..?  There wouldn't be any issue
> with this syntax if we just always had checksums included when a
> manifest is requested. :)
>
> Somehow, I don't think I'm going to win that argument.

Well, it's not a crazy idea. So, at some point, I had the idea that
you were always going to get a manifest, and therefore you should at
least ought to have the option of not checksumming to avoid the
overhead. But, as things stand now, you can suppress the manifest
altogether, so that you can still take a backup even if you've got no
disk space to spool the manifest on the master. So, if you really want
no overhead from manifests, just don't have a manifest. And if you are
OK with some overhead, why not at least have a CRC-32C checksum, which
is, after all, pretty cheap?

Now, on the other hand, I don't have any strong evidence that the
manifest-without-checksums mode is useless. You can still use it to
verify that you have the correct files and that those files have the
expected sizes. And, verifying those things is very cheap, because you
only need to stat() each file, not open and read them all. True, you
can do those things by using pg_validatebackup -s. But, you'd still
incur the (admittedly fairly low) overhead of computing checksums that
you don't intend to use.

This is where I feel like I'm trying to make decisions in a vacuum. If
we had a few more people weighing in on the thread on this point, I'd
be happy to go with whatever the consensus was. If most people think
having both --no-manifest (suppressing the manifest completely) and
--manifest-checksums=none (suppressing only the checksums) is useless
and confusing, then sure, let's rip the latter one out. If most people
like the flexibility, let's keep it: it's already implemented and
tested. But I hate to base the decision on what one or two people
think.

> > Well, the 512MB "limit" for CRC-32C means only that for certain very
> > specific types of errors, detection is not guaranteed above that file
> > size. So if you have a single flipped bit, for example, and the file
> > size is greater than 512MB, then CRC-32C has only a 99.999767169%
> > chance of detecting the error, whereas if the file size is less than
> > 512MB, it is 100% certain, because of the design of the algorithm. But
> > nine nines is plenty, and neither SHA nor our page-level checksums
> > provide guaranteed error detection properties anyway.
>
> Right, so we know that CRC-32C has an upper-bound of 512MB to be useful
> for exactly what it's designed to be useful for, but we also know that
> we're going to have larger files- at least 1GB ones, and quite possibly
> larger, so why are we choosing this?
>
> At the least, wouldn't it make sense to consider a larger CRC, one whose
> limit is above the size of commonly expected files, if we're going to
> use a CRC?

I mean, you're just repeating the same argument here, and it's just
not valid. Regardless of the file size, the chances of a false
checksum match are literally less than one in a billion. There is
every reason to believe that users will be happy with a low-overhead
method that has a 99.999+% chance of detecting corrupt files. I do
agree that a 64-bit CRC would probably be not much more expensive and
improve the probability of detecting errors even further, but I wanted
to restrict this patch to using infrastructure we already have. The
choices there are the various SHA functions (so I supported those),
MD5 (which I deliberately omitted, for reasons I hope you'll be the
first to agree with), CRC-32C (which is fast), a couple of other
CRC-32 variants (which I omitted because they seemed redundant and one
of them only ever existed in PostgreSQL because of a coding mistake),
and the hacked-up version of FNV that we use for page-level checksums
(which is only 16 bits and seems to have no advantages for this
purpose).

> > I'm not sure why the fact that it's a 40-year-old algorithm is
> > relevant. There are many 40-year-old algorithms that are very good.
>
> Sure there are, but there probably wasn't a lot of thought about
> GB-sized files, and this doesn't really seem to be

Re: error context for vacuum to include block number

On Thu, Mar 26, 2020 at 08:56:54PM +0900, Masahiko Sawada wrote:
> 1.
> @@ -1844,9 +1914,15 @@ lazy_vacuum_page(Relation onerel, BlockNumber
> blkno, Buffer buffer,
> int uncnt = 0;
> TransactionId visibility_cutoff_xid;
> boolall_frozen;
> +   LVRelStats  olderrcbarg;
> 
> pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);
> 
> +   /* Update error traceback information */
> +   olderrcbarg = *vacrelstats;
> +   update_vacuum_error_cbarg(vacrelstats, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
> + blkno, NULL, false);
> 
> Since we update vacrelstats->blkno during in the loop in
> lazy_vacuum_heap() we unnecessarily update blkno twice to the same
> value. Also I think we don't need to revert back the callback
> arguments in lazy_vacuum_page(). Perhaps we can either remove the
> change of lazy_vacuum_page() or move the code updating
> vacrelstats->blkno to the beginning of lazy_vacuum_page(). I prefer
> the latter.

We want the error callback to be in place during lazy_scan_heap, since it
calls ReadBufferExtended().

We can't remove the change in lazy_vacuum_page, since it's also called from
lazy_scan_heap, if there are no indexes.

We want lazy_vacuum_page to "revert back" since we go from "scanning heap" to
"vacuuming heap".  lazy_vacuum_page was the motivation for saving and restoring
the called arguments, otherwise lazy_scan_heap() would have to clean up after
the function it called, which was unclean.  Now, every function cleans up after
itself.

Does that address your comment ?

> +static void
> +update_vacuum_error_cbarg(LVRelStats *errcbarg, int phase, BlockNumber blkno,
> + char *indname, bool free_oldindname)
> 
> I'm not sure why "free_oldindname" is necessary. Since we initialize
> vacrelstats->indname with NULL and revert the callback arguments at
> the end of functions that needs update them, vacrelstats->indname is
> NULL at the beginning of lazy_vacuum_index() and lazy_cleanup_index().
> And we make a copy of index name in update_vacuum_error_cbarg(). So I
> think we can pfree the old index name if errcbarg->indname is not NULL.

We want to avoid doing this:
 olderrcbarg = *vacrelstats // saves a pointer
 update_vacuum_error_cbarg(... NULL); // frees the pointer and sets indname to 
NULL
 update_vacuum_error_cbarg(... olderrcbarg.oldindnam) // puts back the pointer, 
which has been freed
 // hit an error, and the callback accesses the pfreed pointer

I think that's only an issue for lazy_vacuum_index().

And I think you're right: we only save state when the calling function has a
indname=NULL, so we never "put back" a non-NULL indname.  We go from having a
indname=NULL at lazy_scan_heap to not not-NULL at lazy_vacuum_index, and never
the other way around.  So once we've "reverted back", 1) the pointer is null;
and, 2) the callback function doesn't access it for the previous/reverted phase
anyway.

Hm, I was just wondering what happens if an error happens *during*
update_vacuum_error_cbarg().  It seems like if we set
errcbarg->phase=VACUUM_INDEX before setting errcbarg->indname=indname, then an
error would cause a crash.  And if we pfree and set indname before phase, it'd
be a problem when going from an index phase to non-index phase.  So maybe we
have to set errcbarg->phase=VACUUM_ERRCB_PHASE_UNKNOWN while in the function,
and errcbarg->phase=phase last.

-- 
Justin

Re: Columns correlation and adaptive query optimization

2020-03-26 Thread Konstantin Knizhnik

Thank you very much for review.

On 25.03.2020 20:04, Rafia Sabih wrote:

+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+

This doesn't look like the right place for it, you might want to
declare it with other functions in the starting of the file.

Also, there is no description about any of the functions here,
wouldn’t hurt having some more comments there.

Sorry, I will fix it.
Actually this patch contains of two independent parts:
first allows to use auto_explain extension to generate mutlicolumn
statistic for variables used in clauses
for which selectivity estimation gives wrong result. It affects only
auto_explain extension.

Second part allows to use multicolumn statistic for join selectivity
estimation.

As far as I know extended statistic is now actively improved:
https://www.postgresql.org/message-id/flat/20200309000157.ig5tcrynvaqu4ixd%40development#bfbdf9c41c31ef92819dfc5ecde4a67c

I think that using extended statistic for join selectivity is very
important and should also be addressed.

If my approach is on so good, I will be pleased for other suggestions.

A few of more questions that cross my mind at this point,

- have you tried measuring the extra cost we have to pay for this
mores statistics , and also compare it with the benefit it gives in
terms of accuracy.
Adding statistic not always leads to performance improvement but I never
observed any performance degradation caused by presence of extended
statistic.
Definitely we can manually create too many extended statistic entries
for different subsets of columns.
And it certainly increase planning time because optimizer has to
consider more alternatives.

But in practice I never noticed such slowdown.

- I would also be interested in understanding if there are cases when
adding this extra step doesn’t help and have you excluded them already
or if some of them are easily identifiable at this stage...?

Unfortunately there are many cases when extended statistic can not help.
Either because optimizer is not able to use it (for example my patch
consider only cases with strict equality comparison,
but if you use predicate like "a.x=b.x and a.y in (1,2,3)" then
extended statistic for can not be used.
Either because collected statistic itself is not precise enough ,
especially in case of data skews.

- is there any limit on the number of columns for which this will
work, or should there be any such limit...?

Right now there is limit for maximal number of columns used in extended
statistic: 8 columns.

But in practice I rarely see join predicates involving more than 3 columns.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Patch: to pass query string to pg_plan_query()

On Thu, Mar 26, 2020 at 10:54:35PM +0900, Fujii Masao wrote:
> 
> On 2020/03/10 6:31, legrand legrand wrote:
> > Hello,
> > 
> > This is a call for committers, reviewers and users,
> > regarding "planning counters in pg_stat_statements"
> > patch [1] but not only.
> 
> Does anyone object to this patch? I'm thinking to commit it separetely
> at first before committing the planning_counter_in_pg_stat_statements
> patch.
> 
> > Historically, this version of pg_stat_statements
> > with planning counters was performing 3 calls to
> > pgss_store() for non utility statements in:
> > 1 - pgss_post_parse_analyze (init entry with queryid
> >  and store query text)
> > 2 - pgss_planner_hook (to store planning counters)
> > 3 - pgss_ExecutorEnd (to store execution counters)
> > 
> > Then a new version was proposed to remove one call
> > to pgss_store() by adding the query string to the
> > planner pg_plan_query():
> 
> But pgss_store() still needs to be called three times even in
> non-utility command if the query has constants. Right?

Yes indeed, this version is actually adding the 3rd pgss_store call.  Passing
the query string is a collateral requirement in case the entry disappeared
between post parse analysis and planning (which is quite possible with prepared
statements at least), as pgss will in this case fallback storing the as-is
query string, which is still better that no query text at all.

> > 1 - pgss_planner_hook (to store planning counters)
> > 2 - pgss_ExecutorEnd (to store execution counters)
> > 
> > Many performances tests where performed concluding
> > that there is no impact on this subject.
> 
> Sounds good!
> 
> > Patch "to pass query string to the planner", could be
> > committed by itself, and (maybe) used by other extensions.
> > 
> > If this was done, this new version of pgss with planning
> > counters could be committed as well, or even later
> > (being used as a non core extension starting with pg13).
> > 
> > So please give us your feedback regarding this patch
> > "to pass query string to the planner", if you have other
> > use cases, or any comment regarding core architecture.
> 
> *As far as I heard*, pg_hint_plan extension uses very tricky way to
> extract query string in the planner hook. So this patch would be
> very helpful to make pg_hint_plan avoid using such tricky way.

+1

Re: replay pause vs. standby promotion

2020-03-26 Thread Sergei Kornilov

Hello

> If we don't ignore walreceiver and does try to connect to the master,
> a promotion and recovery cannot end forever since new WAL data can
> be streamed. You think this behavior is more consistent?

We have no simple point to stop replay.
Well, except for "immediately" - just one easy stop. But I agree that this is 
not the best option. Simple and clear, but not best one for data when we want 
to replay as much as possible from archive.

> IMO it's valid to replay all the WAL data available to avoid data loss
> before a promotion completes.

But in case of still working primary (with archive_command) we choose quite 
random time to promote. A random time when the primary did not save the new wal 
segment.
or even when a temporary error of restore_command occurs? We mention just cp 
command in docs. I know users uses cp (e.g. from NFS) without further error 
handling.

regards, Sergei

Re: Resolving the python 2 -> python 3 mess

Marco Atzeri  writes:
> Am 17.02.2020 um 17:49 schrieb Tom Lane:
>> 1. On platforms where Python 2.x is still supported, recommend that
>> packagers continue to build both plpython2 and plpython3, same as now.

> there is some documentation on how to build both ?
> The INSTALL gives no hint.

It's explained in the plpython documentation: basically you have to
configure and build the source tree twice (although I think the
second time you can just cd into src/pl/plpython and build/install
only that much).

regards, tom lane

Re: adding partitioned tables to publications

2020-03-26 Thread Amit Langote

On Wed, Mar 25, 2020 at 9:29 PM Peter Eisentraut
 wrote:
> On 2020-03-23 06:02, Amit Langote wrote:
> > Okay, added some tests.
> >
> > Attached updated patches.
>
> I have committed the worker.c refactoring patch.
>
> "Add subscription support to replicate into partitioned tables" still
> has lacking test coverage.  Your changes in relation.c are not exercised
> at all because the partitioned table branch in apply_handle_update() is
> never taken.  This is critical and tricky code, so I would look for
> significant testing.

While trying some tests around the code you  mentioned, I found what
looks like a bug, which looking into now.

> The code looks okay to me.  I would remove this code
>
> +   memset(entry->attrmap->attnums, -1,
> +  entry->attrmap->maplen * sizeof(AttrNumber));
>
> because the entries are explicitly filled right after anyway, and
> filling the bytes with -1 has an unclear effect.  There is also
> seemingly some fishiness in this code around whether attribute numbers
> are zero- or one-based.  Perhaps this could be documented briefly.
> Maybe I'm misunderstanding something.

Will check and fix as necessary.

--
Thank you,

Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: pg_checksums in backend/storage/page/README

On Thu, Mar 26, 2020 at 03:02:42PM +0100, Daniel Gustafsson wrote:
> I noticed in passing that backend/storage/page/README hadn't gotten the memo
> about pg_checksums.  The attached tiny diff adds a small mention of it.

Good catch!  LGTM

Re: A bug when use get_bit() function for a long bytea string

Ashutosh Bapat  writes:
> On Wed, 18 Mar 2020 at 08:18, movead...@highgo.ca 
> wrote:
>> if we change return type of all those functions to int64, we won't need
>> this cast.
>> I change the 'encode' function, it needs an int64 return type, but keep
>> other
>> functions in 'pg_encoding', because I think it of no necessary reason.

> Ok, let's leave it for a committer to decide.

If I'm grasping the purpose of these correctly, wouldn't Size or size_t
be a more appropriate type?  And I definitely agree with changing all
of these APIs at once, if they're all dealing with the same kind of
value.

regards, tom lane

Re: pg_checksums in backend/storage/page/README

2020-03-26 Thread Magnus Hagander

On Thu, Mar 26, 2020 at 3:02 PM Daniel Gustafsson  wrote:
>
> I noticed in passing that backend/storage/page/README hadn't gotten the memo
> about pg_checksums.  The attached tiny diff adds a small mention of it.

That seems obvious enough. Pushed, thanks!

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: plan cache overhead on plpgsql expression

Andres Freund  writes:
> On 2020-03-25 17:51:50 -0400, Tom Lane wrote:
>> Andres Freund  writes:
>>> Hm, any chance that the multiple resowner calls here could show up in a
>>> profile? Probably not?

>> Doubt it.  On the other hand, as the code stands it's certain that the
>> resowner contains nothing but plancache pins (while I was writing the
>> patch it wasn't entirely clear that that would hold).  So we could
>> drop the two unnecessary calls.  There are assertions in
>> ResourceOwnerDelete that would fire if we somehow missed releasing
>> anything, so it doesn't seem like much of a maintenance hazard.

> One could even argue that that's a nice crosscheck: Due to the later
> release it'd not actually be correct to just add "arbitrary" things to
> that resowner.

I had a thought about a possibly-cleaner way to do this.  We could invent
a resowner function, say ResourceOwnerReleaseAllPlanCacheRefs, that
explicitly releases all plancache pins it knows about.  So plpgsql
would not call the regular ResourceOwnerRelease entry point at all,
but just call that and then ResourceOwnerDelete, again relying on the
assertions therein to catch anything not released.

This would be slightly more code but it'd perhaps make it clearer
what's going on, without the cost of a duplicative data structure.
Perhaps in future there'd be use for similar calls for other resource
types.

regards, tom lane

pg_checksums in backend/storage/page/README

2020-03-26 Thread Daniel Gustafsson

I noticed in passing that backend/storage/page/README hadn't gotten the memo
about pg_checksums.  The attached tiny diff adds a small mention of it.

cheers ./daniel



cksum_page_readme.diff
Description: Binary data

Re: Patch: to pass query string to pg_plan_query()





On 2020/03/10 6:31, legrand legrand wrote:

Hello,

This is a call for committers, reviewers and users,
regarding "planning counters in pg_stat_statements"
patch [1] but not only.


Does anyone object to this patch? I'm thinking to commit it separetely
at first before committing the planning_counter_in_pg_stat_statements
patch.


Historically, this version of pg_stat_statements
with planning counters was performing 3 calls to
pgss_store() for non utility statements in:
1 - pgss_post_parse_analyze (init entry with queryid
 and store query text)
2 - pgss_planner_hook (to store planning counters)
3 - pgss_ExecutorEnd (to store execution counters)

Then a new version was proposed to remove one call
to pgss_store() by adding the query string to the
planner pg_plan_query():


But pgss_store() still needs to be called three times even in
non-utility command if the query has constants. Right?


1 - pgss_planner_hook (to store planning counters)
2 - pgss_ExecutorEnd (to store execution counters)

Many performances tests where performed concluding
that there is no impact on this subject.


Sounds good!


Patch "to pass query string to the planner", could be
committed by itself, and (maybe) used by other extensions.

If this was done, this new version of pgss with planning
counters could be committed as well, or even later
(being used as a non core extension starting with pg13).

So please give us your feedback regarding this patch
"to pass query string to the planner", if you have other
use cases, or any comment regarding core architecture.


*As far as I heard*, pg_hint_plan extension uses very tricky way to
extract query string in the planner hook. So this patch would be
very helpful to make pg_hint_plan avoid using such tricky way.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: A bug when use get_bit() function for a long bytea string

2020-03-26 Thread Ashutosh Bapat

On Wed, 18 Mar 2020 at 08:18, movead...@highgo.ca 
wrote:

>
> Hello thanks for the detailed review,
>
> >I think the second argument indicates the bit position, which would be
> max bytea length * 8. If max bytea length covers whole int32, the second
> argument >needs to be wider i.e. int64.
> Yes, it makes sence and followed.
>
>
I think we need a similar change in byteaGetBit() and byteaSetBit() as well.


>
> > Some more comments on the patch
> > struct pg_encoding
> > {
> >- unsigned (*encode_len) (const char *data, unsigned dlen);
> >+ int64 (*encode_len) (const char *data, unsigned dlen);
> >  unsigned (*decode_len) (const char *data, unsigned dlen);
> >  unsigned (*encode) (const char *data, unsigned dlen, char *res);
> >  unsigned (*decode) (const char *data, unsigned dlen, char *res);
> > Why not use return type of int64 for rest of the functions here as well?
> >  res = enc->encode(VARDATA_ANY(data), datalen, VARDATA(result));
> >  /* Make this FATAL 'cause we've trodden on memory ... */
> >- if (res > resultlen)
> >+ if ((int64)res > resultlen)
> >
> >if we change return type of all those functions to int64, we won't need
> this cast.
> I change the 'encode' function, it needs an int64 return type, but keep
> other
>
> functions in 'pg_encoding', because I think it of no necessary reason.
>
>
Ok, let's leave it for a committer to decide.


> >Right now we are using int64 because bytea can be 1GB, but what if we
> increase
> >that limit tomorrow, will int64 be sufficient? That may be unlikely in
> the near
> >future, but nevertheless a possibility. Should we then define a new
> datatype
> >which resolves to int64 right now but really depends upon the bytea
> length. I
> >am not suggesting that we have to do it right now, but may be something to
> >think about.
> I decide to use int64 because if we really want to increase the limit,  it
> should be
> the same change with 'text', 'varchar' which have the same limit. So it may
> need
> a more general struct. Hence I give up the idea.
>
>
Hmm, Let's see what a committer says.

Some more review comments.
+   int64   res,resultlen;

We need those on separate lines, possibly.

+   byteNo = (int32)(n / BITS_PER_BYTE);
Does it hurt to have byteNo as int64 so as to avoid a cast. Otherwise,
please
add a comment explaining the reason for the cast. The comment applies at
other
places where this change appears.

-   int len;
+   int64   len;
Why do we need this change?
int i;


>
>
> >It might help to add a test where we could pass the second argument
> something
> >greater than 1G. But it may be difficult to write such a test case.
> Add two test cases.
>
>
+
+select get_bit(
+set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 1024 *
1024 * 1024 + 1, 0)
+   ,1024 * 1024 * 1024 + 1);

This bit position is still within int4.
postgres=# select pg_column_size(1024 * 1024 * 1024 + 1);
 pg_column_size

  4
(1 row)

You want something like
postgres=# select pg_column_size(512::bigint * 1024 * 1024 * 8);
 pg_column_size

  8
(1 row)

-- 
Best Wishes,
Ashutosh

Re: Planning counters in pg_stat_statements (using pgss_store)

On Thu, Mar 26, 2020 at 08:08:35PM +0900, Fujii Masao wrote:
> 
> On 2020/03/25 22:45, Julien Rouhaud wrote:
> > On Wed, Mar 25, 2020 at 10:09:37PM +0900, Fujii Masao wrote:
> > > + /*
> > > +  * We can't process the query if no query_text is provided, as 
> > > pgss_store
> > > +  * needs it.  We also ignore query without queryid, as it would be 
> > > treated
> > > +  * as a utility statement, which may not be the case.
> > > +  */
> > > 
> > > Could you tell me why the planning stats are not tracked when executing
> > > utility statements? In some utility statements like REFRESH MATERIALIZED 
> > > VIEW,
> > > the planner would work.
> > 
> > I explained that in [1].  The problem is that the underlying statement 
> > doesn't
> > get the proper stmt_location and stmt_len, so you eventually end up with two
> > different entries.
> 
> It's not problematic to have two different entries in that case. Right?

I will unnecessarily bloat the entries, and makes users life harder too.  This
example is quite easy to deal with, but if the application is sending
multi-query statements, you'll just end up with a mess impossible to properly
handle.

> The actual problem is that the statements reported in those entries are
> very similar? For example, when "create table test as select 1;" is executed,
> it's strange to get the following two entries, as you explained.
> 
> create table test as select 1;
> create table test as select 1
> 
> But it seems valid to get the following two entries in that case?
> 
> select 1
> create table test as select 1
> 
> The former is the nested statement and the latter is the top statement.

I think that there should only be 1 entry, the utility command.  It seems easy
to correlate the planning time to the underlying query, but I'm not entirely
sure that the execution counters won't be impacted by the fact is being run in
a utilty statements.  Also, for now current pgss behavior is to always merge
underlying optimisable statements in the utility command, and it seems a bit
late in this release cycle to revisit that.

I'd be happy to work on improving that for the next release, among other
things.  For instance the total lack of normalization for utility commands [2]
is also something that has been bothering me for a long time.  In some
workloads, you can end up with the entries almost entirely filled with
1-time-execution commands, just because it's using random identifiers, so you
have no other choice than to disable track_utility, although it would have been
useful for other commands.

> Here are other comments.
> 
> - if (jstate)
> + if (kind == PGSS_JUMBLE)
> 
> Why is PGSS_JUMBLE necessary? ISTM that we can still use jstate here, instead.
> 
> If it's ok to remove PGSS_JUMBLE, we can define PGSS_NUM_KIND(=2) instead
> and replace 2 in, e.g., total_time[2] with PGSS_NUM_KIND. Thought?

Yes, we could be using jstate here.  I originally used that to avoid passing
PGSS_EXEC (or the other one) as a way to say "ignore this information as
there's the jstate which says it's yet another meaning".  If that's not an
issue, I can change that as PGSS_NUM_KIND will clearly improve the explicit "2"
all over the place.

> +  total_time
> +  double precision
> +  
> +  
> +Total time spend planning and executing the statement, in 
> milliseconds
> +  
> + 
> 
> pg_stat_statements view has this column but the function not.
> We should make both have the column or not at all, for consistency?
> I'm not sure if it's good thing to expose the sum of total_plan_time
> and total_exec_time as total_time. If some users want that, they can
> easily calculate it from total_plan_time and total_exec_time by using
> their own logic.

I think we originally added it as a way to avoid too much compatibility break,
and also because it seems like a field most users will be interested in anyway.
Now that I'm thinking about it again, I indeed think it was a mistake to have
that in view part only.  Not mainly for consistency, but for users who would be
interested in the total_time field while not wanting to pay the overhead of
retrieving the query text if they don't need it.  So I'll change that!

> + nested_level++;
> + PG_TRY();
> 
> In old thread [1], Tom Lane commented the usage of nested_level
> in the planner hook. There seems no reply to that so far. What's
> your opinion about that comment?
> 
> [1] https://www.postgresql.org/message-id/28980.1515803...@sss.pgh.pa.us

Oh thanks, I didn't noticed this part of the discussion.  I agree with Tom's
concern, and I think that having a specific nesting level variable for the
planner is the best workaround, so I'll implement that.

[2] https://twitter.com/fujii_masao/status/1242978261572837377

Re: replay pause vs. standby promotion





On 2020/03/25 19:42, Sergei Kornilov wrote:

Hi


  Could we add a few words in func.sgml to clarify the behavior? Especially for 
users from my example above. Something like:


  If a promotion is triggered while recovery is paused, the paused state ends, 
replay of any WAL immediately available in the archive or in pg_wal will be 
continued and then a promotion will be completed.


This description is true if pause is requested by pg_wal_replay_pause(),
but not if recovery target is reached and pause is requested by
recovery_target_action=pause. In the latter case, even if there are WAL data
avaiable in pg_wal or archive, they are not replayed, i.e., the promotion
completes immediately. Probably we should document those two cases
explicitly to avoid the confusion about a promotion and recovery pause?


This is description for pg_wal_replay_pause, but actually we suggest to call 
pg_wal_replay_resume in recovery_target_action=pause... So, I agree, we need to 
document both cases.

PS: I think we have inconsistent behavior here... Read wal during promotion 
from local pg_wal AND call restore_command, but ignore walreceiver also seems 
strange for my DBA hat...


If we don't ignore walreceiver and does try to connect to the master,
a promotion and recovery cannot end forever since new WAL data can
be streamed. You think this behavior is more consistent?

IMO it's valid to replay all the WAL data available to avoid data loss
before a promotion completes.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: potential stuck lock in SaveSlotToPath()

2020-03-26 Thread Peter Eisentraut


On 2020-03-25 17:56, Robert Haas wrote:

On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut
 wrote:

Any concerns about applying and backpatching the patch I posted?


Not from me.


The talk about reorganizing this code doesn't seem very concrete at the
moment and would probably not be backpatch material anyway.


+1.


committed and backpatched

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: error context for vacuum to include block number

2020-03-26 Thread Masahiko Sawada

On Thu, 26 Mar 2020 at 15:34, Amit Kapila  wrote:
>
> On Thu, Mar 26, 2020 at 12:03 PM Amit Kapila  wrote:
> >
> > On Thu, Mar 26, 2020 at 10:11 AM Justin Pryzby  wrote:
> > >
> > > Seems fine.  Rather than saying "different phases" I, would say:
> > > "The index vacuum and heap vacuum phases may be called multiple times in 
> > > the
> > > middle of the heap scan phase."
> > >
> >
> > Okay, I have slightly adjusted the wording as per your suggestion.
> >
> > > But actually I think the concern is not that we unnecessarily "Revert 
> > > back to
> > > the old phase" but that we do it in a *loop*.  Which I agree doesn't make
> > > sense, to go back and forth between "scanning heap" and "truncating".
> > >
> >
> > Fair point.  I have moved the change to the truncate phase at the
> > caller of lazy_heap_truncate() which should address this concern.
> > Sawada-San, does this address your concern?
> >
>
> Forgot to attach the patch, doing now.

Thank you for updating the patch! The changes around
lazy_truncate_heap() looks good to me.

I have two comments;

1.
@@ -1844,9 +1914,15 @@ lazy_vacuum_page(Relation onerel, BlockNumber
blkno, Buffer buffer,
int uncnt = 0;
TransactionId visibility_cutoff_xid;
boolall_frozen;
+   LVRelStats  olderrcbarg;

pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno);

+   /* Update error traceback information */
+   olderrcbarg = *vacrelstats;
+   update_vacuum_error_cbarg(vacrelstats, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+ blkno, NULL, false);

Since we update vacrelstats->blkno during in the loop in
lazy_vacuum_heap() we unnecessarily update blkno twice to the same
value. Also I think we don't need to revert back the callback
arguments in lazy_vacuum_page(). Perhaps we can either remove the
change of lazy_vacuum_page() or move the code updating
vacrelstats->blkno to the beginning of lazy_vacuum_page(). I prefer
the latter.

2.
+/*
+ * Update vacuum error callback for the current phase, block, and index.
+ *
+ * free_oldindname is true if the previous "indname" should be freed.
It must be
+ * false if the caller has copied the old LVRelStats, to avoid keeping a
+ * pointer to a freed allocation.  In which case, the caller should call again
+ * with free_oldindname as true to avoid a leak.
+ */
+static void
+update_vacuum_error_cbarg(LVRelStats *errcbarg, int phase, BlockNumber blkno,
+ char *indname, bool free_oldindname)

I'm not sure why "free_oldindname" is necessary. Since we initialize
vacrelstats->indname with NULL and revert the callback arguments at
the end of functions that needs update them, vacrelstats->indname is
NULL at the beginning of lazy_vacuum_index() and lazy_cleanup_index().
And we make a copy of index name in update_vacuum_error_cbarg(). So I
think we can pfree the old index name if errcbarg->indname is not NULL.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Planning counters in pg_stat_statements (using pgss_store)





On 2020/03/25 22:45, Julien Rouhaud wrote:

On Wed, Mar 25, 2020 at 10:09:37PM +0900, Fujii Masao wrote:


On 2020/03/21 4:30, Julien Rouhaud wrote:

On Fri, Mar 20, 2020 at 05:09:05PM +0300, Sergei Kornilov wrote:

Hello

Yet another is missed in docs: total_time


Oh good catch!  I rechecked many time the field, and totally missed that the
documentation is referring to the view, which has an additional column, and not
the function.  Attached v9 fixes that.


Thanks for the patch! Here are the review comments from me.

-   PGSS_V1_3
+   PGSS_V1_3,
+   PGSS_V1_8

WAL usage patch [1] increments this version to 1_4 instead of 1_8.
I *guess* that's because previously this version was maintained
independently from pg_stat_statements' version. For example,
pg_stat_statements 1.4 seems to have used PGSS_V1_3.


Oh right.  It seems that I changed that many versions ago, I'm not sure why.
I'm personally fine with any, but I think this was previously raised and
consensus was to keep distinct counters.  Unless you prefer to keep it this
way, I'll send an updated version (with other possible modifications depending
on the rest of the mail) using PGSS_V1_4.


+   /*
+* We can't process the query if no query_text is provided, as 
pgss_store
+* needs it.  We also ignore query without queryid, as it would be 
treated
+* as a utility statement, which may not be the case.
+*/

Could you tell me why the planning stats are not tracked when executing
utility statements? In some utility statements like REFRESH MATERIALIZED VIEW,
the planner would work.


I explained that in [1].  The problem is that the underlying statement doesn't
get the proper stmt_location and stmt_len, so you eventually end up with two
different entries.


It's not problematic to have two different entries in that case. Right?
The actual problem is that the statements reported in those entries are
very similar? For example, when "create table test as select 1;" is executed,
it's strange to get the following two entries, as you explained.

create table test as select 1;
create table test as select 1

But it seems valid to get the following two entries in that case?

select 1
create table test as select 1

The former is the nested statement and the latter is the top statement.


I suggested fixing transformTopLevelStmt() to handle the
various DDL that can contain optimisable statements, but everyone preferred to
postpone that for a future enhencement.


Understood. Thanks for explanation!


+static BufferUsage
+compute_buffer_counters(BufferUsage start, BufferUsage stop)
+{
+   BufferUsage result;

BufferUsageAccumDiff() has very similar logic. Isn't it better to expose
and use that function rather than creating new similar function?


Oh, I thought this wouldn't be acceptable.  That's indeed better so I'll do
that instead.


Thanks! But of course this is trivial thing, so it's ok to do that later.


values[i++] = Int64GetDatumFast(tmp.rows);
values[i++] = Int64GetDatumFast(tmp.shared_blks_hit);
values[i++] = Int64GetDatumFast(tmp.shared_blks_read);

Previously (without the patch) pg_stat_statements_1_3() reported
the buffer usage counters updated only in execution phase. But,
in the patched version, pg_stat_statements_1_3() reports the total
of buffer usage counters updated in both planning and execution
phases. Is this OK? I'm not sure how seriously we should ensure
the backward compatibility for pg_stat_statements


That's indeed a behavior change, although the new behavior is probably better
as user want to know how much resource a query is consuming overall.  We could
distinguish all buffers with a plan/exec version, but it seems quite overkill.


Ok.




+/* contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql */

ISTM it's good timing to have also pg_stat_statements--1.8.sql since
the definition of pg_stat_statements() is changed. Thought?


I thought that since CreateExtension() was modified to be able to find it's way
automatically, we shouldn't provide base version anymore, to minimize
maintenance burden and also avoid possible bug/discrepancy.  The only drawback
is that it'll do multiple CREATE or DROP/CREATE of the function usually once
per database, which doesn't seem like a big problem.


Ok.

Here are other comments.

-   if (jstate)
+   if (kind == PGSS_JUMBLE)

Why is PGSS_JUMBLE necessary? ISTM that we can still use jstate here, instead.

If it's ok to remove PGSS_JUMBLE, we can define PGSS_NUM_KIND(=2) instead
and replace 2 in, e.g., total_time[2] with PGSS_NUM_KIND. Thought?

+  total_time
+  double precision
+  
+  
+Total time spend planning and executing the statement, in milliseconds
+  
+ 

pg_stat_statements view has this column but the function not.
We should make both have the column or not at all, for consistency?
I'm not sure if it's good

Re: plan cache overhead on plpgsql expression

2020-03-26 Thread Amit Langote

On Thu, Mar 26, 2020 at 4:44 AM Tom Lane  wrote:
>
> Pavel Stehule  writes:
> > I'll mark this patch as ready for commiters.
>
> Thanks for reviewing!  Amit, do you have any thoughts on this?

Thanks for picking this up.  Test cases added by your patch really
shows why the plancache and the planner must not be skipped, something
I totally failed to grasp.

I can't really see any problem with your patch, but mainly due to my
unfamiliarity with some of the more complicated things it touches,
like resowner stuff.

One thing -- I don't get the division between
CachedPlanAllowsSimpleValidityCheck() and CachedPlanIsSimplyValid().
Maybe I am missing something, but could there not be just one
function, possibly using whether expr_simple_expr is set or not to
skip or do, resp., the checks that the former does?

--
Thank you,

Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Negative cost is seen for plan node

2020-03-26 Thread Richard Guo

On Tue, Mar 24, 2020 at 12:01 PM Kyotaro Horiguchi 
wrote:

> At Mon, 23 Mar 2020 21:13:48 +0800, Richard Guo 
> wrote in
> > Hi all,
> >
> > With the following statements on latest master (c81bd3b9), I find
> > negative cost for plan nodes.
> >
> > create table a (i int, j int);
> > insert into a select i%10, i from generate_series(1,100)i;
> > analyze a;
> >
> > # explain select i from a group by i;
> >QUERY PLAN
> > -
> >  HashAggregate  (cost=1300.00..-1585.82 rows=102043 width=4)
>
> Good catch!
>
> >Group Key: i
> >Planned Partitions: 4
> >->  Seq Scan on a  (cost=0.00..14425.00 rows=100 width=4)
> > (4 rows)
> >
> > In function cost_agg, when we add the disk costs of hash aggregation
> > that spills to disk, nbatches is calculated as 1.18 in this case. It is
> > greater than 1, so there will be spill. And the depth is calculated as
> > -1 in this case, with num_partitions being 4. I think this is where
> > thing goes wrong.
>
> The depth is the expected number of iterations of reading the relation.
>
> >   depth = ceil( log(nbatches - 1) / log(num_partitions) );
>

Yes correct.


>
> I'm not sure what the expression based on, but apparently it is wrong
> for nbatches <= 2.0. It looks like a thinko of something like this.
>
> depth = ceil( log(nbatches) / log(num_partitions + 1) );
>

It seems to me we should use '(nbatches - 1)', without the log function.
Maybe I'm wrong.

I have sent this issue to the 'Memory-Bounded Hash Aggregation' thread.

Thanks
Richard

Re: Memory-Bounded Hash Aggregation

2020-03-26 Thread Richard Guo

Hello,

When calculating the disk costs of hash aggregation that spills to disk,
there is something wrong with how we determine depth:

>depth = ceil( log(nbatches - 1) / log(num_partitions) );

If nbatches is some number between 1.0 and 2.0, we would have a negative
depth. As a result, we may have a negative cost for hash aggregation
plan node, as described in [1].

I don't think 'log(nbatches - 1)' is what we want here. Should it be
just '(nbatches - 1)'?

[1]
https://www.postgresql.org/message-id/flat/CAMbWs4_maqdBnRR4x01pDpoV-CiQ%2BRvMQaPm4JoTPbA%3DmZmhMw%40mail.gmail.com

Thanks
Richard

On Thu, Mar 19, 2020 at 7:36 AM Jeff Davis  wrote:

>
> Committed.
>
> There's some future work that would be nice (some of these are just
> ideas and may not be worth it):
>
> * Refactor MemoryContextMemAllocated() to be a part of
> MemoryContextStats(), but allow it to avoid walking through the blocks
> and freelists.
>
> * Improve the choice of the initial number of buckets in the hash
> table. For this patch, I tried to preserve the existing behavior of
> estimating the number of groups and trying to initialize with that many
> buckets. But my performance tests seem to indicate this is not the best
> approach. More work is needed to find what we should really do here.
>
> * For workloads that are not in work_mem *or* system memory, and need
> to actually go to storage, I see poor CPU utilization because it's not
> effectively overlapping CPU and IO work. Perhaps buffering or readahead
> changes can improve this, or async IO (even better).
>
> * Project unnecessary attributes away before spilling tuples to disk.
>
> * Improve logtape.c API so that the caller doesn't need to manage a
> bunch of tape numbers.
>
> * Improve estimate of the hash entry size. This patch doesn't change
> the way the planner estimates it, but I observe that actual size as
> seen at runtime is significantly different. This is connected to the
> initial number of buckets for the hash table.
>
> * In recursive steps, I don't have a good estimate for the number of
> groups, so I just estimate it as the number of tuples in that spill
> tape (which is pessimistic). That could be improved by doing a real
> cardinality estimate as the tuples are spilling (perhaps with HLL?).
>
> * Many aggregates with pass-by-ref transition states don't provide a
> great aggtransspace. We should consider doing something smarter, like
> having negative numbers represent a number that should be multiplied by
> the size of the group (e.g. ARRAY_AGG would have a size dependent on
> the group size, not a constant).
>
> * If we want to handle ARRAY_AGG (and the like) well, we can consider
> spilling the partial states in the hash table whem the memory is full.
> That would add a fair amount of complexity because there would be two
> types of spilled data (tuples and partial states), but it could be
> useful in some cases.
>
> Regards,
> Jeff Davis
>
>
>
>
>

Re: allow online change primary_conninfo

2020-03-26 Thread Sergei Kornilov

Hello

Thank you! You were ahead of me. I wanted to double-check my changes this 
morning before posting.

> Sergei included LOG messages to indicate which setting has been changed.
> I put these in "#if 0" for now, but I'm thinking to remove these
> altogether;

No objections.

> Not sure if we can or should adjust the FATAL line; probably not worth the 
> trouble.

I think it's ok. walreceiver is terminated exactly due to administrator command.

regards, Sergei

Re: Resolving the python 2 -> python 3 mess

2020-03-26 Thread Peter Eisentraut


On 2020-03-26 06:46, Marco Atzeri wrote:

Am 17.02.2020 um 17:49 schrieb Tom Lane:

We've had multiple previous discussions of $SUBJECT (eg [1][2]),
without any resolution of what to do exactly.  Thinking about this
some more, I had an idea that I don't think has been discussed.
To wit:

1. On platforms where Python 2.x is still supported, recommend that
packagers continue to build both plpython2 and plpython3, same as now.



there is some documentation on how to build both ?


You have to configure and build the sources twice with different PYTHON 
settings.  It depends on your packaging system how to best arrange that.



And how to build for multiples 3.x ?


That is not supported.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-26 Thread Laurenz Albe

On Wed, 2020-03-25 at 23:19 +0300, Alexander Korotkov wrote:
> On Wed, Mar 25, 2020 at 10:26 PM Andres Freund  wrote:
> > On 2020-03-25 11:05:21 -0500, Justin Pryzby wrote:
> > > Since we talked about how scale_factor can be used to effectively disable 
> > > this
> > > new feature, I thought that scale=100 was too small and suggesed 1e10 
> > > (same as
> > > max for vacuum_cleanup_index_scale_factor since 4d54543ef).  That should 
> > > allow
> > > handling the case that analyze is disabled, or its threshold is high, or 
> > > it
> > > hasn't run yet, or it's running but hasn't finished, or analyze is 
> > > triggered as
> > > same time as vacuum.
> > 
> > For disabling we instead should allow -1, and disable the feature if set
> > to < 0.
> 
> This patch introduces both GUC and reloption.  In reloptions we
> typically use -1 for "disable reloption, use GUC value instead"
> semantics.  So it's unclear how should we allow reloption to both
> disable feature and disable reloption.  I think we don't have a
> precedent in the codebase yet.  We could allow -2 (disable reloption)
> and -1 (disable feature) for reloption.  Opinions?

Here is patch v11, where the reloption has the same upper limit 1e10
as the GUC.  There is no good reason to have them different.

I am reluctant to introduce new semantics like a reloption value of -2
to disable a feature in this patch right before feature freeze.

I believe there are enough options to disable insert-only vacuuming for
an individual table:

- Set the threshold to 2147483647.  True, that will not work for very
  large tables, but I think that there are few tables that insert that
  many rows before they hit autovacuum_freeze_max_age anyway.

- Set the scale factor to some astronomical value.

Yours,
Laurenz Albe

Re: Collation versions on Windows (help wanted, apply within)

2020-03-26 Thread Juan José Santamaría Flecha

On Wed, Mar 25, 2020 at 4:18 AM Thomas Munro  wrote:

>
> Thanks!  Pushed.
>

Great!


> From the things we learned in this thread, I think there is an open
> item for someone to write a patch to call EnumSystemLocalesEx() and
> populate the initial set of collations, where we use "locale -a" on
> Unix.  I'm not sure where the encoding is supposed to come from
> though, which is why I didn't try to write a patch myself.
>

I will take a look at this when the current commitfest is over.

Regards,

Juan José Santamaría Flecha

Re: Improve handling of parameter differences in physical replication

2020-03-26 Thread Masahiko Sawada

On Thu, 12 Mar 2020 at 04:34, Peter Eisentraut
 wrote:
>
> Here is an updated patch that incorporates some of the suggestions.  In
> particular, some of the warning messages have been rephrased to more
> accurate (but also less specific), the warning message at recovery pause
> repeats every 1 minute, and the documentation has been updated.
>

Thank you for updating the patch. I have one comment on the latest
version patch:

+   do
+   {
+   TimestampTz now = GetCurrentTimestamp();
+
+   if (TimestampDifferenceExceeds(last_warning, now, 6))
+   {
+   ereport(WARNING,
+   (errmsg("recovery paused because of insufficient
parameter settings"),
+errdetail("See earlier in the log about which
settings are insufficient."),
+errhint("Recovery cannot continue unless the
configuration is changed and the server restarted.")));
+   last_warning = now;
+   }
+
+   pg_usleep(100L);/* 1000 ms */
+   HandleStartupProcInterrupts();
+   }

I think we can set wait event WAIT_EVENT_RECOVERY_PAUSE here.

The others look good to me.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Conflict handling for COPY FROM

2020-03-26 Thread Surafel Temesgen

Hi Takanori Asaba,

>
>
> Although it is a small point, it may be better like this:
> +70005  27  36  46  56  ->  70005  27  37  47  57
>
>
done

> I want to discuss about copy from binary file.
> It seems that this patch tries to avoid the error that number of field is
> different .
>
> +   {
> +   if (cstate->error_limit > 0 ||
> cstate->ignore_all_error)
> +   {
> +   ereport(WARNING,
> +
>  (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> +errmsg("skipping \"%s\"
> --- row field count is %d, expected %d",
> +
>  cstate->line_buf.data, (int) fld_count, attr_count)));
> +   cstate->error_limit--;
> +   goto next_line;
> +   }
> +   else
> +   ereport(ERROR,
> +
>  (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> +errmsg("row field count
> is %d, expected %d",
> +   (int)
> fld_count, attr_count)));
> +
> +   }
>
> I checked like this:
>
> postgres=# CREATE TABLE x (
> postgres(# a serial UNIQUE,
> postgres(# b int,
> postgres(# c text not null default 'stuff',
> postgres(# d text,
> postgres(# e text
> postgres(# );
> CREATE TABLE
> postgres=# COPY x from stdin;
> Enter data to be copied followed by a newline.
> End with a backslash and a period on a line by itself, or an EOF signal.
> >> 7000425  35  45  55
> >> 7000526  36  46  56
> >> \.
> COPY 2
> postgres=# SELECT * FROM x;
>a   | b  | c  | d  | e
> ---++++
>  70004 | 25 | 35 | 45 | 55
>  70005 | 26 | 36 | 46 | 56
> (2 rows)
>
> postgres=# COPY x TO '/tmp/copyout' (FORMAT binary);
> COPY 2
> postgres=# CREATE TABLE y (
> postgres(# a serial UNIQUE,
> postgres(# b int,
> postgres(# c text not null default 'stuff',
> postgres(# d text
> postgres(# );
> CREATE TABLE
> postgres=# COPY y FROM '/tmp/copyout' WITH (FORMAT binary,ERROR_LIMIT -1);
> 2020-03-12 16:55:55.457 JST [2319] WARNING:  skipping "" --- row field
> count is 5, expected 4
> 2020-03-12 16:55:55.457 JST [2319] CONTEXT:  COPY y, line 1
> 2020-03-12 16:55:55.457 JST [2319] WARNING:  skipping "" --- row field
> count is 0, expected 4
> 2020-03-12 16:55:55.457 JST [2319] CONTEXT:  COPY y, line 2
> 2020-03-12 16:55:55.457 JST [2319] ERROR:  unexpected EOF in COPY data
> 2020-03-12 16:55:55.457 JST [2319] CONTEXT:  COPY y, line 3, column a
> 2020-03-12 16:55:55.457 JST [2319] STATEMENT:  COPY y FROM '/tmp/copyout'
> WITH (FORMAT binary,ERROR_LIMIT -1);
> WARNING:  skipping "" --- row field count is 5, expected 4
> WARNING:  skipping "" --- row field count is 0, expected 4
> ERROR:  unexpected EOF in COPY data
> CONTEXT:  COPY y, line 3, column a
>
> It seems that the error isn't handled.
> 'WARNING:  skipping "" --- row field count is 5, expected 4' is correct,
> but ' WARNING:  skipping "" --- row field count is 0, expected 4' is not
> correct.
>
>
Thank you for the detailed example


> Also, is it needed to skip the error that happens when input is binary
> file?
> Is the case that each row has different number of field and only specific
> rows are copied occurred?
>
>
An error that can be surly handled without transaction rollback can
be included in error handling but i will like to proceed without binary file
errors handling for the time being

regards
Surafel


conflict-handling-copy-from-v16.patch
Description: Binary data

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

On Tue, Mar 24, 2020 at 12:22:16PM +0900, Michael Paquier wrote:
> Thanks Alvaro and Alexander.  0001 has been applied as of e09ad07.
> Now for 0002, let's see about it later.  Attached is a rebased
> version, with no actual changes.

I was looking at this patch again today and I am rather fine with the
existing semantics.  Still I don't like much to name the frontend-side
routine FrontendRestoreArchivedFile() and use a different name than
the backend counterpart because we have to include xlog_internal.h in
fe_archive.c to be able to grab XLOGDIR.

So here is an idea: let's move the declaration of the routines part of
xlogarchive.c to a new header, called xlogarchive.h, and then let's
use the same routine name for the frontend and the backend in this
second patch.  We include xlog_internal.h already in many frontend
tools, so that would clean up things a bit.  Two extra things are the
routines for the checkpointer as well as the variables like
ArchiveRecoveryRequested.  It may be nice to move those while on it,
but I am not sure where and that's not actually required for this
patch set so that could be addressed later if need be.

Any thoughts?
--
Michael

signature.asc
Description: PGP signature

Re: A rather hackish POC for alternative implementation of WITH TIES

2020-03-26 Thread Surafel Temesgen

On Wed, Jan 22, 2020 at 3:35 PM Andrew Gierth 
wrote:

> > "Alvaro" == Alvaro Herrera  writes:
>
>
> I was largely holding off on doing further work hoping for some
> discussion of which way we should go. If you think my approach is worth
> pursuing (I haven't seriously tested the performance, but I'd expect it
> to be slower than Surafel's - the price you pay for flexibility) then I
> can look at it further, but figuring out the planner stuff will take
> some time.
>
>
Other alternative can be pushing the existing implementation
which will be open to change in case of better-finished
implementation.

regards
Surafel

Re: Include sequence relation support in logical replication

On Wed, Mar 25, 2020 at 12:27:28PM -0700, Andres Freund wrote:
> On 2020-03-24 16:19:21 -0700, Cary Huang wrote:
>> For the replication to make sense, the patch actually disables the WAL
>> update at every 32 nextval() calls, so every call to nextval() will
>> emit a WAL update for proper replication. This is done by setting
>> SEQ_LOG_VALS to 0 in sequence.c
> 
> Why is that needed? ISTM updating in increments of 32 is fine for
> replication purposes? It's good imo, because sending out more granular
> increments would increase the size of the WAL stream?

Once upon a time, I was looking at the effects of playing with the
limit of a WAL record generated every 32 increments for a sequence,
and the performance difference is huge and noticeable.
--
Michael


signature.asc
Description: PGP signature

Re: [DOC] Document concurrent index builds waiting on each other

On Wed, Mar 25, 2020 at 05:12:48PM -0300, Alvaro Herrera wrote:
> Hmm, that sounds more promising.

Haven't looked at that myself in details.  But as I doubt that this
would be backpatched, wouldn't it be better to document the issue for
now?
--
Michael


signature.asc
Description: PGP signature

Re: color by default

On Mon, Mar 23, 2020 at 09:32:08AM +0100, Peter Eisentraut wrote:
> Attached is the documentation patch reworked.

Thanks!

> Should we delete all the repetitive mentions on the man pages?

I am not sure that deleting all the mentions would be a good idea, as
we'd lose track of which tool supports coloring or not, and that could
confuse users.  What about switching the existing paragraph to a
simple sentence with a link to the new appendix you are adding?  Say:
"pg_foo supports colorized output".

> +  
> +   The actual colors to be used are configured using the environment variable
> +   
> PG_COLORSPG_COLORS
> +   (note plural).  The value is a colon-separated list of
> +   
> key=value
> +   pairs.  The keys specify what the color is to be used for.  The values are
> +   SGR (Select Graphic Rendition) specifications, which are interpreted by 
> the
> +   terminal.
> +  

A reference to SGR to understand better what's the list of values
supported would be nice?

> +  
> +   The default value is 
> error=01;31:warning=01;35:locus=01.
> +  

Could it be possible to have more details about what those three
fields map to?
--
Michael

signature.asc
Description: PGP signature

Re: error context for vacuum to include block number

2020-03-26 Thread Amit Kapila

On Thu, Mar 26, 2020 at 12:03 PM Amit Kapila  wrote:
>
> On Thu, Mar 26, 2020 at 10:11 AM Justin Pryzby  wrote:
> >
> > Seems fine.  Rather than saying "different phases" I, would say:
> > "The index vacuum and heap vacuum phases may be called multiple times in the
> > middle of the heap scan phase."
> >
>
> Okay, I have slightly adjusted the wording as per your suggestion.
>
> > But actually I think the concern is not that we unnecessarily "Revert back 
> > to
> > the old phase" but that we do it in a *loop*.  Which I agree doesn't make
> > sense, to go back and forth between "scanning heap" and "truncating".
> >
>
> Fair point.  I have moved the change to the truncate phase at the
> caller of lazy_heap_truncate() which should address this concern.
> Sawada-San, does this address your concern?
>

Forgot to attach the patch, doing now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


v35-0001-Introduce-vacuum-errcontext-to-display-additiona.patch
Description: Binary data

Re: error context for vacuum to include block number

2020-03-26 Thread Amit Kapila

On Thu, Mar 26, 2020 at 10:11 AM Justin Pryzby  wrote:
>
> Seems fine.  Rather than saying "different phases" I, would say:
> "The index vacuum and heap vacuum phases may be called multiple times in the
> middle of the heap scan phase."
>

Okay, I have slightly adjusted the wording as per your suggestion.

> But actually I think the concern is not that we unnecessarily "Revert back to
> the old phase" but that we do it in a *loop*.  Which I agree doesn't make
> sense, to go back and forth between "scanning heap" and "truncating".
>

Fair point.  I have moved the change to the truncate phase at the
caller of lazy_heap_truncate() which should address this concern.
Sawada-San, does this address your concern?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: allow online change primary_conninfo