Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-08-29 Thread Noah Misch
On Mon, May 25, 2020 at 12:00:33AM -0700, Noah Misch wrote:
> On Mon, Apr 06, 2020 at 09:18:47PM -0700, Noah Misch wrote:
> > On Mon, Apr 06, 2020 at 02:46:09PM -0400, Tom Lane wrote:
> > > Noah Misch  writes:
> > > > On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote:
> > > >> So I think what we're actually trying to accomplish here is to
> > > >> ensure that instead of deleting up to half of the SLRU space
> > > >> before the cutoff, we delete up to half-less-one-segment.
> > > >> Maybe it should be half-less-two-segments, just to provide some
> > > >> cushion against edge cases.  Reading the first comment in
> > > >> SetTransactionIdLimit makes one not want to trust too much in
> > > >> arguments based on the exact value of xidWrapLimit, while for
> > > >> the other SLRUs it was already unclear whether the edge cases
> > > >> were exactly right.
> > > 
> > > > That could be interesting insurance.  While it would be sad for us to 
> > > > miss an
> > > > edge case and print "must be vacuumed within 2 transactions" when wrap 
> > > > has
> > > > already happened, reaching that message implies the DBA burned ~1M 
> > > > XIDs, all
> > > > in single-user mode.  More plausible is FreezeMultiXactId() overrunning 
> > > > the
> > > > limit by tens of segments.  Hence, if we do buy this insurance, let's 
> > > > skip far
> > > > more segments.  For example, instead of unlinking segments representing 
> > > > up to
> > > > 2^31 past XIDs, we could divide that into an upper half that we unlink 
> > > > and a
> > > > lower half.  The lower half will stay in place; eventually, XID 
> > > > consumption
> > > > will overwrite it.  Truncation behavior won't change until the region 
> > > > of CLOG
> > > > for pre-oldestXact XIDs exceeds 256 MiB.  Beyond that threshold,
> > > > vac_truncate_clog() will unlink the upper 256 MiB and leave the rest.  
> > > > CLOG
> > > > maximum would rise from 512 MiB to 768 MiB.  Would that be worthwhile?
> 
> > > Temporarily wasting some disk
> > > space is a lot more palatable than corrupting data, and these code
> > > paths are necessarily not terribly well tested.  So +1 for more
> > > insurance.
> > 
> > Okay, I'll give that a try.  I expect this will replace the PagePrecedes
> > callback with a PageDiff callback such that PageDiff(a, b) < 0 iff
> > PagePrecedes(a, b).  PageDiff callbacks shall distribute return values
> > uniformly in [INT_MIN,INT_MAX].  SimpleLruTruncate() will unlink segments
> > where INT_MIN/2 < PageDiff(candidate, cutoff) < 0.
> 
> While doing so, I found that slru-truncate-modulo-v2.patch did get edge cases
> wrong, as you feared.  In particular, if the newest XID reached xidStopLimit
> and was in the first page of a segment, TruncateCLOG() would delete its
> segment.  Attached slru-truncate-modulo-v3.patch fixes that; as restitution, I
> added unit tests covering that and other scenarios.  Reaching the bug via XIDs
> was hard, requiring one to burn 1000k-CLOG_XACTS_PER_PAGE=967k XIDs in
> single-user mode.  I expect the bug was easier to reach via pg_multixact.
> 
> The insurance patch stacks on top of the bug fix patch.  It does have a
> negative effect on TruncateMultiXact(), which uses SlruScanDirCbFindEarliest
> to skip truncation in corrupted clusters.  SlruScanDirCbFindEarliest() gives
> nonsense answers if "future" segments exist.  That can happen today, but the
> patch creates new ways to make it happen.  The symptom is wasting yet more
> space in pg_multixact.  I am okay with this, since it arises only after one
> fills pg_multixact 50% full.  There are alternatives.  We could weaken the
> corruption defense in TruncateMultiXact() or look for another implementation
> of equivalent defense.  We could unlink, say, 75% or 95% of the "past" instead
> of 50% (this patch) or >99.99% (today's behavior).

Rebased the second patch.  The first patch did not need a rebase.
Author: Noah Misch 
Commit: Noah Misch 

Prevent excess SimpleLruTruncate() deletion.

Every core SLRU wraps around.  With the exception of pg_notify, the wrap
point can fall in the middle of a page.  Account for this in the
PagePrecedes callback specification and in SimpleLruTruncate()'s use of
said callback.  Update each callback implementation to fit the new
specification.  This changes SerialPagePrecedesLogically() from the
style of asyncQueuePagePrecedes() to the style of CLOGPagePrecedes().
(Whereas pg_clog and pg_serial share a key space, pg_serial is nothing
like pg_notify.)

This closes a rare opportunity for data loss, which manifested as
"apparent wraparound" or "could not access status of transaction"
errors.  This is more likely to affect pg_multixact, due to the thin
space between multiStopLimit and multiWrapLimit.  If a user's physical
replication primary logged ":  apparent wraparound" messages, the user
should rebuild that standbys of that primary regardless of symptoms.  At
less risk is 

Rare link canary failure in dblink test

2020-08-29 Thread Thomas Munro
Hi,

A couple of recent cases where an error "libpq is incorrectly linked
to backend functions" broke the dblink test:

 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2020-08-22%2002:55:22
  | REL9_6_STABLE | Windows
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2020-04-27%2022:46:16
  | HEAD  | Windows

Before that, lorikeet last said that before commit a33245a8 apparently
fixed something relevant, but curiously the message also appeared a
couple of times on two Unixen.  Perhaps we can write those off as lost
in the mists of time.

 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2018-09-29%2010:56:25
  | HEAD  | Windows
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly=2018-09-10%2011:09:29
 | HEAD  | Illumos
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundi=2018-09-09%2018:11:15
| HEAD  | FreeBSD




Re: Windows buildfarm members vs. new async-notify isolation test

2020-08-29 Thread Tom Lane
Thomas Munro  writes:
> I made a thing to watch out for low probability BF failures and it
> told me that a similar failure in async-notify might have reappeared
> on brolga:

>  
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga=2020-07-15%2008:30:11
> | REL_10_STABLE
> [ etc ]

Hm, interesting.  None of these examples show an actual *failure* to
receive a notification, unlike the example that began this thread.
So it seems unlikely that back-patching 16114f2ea would help.  What
we are seeing here, instead, is delayed timing of notify receipt(s).
I suspect that this is a variant of the issue described over here:

https://www.postgresql.org/message-id/flat/2527507.1598237598%40sss.pgh.pa.us

I didn't have a great idea about how to fix it reliably in
insert-conflict-specconflict, and I lack one here too :-(.

It's interesting though that your examples are all in v10 or older.
Could we have done something that indirectly fixes the problem
since then?  Or is that just chance?

regards, tom lane




Re: Windows buildfarm members vs. new async-notify isolation test

2020-08-29 Thread Thomas Munro
On Thu, Dec 12, 2019 at 9:11 AM Tom Lane  wrote:
> Amit Kapila  writes:
> > I am convinced by your points.  So +1 for your proposed patch.  I have
> > already reviewed it yesterday and it appears fine to me.
>
> OK, pushed.  Thanks for reviewing!

I made a thing to watch out for low probability BF failures and it
told me that a similar failure in async-notify might have reappeared
on brolga:

 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga=2020-07-15%2008:30:11
| REL_10_STABLE
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga=2020-05-21%2009:17:13
| REL9_6_STABLE
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga=2020-04-22%2009:13:38
| REL9_6_STABLE
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga=2020-04-05%2009:38:13
| REL9_6_STABLE
 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga=2020-04-03%2021:17:39
| REL9_6_STABLE




Re: Background writer and checkpointer in crash recovery

2020-08-29 Thread Tom Lane
Thomas Munro  writes:
> Once we had the checkpointer running, we could also consider making
> the end-of-recovery checkpoint optional, or at least the wait for it
> to complete.  I haven't shown that in this patch, but it's just
> different checkpoint request flags and possibly an end-of-recovery
> record.  What problems do you foresee with that?  Why should we have
> "fast" promotion but not "fast" crash recovery?

I think that the rationale for that had something to do with trying
to reduce the costs of repeated crashes.  If you've had one crash,
you might well have another one in your near future, due to the same
problem recurring.  Therefore, avoiding multiple replays of the same WAL
is attractive; and to do that we have to complete a checkpoint before
giving users a chance to crash us again.

I'm not sure what I think about your primary proposal here.  On the
one hand, optimizing crash recovery ought to be pretty darn far down
our priority list; if it happens often enough for performance to be
a consideration then we have worse problems to fix.  Also, at least
in theory, not running the bgwriter/checkpointer makes for fewer moving
parts and thus better odds of completing crash recovery successfully.
On the other hand, it could be argued that running without the
bgwriter/checkpointer actually makes recovery less reliable, since
that's a far less thoroughly-tested operating regime than when they're
active.

tl;dr: I think this should be much less about performance than about
whether we successfully recover or not.  Maybe there's still a case for
changing in that framework, though.

regards, tom lane




Re: Disk-based hash aggregate's cost model

2020-08-29 Thread Tomas Vondra

On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote:

On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote:

We have a Postgres 13 open item for Disk-based hash aggregate, which
is the only non-trivial open item. There is a general concern about
how often we get disk-based hash aggregation when work_mem is
particularly low, and recursion seems unavoidable. This is generally
thought to be a problem in the costing.


We discussed two approaches to tweaking the cost model:

1. Penalize HashAgg disk costs by a constant amount. It seems to be
chosen a little too often, and we can reduce the number of plan
changes.

2. Treat recursive HashAgg spilling skeptically, and heavily penalize
recursive spilling.

The problem with approach #2 is that we have a default hash mem of 4MB,
and real systems have a lot more than that. In this scenario, recursive
spilling can beat Sort by a lot.



I think the main issue is that we're mostly speculating what's wrong.
I've shared some measurements and symptoms, and we've discussed what
might be causing it, but I'm not really sure we know for sure.

I really dislike (1) because it seems more like "We don't know what's
wrong so we'll penalize hashagg," kind of solution. A much more
principled solution would be to tweak the costing accordingly, not just
by adding some constant. For (2) it really depends if recursive spilling
is really the problem here. In the examples I shared, the number of
partitions/batches was very different, but the query duration was
mostly independent (almost constant).


FWIW I still haven't seen any explanation why the current code spills
more data than the CP_SMALL_TLIST patch (which was reverted).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Background writer and checkpointer in crash recovery

2020-08-29 Thread Thomas Munro
(Forking this thread from the SLRU fsync one[1] to allow for a
separate CF entry; it's unrelated, except for being another case of
moving work off the recovery process's plate.)

Hello hackers,

Currently we don't run the bgwriter process during crash recovery.
I've CCed Simon and Heikki who established this in commit cdd46c76.
Based on that commit message, I think the bar to clear to change the
policy is to show that it's useful, and that it doesn't make crash
recovery less robust.   See the other thread for some initial evidence
of usefulness from Jakub Wartak.  I think it also just makes intuitive
sense that it's got to help bigger-than-buffer-pool crash recovery if
you can shift buffer eviction out of the recovery loop.   As for
robustness, I suppose we could provide the option to turn it off just
in case that turns out to be useful in some scenarios, but I'm
wondering why we would expect something that we routinely run in
archive/streaming recovery to reduce robustness in only slightly
different circumstances.

Here's an experiment-grade patch, comments welcome, though at this
stage it's primarily thoughts about the concept that I'm hoping to
solicit.

One question raised by Jakub that I don't have a great answer to right
now is whether you'd want different bgwriter settings in this scenario
for best results.  I don't know, but I suspect that the answer is to
make bgwriter more adaptive rather than more configurable, and that's
an orthogonal project.

Once we had the checkpointer running, we could also consider making
the end-of-recovery checkpoint optional, or at least the wait for it
to complete.  I haven't shown that in this patch, but it's just
different checkpoint request flags and possibly an end-of-recovery
record.  What problems do you foresee with that?  Why should we have
"fast" promotion but not "fast" crash recovery?

[1] 
https://www.postgresql.org/message-id/flat/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
From 21d7d459a6076f85c743b739f1c4ba16451b7046 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 26 Aug 2020 16:34:33 +1200
Subject: [PATCH 1/2] Run checkpointer and bgworker in crash recovery.

Relieve the startup process of some writeback and checkpointing work
during crash recovery, making it more like replication recovery.  This
wasn't done back in commit cdd46c76 because it wasn't thought to be
useful at the time.  The theory of this patch is that there may be
workloads where we can profitably move a bunch of buffer eviction work
out of the recovery process.

In order to have a bgwriter running, you also need a checkpointer.
While you could give startup and bgwriter their own backend private
pending ops table, it's nicer to merger their requests in one place.

XXX This is just an experimental prototype patch and may contain all
kinds of bugs!
---
 src/backend/access/transam/xlog.c | 33 ++-
 src/backend/postmaster/bgwriter.c |  3 ---
 src/backend/postmaster/checkpointer.c |  3 ---
 src/backend/postmaster/postmaster.c   | 17 ++
 src/backend/storage/sync/sync.c   | 30 +++-
 src/include/storage/sync.h|  1 -
 6 files changed, 22 insertions(+), 65 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..d8080ed4f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -870,9 +870,6 @@ bool		reachedConsistency = false;
 
 static bool InRedo = false;
 
-/* Have we launched bgwriter during recovery? */
-static bool bgwriterLaunched = false;
-
 /* For WALInsertLockAcquire/Release functions */
 static int	MyLockNo = 0;
 static bool holdingAllLocks = false;
@@ -7123,25 +7120,14 @@ StartupXLOG(void)
 		/* Also ensure XLogReceiptTime has a sane value */
 		XLogReceiptTime = GetCurrentTimestamp();
 
+		PublishStartupProcessInformation();
+
 		/*
 		 * Let postmaster know we've started redo now, so that it can launch
-		 * checkpointer to perform restartpoints.  We don't bother during
-		 * crash recovery as restartpoints can only be performed during
-		 * archive recovery.  And we'd like to keep crash recovery simple, to
-		 * avoid introducing bugs that could affect you when recovering after
-		 * crash.
-		 *
-		 * After this point, we can no longer assume that we're the only
-		 * process in addition to postmaster!  Also, fsync requests are
-		 * subsequently to be handled by the checkpointer, not locally.
+		 * the archiver if necessary.
 		 */
-		if (ArchiveRecoveryRequested && IsUnderPostmaster)
-		{
-			PublishStartupProcessInformation();
-			EnableSyncRequestForwarding();
+		if (IsUnderPostmaster)
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
-			bgwriterLaunched = true;
-		}
 
 		/*
 		 * Allow read-only connections immediately if we're consistent
@@ -7729,7 +7715,7 @@ StartupXLOG(void)
 		 * after we're fully out of recovery mode and already accepting
 		 * queries.
 		 */
-		if 

Re: PROC_IN_ANALYZE stillborn 13 years ago

2020-08-29 Thread Tom Lane
Alvaro Herrera  writes:
> I pushed despite the objection because it seemed that downstream
> discussion was largely favorable to the change, and there's a different
> proposal to solve the bloat problem for analyze; and also:

Note that this quasi-related patch has pretty thoroughly hijacked
the CF entry for James' original docs patch proposal.  The cfbot
thinks that that's the latest patch in the original thread, and
unsurprisingly is failing to apply it.

Since the discussion was all over the place, I'm not sure whether
there's still a live docs patch proposal or not; but if so, somebody
should repost that patch (and go back to the original thread title).

regards, tom lane




Re: Get rid of runtime handling of AlternativeSubPlan?

2020-08-29 Thread Tom Lane
I wrote:
> Back in bd3daddaf232d95b0c9ba6f99b0170a0147dd8af, which introduced
> AlternativeSubPlans, I wrote:
>   There is a lot more that could be done based on this infrastructure: in
>   particular it's interesting to consider switching to the hash plan if we 
> start
>   out using the non-hashed plan but find a lot more upper rows going by than 
> we
>   expected.  I have therefore left some minor inefficiencies in place, such as
>   initializing both subplans even though we will currently only use one.
>
> That commit will be twelve years old come August, and nobody has either
> built anything else atop it or shown any interest in making the plan choice
> switchable mid-run.  So it seems like kind of a failed experiment.
>
> Therefore, I'm considering the idea of ripping out all executor support
> for AlternativeSubPlan and instead having the planner replace an
> AlternativeSubPlan with the desired specific SubPlan somewhere late in
> planning, possibly setrefs.c.

Here's a proposed patchset for that.  This runs with the idea I'd had
that setrefs.c could be smarter than the executor about which plan node
subexpressions will be executed how many times.  I did not take it very
far, for fear of adding an undue number of planning cycles, but it's still
better than what we have now.

For ease of review, 0001 adds the new planner logic, while 0002 removes
the now-dead executor support.

There's one bit of dead code that I left in place for the moment, which is
ruleutils.c's support for printing AlternativeSubPlans.  I'm not sure if
that's worth keeping or not --- it's dead code for normal use, but if
someone tried to use ruleutils.c to print partially-planned expression
trees, maybe there'd be a use for it?

(It's also arguable that readfuncs.c's support is now dead code, but
I have little interest in stripping that out.)

regards, tom lane

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..f0386480ab 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2254,6 +2254,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_BOOL_FIELD(hasLateralRTEs);
 	WRITE_BOOL_FIELD(hasHavingQual);
 	WRITE_BOOL_FIELD(hasPseudoConstantQuals);
+	WRITE_BOOL_FIELD(hasAlternativeSubPlans);
 	WRITE_BOOL_FIELD(hasRecursion);
 	WRITE_INT_FIELD(wt_param_id);
 	WRITE_BITMAPSET_FIELD(curOuterRels);
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b40a112c25..27006c59e0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -629,6 +629,8 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
 	root->minmax_aggs = NIL;
 	root->qual_security_level = 0;
 	root->inhTargetKind = INHKIND_NONE;
+	root->hasPseudoConstantQuals = false;
+	root->hasAlternativeSubPlans = false;
 	root->hasRecursion = hasRecursion;
 	if (hasRecursion)
 		root->wt_param_id = assign_special_exec_param(root);
@@ -759,9 +761,6 @@ subquery_planner(PlannerGlobal *glob, Query *parse,
 	 */
 	root->hasHavingQual = (parse->havingQual != NULL);
 
-	/* Clear this flag; might get set in distribute_qual_to_rels */
-	root->hasPseudoConstantQuals = false;
-
 	/*
 	 * Do expression preprocessing on targetlist and quals, as well as other
 	 * random expressions in the querytree.  Note that we do not need to
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index baefe0e946..b5a29ecfc4 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -49,6 +49,7 @@ typedef struct
 {
 	PlannerInfo *root;
 	int			rtoffset;
+	double		num_exec;
 } fix_scan_expr_context;
 
 typedef struct
@@ -58,6 +59,7 @@ typedef struct
 	indexed_tlist *inner_itlist;
 	Index		acceptable_rel;
 	int			rtoffset;
+	double		num_exec;
 } fix_join_expr_context;
 
 typedef struct
@@ -66,8 +68,28 @@ typedef struct
 	indexed_tlist *subplan_itlist;
 	Index		newvarno;
 	int			rtoffset;
+	double		num_exec;
 } fix_upper_expr_context;
 
+/*
+ * Selecting the best alternative in an AlternativeSubPlan expression requires
+ * estimating how many times that expression will be evaluated.  For an
+ * expression in a plan node's targetlist, the plan's estimated number of
+ * output rows is clearly what to use, but for an expression in a qual it's
+ * far less clear.  Since AlternativeSubPlans aren't heavily used, we don't
+ * want to expend a lot of cycles making such estimates.  What we use is twice
+ * the number of output rows.  That's not entirely unfounded: we know that
+ * clause_selectivity() would fall back to a default selectivity estimate
+ * of 0.5 for any SubPlan, so if the qual containing the SubPlan is the last
+ * to be applied (which it likely would be, thanks to order_qual_clauses()),
+ * this matches what we could have estimated in a far more laborious fashion.
+ * Obviously there are many other scenarios, but it's probably not worth the
+ * 

Re: list of extended statistics on psql

2020-08-29 Thread Tomas Vondra

On Sat, Aug 29, 2020 at 06:43:47PM -0400, Alvaro Herrera wrote:

On 2020-Aug-29, Tomas Vondra wrote:


But if we want the
output to show both what was requested and which types were actually
built, that'll effectively double the number of columns needed :-(


I was thinking it would be one column per type showing either disabled or 
enabled
or built.  But another idea is to show one type per line that's at least
enabled.


Also, it might be useful to show the size of the statistics built, just
like we show for \d+ etc.


\dX+  I  suppose?



Right. I've only used \d+ as an example of an existing command showing
sizes of the objects.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: list of extended statistics on psql

2020-08-29 Thread Alvaro Herrera
On 2020-Aug-29, Tomas Vondra wrote:

> But if we want the
> output to show both what was requested and which types were actually
> built, that'll effectively double the number of columns needed :-(

I was thinking it would be one column per type showing either disabled or 
enabled
or built.  But another idea is to show one type per line that's at least
enabled.

> Also, it might be useful to show the size of the statistics built, just
> like we show for \d+ etc.

\dX+  I  suppose?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-29 Thread Tomas Vondra

On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote:

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:

On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost  wrote:
> > Hm? At least earlier versions didn't do prefetching for records with an 
fpw, and only for subsequent records affecting the same or if not in s_b anymore.
>
> We don't actually read the page when we're replaying an FPW though..?
> If we don't read it, and we entirely write the page from the FPW, how is
> pre-fetching helping..?

Suppose there is a checkpoint. Then we replay a record with an FPW,
pre-fetching nothing. Then the buffer gets evicted from
shared_buffers, and maybe the OS cache too. Then, before the next
checkpoint, we again replay a record for the same page. At this point,
pre-fetching should be helpful.


Sure- but if we're talking about 25GB of WAL, on a server that's got
32GB, then why would those pages end up getting evicted from memory
entirely?  Particularly, enough of them to end up with such a huge
difference in replay time..

I do agree that if we've got more outstanding WAL between checkpoints
than the system's got memory then that certainly changes things, but
that wasn't what I understood the case to be here.



I don't think it's very clear how much WAL there actually was in each
case - the message only said there was more than 25GB, but who knows how
many checkpoints that covers? In the cases with FPW=on this may easily
be much less than one checkpoint (because with scale 45GB an update to
every page will log 45GB of full-page images). It'd be interesting to
see some stats from pg_waldump etc.


Admittedly, I don't quite understand whether that is what is happening
in this test case, or why SDD vs. HDD should make any difference. But
there doesn't seem to be any reason why it doesn't make sense in
theory.


I agree that this could be a reason, but it doesn't seem to quite fit in
this particular case given the amount of memory and WAL.  I'm suspecting
that it's something else and I'd very much like to know if it's a
general "this applies to all (most?  a lot of?) SSDs because the
hardware has a larger than 8KB page size and therefore the kernel has to
read it", or if it's something odd about this particular system and
doesn't apply generally.



Not sure. I doubt it has anything to do with the hardware page size,
that's mostly transparent to the kernel anyway. But it might be that the
prefetching on a particular SSD has more overhead than what it saves.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: list of extended statistics on psql

2020-08-29 Thread Tomas Vondra

On Thu, Aug 27, 2020 at 11:26:17PM -0400, Alvaro Herrera wrote:

On 2020-Aug-28, Tatsuro Yamada wrote:


> IMO the per-type columns should show both the type being enabled as
> well as it being built.

Hmm. I'm not sure how to get the status (enabled or disabled) of
extended stats. :(
Could you explain it more?


pg_statistic_ext_data.stxdndistinct is not null if the stats have been
built.  (I'm not sure whether there's an easier way to determine this.)



It's the only way, I think. Which types were requested is stored in

   pg_statistic_ext.stxkind

and what was built is in pg_statistic_ext_data. But if we want the
output to show both what was requested and which types were actually
built, that'll effectively double the number of columns needed :-(

Also, it might be useful to show the size of the statistics built, just
like we show for \d+ etc.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: list of extended statistics on psql

2020-08-29 Thread Tomas Vondra

On Thu, Aug 27, 2020 at 07:53:23PM -0400, Alvaro Herrera wrote:

+1 for the general idea, and +1 for \dX being the syntax to use

IMO the per-type columns should show both the type being enabled as
well as it being built.

(How many more stat types do we expect -- Tomas?  I wonder if having one
column per type is going to scale in the long run.)



I wouldn't expect a huge number of types. I can imagine maybe twice the
current number of types, but not much more. But I'm not sure the output
is easy to read even now ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: new heapcheck contrib module

2020-08-29 Thread Mark Dilger



> On Aug 29, 2020, at 3:27 AM, Andrey M. Borodin  wrote:
> 
> 
> 
>> 29 авг. 2020 г., в 00:56, Robert Haas  написал(а):
>> 
>> On Fri, Aug 28, 2020 at 2:10 PM Andrey M. Borodin  
>> wrote:
>>> I don't think so. ISTM It's the same problem of xmax>> just hidden behind detoasing.
>>> Our regular heap_check was checking xmin\xmax invariants for tables, but 
>>> failed to recognise the problem in toast (while toast was accessible until 
>>> CLOG truncation).
>> 
>> The code can (and should, and I think does) refrain from looking up
>> XIDs that are out of the range thought to be valid -- but how do you
>> propose that it avoid looking up XIDs that ought to have clog data
>> associated with them despite being >= relfrozenxid and < nextxid?
>> TransactionIdDidCommit() does not have a suppress-errors flag, adding
>> one would be quite invasive, yet we cannot safely perform a
>> significant number of checks without knowing whether the inserting
>> transaction committed.
> 
> What you write seems completely correct to me. I agree that CLOG thresholds 
> lookup seems unnecessary.
> 
> But I have a real corruption at hand (on testing site). If I have proposed 
> here heapcheck. And I have pg_surgery from the thread nearby. Yet I cannot 
> fix the problem, because cannot list affected tuples. These tools do not 
> solve the problem neglected for long enough. It would be supercool if they 
> could.
> 
> This corruption like a caries had 3 stages:
> 1. incorrect VM flag that page do not need vacuum
> 2. xmin and xmax < relfrozenxid
> 3. CLOG truncated
> 
> Stage 2 is curable with proposed toolset, stage 3 is not. But they are not 
> that different.

I had an earlier version of the verify_heapam patch that included a 
non-throwing interface to clog.  Ultimately, I ripped that out.  My reasoning 
was that a simpler patch submission was more likely to be acceptable to the 
community.

If you want to submit a separate patch that creates a non-throwing version of 
the clog interface, and get the community to accept and commit it, I would 
seriously consider using that from verify_heapam.  If it gets committed in 
time, I might even do so for this release cycle.  But I don't want to make this 
patch dependent on that hypothetical patch getting written and accepted.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Ideas about a better API for postgres_fdw remote estimates

2020-08-29 Thread Andrey Lepikhov




On 8/29/20 9:22 PM, Stephen Frost wrote:



I implemented some FDW + pg core machinery to reduce weight of the problem.
The ANALYZE command on foreign table executes query on foreign server that
extracts statistics tuple, serializes it into json-formatted string and
returns to the caller. The caller deserializes this string, generates
statistics for this foreign table and update it. The second patch is a
proof-of-concept.


Isn't this going to create a version dependency that we'll need to deal
with..?  What if a newer major version has some kind of improved ANALYZE
command, in terms of what it looks at or stores, and it's talking to an
older server?

When I was considering the issue with ANALYZE and FDWs, I had been
thinking it'd make sense to just change the query that's built in
deparseAnalyzeSql() to have a TABLESAMPLE clause, but otherwise run in
more-or-less the same manner as today.  If we don't like the available
TABLESAMPLE methods then we could add a new one that's explicitly the
'right' sample for an ANALYZE call and use that when it's available on
the remote side.  Not sure if it'd make any sense for ANALYZE itself to
start using that same TABLESAMPLE code, but maybe?  Not that I think
it'd be much of an issue if it's independent either, with appropriate
comments to note that we should probably try to make them match up for
the sake of FDWs.
This approach does not contradict your idea.  This is a lightweight 
opportunity to reduce the cost of analysis if we have a set of servers 
with actual versions of system catalog and fdw.



This patch speedup analyze command and provides statistics relevance on a
foreign table after autovacuum operation. Its effectiveness depends on
relevance of statistics on the remote server, but still.


If we do decide to go down this route, wouldn't it mean we'd have to
solve the problem of what to do when it's a 9.6 foreign server being
queried from a v12 server and dealing with any difference in the
statistics structures of the two?

Seems like we would... in which case I would say that we should pull
that bit out and make it general, and use it for pg_upgrade too, which
would benefit a great deal from having the ability to upgrade stats
between major versions also.  That's a much bigger piece to take on, of
course, but seems to be what's implied with this approach for the FDW.



Thank you for this use case.

We can add field "version" to statistics string (btree uses versioning 
too). As you can see, in this patch we are only trying to get 
statistics. If for some reason this does not work out, then we go along 
a difficult route.


Moreover,  I believe this strategy should only work if we analyze a 
relation implicitly. If the user executes analysis explicitly by the 
command "ANALYZE ", we need to perform an fair analysis of the 
table.


--
regards,
Andrey Lepikhov
Postgres Professional




Re: Ideas about a better API for postgres_fdw remote estimates

2020-08-29 Thread Tom Lane
Stephen Frost  writes:
> Isn't this going to create a version dependency that we'll need to deal
> with..?  What if a newer major version has some kind of improved ANALYZE
> command, in terms of what it looks at or stores, and it's talking to an
> older server?

Yeah, this proposal is a nonstarter unless it can deal with the remote
server being a different PG version with different stats.

Years ago (when I was still at Salesforce, IIRC, so ~5 years) we had
some discussions about making it possible for pg_dump and/or pg_upgrade
to propagate stats data forward to the new database.  There is at least
one POC patch in the archives for doing that by dumping the stats data
wrapped in a function call, where the target database's version of the
function would be responsible for adapting the data if necessary, or
maybe just discarding it if it couldn't adapt.  We seem to have lost
interest but it still seems like something worth pursuing.  I'd guess
that if such infrastructure existed it could be helpful for this.

> When I was considering the issue with ANALYZE and FDWs, I had been
> thinking it'd make sense to just change the query that's built in
> deparseAnalyzeSql() to have a TABLESAMPLE clause, but otherwise run in
> more-or-less the same manner as today.

+1, that seems like something worth doing in any case, since even if
we do get somewhere with the present idea it would only work for new
remote servers.  TABLESAMPLE would work pretty far back (9.5,
looks like).

regards, tom lane




Re: Clang UndefinedBehaviorSanitize (Postgres14) Detected undefined-behavior

2020-08-29 Thread Tom Lane
Peter Geoghegan  writes:
> I wonder if we should start using -fno-delete-null-pointer-checks:
> https://lkml.org/lkml/2018/4/4/601
> This may not be strictly relevant to the discussion, but I was
> reminded of it just now and thought I'd mention it.

Hmm.  gcc 8.3 defines this as:

 Assume that programs cannot safely dereference null pointers, and
 that no code or data element resides at address zero.  This option
 enables simple constant folding optimizations at all optimization
 levels.  In addition, other optimization passes in GCC use this
 flag to control global dataflow analyses that eliminate useless
 checks for null pointers; these assume that a memory access to
 address zero always results in a trap, so that if a pointer is
 checked after it has already been dereferenced, it cannot be null.

AFAICS, that's a perfectly valid assumption for our usage.  I can see why
the kernel might not want it, but we set things up whenever possible to
ensure that dereferencing NULL would crash.

However, while grepping the manual for that I also found

'-Wnull-dereference'
 Warn if the compiler detects paths that trigger erroneous or
 undefined behavior due to dereferencing a null pointer.  This
 option is only active when '-fdelete-null-pointer-checks' is
 active, which is enabled by optimizations in most targets.  The
 precision of the warnings depends on the optimization options used.

I wonder whether turning that on would find anything interesting.

regards, tom lane




Re: Ideas about a better API for postgres_fdw remote estimates

2020-08-29 Thread Stephen Frost
Greetings,

* Andrey Lepikhov (a.lepik...@postgrespro.ru) wrote:
> During the implementation of sharding related improvements i noticed that if
> we use a lot of foreign partitions, we have bad plans because of vacuum
> don't update statistics of foreign tables.This is done by the ANALYZE
> command, but it is very expensive operation for foreign table.
> Problem with statistics demonstrates with TAP-test from the first patch in
> attachment.

Yes, the way we handle ANALYZE today for FDWs is pretty terrible, since
we stream the entire table across to do it.

> I implemented some FDW + pg core machinery to reduce weight of the problem.
> The ANALYZE command on foreign table executes query on foreign server that
> extracts statistics tuple, serializes it into json-formatted string and
> returns to the caller. The caller deserializes this string, generates
> statistics for this foreign table and update it. The second patch is a
> proof-of-concept.

Isn't this going to create a version dependency that we'll need to deal
with..?  What if a newer major version has some kind of improved ANALYZE
command, in terms of what it looks at or stores, and it's talking to an
older server?

When I was considering the issue with ANALYZE and FDWs, I had been
thinking it'd make sense to just change the query that's built in
deparseAnalyzeSql() to have a TABLESAMPLE clause, but otherwise run in
more-or-less the same manner as today.  If we don't like the available
TABLESAMPLE methods then we could add a new one that's explicitly the
'right' sample for an ANALYZE call and use that when it's available on
the remote side.  Not sure if it'd make any sense for ANALYZE itself to
start using that same TABLESAMPLE code, but maybe?  Not that I think
it'd be much of an issue if it's independent either, with appropriate
comments to note that we should probably try to make them match up for
the sake of FDWs.

> This patch speedup analyze command and provides statistics relevance on a
> foreign table after autovacuum operation. Its effectiveness depends on
> relevance of statistics on the remote server, but still.

If we do decide to go down this route, wouldn't it mean we'd have to
solve the problem of what to do when it's a 9.6 foreign server being
queried from a v12 server and dealing with any difference in the
statistics structures of the two?

Seems like we would... in which case I would say that we should pull
that bit out and make it general, and use it for pg_upgrade too, which
would benefit a great deal from having the ability to upgrade stats
between major versions also.  That's a much bigger piece to take on, of
course, but seems to be what's implied with this approach for the FDW.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: More aggressive vacuuming of temporary tables

2020-08-29 Thread Tom Lane
Michael Paquier  writes:
> I got to wonder lately if we should not have a static state like what
> we do for MyXactFlags to track down if a utility query is a top-level
> one or not as most of the time we just want to check the flag as
> passed down by standard_ProcessUtility().  I have faced this problem
> as well lately when pushing down a PreventInTransactionBlock() for
> some stuff with REINDEX for example.  Not sure how reliable this would
> be though..  Passing isTopLevel down the road is a no-brainer, and
> perhaps having a static value would create more problems than it
> solves in practice.

Hm.  I suppose you'd need a PG_TRY block to ensure that the setting
got restored correctly after an error, so maintaining it that way
would be rather expensive.  Also it just doesn't seem like
transaction-wide state, so having a static for it feels like the
wrong thing.

[thinks for awhile...]  Would it make any sense to mark Portals as being
top-level or not, and then code that needs this info could look to
"ActivePortal->isTopLevel"?  We are already paying the PG_TRY overhead
to maintain the ActivePortal variable safely.  I'm not sure though
whether the Portal is granular enough.  Do we set up a separate
Portal for subcommands?

In the big scheme of things, though, we don't need this info in so
many places that passing it down as a parameter is an undue burden.

regards, tom lane




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-08-29 Thread Amit Kapila
On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar  wrote:
>
> As discussed, I have added a another test case for covering the out of
> order subtransaction rollback scenario.
>

+# large (streamed) transaction with out of order subtransaction ROLLBACKs
+$node_publisher->safe_psql('postgres', q{

How about writing a comment as: "large (streamed) transaction with
subscriber receiving out of order subtransaction ROLLBACKs"?

I have reviewed and modified the number of things in the attached patch:
1. In apply_handle_origin, improved the check streamed xacts.
2. In apply_handle_stream_commit() while applying changes in the loop,
added CHECK_FOR_INTERRUPTS.
3. In DEBUG messages, print the path with double-quotes as we are
doing in all other places.
4.
+ /*
+ * Exit if streaming option is changed. The launcher will start new
+ * worker.
+ */
+ if (newsub->stream != MySubscription->stream)
+ {
+ ereport(LOG,
+ (errmsg("logical replication apply worker for subscription \"%s\" will "
+ "restart because subscription's streaming option were changed",
+ MySubscription->name)));
+
+ proc_exit(0);
+ }
+
We don't need a separate check like this. I have merged this into one
of the existing checks.
5.
subxact_info_write()
{
..
+ if (subxact_data.nsubxacts == 0)
+ {
+ if (ent->subxact_fileset)
+ {
+ cleanup_subxact_info();
+ BufFileDeleteShared(ent->subxact_fileset, path);
+ pfree(ent->subxact_fileset);
+ ent->subxact_fileset = NULL;
+ }

I don't think it is right to use BufFileDeleteShared interface here
because it won't perform SharedFileSetUnregister which means if after
above code execution is the server exits it will crash in
SharedFileSetDeleteOnProcExit which will try to access already deleted
fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead.
The another related problem is that in function
SharedFileSetDeleteOnProcExit, it tries to delete the list element
while traversing the list with 'foreach' construct which makes the
behavior of list traversal unpredictable. I have fixed this in a
separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you
are fine with this, I would like to commit this as this fixes a
problem in the existing commit 808e13b282.
6. Function stream_cleanup_files() contains a missing_ok argument
which is not used so removed it.
7. In pgoutput.c, change the ordering of functions to make them
consistent with their declaration.
8.
typedef struct RelationSyncEntry
 {
  Oid relid; /* relation oid */
+ TransactionId xid; /* transaction that created the record */

Removed above parameter as this doesn't seem to be required as per the
new design in the patch.

Apart from above, I have added/changed quite a few comments and a few
other cosmetic changes. Kindly review and let me know what do you
think about the changes?

One more comment for which I haven't done anything yet.
+static void
+set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid)
+{
+ MemoryContext oldctx;
+
+ oldctx = MemoryContextSwitchTo(CacheMemoryContext);
+
+ entry->streamed_txns = lappend_int(entry->streamed_txns, xid);

Is it a good idea to append xid with lappend_int? Won't we need
something equivalent for uint32? If so, I think we have a couple of
options (a) use lcons method and accordingly append the pointer to
xid, I think we need to allocate memory for xid if we want to use this
idea or (b) use an array instead. What do you think?

-- 
With Regards,
Amit Kapila.


v54-0001-Fix-the-SharedFileSetUnregister-API.patch
Description: Binary data


v54-0002-Add-support-for-streaming-to-built-in-logical-re.patch
Description: Binary data


Re: new heapcheck contrib module

2020-08-29 Thread Andrey M. Borodin



> 29 авг. 2020 г., в 00:56, Robert Haas  написал(а):
> 
> On Fri, Aug 28, 2020 at 2:10 PM Andrey M. Borodin  
> wrote:
>> I don't think so. ISTM It's the same problem of xmax> just hidden behind detoasing.
>> Our regular heap_check was checking xmin\xmax invariants for tables, but 
>> failed to recognise the problem in toast (while toast was accessible until 
>> CLOG truncation).
> 
> The code can (and should, and I think does) refrain from looking up
> XIDs that are out of the range thought to be valid -- but how do you
> propose that it avoid looking up XIDs that ought to have clog data
> associated with them despite being >= relfrozenxid and < nextxid?
> TransactionIdDidCommit() does not have a suppress-errors flag, adding
> one would be quite invasive, yet we cannot safely perform a
> significant number of checks without knowing whether the inserting
> transaction committed.

What you write seems completely correct to me. I agree that CLOG thresholds 
lookup seems unnecessary.

But I have a real corruption at hand (on testing site). If I have proposed here 
heapcheck. And I have pg_surgery from the thread nearby. Yet I cannot fix the 
problem, because cannot list affected tuples. These tools do not solve the 
problem neglected for long enough. It would be supercool if they could.

This corruption like a caries had 3 stages:
1. incorrect VM flag that page do not need vacuum
2. xmin and xmax < relfrozenxid
3. CLOG truncated

Stage 2 is curable with proposed toolset, stage 3 is not. But they are not that 
different.

Thanks!

Best regards, Andrey Borodin.



Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-08-29 Thread Dilip Kumar
On Sat, Aug 29, 2020 at 1:46 AM Robert Haas  wrote:
>
> On Fri, Aug 28, 2020 at 1:29 PM Andres Freund  wrote:
> > It can break HOT chains, plain ctid chains etc, for example. Which, if
> > earlier / follower tuples are removed can't be detected anymore at a
> > later time.
>
> I think I need a more specific example here to understand the problem.
> If the xmax of one tuple matches the xmin of the next, then either
> both values precede relfrozenxid or both follow it. In the former
> case, neither tuple should be frozen and the chain should not get
> broken; in the latter case, everything's normal anyway. If the xmax
> and xmin don't match, then the chain was already broken. Perhaps we
> are removing important evidence, though it seems like that might've
> happened anyway prior to reaching the damaged page, but we're not
> making whatever corruption may exist any worse. At least, not as far
> as I can see.

One example is,  suppose during vacuum, there are 2 tuples in the hot
chain,  and the xmin of the first tuple is corrupted (i.e. smaller
than relfrozenxid).  And the xmax of this tuple (which is same as the
xmin of the second tuple) is smaller than the cutoff_xid while trying
to freeze the tuple.  As a result, it will freeze the second tuple but
the first tuple will be left untouched.

Now,  if we come for the heap_hot_search_buffer, then the xmax of the
first tuple will not match the xmin of the second tuple as we have
frozen the second tuple.  But, I feel this is easily fixable right? I
mean instead of not doing anything to the corrupted tuple we can
partially freeze it?  I mean we can just leave the corrupted xid alone
but mark the other xid as frozen if that is smaller then cutoff_xid.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-08-29 Thread Dilip Kumar
On Fri, Aug 28, 2020 at 9:49 PM Robert Haas  wrote:
>
> On Tue, Jul 21, 2020 at 9:21 AM Dilip Kumar  wrote:
> > In the previous version, the feature was enabled for cluster/vacuum
> > full command as well.   in the attached patch I have enabled it only
> > if we are running vacuum command.  It will not be enabled during a
> > table rewrite.  If we think that it should be enabled for the 'vacuum
> > full' then we might need to pass a flag from the cluster_rel, all the
> > way down to the heap_freeze_tuple.
>
> I think this is a somewhat clunky way of accomplishing this. The
> caller passes down a flag to heap_prepare_freeze_tuple() which decides
> whether or not an error is forced, and then that function and
> FreezeMultiXactId use vacuum_damage_elevel() to combine the results of
> that flag with the value of the vacuum_tolerate_damage GUC. But that
> means that a decision that could be made in one place is instead made
> in many places. If we just had heap_freeze_tuple() and
> FreezeMultiXactId() take an argument int vacuum_damage_elevel, then
> heap_freeze_tuple() could pass ERROR and lazy_scan_heap() could
> arrange to pass WARNING or ERROR based on the value of
> vacuum_tolerate_damage. I think that would likely end up cleaner. What
> do you think?

I agree this way it is much more cleaner.  I have changed in the attached patch.

> I also suggest renaming is_corrupted_xid to found_corruption. With the
> current name, it's not very clear which XID we're saying is corrupted;
> in fact, the problem might be a MultiXactId rather than an XID, and
> the real issue might be with the table's pg_class entry or something.

Okay, changed to found_corruption.

> The new arguments to heap_prepare_freeze_tuple() need to be documented
> in its header comment.

Done.

I have also done a few more cosmetic changes to the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v4-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch
Description: Binary data