Re: INSERT INTO SELECT, Why Parallelism is not selected?

2020-07-24 Thread Amit Kapila
On Fri, Jul 24, 2020 at 7:36 PM Tom Lane  wrote:
>
> Robert Haas  writes:
> > Well, I think the comments could be more clear - for the insert case
> > specifically - about which cases you think are and are not safe.
>

Okay, I'll update the patch accordingly.

> Yeah, the proposed comment changes don't actually add much.  Also
> please try to avoid inserting non-ASCII  into the source code;
> at least in my mail reader, that attachment has some.
>

I don't see any non-ASCII characters in the patch.  I have applied and
checked (via vi editor) the patch as well but don't see any non-ASCII
characters.  How can I verify that?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Row estimates for empty tables

2020-07-24 Thread Pavel Stehule
so 25. 7. 2020 v 0:34 odesílatel Tom Lane  napsal:

> [ redirecting to -hackers ]
>
> I wrote:
> > The core issue here is "how do we know whether the table is likely to
> stay
> > empty?".  I can think of a couple of more or less klugy solutions:
>

For these special cases is probably possible to ensure ANALYZE before any
SELECT. When the table is created, then it is analyzed, and after that it
is published and used for SELECT. Usually this table is not modified ever.

Because it is a special case, then it is not necessarily too sophisticated
a solution. But for built in solution it can be designed more goneral



> > 1. Arrange to send out a relcache inval when adding the first page to
> > a table, and then remove the planner hack for disbelieving relpages = 0.
> > I fear this'd be a mess from a system structural standpoint, but it might
> > work fairly transparently.
>
> I experimented with doing this.  It's not hard to code, if you don't mind
> having RelationGetBufferForTuple calling CacheInvalidateRelcache.  I'm not
> sure whether that code path might cause any long-term problems, but it
> seems to work OK right now.  However, this solution causes massive
> "failures" in the regression tests as a result of plans changing.  I'm
> sure that's partly because we use so many small tables in the tests.
> Nonetheless, it's not promising from the standpoint of not causing
> unexpected problems in the real world.
>
> > 2. Establish the convention that vacuuming or analyzing an empty table
> > is what you do to tell the system that this state is going to persist.
> > That's more or less what the existing comments in plancat.c envision,
> > but we never made a definition for how the occurrence of that event
> > would be recorded in the catalogs, other than setting relpages > 0.
> > Rather than adding another pg_class column, I'm tempted to say that
> > vacuum/analyze should set relpages to a minimum of 1, even if the
> > relation has zero pages.
>
> I also tried this, and it seems a lot more promising: no existing
> regression test cases change.  So perhaps we should do the attached
> or something like it.
>

I am sending a patch that is years used in GoodData.

I am not sure if the company uses 0 or 1, but I can ask.

Regards

Pavel



> regards, tom lane
>
>
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index c814733b22..d9ca9904e1 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -33,6 +33,7 @@
 /* GUC variables */
 char	   *default_table_access_method = DEFAULT_TABLE_ACCESS_METHOD;
 bool		synchronize_seqscans = true;
+int			fake_pages = 10;
 
 
 /* 
@@ -591,10 +592,10 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
 	 * Totally empty parent tables are quite common, so we should be willing
 	 * to believe that they are empty.
 	 */
-	if (curpages < 10 &&
+	if (curpages < fake_pages &&
 		relpages == 0 &&
 		!rel->rd_rel->relhassubclass)
-		curpages = 10;
+		curpages = fake_pages;
 
 	/* report estimated # pages */
 	*pages = curpages;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 11b3f050bc..6954274c28 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2195,6 +2195,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"fake_pages", PGC_USERSET, QUERY_TUNING_OTHER,
+			gettext_noop("Sets a number of pages for planner when relation has no pages."),
+			NULL,
+			GUC_UNIT_BLOCKS
+		},
+		_pages,
+		10, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"max_standby_archive_delay", PGC_SIGHUP, REPLICATION_STANDBY,
 			gettext_noop("Sets the maximum delay before canceling queries when a hot standby server is processing archived WAL data."),
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index eb18739c36..f997f9dbc4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -30,6 +30,8 @@
 extern char *default_table_access_method;
 extern bool synchronize_seqscans;
 
+extern int fake_pages;
+
 
 struct BulkInsertStateData;
 struct IndexInfo;


Re: handle a ECPG_bytea typo

2020-07-24 Thread vignesh C
On Fri, Jul 24, 2020 at 1:35 PM Wang, Shenhao
 wrote:
>
> Hi, hackers
>
> The source looks like:
>
> case ECPGt_bytea:
> {
> struct ECPGgeneric_varchar *variable =
> (struct ECPGgeneric_varchar *) (var->value);
>
> ..
> }
>
> I think the developer intend to use struct ECPGgeneric_bytea instead of 
> struct ECPGgeneric_varchar
>
> Is this thoughts right?
>
> I have wrote a patch to fix this typo

I felt the changes look correct. The reason it might be working
earlier is because the structure members are the same for both the
data structures ECPGgeneric_bytea & ECPGgeneric_varchar.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel worker hangs while handling errors.

2020-07-24 Thread vignesh C
On Fri, Jul 24, 2020 at 12:41 PM Bharath Rupireddy
 wrote:
>
> On Thu, Jul 23, 2020 at 12:54 PM vignesh C  wrote:
> >
> > Thanks for reviewing and adding your thoughts, My comments are inline.
> >
> > On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
> >  wrote:
> > >
> > > The same hang issue can occur(though I'm not able to back it up with a
> > > use case), in the cases from wherever the EmitErrorReport() is called
> > > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
> > > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
> > > postgres.c.
> > >
> >
> > I'm not sure if this can occur in other cases.
> >
>
> I checked further on this point: Yes, it can't occur for the other
> cases, as mq_putmessage() gets only used for parallel
> workers(ParallelWorkerMain()  --> pq_redirect_to_shm_mq()).
>
> >
> > > Note that, in all sigsetjmp blocks, we intentionally
> > > HOLD_INTERRUPTS(), to not cause any issues while performing error
> > > handling, I'm concerned here that now, if we directly call
> > > BackgroundWorkerUnblockSignals() which will open up all the signals
> > > and our main intention of holding interrupts or block signals may go
> > > away.
> > >
> > > Since the main problem for this hang issue is because of blocking
> > > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
> > > instead of unblocking all signals? I tried this with parallel copy
> > > hang, the issue is resolved.
> > >
> >
> > On putting further thoughts on this, I feel just unblocking SIGUSR1
> > would be the right approach in this case. I'm attaching a new patch
> > which unblocks SIGUSR1 signal. I have verified that the original issue
> > with WIP parallel copy patch gets fixed. I have made changes only in
> > bgworker.c as we require the parallel worker to receive this signal
> > and continue processing. I have not included the changes for other
> > processes as I'm not sure if this scenario is applicable for other
> > processes.
> >
>
> Since we are sure that this hang issue can occur only for parallel
> workers, and the change in StartBackgroundWorker's sigsetjmp's block
> should only be made for parallel worker cases. And also there can be a
> lot of other callbacks execution and processing in proc_exit() for
> which we might not need the SIGUSR1 unblocked. So, let's undo the
> unblocking right after EmitErrorReport() to not cause any new issues.
>
> Attaching a modified v2 patch: it has the unblocking for only for
> parallel workers, undoing it after EmitErrorReport(), and some
> adjustments in the comment.
>

I have made slight changes on top of the patch to remove duplicate
code, attached v3 patch for the same.
The parallel worker hang issue gets resolved, make check & make
check-world passes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From b48849d091302dfdac4fc004195b3c532fdb9dcb Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Sat, 25 Jul 2020 06:38:40 +0530
Subject: [PATCH v3] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.

Authors: Vignesh C, Bharath Rupireddy
---
 src/backend/postmaster/bgworker.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85..0b9f214 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -671,6 +671,23 @@ bgworker_sigusr1_handler(SIGNAL_ARGS)
 }
 
 /*
+ * update_parallel_worker_sigmask - add or remove a signal from sigmask.
+ */
+static void
+update_parallel_worker_sigmask(BackgroundWorker *worker, int signum,
+			   bool isadd)
+{
+	if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+	{
+		if (isadd)
+			sigaddset(, signum);
+		else
+			sigdelset(, signum);
+		PG_SETMASK();
+	}
+}
+
+/*
  * Start a new background worker
  *
  * This is the main entry point for background worker, to be called from
@@ -747,6 +764,16 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * In case of parallel workers, unblock SIGUSR1 signal, it was blocked
+		 * when the postmaster forked us. Leader process will send SIGUSR1 signal
+		 * to the worker process(worker process will be in waiting state as
+		 * there is no space available) to indicate shared memory space is freed
+		 * up. Once the signal is received worker process will start populating
+		 * the error message further.
+		 */
+		update_parallel_worker_sigmask(worker, SIGUSR1, false);
+
 		/* Since not using PG_TRY, must 

Re: Include access method in listTables output

2020-07-24 Thread vignesh C
On Thu, Jul 16, 2020 at 7:35 PM Georgios  wrote:
>
>
>
> ‐‐‐ Original Message ‐‐‐
> On Saturday, July 11, 2020 3:16 PM, vignesh C  wrote:
>
> > On Mon, Jul 6, 2020 at 1:24 PM Georgios gkokola...@protonmail.com wrote:
> >
> > > ‐‐‐ Original Message ‐‐‐
> > > On Monday, July 6, 2020 3:12 AM, Michael Paquier mich...@paquier.xyz 
> > > wrote:
> > >
> > > > On Sun, Jul 05, 2020 at 07:13:10AM +0530, vignesh C wrote:
> > > >
> > > > > I'm not sure if we should include showViews, I had seen that the
> > > > > access method was not getting selected for view.
> > > >
> > > > +1. These have no physical storage, so you are looking here for
> > > > relkinds that satisfy RELKIND_HAS_STORAGE().
> > >
> > > Thank you for the review.
> > > Find attached v4 of the patch.
> >
> > Thanks for fixing the comments.
> > Patch applies cleanly, make check & make check-world passes.
> > I was not sure if any documentation change is required or not for this
> > in doc/src/sgml/ref/psql-ref.sgml. Thoughts?
> >
>
> Thank you for the comment. I added a line in docs. Admittedly, might need 
> tweaking.
>
> Please find version 5 of the patch attached.

Most changes looks fine, minor comments:

 \pset tuples_only false
 -- check conditional tableam display
--- Create a heap2 table am handler with heapam handler
+\pset expanded off
+CREATE SCHEMA conditional_tableam_display;
+CREATE ROLE conditional_tableam_display_role;
+ALTER SCHEMA conditional_tableam_display OWNER TO
conditional_tableam_display_role;
+SET search_path TO conditional_tableam_display;
 CREATE ACCESS METHOD heap_psql TYPE TABLE HANDLER heap_tableam_handler;

This comment might have been removed unintentionally, do you want to
add it back?

+-- access method column should not be displayed for sequences
+\ds+
+List of relations
+ Schema | Name | Type | Owner | Persistence | Size | Description
++--+--+---+-+--+-
+(0 rows)

We can include one test for view.

+ if (pset.sversion >= 12 && !pset.hide_tableam &&
+ (showTables || showMatViews || showIndexes))
+ appendPQExpBuffer(,
+   ",\n  am.amname as \"%s\"",
+   gettext_noop("Access Method"));
+
  /*
  * As of PostgreSQL 9.0, use pg_table_size() to show a more accurate
  * size of a table, including FSM, VM and TOAST tables.
@@ -3772,6 +3778,12 @@ listTables(const char *tabtypes, const char
*pattern, bool verbose, bool showSys
  appendPQExpBufferStr(,
  "\nFROM pg_catalog.pg_class c"
  "\n LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace");
+
+ if (pset.sversion >= 12 && !pset.hide_tableam &&
+ (showTables || showMatViews || showIndexes))

If conditions in both the places are the same, do you want to make it a macro?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Row estimates for empty tables

2020-07-24 Thread David Rowley
On Sat, 25 Jul 2020 at 10:34, Tom Lane  wrote:
> I wrote:
> > 1. Arrange to send out a relcache inval when adding the first page to
> > a table, and then remove the planner hack for disbelieving relpages = 0.
> > I fear this'd be a mess from a system structural standpoint, but it might
> > work fairly transparently.
>
> I experimented with doing this.  It's not hard to code, if you don't mind
> having RelationGetBufferForTuple calling CacheInvalidateRelcache.  I'm not
> sure whether that code path might cause any long-term problems, but it
> seems to work OK right now.  However, this solution causes massive
> "failures" in the regression tests as a result of plans changing.  I'm
> sure that's partly because we use so many small tables in the tests.
> Nonetheless, it's not promising from the standpoint of not causing
> unexpected problems in the real world.

I guess all these changes would be the planner moving towards a plan
that suits having fewer rows for the given table better.  If so, that
does seem quite scary as we already have enough problems from the
planner choosing poor plans when it thinks there are fewer rows than
there actually are.  Don't we need to keep something like the 10-page
estimate there so safer plans are produced before auto-vacuum gets in
and gathers some proper stats?

I think if anything we'd want to move in the direction of producing
more cautious plans when the estimated number of rows is low. Perhaps
especially so for when the planner opts to do things like perform a
non-parameterized nested loop join when it thinks the RelOptInfo with,
say 3, unbeknown-to-the-planner, correlated, base restrict quals that
are thought to produce just 1 row, but actually produce many more.

> > 2. Establish the convention that vacuuming or analyzing an empty table
> > is what you do to tell the system that this state is going to persist.
> > That's more or less what the existing comments in plancat.c envision,
> > but we never made a definition for how the occurrence of that event
> > would be recorded in the catalogs, other than setting relpages > 0.
> > Rather than adding another pg_class column, I'm tempted to say that
> > vacuum/analyze should set relpages to a minimum of 1, even if the
> > relation has zero pages.
>
> I also tried this, and it seems a lot more promising: no existing
> regression test cases change.  So perhaps we should do the attached
> or something like it.

This sounds like a more plausible solution. At least this way there's
an escape hatch for people who suffer due to this.

David




Re: Improving connection scalability: GetSnapshotData()

2020-07-24 Thread Andres Freund
On 2020-07-24 18:15:15 -0300, Ranier Vilela wrote:
> Em sex., 24 de jul. de 2020 às 14:16, Andres Freund 
> escreveu:
> 
> > On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:
> > > Latest Postgres
> > > Windows 64 bits
> > > msvc 2019 64 bits
> > >
> > > Patches applied v12-0001 to v12-0007:
> > >
> > >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning
> > C4013:
> > > 'GetOldestXmin' indefinido; assumindo extern retornando int
> > > [C:\dll\postgres
> > > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
> > > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
> > > [C:\dll\postgres\pg_visibility.
> > > vcxproj]
> > >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
> > > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > [C:\dll\postgres\pgstattuple.vcxproj]
> > >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
> > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > [C:\dll\postgres\pg_visibility.vcxproj]
> > >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
> > > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > > [C:\dll\postgres\pg_visibility.vcxproj]
> >
> > I don't know that's about - there's no call to GetOldestXmin() in
> > pgstatapprox and pg_visibility after patch 0002? And similarly, the
> > PROCARRAY_* references are also removed in the same patch?
> >
> Maybe need to remove them from these places, not?
> C:\dll\postgres\contrib>grep -d GetOldestXmin *.c
> File pgstattuple\pgstatapprox.c:
> OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
> File pg_visibility\pg_visibility.c:
> OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
>  * deadlocks, because surely
> GetOldestXmin() should never take
> RecomputedOldestXmin = GetOldestXmin(NULL,
> PROCARRAY_FLAGS_VACUUM);

The 0002 patch changed those files:

diff --git a/contrib/pg_visibility/pg_visibility.c 
b/contrib/pg_visibility/pg_visibility.c
index 68d580ed1e0..37206c50a21 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -563,17 +563,14 @@ collect_corrupt_items(Oid relid, bool all_visible, bool 
all_frozen)
BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
TransactionId OldestXmin = InvalidTransactionId;
 
-   if (all_visible)
-   {
-   /* Don't pass rel; that will fail in recovery. */
-   OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
-   }
-
rel = relation_open(relid, AccessShareLock);
 
/* Only some relkinds have a visibility map */
check_relation_relkind(rel);
 
+   if (all_visible)
+   OldestXmin = GetOldestNonRemovableTransactionId(rel);
+
nblocks = RelationGetNumberOfBlocks(rel);
 
/*
@@ -679,11 +676,12 @@ collect_corrupt_items(Oid relid, bool all_visible, bool 
all_frozen)
 * From a concurrency point of view, it sort of 
sucks to
 * retake ProcArrayLock here while we're 
holding the buffer
 * exclusively locked, but it should be safe 
against
-* deadlocks, because surely GetOldestXmin() 
should never take
-* a buffer lock. And this shouldn't happen 
often, so it's
-* worth being careful so as to avoid false 
positives.
+* deadlocks, because surely 
GetOldestNonRemovableTransactionId()
+* should never take a buffer lock. And this 
shouldn't happen
+* often, so it's worth being careful so as to 
avoid false
+* positives.
 */
-   RecomputedOldestXmin = GetOldestXmin(NULL, 
PROCARRAY_FLAGS_VACUUM);
+   RecomputedOldestXmin = 
GetOldestNonRemovableTransactionId(rel);
 
if (!TransactionIdPrecedes(OldestXmin, 
RecomputedOldestXmin))
record_corrupt_item(items, 
_self);

diff --git a/contrib/pgstattuple/pgstatapprox.c 
b/contrib/pgstattuple/pgstatapprox.c
index dbc0fa11f61..3a99333d443 100644
--- a/contrib/pgstattuple/pgstatapprox.c
+++ b/contrib/pgstattuple/pgstatapprox.c
@@ -71,7 +71,7 @@ statapprox_heap(Relation rel, output_type *stat)
BufferAccessStrategy bstrategy;
TransactionId OldestXmin;
 
-   OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
+   OldestXmin = GetOldestNonRemovableTransactionId(rel);
bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
nblocks = RelationGetNumberOfBlocks(rel);


Greetings,

Andres Freund




Re: hashagg slowdown due to spill changes

2020-07-24 Thread Andres Freund
Hi,

On 2020-06-12 14:37:15 -0700, Andres Freund wrote:
> On 2020-06-11 11:14:02 -0700, Jeff Davis wrote:
> > On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:
> > > Did you run any performance tests?
> >
> > Yes, I reproduced your ~12% regression from V12, and this patch nearly
> > eliminated it for me.
> 
> I spent a fair bit of time looking at the difference. Jeff had let me
> know on chat that he was still seeing some difference, but couldn't
> quite figure out where that was.
> 
> Trying it out myself, I observed that the patch helped, but not that
> much. After a bit I found one major reason for why:
> LookupTupleHashEntryHash() assigned the hash to pointer provided by the
> caller's before doing the insertion. That ended up causing a pipeline
> stall (I assume it's store forwarding, but not sure). Moving the
> assignment to the caller variable to after the insertion got rid of
> that.

This is still not resolved. We're right now slower than 12. It's
effectively not possible to do performance comparisons right now. This
was nearly two months ago.

Greetings,

Andres Freund




Re: Row estimates for empty tables

2020-07-24 Thread Tom Lane
[ redirecting to -hackers ]

I wrote:
> The core issue here is "how do we know whether the table is likely to stay
> empty?".  I can think of a couple of more or less klugy solutions:

> 1. Arrange to send out a relcache inval when adding the first page to
> a table, and then remove the planner hack for disbelieving relpages = 0.
> I fear this'd be a mess from a system structural standpoint, but it might
> work fairly transparently.

I experimented with doing this.  It's not hard to code, if you don't mind
having RelationGetBufferForTuple calling CacheInvalidateRelcache.  I'm not
sure whether that code path might cause any long-term problems, but it
seems to work OK right now.  However, this solution causes massive
"failures" in the regression tests as a result of plans changing.  I'm
sure that's partly because we use so many small tables in the tests.
Nonetheless, it's not promising from the standpoint of not causing
unexpected problems in the real world.

> 2. Establish the convention that vacuuming or analyzing an empty table
> is what you do to tell the system that this state is going to persist.
> That's more or less what the existing comments in plancat.c envision,
> but we never made a definition for how the occurrence of that event
> would be recorded in the catalogs, other than setting relpages > 0.
> Rather than adding another pg_class column, I'm tempted to say that
> vacuum/analyze should set relpages to a minimum of 1, even if the
> relation has zero pages.

I also tried this, and it seems a lot more promising: no existing
regression test cases change.  So perhaps we should do the attached
or something like it.

regards, tom lane

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1bbc4598f7..243cdaa409 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -594,6 +594,19 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	new_frozen_xid = scanned_all_unfrozen ? FreezeLimit : InvalidTransactionId;
 	new_min_multi = scanned_all_unfrozen ? MultiXactCutoff : InvalidMultiXactId;
 
+	/*
+	 * Another corner case is that if the relation is physically zero-length,
+	 * we don't want to record relpages=reltuples=0 in pg_class; what we want
+	 * to record is relpages=1, reltuples=0.  This tells the planner that the
+	 * relation has been seen to be empty by VACUUM or ANALYZE, so it should
+	 * not override a small measured table size.
+	 */
+	if (new_rel_pages == 0)
+	{
+		Assert(new_live_tuples == 0);
+		new_rel_pages = 1;
+	}
+
 	vac_update_relstats(onerel,
 		new_rel_pages,
 		new_live_tuples,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 4b2bb29559..0a92450e8a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -585,11 +585,9 @@ table_block_relation_estimate_size(Relation rel, int32 *attr_widths,
 	 * doesn't happen instantaneously, and it won't happen at all for cases
 	 * such as temporary tables.)
 	 *
-	 * We approximate "never vacuumed" by "has relpages = 0", which means this
-	 * will also fire on genuinely empty relations.  Not great, but
-	 * fortunately that's a seldom-seen case in the real world, and it
-	 * shouldn't degrade the quality of the plan too much anyway to err in
-	 * this direction.
+	 * We test "never vacuumed" by seeing whether relpages = 0.  VACUUM and
+	 * ANALYZE will never set relpages to less than 1, even if the relation is
+	 * in fact zero length.
 	 *
 	 * If the table has inheritance children, we don't apply this heuristic.
 	 * Totally empty parent tables are quite common, so we should be willing
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 924ef37c81..f0247e350f 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -620,6 +620,16 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 
 		visibilitymap_count(onerel, , NULL);
 
+		/*
+		 * As in VACUUM, replace relpages=reltuples=0 with relpages=1,
+		 * reltuples=0.
+		 */
+		if (relpages == 0)
+		{
+			Assert(totalrows == 0);
+			relpages = 1;
+		}
+
 		vac_update_relstats(onerel,
 			relpages,
 			totalrows,


Re: Improving connection scalability: GetSnapshotData()

2020-07-24 Thread Ranier Vilela
Em sex., 24 de jul. de 2020 às 14:16, Andres Freund 
escreveu:

> On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:
> > Latest Postgres
> > Windows 64 bits
> > msvc 2019 64 bits
> >
> > Patches applied v12-0001 to v12-0007:
> >
> >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning
> C4013:
> > 'GetOldestXmin' indefinido; assumindo extern retornando int
> > [C:\dll\postgres
> > C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
> > C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
> > [C:\dll\postgres\pg_visibility.
> > vcxproj]
> >  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
> > 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > [C:\dll\postgres\pgstattuple.vcxproj]
> >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
> > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > [C:\dll\postgres\pg_visibility.vcxproj]
> >   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
> > C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> > [C:\dll\postgres\pg_visibility.vcxproj]
>
> I don't know that's about - there's no call to GetOldestXmin() in
> pgstatapprox and pg_visibility after patch 0002? And similarly, the
> PROCARRAY_* references are also removed in the same patch?
>
Maybe need to remove them from these places, not?
C:\dll\postgres\contrib>grep -d GetOldestXmin *.c
File pgstattuple\pgstatapprox.c:
OldestXmin = GetOldestXmin(rel, PROCARRAY_FLAGS_VACUUM);
File pg_visibility\pg_visibility.c:
OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_VACUUM);
 * deadlocks, because surely
GetOldestXmin() should never take
RecomputedOldestXmin = GetOldestXmin(NULL,
PROCARRAY_FLAGS_VACUUM);

regards,
Ranier Vilela


Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Peter Geoghegan
On Fri, Jul 24, 2020 at 12:16 PM Tomas Vondra
 wrote:
> Maybe, but we're nowhere close to these limits. See this table which I
> posted earlier:
>
>2MB   Planned Partitions:  64HashAgg Batches:  4160
>4MB   Planned Partitions: 128HashAgg Batches: 16512
>8MB   Planned Partitions: 256HashAgg Batches: 21488
>   64MB   Planned Partitions:  32HashAgg Batches:  2720
>  256MB   Planned Partitions:   8HashAgg Batches: 8
>
> This is from the non-parallel runs on the i5 machine with 32GB data set,
> the first column is work_mem. We're nowhere near the 1024 limit, and the
> cardinality estimates are pretty good.
>
> OTOH the number o batches is much higher, so clearly there was some
> recursive spilling happening. What I find strange is that this grows
> with work_mem and only starts dropping after 64MB.

Could that be caused by clustering in the data?

If the input data is in totally random order then we have a good
chance of never having to spill skewed "common" values. That is, we're
bound to encounter common values before entering spill mode, and so
those common values will continue to be usefully aggregated until
we're done with the initial groups (i.e. until the in-memory hash
table is cleared in order to process spilled input tuples). This is
great because the common values get aggregated without ever spilling,
and most of the work is done before we even begin with spilled tuples.

If, on the other hand, the common values are concentrated together in
the input...

Assuming that I have this right, then I would also expect simply
having more memory to ameliorate the problem. If you only have/need 4
or 8 partitions then you can fit a higher proportion of the total
number of groups for the whole dataset in the hash table (at the point
when you first enter spill mode). I think it follows that the "nailed"
hash table entries/groupings will "better characterize" the dataset as
a whole.

> Also, how could the amount of I/O be almost constant in all these cases?
> Surely more recursive spilling should do more I/O, but the Disk Usage
> reported by explain analyze does not show anything like ...

Not sure, but might that just be because of the fact that logtape.c
can recycle disk space?

As I said in my last e-mail, it's pretty reasonable to assume that the
vast majority of external sorts are one-pass. It follows that disk
usage can be thought of as almost the same thing as total I/O for
tuplesort. But the same heuristic isn't reasonable when thinking about
hash agg. Hash agg might write out much less data than the total
memory used for the equivalent "peak optimal nospill" hash agg case --
or much more. (Again, reiterating what I said in my last e-mail.)

-- 
Peter Geoghegan




Re: [PATCH] fix GIN index search sometimes losing results

2020-07-24 Thread Tom Lane
Pavel Borisov  writes:
> ср, 22 июл. 2020 г. в 19:10, Tom Lane :
>> The other issue we have to agree on is whether we want to sneak this
>> fix into v13, or wait another year for it.  I feel like it's pretty
>> late to be making potentially API-breaking changes, but on the other
>> hand this is undoubtedly a bug fix.

> I am convinced patch 0001 is necessary and enough to fix a bug, so I think
> it's very much worth adding it to v13.

Agreed, and done.

> As for 0002 I see the beauty of this change but I also see the value of
> leaving defaults as they were before.
> The change of CALC_NOT behavior doesn't seem to be a source of big changes,
> though. I'm just not convinced it is very much needed.
> The best way I think is to leave 0002 until the next version and add
> commentary in the code that this default behavior of NOT
> doing TS_EXEC_CALC_NOT is inherited from past, so basically any caller
> should set this flag (see patch 0003-add-comments-on-calc-not.

I don't think it's a great plan to make these two changes in two
successive versions.  They're going to be affecting basically the
same set of outside callers, at least if you assume that every
TS_execute caller will be supplying its own callback function.
So we might as well force people to make both updates at once.
Also, if there is anyone who thinks they need "skip NOT" behavior,
this'd be a great time to reconsider.

I revised 0002 to still define a flag for skipping NOTs, but
it's not the default and is indeed unused in the core code now.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Tomas Vondra

On Fri, Jul 24, 2020 at 11:03:54AM -0700, Peter Geoghegan wrote:

On Fri, Jul 24, 2020 at 8:19 AM Robert Haas  wrote:

This is all really good analysis, I think, but this seems like the key
finding. It seems like we don't really understand what's actually
getting written. Whether we use hash or sort doesn't seem like it
should have this kind of impact on how much data gets written, and
whether we use CP_SMALL_TLIST or project when needed doesn't seem like
it should matter like this either.


Isn't this more or less the expected behavior in the event of
partitions that are spilled recursively? The case that Tomas tested
were mostly cases where work_mem was tiny relative to the data being
aggregated.

The following is an extract from commit 1f39bce0215 showing some stuff
added to the beginning of nodeAgg.c:

+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 1024



Maybe, but we're nowhere close to these limits. See this table which I
posted earlier:

  2MB   Planned Partitions:  64HashAgg Batches:  4160
  4MB   Planned Partitions: 128HashAgg Batches: 16512
  8MB   Planned Partitions: 256HashAgg Batches: 21488
 64MB   Planned Partitions:  32HashAgg Batches:  2720
256MB   Planned Partitions:   8HashAgg Batches: 8

This is from the non-parallel runs on the i5 machine with 32GB data set,
the first column is work_mem. We're nowhere near the 1024 limit, and the
cardinality estimates are pretty good.

OTOH the number o batches is much higher, so clearly there was some
recursive spilling happening. What I find strange is that this grows
with work_mem and only starts dropping after 64MB.

Also, how could the amount of I/O be almost constant in all these cases?
Surely more recursive spilling should do more I/O, but the Disk Usage
reported by explain analyze does not show anything like ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-24 Thread Soumyadeep Chakraborty
On Fri, Jul 24, 2020 at 7:34 AM Robert Haas  wrote:

>
> On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty
>  wrote:
> > In the read-only level I was suggesting, I wasn't suggesting that we
> > stop WAL flushes, in fact we should flush the WAL before we mark the
> > system as read-only. Once the system declares itself as read-only, it
> > will not perform any more on-disk changes; It may perform all the
> > flushes it needs as a part of the read-only request handling.
>
> I think that's already how the patch works, or at least how it should
> work. You stop new writes, flush any existing WAL, and then declare
> the system read-only. That can all be done quickly.
>

True, except for the fact that it allows dirty buffers to be flushed
after the ALTER command returns.

> > What I am saying is it doesn't have to be just the queries. I think we
> > can cater to all the other use cases simply by forcing a checkpoint
> > before marking the system as read-only.
>
> But that part can't, which means that if we did that, it would break
> the feature for the originally intended use case. I'm not on board
> with that.
>

Referring to the options you presented in [1]:
I am saying that we should allow for both: with a checkpoint (#2) (can
also be #3) and without a checkpoint (#1) before having the ALTER
command return, by having different levels of read-onlyness.

We should have syntax variants for these. The syntax should not be an
ALTER SYSTEM SET as you have pointed out before. Perhaps:

ALTER SYSTEM READ ONLY; -- #2 or #3
ALTER SYSTEM READ ONLY WAL; -- #1
ALTER SYSTEM READ WRITE;

or even:

ALTER SYSTEM FREEZE; -- #2 or #3
ALTER SYSTEM FREEZE WAL; -- #1
ALTER SYSTEM UNFREEZE;

Regards,
Soumyadeep (VMware)

[1] 
http://postgr.es/m/ca+tgmoz-c3dz9qwhwmm4bc36n4u0xz2oyenewmf+bwokbyd...@mail.gmail.com




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-24 Thread Soumyadeep Chakraborty
On Fri, Jul 24, 2020 at 7:32 AM Robert Haas  wrote:
>
> On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty
>  wrote:
> > So if we are not going to address those cases, we should change the
> > syntax and remove the notion of read-only. It could be:
> >
> > ALTER SYSTEM SET wal_writes TO off|on;
> > or
> > ALTER SYSTEM SET prohibit_wal TO off|on;
>
> This doesn't really work because of the considerations mentioned in
> http://postgr.es/m/ca+tgmoakctzozr0xeqalfimbcje2rgcbazf4eybpxjtnetp...@mail.gmail.com

Ah yes. We should then have ALTER SYSTEM WAL {PERMIT|PROHIBIT}. I don't
think we should say "READ ONLY" if we still allow on-disk file changes
after the ALTER SYSTEM command returns (courtesy dirty buffer flushes)
because it does introduce confusion, especially to an audience not privy
to this thread. When people hear "read-only" they may think of static on-disk
files immediately.

> Contrary to what you write, I don't think either #2 or #3 is
> sufficient to enable checksums, at least not without some more
> engineering, because the server would cache the state from the control
> file, and a bunch of blocks from the database. I guess it would work
> if you did a server restart afterward, but I think there are better
> ways of supporting online checksum enabling that don't require
> shutting down the server, or even making it read-only; and there's
> been significant work done on those already.

Agreed. As you mentioned, if we did do #2 or #3, we would be able to do
pg_checksums on a server that was shut down or that had crashed while it
was in a read-only state, which is what Michael was asking for in [1]. I
think it's just cleaner if we allow for this.

I don't have enough context to enumerate use cases for the advantages or
opportunities that would come with an assurance that the cluster's files
are frozen (and not covered by any existing utilities), but surely there
are some? Like the possibility of pg_upgrade on a running server while
it can entertain read-only queries? Surely, that's a nice one!

Of course, some or all of these utilities would need to be taught about
read-only mode.

Regards,
Soumyadeep

[1] http://postgr.es/m/20200626095921.gf1...@paquier.xyz




Re: Making CASE error handling less surprising

2020-07-24 Thread Pavel Stehule
pá 24. 7. 2020 v 19:46 odesílatel Tom Lane  napsal:

> Andres Freund  writes:
> > Wouldn't the rule that I proposed earlier, namely that sub-expressions
> > that involve only "proper" constants continue to get evaluated even
> > within CASE, largely address that?
>
> The more I think about that the less I like it.  It'd make the behavior
> even harder to reason about than it is now, and it doesn't fix the issue
> for subquery pullup cases.
>
> Basically this seems like a whole lot of thrashing to try to preserve
> all the details of a behavior that is kind of accidental to begin with.
> The argument that it's a performance issue seems hypothetical too,
> rather than founded on any observed results.
>
> BTW, to the extent that there is a performance issue, we could perhaps
> fix it if we resurrected the "cache stable subexpressions" patch that
> was kicking around a year or two ago.  That'd give us both
> at-most-one-evaluation and no-evaluation-until-necessary behaviors,
> if we made sure to apply it to stable CASE arms.
>

+1

regards

Pavel


> regards, tom lane
>
>
>


Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Peter Geoghegan
On Fri, Jul 24, 2020 at 1:40 AM Tomas Vondra
 wrote:
> Maybe, not sure what exactly you think is pathological? The trouble is
> hashagg has to spill input tuples but the memory used in no-spill case
> represents aggregated groups, so I'm not sure how you could extrapolate
> from that ...

Yeah, but when hash agg enters spill mode it will continue to advance
the transition states for groups already in the hash table, which
could be quite a significant effect. The peak memory usage for an
equivalent no-spill hash agg is therefore kind of related to the
amount of I/O needed for spilling.

I suppose that you mostly tested cases where memory was in very short
supply, where that breaks down completely. Perhaps it wasn't helpful
for me to bring that factor into this discussion -- it's not as if
there is any doubt that hash agg is spilling a lot more here in any
case.

> Not sure, but I think we need to spill roughly as much as sort, so it
> seems a bit strange that (a) we're spilling 2x as much data and yet the
> cost is so much lower.

ISTM that the amount of I/O that hash agg performs can vary *very*
widely for the same data. This is mostly determined by work_mem, but
there are second order effects. OTOH, the amount of I/O that a sort
must do is practically fixed. You can quibble with that
characterisation a bit because of multi-pass sorts, but not really --
multi-pass sorts are generally quite rare.

I think that we need a more sophisticated cost model for this in
cost_agg(). Maybe the "pages_written" calculation could be pessimized.
However, it doesn't seem like this is precisely an issue with I/O
costs.

--
Peter Geoghegan




Re: Missing CFI in hlCover()?

2020-07-24 Thread Tom Lane
I tried to duplicate a multiple-second ts_headline call here, and
failed to, so there must be something I'm missing.  Can you provide
a concrete example?  I'd like to do some analysis with perf.

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Peter Geoghegan
On Fri, Jul 24, 2020 at 8:19 AM Robert Haas  wrote:
> This is all really good analysis, I think, but this seems like the key
> finding. It seems like we don't really understand what's actually
> getting written. Whether we use hash or sort doesn't seem like it
> should have this kind of impact on how much data gets written, and
> whether we use CP_SMALL_TLIST or project when needed doesn't seem like
> it should matter like this either.

Isn't this more or less the expected behavior in the event of
partitions that are spilled recursively? The case that Tomas tested
were mostly cases where work_mem was tiny relative to the data being
aggregated.

The following is an extract from commit 1f39bce0215 showing some stuff
added to the beginning of nodeAgg.c:

+ * We also specify a min and max number of partitions per spill. Too few might
+ * mean a lot of wasted I/O from repeated spilling of the same tuples. Too
+ * many will result in lots of memory wasted buffering the spill files (which
+ * could instead be spent on a larger hash table).
+ */
+#define HASHAGG_PARTITION_FACTOR 1.50
+#define HASHAGG_MIN_PARTITIONS 4
+#define HASHAGG_MAX_PARTITIONS 1024

-- 
Peter Geoghegan




Re: Missing CFI in hlCover()?

2020-07-24 Thread Tom Lane
Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Hm.  I'd vote for a CFI within the recursion in TS_execute(), if there's
>> not one there yet.  Maybe hlFirstIndex needs one too --- if there are
>> a lot of words in the query, maybe that could be slow?  Did you pin the
>> blame down any more precisely than hlCover?

> I've definitely seen hlFirstIndex take a few seconds to run (while
> running this under gdb and stepping through), so that could be a good
> choice to place one (perhaps even additionally to this...).  I have to
> admit to wondering if we shouldn't consider having one in
> check_stack_depth() to try and reduce the risk of us forgetting to have
> one in sensible places, though I've not really looked at all the callers
> and that might not be reasonable in some cases (though I wonder if maybe
> we consider having a 'default' version that has a CFI, and an alternate
> that doesn't...).

Adding it to check_stack_depth doesn't really seem like a reasonable
proposal to me; aside from failing to separate concerns, running a
long time is quite distinct from taking a lot of stack.

The reason I'm eyeing TS_execute is that it involves callbacks to
functions that might be pretty complex in themselves, eg during index
scans.  So that would guard a lot of territory besides hlCover.  But
hlFirstIndex could use a CFI too, if you've seen it take that long.
(I wonder if we need to try to make it faster.  I'd supposed that the
loop was cheap enough to be a non-problem, but with large enough
documents maybe not?  It seems like converting to a hash table could
be worthwhile for a large doc.)

regards, tom lane




display offset along with block number in vacuum errors

2020-07-24 Thread Mahendra Singh Thalor
Hi hackers,
We discussed in another email thread[1], that it will be helpful if we can
display offset along with block number in vacuum error. Here, proposing a
patch to add offset along with block number in vacuum errors.

In commit b61d161(Introduce vacuum errcontext to display additional
information), we added vacuum errcontext to display additional
information(block number) so that in case of vacuum error, we can identify
which block we are getting error.  Addition to block number, if we can
display offset, then it will be more helpful for users. So to display
offset, here proposing two different methods(Thanks Robert for suggesting
these 2 methods):

*Method 1:* We can report the TID as well as the block number in
errcontext.
- errcontext("while scanning block %u of relation \"%s.%s\"",
-   errinfo->blkno, errinfo->relnamespace, errinfo->relname);
+ errcontext("while scanning block %u and offset %u of relation \"%s.%s\"",
+   errinfo->blkno, errinfo->offnum, errinfo->relnamespace,
errinfo->relname);

Above fix requires more calls to update_vacuum_error_info(). Attaching
v01_0001 patch for this method.

*Method 2: *We can improve the error messages by passing the relevant TID
to heap_prepare_freeze_tuple and having it report the TID as part of the
error message or in the error detail.
  ereport(ERROR,
  (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg_internal("found xmin %u from before relfrozenxid %u",
+ errmsg_internal("for block %u and offnum %u, found xmin %u from before
relfrozenxid %u",
+ ItemPointerGetBlockNumber(tid),
+ ItemPointerGetOffsetNumber(tid),
  xid, relfrozenxid)));

Attaching v01_0002 patch for this method.

Please let me know your thoughts.

[1] : http://postgr.es/m/20200713223822.az6fo3m2x4t42...@alap3.anarazel.de
-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com


v01_0001-Added-offset-with-block-number-in-vacuum-errcontext.patch
Description: Binary data


v01_0002-Added-block-and-offset-to-errors-of-heap_prepare_fre.patch
Description: Binary data


Re: Making CASE error handling less surprising

2020-07-24 Thread Tom Lane
Andres Freund  writes:
> Wouldn't the rule that I proposed earlier, namely that sub-expressions
> that involve only "proper" constants continue to get evaluated even
> within CASE, largely address that?

The more I think about that the less I like it.  It'd make the behavior
even harder to reason about than it is now, and it doesn't fix the issue
for subquery pullup cases.

Basically this seems like a whole lot of thrashing to try to preserve
all the details of a behavior that is kind of accidental to begin with.
The argument that it's a performance issue seems hypothetical too,
rather than founded on any observed results.

BTW, to the extent that there is a performance issue, we could perhaps
fix it if we resurrected the "cache stable subexpressions" patch that
was kicking around a year or two ago.  That'd give us both
at-most-one-evaluation and no-evaluation-until-necessary behaviors,
if we made sure to apply it to stable CASE arms.

regards, tom lane




Re: Making CASE error handling less surprising

2020-07-24 Thread Andres Freund



On July 24, 2020 10:30:37 AM PDT, Pavel Stehule  wrote:
>pá 24. 7. 2020 v 19:13 odesílatel Andres Freund 
>napsal:
>> Your earlier example of a WHEN ... THEN upper('constant') ... would
>> still have the upper('constant') be evaluated, because it doesn't
>> involve a parameter. And e.g. THEN upper('constant') * $1 would also
>> still have the upper('constant') be evaluated, just the
>multiplication
>> with $1 wouldn't get evaluated.
>>
>>
>> I'm not sure what you're concerned about with the one-shot bit?
>>
>
>Now query parameters are evaluated like constant.

How's that related to oneeshot plans?

>I can imagine WHERE clause like WHERE col = CASE  $1 WHEN true THEN
>upper($2) ELSE $2 END
>
>I remember applications  that use these strange queries to support
>parameterized behaviour - like case sensitive or case insensitive
>searching.

I don't buy this as a significant issue. This could much more sensibly be 
written as an OR. If the policy is that we cannot regress anything then there's 
no way we can improve at all.

Andres

Andres

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: Making CASE error handling less surprising

2020-07-24 Thread Pavel Stehule
pá 24. 7. 2020 v 19:13 odesílatel Andres Freund  napsal:

> Hi,
>
> On 2020-07-24 19:03:30 +0200, Pavel Stehule wrote:
> > pá 24. 7. 2020 v 18:49 odesílatel Andres Freund 
> napsal:
> > > Wouldn't the rule that I proposed earlier, namely that sub-expressions
> > > that involve only "proper" constants continue to get evaluated even
> > > within CASE, largely address that?
> > >
> >
> > It doesn't solve a possible performance problem with one shot (EXECUTE
> stmt
> > plpgsql) queries, or with parameterized queries
>
> What precisely are you thinking of here? Most expressions involving
> parameters would still get constant evaluated - it'd just be inside CASE
> etc that they wouldn't anymore? Do you think it's that common to have a
> parameter reference inside an expression inside a CASE where it's
> crucial that that parameter reference gets constant evaluated? I'd think
> that's a bit of a stretch.
>
> Your earlier example of a WHEN ... THEN upper('constant') ... would
> still have the upper('constant') be evaluated, because it doesn't
> involve a parameter. And e.g. THEN upper('constant') * $1 would also
> still have the upper('constant') be evaluated, just the multiplication
> with $1 wouldn't get evaluated.
>
>
> I'm not sure what you're concerned about with the one-shot bit?
>

Now query parameters are evaluated like constant.

I can imagine WHERE clause like WHERE col = CASE  $1 WHEN true THEN
upper($2) ELSE $2 END

I remember applications  that use these strange queries to support
parameterized behaviour - like case sensitive or case insensitive searching.



> Greetings,
>
> Andres Freund
>


Re: Making CASE error handling less surprising

2020-07-24 Thread Andres Freund
Hi,

On 2020-07-23 22:34:53 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I'm a bit worried about a case like:
> 
> > CREATE FUNCTION yell(int, int)
> > RETURNS int
> > IMMUTABLE
> > LANGUAGE SQL AS $$
> >SELECT CASE WHEN $1 != 0 THEN 17 / $2 ELSE NULL END
> > $$;
> 
> > EXPLAIN SELECT yell(g.i, 0) FROM generate_series(1, 10) g(i);
> 
> > I don't think the parameters here would have been handled before
> > inlining, right?
> 
> Ah, I see what you mean.  Yeah, that throws an error today, and it
> still would with the patch I was envisioning (attached), because
> inlining does Param substitution in a different way.  I'm not
> sure that we could realistically fix the inlining case with this
> sort of approach.

Thinking about it a bit it seems we could solve that fairly easy if we
don't replace function parameter with actual Const nodes, but instead
with a PseudoConst parameter. If we map that to the same expression
evaluation step that should be fairly cheap to implement, basically
adding a bunch of 'case PseudoConst:' to the Const cases.  That way we
could take the type of constness into account before constant folding.

Alternatively we could add a new field to Const, indicating the 'source'
or 'context of the Const, which we then could take into account during
constant evaluation.


> I think this bears out the comment I made before that this approach
> still leaves us with a very complicated behavior.  Maybe we should
> stick with the previous approach, possibly supplemented with a
> leakproofness exception.

ISTM that most of the complication has to be dealt with in either
approach?

Greetings,

Andres Freund




Re: Making CASE error handling less surprising

2020-07-24 Thread Tom Lane
Robert Haas  writes:
> Like Pavel, and I think implicitly Dagfinn and Andres, I'm not sure I
> believe this. Pavel's example is a good one. The leakproof exception
> helps, but it doesn't cover everything. Users I've encountered throw
> things like date_trunc() and lpad() into SQL code and expect them to
> behave (from a performance point of view) like constants, but they
> also expect 1/0 not to get evaluated too early when e.g. CASE is used.
> It's difficult to meet both sets of expectations at the same time and
> we're probably never going to have a perfect solution, but I think
> you're minimizing the concern too much here.

I've quoted this point before, but ... we can make queries arbitrarily
fast, if we don't have to give the right answer.  I think we've seen
enough complaints on this topic now to make it clear that what we're
doing today with CASE is the wrong answer.

The performance argument can be made to cut both ways, too.  If somebody's
got a very expensive function in a CASE arm that they don't expect to
reach, having it be evaluated anyway because it's got constant inputs
isn't going to make them happy.

The real bottom line is: if you don't want to do this, how else do
you want to fix the problem?  I'm no longer willing to deny that
there is a problem.

> I don't think I believe this either. I don't think an average user is
> going to expect  to behave differently from (SELECT
> ).

Agreed, that's poorly (or not at all?) documented.  But it's been
true all along, and this patch isn't changing that behavior at all.
I'm not sure if we should do anything more than improve the docs,
but in any case it seems independent of the CASE issue.

> The current behavior isn't great, but at least it handles these
> cases consistently.

Really?

regards, tom lane




Re: Improving connection scalability: GetSnapshotData()

2020-07-24 Thread Andres Freund
On 2020-07-24 14:05:04 -0300, Ranier Vilela wrote:
> Latest Postgres
> Windows 64 bits
> msvc 2019 64 bits
> 
> Patches applied v12-0001 to v12-0007:
> 
>  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning C4013:
> 'GetOldestXmin' indefinido; assumindo extern retornando int
> [C:\dll\postgres
> C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
> C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
> [C:\dll\postgres\pg_visibility.
> vcxproj]
>  C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
> 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> [C:\dll\postgres\pgstattuple.vcxproj]
>   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
> C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> [C:\dll\postgres\pg_visibility.vcxproj]
>   C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
> C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
> [C:\dll\postgres\pg_visibility.vcxproj]

I don't know that's about - there's no call to GetOldestXmin() in
pgstatapprox and pg_visibility after patch 0002? And similarly, the
PROCARRAY_* references are also removed in the same patch?

Greetings,

Andres Freund




Re: Making CASE error handling less surprising

2020-07-24 Thread Andres Freund
Hi,

On 2020-07-24 19:03:30 +0200, Pavel Stehule wrote:
> pá 24. 7. 2020 v 18:49 odesílatel Andres Freund  napsal:
> > Wouldn't the rule that I proposed earlier, namely that sub-expressions
> > that involve only "proper" constants continue to get evaluated even
> > within CASE, largely address that?
> >
> 
> It doesn't solve a possible performance problem with one shot (EXECUTE stmt
> plpgsql) queries, or with parameterized queries

What precisely are you thinking of here? Most expressions involving
parameters would still get constant evaluated - it'd just be inside CASE
etc that they wouldn't anymore? Do you think it's that common to have a
parameter reference inside an expression inside a CASE where it's
crucial that that parameter reference gets constant evaluated? I'd think
that's a bit of a stretch.

Your earlier example of a WHEN ... THEN upper('constant') ... would
still have the upper('constant') be evaluated, because it doesn't
involve a parameter. And e.g. THEN upper('constant') * $1 would also
still have the upper('constant') be evaluated, just the multiplication
with $1 wouldn't get evaluated.


I'm not sure what you're concerned about with the one-shot bit?

Greetings,

Andres Freund




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-24 Thread Soumyadeep Chakraborty
On Thu, Jul 23, 2020 at 10:14 PM Amul Sul  wrote:
>
> On Fri, Jul 24, 2020 at 6:28 AM Soumyadeep Chakraborty 
>  wrote:
> > In case it is necessary, the patch set does not wait for the checkpoint to
> > complete before marking the system as read-write. Refer:
> >
> > /* Set final state by clearing in-progress flag bit */
> > if (SetWALProhibitState(wal_state &
> ~(WALPROHIBIT_TRANSITION_IN_PROGRESS)))
> > {
> >   if ((wal_state & WALPROHIBIT_STATE_READ_ONLY) != 0)
> > ereport(LOG, (errmsg("system is now read only")));
> >   else
> >   {
> > /* Request checkpoint */
> > RequestCheckpoint(CHECKPOINT_IMMEDIATE);
> > ereport(LOG, (errmsg("system is now read write")));
> >   }
> > }
> >
> > We should RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT) before
> > we SetWALProhibitState() and do the ereport(), if we have a read-write
> > state change request.
> >
> +1, I too have the same question.
>
>
>
> FWIW, I don't we can request CHECKPOINT_WAIT for this place, otherwise, it
> think
> it will be deadlock case -- checkpointer process waiting for itself.

We should really just call CreateCheckPoint() here instead of
RequestCheckpoint().

> > 3. Some of the functions that were added such as GetWALProhibitState(),
> > IsWALProhibited() etc could be declared static inline.
> >
> IsWALProhibited() can be static but not GetWALProhibitState() since it
> needed to
> be accessible from other files.

If you place a static inline function in a header file, it will be
accessible from other files. E.g. pg_atomic_* functions.

Regards,
Soumyadeep




Re: Improving connection scalability: GetSnapshotData()

2020-07-24 Thread Ranier Vilela
Latest Postgres
Windows 64 bits
msvc 2019 64 bits

Patches applied v12-0001 to v12-0007:

 C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,28): warning C4013:
'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres
C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,29): warning
C4013: 'GetOldestXmin' indefinido; assumindo extern retornando int
[C:\dll\postgres\pg_visibility.
vcxproj]
 C:\dll\postgres\contrib\pgstattuple\pgstatapprox.c(74,56): error C2065:
'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pgstattuple.vcxproj]
  C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(569,58): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]
  C:\dll\postgres\contrib\pg_visibility\pg_visibility.c(686,70): error
C2065: 'PROCARRAY_FLAGS_VACUUM': identificador nao declarado
[C:\dll\postgres\pg_visibility.vcxproj]

regards,
Ranier Vilela


Re: Making CASE error handling less surprising

2020-07-24 Thread Pavel Stehule
pá 24. 7. 2020 v 18:49 odesílatel Andres Freund  napsal:

> Hi,
>
> On 2020-07-24 12:31:05 -0400, Robert Haas wrote:
> > On Thu, Jul 23, 2020 at 12:57 PM Tom Lane  wrote:
> > > Every so often we get a complaint like [1] about how a CASE should have
> > > prevented a run-time error and didn't, because constant-folding tried
> > > to evaluate a subexpression that would not have been entered at
> run-time.
> >
> > Yes, I've heard such complaints from other sources as well.
> >
> > > It struck me that it would not be hard to improve this situation a
> great
> > > deal.  If, within a CASE subexpression that isn't certain to be
> executed
> > > at runtime, we refuse to pre-evaluate *any* function (essentially,
> treat
> > > them all as volatile), then we should largely get the semantics that
> > > users expect.  There's some potential for query slowdown if a CASE
> > > contains a constant subexpression that we formerly reduced at plan time
> > > and now do not, but that doesn't seem to me to be a very big deal.
> >
> > Like Pavel, and I think implicitly Dagfinn and Andres, I'm not sure I
> > believe this. Pavel's example is a good one. The leakproof exception
> > helps, but it doesn't cover everything. Users I've encountered throw
> > things like date_trunc() and lpad() into SQL code and expect them to
> > behave (from a performance point of view) like constants, but they
> > also expect 1/0 not to get evaluated too early when e.g. CASE is used.
> > It's difficult to meet both sets of expectations at the same time and
> > we're probably never going to have a perfect solution, but I think
> > you're minimizing the concern too much here.
>
> Wouldn't the rule that I proposed earlier, namely that sub-expressions
> that involve only "proper" constants continue to get evaluated even
> within CASE, largely address that?
>

It doesn't solve a possible performance problem with one shot (EXECUTE stmt
plpgsql) queries, or with parameterized queries





>
> > I don't think I believe this either. I don't think an average user is
> > going to expect  to behave differently from (SELECT
> > ). This one actually bothers me more than the previous
> > one. How would we even document it? Sometimes things get inlined,
> > sometimes they don't. Sometimes subqueries get pulled up, sometimes
> > not. The current behavior isn't great, but at least it handles these
> > cases consistently. Getting the easy cases "right" while making the
> > behavior in more complex cases harder to understand is not necessarily
> > a win.
>
> Well, if we formalize the desired behaviour it's probably a lot easier
> to work towards implementing it in additional cases (like
> subselects). It doesn't seem to hard to keep track of whether a specific
> subquery can be evaluate constants in a certain way, if that's what we
> need.
>
> Greetings,
>
> Andres Freund
>
>
>


Re: HOT vs freezing issue causing "cannot freeze committed xmax"

2020-07-24 Thread Andres Freund
Hi,

On 2020-07-24 11:06:58 -0400, Robert Haas wrote:
> On Thu, Jul 23, 2020 at 2:10 PM Andres Freund  wrote:
> > In the case the HOT logic triggers, we'll call
> > heap_prepare_freeze_tuple() even when the tuple is dead.
>
> I think this is very bad. I've always been confused about these
> comments, but I couldn't quite put my finger on the problem. Now I
> think I can: the comments here imagine that we have the option either
> to set tupgone, causing the line pointer to be marked unused by an
> eventual call to lazy_vacuum_page(), or that we can decline to set
> tupgone, which will leave the tuple around to be handled by the next
> vacuum.

Yea. I think the only saving grace is that it's not obvious when the
situation can arise without prior corruption. But even if that's actuall
impossible, it seems extremely fragile. I stared at heap_prune_chain()
for quite a while and couldn't convince myself either way.


> However, we don't really have a choice at all. A choice implies that
> either option is correct, and therefore we can elect the one we
> prefer. But here, it's not just that one option is incorrect, but that
> both options are incorrect. Setting tupgone controls whether or not
> the tuple is considered for freezing. If we DON'T consider freezing
> it, then we might manage to advance relfrozenxid while an older XID
> still exists in the table. If we DO consider freezing it, we will
> correctly conclude that it needs to be frozen, but then the freezing
> code is in an impossible situation, because it has no provision for
> getting rid of tuples, only for keeping them. Its logic assumes that
> whenever we are freezing xmin or xmax we do that in a way that causes
> the tuple to be visible to everyone, but this tuple should be visible
> to no one. So with your changes it now errors out instead of
> corrupting data, but that's just replacing one bad thing (data
> corruption) with another (VACUUM failures).

I suspect that the legitimate cases hitting this branch won't error out,
because then xmin/xmax aren't old enough to need to be frozen.


> I think the actual correct behavior here is to mark the line pointer
> as dead.

That's not trivial, because just doing that naively will break HOT.


> The easiest way to accomplish that is probably to have
> lazy_scan_heap() just emit an extra XLOG_HEAP2_CLEAN record beyond
> whatever HOT-pruning already did, if it's necessary. A better solution
> would probably be to merge HOT-pruning with setting things all-visible
> and have a single function that does both, but that seems a lot more
> invasive, and definitely unsuitable for back-patching.

I suspect that merging pruning and this logic in lazy_scan_heap() really
is the only proper way to solve this kind of issue.

I wonder if, given we don't know if this issue can be hit in a real
database, and given that it already triggers an error, the right way to
deal with this in the back-branches is to emit a more precise error
message. I.e. if we hit this branch, and either xmin/xmax are older than
the cutoff, then we issue a more specific ERROR.

Greetings,

Andres Freund




Re: Making CASE error handling less surprising

2020-07-24 Thread Andres Freund
Hi,

On 2020-07-24 12:31:05 -0400, Robert Haas wrote:
> On Thu, Jul 23, 2020 at 12:57 PM Tom Lane  wrote:
> > Every so often we get a complaint like [1] about how a CASE should have
> > prevented a run-time error and didn't, because constant-folding tried
> > to evaluate a subexpression that would not have been entered at run-time.
> 
> Yes, I've heard such complaints from other sources as well.
> 
> > It struck me that it would not be hard to improve this situation a great
> > deal.  If, within a CASE subexpression that isn't certain to be executed
> > at runtime, we refuse to pre-evaluate *any* function (essentially, treat
> > them all as volatile), then we should largely get the semantics that
> > users expect.  There's some potential for query slowdown if a CASE
> > contains a constant subexpression that we formerly reduced at plan time
> > and now do not, but that doesn't seem to me to be a very big deal.
> 
> Like Pavel, and I think implicitly Dagfinn and Andres, I'm not sure I
> believe this. Pavel's example is a good one. The leakproof exception
> helps, but it doesn't cover everything. Users I've encountered throw
> things like date_trunc() and lpad() into SQL code and expect them to
> behave (from a performance point of view) like constants, but they
> also expect 1/0 not to get evaluated too early when e.g. CASE is used.
> It's difficult to meet both sets of expectations at the same time and
> we're probably never going to have a perfect solution, but I think
> you're minimizing the concern too much here.

Wouldn't the rule that I proposed earlier, namely that sub-expressions
that involve only "proper" constants continue to get evaluated even
within CASE, largely address that?


> I don't think I believe this either. I don't think an average user is
> going to expect  to behave differently from (SELECT
> ). This one actually bothers me more than the previous
> one. How would we even document it? Sometimes things get inlined,
> sometimes they don't. Sometimes subqueries get pulled up, sometimes
> not. The current behavior isn't great, but at least it handles these
> cases consistently. Getting the easy cases "right" while making the
> behavior in more complex cases harder to understand is not necessarily
> a win.

Well, if we formalize the desired behaviour it's probably a lot easier
to work towards implementing it in additional cases (like
subselects). It doesn't seem to hard to keep track of whether a specific
subquery can be evaluate constants in a certain way, if that's what we
need.

Greetings,

Andres Freund




Re: Missing CFI in hlCover()?

2020-07-24 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > I'm looking into an issue that we're seeing on the PG archives server
> > with runaway queries that don't seem to ever want to end- and ignore
> > signals.
> 
> > This is PG11, 11.8-1.pgdg100+1 specifically on Debian/buster and what
> > we're seeing is the loop in hlCover() (wparser_def.c:2071 to 2093) is
> > lasting an awful long time without any CFI call.  It's possible the CFI
> > call should actually go elsewhere, but the complete lack of any CFI in
> > wparser_def.c or tsvector_op.c seems a bit concerning.
> 
> > I'm suspicious there's something else going on here that's causing this
> > to take a long time but I don't have any smoking gun as yet.
> 
> Hm.  I'd vote for a CFI within the recursion in TS_execute(), if there's
> not one there yet.  Maybe hlFirstIndex needs one too --- if there are
> a lot of words in the query, maybe that could be slow?  Did you pin the
> blame down any more precisely than hlCover?

I've definitely seen hlFirstIndex take a few seconds to run (while
running this under gdb and stepping through), so that could be a good
choice to place one (perhaps even additionally to this...).  I have to
admit to wondering if we shouldn't consider having one in
check_stack_depth() to try and reduce the risk of us forgetting to have
one in sensible places, though I've not really looked at all the callers
and that might not be reasonable in some cases (though I wonder if maybe
we consider having a 'default' version that has a CFI, and an alternate
that doesn't...).

The depth of recursion for TS_execute_recurse() is actually not all that
bad either though (only 6 or so, as the query string here is:
"ERROR: The required file is not available"), so maybe that isn't really
the right thing to be thinking here.

Down in checkcondition_HL(), checkval->len is 213601, and it seems to
pretty much always end up with a result of TS_NO, but doesn't seem to
take all that long to arrive at that.

Over in hlFirstIndex():

hlFirstIndex (prs=0x7ffc2d4b16c0, prs=0x7ffc2d4b16c0, pos=219518, 
query=0x559619f81528) at ./build/../src/backend/tsearch/wparser_def.c:2013
2013hlFirstIndex(HeadlineParsedText *prs, TSQuery query, int pos)
(gdb) n
2026if (item->type == QI_VAL &&
(gdb) 
2029item++;
(gdb) p pos
$72 = 219518
(gdb) p prs->curwords
$73 = 583766
(gdb) 

Here's a full backtrace down to the checkcondition_HL():

(gdb) i s
#0  checkcondition_HL (opaque=0x7ffc2d4b11f0, val=0x559619f815c0, data=0x0) at 
./build/../src/backend/tsearch/wparser_def.c:1981
#1  0x559617eced2b in TS_execute_recurse (curitem=0x559619f815c0, 
arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, 
chkcond=chkcond@entry=0x559617df0120 )
at ./build/../src/backend/utils/adt/tsvector_op.c:1872  

   #2  0x559617ecedd1 in 
TS_execute_recurse (curitem=0x559619f815a8, arg=arg@entry=0x7ffc2d4b11f0, 
flags=flags@entry=0, chkcond=chkcond@entry=0x559617df0120 )
at ./build/../src/backend/utils/adt/tsvector_op.c:1892
#3  0x559617ecedd1 in TS_execute_recurse (curitem=0x559619f81590, 
arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, 
chkcond=chkcond@entry=0x559617df0120 )   
  at ./build/../src/backend/utils/adt/tsvector_op.c:1892
#4  0x559617ecedd1 in TS_execute_recurse (curitem=0x559619f81578, 
arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, 
chkcond=chkcond@entry=0x559617df0120 )
at ./build/../src/backend/utils/adt/tsvector_op.c:1892
#5  0x559617ecedd1 in TS_execute_recurse (curitem=0x559619f81560, 
arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, 
chkcond=chkcond@entry=0x559617df0120 )   
  at ./build/../src/backend/utils/adt/tsvector_op.c:1892
#6  0x559617ecedd1 in TS_execute_recurse (curitem=0x559619f81548, 
arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, 
chkcond=chkcond@entry=0x559617df0120 )
at ./build/../src/backend/utils/adt/tsvector_op.c:1892
#7  0x559617ecedd1 in TS_execute_recurse 
(curitem=curitem@entry=0x559619f81530, arg=arg@entry=0x7ffc2d4b11f0, 
flags=flags@entry=0, chkcond=chkcond@entry=0x559617df0120 )
at ./build/../src/backend/utils/adt/tsvector_op.c:1892
#8  0x559617ed26d9 in TS_execute (curitem=curitem@entry=0x559619f81530, 
arg=arg@entry=0x7ffc2d4b11f0, flags=flags@entry=0, 
chkcond=chkcond@entry=0x559617df0120 )
at ./build/../src/backend/utils/adt/tsvector_op.c:1854
#9  0x559617df041e in hlCover (prs=prs@entry=0x7ffc2d4b16c0, 
query=query@entry=0x559619f81528, p=p@entry=0x7ffc2d4b12a0, 
q=q@entry=0x7ffc2d4b12a4) at ./build/../src/backend/tsearch/wparser_def.c:2075  
  #10 0x559617df1a2d in mark_hl_words (max_words=35, min_words=15, 
shortword=3, highlightall=, query=, 
prs=0x7ffc2d4b16c0) at 

Re: Mark unconditionally-safe implicit coercions as leakproof

2020-07-24 Thread Robert Haas
On Fri, Jul 24, 2020 at 12:17 PM Tom Lane  wrote:
> I went through the system's built-in implicit coercions to see
> which ones are unconditionally successful.  These could all be
> marked leakproof, as per attached patch.  This came up in the
> context of the nearby discussion about CASE, but it seems like
> an independent improvement.  If you have a function f(int8)
> that is leakproof, you don't want it to effectively become
> non-leakproof when you apply it to an int4 or int2 column.
>
> One that I didn't mark leakproof is rtrim1(), which is the
> infrastructure for char(n) to text coercion.  It looks like it
> actually does qualify right now, but the code is long enough and
> complex enough that I think such a marking would be a bit unsafe.
>
> Any objections?

IMHO, this is a nice improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Making CASE error handling less surprising

2020-07-24 Thread Robert Haas
On Thu, Jul 23, 2020 at 12:57 PM Tom Lane  wrote:
> Every so often we get a complaint like [1] about how a CASE should have
> prevented a run-time error and didn't, because constant-folding tried
> to evaluate a subexpression that would not have been entered at run-time.

Yes, I've heard such complaints from other sources as well.

> It struck me that it would not be hard to improve this situation a great
> deal.  If, within a CASE subexpression that isn't certain to be executed
> at runtime, we refuse to pre-evaluate *any* function (essentially, treat
> them all as volatile), then we should largely get the semantics that
> users expect.  There's some potential for query slowdown if a CASE
> contains a constant subexpression that we formerly reduced at plan time
> and now do not, but that doesn't seem to me to be a very big deal.

Like Pavel, and I think implicitly Dagfinn and Andres, I'm not sure I
believe this. Pavel's example is a good one. The leakproof exception
helps, but it doesn't cover everything. Users I've encountered throw
things like date_trunc() and lpad() into SQL code and expect them to
behave (from a performance point of view) like constants, but they
also expect 1/0 not to get evaluated too early when e.g. CASE is used.
It's difficult to meet both sets of expectations at the same time and
we're probably never going to have a perfect solution, but I think
you're minimizing the concern too much here.

> This is not a complete fix, because if you write a sub-SELECT the
> contents of the sub-SELECT are not processed by the outer query's
> eval_const_expressions pass; instead, we look at it within the
> sub-SELECT itself, and in that context there's no apparent reason
> to avoid const-folding.  So
>CASE WHEN x < 0 THEN (SELECT 1/0) END
> fails even if x is never less than zero.  I don't see any great way
> to avoid that, and I'm not particularly concerned about it anyhow;
> usually the point of a sub-SELECT like this is to be decoupled from
> outer query evaluation, so that the behavior should not be that
> surprising.

I don't think I believe this either. I don't think an average user is
going to expect  to behave differently from (SELECT
). This one actually bothers me more than the previous
one. How would we even document it? Sometimes things get inlined,
sometimes they don't. Sometimes subqueries get pulled up, sometimes
not. The current behavior isn't great, but at least it handles these
cases consistently. Getting the easy cases "right" while making the
behavior in more complex cases harder to understand is not necessarily
a win.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Missing CFI in hlCover()?

2020-07-24 Thread Tom Lane
Stephen Frost  writes:
> I'm looking into an issue that we're seeing on the PG archives server
> with runaway queries that don't seem to ever want to end- and ignore
> signals.

> This is PG11, 11.8-1.pgdg100+1 specifically on Debian/buster and what
> we're seeing is the loop in hlCover() (wparser_def.c:2071 to 2093) is
> lasting an awful long time without any CFI call.  It's possible the CFI
> call should actually go elsewhere, but the complete lack of any CFI in
> wparser_def.c or tsvector_op.c seems a bit concerning.

> I'm suspicious there's something else going on here that's causing this
> to take a long time but I don't have any smoking gun as yet.

Hm.  I'd vote for a CFI within the recursion in TS_execute(), if there's
not one there yet.  Maybe hlFirstIndex needs one too --- if there are
a lot of words in the query, maybe that could be slow?  Did you pin the
blame down any more precisely than hlCover?

regards, tom lane




Mark unconditionally-safe implicit coercions as leakproof

2020-07-24 Thread Tom Lane
I went through the system's built-in implicit coercions to see
which ones are unconditionally successful.  These could all be
marked leakproof, as per attached patch.  This came up in the
context of the nearby discussion about CASE, but it seems like
an independent improvement.  If you have a function f(int8)
that is leakproof, you don't want it to effectively become
non-leakproof when you apply it to an int4 or int2 column.

One that I didn't mark leakproof is rtrim1(), which is the
infrastructure for char(n) to text coercion.  It looks like it
actually does qualify right now, but the code is long enough and
complex enough that I think such a marking would be a bit unsafe.

Any objections?

regards, tom lane

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4b5af32440..3fed5725cb 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -698,10 +698,10 @@
   proname => 'dlog1', prorettype => 'float8', proargtypes => 'float8',
   prosrc => 'dlog1' },
 { oid => '235', descr => 'convert int2 to float8',
-  proname => 'float8', prorettype => 'float8', proargtypes => 'int2',
+  proname => 'float8', proleakproof => 't', prorettype => 'float8', proargtypes => 'int2',
   prosrc => 'i2tod' },
 { oid => '236', descr => 'convert int2 to float4',
-  proname => 'float4', prorettype => 'float4', proargtypes => 'int2',
+  proname => 'float4', proleakproof => 't', prorettype => 'float4', proargtypes => 'int2',
   prosrc => 'i2tof' },
 { oid => '237', descr => 'convert float8 to int2',
   proname => 'int2', prorettype => 'int2', proargtypes => 'float8',
@@ -879,25 +879,25 @@
   proargtypes => 'float8 float8 float8 int4', prosrc => 'width_bucket_float8' },
 
 { oid => '311', descr => 'convert float4 to float8',
-  proname => 'float8', prorettype => 'float8', proargtypes => 'float4',
+  proname => 'float8', proleakproof => 't', prorettype => 'float8', proargtypes => 'float4',
   prosrc => 'ftod' },
 { oid => '312', descr => 'convert float8 to float4',
   proname => 'float4', prorettype => 'float4', proargtypes => 'float8',
   prosrc => 'dtof' },
 { oid => '313', descr => 'convert int2 to int4',
-  proname => 'int4', prorettype => 'int4', proargtypes => 'int2',
+  proname => 'int4', proleakproof => 't', prorettype => 'int4', proargtypes => 'int2',
   prosrc => 'i2toi4' },
 { oid => '314', descr => 'convert int4 to int2',
   proname => 'int2', prorettype => 'int2', proargtypes => 'int4',
   prosrc => 'i4toi2' },
 { oid => '316', descr => 'convert int4 to float8',
-  proname => 'float8', prorettype => 'float8', proargtypes => 'int4',
+  proname => 'float8', proleakproof => 't', prorettype => 'float8', proargtypes => 'int4',
   prosrc => 'i4tod' },
 { oid => '317', descr => 'convert float8 to int4',
   proname => 'int4', prorettype => 'int4', proargtypes => 'float8',
   prosrc => 'dtoi4' },
 { oid => '318', descr => 'convert int4 to float4',
-  proname => 'float4', prorettype => 'float4', proargtypes => 'int4',
+  proname => 'float4', proleakproof => 't', prorettype => 'float4', proargtypes => 'int4',
   prosrc => 'i4tof' },
 { oid => '319', descr => 'convert float4 to int4',
   proname => 'int4', prorettype => 'int4', proargtypes => 'float4',
@@ -1150,16 +1150,16 @@
   proname => 'text', prorettype => 'text', proargtypes => 'bpchar',
   prosrc => 'rtrim1' },
 { oid => '406', descr => 'convert name to text',
-  proname => 'text', prorettype => 'text', proargtypes => 'name',
+  proname => 'text', proleakproof => 't', prorettype => 'text', proargtypes => 'name',
   prosrc => 'name_text' },
 { oid => '407', descr => 'convert text to name',
-  proname => 'name', prorettype => 'name', proargtypes => 'text',
+  proname => 'name', proleakproof => 't', prorettype => 'name', proargtypes => 'text',
   prosrc => 'text_name' },
 { oid => '408', descr => 'convert name to char(n)',
   proname => 'bpchar', prorettype => 'bpchar', proargtypes => 'name',
   prosrc => 'name_bpchar' },
 { oid => '409', descr => 'convert char(n) to name',
-  proname => 'name', prorettype => 'name', proargtypes => 'bpchar',
+  proname => 'name', proleakproof => 't', prorettype => 'name', proargtypes => 'bpchar',
   prosrc => 'bpchar_name' },
 
 { oid => '449', descr => 'hash',
@@ -1338,10 +1338,10 @@
   proname => 'int4', prorettype => 'int4', proargtypes => 'int8',
   prosrc => 'int84' },
 { oid => '481', descr => 'convert int4 to int8',
-  proname => 'int8', prorettype => 'int8', proargtypes => 'int4',
+  proname => 'int8', proleakproof => 't', prorettype => 'int8', proargtypes => 'int4',
   prosrc => 'int48' },
 { oid => '482', descr => 'convert int8 to float8',
-  proname => 'float8', prorettype => 'float8', proargtypes => 'int8',
+  proname => 'float8', proleakproof => 't', prorettype => 'float8', proargtypes => 'int8',
   prosrc => 'i8tod' },
 { oid => '483', descr => 'convert float8 to int8',
   proname => 'int8', prorettype => 'int8', proargtypes => 'float8',
@@ -1359,7 

Re: Any objections to implementing LogicalDecodeMessageCB for pgoutput?

2020-07-24 Thread Andres Freund
Hi,

On 2020-07-24 11:33:52 -0400, Dave Cramer wrote:
> For logical replication there is no need to implement this, but others are
> using the pgoutput plugin for Change Data Capture. The reason they are
> using pgoutput is because it is guaranteed to be available as it is in core
> postgres.
> 
> Implementing LogicalDecodeMessageCB provides some synchronization facility
> that is not easily replicated.

It's definitely useful. Probably needs to be parameter that signals
whether they should be sent out?

Greetings,

Andres Freund




Missing CFI in hlCover()?

2020-07-24 Thread Stephen Frost
Greetings,

I'm looking into an issue that we're seeing on the PG archives server
with runaway queries that don't seem to ever want to end- and ignore
signals.

This is PG11, 11.8-1.pgdg100+1 specifically on Debian/buster and what
we're seeing is the loop in hlCover() (wparser_def.c:2071 to 2093) is
lasting an awful long time without any CFI call.  It's possible the CFI
call should actually go elsewhere, but the complete lack of any CFI in
wparser_def.c or tsvector_op.c seems a bit concerning.

I'm suspicious there's something else going on here that's causing this
to take a long time but I don't have any smoking gun as yet.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Tomas Vondra

On Fri, Jul 24, 2020 at 11:18:48AM -0400, Robert Haas wrote:

On Thu, Jul 23, 2020 at 9:22 PM Tomas Vondra
 wrote:

   2MB 4MB8MB64MB256MB
---
 hash 6.716.70   6.736.44 5.81
 hash CP_SMALL_TLIST  5.285.26   5.245.04 4.54
 sort 3.413.41   3.413.57 3.45

So sort writes ~3.4GB of data, give or take. But hashagg/master writes
almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
much more than the 3.4GB of data written by sort (which has to spill
everything, while hashagg only spills rows not covered by the groups
that fit into work_mem).

I initially assumed this is due to writing the hash value to the tapes,
and the rows are fairly narrow (only about 40B per row), so a 4B hash
could make a difference - but certainly not this much. Moreover, that
does not explain the difference between master and the now-reverted
CP_SMALL_TLIST, I think.


This is all really good analysis, I think, but this seems like the key
finding. It seems like we don't really understand what's actually
getting written. Whether we use hash or sort doesn't seem like it
should have this kind of impact on how much data gets written, and
whether we use CP_SMALL_TLIST or project when needed doesn't seem like
it should matter like this either.



I think for CP_SMALL_TLIST at least some of the extra data can be
attributed to writing the hash value along with the tuple, which sort
obviously does not do. With the 32GB data set (the i5 machine), there
are ~20M rows in the lineitem table, and with 4B hash values that's
about 732MB of extra data. That's about the 50% of the difference
between sort and CP_SMALL_TLIST, and I'd dare to speculate the other 50%
is due to LogicalTape internals (pointers to the next block, etc.)

The question is why master has 2x the overhead of CP_SMALL_TLIST, if
it's meant to write the same set of columns etc.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] Custom compression methods

2020-07-24 Thread Robert Haas
On Mon, Jun 29, 2020 at 12:31 PM Andres Freund  wrote:
> > This "custom compression methods" thread - vintage 2017 - Original
> > code by Nikita Glukhov, later work by Ildus Kurbangaliev
> > The "pluggable compression support" thread - vintage 2013 - Andres Freund
> > The "plgz performance" thread - vintage 2019 - Petr Jelinek
> >
> > Anyone want to point to a FOURTH implementation of this feature?
>
> To be clear, I don't think the 2003 patch should be considered as being
> "in the running".

I guess you mean 2013, not 2003?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Any objections to implementing LogicalDecodeMessageCB for pgoutput?

2020-07-24 Thread Dave Cramer
For logical replication there is no need to implement this, but others are
using the pgoutput plugin for Change Data Capture. The reason they are
using pgoutput is because it is guaranteed to be available as it is in core
postgres.

Implementing LogicalDecodeMessageCB provides some synchronization facility
that is not easily replicated.

Thoughts ?

Dave Cramer


Re: Making CASE error handling less surprising

2020-07-24 Thread Tom Lane
I wrote:
> Ah, I see what you mean.  Yeah, that throws an error today, and it
> still would with the patch I was envisioning (attached), because
> inlining does Param substitution in a different way.  I'm not
> sure that we could realistically fix the inlining case with this
> sort of approach.

Here's another example that we can't possibly fix with Param substitution
hacking, because there are no Params involved in the first place:

select f1, case when f1 = 42 then 1/i else null end
from (select f1, 0 as i from int4_tbl) ss;

Pulling up the subquery results in "1/0", so this fails today,
even though "f1 = 42" is never true.

Attached is a v3 patch that incorporates the leakproofness idea.
As shown in the new case.sql tests, this does fix both the SQL
function and subquery-pullup cases.

To keep the join regression test results the same, I marked int48()
as leakproof, which is surely safe enough.  Probably we should make
a push to mark all unconditionally-safe implicit coercions as
leakproof, but that's a separate matter.

regards, tom lane

diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index e04b144072..48e31b16d6 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -61,6 +61,16 @@ typedef struct
 	AggClauseCosts *costs;
 } get_agg_clause_costs_context;
 
+typedef enum
+{
+	/* Ordering is important here! */
+	ece_eval_nothing,			/* be unconditionally safe */
+	ece_eval_leakproof,			/* eval leakproof immutable functions */
+	ece_eval_immutable,			/* eval immutable functions too */
+	ece_eval_stable,			/* eval stable functions too */
+	ece_eval_volatile			/* eval volatile functions too */
+} ece_eval_mode;
+
 typedef struct
 {
 	ParamListInfo boundParams;
@@ -68,6 +78,7 @@ typedef struct
 	List	   *active_fns;
 	Node	   *case_val;
 	bool		estimate;
+	ece_eval_mode eval_mode;
 } eval_const_expressions_context;
 
 typedef struct
@@ -119,6 +130,8 @@ static Node *eval_const_expressions_mutator(Node *node,
 static bool contain_non_const_walker(Node *node, void *context);
 static bool ece_function_is_safe(Oid funcid,
  eval_const_expressions_context *context);
+static bool ece_function_form_is_safe(Form_pg_proc func_form,
+	  eval_const_expressions_context *context);
 static Node *apply_const_relabel(Node *arg, Oid rtype,
  int32 rtypmod, Oid rcollid,
  CoercionForm rformat, int rlocation);
@@ -2264,6 +2277,7 @@ eval_const_expressions(PlannerInfo *root, Node *node)
 	context.active_fns = NIL;	/* nothing being recursively simplified */
 	context.case_val = NULL;	/* no CASE being examined */
 	context.estimate = false;	/* safe transformations only */
+	context.eval_mode = ece_eval_immutable; /* eval immutable functions */
 	return eval_const_expressions_mutator(node, );
 }
 
@@ -2280,8 +2294,11 @@ eval_const_expressions(PlannerInfo *root, Node *node)
  *	  available by the caller of planner(), even if the Param isn't marked
  *	  constant.  This effectively means that we plan using the first supplied
  *	  value of the Param.
- * 2. Fold stable, as well as immutable, functions to constants.
+ * 2. Fold stable, as well as immutable, functions to constants.  The risk
+ *	  that the result might change from planning time to execution time is
+ *	  worth taking, as we otherwise couldn't get an estimate at all.
  * 3. Reduce PlaceHolderVar nodes to their contained expressions.
+ * 4. Ignore domain constraints, assuming that CoerceToDomain will succeed.
  *
  */
 Node *
@@ -2295,6 +2312,7 @@ estimate_expression_value(PlannerInfo *root, Node *node)
 	context.active_fns = NIL;	/* nothing being recursively simplified */
 	context.case_val = NULL;	/* no CASE being examined */
 	context.estimate = true;	/* unsafe transformations OK */
+	context.eval_mode = ece_eval_stable;	/* eval stable functions */
 	return eval_const_expressions_mutator(node, );
 }
 
@@ -2961,7 +2979,22 @@ eval_const_expressions_mutator(Node *node,
  * opportunity to reduce constant test conditions.  For
  * example this allows
  *		CASE 0 WHEN 0 THEN 1 ELSE 1/0 END
- * to reduce to 1 rather than drawing a divide-by-0 error.
+ * to reduce to 1 rather than drawing a divide-by-0 error,
+ * since we'll throw away the ELSE without processing it.
+ *
+ * Even in not-all-constant cases, the user might be expecting
+ * the CASE to prevent run-time errors, for example in
+ *		CASE WHEN j > 0 THEN 1 ELSE 1/0 END
+ * Since division is immutable, we'd ordinarily simplify the
+ * division and hence draw the divide-by-zero error at plan
+ * time, even if j is never zero at run time.  To avoid that,
+ * reduce eval_mode to ece_eval_leakproof while processing any
+ * test condition or result value that will not certainly be
+ * evaluated at run-time.  We expect that leakproof immutable
+ * functions will not throw any errors (except perhaps in

Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Robert Haas
On Thu, Jul 23, 2020 at 9:22 PM Tomas Vondra
 wrote:
>2MB 4MB8MB64MB256MB
> ---
>  hash 6.716.70   6.736.44 5.81
>  hash CP_SMALL_TLIST  5.285.26   5.245.04 4.54
>  sort 3.413.41   3.413.57 3.45
>
> So sort writes ~3.4GB of data, give or take. But hashagg/master writes
> almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
> original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
> much more than the 3.4GB of data written by sort (which has to spill
> everything, while hashagg only spills rows not covered by the groups
> that fit into work_mem).
>
> I initially assumed this is due to writing the hash value to the tapes,
> and the rows are fairly narrow (only about 40B per row), so a 4B hash
> could make a difference - but certainly not this much. Moreover, that
> does not explain the difference between master and the now-reverted
> CP_SMALL_TLIST, I think.

This is all really good analysis, I think, but this seems like the key
finding. It seems like we don't really understand what's actually
getting written. Whether we use hash or sort doesn't seem like it
should have this kind of impact on how much data gets written, and
whether we use CP_SMALL_TLIST or project when needed doesn't seem like
it should matter like this either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: HOT vs freezing issue causing "cannot freeze committed xmax"

2020-07-24 Thread Robert Haas
On Thu, Jul 23, 2020 at 2:10 PM Andres Freund  wrote:
> In the case the HOT logic triggers, we'll call
> heap_prepare_freeze_tuple() even when the tuple is dead.

I think this is very bad. I've always been confused about these
comments, but I couldn't quite put my finger on the problem. Now I
think I can: the comments here imagine that we have the option either
to set tupgone, causing the line pointer to be marked unused by an
eventual call to lazy_vacuum_page(), or that we can decline to set
tupgone, which will leave the tuple around to be handled by the next
vacuum.

However, we don't really have a choice at all. A choice implies that
either option is correct, and therefore we can elect the one we
prefer. But here, it's not just that one option is incorrect, but that
both options are incorrect. Setting tupgone controls whether or not
the tuple is considered for freezing. If we DON'T consider freezing
it, then we might manage to advance relfrozenxid while an older XID
still exists in the table. If we DO consider freezing it, we will
correctly conclude that it needs to be frozen, but then the freezing
code is in an impossible situation, because it has no provision for
getting rid of tuples, only for keeping them. Its logic assumes that
whenever we are freezing xmin or xmax we do that in a way that causes
the tuple to be visible to everyone, but this tuple should be visible
to no one. So with your changes it now errors out instead of
corrupting data, but that's just replacing one bad thing (data
corruption) with another (VACUUM failures).

I think the actual correct behavior here is to mark the line pointer
as dead. The easiest way to accomplish that is probably to have
lazy_scan_heap() just emit an extra XLOG_HEAP2_CLEAN record beyond
whatever HOT-pruning already did, if it's necessary. A better solution
would probably be to merge HOT-pruning with setting things all-visible
and have a single function that does both, but that seems a lot more
invasive, and definitely unsuitable for back-patching.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Improve planner cost estimations for alternative subplans

2020-07-24 Thread Alexey Bashtanov

Hi Tom,

sorry for the delay,

I experimented with adding a number-of-evaluations parameter to
cost_qual_eval, and found that the majority of call sites do have
something realistic they can pass.  The attached very-much-WIP
patch shows my results so far.  There's a lot of loose ends:

I like the idea, so if we alternative subplans remain there
I think we should implement it.

Best, Alex




Re: Improve planner cost estimations for alternative subplans

2020-07-24 Thread Alexey Bashtanov

Hi Melanie,

Sorry for the delay.


I've just started looking at this patch today, but I was wondering if
you might include a test case which minimally reproduces the original
problem you had.

I could reproduce it with an easier generated data set, please see attached.

However, to be honest with you, while searching I encountered a few 
examples of the opposite behavior,

when the patched version was slower than the master branch.
So I'm not so sure whether we should use the patch, maybe we should 
rather consider Tom's approach.


Best, Alex


altplan-example.sql
Description: application/sql


Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

2020-07-24 Thread Robert Haas
On Fri, Jul 24, 2020 at 7:10 AM Bharath Rupireddy
 wrote:
> I looked at what actually llvm_shutdown() does? It frees up JIT stacks, also 
> if exists perf related resource, using LLVMOrcDisposeInstance() and 
> LLVMOrcUnregisterPerf(), that were dynamically allocated in 
> llvm_session_initialize through a JIT library function 
> LLVMOrcCreateInstance() [1].
>
> It looks like there is no problem in moving llvm_shutdown to either 
> on_shmem_exit() or on_proc_exit().

If it doesn't involve shared memory, I guess it can be on_proc_exit()
rather than on_shmem_exit().

I guess the other question is why we're doing it at all. What
resources are being allocated that wouldn't be freed up by process
exit anyway?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-24 Thread Robert Haas
On Thu, Jul 23, 2020 at 10:04 PM Andres Freund  wrote:
> I think we should consider introducing XACTFATAL or such, guaranteeing
> the transaction gets aborted, without requiring a FATAL. This has been
> needed for enough cases that it's worthwhile.

Seems like that would need a separate discussion, apart from this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-24 Thread Robert Haas
On Thu, Jul 23, 2020 at 12:11 PM Soumyadeep Chakraborty
 wrote:
> In the read-only level I was suggesting, I wasn't suggesting that we
> stop WAL flushes, in fact we should flush the WAL before we mark the
> system as read-only. Once the system declares itself as read-only, it
> will not perform any more on-disk changes; It may perform all the
> flushes it needs as a part of the read-only request handling.

I think that's already how the patch works, or at least how it should
work. You stop new writes, flush any existing WAL, and then declare
the system read-only. That can all be done quickly.

> What I am saying is it doesn't have to be just the queries. I think we
> can cater to all the other use cases simply by forcing a checkpoint
> before marking the system as read-only.

But that part can't, which means that if we did that, it would break
the feature for the originally intended use case. I'm not on board
with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-24 Thread Robert Haas
On Wed, Jul 22, 2020 at 6:03 PM Soumyadeep Chakraborty
 wrote:
> So if we are not going to address those cases, we should change the
> syntax and remove the notion of read-only. It could be:
>
> ALTER SYSTEM SET wal_writes TO off|on;
> or
> ALTER SYSTEM SET prohibit_wal TO off|on;

This doesn't really work because of the considerations mentioned in
http://postgr.es/m/ca+tgmoakctzozr0xeqalfimbcje2rgcbazf4eybpxjtnetp...@mail.gmail.com

> If we are going to try to make it truly read-only, and cater to the
> other use cases, we have to:
>
> Perform a checkpoint before declaring the system read-only (i.e. before
> the command returns). This may be expensive of course, as Andres has
> pointed out in this thread, but it is a price that has to be paid. If we
> do this checkpoint, then we can avoid an additional shutdown checkpoint
> and an end-of-recovery checkpoint (if we restart the primary after a
> crash while in read-only mode). Also, we would have to prevent any
> operation that touches control files, which I am not sure we do today in
> the current patch.

It's basically impossible to create a system for fast failover that
involves a checkpoint.  See my comments at
http://postgr.es/m/ca+tgmoye8ucgtyfgfnv3vwpztygsdksu2f4mniqhkar_ukb...@mail.gmail.com
- you can't achieve five nines or even four nines of availability if
you have to wait for a checkpoint that might take twenty minutes. I
have nothing against a feature that does what you're describing, but
this feature is designed to make fast failover easier to accomplish,
and it's not going to succeed if it involves a checkpoint.

> Why not have the best of both worlds? Consider:
>
> ALTER SYSTEM SET read_only to {off, on, wal};
>
> -- on: wal writes off + no writes to disk
> -- off: default
> -- wal: only wal writes off
>
> Of course, there can probably be better syntax for the above.

There are a few things you can can imagine doing here:

1. Freeze WAL writes but allow dirty buffers to be flushed afterward.
This is the most useful thing for fast failover, I would argue,
because it's quick and the fact that some dirty buffers may not be
written doesn't matter.

2. Freeze WAL writes except a final checkpoint which will flush dirty
buffers along the way. This is like shutting the system down cleanly
and bringing it back up as a standby, except without performing a
shutdown.

3. Freeze WAL writes and write out all dirty buffers without actually
checkpointing. This is sort of a hybrid of #1 and #2. It's probably
not much faster than #2 but it avoids generating any more WAL.

4. Freeze WAL writes and just keep all the dirty buffers cached,
without writing them out. This seems like a bad idea for the reasons
mentioned in Amul's reply. The system might not be able to respond
even to read-only queries any more if shared_buffers is full of
unevictable dirty buffers.

Either #2 or #3 is sufficient to take a filesystem level snapshot of
the cluster while it's running, but I'm not sure why that's
interesting. You can already do that sort of thing by using
pg_basebackup or by running pg_start_backup() and pg_stop_backup() and
copying the directory in the middle, and you can do all of that while
the cluster is accepting writes, which seems like it will usually be
more convenient. If you do want this, you have several options, like
running a checkpoint immediately followed by ALTER SYSTEM READ ONLY
(so that the amount of WAL generated during the backup is small but
maybe not none); or shutting down the system cleanly and restarting it
as a standby; or maybe using the proposed pg_ctl demote feature
mentioned on a separate thread.

Contrary to what you write, I don't think either #2 or #3 is
sufficient to enable checksums, at least not without some more
engineering, because the server would cache the state from the control
file, and a bunch of blocks from the database. I guess it would work
if you did a server restart afterward, but I think there are better
ways of supporting online checksum enabling that don't require
shutting down the server, or even making it read-only; and there's
been significant work done on those already.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Making CASE error handling less surprising

2020-07-24 Thread Chris Travers
On Fri, Jul 24, 2020 at 4:35 AM Tom Lane  wrote:

> Andres Freund  writes:
> > I'm a bit worried about a case like:
>
> > CREATE FUNCTION yell(int, int)
> > RETURNS int
> > IMMUTABLE
> > LANGUAGE SQL AS $$
> >SELECT CASE WHEN $1 != 0 THEN 17 / $2 ELSE NULL END
> > $$;
>
> > EXPLAIN SELECT yell(g.i, 0) FROM generate_series(1, 10) g(i);
>
> > I don't think the parameters here would have been handled before
> > inlining, right?
>
> Ah, I see what you mean.  Yeah, that throws an error today, and it
> still would with the patch I was envisioning (attached), because
> inlining does Param substitution in a different way.  I'm not
> sure that we could realistically fix the inlining case with this
> sort of approach.
>
> I think this bears out the comment I made before that this approach
> still leaves us with a very complicated behavior.  Maybe we should
> stick with the previous approach, possibly supplemented with a
> leakproofness exception.
>


I am actually not so sure this is a good idea. Here are two doubts I have.

1.  The problem of when a given SQL expression is evaluated crops up in a
wide variety of different contexts and, worst case, causes far more damage
than queries which always error.  Removing the lower hanging fruit while
leaving cases like:

select lock_foo(id), * from foo where somefield > 100; -- which rows does
lock_foo(id) run on?  Does it matter?

is going to legitimize these complaints in a way which will be very hard to
do unless we also want to eventually be able to specify when volatile
functions may be run. The two cases don't look the same but they are
manifestations of the same problem which is that when you execute a SQL
query you have no control over when expressions are actually run.

2.  The refusal to fold immutables within case statements here mean either
we do more tricks to get around the planner if we hit a pathological cases
in performance.  I am not convinced this is a net win.

If we go this route, would it be too much to ask to allow a GUC variable to
preserve the old behavior?


> regards, tom lane
>
>

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: INSERT INTO SELECT, Why Parallelism is not selected?

2020-07-24 Thread Tom Lane
Robert Haas  writes:
> Well, I think the comments could be more clear - for the insert case
> specifically - about which cases you think are and are not safe.

Yeah, the proposed comment changes don't actually add much.  Also
please try to avoid inserting non-ASCII  into the source code;
at least in my mail reader, that attachment has some.

regards, tom lane




Re: INSERT INTO SELECT, Why Parallelism is not selected?

2020-07-24 Thread Robert Haas
On Fri, Jul 24, 2020 at 7:59 AM Amit Kapila  wrote:
> On Fri, Jul 17, 2020 at 11:24 AM Amit Kapila  wrote:
> > Do you have something else in mind?
>
> I am planning to commit the comments change patch attached in the
> above email [1] next week sometime (probably Monday or Tuesday) unless
> you have something more to add?

Well, I think the comments could be more clear - for the insert case
specifically - about which cases you think are and are not safe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: renaming configure.in to configure.ac

2020-07-24 Thread Tom Lane
Peter Eisentraut  writes:
> On 2020-07-17 10:46, Peter Eisentraut wrote:
>> v1-0001-Rename-configure.in-to-configure.ac.patch

> I have committed that, and I have sent feedback to the Autoconf 
> developers about our concerns about the slowness of some of the new tests.

Sounds good.  Do we want to try Noah's idea of temporarily committing
the remaining changes, to see if the buildfarm is happy?

regards, tom lane




Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Tomas Vondra

On Fri, Jul 24, 2020 at 10:40:47AM +0200, Tomas Vondra wrote:

On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote:

On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra
 wrote:

So let me share some fresh I/O statistics collected on the current code
using iosnoop. I've done the tests on two different machines using the
"aggregate part" of TPC-H Q17, i.e. essentially this:

SELECT * FROM (
   SELECT
   l_partkey AS agg_partkey,
   0.2 * avg(l_quantity) AS avg_quantity
   FROM lineitem GROUP BY l_partkey OFFSET 10
) part_agg;

The OFFSET is there just to ensure we don't need to send anything to
the client, etc.


Thanks for testing this.


So sort writes ~3.4GB of data, give or take. But hashagg/master writes
almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
much more than the 3.4GB of data written by sort (which has to spill
everything, while hashagg only spills rows not covered by the groups
that fit into work_mem).


What I find when I run your query (with my own TPC-H DB that is
smaller than what you used here -- 59,986,052 lineitem tuples) is that
the sort required about 7x more memory than the hash agg to do
everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak
hash agg memory usage. I'd be surprised if the ratio was very
different for you -- but can you check?



I can check, but it's not quite clear to me what are we looking for?
Increase work_mem until there's no need to spill in either case?



FWIW the hashagg needs about 4775953kB and the sort 33677586kB. So yeah,
that's about 7x more. I think that's probably built into the TPC-H data
set. It'd be easy to construct cases with much higher/lower factors.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
   QUERY PLAN   
 
-
 Limit  (cost=15493271.13..15493271.13 rows=1 width=36) (actual 
time=331351.099..331351.099 rows=0 loops=1)
   ->  HashAggregate  (cost=15186647.64..15493271.13 rows=20441566 width=36) 
(actual time=318190.465..330956.383 rows=1500 loops=1)
 Group Key: lineitem.l_partkey
 Peak Memory Usage: 4775953kB
 ->  Seq Scan on lineitem  (cost=0.00..12936556.76 rows=450018176 
width=9) (actual time=0.156..56051.850 rows=450019701 loops=1)
 Planning Time: 0.151 ms
 Execution Time: 331931.239 ms
(7 rows)


  QUERY PLAN
   
---
 Limit  (cost=81298097.02..81298097.02 rows=1 width=36) (actual 
time=415344.639..415344.639 rows=0 loops=1)
   ->  GroupAggregate  (cost=77616337.21..81298097.02 rows=20441566 width=36) 
(actual time=209292.469..414951.954 rows=1500 loops=1)
 Group Key: lineitem.l_partkey
 ->  Sort  (cost=77616337.21..78741382.65 rows=450018176 width=9) 
(actual time=209292.435..329583.999 rows=450019701 loops=1)
   Sort Key: lineitem.l_partkey
   Sort Method: quicksort  Memory: 33677586kB
   ->  Seq Scan on lineitem  (cost=0.00..12936556.76 rows=450018176 
width=9) (actual time=0.096..72474.733 rows=450019701 loops=1)
 Planning Time: 0.145 ms
 Execution Time: 417157.598 ms
(9 rows)


Re: [Patch] ALTER SYSTEM READ ONLY

2020-07-24 Thread Amul Sul
On Fri, Jul 24, 2020 at 7:34 AM Andres Freund  wrote:
>
> Hi,

Thanks for looking at the patch.

>
> > From f0188a48723b1ae7372bcc6a344ed7868fdc40fb Mon Sep 17 00:00:00 2001
> > From: Amul Sul 
> > Date: Fri, 27 Mar 2020 05:05:38 -0400
> > Subject: [PATCH v3 2/6] Add alter system read only/write syntax
> >
> > Note that syntax doesn't have any implementation.
> > ---
> >  src/backend/nodes/copyfuncs.c| 12 
> >  src/backend/nodes/equalfuncs.c   |  9 +
> >  src/backend/parser/gram.y| 13 +
> >  src/backend/tcop/utility.c   | 20 
> >  src/bin/psql/tab-complete.c  |  6 --
> >  src/include/nodes/nodes.h|  1 +
> >  src/include/nodes/parsenodes.h   | 10 ++
> >  src/tools/pgindent/typedefs.list |  1 +
> >  8 files changed, 70 insertions(+), 2 deletions(-)
>
> Shouldn't there be at outfuncs support as well? Perhaps we even need
> readfuncs, not immediately sure.

Ok, can add that as well.

>
>
>
> > From 2c5db7db70d4cebebf574fbc47db7fbf7c440be1 Mon Sep 17 00:00:00 2001
> > From: Amul Sul 
> > Date: Fri, 19 Jun 2020 06:29:36 -0400
> > Subject: [PATCH v3 3/6] Implement ALTER SYSTEM READ ONLY using global 
> > barrier.
> >
> > Implementation:
> >
> >  1. When a user tried to change server state to WAL-Prohibited using
> > ALTER SYSTEM READ ONLY command; AlterSystemSetWALProhibitState() will 
> > emit
> > PROCSIGNAL_BARRIER_WAL_PROHIBIT_STATE_CHANGE barrier and will wait 
> > until the
> > barrier has been absorbed by all the backends.
> >
> >  2. When a backend receives the WAL-Prohibited barrier, at that moment if
> > it is already in a transaction and the transaction already assigned XID,
> > then the backend will be killed by throwing FATAL(XXX: need more 
> > discussion
> > on this)
>
> I think we should consider introducing XACTFATAL or such, guaranteeing
> the transaction gets aborted, without requiring a FATAL. This has been
> needed for enough cases that it's worthwhile.
>

As I am aware of, the existing code PostgresMain() uses FATAL to terminate
the connection when protocol synchronization was lost.  Currently, in
a proposal, this and another one is "Terminate the idle sessions"[1] is using
FATAL, afaik.

>
> There are several cases where we WAL log without having an xid
> assigned. E.g. when HOT pruning during syscache lookups or such. Are
> there any cases where the check for being in recovery is followed by a
> CHECK_FOR_INTERRUPTS, before the WAL logging is done?
>

In case of operation without xid, an error will be raised just before the point
where the wal record is expected. The places you are asking about, I haven't
found in a glance, will try to search for that, but I am sure current
implementation is not missing those places where it is supposed to check the
prohibited state and complaint.

Quick question, is it possible that pruning will happen with the SELECT query?
It would be helpful if you or someone else could point me to the place where WAL
can be generated even in the case of read-only queries.

>
>
> >  3. Otherwise, if that backend running transaction which yet to get XID
> > assigned we don't need to do anything special, simply call
> > ResetLocalXLogInsertAllowed() so that any future WAL insert in will 
> > check
> > XLogInsertAllowed() first which set ready only state appropriately.
> >
> >  4. A new transaction (from existing or new backend) starts as a read-only
> > transaction.
>
> Why do we need 4)? And doesn't that have the potential to be
> unnecessarily problematic if a the server is subsequently brought out of
> the readonly state again?

The transaction that was started in the read-only system state will be read-only
until the end.  I think that shouldn't be too problematic.

>
>
> >  5. Auxiliary processes like autovacuum launcher, background writer,
> > checkpointer and  walwriter will don't do anything in WAL-Prohibited
> > server state until someone wakes us up. E.g. a backend might later on
> > request us to put the system back to read-write.
>
> Hm. It's not at all clear to me why bgwriter and walwriter shouldn't do
> anything in this state. bgwriter for example is even running entirely
> normally in a hot standby node?

I think I missed to update the description when I reverted the
walwriter changes. The current version doesn't have any changes to
the walwriter.  And bgwriter too behaves the same as it on the recovery
system. Will update this, sorry for the confusion.

>
>
> >  6. At shutdown in WAL-Prohibited mode, we'll skip shutdown checkpoint
> > and xlog rotation. Starting up again will perform crash recovery(XXX:
> > need some discussion on this as well)
> >
> >  7. ALTER SYSTEM READ ONLY/WRITE is restricted on standby server.
> >
> >  8. Only super user can toggle WAL-Prohibit state.
> >
> >  9. Add system_is_read_only GUC show the system state -- will true when 
> > system
> > is wal prohibited or in 

Re: INSERT INTO SELECT, Why Parallelism is not selected?

2020-07-24 Thread Amit Kapila
On Fri, Jul 17, 2020 at 11:24 AM Amit Kapila  wrote:
>
> Do you have something else in mind?
>

I am planning to commit the comments change patch attached in the
above email [1] next week sometime (probably Monday or Tuesday) unless
you have something more to add?


[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BRL7c_s%3D%2BTwAE6DJ1MmupbEiGCFLt97US%2BDMm6UxAjTA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

2020-07-24 Thread Bharath Rupireddy
On Tue, Jul 21, 2020 at 1:17 AM Robert Haas  wrote:
>
> On Tue, Jul 7, 2020 at 12:55 PM Andres Freund  wrote:
> > What are you proposing? For now we could easily enough work around this
> > by just making it a on_proc_exit() callback, but that doesn't really
> > change the fundamental issue imo.
>
> I think it would be more correct for it to be an on_proc_exit()
> callback, because before_shmem_exit() callbacks can and do perform
> actions which rely on an awful lot of the system being still in a
> working state. RemoveTempRelationsCallback() is a good example: it
> thinks it can start and end transactions and make a bunch of catalog
> changes. I don't know that any of that could use JIT, but moving the
> JIT cleanup to the on_shmem_exit() stage seems better. At that point,
> there shouldn't be anybody doing anything that relies on being able to
> perform logical changes to the database; we're just shutting down
> low-level subsystems at that point, and thus presumably not doing
> anything that could possibly need JIT.
>

I looked at what actually llvm_shutdown() does? It frees up JIT stacks,
also if exists perf related resource, using LLVMOrcDisposeInstance() and
LLVMOrcUnregisterPerf(), that were dynamically allocated in
llvm_session_initialize through a JIT library function
LLVMOrcCreateInstance() [1].

It looks like there is no problem in moving llvm_shutdown to either
on_shmem_exit() or on_proc_exit().

[1] - https://llvm.org/doxygen/OrcCBindings_8cpp_source.html

>
> But I also agree that what pg_start_backup() was doing before v13 was
> wrong; that's why I committed
> 303640199d0436c5e7acdf50b837a027b5726594. The only reason I didn't
> back-patch it is because the consequences are so minor I didn't think
> it was worth worrying about. We could, though. I'd be somewhat
> inclined to both do that and also change LLVM to use on_proc_exit() in
> master, but I don't feel super-strongly about it.
>

Patch: v1-0001-Move-llvm_shutdown-to-on_proc_exit-list-from-befo.patch
Moved llvm_shutdown to on_proc_exit() call back list. Request to consider
this change for master, if possible <=13 versions. Basic JIT use cases and
regression tests are working fine with the patch.

Patches: PG11-0001-Fix-minor-problems-with-non-exclusive-backup-clea.patch
and PG12-0001-Fix-minor-problems-with-non-exclusive-backup-cleanup.patch
Request to consider the commit
303640199d0436c5e7acdf50b837a027b5726594(above two patches are for this
commit) to versions < 13, to fix the abort issue. Please note that the
above two patches have no difference in the code, just I made it applicable
on PG11.

Patch: v1-0001-Modify-cancel_before_shmem_exit-comments.patch
This patch, modifies cancel_before_shmem_exit() function comment to reflect
the safe usage of before_shmem_exit_list callback mechanism and also
removes the point "For simplicity, only the latest entry can be
removed*" as this gives a meaning that there is still scope for
improvement in cancel_before_shmem_exit() search mechanism.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 5fae4b3a046789fb7b54e689c979b01cb322aaf9 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 23 Jul 2020 17:56:21 +0530
Subject: [PATCH v1] Move llvm_shutdown to on_proc_exit list from
 before_shmem_exit.

llvm_shutdown frees up dynamically allocated memory for jit
compilers in a session/backend. Having this as on_proc_exit doesn't
cause any harm.
---
 src/backend/jit/llvm/llvmjit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index af8b34aaaf..43bed78a52 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -683,7 +683,7 @@ llvm_session_initialize(void)
 	}
 #endif
 
-	before_shmem_exit(llvm_shutdown, 0);
+	on_proc_exit(llvm_shutdown, 0);
 
 	llvm_session_initialized = true;
 
-- 
2.25.1

From 3b0a048afd7ad6d8564a54d13397c72f6eadc5da Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 24 Jul 2020 08:55:52 +0530
Subject: [PATCH v5] Fix minor problems with non-exclusive backup cleanup.

The previous coding imagined that it could call before_shmem_exit()
when a non-exclusive backup began and then remove the previously-added
handler by calling cancel_before_shmem_exit() when that backup
ended. However, this only works provided that nothing else in the
system has registered a before_shmem_exit() hook in the interim,
because cancel_before_shmem_exit() is documented to remove a callback
only if it is the latest callback registered. It also only works
if nothing can ERROR out between the time that sessionBackupState
is reset and the time that cancel_before_shmem_exit(), which doesn't
seem to be strictly true.

To fix, leave the handler installed for the lifetime of the session,
arrange to install it just once, and teach it to quietly do nothing if
there isn't a non-exclusive backup in process.

This is a bug, but for now I'm not going to 

Re: renaming configure.in to configure.ac

2020-07-24 Thread Peter Eisentraut

On 2020-07-17 10:46, Peter Eisentraut wrote:

v1-0001-Rename-configure.in-to-configure.ac.patch


I have committed that, and I have sent feedback to the Autoconf 
developers about our concerns about the slowness of some of the new tests.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-24 Thread Ashutosh Sharma
Hi All,

Attached is the patch that adds heap_force_kill(regclass, tid[]) and
heap_force_freeze(regclass, tid[]) functions which Robert mentioned in the
first email in this thread. The patch basically adds an extension named
pg_surgery that contains these functions.  Please have a look and let me
know your feedback. Thank you.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


On Thu, Jul 16, 2020 at 9:44 PM Robert Haas  wrote:

> On Thu, Jul 16, 2020 at 10:00 AM Robert Haas 
> wrote:
> > I see your point, though: the tuple has to be able to survive
> > HOT-pruning in order to cause a problem when we check whether it needs
> > freezing.
>
> Here's an example where the new sanity checks fail on an invisible
> tuple without any concurrent transactions:
>
> $ initdb
> $ pg_ctl start -l ~/logfile
> $ createdb
> $ psql
>
> create table simpsons (a int, b text);
> vacuum freeze;
>
> $ cat > txid.sql
> select txid_current();
> $ pgbench -t 131072 -c 8 -j 8 -n -f txid.sql
> $ psql
>
> insert into simpsons values (1, 'homer');
>
> $ pg_ctl stop
> $ pg_resetwal -x 1000 $PGDATA
> $ pg_ctl start -l ~/logfile
> $ psql
>
> update pg_class set relfrozenxid = (relfrozenxid::text::integer +
> 200)::text::xid where relname = 'simpsons';
>
> rhaas=# select * from simpsons;
>  a | b
> ---+---
> (0 rows)
>
> rhaas=# vacuum simpsons;
> ERROR:  found xmin 1049082 from before relfrozenxid 2000506
> CONTEXT:  while scanning block 0 of relation "public.simpsons"
>
> This is a fairly insane situation, because we should have relfrozenxid
> < tuple xid < xid counter, but instead we have xid counter < tuple xid
> < relfrozenxid, but it demonstrates that it's possible to have a
> database which is sufficiently corrupt that you can't escape from the
> new sanity checks using only INSERT, UPDATE, and DELETE.
>
> Now, an even easier way to create a table with a tuple that prevents
> vacuuming and also can't just be deleted is to simply remove a
> required pg_clog file (and maybe restart the server to clear out any
> cached data in the SLRUs). What we typically do with customers who
> need to recover from that situation today is give them a script to
> fabricate a bogus CLOG file that shows all transactions as committed
> (or, perhaps, aborted). But I think that the tools proposed on this
> thread might be a better approach in certain cases. If the problem is
> that a pg_clog file vanished, then recreating it with whatever content
> you think is closest to what was probably there before is likely the
> best you can do. But if you've got some individual tuples with crazy
> xmin values, you don't really want to drop matching files in pg_clog;
> it's better to fix the tuples.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>
From 75a6ba0e86b6d199201477c5ebd60258ef16f181 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Fri, 24 Jul 2020 11:26:45 +0530
Subject: [PATCH] Add contrib/pg_surgery to perform surgery on the damaged heap
 tables.

This contrib module basically adds a couple of functions named
heap_force_kill and heap_force_freeze that can be used in the scripts
to perform surgery on the damaged heap tables.

Ashutosh Sharma.
---
 contrib/Makefile   |   1 +
 contrib/pg_surgery/Makefile|  23 ++
 contrib/pg_surgery/expected/pg_surgery.out | 109 +
 contrib/pg_surgery/heap_surgery_funcs.c| 356 +
 contrib/pg_surgery/pg_surgery--1.0.sql |  14 ++
 contrib/pg_surgery/pg_surgery.control  |   5 +
 contrib/pg_surgery/sql/pg_surgery.sql  |  72 ++
 7 files changed, 580 insertions(+)
 create mode 100644 contrib/pg_surgery/Makefile
 create mode 100644 contrib/pg_surgery/expected/pg_surgery.out
 create mode 100644 contrib/pg_surgery/heap_surgery_funcs.c
 create mode 100644 contrib/pg_surgery/pg_surgery--1.0.sql
 create mode 100644 contrib/pg_surgery/pg_surgery.control
 create mode 100644 contrib/pg_surgery/sql/pg_surgery.sql

diff --git a/contrib/Makefile b/contrib/Makefile
index 1846d41..07d5734 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		pg_standby	\
 		pg_stat_statements \
 		pg_trgm		\
+		pg_surgery	\
 		pgcrypto	\
 		pgrowlocks	\
 		pgstattuple	\
diff --git a/contrib/pg_surgery/Makefile b/contrib/pg_surgery/Makefile
new file mode 100644
index 000..baf2a88
--- /dev/null
+++ b/contrib/pg_surgery/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_surgery/Makefile
+
+MODULE_big = pg_surgery
+OBJS = \
+	$(WIN32RES) \
+	heap_surgery_funcs.o
+
+EXTENSION = pg_surgery
+DATA = pg_surgery--1.0.sql
+PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
+
+REGRESS = pg_surgery
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_surgery
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git 

Re: Default setting for enable_hashagg_disk

2020-07-24 Thread Tomas Vondra

On Thu, Jul 23, 2020 at 07:33:45PM -0700, Peter Geoghegan wrote:

On Thu, Jul 23, 2020 at 6:22 PM Tomas Vondra
 wrote:

So let me share some fresh I/O statistics collected on the current code
using iosnoop. I've done the tests on two different machines using the
"aggregate part" of TPC-H Q17, i.e. essentially this:

 SELECT * FROM (
SELECT
l_partkey AS agg_partkey,
0.2 * avg(l_quantity) AS avg_quantity
FROM lineitem GROUP BY l_partkey OFFSET 10
 ) part_agg;

The OFFSET is there just to ensure we don't need to send anything to
the client, etc.


Thanks for testing this.


So sort writes ~3.4GB of data, give or take. But hashagg/master writes
almost 6-7GB of data, i.e. almost twice as much. Meanwhile, with the
original CP_SMALL_TLIST we'd write "only" ~5GB of data. That's still
much more than the 3.4GB of data written by sort (which has to spill
everything, while hashagg only spills rows not covered by the groups
that fit into work_mem).


What I find when I run your query (with my own TPC-H DB that is
smaller than what you used here -- 59,986,052 lineitem tuples) is that
the sort required about 7x more memory than the hash agg to do
everything in memory: 4,384,711KB for the quicksort vs 630,801KB peak
hash agg memory usage. I'd be surprised if the ratio was very
different for you -- but can you check?



I can check, but it's not quite clear to me what are we looking for?
Increase work_mem until there's no need to spill in either case?


I think that there is something pathological about this spill
behavior, because it sounds like the precise opposite of what you
might expect when you make a rough extrapolation of what disk I/O will
be based on the memory used in no-spill cases (as reported by EXPLAIN
ANALYZE).



Maybe, not sure what exactly you think is pathological? The trouble is
hashagg has to spill input tuples but the memory used in no-spill case
represents aggregated groups, so I'm not sure how you could extrapolate
from that ...

FWIW one more suspicious thing that I forgot to mention is the behavior
of the "planned partitions" depending on work_mem, which looks like
this:

  2MB   Planned Partitions:  64HashAgg Batches:  4160
  4MB   Planned Partitions: 128HashAgg Batches: 16512
  8MB   Planned Partitions: 256HashAgg Batches: 21488
 64MB   Planned Partitions:  32HashAgg Batches:  2720
256MB   Planned Partitions:   8HashAgg Batches: 8

I'd expect the number of planned partitions to decrease (slowly) as
work_mem increases, but it seems to increase initially. Seems a bit
strange, but maybe it's expected.


What I find really surprising is the costing - despite writing about
twice as much data, the hashagg cost is estimated to be much lower than
the sort. For example on the i5 machine, the hashagg cost is ~10M, while
sort cost is almost 42M. Despite using almost twice as much disk. And
the costing is exactly the same for master and the CP_SMALL_TLIST.


That does make it sound like the costs of the hash agg aren't being
represented. I suppose it isn't clear if this is a costing issue
because it isn't clear if the execution time performance itself is
pathological or is instead something that must be accepted as the cost
of spilling the hash agg in a general kind of way.



Not sure, but I think we need to spill roughly as much as sort, so it
seems a bit strange that (a) we're spilling 2x as much data and yet the
cost is so much lower.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Implement UNLOGGED clause for COPY FROM

2020-07-24 Thread Masahiko Sawada
On Fri, 17 Jul 2020 at 13:23, osumi.takami...@fujitsu.com
 wrote:
>
> Hi,
>
> > AFAICS, we can already accomplish basically the same thing as what you want 
> > to
> > do like this:
> >
> > alter table foo set unlogged;
> > copy foo from ...;
> > alter table foo set logged;
> This didn't satisfy what I wanted.
> In case that 'foo' has huge amount of rows at the beginning,
> this example would spend much time to copy
> the contents of 'foo' twice to swap relfilenodes atomically.
> When that loaded data by COPY is big too, its execution time becomes much 
> longer.
>
> > You keep on ignoring the indexes... not to mention replication.
> Sorry for having made you think like this.
>
> When the server crash occurs during data loading of COPY UNLOGGED,
> it's a must to keep index consistent of course.
> I'm thinking that to rebuild the indexes on the target table would work.
>
> In my opinion, UNLOGGED clause must be designed to guarantee that
> where the data loaded by this clause is written starts from the end of all 
> other data blocks.
> Plus, those blocks needs to be protected by any write of other transactions 
> during the copy.
> Apart from that, the server must be aware of which block is the first block,
> or the range about where it started or ended in preparation for the crash.
>
> During the crash recovery, those points are helpful to recognize and detach 
> such blocks
> in order to solve a situation that the loaded data is partially synced to the 
> disk and the rest isn't.

How do online backup and archive recovery work?

Suppose that the user executes pg_basebackup during COPY UNLOGGED
running, the physical backup might have the portion of tuples loaded
by COPY UNLOGGED but these data are not recovered. It might not be a
problem because the operation is performed without WAL records. But
what if an insertion happens after COPY UNLOGGED but before
pg_stop_backup()? I think that a new tuple could be inserted at the
end of the table, following the data loaded by COPY UNLOGGED. With
your approach described above, the newly inserted tuple will be
recovered during archive recovery, but it either will be removed if we
replay the insertion WAL then truncate the table or won’t be inserted
due to missing block if we truncate the table then replay the
insertion WAL, resulting in losing the tuple although the user got
successful of insertion.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: deferred primary key and logical replication

2020-07-24 Thread Ajin Cherian
The patch no longer applies, because of additions in the test source. 
Otherwise, I have tested the patch and confirmed that updates and deletes on 
tables with deferred primary keys work with logical replication.

The new status of this patch is: Waiting on Author


Re: logical replication empty transactions

2020-07-24 Thread Ajin Cherian
Sorry, I replied in the wrong thread. Please ignore above mail.

>
>


handle a ECPG_bytea typo

2020-07-24 Thread Wang, Shenhao
Hi, hackers

The source looks like:

case ECPGt_bytea:
{
struct ECPGgeneric_varchar *variable =
(struct ECPGgeneric_varchar *) (var->value);

..
}

I think the developer intend to use struct ECPGgeneric_bytea instead of struct 
ECPGgeneric_varchar

Is this thoughts right?

I have wrote a patch to fix this typo





0001-Fix-ECPGt_bytea-typo.patch
Description: 0001-Fix-ECPGt_bytea-typo.patch


Re: logical replication empty transactions

2020-07-24 Thread Ajin Cherian
The patch no longer applies, because of additions in the test source. 
Otherwise, I have tested the patch and confirmed that updates and deletes on 
tables with deferred primary keys work with logical replication.

The new status of this patch is: Waiting on Author


Re: Parallel Seq Scan vs kernel read ahead

2020-07-24 Thread David Rowley
Hi Soumyadeep,

Thanks for re-running the tests.

On Thu, 23 Jul 2020 at 06:01, Soumyadeep Chakraborty
 wrote:
> On Tue, Jul 14, 2020 at 8:52 PM David Rowley  wrote:
> > It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET
> > track_io_timing = on; for each value of max_parallel_workers.
>
> As for running EXPLAIN ANALYZE, running that on this system incurs a
> non-trivial amount of overhead. The overhead is simply staggering.

I didn't really intend for that to be used to get an accurate overall
timing for the query. It was more to get an indication of the reads
are actually hitting the disk or not.

I mentioned to Kirk in [1] that his read speed might be a bit higher
than what his disk can actually cope with.  I'm not too sure on the
HDD he mentions, but if it's a single HDD then reading at an average
speed of 3457 MB/sec seems quite a bit too fast. It seems more likely,
in his cases, that those reads are mostly coming from the kernel's
cache.

David

[1] 
https://www.postgresql.org/message-id/CAApHDvoDzAzXEp+Ay2CfT3U=zcd5nld7k9_y936bnhjzs5j...@mail.gmail.com




Re: Parallel worker hangs while handling errors.

2020-07-24 Thread Bharath Rupireddy
On Thu, Jul 23, 2020 at 12:54 PM vignesh C  wrote:
>
> Thanks for reviewing and adding your thoughts, My comments are inline.
>
> On Fri, Jul 17, 2020 at 1:21 PM Bharath Rupireddy
>  wrote:
> >
> > The same hang issue can occur(though I'm not able to back it up with a
> > use case), in the cases from wherever the EmitErrorReport() is called
> > from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
> > autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
> > postgres.c.
> >
>
> I'm not sure if this can occur in other cases.
>

I checked further on this point: Yes, it can't occur for the other
cases, as mq_putmessage() gets only used for parallel
workers(ParallelWorkerMain()  --> pq_redirect_to_shm_mq()).

>
> > Note that, in all sigsetjmp blocks, we intentionally
> > HOLD_INTERRUPTS(), to not cause any issues while performing error
> > handling, I'm concerned here that now, if we directly call
> > BackgroundWorkerUnblockSignals() which will open up all the signals
> > and our main intention of holding interrupts or block signals may go
> > away.
> >
> > Since the main problem for this hang issue is because of blocking
> > SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
> > instead of unblocking all signals? I tried this with parallel copy
> > hang, the issue is resolved.
> >
>
> On putting further thoughts on this, I feel just unblocking SIGUSR1
> would be the right approach in this case. I'm attaching a new patch
> which unblocks SIGUSR1 signal. I have verified that the original issue
> with WIP parallel copy patch gets fixed. I have made changes only in
> bgworker.c as we require the parallel worker to receive this signal
> and continue processing. I have not included the changes for other
> processes as I'm not sure if this scenario is applicable for other
> processes.
>

Since we are sure that this hang issue can occur only for parallel
workers, and the change in StartBackgroundWorker's sigsetjmp's block
should only be made for parallel worker cases. And also there can be a
lot of other callbacks execution and processing in proc_exit() for
which we might not need the SIGUSR1 unblocked. So, let's undo the
unblocking right after EmitErrorReport() to not cause any new issues.

Attaching a modified v2 patch: it has the unblocking for only for
parallel workers, undoing it after EmitErrorReport(), and some
adjustments in the comment.

I verified this fix for the parallel copy hang issue. And also make
check and make check-world passes.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 6bd924f2e7ff90d6c293f131b8ddb20898b9950d Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 24 Jul 2020 12:01:14 +0530
Subject: [PATCH v2] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.
---
 src/backend/postmaster/bgworker.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85434..ee187689a4 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -747,6 +747,20 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * In case of parallel workers, unblock SIGUSR1 signal, it was blocked
+		 * when the postmaster forked us. Leader process will send SIGUSR1 signal
+		 * to the worker process(worker process will be in waiting state as
+		 * there is no space available) to indicate shared memory space is freed
+		 * up. Once the signal is received worker process will start populating
+		 * the error message further.
+		 */
+		if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+		{
+			sigdelset(, SIGUSR1);
+			PG_SETMASK();
+		}
+
 		/* Since not using PG_TRY, must reset error stack by hand */
 		error_context_stack = NULL;
 
@@ -756,6 +770,18 @@ StartBackgroundWorker(void)
 		/* Report the error to the server log */
 		EmitErrorReport();
 
+		/*
+		 * Undo the unblocking of SIGUSR1 which was done above, as to
+		 * not cause any further issues from unblocking SIGUSR1 during
+		 * the execution of callbacks and other processing that will be
+		 * done during proc_exit().
+		 */
+		if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+		{
+			sigaddset(, SIGUSR1);
+			PG_SETMASK();
+		}
+
 		/*
 		 * Do we need more cleanup here?  For shmem-connected bgworkers, we
 		 * will call InitProcess below, which will install ProcKill as exit
-- 
2.25.1