Re: Allow continuations in "pg_hba.conf" files

2020-04-02 Thread Fabien COELHO


Hello David,

+Agree. However, it would nice to update the sentence below if I understand 
it correctly.


"+   Comments, whitespace and continuations are handled in the same way as 
in" pg_hba.conf


In the attached v3, I've tried to clarify comments and doc about 
tokenization rules relating to comments, strings and continuations.


--
Fabien.diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index 5f1eec78fb..5d6412de9b 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -77,13 +77,16 @@
The general format of the pg_hba.conf file is
a set of records, one per line. Blank lines are ignored, as is any
text after the # comment character.
-   Records cannot be continued across lines.
+   Lines ending with a backslash are logically continued on the next
+   line, so a record can span several lines.
A record is made
up of a number of fields which are separated by spaces and/or tabs.
Fields can contain white space if the field value is double-quoted.
Quoting one of the keywords in a database, user, or address field (e.g.,
all or replication) makes the word lose its special
meaning, and just match a database, user, or host with that name.
+   If a continuation occurs within a quoted text or a comment,
+   the quoted text or comment are continued.
   
 
   
@@ -821,7 +824,7 @@ local   db1,db2,@demodbs  all   md5
 
 map-name system-username database-username
 
-   Comments and whitespace are handled in the same way as in
+   Comments, whitespace and continuations are handled in the same way as in
pg_hba.conf.  The
map-name is an arbitrary name that will be used to
refer to this mapping in pg_hba.conf. The other
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index da5189a4fa..e88758c5aa 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -166,11 +166,17 @@ pg_isblank(const char c)
 /*
  * Grab one token out of the string pointed to by *lineptr.
  *
- * Tokens are strings of non-blank
- * characters bounded by blank characters, commas, beginning of line, and
- * end of line. Blank means space or tab. Tokens can be delimited by
- * double quotes (this allows the inclusion of blanks, but not newlines).
- * Comments (started by an unquoted '#') are skipped.
+ * Tokens are strings of non-blank characters bounded by blank characters,
+ * commas, beginning of line, and end of line. Blank means space or tab.
+ *
+ * Tokens can be delimited by double quotes (this allows the inclusion of
+ * blanks or '#', but not newlines). If a continuations occured within the
+ * quoted text, the quoted text is continued on the next line. There is no
+ * escape mechanism, thus character '"' cannot be part of a token.
+ *
+ * Comments (started by an unquoted '#') are skipped, i.e. the remainder
+ * of the line is ignored. If a line with a comment was backslash-continued,
+ * the continuated text is part of the comment, thus ignored as well.
  *
  * The token, if any, is returned at *buf (a buffer of size bufsz), and
  * *lineptr is advanced past the token.
@@ -486,8 +492,43 @@ tokenize_file(const char *filename, FILE *file, List **tok_lines, int elevel)
 		char	   *lineptr;
 		List	   *current_line = NIL;
 		char	   *err_msg = NULL;
+		char	   *cur = rawline;
+		int			len = sizeof(rawline);
+		int			continuations = 0;
 
-		if (!fgets(rawline, sizeof(rawline), file))
+		/* read input and handle simplistic backslash continuations */
+		while ((cur = fgets(cur, len, file)) != NULL)
+		{
+			int		curlen = strlen(cur);
+			char   *curend = cur + curlen - 1;
+
+			if (curlen == len - 1)
+			{
+/* Line too long! */
+ereport(elevel,
+		(errcode(ERRCODE_CONFIG_FILE_ERROR),
+		 errmsg("authentication file line too long"),
+		 errcontext("line %d of configuration file \"%s\"",
+	line_number + continuations, filename)));
+err_msg = "authentication file line too long";
+			}
+
+			/* Strip trailing linebreak from rawline */
+			while (curend >= cur && (*curend == '\n' || *curend == '\r'))
+*curend-- = '\0';
+
+			/* empty or not a continuation, we are done */
+			if (curend < cur || *curend != '\\')
+break;
+
+			/* else we have a continuation, just remove it and loop */
+			continuations++;
+			*curend = '\0';
+			len -= (curend - cur);
+			cur = curend;
+		}
+
+		if (cur == NULL)
 		{
 			int			save_errno = errno;
 
@@ -501,21 +542,6 @@ tokenize_file(const char *filename, FILE *file, List **tok_lines, int elevel)
 			   filename, strerror(save_errno));
 			rawline[0] = '\0';
 		}
-		if (strlen(rawline) == MAX_LINE - 1)
-		{
-			/* Line too long! */
-			ereport(elevel,
-	(errcode(ERRCODE_CONFIG_FILE_ERROR),
-	 errmsg("authentication file line too long"),
-	 errcontext("line %d of configuration file \"%s\"",
-line_number, filename)));
-			err_msg = "authentication file line too long";
-		}
-
-		/* Strip trailing linebreak 

Re: User Interface for WAL usage data

2020-04-02 Thread Justin Pryzby
On Fri, Apr 03, 2020 at 10:52:02AM +0530, Amit Kapila wrote:
> On Fri, Apr 3, 2020 at 10:41 AM Justin Pryzby  wrote:
> >
> > > > > > > >WAL: records=2359 full page records=42 bytes=447788
> >
> > 1) records; 2) pages ("full page images"); 3) bytes
> >
> > That is exactly like sort (method/type/size) and hash 
> > (buckets/batches/size),
> > and *not* like buffers, which shows various values all in units of "pages".
> >
> 
> The way you have written (2) appears to bit awkward. I would prefer
> "full page writes" or "full page images".

I didn't mean it to be the description used in the patch or anywhere else, just
the list of units.

I wonder if it should use colons instead of equals ?  As in:
|WAL: Records: 2359  Full Page Images: 42  Size: 437kB

Note, that has: 1) two spaces; 2) capitalized "fields"; 3) size rather than
"bytes".  That's similar to Buckets:
|Buckets: 1024  Batches: 1  Memory Usage: 44kB

I'm not sure if it should say "WAL: " or "WAL ", or perhaps "WAL:  "  If
there's no colon, then it looks like the first field is "WAL Records", but then
"size" isn't as tightly associated with WAL.  It could say:
|WAL Records: n  Full Page Images: n  WAL Size: nkB

For comparison, buffers uses "equals" for the case showing multiple "fields",
which are all in units of pages:
|Buffers: shared hit=15 read=2006

Also, for now, the output can be in kB, but I think in the future we should
take a recent suggestion from Andres to make an ExplainPropertyBytes() which
handles conversion to and display of a reasonable unit.  

-- 
Justin




Re: pgbench - add \aset to store results of a combined query

2020-04-02 Thread Fabien COELHO


Bonjour Michaël,


Sure. I meant that the feature would make sense to write benchmark scripts
which would use aset and be able to act on the success or not of this aset,
not to resurrect it for a hidden coverage test.


This could always be discussed for v14.  We'll see.


Or v15, or never, who knows? :-)

The use case I have in mind for such a feature is to be able to have a 
flow of DELETE transactions in a multi-script benchmark without breaking 
concurrent SELECT/UPDATE transactions. For that, the ability of extracting 
data easily and testing whether it was non empty would help.



Applied, then.  Thanks!


Thanks to you!

--
Fabien.

Re: User Interface for WAL usage data

2020-04-02 Thread Amit Kapila
On Fri, Apr 3, 2020 at 10:41 AM Justin Pryzby  wrote:
>
> > > > > > >WAL: records=2359 full page records=42 bytes=447788
>
> 1) records; 2) pages ("full page images"); 3) bytes
>
> That is exactly like sort (method/type/size) and hash (buckets/batches/size),
> and *not* like buffers, which shows various values all in units of "pages".
>

The way you have written (2) appears to bit awkward. I would prefer
"full page writes" or "full page images".

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: User Interface for WAL usage data

2020-04-02 Thread Justin Pryzby
> > > > > Regarding 
> > > > > v10-0004-Add-option-to-report-WAL-usage-in-EXPLAIN-and-au.patch:
> > > > > I think there should be additional spaces before "full" and before 
> > > > > "bytes":
> > > > >
> > > > > >WAL: records=2359 full page records=42 bytes=447788
> > > > >
> > > > > Compare with these:
> > > > >
> > > > >"Sort Method: %s  %s: %ldkB\n",
> > > > >"Buckets: %d (originally %d)  Batches: %d (originally %d)  
> > > > > Memory Usage: %ldkB\n",
> > > > >"Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
> > > > >
> > > > > Otherwise "records=2359 full page records=42" is hard to parse.
> > > >
> > > > I got the same feeling seeing the line.
> > >
> > > But isn't this same as we have BufferUsage data?  We can probably
> > > display it as full_page_writes or something like that.
> >
> > I guess you mean this:
> >  Buffers: shared hit=994 read=11426 dirtied=466
> >
> > Which can show shared/local/temp.  Actually I would probably make the same
> > suggestion for "Buffers" (if it were a new patch).  I would find this to be
> > pretty unfriendly output:
> >
> >  Buffers: shared hit=12345 read=12345 dirtied=12345 local hit=12345 
> > read=12345 dirtied=12345 temp hit=12345 read=12345 dirtied=12345
> >
> > Adding two extra spaces "  local" and "  temp" would have helped there, so
> > would commas, or parenthesis, dashes or almost anything - other than a
> > backslash.
> >
> > So I think you're right that WAL is very similar to the Buffers case[...]

Actually, I take that back.  If I understand correctly, there should be two
spaces not only to make the 2nd field more clear, but because the three values
have different units:

> > > > > >WAL: records=2359 full page records=42 bytes=447788

1) records; 2) pages ("full page images"); 3) bytes

That is exactly like sort (method/type/size) and hash (buckets/batches/size),
and *not* like buffers, which shows various values all in units of "pages".

-- 
Justin




Re: Some problems of recovery conflict wait events

2020-04-02 Thread Masahiko Sawada
On Fri, 3 Apr 2020 at 12:28, Fujii Masao  wrote:
>
>
>
> On 2020/04/02 16:12, Fujii Masao wrote:
> >
> >
> > On 2020/04/02 15:54, Masahiko Sawada wrote:
> >> On Thu, 2 Apr 2020 at 15:34, Fujii Masao  
> >> wrote:
> >>>
> >>>
> >>>
> >>> On 2020/04/02 14:25, Masahiko Sawada wrote:
>  On Wed, 1 Apr 2020 at 22:32, Fujii Masao  
>  wrote:
> >
> >
> >
> > On 2020/03/30 20:10, Masahiko Sawada wrote:
> >> On Fri, 27 Mar 2020 at 17:54, Fujii Masao 
> >>  wrote:
> >>>
> >>>
> >>>
> >>> On 2020/03/04 14:31, Masahiko Sawada wrote:
>  On Wed, 4 Mar 2020 at 13:48, Fujii Masao 
>   wrote:
> >
> >
> >
> > On 2020/03/04 13:27, Michael Paquier wrote:
> >> On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote:
> >>> Yeah, so 0001 patch sets existing wait events to recovery conflict
> >>> resolution. For instance, it sets (PG_WAIT_LOCK | 
> >>> LOCKTAG_TRANSACTION)
> >>> to the recovery conflict on a snapshot. 0003 patch improves these 
> >>> wait
> >>> events by adding the new type of wait event such as
> >>> WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) 
> >>> patch
> >>> is the fix for existing versions and 0003 patch is an improvement 
> >>> for
> >>> only PG13. Did you mean even 0001 patch doesn't fit for 
> >>> back-patching?
> >
> > Yes, it looks like a improvement rather than bug fix.
> >
> 
>  Okay, understand.
> 
> >> I got my eyes on this patch set.  The full patch set is in my 
> >> opinion
> >> just a set of improvements, and not bug fixes, so I would refrain 
> >> from
> >> back-backpatching.
> >
> > I think that the issue (i.e., "waiting" is reported twice needlessly
> > in PS display) that 0002 patch tries to fix is a bug. So it should 
> > be
> > fixed even in the back branches.
> 
>  So we need only two patches: one fixes process title issue and 
>  another
>  improve wait event. I've attached updated patches.
> >>>
> >>> I started reading 
> >>> v2-0002-Improve-wait-events-for-recovery-conflict-resolut.patch.
> >>>
> >>> -   ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
> >>> +   ProcWaitForSignal(WAIT_EVENT_RECOVERY_CONFLICT_BUFFER_PIN);
> >>>
> >>> Currently the wait event indicating the wait for buffer pin has 
> >>> already
> >>> been reported. But the above change in the patch changes the name of
> >>> wait event for buffer pin only in the startup process. Is this really 
> >>> useful?
> >>> Isn't the existing wait event for buffer pin enough?
> >>>
> >>> -   /* Wait to be signaled by the release of the Relation Lock */
> >>> -   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
> >>> +   /* Wait to be signaled by the release of the Relation 
> >>> Lock */
> >>> +   ProcWaitForSignal(WAIT_EVENT_RECOVERY_CONFLICT_LOCK);
> >>>
> >>> Same as above. Isn't the existing wait event enough?
> >>
> >> Yeah, we can use the existing wait events for buffer pin and lock.
> >>
> >>>
> >>> -   /*
> >>> -* Progressively increase the sleep times, but not to more 
> >>> than 1s, since
> >>> -* pg_usleep isn't interruptible on some platforms.
> >>> -*/
> >>> -   standbyWait_us *= 2;
> >>> -   if (standbyWait_us > 100)
> >>> -   standbyWait_us = 100;
> >>> +   WaitLatch(MyLatch,
> >>> + WL_LATCH_SET | WL_POSTMASTER_DEATH | 
> >>> WL_TIMEOUT,
> >>> + STANDBY_WAIT_MS,
> >>> + wait_event_info);
> >>> +   ResetLatch(MyLatch);
> >>>
> >>> ResetLatch() should be called before WaitLatch()?
> >>
> >> Fixed.
> >>
> >>>
> >>> Could you tell me why you dropped the "increase-sleep-times" logic?
> >>
> >> I thought we can remove it because WaitLatch is interruptible but my
> >> observation was not correct. The waiting startup process is not
> >> necessarily woken up by signal. I think it's still better to not wait
> >> more than 1 sec even if it's an interruptible wait.
> >
> > So we don't need to use WaitLatch() there, i.e., it's ok to keep using
> > pg_usleep()?
> >
> >> Attached patch fixes the above and introduces only two wait events of
> >> conflict resolution: snapshot and tablespace.
> >
> > Many thanks for updating the patch!
> >
> > -   /* Wait to be signaled by the release of the Relation Lock */
> > -   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
> > +   /* Wait to be signaled by the 

Re: Removing unneeded self joins

2020-04-02 Thread Andrey Lepikhov

On 4/1/20 8:34 PM, David Steele wrote:
This patch no longer applies cleanly on 
src/test/regress/sql/equivclass.sql: 
http://cfbot.cputube.org/patch_27_1712.log


The CF entry has been updated to Waiting on Author.


v.23 in attachment:

1. The patch applies cleanly.
2. Add checks: two potentially self joined relations may belong to 
different rules of order restriction in join_info_list.

3. Add test for item 2.

The CF entry has been updated to Needs review.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
>From e74d3c2549737305419b3c29301d29c1e191d6ae Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Fri, 3 Apr 2020 09:31:58 +0500
Subject: [PATCH] Remove Self Joins v. 23

---
 src/backend/nodes/outfuncs.c  |   19 +-
 src/backend/optimizer/path/indxpath.c |   28 +-
 src/backend/optimizer/path/joinpath.c |6 +-
 src/backend/optimizer/plan/analyzejoins.c | 1178 -
 src/backend/optimizer/plan/planmain.c |7 +-
 src/backend/optimizer/util/pathnode.c |3 +-
 src/backend/optimizer/util/relnode.c  |   26 +-
 src/backend/utils/misc/guc.c  |   11 +
 src/backend/utils/mmgr/aset.c |1 -
 src/include/nodes/nodes.h |1 +
 src/include/nodes/pathnodes.h |   22 +-
 src/include/optimizer/pathnode.h  |4 +
 src/include/optimizer/paths.h |3 +-
 src/include/optimizer/planmain.h  |   10 +-
 src/test/regress/expected/equivclass.out  |   32 +
 src/test/regress/expected/join.out|  330 +
 src/test/regress/expected/updatable_views.out |   15 +-
 src/test/regress/sql/equivclass.sql   |   16 +
 src/test/regress/sql/join.sql |  169 +++
 19 files changed, 1773 insertions(+), 108 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index eb168ffd6d..a1a9ae1ac1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2281,7 +2281,8 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_OID_FIELD(userid);
 	WRITE_BOOL_FIELD(useridiscurrent);
 	/* we don't try to print fdwroutine or fdw_private */
-	/* can't print unique_for_rels/non_unique_for_rels; BMSes aren't Nodes */
+	WRITE_NODE_FIELD(unique_for_rels);
+	/* can't print non_unique_for_rels; BMSes aren't Nodes */
 	WRITE_NODE_FIELD(baserestrictinfo);
 	WRITE_UINT_FIELD(baserestrict_min_security);
 	WRITE_NODE_FIELD(joininfo);
@@ -2353,6 +2354,19 @@ _outStatisticExtInfo(StringInfo str, const StatisticExtInfo *node)
 	WRITE_BITMAPSET_FIELD(keys);
 }
 
+static void
+_outUniqueRelInfo(StringInfo str, const UniqueRelInfo *node)
+{
+	WRITE_NODE_TYPE("UNIQUERELINFO");
+
+	WRITE_BITMAPSET_FIELD(outerrelids);
+	if (node->index)
+	{
+		WRITE_OID_FIELD(index->indexoid);
+		WRITE_NODE_FIELD(column_values);
+	}
+}
+
 static void
 _outEquivalenceClass(StringInfo str, const EquivalenceClass *node)
 {
@@ -4135,6 +4149,9 @@ outNode(StringInfo str, const void *obj)
 			case T_StatisticExtInfo:
 _outStatisticExtInfo(str, obj);
 break;
+			case T_UniqueRelInfo:
+_outUniqueRelInfo(str, obj);
+break;
 			case T_ExtensibleNode:
 _outExtensibleNode(str, obj);
 break;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2a50272da6..0c6e7eee74 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -3564,7 +3564,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
  * relation_has_unique_index_for
  *	  Determine whether the relation provably has at most one row satisfying
  *	  a set of equality conditions, because the conditions constrain all
- *	  columns of some unique index.
+ *	  columns of some unique index. If index_info is not null, it is set to
+ *	  point to a new UniqueRelInfo containing the index and conditions.
  *
  * The conditions can be represented in either or both of two ways:
  * 1. A list of RestrictInfo nodes, where the caller has already determined
@@ -3585,12 +3586,16 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
 bool
 relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 			  List *restrictlist,
-			  List *exprlist, List *oprlist)
+			  List *exprlist, List *oprlist,
+			  UniqueRelInfo **unique_info)
 {
 	ListCell   *ic;
 
 	Assert(list_length(exprlist) == list_length(oprlist));
 
+	if (unique_info)
+		*unique_info = NULL;
+
 	/* Short-circuit if no indexes... */
 	if (rel->indexlist == NIL)
 		return false;
@@ -3641,6 +3646,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic);
 		int			c;
+		List *column_values = NIL;
 
 		/*
 		 * If the index is not unique, or not immediately enforced, or if it's
@@ -3689,6 +3695,9 @@ relation_has_unique_index_for(PlannerInfo *root, 

Re: WAL usage calculation patch

2020-04-02 Thread Amit Kapila
On Fri, Apr 3, 2020 at 9:35 AM Dilip Kumar  wrote:
>
> On Fri, Apr 3, 2020 at 9:17 AM Dilip Kumar  wrote:
> >
> > On Fri, Apr 3, 2020 at 9:02 AM Amit Kapila  wrote:
> > >
> > > On Fri, Apr 3, 2020 at 8:55 AM Dilip Kumar  wrote:
> > > >
> > > > I think now I got the reason.  Basically, both of these records are
> > > > storing the FPW, and FPW size can vary based on the hole size on the
> > > > page.  If hold size is smaller the image length will be more, the
> > > > image_len= BLCKSZ-hole_size.  So in subsequent records, the image size
> > > > is bigger.
> > > >
> > >
> > > This means if we always re-create the database or may be keep
> > > full_page_writes to off, then we should get consistent WAL usage data
> > > for all tests.
> >
> > With new database, it is always the same.  But, with full-page write,
> > I could see one of the create index is writing extra wal and if we
> > change the older then the new create index at that place will write
> > extra wal.  I guess that could be due to a non-in place update in some
> > of the system tables.
>
> I have analyzed the WAL and there could be multiple reasons for the
> same.  With small data, I have noticed that while inserting in the
> system index there was a Page Split and that created extra WAL.
>

Thanks for the investigation.  I think it is clear that we can't
expect the same WAL size even if we repeat the same operation unless
it is a fresh database.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-02 Thread David Rowley
On Fri, 3 Apr 2020 at 16:40, Tom Lane  wrote:
> (It occurs to me BTW that we've been overly conservative about using
> NOT NULL constraints in planning, because of failing to consider that.
> Addition or drop of NOT NULL has to cause a change in
> pg_attribute.attnotnull, which will definitely cause a relcache inval
> on its table, cf rules in CacheInvalidateHeapTuple().  So we *don't*
> need to have a pg_constraint entry corresponding to the NOT NULL, as
> we've mistakenly supposed in some past discussions.)

Agreed for remove_useless_groupby_columns(), but we'd need it if we
wanted to detect functional dependencies in
check_functional_grouping() using unique indexes.




Re: WAL usage calculation patch

2020-04-02 Thread Dilip Kumar
On Fri, Apr 3, 2020 at 9:17 AM Dilip Kumar  wrote:
>
> On Fri, Apr 3, 2020 at 9:02 AM Amit Kapila  wrote:
> >
> > On Fri, Apr 3, 2020 at 8:55 AM Dilip Kumar  wrote:
> > >
> > > I think now I got the reason.  Basically, both of these records are
> > > storing the FPW, and FPW size can vary based on the hole size on the
> > > page.  If hold size is smaller the image length will be more, the
> > > image_len= BLCKSZ-hole_size.  So in subsequent records, the image size
> > > is bigger.
> > >
> >
> > This means if we always re-create the database or may be keep
> > full_page_writes to off, then we should get consistent WAL usage data
> > for all tests.
>
> With new database, it is always the same.  But, with full-page write,
> I could see one of the create index is writing extra wal and if we
> change the older then the new create index at that place will write
> extra wal.  I guess that could be due to a non-in place update in some
> of the system tables.

I have analyzed the WAL and there could be multiple reasons for the
same.  With small data, I have noticed that while inserting in the
system index there was a Page Split and that created extra WAL.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Unicode normalization SQL functions

2020-04-02 Thread John Naylor
On Thu, Apr 2, 2020 at 3:51 PM Peter Eisentraut
 wrote:
>
> On 2020-03-26 18:41, John Naylor wrote:
> > We don't have a trie implementation in Postgres, but we do have a
> > perfect hash implementation. Doing that would bring the tables back to
> > 64 bits per entry, but would likely be noticeably faster than binary
> > search. Since v4 has left out the biggest tables entirely, I think
> > this might be worth a look for the smaller tables remaining.
> >
> > In the attached v5, when building the hash tables, we sort the code
> > points by NO/MAYBE, and store the index of the beginning of the NO
> > block:
>
> This is a valuable idea, but I fear it's a bit late now in this cycle.
> I have questions about some details.  For example, you mention that you
> had to fiddle with the hash seed.  How does that affect other users of
> PerfectHash?

They would still try the same multipliers they use now, so no effect on them.

> What happens when we update Unicode data and the hash
> doesn't work anymore?

The script would choose different multipliers and/or seeds
automatically. Only if you're unlucky would you have to fiddle with
the hash parameters again. The last-resort multipliers in the v2 patch
in the other thread [1] seem very effective and easily build both the
quick check D tables, which I tried for amusement's sake.

That said, we could reduce the chances of that happening this way:
After trying all the shift-and-add multipliers, we could add another
loop to try a bunch of numbers in a range. We'd need a quick check to
weed out multiples of small primes so the number has a decent chance
of being prime. To save time, just try a few seeds and move on to the
next number. Maybe emit a warning that it exhausted the shift-and-add
multipliers in case the developer wanted to intervene.

If I resubmit this, I would split the build up into two steps: have
the current manual script build the quick check array for later commit
into the tree, and build the hash function separately from that as a
Makefile distprep target. There's no reason to have the hash functions
checked in as I did in v5, like we don't check in the keyword hash
functions.

I would also consider splitting it into two patches:

1. Keep binary search but with a more abstract array representation
(think PG_RMGR). This patch would be large but mostly mechanical.
2. Switch to hash lookup. A smaller one for ease of review.

[1] 
https://www.postgresql.org/message-id/CACPNZCvMMj88Bsnk1k%3DRffW6gBw%2BFH7wcwCBfcKLDM%3DUEG2UWg%40mail.gmail.com

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WAL usage calculation patch

2020-04-02 Thread Dilip Kumar
On Fri, Apr 3, 2020 at 9:02 AM Amit Kapila  wrote:
>
> On Fri, Apr 3, 2020 at 8:55 AM Dilip Kumar  wrote:
> >
> > I think now I got the reason.  Basically, both of these records are
> > storing the FPW, and FPW size can vary based on the hole size on the
> > page.  If hold size is smaller the image length will be more, the
> > image_len= BLCKSZ-hole_size.  So in subsequent records, the image size
> > is bigger.
> >
>
> This means if we always re-create the database or may be keep
> full_page_writes to off, then we should get consistent WAL usage data
> for all tests.

With new database, it is always the same.  But, with full-page write,
I could see one of the create index is writing extra wal and if we
change the older then the new create index at that place will write
extra wal.  I guess that could be due to a non-in place update in some
of the system tables.

postgres[58554]=# create extension pg_stat_statements;
CREATE EXTENSION
postgres[58554]=#
postgres[58554]=# create table t1(id integer);
CREATE TABLE
postgres[58554]=# insert into t1 select * from generate_series(1, 100);
INSERT 0 100
postgres[58554]=# select * from pg_stat_statements_reset() ;
 pg_stat_statements_reset
--

(1 row)

postgres[58554]=#
postgres[58554]=# alter table t1 set (parallel_workers = 0);
ALTER TABLE
postgres[58554]=# vacuum;checkpoint;
VACUUM
CHECKPOINT
postgres[58554]=# create index t1_idx_parallel_0 ON t1(id);
CREATE INDEX
postgres[58554]=#
postgres[58554]=# alter table t1 set (parallel_workers = 1);
ALTER TABLE
postgres[58554]=# vacuum;checkpoint;
VACUUM
CHECKPOINT
postgres[58554]=# create index t1_idx_parallel_1 ON t1(id);
CREATE INDEX
postgres[58554]=#
postgres[58554]=# alter table t1 set (parallel_workers = 2);
ALTER TABLE
postgres[58554]=# vacuum;checkpoint;
VACUUM
CHECKPOINT
postgres[58554]=# create index t1_idx_parallel_2 ON t1(id);
CREATE INDEX
postgres[58554]=#
postgres[58554]=# alter table t1 set (parallel_workers = 3);
ALTER TABLE
postgres[58554]=# vacuum;checkpoint;
VACUUM
CHECKPOINT
postgres[58554]=# create index t1_idx_parallel_3 ON t1(id);
CREATE INDEX
postgres[58554]=#
postgres[58554]=# alter table t1 set (parallel_workers = 4);
ALTER TABLE
postgres[58554]=# vacuum;checkpoint;
VACUUM
CHECKPOINT
postgres[58554]=# create index t1_idx_parallel_4 ON t1(id);
CREATE INDEX
postgres[58554]=#
postgres[58554]=# alter table t1 set (parallel_workers = 5);
ALTER TABLE
postgres[58554]=# vacuum;checkpoint;
VACUUM
CHECKPOINT
postgres[58554]=# create index t1_idx_parallel_5 ON t1(id);
CREATE INDEX
postgres[58554]=#
postgres[58554]=# alter table t1 set (parallel_workers = 6);
ALTER TABLE
postgres[58554]=# vacuum;checkpoint;
VACUUM
CHECKPOINT
postgres[58554]=# create index t1_idx_parallel_6 ON t1(id);
CREATE INDEX
postgres[58554]=#
postgres[58554]=# alter table t1 set (parallel_workers = 7);
ALTER TABLE
postgres[58554]=# vacuum;checkpoint;
VACUUM
CHECKPOINT
postgres[58554]=# create index t1_idx_parallel_7 ON t1(id);
CREATE INDEX
postgres[58554]=#
postgres[58554]=# alter table t1 set (parallel_workers = 8);
ALTER TABLE
postgres[58554]=# vacuum;checkpoint;
VACUUM
CHECKPOINT
postgres[58554]=# create index t1_idx_parallel_8 ON t1(id);
CREATE INDEX
postgres[58554]=#
postgres[58554]=# select query, calls, wal_bytes, wal_records,
wal_num_fpw from pg_stat_statements where query ilike '%create
index%';
  query   | calls | wal_bytes |
wal_records | wal_num_fpw
--+---+---+-+-
 create index t1_idx_parallel_0 ON t1(id) | 1 |  20355953 |
2766 |2745
 create index t1_idx_parallel_1 ON t1(id) | 1 |  20355953 |
2766 |2745
 create index t1_idx_parallel_3 ON t1(id) | 1 |  20355953 |
2766 |2745
 create index t1_idx_parallel_2 ON t1(id) | 1 |  20355953 |
2766 |2745
 create index t1_idx_parallel_4 ON t1(id) | 1 |  20355953 |
2766 |2745
 create index t1_idx_parallel_8 ON t1(id) | 1 |  20355953 |
2766 |2745
 create index t1_idx_parallel_6 ON t1(id) | 1 |  20355953 |
2766 |2745
 create index t1_idx_parallel_7 ON t1(id) | 1 |  20355953 |
2766 |2745
 create index t1_idx_parallel_5 ON t1(id) | 1 |  20359585 |
2767 |2745

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Syntax rules for a text value inside the literal for a user-defined type—doc section “8.16.2. Constructing Composite Values”

2020-04-02 Thread Bryn Llewellyn
On 02-Apr-2020, at 19:25, Tom Lane  wrote:

Bryn Llewellyn  writes:
> The documentation in section “8.16.2. Constructing Composite Values” here:
> https://www.postgresql.org/docs/11/rowtypes.html#ROWTYPES-IO-SYNTAX 
> 

The authoritative documentation for that is at 8.16.6 "Composite Type
Input and Output Syntax", and it says quite clearly that whitespace is
not ignored (except for before and after the outer parentheses).

regards, tom lane


Thanks for the super-fast reply, Tom. Yes, I had read that part. I should have 
said this:

“The documentation in section “8.16.2. Constructing Composite Values” et seq 
shows examples…”

This is the text to which you refer me:

“Whitespace outside the parentheses is ignored, but within the parentheses it 
is considered part of the field value, and might or might not be significant 
depending on the input conversion rules for the field data type. For example, 
in:

'(  42)'

the whitespace will be ignored if the field type is integer, but not if it is 
text. As shown previously, when writing a composite value you can write double 
quotes around any individual field value.”

Notice the wording “double quotes around any individual field value.” The word 
“around” was the source of my confusion. For the docs to communicate what, it 
seems, they ought to, then the word should be “within”. This demonstrates my 
point:

create type rt as (a text, b text);
with v as (select '(a "b c" d, e "f,g" h)'::rt as r)
select
  '>'||(r).a||'<' as a,
  '>'||(r).b||'<' as b
from v;

It shows this:

 a | b  
---+
 >a b c d< | > e f,g h<

So these are the resulting parsed-out text values:

a b c d

and

 e f,g h

This demonstrates that, in my input, the double quotes are *within* each of the 
two text values—and definitely *do not surround* them.

I really would appreciate a reply to the second part of my earlier question:

“please explain the rationale for what seems to me to be a counter-intuitive 
syntax design choice.”

I say “counter-intuitive” because JSON had to solve the same high-level goal—to 
distinguish between a string value on the one hand and, for example, a number 
or boolean value on the other hand. They chose the rule that a string value 
*must* be surrounded by double quotes and that other values must *not* be so 
surrounded. The JSON model resonates with my intuition. It also has mechanisms 
to escape interior double quotes and other special characters. I am very 
interested to know why the PostgreSQL designers preferred their model.









Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-02 Thread Tom Lane
David Rowley  writes:
> For just planner smarts such as LEFT JOIN removal, Unique Joins, and
> all this Unique Key stuff, we really don't need to record the
> dependency as if the index or constraint is dropped, then that'll
> cause a relcache invalidation and we'll see the invalidation when we
> attempt to execute the cached plan. That will cause the statement to
> be re-planned and we'll not see the unique index when we do that.

You need to make sure that the thing you're concerned about will actually
cause a relcache invalidation of a table in the query.  But yeah, if it
will then there's not a need to have any other invalidation mechanism.

(It occurs to me BTW that we've been overly conservative about using
NOT NULL constraints in planning, because of failing to consider that.
Addition or drop of NOT NULL has to cause a change in
pg_attribute.attnotnull, which will definitely cause a relcache inval
on its table, cf rules in CacheInvalidateHeapTuple().  So we *don't*
need to have a pg_constraint entry corresponding to the NOT NULL, as
we've mistakenly supposed in some past discussions.)

regards, tom lane




Re: WAL usage calculation patch

2020-04-02 Thread Amit Kapila
On Fri, Apr 3, 2020 at 8:55 AM Dilip Kumar  wrote:
>
> I think now I got the reason.  Basically, both of these records are
> storing the FPW, and FPW size can vary based on the hole size on the
> page.  If hold size is smaller the image length will be more, the
> image_len= BLCKSZ-hole_size.  So in subsequent records, the image size
> is bigger.
>

This means if we always re-create the database or may be keep
full_page_writes to off, then we should get consistent WAL usage data
for all tests.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Some problems of recovery conflict wait events

2020-04-02 Thread Fujii Masao




On 2020/04/02 16:12, Fujii Masao wrote:



On 2020/04/02 15:54, Masahiko Sawada wrote:

On Thu, 2 Apr 2020 at 15:34, Fujii Masao  wrote:




On 2020/04/02 14:25, Masahiko Sawada wrote:

On Wed, 1 Apr 2020 at 22:32, Fujii Masao  wrote:




On 2020/03/30 20:10, Masahiko Sawada wrote:

On Fri, 27 Mar 2020 at 17:54, Fujii Masao  wrote:




On 2020/03/04 14:31, Masahiko Sawada wrote:

On Wed, 4 Mar 2020 at 13:48, Fujii Masao  wrote:




On 2020/03/04 13:27, Michael Paquier wrote:

On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote:

Yeah, so 0001 patch sets existing wait events to recovery conflict
resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION)
to the recovery conflict on a snapshot. 0003 patch improves these wait
events by adding the new type of wait event such as
WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch
is the fix for existing versions and 0003 patch is an improvement for
only PG13. Did you mean even 0001 patch doesn't fit for back-patching?


Yes, it looks like a improvement rather than bug fix.



Okay, understand.


I got my eyes on this patch set.  The full patch set is in my opinion
just a set of improvements, and not bug fixes, so I would refrain from
back-backpatching.


I think that the issue (i.e., "waiting" is reported twice needlessly
in PS display) that 0002 patch tries to fix is a bug. So it should be
fixed even in the back branches.


So we need only two patches: one fixes process title issue and another
improve wait event. I've attached updated patches.


I started reading 
v2-0002-Improve-wait-events-for-recovery-conflict-resolut.patch.

-   ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
+   ProcWaitForSignal(WAIT_EVENT_RECOVERY_CONFLICT_BUFFER_PIN);

Currently the wait event indicating the wait for buffer pin has already
been reported. But the above change in the patch changes the name of
wait event for buffer pin only in the startup process. Is this really useful?
Isn't the existing wait event for buffer pin enough?

-   /* Wait to be signaled by the release of the Relation Lock */
-   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
+   /* Wait to be signaled by the release of the Relation Lock */
+   ProcWaitForSignal(WAIT_EVENT_RECOVERY_CONFLICT_LOCK);

Same as above. Isn't the existing wait event enough?


Yeah, we can use the existing wait events for buffer pin and lock.



-   /*
-    * Progressively increase the sleep times, but not to more than 1s, 
since
-    * pg_usleep isn't interruptible on some platforms.
-    */
-   standbyWait_us *= 2;
-   if (standbyWait_us > 100)
-   standbyWait_us = 100;
+   WaitLatch(MyLatch,
+ WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT,
+ STANDBY_WAIT_MS,
+ wait_event_info);
+   ResetLatch(MyLatch);

ResetLatch() should be called before WaitLatch()?


Fixed.



Could you tell me why you dropped the "increase-sleep-times" logic?


I thought we can remove it because WaitLatch is interruptible but my
observation was not correct. The waiting startup process is not
necessarily woken up by signal. I think it's still better to not wait
more than 1 sec even if it's an interruptible wait.


So we don't need to use WaitLatch() there, i.e., it's ok to keep using
pg_usleep()?


Attached patch fixes the above and introduces only two wait events of
conflict resolution: snapshot and tablespace.


Many thanks for updating the patch!

-   /* Wait to be signaled by the release of the Relation Lock */
-   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
+   /* Wait to be signaled by the release of the Relation Lock */
+   ProcWaitForSignal(PG_WAIT_LOCK | locktag.locktag_type);
+   }

Is this change really valid? What happens if the latch is set during
ResolveRecoveryConflictWithVirtualXIDs()?
ResolveRecoveryConflictWithVirtualXIDs() can return after the latch
is set but before WaitLatch() in WaitExceedsMaxStandbyDelay() is reached.


Thank you for reviewing the patch!

You're right. It's better to keep using pg_usleep() and set the wait
event by pgstat_report_wait_start().



+   default:
+   event_name = "unknown wait event";
+   break;

Seems this default case should be removed. Please see other
pgstat_get_wait_xxx() function, so there is no such code.


I also removed the wait
event of conflict resolution of database since it's unlikely to become
a user-visible and a long sleep as we discussed before.


Is it worth defining new wait event type RecoveryConflict only for
those two events? ISTM that it's ok to use IPC event type here.



I dropped a new wait even type and added them to IPC wait event type.

I've attached the new version patch.


Thanks for updating the patch! The patch looks good to me except
the following mior things.

+    

Re: WAL usage calculation patch

2020-04-02 Thread Dilip Kumar
On Thu, Apr 2, 2020 at 9:28 PM Dilip Kumar  wrote:
>
> On Thu, Apr 2, 2020 at 6:41 PM Amit Kapila  wrote:
> >
> > On Thu, Apr 2, 2020 at 6:18 PM Julien Rouhaud  wrote:
> > >
> > > =# select query, calls, wal_bytes, wal_records, wal_num_fpw from 
> > > pg_stat_statements where query ilike '%create index%';
> > >   query   | calls | wal_bytes | 
> > > wal_records | wal_num_fpw
> > > --+---+---+-+-
> > >  create index t1_idx_parallel_0 ON t1(id) | 1 |  20389743 |   
> > >  2762 |2758
> > >  create index t1_idx_parallel_0_bis ON t1(id) | 1 |  20394391 |   
> > >  2762 |2758
> > >  create index t1_idx_parallel_0_ter ON t1(id) | 1 |  20395155 |   
> > >  2762 |2758
> > >  create index t1_idx_parallel_1 ON t1(id) | 1 |  20388335 |   
> > >  2762 |2758
> > >  create index t1_idx_parallel_2 ON t1(id) | 1 |  20389091 |   
> > >  2762 |2758
> > >  create index t1_idx_parallel_3 ON t1(id) | 1 |  20389847 |   
> > >  2762 |2758
> > >  create index t1_idx_parallel_4 ON t1(id) | 1 |  20390603 |   
> > >  2762 |2758
> > >  create index t1_idx_parallel_5 ON t1(id) | 1 |  20391359 |   
> > >  2762 |2758
> > >  create index t1_idx_parallel_6 ON t1(id) | 1 |  20392115 |   
> > >  2762 |2758
> > >  create index t1_idx_parallel_7 ON t1(id) | 1 |  20392871 |   
> > >  2762 |2758
> > >  create index t1_idx_parallel_8 ON t1(id) | 1 |  20393627 |   
> > >  2762 |2758
> > > (11 rows)
> > >
> > > =# select relname, pg_relation_size(oid) from pg_class where relname like 
> > > '%t1_id%';
> > >   relname  | pg_relation_size
> > > ---+--
> > >  t1_idx_parallel_0 | 22487040
> > >  t1_idx_parallel_0_bis | 22487040
> > >  t1_idx_parallel_0_ter | 22487040
> > >  t1_idx_parallel_2 | 22487040
> > >  t1_idx_parallel_1 | 22487040
> > >  t1_idx_parallel_4 | 22487040
> > >  t1_idx_parallel_3 | 22487040
> > >  t1_idx_parallel_5 | 22487040
> > >  t1_idx_parallel_6 | 22487040
> > >  t1_idx_parallel_7 | 22487040
> > >  t1_idx_parallel_8 | 22487040
> > > (9 rows)
> > >
> > >
> > > So while the number of WAL records and full page images stay constant, we 
> > > can
> > > see some small fluctuations in the total amount of generated WAL data, 
> > > even for
> > > multiple execution of the sequential create index.  I'm wondering if the
> > > fluctuations are due to some other internal details or if the WalUsage 
> > > support
> > > is just completely broken (although I don't see any obvious issue ATM).
> > >
> >
> > I think we need to know the reason for this.  Can you try with small
> > size indexes and see if the problem is reproducible? If it is, then it
> > will be easier to debug the same.
>
> I have done some testing to see where these extra WAL size is coming
> from.  First I tried to create new db before every run then the size
> is consistent.  But, then on the same server, I tired as Julien showed
> in his experiment then I am getting few extra wal bytes from next
> create index onwards.  And, the waldump(attached in the mail) shows
> that is pg_class insert wal.  I still have to check that why we need
> to write an extra wal size.
>
> create extension pg_stat_statements;
> drop table t1;
> create table t1(id integer);
> insert into t1 select * from generate_series(1, 10);
> alter table t1 set (parallel_workers = 0);
> vacuum;checkpoint;
> select * from pg_stat_statements_reset() ;
> create index t1_idx_parallel_0 ON t1(id);
> select query, calls, wal_bytes, wal_records, wal_num_fpw from
> pg_stat_statements where query ilike '%create index%';;
>   query
>| calls | wal_bytes | wal_records | wal_num_fpw
> --+---+---+-+-
>  create index t1_idx_parallel_0 ON t1(id)
>| 1 | 49320 |  23 |  15
>
>
> drop table t1;
> create table t1(id integer);
> insert into t1 select * from generate_series(1, 10);
> --select * from pg_stat_statements_reset() ;
> alter table t1 set (parallel_workers = 0);
> vacuum;checkpoint;
> create index t1_idx_parallel_1 ON t1(id);
>
> select query, calls, wal_bytes, wal_records, wal_num_fpw from
> pg_stat_statements where query ilike '%create index%';;
> postgres[110383]=# select query, calls, wal_bytes, wal_records,
> wal_num_fpw from pg_stat_statements;
>   query
>| calls | wal_bytes | wal_records | wal_num_fpw
> 

Re: Online checksums verification in the backend

2020-04-02 Thread Masahiko Sawada
On Sat, 28 Mar 2020 at 21:19, Julien Rouhaud  wrote:
>
> On Sat, Mar 28, 2020 at 12:28:27PM +0900, Masahiko Sawada wrote:
> > On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud  wrote:
> > >
> > > v5 attached
> >
> > Thank you for updating the patch. I have some comments:
>
> Thanks a lot for the review!

Thank you for updating the patch!

> > 4.
> > +   /* Check if the relation (still) exists */
> > +   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
> > +   {
> > +   /*
> > +* We use a standard relation_open() to acquire the initial lock.  
> > It
> > +* means that this will block until the lock is acquired, or will
> > +* raise an ERROR if lock_timeout has been set.  If caller wants to
> > +* check multiple tables while relying on a maximum wait time, it
> > +* should process tables one by one instead of relying on a global
> > +* processing with the main SRF.
> > +*/
> > +   relation = relation_open(relid, AccessShareLock);
> > +   }
> >
> > IIUC the above was necessary because we used to have
> > check_all_relations() which iterates all relations on the database to
> > do checksum checks. But now that we don't have that function and
> > pg_check_relation processes one relation. Can we just do
> > relation_open() here?
>
> Ah yes I missed that comment.  I think only the comment needed to be updated 
> to
> remove any part related to NULL-relation call.  I ended up removign the whole
> comment since locking and lock_timeout behavior is inherent to relation_open
> and there's no need to document that any further now that we always only check
> one relation at a time.

The current patch still checks SearchSysCacheExists1() before
relation_open. Why do we need to call SearchSysCacheExists1() here? I
think if the given relation doesn't exist, relation_open() will raise
an error "could not open relation with OID %u".

+   /* Open the relation if it exists.  */
+   if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid)))
+   {
+   relation = relation_open(relid, AccessShareLock);
+   }


> > 8.
> > +   if (PageIsVerified(buffer, blkno))
> > +   {
> > +   /*
> > +* If the page is really new, there won't by any checksum to be
> > +* computed or expected.
> > +*/
> > +   *chk_expected = *chk_found = NoComputedChecksum;
> > +   return true;
> > +   }
> > +   else
> > +   {
> > +   /*
> > +* There's a corruption, but since this affect PageIsNew, we
> > +* can't compute a checksum, so set NoComputedChecksum for the
> > +* expected checksum.
> > +*/
> > +   *chk_expected = NoComputedChecksum;
> > +   *chk_found = hdr->pd_checksum;
> > +   }
> > +   return false;
> >
> > * I think the 'else' is not necessary here.
>
> AFAICT it's, see below.
>
> > * Setting *chk_expected and *chk_found seems useless when we return
> > true. The caller doesn't use them.
>
> Indeed, fixed.

The patch still sets values to both?

+   if (PageIsVerified(buffer, blkno))
+   {
+   /* No corruption. */
+   *chk_expected = *chk_found = NoComputedChecksum;
+   return true;
+   }

>
> > * Should we forcibly overwrite ignore_checksum_failure to off in
> > pg_check_relation()? Otherwise, this logic seems not work fine.
> >
> > * I don't understand why we cannot compute a checksum in case where a
> > page looks like a new page but is actually corrupted. Could you please
> > elaborate on that?
>
> PageIsVerified has a different behavior depending on whether the page looks 
> new
> or not.  If the page looks like new, it only checks that it's indeed a new
> page, and otherwise try to verify the checksum.
>
> Also, pg_check_page() has an assert to make sure that the page isn't (or don't
> look like) new.
>
> So it seems to me that the 'else' is required to properly detect a real or 
> fake
> PageIsNew, and try to compute checksums only when required.

Thank you for your explanation! I understand.

I thought we can arrange the code to something like:

if (PageIsNew(hdr))
{
if (PageIsVerified(hdr))
{
*chk_expected = *chk_found = NoComputedChecksum;
return true;
}

*chk_expected = NoComputedChecksum;
*chk_found = hdr->pd_checksum;
return false;
}

But since it's not a critical problem you can ignore it.

>
> > 8.
> > +   {
> > +   {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY,
> > +   gettext_noop("Checksum cost for a page found in the buffer 
> > cache."),
> > +   NULL
> > +   },
> > +   ,
> > +   1, 0, 1,
> > +   NULL, NULL, NULL
> > +   },
> >
> > * There is no description about the newly added four GUC parameters in the 
> > doc.
> >
> > * We need to put new GUC parameters into postgresql.conf.sample as well.
>
> Fixed both.
>
> > * The patch doesn't use checksum_cost_page_hit at 

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-02 Thread David Rowley
On Fri, 3 Apr 2020 at 15:18, Andy Fan  wrote:
>
> The updated patch should fixed all the issues.  See the comments below for 
> more
> information.
>
> On Tue, Mar 31, 2020 at 9:44 AM David Rowley  wrote:
>>
>> + * can only guarantee the uniqueness without considering the null values. 
>> This
>> + * field is necessary for remove_useless_join & reduce_unique_semijions 
>> since
>> + * these cases don't care about the null values.
>>
>> Why is the field which stores the nullability of the key required for
>> code that does not care about the nullability of the key?
>>
>> Also please check your spelling of the word "join"
>>
>
> Actually I didn't find the spell error for "join"..

It was in reduce_unique_semijions. That should be
reduce_unique_semijoins. I see you fixed it in the patch though.

> 3.  remove_useless_groupby_columns maintains the parse->constraintDeps
> when it depends on primary key,  but UniqueKey doesn't maintain such data.
> since we have translation layer which should protect us from the concurrency 
> issue
> and isolation issue.  Do we need to do that as well in UniqueKey?

I'm pretty sure that code is pretty bogus in
remove_useless_groupby_columns(). It perhaps was just copied from
check_functional_grouping(), where it is required. Looks like the
(ahem) author of d4c3a156c got that wrong... :-(

The reason check_functional_grouping() needs it is for things like
creating a view with a GROUP BY clause that has a column in the SELECT
list that is functionally dependant on the GROUP BY columns. e.g:

create table z (a int primary key, b int);
create view view_z as select a,b from z group by a;
alter table z drop constraint z_pkey;
ERROR:  cannot drop constraint z_pkey on table z because other objects
depend on it
DETAIL:  view view_z depends on constraint z_pkey on table z
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

Here that view would become invalid if the PK was dropped, so we must
record the dependency in that case.  Doing so is what causes that
error message.

For just planner smarts such as LEFT JOIN removal, Unique Joins, and
all this Unique Key stuff, we really don't need to record the
dependency as if the index or constraint is dropped, then that'll
cause a relcache invalidation and we'll see the invalidation when we
attempt to execute the cached plan. That will cause the statement to
be re-planned and we'll not see the unique index when we do that.

We should probably just get rid of that code in
remove_useless_groupby_columns() to stop people getting confused about
that.




Re: WAL usage calculation patch

2020-04-02 Thread Amit Kapila
On Fri, Apr 3, 2020 at 7:15 AM Kyotaro Horiguchi
 wrote:
>
> Hello.
>
> The v13 patch seems failing to apply on the master.
>

It is probably due to recent commit ed7a509571.  I have briefly
studied that and I think we should make this patch account for plan
time WAL usage if any similar to what got committed for buffer usage.
The reason is that there is a possibility that during planning we
might write a WAL due to hint bits.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: WAL usage calculation patch

2020-04-02 Thread Dilip Kumar
On Fri, Apr 3, 2020 at 8:14 AM Amit Kapila  wrote:
>
> On Fri, Apr 3, 2020 at 6:37 AM Amit Kapila  wrote:
> >
> > On Thu, Apr 2, 2020 at 8:06 PM Dilip Kumar  wrote:
> > >
> > > On Thu, Apr 2, 2020 at 6:41 PM Amit Kapila  
> > > wrote:
> > > >
> > > > 4.
> > > > /* # of WAL full page image generated */
> > > > Can we change it to "/* # of WAL full page image records generated */"?
> > >
> > > IMHO, "# of WAL full-page image records" seems like the number of wal
> > > record which contains the full-page image.
> > >
> >
> > I think this resembles what you have written here.
> >
> > >  But, actually, this is the
> > > total number of the full-page images, not the number of records that
> > > have a full-page image.
> > >
> >
> > We count this when forming WAL records.  As per my understanding, all
> > three counters are about WAL records.  This counter tells how many
> > records have full page images and one of the purposes of having this
> > counter is to check what percentage of records contain full page
> > image.
> >
>
> How about if say "# of full-page writes generated" or "# of WAL
> full-page writes generated"?  I think now I understand your concern
> because we want to display it as full page writes and the comment
> doesn't seem to indicate the same.

Either of these seem good to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: pgbench - add \aset to store results of a combined query

2020-04-02 Thread Michael Paquier
On Thu, Apr 02, 2020 at 03:58:50PM +0200, Fabien COELHO wrote:
> Sure. I meant that the feature would make sense to write benchmark scripts
> which would use aset and be able to act on the success or not of this aset,
> not to resurrect it for a hidden coverage test.

This could always be discussed for v14.  We'll see.

>> Thanks.  So, it looks like everything has been addressed.  Do you have
>> anything else in mind?
> 
> Nope.

Applied, then.  Thanks!
--
Michael


signature.asc
Description: PGP signature


Re: WAL usage calculation patch

2020-04-02 Thread Amit Kapila
On Fri, Apr 3, 2020 at 6:37 AM Amit Kapila  wrote:
>
> On Thu, Apr 2, 2020 at 8:06 PM Dilip Kumar  wrote:
> >
> > On Thu, Apr 2, 2020 at 6:41 PM Amit Kapila  wrote:
> > >
> > > 4.
> > > /* # of WAL full page image generated */
> > > Can we change it to "/* # of WAL full page image records generated */"?
> >
> > IMHO, "# of WAL full-page image records" seems like the number of wal
> > record which contains the full-page image.
> >
>
> I think this resembles what you have written here.
>
> >  But, actually, this is the
> > total number of the full-page images, not the number of records that
> > have a full-page image.
> >
>
> We count this when forming WAL records.  As per my understanding, all
> three counters are about WAL records.  This counter tells how many
> records have full page images and one of the purposes of having this
> counter is to check what percentage of records contain full page
> image.
>

How about if say "# of full-page writes generated" or "# of WAL
full-page writes generated"?  I think now I understand your concern
because we want to display it as full page writes and the comment
doesn't seem to indicate the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: User Interface for WAL usage data

2020-04-02 Thread Amit Kapila
On Thu, Apr 2, 2020 at 12:05 PM Justin Pryzby  wrote:
>
> On Thu, Apr 02, 2020 at 11:32:16AM +0530, Amit Kapila wrote:
> > On Thu, Apr 2, 2020 at 11:28 AM Kyotaro Horiguchi  
> > wrote:
> > >
> > > At Thu, 2 Apr 2020 00:41:20 -0500, Justin Pryzby  
> > > wrote in
> > > > Regarding 
> > > > v10-0004-Add-option-to-report-WAL-usage-in-EXPLAIN-and-au.patch:
> > > > I think there should be additional spaces before "full" and before 
> > > > "bytes":
> > > >
> > > > >WAL: records=2359 full page records=42 bytes=447788
> > > >
> > > > Compare with these:
> > > >
> > > >"Sort Method: %s  %s: %ldkB\n",
> > > >"Buckets: %d (originally %d)  Batches: %d (originally %d)  
> > > > Memory Usage: %ldkB\n",
> > > >"Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
> > > >
> > > > Otherwise "records=2359 full page records=42" is hard to parse.
> > >
> > > I got the same feeling seeing the line.
> >
> > But isn't this same as we have BufferUsage data?  We can probably
> > display it as full_page_writes or something like that.
>
> I guess you mean this:
>  Buffers: shared hit=994 read=11426 dirtied=466
>
> Which can show shared/local/temp.  Actually I would probably make the same
> suggestion for "Buffers" (if it were a new patch).  I would find this to be
> pretty unfriendly output:
>
>  Buffers: shared hit=12345 read=12345 dirtied=12345 local hit=12345 
> read=12345 dirtied=12345 temp hit=12345 read=12345 dirtied=12345
>
> Adding two extra spaces "  local" and "  temp" would have helped there, so
> would commas, or parenthesis, dashes or almost anything - other than a
> backslash.
>
> So I think you're right that WAL is very similar to the Buffers case, but I
> suggest that's not a good example to follow, especially since you're adding a
> "field" with spaces in it.
>

Agreed, this is a good reason to keep two spaces as we have in cases
of Buckets.  We can display the variable as "full page writes".

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: BUG #16109: Postgres planning time is high across version (Expose buffer usage during planning in EXPLAIN)

2020-04-02 Thread Fujii Masao




On 2020/04/02 15:52, Fujii Masao wrote:



On 2020/04/02 15:01, Julien Rouhaud wrote:

On Thu, Apr 02, 2020 at 01:05:56PM +0900, Fujii Masao wrote:



On 2020/04/02 3:47, Julien Rouhaud wrote:

On Wed, Apr 1, 2020 at 7:51 PM Fujii Masao  wrote:



On 2020/03/31 10:31, Justin Pryzby wrote:

On Wed, Jan 29, 2020 at 12:15:59PM +0100, Julien Rouhaud wrote:

Rebase due to conflict with 3ec20c7091e97.


This is failing to apply probably since 
4a539a25ebfc48329fd656a95f3c1eb2cda38af3.
Could you rebase?   (Also, not sure if this can be set as RFC?)


I updated the patch. Attached.


Thanks a lot!  I'm sorry I missed Justin's ping, and it I just
realized that my cron job that used to warn me about cfbot failure was
broken :(


+/* Compute the difference between two BufferUsage */
+BufferUsage
+ComputeBufferCounters(BufferUsage *start, BufferUsage *stop)

Since BufferUsageAccumDiff() was exported, ComputeBufferCounters() is
no longer necessary. In the patched version, BufferUsageAccumDiff() is
used to calculate the difference of buffer usage.


Indeed, exposing BufferUsageAccumDiff wa definitely a good thing!


+   if (es->summary && (planduration || es->buffers))
+   ExplainOpenGroup("Planning", "Planning", true, es);

Isn't it more appropriate to check "bufusage" instead of "es->buffers" here?
The patch changes the code so that "bufusage" is checked.


AFAICS not unless ExplainOneQuery is also changed to pass a NULL
pointer if the BUFFER option wasn't provided (and maybe also
optionally skip the planning buffer computation).  With this version
you now get:

=# explain (analyze, buffers off) update t1 set id = id;
    QUERY PLAN
---
   Update on t1  (cost=0.00..22.70 rows=1270 width=42) (actual
time=0.170..0.170 rows=0 loops=1)
 ->  Seq Scan on t1  (cost=0.00..22.70 rows=1270 width=42) (actual
time=0.050..0.054 rows=1 loops=1)
   Planning Time: 1.461 ms
 Buffers: shared hit=25
   Execution Time: 1.071 ms
(5 rows)

which seems wrong to me.

I reused the es->buffers to avoid having needing something like:


Yes, you're right! So I updated the patch as you suggested.
Attached is the updated version of the patch.
Thanks for the review!



Thanks a lot, it all looks good to me!


Thanks! Barring any objection, I will commit the latest version of the patch.


Pushed! Thanks a lot!!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: A bug when use get_bit() function for a long bytea string

2020-04-02 Thread movead...@highgo.ca

>Some comments about the set_bit/get_bit parts.
>I'm reading long_bytea_string_bug_fix_ver6_pg10.patch, but that
>applies probably to the other files meant for the existing releases
>(I think you could get away with only one patch for backpatching
>and one patch for v13, and committers will sort out how
>to deal with release branches).
Thanks for teaching me.

>byteaSetBit(PG_FUNCTION_ARGS)
>{
>bytea*res = PG_GETARG_BYTEA_P_COPY(0);
>- int32 n = PG_GETARG_INT32(1);
>+ int64 n = PG_GETARG_INT64(1);
>int32 newBit = PG_GETARG_INT32(2);
>The 2nd argument is 32-bit, not 64. PG_GETARG_INT32(1) must be used. 
>+ errmsg("index "INT64_FORMAT" out of valid range, 0.."INT64_FORMAT,
>+ n, (int64)len * 8 - 1)));
>The cast to int64 avoids the overflow, but it may also produce a
>result that does not reflect the actual range, which is limited to
>2^31-1, again because the bit number is a signed 32-bit value. 
>I believe the formula for the upper limit in the 32-bit case is:
>  (len <= PG_INT32_MAX / 8) ? (len*8 - 1) : PG_INT32_MAX;

>These functions could benefit from a comment mentioning that
>they cannot reach the full extent of a bytea, because of the size limit
>on the bit number.

Because the 2nd argument is describing 'bit' location of every byte in bytea
string, so an int32 is not enough for that. I think the change is nothing wrong,
or I have not caught your means? 


>These 2 tests need to allocate big chunks of contiguous memory, so they
>might fail for lack of memory on tiny machines, and even when not failing,
>they're pretty slow to run. Are they worth the trouble?


>These 2 tests are supposed to fail in existing releases because set_bit()
>and get_bit() don't take a bigint as the 2nd argument.
>Also, the same comment as above on how much they allocate.
I have deleted the four test cases because it is not worth the memory and time,
and no new test cases added because it needs time to generate lots of data.

New patch attached.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
<>


Re: Syntax rules for a text value inside the literal for a user-defined type—doc section “8.16.2. Constructing Composite Values”

2020-04-02 Thread Tom Lane
Bryn Llewellyn  writes:
> The documentation in section “8.16.2. Constructing Composite Values” here:
> https://www.postgresql.org/docs/11/rowtypes.html#ROWTYPES-IO-SYNTAX 
> 

The authoritative documentation for that is at 8.16.6 "Composite Type
Input and Output Syntax", and it says quite clearly that whitespace is
not ignored (except for before and after the outer parentheses).

regards, tom lane




Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-02 Thread Andy Fan
The updated patch should fixed all the issues.  See the comments below for
more
information.

On Tue, Mar 31, 2020 at 9:44 AM David Rowley  wrote:

> On Sun, 29 Mar 2020 at 20:50, Andy Fan  wrote:
> > Some other changes made in the new patch:
> > 1.   Fixed bug for UniqueKey calculation for OUTER join.
> > 2.   Fixed some typo error in comments.
> > 3.   Renamed the field "grantee" as "guarantee".
>
> I've had a look over this patch. Thank for you doing further work on it.
>
> I've noted down the following during my read of the code:
>
> 1. There seem to be some cases where joins are no longer being
> detected as unique. This is evident in postgres_fdw.out. We shouldn't
> be regressing any of these cases.
>

The issue here is  I didn't distinguish the one_row case in UniqueKey
struct.
 Actually when a outer relation is join with a relation which has only one
row,
there must be at most one row match the outer join.The only-one-row case in
postgres_fdw.out come from aggregation call.

Added a new field "onerow" in UniqueKey struct.  and optimize the onerow
UniqueKey to not record every exprs.  See add_uniquekey_for_onerow
and relation_is_onerow.


> 2. The following change does not seem like it should be part of this
> patch.  I understand you perhaps have done as you think it will
> improve the performance of checking if an expression is in a list of
> expressions.
>
> - COMPARE_SCALAR_FIELD(varno);
> + /* Compare varattno first since it has higher selectivity than varno */
>   COMPARE_SCALAR_FIELD(varattno);
> + COMPARE_SCALAR_FIELD(varno);
>
> If you think that is true, then please do it as a separate effort and
> provide benchmarks with your findings.
>
> Rollbacked.


> 3. list_all_members_in. I think this would be better named as
> list_is_subset. Please follow the lead of bms_is_subset().
> Additionally, you should Assert that IsPointerList is true as there's
> nothing else to indicate that it can't be used for an int or Oid list.
>
> Done


> 4. guarantee is not a very good name for the field in UniqueKey.
> Maybe something like is_not_null?
>
> I tried is_not_null, but when it is is_not_null equals false, it is a
double
negation, and not feel good for me.  At last, I used multi_nullvals to show
the UniqueKey may yield multi null values so the uniqueness is not
guaranteed.


> 5. I think you should be performing a bms_del_member during join
> removal rather than removing this Assert()
>
> - Assert(bms_equal(rel->relids, root->all_baserels));


Done

>
> FWIW, it's far from perfect that you've needed to delay the left join
> removal, but I do understand why you've done it. It's also far from
> perfect that you're including removed relations in the
> total_table_pages calculation. c6e4133fae1 took some measures to
> improve this calculation and this is making it worse again.
>
> 6. Can you explain why you moved the populate_baserel_uniquekeys()
> call out of set_plain_rel_size()?
>
> 7. I don't think the removal of rel_supports_distinctness() is
> warranted.  Is it not ok to check if the relation has any uniquekeys?
> It's possible, particularly in join_is_removable that this can save
> quite a large amount of effort.
>
> Done


> 8. Your spelling of unique is incorrect in many places:
>
> src/backend/nodes/makefuncs.c: * makeUnqiueKey
> src/backend/optimizer/path/uniquekeys.c:static List
> *initililze_unqiuecontext_for_joinrel(RelOptInfo *joinrel,
> src/backend/optimizer/path/uniquekeys.c: * check if combination of
> unqiuekeys from both side is still useful for us,
> src/backend/optimizer/path/uniquekeys.c:outerrel_uniquekey_ctx
> = initililze_unqiuecontext_for_joinrel(joinrel, outerrel);
> src/backend/optimizer/path/uniquekeys.c:innerrel_uniquekey_ctx
> = initililze_unqiuecontext_for_joinrel(joinrel, innerrel);
> src/backend/optimizer/path/uniquekeys.c: * we need to convert the
> UnqiueKey from sub_final_rel to currel via the positions info in
> src/backend/optimizer/path/uniquekeys.c:ctx->pos =
> pos; /* the position in current targetlist,  will be used to set
> UnqiueKey */
> src/backend/optimizer/path/uniquekeys.c: * Check if Unqiue key of the
> innerrel is valid after join. innerrel's UniqueKey
> src/backend/optimizer/path/uniquekeys.c: *
> initililze_unqiuecontext_for_joinrel
> src/backend/optimizer/path/uniquekeys.c: * all the unqiuekeys which
> are not possible to use later
>
> src/backend/optimizer/path/uniquekeys.c:initililze_unqiuecontext_for_joinrel(RelOptInfo
> *joinrel,  RelOptInfo *inputrel)
> src/backend/optimizer/plan/analyzejoins.c:  /*
> This UnqiueKey is what we want */
> src/backend/optimizer/plan/planner.c:   /* If we the result if unqiue
> already, we just return the input_rel directly */
> src/include/nodes/pathnodes.h: * exprs is a list of exprs which is
> unqiue on current RelOptInfo.
> src/test/regress/expected/join.out:-- : since b.id is unqiue now
> so the group by cluase is erased, so
> 

Syntax rules for a text value inside the literal for a user-defined type—doc section “8.16.2. Constructing Composite Values”

2020-04-02 Thread Bryn Llewellyn
I’m using PostgreSQL Version 11.2. Try this:

create type rt as (a text, b text);
create table t(k serial primary key, r rt);
insert into t(r) values
  ('("a","b")'),
  ('( "e" , "f" )'),
  ('( "g (h)" , "i, j" )');
select
  k,
  '>'||(r).a||'<' as a,
  '>'||(r).b||'<' as b
from t order by k;

This is the result.

 k | a |b 
---+---+--
 1 | >a<   | >b<
 2 | > e < | > f <
 3 | > g (h) < | > i, j <

The documentation in section “8.16.2. Constructing Composite Values” here:

https://www.postgresql.org/docs/11/rowtypes.html#ROWTYPES-IO-SYNTAX 


shows examples of using double quotes to surround a text value inside the 
literal for a user-defined record type—and it states that this is optional 
unless the text value contains a comma or a parenthesis. But in every example 
there is no case where the syntax elements (the surrounding parentheses and the 
comma separators) outside of the values themselves have space(s) on either 
side. So it gives no basis to explain the result that I show for “k=2” and 
“k=3”.

Intuition tells me that if a text value is surrounded by double quotes, then 
these delimit the string and that anything outside of this (baring the special 
case of a text value that is *not* surrounded by double quotes), then 
whitespace can be used—as it is in every other case that I can think of in 
PostgreSQL’s SQL and PL/pgSQL, at the programers discretion to improve 
readability.

This, by the way, is the rule for a JSON string value.

In fact, the rule seems to be this:

“When a text value is written inside the literal for a user-defined type (which 
data type is given by its declaration), the entire run of characters between 
the syntax elements—the opening left parenthesis, an interior comma, or the 
closing right parenthesis— is taken to be the text value, including spaces 
outside of the quotes.”

Have I stumbled on a bug? If not, please explain the rationale for what seems 
to me to be a counter-intuitive syntax design choice.


.



Re: WAL usage calculation patch

2020-04-02 Thread Kyotaro Horiguchi
Hello.

The v13 patch seems failing to apply on the master.

At Fri, 3 Apr 2020 06:37:21 +0530, Amit Kapila  wrote 
in 
> On Thu, Apr 2, 2020 at 8:06 PM Dilip Kumar  wrote:
> >
> > On Thu, Apr 2, 2020 at 6:41 PM Amit Kapila  wrote:
> > >
> > > 4.
> > > /* # of WAL full page image generated */
> > > Can we change it to "/* # of WAL full page image records generated */"?
> >
> > IMHO, "# of WAL full-page image records" seems like the number of wal
> > record which contains the full-page image.
> >
> 
> I think this resembles what you have written here.
> 
> >  But, actually, this is the
> > total number of the full-page images, not the number of records that
> > have a full-page image.
> >
> 
> We count this when forming WAL records.  As per my understanding, all
> three counters are about WAL records.  This counter tells how many
> records have full page images and one of the purposes of having this
> counter is to check what percentage of records contain full page
> image.

Aside from which is desirable or useful, acutually XLogRecordAssemble
in v13-0001 counts the number of attached images then XLogInsertRecord
sums up the number of images in pgWalUsage.wal_num_fpw.

FWIW, it seems to me that the main concern here is the source of WAL
size. If it is the case I think that the number of full page image is
more useful.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: snapshot too old issues, first around wraparound and then more.

2020-04-02 Thread Peter Geoghegan
On Thu, Apr 2, 2020 at 5:17 PM Andres Freund  wrote:
> Since this is a feature that can result in wrong query results (and
> quite possibly crashes / data corruption), I don't think we can just
> leave this unfixed.  But given the amount of code / infrastructure
> changes required to get this into a working feature, I don't see how we
> can unleash those changes onto the stable branches.

I don't think that the feature can be allowed to remain in anything
like its current form. The current design is fundamentally unsound.

> I don't really know what to do here. Causing problems by neutering a
> feature in the back branch *sucks*. While not quite as bad, removing a
> feature without a replacement in a major release is pretty harsh
> too. But I don't really see any other realistic path forward.

I have an idea that might allow us to insulate some users from the
problem caused by a full revert (or disabling the feature) in the
backbranches. I wouldn't usually make such a radical suggestion, but
the current situation is exceptional. Anything that avoids serious
pain for users deserves to be considered.

Kevin said this about the feature very recently:

"""
Keep in mind that the real goal of this feature is not to eagerly
_see_ "snapshot too old" errors, but to prevent accidental
debilitating bloat due to one misbehaving user connection.  This is
particularly easy to see (and therefore unnervingly common) for those
using ODBC, which in my experience tends to correspond to the largest
companies which are using PostgreSQL.  In some cases, the snapshot
which is preventing removal of the rows will never be used again;
removal of the rows will not actually affect the result of any query,
but only the size and performance of the database.  This is a "soft
limit" -- kinda like max_wal_size.  Where there was a trade-off
between accuracy of the limit and performance, the less accurate way
was intentionally chosen.  I apologize for not making that more clear
in comments.
"""

ODBC uses cursors in rather strange ways, often to implement a kind of
ODBC-level cache. See the description of "Use Declare/Fetch" from
https://odbc.postgresql.org/docs/config.html to get some idea of what
this can look like.

I think that it's worth considering whether or not there are a
significant number of "snapshot too old" users that rarely or never
rely on old snapshots used by new queries. Kevin said that this
happens "in some cases", but how many cases? Might it be that many
"snapshot too old" users could get by with a version of the feature
that makes the most conservative possible assumptions, totally giving
up on the idea of differentiating which blocks are truly safe to
access with an "old" snapshot? (In other words, one that assumes that
they're *all* unsafe for an "old" snapshot.)

I'm thinking of a version of "snapshot too old" that amounts to a
statement timeout that gets applied for xmin horizon type purposes in
the conventional way, while only showing an error to the client if and
when they access literally any buffer (though not when the relation is
a system catalog). Is it possible that something along those lines is
appreciably better than nothing to users? If it is, and if we can find
a way to manage the transition, then maybe we could tolerate
supporting this greatly simplified implementation of "snapshot too
old".

I feel slightly silly for even suggesting this. I have to ask. Maybe
nobody noticed a problem with the feature before now (at least in
part) because they didn't truly care about old snapshots anyway. They
just wanted to avoid a significant impact from buggy code that leaks
cursors and things like that. Or, they were happy as long as they
could still access ODBC's "100 rows in a cache" through the cursor.
The docs say that a old_snapshot_threshold setting in the hours is
about the lowest reasonable setting for production use, which seems
rather high to me. It almost seems as if the feature specifically
targets misbehaving applications already.

-- 
Peter Geoghegan




Re: WAL usage calculation patch

2020-04-02 Thread Amit Kapila
On Thu, Apr 2, 2020 at 8:06 PM Dilip Kumar  wrote:
>
> On Thu, Apr 2, 2020 at 6:41 PM Amit Kapila  wrote:
> >
> > 4.
> > /* # of WAL full page image generated */
> > Can we change it to "/* # of WAL full page image records generated */"?
>
> IMHO, "# of WAL full-page image records" seems like the number of wal
> record which contains the full-page image.
>

I think this resembles what you have written here.

>  But, actually, this is the
> total number of the full-page images, not the number of records that
> have a full-page image.
>

We count this when forming WAL records.  As per my understanding, all
three counters are about WAL records.  This counter tells how many
records have full page images and one of the purposes of having this
counter is to check what percentage of records contain full page
image.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-02 Thread James Coleman
On Thu, Apr 2, 2020 at 8:46 PM James Coleman  wrote:
>
> On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
>  wrote:
> >
> > Hi,
> >
> > Thanks, the v52 looks almost ready. I've been looking at the two or
> > three things I mentioned, and I have a couple of comments.
> >
> >
> > 1) /* XXX comparison_cost shouldn't be 0? */
> >
> > I'm not worried about this, because this is not really intriduced by
> > this patch - create_sort_path has the same comment/issue, so I think
> > it's acceptable to do the same thing for incremental sort.
>
> Sounds good.
>
> > 2) INITIAL_MEMTUPSIZE
> >
> > tuplesort.c does this:
> >
> >#define INITIAL_MEMTUPSIZE Max(1024, \
> >ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
> >
> > supposedly to keep the array size within ALLOCSET_SEPARATE_THRESHOLD.
> > But I think it fails to do that, for a couple of reasons.
> >
> > Firstly, ALLOCSET_SEPARATE_THRESHOLD is 8192, and SortTuple is 21B at
> > minimum (without padding), so with 1024 elements it's guaranteed to be
> > at least 21kB - so exceeding the threshold. The maximum value is
> > something like 256.
> >
> > Secondly, the second part of the formula is guaranteed to get us over
> > the threshold anyway, thanks to the +1. Let's say SortTuple is 30B. Then
> >
> >ALLOCSET_SEPARATE_THRESHOLD / 30 = 273
> >
> > but we end up using 274, resulting in 8220B array. :-(
> >
> > So I guess the formula should be more like
> >
> >Min(128, ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple))
> >
> > or something like that.
> >
> > FWIW I think the whole hypothesis that selecting the array size below
> > ALLOCSET_SEPARATE_THRESHOLD reduces overhead is dubious. AFAIC we
> > allocate this only once (or very few times), and if we need to grow the
> > array we'll hit the threshold anyway.
> >
> > I'd just pick a reasonable constant - 128 or 256 seems reasonable, 1024
> > may be OK too.
>
> That was a part of the patch I haven't touched since I inherited it,
> and I didn't feel like I knew enough about the Postgres memory
> management to make a determination on whether the reasoning made
> sense.
>
> So I' happy to use a constant as suggested.

I take that back. That code hasn't changed, it's just moved. Here's
the current code in tuplesort_begin_common on master:

/*
* Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
* see comments in grow_memtuples().
*/
state->memtupsize = Max(1024,
ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);

I'm not sure we ought to change that in this patch...

James




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-02 Thread James Coleman
On Thu, Apr 2, 2020 at 8:20 PM Tomas Vondra
 wrote:
>
> Hi,
>
> Thanks, the v52 looks almost ready. I've been looking at the two or
> three things I mentioned, and I have a couple of comments.
>
>
> 1) /* XXX comparison_cost shouldn't be 0? */
>
> I'm not worried about this, because this is not really intriduced by
> this patch - create_sort_path has the same comment/issue, so I think
> it's acceptable to do the same thing for incremental sort.

Sounds good.

> 2) INITIAL_MEMTUPSIZE
>
> tuplesort.c does this:
>
>#define INITIAL_MEMTUPSIZE Max(1024, \
>ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)
>
> supposedly to keep the array size within ALLOCSET_SEPARATE_THRESHOLD.
> But I think it fails to do that, for a couple of reasons.
>
> Firstly, ALLOCSET_SEPARATE_THRESHOLD is 8192, and SortTuple is 21B at
> minimum (without padding), so with 1024 elements it's guaranteed to be
> at least 21kB - so exceeding the threshold. The maximum value is
> something like 256.
>
> Secondly, the second part of the formula is guaranteed to get us over
> the threshold anyway, thanks to the +1. Let's say SortTuple is 30B. Then
>
>ALLOCSET_SEPARATE_THRESHOLD / 30 = 273
>
> but we end up using 274, resulting in 8220B array. :-(
>
> So I guess the formula should be more like
>
>Min(128, ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple))
>
> or something like that.
>
> FWIW I think the whole hypothesis that selecting the array size below
> ALLOCSET_SEPARATE_THRESHOLD reduces overhead is dubious. AFAIC we
> allocate this only once (or very few times), and if we need to grow the
> array we'll hit the threshold anyway.
>
> I'd just pick a reasonable constant - 128 or 256 seems reasonable, 1024
> may be OK too.

That was a part of the patch I haven't touched since I inherited it,
and I didn't feel like I knew enough about the Postgres memory
management to make a determination on whether the reasoning made
sense.

So I' happy to use a constant as suggested.

> 3) add_partial_path
>
> I'm a bit torn about using enable_incrementalsort in add_partial_path. I
> know we've done that to eliminate the overhead, but it just seems weird.
> I wonder if that really saves us anything or if it was just noise. I'll
> do more testing before making a decision.
>
> While looking at add_path, I see the comparisons there are made in the
> opposite order - we first compare costs, and only then pathkeys. That
> seems like a more efficient way, I guess (cheaper float comparison
> first, pathkey comparison with a loop second). I wonder why it's not
> done like that in add_partial_path too? Perhaps this will make it cheap
> enough to remove the enable_incrementalsort check.

I would love to avoid checking enable_incrementalsort there. I think
it's pretty gross to do so. But if it's the only way to keep the patch
acceptable, then obviously I'd leave it in. So I'm hopeful we can
avoid needing it.

> 4) add_partial_path_precheck
>
> While looking at add_partial_path, I realized add_partial_path_precheck
> probably needs the same change - to consider startup cost too. I think
> the expectation is that add_partial_path_precheck only rejects paths
> that we know would then get rejected by add_partial_path.
>
> But now the behavior is inconsistent, which might lead to surprising
> behavior (I haven't seen such cases, but I haven't looked for them).
> At the moment the add_partial_path_precheck is called only from
> joinpath.c, maybe it's not an issue there.
>
> If it's OK to keep it like this, it'd probably deserve a comment why
> the difference is OK. And the comments also contain the same claim that
> parallel plans only need to look at total cost.

I remember some discussion about that much earlier in this thread (or
in the spin-off thread [1] re: that specific change that didn't get a
lot of action), and I'm pretty sure the conclusion was that we should
change both. But I guess we never got around to doing it.

> 5) Overall, I think the costing is OK. I'm sure we'll find cases that
> will need improvements, but that's fine. However, we now have
>
> - cost_tuplesort (used to be cost_sort)
> - cost_full_sort
> - cost_incremental_sort
> - cost_sort
>
> I find it a bit confusing that we have cost_sort and cost_full_sort. Why
> don't we just keep using the dummy path in label_sort_with_costsize?
> That seems to be the only external caller outside costsize.c. Then we
> could either make cost_full_sort static or get rid of it entirely.

This another area of the patch I haven't really modified.

James

[1]: 
https://www.postgresql.org/message-id/flat/CAAaqYe-5HmM4ih6FWp2RNV9rruunfrFrLhqFXF_nrrNCPy1Zhg%40mail.gmail.com




Re: src/test/regress/standby_schedule

2020-04-02 Thread Tom Lane
Thomas Munro  writes:
> I guess no human or machine ever runs $SUBJECT, because when I tried
> it while hunting down users of txid_XXX functions, it failed (see
> end).  To run it, you need a primary/standby pair, here 5441/5442, and
> then:

> PGPORT=5441 psql postgres -f sql/hs_primary_setup.sql
> PGPORT=5442 ./pg_regress --use-existing --dbname=postgres --schedule
> standby_schedule

> Perhaps the output changed in January with commit 2eb34ac3.  Easy to
> fix, but I wonder if anyone has a good idea for how to get check-world
> to run it (probably via the "recovery" stuff).

That stuff is very very ancient.  I'd suggest nuking it and writing
an equivalent TAP test, assuming that there's anything it does that's
not already covered by our existing TAP tests.

regards, tom lane




src/test/regress/standby_schedule

2020-04-02 Thread Thomas Munro
Hello,

I guess no human or machine ever runs $SUBJECT, because when I tried
it while hunting down users of txid_XXX functions, it failed (see
end).  To run it, you need a primary/standby pair, here 5441/5442, and
then:

PGPORT=5441 psql postgres -f sql/hs_primary_setup.sql
PGPORT=5442 ./pg_regress --use-existing --dbname=postgres --schedule
standby_schedule

Perhaps the output changed in January with commit 2eb34ac3.  Easy to
fix, but I wonder if anyone has a good idea for how to get check-world
to run it (probably via the "recovery" stuff).

diff -U3 
/home/tmunro/projects/postgresql/src/test/regress/expected/hs_standby_disallowed.out
/home/tmunro/projects/postgresql/src/test/regress/results/hs_standby_disallowed.out
--- 
/home/tmunro/projects/postgresql/src/test/regress/expected/hs_standby_disallowed.out
   2020-03-24 09:02:24.835023971 +1300
+++ 
/home/tmunro/projects/postgresql/src/test/regress/results/hs_standby_disallowed.out
2020-04-03 13:09:24.339672898 +1300
@@ -64,7 +64,7 @@
 (1 row)

 COMMIT PREPARED 'foobar';
-ERROR:  COMMIT PREPARED cannot run inside a transaction block
+ERROR:  cannot execute COMMIT PREPARED during recovery
 ROLLBACK;
 BEGIN;
 SELECT count(*) FROM hs1;
@@ -86,7 +86,7 @@
 (1 row)

 ROLLBACK PREPARED 'foobar';
-ERROR:  ROLLBACK PREPARED cannot run inside a transaction block
+ERROR:  cannot execute ROLLBACK PREPARED during recovery
 ROLLBACK;
 -- Locks
 BEGIN;




Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-04-02 Thread Tomas Vondra

Hi,

Thanks, the v52 looks almost ready. I've been looking at the two or
three things I mentioned, and I have a couple of comments.


1) /* XXX comparison_cost shouldn't be 0? */

I'm not worried about this, because this is not really intriduced by
this patch - create_sort_path has the same comment/issue, so I think
it's acceptable to do the same thing for incremental sort.


2) INITIAL_MEMTUPSIZE

tuplesort.c does this:

  #define INITIAL_MEMTUPSIZE Max(1024, \
  ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1)

supposedly to keep the array size within ALLOCSET_SEPARATE_THRESHOLD.
But I think it fails to do that, for a couple of reasons.

Firstly, ALLOCSET_SEPARATE_THRESHOLD is 8192, and SortTuple is 21B at
minimum (without padding), so with 1024 elements it's guaranteed to be
at least 21kB - so exceeding the threshold. The maximum value is
something like 256.

Secondly, the second part of the formula is guaranteed to get us over
the threshold anyway, thanks to the +1. Let's say SortTuple is 30B. Then

  ALLOCSET_SEPARATE_THRESHOLD / 30 = 273

but we end up using 274, resulting in 8220B array. :-(

So I guess the formula should be more like

  Min(128, ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple))

or something like that.

FWIW I think the whole hypothesis that selecting the array size below
ALLOCSET_SEPARATE_THRESHOLD reduces overhead is dubious. AFAIC we
allocate this only once (or very few times), and if we need to grow the
array we'll hit the threshold anyway.

I'd just pick a reasonable constant - 128 or 256 seems reasonable, 1024
may be OK too.


3) add_partial_path

I'm a bit torn about using enable_incrementalsort in add_partial_path. I
know we've done that to eliminate the overhead, but it just seems weird.
I wonder if that really saves us anything or if it was just noise. I'll
do more testing before making a decision.

While looking at add_path, I see the comparisons there are made in the
opposite order - we first compare costs, and only then pathkeys. That
seems like a more efficient way, I guess (cheaper float comparison
first, pathkey comparison with a loop second). I wonder why it's not
done like that in add_partial_path too? Perhaps this will make it cheap
enough to remove the enable_incrementalsort check.


4) add_partial_path_precheck

While looking at add_partial_path, I realized add_partial_path_precheck
probably needs the same change - to consider startup cost too. I think
the expectation is that add_partial_path_precheck only rejects paths
that we know would then get rejected by add_partial_path.

But now the behavior is inconsistent, which might lead to surprising
behavior (I haven't seen such cases, but I haven't looked for them).
At the moment the add_partial_path_precheck is called only from
joinpath.c, maybe it's not an issue there.

If it's OK to keep it like this, it'd probably deserve a comment why
the difference is OK. And the comments also contain the same claim that
parallel plans only need to look at total cost.


5) Overall, I think the costing is OK. I'm sure we'll find cases that
will need improvements, but that's fine. However, we now have 


- cost_tuplesort (used to be cost_sort)
- cost_full_sort
- cost_incremental_sort
- cost_sort

I find it a bit confusing that we have cost_sort and cost_full_sort. Why
don't we just keep using the dummy path in label_sort_with_costsize?
That seems to be the only external caller outside costsize.c. Then we
could either make cost_full_sort static or get rid of it entirely.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: snapshot too old issues, first around wraparound and then more.

2020-04-02 Thread Andres Freund
Hi,

On 2020-04-01 12:02:18 -0400, Robert Haas wrote:
> I have no objection to the idea that *if* the feature is hopelessly
> broken, it should be removed.

I don't think we have a real choice here at this point, at least for the
back branches.

Just about nothing around old_snapshot_threshold works correctly:

* There are basically no tests (see [1] I jsut sent, and also
  old_snapshot_threshold bypassing a lot of the relevant code).

* We don't detect errors after hot pruning (to allow that is a major
  point of the feature) when access is via any sort of index
  scans. Wrong query results.

* The time->xid mapping is is entirely broken. We don't prevent bloat
  for many multiples of old_snapshot_threshold (if above 1min).

  It's possible, but harder, to have this cause wrong query results.

* In read-mostly workloads it can trigger errors in sessions that are
  much younger than old_snapshot_threshold, if the transactionid is not
  advancing.

  I've not tried to reproduce, but I suspect this can also cause wrong
  query results. Because a later snapshot can have the same xmin as
  older transactions, it sure looks like we can end up with data from an
  older xmin getting removed, but the newer snapshot's whenTaken will
  prevent TestForOldSnapshot_impl() from raising an error.

* I am fairly sure that it can cause crashes (or even data corruption),
  because it assumes that DML never needs to check for old snapshots
  (with no meaningful justificiation). Leading to heap_update/delete to
  assume the page header is a tuple.

* There's obviously also the wraparound issue that made me start this
  thread initially.

Since this is a feature that can result in wrong query results (and
quite possibly crashes / data corruption), I don't think we can just
leave this unfixed.  But given the amount of code / infrastructure
changes required to get this into a working feature, I don't see how we
can unleash those changes onto the stable branches.

There's quite a few issues in here that require not just local bugfixes,
but some design changes too. And it's pretty clear that the feature
didn't go through enough review before getting committed. I see quite
some merit in removing the code in master, and having a potential
reimplementation go through a normal feature integration process.


I don't really know what to do here. Causing problems by neutering a
feature in the back branch *sucks*. While not quite as bad, removing a
feature without a replacement in a major release is pretty harsh
too. But I don't really see any other realistic path forward.


FWIW, I've now worked around the interdependency of s_t_o my snapshot
scalability patch (only took like 10 days). I have manually confirmed it
works with 0/1 minute thresholds.  I can make the tests pass unmodified
if I just add SetOldSnapshotThresholdTimestamp() calls when not
necessary (which obviously makes no sense).  Lead to some decent
improvements around pruning that are independent of s_t_o (with more
possibilities "opened").  But I still think we need to do something
here.

Greetings,

Andres Freund

[1] https://postgr.es/m/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de




Re: snapshot too old issues, first around wraparound and then more.

2020-04-02 Thread Andres Freund
Hi,

I just spend a good bit more time improving my snapshot patch, so it
could work well with a fixed version of the old_snapshot_threshold
feature. Mostly so there's no unnecessary dependency on the resolution
of the issues in that patch.


When testing my changes, for quite a while, I could not get
src/test/modules/snapshot_too_old/ to trigger a single too-old error.

It turns out, that's because there's not a single tuple removed due to
old_snapshot_threshold in src/test/modules/snapshot_too_old/. The only
reason the current code triggers any such errors is that

a) TransactionIdLimitedForOldSnapshots() is always called in
   heap_page_prune_opt(), even if the "not limited" horizon
   (i.e. RecentGlobalDataXmin) is more than old enough to allow for
   pruning. That includes pages without a pd_prune_xid.

b) TransactionIdLimitedForOldSnapshots(), in the old_snapshot_threshold
   == 0 branch, always calls
SetOldSnapshotThresholdTimestamp(ts, xlimit)
   even if the horizon has not changed due to snapshot_too_old (xlimit
   is initially set tot the "input" horizon, and only increased if
   between (recentXmin, MyProc->xmin)).


To benefit from the snapshot scalability improvements in my patchset, it
is important to avoid unnecessarily determining the "accurate" xmin
horizon, if it's clear from the "lower boundary" horizon that pruning
can happen. Therefore I changed heap_page_prune_opt() and
heap_page_prune() to only limit when we couldn't prune.

In the course of that I separated getting the horizon from
TransactionIdLimitedForOldSnapshots() and triggering errors when an
already removed tuple would be needed via
TransactionIdLimitedForOldSnapshots().

Because there are no occasions to actually remove tuples in the entire
test, there now were no TransactionIdLimitedForOldSnapshots() calls. And
thus no errors.  My code turns out to actually work.


Thus, if I change the code in master from:
TransactionId xlimit = recentXmin;
...
if (old_snapshot_threshold == 0)
{
if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
&& TransactionIdFollows(latest_xmin, xlimit))
xlimit = latest_xmin;

ts -= 5 * USECS_PER_SEC;
SetOldSnapshotThresholdTimestamp(ts, xlimit);

return xlimit;
}

to
...
if (old_snapshot_threshold == 0)
{
if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
&& TransactionIdFollows(latest_xmin, xlimit))
{
xlimit = latest_xmin;
SetOldSnapshotThresholdTimestamp(ts, xlimit);
}

ts -= 5 * USECS_PER_SEC;

return xlimit;
}

there's not a single error raised in the existing tests. Not a *single*
tuple removal is caused by old_snapshot_threshold. We just test the
order of SetOldSnapshotThresholdTimestamp() calls. We have code in the
backend to support testing old_snapshot_threshold, but we don't test
anything meaningful in the feature. We basically test a oddly behaving
version version of "transaction_timeout = 5s".  I can't emphasize enough
how baffling I find this.

Greetings,

Andres Freund




Re: Berserk Autovacuum (let's save next Mandrill)

2020-04-02 Thread Tom Lane
David Rowley  writes:
> On Fri, 3 Apr 2020 at 04:46, Tom Lane  wrote:
>> Concretely, I suggest the attached, which replaces the autovac disables
>> with adjusting partition boundaries so that the partitions contain
>> different numbers of rows.

> I've looked over this and I agree that it's a better solution to the problem.
> I'm happy for you to go ahead on this.

Pushed, thanks for looking at it!

regards, tom lane




Re: Datum values consistency within one query

2020-04-02 Thread Tom Lane
Paul Ramsey  writes:
> Getting the datum value is really fast, so I can have a cache that
> keeps the latest detoasted object around, and update it when the datum
> changes, and store the cache information in the parent context. Like
> so:

Jeez, no, not like that.  You're just testing a pointer.  Most likely,
if this is happening in a table scan, the pointer is pointing into
some shared buffer.  If that buffer gets re-used to hold some other
page, you could receive the identical pointer value but it's pointing
to completely different data.  The risk of false pointer match would
be even higher at plan levels above a scan, I think, because it'd
possibly just be pointing into a plan node's output tuple slot.

The case where this would actually be worth doing, probably, is where
you are receiving a toasted-out-of-line datum.  In that case you could
legitimately use the toast pointer ID values (va_valueid + va_toastrelid)
as a lookup key for a cache, as long as it had a lifespan of a statement
or less.  You'd have to get a bit in bed with the details of toast
pointers, but it's not like those are going anywhere.

It would be interesting to tie that into the "expanded object"
infrastructure, perhaps, especially if the contents of the objects
you're interested in aren't just flat blobs of data.

regards, tom lane




Re: [PATCH] Fix for slow GIN index queries when "gin_fuzzy_search_limit" setting is relatively small for large tables

2020-04-02 Thread Tom Lane
=?UTF-8?B?QWTDqQ==?=  writes:
>> It seems like what you're actually trying
>> to accomplish is to ensure that entryLoadMoreItems's "stepright" path
>> is taken, instead of re-descending the index from the root.

> What I was primarily trying to do is make sure that when entryLoadMoreItems
> is called, it loads new items that it didn't load previously, which would
> occur in special cases. But the solution to that goal does result in the
> "stepright" path being used.

Oh, hm, now I see what you mean: as things stand, it's likely to
repeatedly (and very expensively) reload the same page we were
already on, until the random dropItem() test finally accepts some
item from that page.  Ick.

I think though that the fix can be a bit less messy than you have here,
because advancePast is just a local variable in entryGetItem, so we
can overwrite it without any problem.  So what I want to do is just
update it to equal entry->curItem before looping around.  But shoving
that assignment into the while-condition was too ugly for my taste
(and no, I didn't like your assignment there either).  So I ended up
refactoring the do-loop into a for-loop with internal break conditions,
as attached.

I also made the posting-list case handle reduction in the same way,
and for good measure changed the bitmap-result case to look similar,
which caused me to notice that it had a bug too: the "continue" case
within that loop failed to reset gotitem = false as it should,
if we'd looped around after rejecting an item due to reduceResult.
As far as I can see, that would lead to returning the previously-
rejected curItem value, which isn't fatal but it's still wrong.
So I just got rid of the gotitem variable altogether; it really
wasn't helping with either clarity or correctness.

This patch also adds a couple of test cases so that we have more
code coverage in this area.  The overall coverage of ginget.c
is still mighty lame, but at least we're going through some of
these lines that we weren't before.

I'm inclined to back-patch this.  Given how fuzzy the definition
of gin_fuzzy_search_limit is, it seems unlikely that anyone would
be depending on the current buggy behavior.

regards, tom lane

diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index 50fe38b..c18ba87 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -813,9 +813,8 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 		/* A bitmap result */
 		BlockNumber advancePastBlk = GinItemPointerGetBlockNumber();
 		OffsetNumber advancePastOff = GinItemPointerGetOffsetNumber();
-		bool		gotitem = false;
 
-		do
+		for (;;)
 		{
 			/*
 			 * If we've exhausted all items on this block, move to next block
@@ -864,7 +863,6 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
  * estimate number of results on this page to support correct
  * reducing of result even if it's enabled.
  */
-gotitem = true;
 break;
 			}
 
@@ -877,7 +875,7 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 /*
  * First, do a quick check against the last offset on the
  * page. If that's > advancePast, so are all the other
- * offsets.
+ * offsets, so just go back to the top to get the next page.
  */
 if (entry->matchResult->offsets[entry->matchResult->ntuples - 1] <= advancePastOff)
 {
@@ -894,8 +892,11 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 		   entry->matchResult->blockno,
 		   entry->matchResult->offsets[entry->offset]);
 			entry->offset++;
-			gotitem = true;
-		} while (!gotitem || (entry->reduceResult == true && dropItem(entry)));
+
+			/* Done unless we need to reduce the result */
+			if (!entry->reduceResult || !dropItem(entry))
+break;
+		}
 	}
 	else if (!BufferIsValid(entry->buffer))
 	{
@@ -903,7 +904,7 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 		 * A posting list from an entry tuple, or the last page of a posting
 		 * tree.
 		 */
-		do
+		for (;;)
 		{
 			if (entry->offset >= entry->nlist)
 			{
@@ -913,13 +914,20 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 			}
 
 			entry->curItem = entry->list[entry->offset++];
-		} while (ginCompareItemPointers(>curItem, ) <= 0);
-		/* XXX: shouldn't we apply the fuzzy search limit here? */
+
+			/* If we're not past advancePast, keep scanning */
+			if (ginCompareItemPointers(>curItem, ) <= 0)
+continue;
+
+			/* Done unless we need to reduce the result */
+			if (!entry->reduceResult || !dropItem(entry))
+break;
+		}
 	}
 	else
 	{
 		/* A posting tree */
-		do
+		for (;;)
 		{
 			/* If we've processed the current batch, load more items */
 			while (entry->offset >= entry->nlist)
@@ -935,8 +943,20 @@ entryGetItem(GinState *ginstate, GinScanEntry entry,
 
 			entry->curItem = entry->list[entry->offset++];
 
-		} while (ginCompareItemPointers(>curItem, ) <= 0 ||
- (entry->reduceResult == true && dropItem(entry)));
+			/* If we're 

Datum values consistency within one query

2020-04-02 Thread Paul Ramsey
Imagine a function that was going to take a table of, say, images, and
so something to them over and over, like:

  SELECT pixel_count(img), clr_name, img_name
  FROM images img
  CROSS JOIN colors clr

When you run this function, you find that a great amount of time is
being spend in the decompression/detoasting routines, so you think: I
have a nested loop here, driven on the 'img' side, if I can avoid
re-loading the big image object over and over I can make things
faster.

Getting the datum value is really fast, so I can have a cache that
keeps the latest detoasted object around, and update it when the datum
changes, and store the cache information in the parent context. Like
so:

struct {
Datum d;
bytea *ba;
} DatumCache;

PG_FUNCTION_INFO_V1(pixel_count);
Datum pixel_count(PG_FUNCTION_ARGS)
{
Datum d = PG_GETARG_DATUM(0);
DatumCache *dcache = fcinfo->flinfo->fn_extra;
bytea *ba;

if (!dcache)
{
dcache = MemoryContextAllocZero(fcinfo->flinfo->fn_mcxt,
sizeof(DatumCache));
fcinfo->flinfo->fn_extra = dcache;
}

if (dcache->d != d)
{
if (dcache->ba) pfree(dcache->ba);
MemoryContext old_context =
MemoryContextSwitchTo(fcinfo->flinfo->fn_mcxt);
dcache->ba = PG_GETARG_BYTEA_P_COPY(0);
MemoryContextSwitchTo(old_context);
}

ba = dcache->ba;

/* now do things with ba here */
}

Now, notwithstanding any concerns about the particularities of my
example (I've found order-of-magnitude improvements on PostGIS
workloads avoiding the detoasting overhead this way) is my core
assumption correct: within the context of a single SQL statement, will
the Datum values for a particular object remain constant?

They *seem* to, in the examples I'm running. But do they always?




Re: Add A Glossary

2020-04-02 Thread Justin Pryzby
On Thu, Apr 02, 2020 at 07:09:32PM -0300, Alvaro Herrera wrote:
> "partition" instead).  If you (or anybody) have suggestions for the
> definition of "client" and "session", I'm all ears.

We already have Session:
A Connection to the Database. 

I propose: Client:
A host (or a process on a host) which connects to a server to make
queries or other requests.

But note, "host" is still defined as "server", which I didn't like.

Maybe it should be:
A computer which may act as a >client< or a >server<.

-- 
Justin




Re: A bug when use get_bit() function for a long bytea string

2020-04-02 Thread Tom Lane
ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> Tom Lane  writes:
>> Yeah, I'd noticed those on previous readings of the patch.  They'd almost
>> certainly fail on some of our older/smaller buildfarm members, so they're
>> not getting committed, even if they didn't require multiple seconds apiece
>> to run (even on a machine with plenty of memory).  It's useful to have
>> them for initial testing though.

> Perl's test suite has a similar issue with tests for handling of huge
> strings, hashes, arrays, regexes etc.  We've taken the approach of
> checking the environment variable PERL_TEST_MEMORY and skipping tests
> that need more than that many gigabytes.  We currently have tests that
> check for values from 1 all the way up to 96 GiB.
> This would be trivial to do in the Postgres TAP tests, but something
> similar might feasible in the pg_regress too?

Meh.  The memory is only part of it; the other problem is that multiple
seconds expended in every future run of the regression tests is a price
that's many orders of magnitude higher than the potential value of this
test case.

regards, tom lane




Re: Add A Glossary

2020-04-02 Thread Alvaro Herrera
On 2020-Apr-02, Corey Huinker wrote:

> On Thu, Apr 2, 2020 at 8:44 AM Jürgen Purtz  wrote:
> 
> > +1 and many thanks to Alvaros edits.
> >
> >
> I did some of the grunt work Alvaro alluded to in v6, and the results are
> attached and they build, which means there are no invalid links.

Thank you!  I had been working on some other changes myself, and merged
most of your changes.  I give you v8.

> * renamed id glossary-temporary-tables to glossary-temporary-table

Good.

> * temporarily re-added an id for glossary-row as we have many references to
> that. unsure if we should use the term Tuple in all those places or say Row
> while linking to glossary-tuple, or something else

I changed these to link to glossary-tuple; that entry already explains
these two other terms, so this seems acceptable.

> * temporarily re-added an id for glossary-segment, glossary-wal-segment,
> glossary-analytic-function, as those were also referenced and will need
> similar decisions made

Ditto.

> * added a stub entry for glossary-unique-index, unsure if it should have a
> definition on it's own, or we split it into unique and index.

I changed Unique Index into Unique Constraint, which is supposed to be
the overarching concept.  Used that in the definition of primary key.

> * I noticed several cases where a glossterm is used twice in a definition,
> but didn't de-term them

Did that for most I found, but I expect that some remain.

> * I'm curious about how we should tag a term when using it in its own
> definition. same as anywhere else?

I think we should not tag those.

I fixed the definition of global object as mentioned previously.  Also
added "client", made "connection" have less importance compared to
"session", and removed "window frame" (made "window function" refer to
"partition" instead).  If you (or anybody) have suggestions for the
definition of "client" and "session", I'm all ears.

I'm quite liking the result of this now.  Thanks for all your efforts.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 1043d0f7ab..cf21ef857e 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -170,6 +170,7 @@
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index 00..7d981a9223
--- /dev/null
+++ b/doc/src/sgml/glossary.sgml
@@ -0,0 +1,1692 @@
+
+ Glossary
+ 
+  This is a list of terms and their meaning in the context of
+  PostgreSQL and relational database
+  systems in general.
+ 
+
+ 
+  
+   Aggregate Function
+   
+
+ A function that
+ combines (aggregates) multiple input values,
+ for example by counting, averaging or adding,
+ yielding a single output value.
+
+
+ For more information, see
+ .
+
+
+   
+  
+
+  
+   Analyze (operation)
+   
+
+ The process of collecting statistics from data in
+ tables
+ and other relations
+ to help the query planner
+ to make decisions about how to execute
+ queries.
+
+   
+  
+
+  
+   Analytic Function
+   
+  
+
+  
+   Atomic
+   
+
+ In reference to a datum:
+ the fact that its value that cannot be broken down into smaller
+ components.
+
+   
+   
+
+ In reference to a
+ database transaction:
+ see atomicity.
+
+   
+  
+
+  
+   Atomicity
+   
+
+ The property of a transaction
+ that either all its operations complete as a single unit or none do.
+ In addition, if a system failure occurs during the execution of a
+ transaction, no partial results are visible after recovery.
+ This is one of the ACID properties.
+
+   
+  
+
+  
+   Attribute
+   
+
+ An element with a certain name and data type found within a
+ tuple or
+ table.
+
+   
+  
+
+  
+   Autovacuum (process)
+   
+
+ A set of background processes that routinely perform
+ vacuum
+ and analyze
+ operations.
+
+
+ For more information, see
+ .
+
+   
+  
+
+  
+   Backend (process)
+   
+
+ Processes of an instance
+ which act on behalf of client sessions
+ and handle their requests.
+
+
+ (Don't confuse this term with the similar terms
+ Background Worker or
+ Background Writer).
+
+   
+  
+
+  
+   Background Worker (process)
+   
+
+ Individual processes within an instance,
+ which run system- or user-supplied code.
+ They provide infrastructure for several features in
+ PostgreSQL, such as 
+ logical replication
+ and parallel queries.
+ Extensions can add
+ custom background worker processes, as well.
+   
+   
+For more information, see
+.
+   
+   
+  
+
+  
+   Background Writer (process)
+   
+
+ A process that continuously writes dirty pages from
+ Shared Memory to
+ 

Re: A bug when use get_bit() function for a long bytea string

2020-04-02 Thread Dagfinn Ilmari Mannsåker
Tom Lane  writes:

> "Daniel Verite"  writes:
>> These 2 tests need to allocate big chunks of contiguous memory, so they
>> might fail for lack of memory on tiny machines, and even when not failing,
>> they're pretty slow to run. Are they worth the trouble?
>
> Yeah, I'd noticed those on previous readings of the patch.  They'd almost
> certainly fail on some of our older/smaller buildfarm members, so they're
> not getting committed, even if they didn't require multiple seconds apiece
> to run (even on a machine with plenty of memory).  It's useful to have
> them for initial testing though.

Perl's test suite has a similar issue with tests for handling of huge
strings, hashes, arrays, regexes etc.  We've taken the approach of
checking the environment variable PERL_TEST_MEMORY and skipping tests
that need more than that many gigabytes.  We currently have tests that
check for values from 1 all the way up to 96 GiB.

This would be trivial to do in the Postgres TAP tests, but something
similar might feasible in the pg_regress too?

> It'd be great if there was a way to test get_bit/set_bit on large
> indexes without materializing a couple of multi-hundred-MB objects.
> Can't think of one offhand though.

For this usecase it might make sense to express the limit in megabytes,
and have a policy for how much memory tests can assume without explicit
opt-in from the developer or buildfarm animal.

- ilmari
-- 
"The surreality of the universe tends towards a maximum" -- Skud's Law
"Never formulate a law or axiom that you're not prepared to live with
 the consequences of."  -- Skud's Meta-Law




Re: Nicer error when connecting to standby with hot_standby=off

2020-04-02 Thread David Zhang
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I applied the patch to the latest master branch and run a test below. The error 
messages have been separated. Below is the test steps.

### setup primary server
initdb -D /tmp/primary/data
mkdir /tmp/archive_dir
echo "archive_mode='on'" >> /tmp/primary/data/postgresql.conf
echo "archive_command='cp %p /tmp/archive_dir/%f'" >> 
/tmp/primary/data/postgresql.conf
pg_ctl -D /tmp/primary/data -l /tmp/primary-logs start

### setup host standby server
pg_basebackup -p 5432 -w -R -D /tmp/hotstandby
echo "primary_conninfo='host=127.0.0.1 port=5432 user=pgdev'" >> 
/tmp/hotstandby/postgresql.conf
echo "restore_command='cp /tmp/archive_dir/%f %p'" >> 
/tmp/hotstandby/postgresql.conf
echo "hot_standby = off" >> /tmp/hotstandby/postgresql.conf
pg_ctl -D /tmp/hotstandby -l /tmp/hotstandby-logs -o "-p 5433" start

### keep trying to connect to hot standby server in order to get the error 
messages in different stages.
while true; do echo "`date`"; psql postgres -p 5433 -c "SELECT 
txid_current_snapshot();" sleep 0.2; done

### before the patch
psql: error: could not connect to server: FATAL:  the database system is 
starting up
...

### after the patch, got different messages, one message indicates hot_standby 
is off
psql: error: could not connect to server: FATAL:  the database system is 
starting up
...
psql: error: could not connect to server: FATAL:  the database system is up, 
but hot_standby is off
...

Re: Should we add xid_current() or a int8->xid cast?

2020-04-02 Thread Andres Freund
Hi,

On 2020-04-02 14:26:41 -0700, Mark Dilger wrote:
> Since Thomas's patch is really just focused on transitioning from txid
> -> xid8, I think this conversation is a bit beyond scope for this
> patch, except that "xid8" sounds an awful lot like the new type that
> all user facing xid output will transition to.  Maybe I'm wrong about
> that.

Several at least. For me it'd e.g. make no sense to change pageinspect
etc.


> Are we going to change the definition of the "xid" type to 8 bytes?
> That sounds dangerous, from a compatibility standpoint.

No, I can't see that happening.


> On balance, I'd rather have xid8in and xid8out work just as Thomas has
> it.  I'm not asking for any change there.  But I'm curious if the
> whole community is on the same page regarding where this is all
> heading.
>
> I'm contemplating the statement that "the goal should be to reduce
> awareness of the 32bitness of normal xids from as many places as
> possible", which I support, and what that means for the eventual
> signatures of functions like pg_stat_get_activity, including:

Maybe. Aiming to do things like this all-at-once just makes it less
likely for anything to ever happen.


> but otherwise there would be a transition period where some have been
> reworked to return xid8 but others not, and users during that
> transition period might be happier with Alvaro's suggestion of
> treating epoch/xid as two fields in xid8in and xid8out.

-countless

I can only restate my point that we've had 8 byte xids exposed for many
years. We've had very few epoch/xid values exposed. I think it'd be
insane to now start to expose that more widely.

It's just about impossible for normal users to compare xids. Once one
wrapped around, it's just too hard/mindbending. Yes, an accompanying
epoch makes it easier, but it still can be quite confusing.

Greetings,

Andres Freund




Re: Should we add xid_current() or a int8->xid cast?

2020-04-02 Thread Mark Dilger



> On Apr 2, 2020, at 2:13 PM, Andres Freund  wrote:
> 
> Hi,
> 
> On 2020-04-02 11:47:32 -0700, Mark Dilger wrote:
>> I agree with transitioning to 64-bit xids with 32 bit xid/epoch pairs
>> as an internal implementation and storage detail only, but we still
>> have user facing views that don't treat it that way.
> 
> Note that epochs are not really a thing internally anymore. The xid
> counter is a FullTransactionId.
> 
> 
>> pg_stat_get_activity still returns backend_xid and backend_xmin as
>> 32-bit, not 64-bit.  Should this function change to be consistent?  I'm
>> curious what the user experience will be during the transitional period
>> where some user facing xids are 64 bit and others (perhaps the same xids
>> but viewed elsewhere) will be 32 bit.  That might make it difficult for
>> users to match them up.
> 
> I think we probably should switch them over at some point, but I would
> strongly advise against coupling that with Thomas' patch. That patch
> doesn't make the current situation around 32bit / 64bit any worse, as
> far as I can tell.

I agree with that.

> Given that txid_current() "always" has been a plain 64 bit integer, and
> the various txid_* functions always have returned 64 bit integers, I
> really don't think arguing for some 32bit/32bit situation now makes
> sense.

Yeah, I'm not arguing for that, though I can see how my email might have been 
ambiguous on that point.

Since Thomas's patch is really just focused on transitioning from txid -> xid8, 
I think this conversation is a bit beyond scope for this patch, except that 
"xid8" sounds an awful lot like the new type that all user facing xid output 
will transition to.  Maybe I'm wrong about that.   Are we going to change the 
definition of the "xid" type to 8 bytes?  That sounds dangerous, from a 
compatibility standpoint.

On balance, I'd rather have xid8in and xid8out work just as Thomas has it.  I'm 
not asking for any change there.  But I'm curious if the whole community is on 
the same page regarding where this is all heading.

I'm contemplating the statement that "the goal should be to reduce awareness of 
the 32bitness of normal xids from as many places as possible", which I support, 
and what that means for the eventual signatures of functions like 
pg_stat_get_activity, including:

(..., backend_xid XID, backend_xminxid XID, ...) 
pg_stat_get_activity(pid INTEGER)

(..., transactionid XID, ...) pg_lock_status()

(transaction XID, ...) pg_prepared_xact()

timestamptz pg_xact_commit_timestamp(XID)

(xid XID, ...) pg_last_committed_xact()

(..., xmin XID, catalog_xmin XID, ...) pg_get_replication_slots()

... more that I'm too lazy to copy-and-paste just now ...

I would expect that, eventually, these would be upgraded to xid8.  If that 
happened seemlessly in one release, then there would be no problem with some 
functions returning 4-byte xids and some returning 8-byte xids, but otherwise 
there would be a transition period where some have been reworked to return xid8 
but others not, and users during that transition period might be happier with 
Alvaro's suggestion of treating epoch/xid as two fields in xid8in and xid8out.  
I'm also doubtful that these functions would be "upgraded".  It seems far more 
likely that alternate versions, perhaps named something with /xid8/ in them, 
would exist side-by-side with the originals.

So I'm really just wondering where others on this list think all this is 
heading, and if there are any arguments brewing about that which could be 
avoided by making assumptions clear right up front.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Should we add xid_current() or a int8->xid cast?

2020-04-02 Thread Andres Freund
Hi,

On 2020-04-02 11:47:32 -0700, Mark Dilger wrote:
> I agree with transitioning to 64-bit xids with 32 bit xid/epoch pairs
> as an internal implementation and storage detail only, but we still
> have user facing views that don't treat it that way.

Note that epochs are not really a thing internally anymore. The xid
counter is a FullTransactionId.


> pg_stat_get_activity still returns backend_xid and backend_xmin as
> 32-bit, not 64-bit.  Should this function change to be consistent?  I'm
> curious what the user experience will be during the transitional period
> where some user facing xids are 64 bit and others (perhaps the same xids
> but viewed elsewhere) will be 32 bit.  That might make it difficult for
> users to match them up.

I think we probably should switch them over at some point, but I would
strongly advise against coupling that with Thomas' patch. That patch
doesn't make the current situation around 32bit / 64bit any worse, as
far as I can tell.

Given that txid_current() "always" has been a plain 64 bit integer, and
the various txid_* functions always have returned 64 bit integers, I
really don't think arguing for some 32bit/32bit situation now makes
sense.

Greetings,

Andres Freund




Re: Should we add xid_current() or a int8->xid cast?

2020-04-02 Thread James Coleman
On Thu, Apr 2, 2020 at 2:47 PM Mark Dilger  wrote:
>
>
>
> > On Apr 2, 2020, at 11:01 AM, Andres Freund  wrote:
> >
> >>
> >> Hmm, for some reason I had it in my head that we would make these use an
> >> "epoch/val" output format rather than raw uint64 values.
> >
> > Why would we do that? IMO the goal should be to reduce awareness of the
> > 32bitness of normal xids from as many places as possible, and treat them
> > as an internal space optimization.
>
> I agree with transitioning to 64-bit xids with 32 bit xid/epoch pairs as an 
> internal implementation and storage detail only, but we still have user 
> facing views that don't treat it that way.   pg_stat_get_activity still 
> returns backend_xid and backend_xmin as 32-bit, not 64-bit.  Should this 
> function change to be consistent?  I'm curious what the user experience will 
> be during the transitional period where some user facing xids are 64 bit and 
> others (perhaps the same xids but viewed elsewhere) will be 32 bit.  That 
> might make it difficult for users to match them up.


Agreed. The "benefit" (at least in the short term) of using the
epoch/value style is that it makes (visual, at least) comparison with
other (32-bit) xid values easier.

I'm not sure if that's worth it, or if it's worth making a change
depend on changing all of those views too.

James




Re: control max length of parameter values logged

2020-04-02 Thread Alexey Bashtanov




Pushed with a bit of editorialization.

Great, and thanks for the fixes!

Best, Alex




Re: Berserk Autovacuum (let's save next Mandrill)

2020-04-02 Thread David Rowley
On Fri, 3 Apr 2020 at 04:46, Tom Lane  wrote:
>
> I wrote:
> > I'd be inclined to undo what you did in favor of initializing the
> > test tables to contain significantly different numbers of rows,
> > because that would (a) achieve plan stability more directly,
> > and (b) demonstrate that the planner is actually ordering the
> > tables by cost correctly.  Maybe somewhere else we have a test
> > that is verifying (b), but these test cases abysmally fail to
> > check that point.
>
> Concretely, I suggest the attached, which replaces the autovac disables
> with adjusting partition boundaries so that the partitions contain
> different numbers of rows.

I've looked over this and I agree that it's a better solution to the problem.

I'm happy for you to go ahead on this.

David




Re: backup manifests

2020-04-02 Thread David Steele

On 4/2/20 3:47 PM, Andres Freund wrote:

On 2020-04-02 15:42:48 -0400, Robert Haas wrote:

I suspect I'm not doing quite what you had in mind here... thoughts?


I have some ideas, but I think it's complicated enough that I'd not put
it in the "pre commit path" for now.


+1. These would be great tests to have and a win for pg_basebackup 
overall but I don't think they should be a prerequisite for this commit.


Regards,
--
-David
da...@pgmasters.net




Re: Planning counters in pg_stat_statements (using pgss_store)

2020-04-02 Thread legrand legrand
Fujii Masao-4 wrote
> On 2020/04/01 18:19, Fujii Masao wrote:
> 
> Finally I pushed the patch!
> Many thanks for all involved in this patch!
> 
> As a remaining TODO item, I'm thinking that the document would need to
> be improved. For example, previously the query was not stored in pgss
> when it failed. But, in v13, if pgss_planning is enabled, such a query is
> stored because the planning succeeds. Without the explanation about
> that behavior in the document, I'm afraid that users will get confused.
> Thought?
> 
> Regards,
> 
> -- 
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION

Thank you all for this work and especially to Julian for its major
contribution !

Regarding the TODO point: Yes I agree that it can be improved.
My proposal:

"Note that planning and execution statistics are updated only at their 
respective end phase, and only for successfull operations.
For exemple executions counters of a long running SELECT query, 
will be updated at the execution end, without showing any progress 
report in the interval.
Other exemple, if the statement is successfully planned but fails in 
the execution phase, only its planning statistics are stored.
This may give uncorrelated plans vs calls informations."

Regards
PAscal




--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html




Re: snapshot too old issues, first around wraparound and then more.

2020-04-02 Thread Peter Geoghegan
On Thu, Apr 2, 2020 at 11:28 AM Peter Geoghegan  wrote:
> In conclusion, I share Andres' concerns here. There are glaring
> problems with how we manipulate the data structure that controls the
> effective horizon for pruning. Maybe they can be fixed while leaving
> the code that manages the OldSnapshotControl circular buffer in
> something resembling its current form, but I doubt it. In my opinion,
> there is no approach to fixing "snapshot too old" that won't have some
> serious downside.

I'll add something that might be constructive: It would probably be a
good idea to introduce a function like syncrep.c's
SyncRepQueueIsOrderedByLSN() function, which is designed to be called
by assertions only. That would both clearly document and actually
verify the circular buffer/OldSnapshotControl data structure's
invariants.


--
Peter Geoghegan




Re: A bug when use get_bit() function for a long bytea string

2020-04-02 Thread Tom Lane
"Daniel Verite"  writes:
> These 2 tests need to allocate big chunks of contiguous memory, so they
> might fail for lack of memory on tiny machines, and even when not failing,
> they're pretty slow to run. Are they worth the trouble?

Yeah, I'd noticed those on previous readings of the patch.  They'd almost
certainly fail on some of our older/smaller buildfarm members, so they're
not getting committed, even if they didn't require multiple seconds apiece
to run (even on a machine with plenty of memory).  It's useful to have
them for initial testing though.

It'd be great if there was a way to test get_bit/set_bit on large
indexes without materializing a couple of multi-hundred-MB objects.
Can't think of one offhand though.

regards, tom lane




Re: backup manifests

2020-04-02 Thread Andres Freund
On 2020-04-02 15:42:48 -0400, Robert Haas wrote:
> I suspect I'm not doing quite what you had in mind here... thoughts?

I have some ideas, but I think it's complicated enough that I'd not put
it in the "pre commit path" for now.




Re: backup manifests

2020-04-02 Thread Robert Haas
On Thu, Apr 2, 2020 at 2:55 PM Robert Haas  wrote:
> On Thu, Apr 2, 2020 at 2:23 PM Andres Freund  wrote:
> > > That might make the window fairly wide on normal systems, but I'm not
> > > sure about Raspberry Pi BF members or things running
> > > CLOBBER_CACHE_ALWAYS/RECURSIVELY. I guess I could try it.
> >
> > You can set checkpoint_timeout to be a day. If that's not enough, well,
> > then I think we have other problems.
>
> I'm not sure that's the only issue here, but I'll try it.

I ran into a few problems here. In trying to set this up manually, I
always began with the following steps:


# (1) create cluster
initdb

# (2) add to configuration file
log_checkpoints=on
checkpoint_timeout=1d
checkpoint_completion_target=0.99

# (3) fire it up
postgres
createdb


If at this point I do "pg_basebackup -D pgslave -R -c spread", it
completes within a few seconds anyway, because there's basically
nothing dirty, and no matter how slowly you write out no data, it's
still pretty quick. If I run "pgbench -i" first, and then
"pg_basebackup -D pgslave -R -c spread", it hangs, apparently
essentially forever, because now the checkpoint has something to do,
and it does it super-slowly, and "psql -c checkpoint" makes it finish
immediately. However, this experiment isn't testing quite the right
thing, because what I actually need is a slow backup off of a
cascading standby, so that I have time to promote the parent standby
before the backup completes. I tried continuing like this:


# (4) set up standby
pg_basebackup -D pgslave -R
postgres -D pgslave -c port=5433

# (5) set up cascading standby
pg_basebackup -D pgslave2 -d port=5433 -R
postgres -c port=5434 -D pgslave2

# (6) dirty some pages on the master
pgbench -i

# (7) start a backup of the cascading standby
pg_basebackup -D pgslave3 -d port=5434 -R -c spread


However, the pg_basebackup in the last step completes after only a few
seconds. If it were hanging, then I could continue with "pg_ctl
promote -D pgslave" and that might give me what I need, but that's not
what happens.

I suspect I'm not doing quite what you had in mind here... thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: backup manifests

2020-04-02 Thread Robert Haas
On Thu, Apr 2, 2020 at 3:26 PM David Steele  wrote:
> So, with the addition of the 0004 patch down-thread this looks
> committable to me.

Glad to hear it. Thank you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Add A Glossary

2020-04-02 Thread Corey Huinker
On Thu, Apr 2, 2020 at 8:44 AM Jürgen Purtz  wrote:

> +1 and many thanks to Alvaros edits.
>
>
I did some of the grunt work Alvaro alluded to in v6, and the results are
attached and they build, which means there are no invalid links.

Notes:
* no definition wordings were changed
* added a linkend to all remaining glossterms that do not immediately
follow a glossentry
* renamed id glossary-temporary-tables to glossary-temporary-table
* temporarily re-added an id for glossary-row as we have many references to
that. unsure if we should use the term Tuple in all those places or say Row
while linking to glossary-tuple, or something else
* temporarily re-added an id for glossary-segment, glossary-wal-segment,
glossary-analytic-function, as those were also referenced and will need
similar decisions made
* added a stub entry for glossary-unique-index, unsure if it should have a
definition on it's own, or we split it into unique and index.
* I noticed several cases where a glossterm is used twice in a definition,
but didn't de-term them
* I'm curious about how we should tag a term when using it in its own
definition. same as anywhere else?
From 4603ce04306e77f5508bb207b42e5dec1425e7c5 Mon Sep 17 00:00:00 2001
From: coreyhuinker 
Date: Thu, 2 Apr 2020 15:32:43 -0400
Subject: [PATCH] glossary v7

---
 doc/src/sgml/filelist.sgml |1 +
 doc/src/sgml/glossary.sgml | 1589 
 doc/src/sgml/postgres.sgml |1 +
 3 files changed, 1591 insertions(+)
 create mode 100644 doc/src/sgml/glossary.sgml

diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 1043d0f7ab..cf21ef857e 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -170,6 +170,7 @@
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index 00..edfcf9d725
--- /dev/null
+++ b/doc/src/sgml/glossary.sgml
@@ -0,0 +1,1589 @@
+
+ Glossary
+ 
+  This is a list of terms and their meaning in the context of
+  PostgreSQL and relational database
+  systems in general.
+ 
+
+ 
+  
+   Aggregate Function
+   
+
+ A function that
+ combines (aggregates) multiple input values,
+ for example by counting, averaging or adding,
+ yielding a single output value.
+
+
+ For more information, see
+ .
+
+
+   
+  
+
+  
+   Analyze (operation)
+   
+
+ The process of collecting statistics from data in
+ tables
+ and other relations
+ to help the query planner
+ to make decisions about how to execute
+ queries.
+
+   
+  
+
+  
+   Analytic Function
+   
+  
+
+  
+   Atomic
+   
+
+ In reference to a datum:
+ the fact that its value that cannot be broken down into smaller
+ components.
+
+   
+   
+
+ In reference to a
+ database transaction:
+ see atomicity.
+
+   
+  
+
+  
+   Atomicity
+   
+
+ The property of a transaction
+ that either all its operations complete as a single unit or none do.
+ This is one of the ACID properties.
+
+   
+  
+
+  
+   Attribute
+   
+
+ An element with a certain name and data type found within a
+ tuple or
+ table.
+
+   
+  
+
+  
+   Autovacuum
+   
+
+ Background processes that routinely perform
+ Vacuum and Analyze
+ operations.
+
+
+ For more information, see
+ .
+
+   
+  
+
+  
+   Backend (process)
+   
+
+ Processes of an Instance which act on behalf of
+ client Connections and handle their requests.
+
+
+ (Don't confuse this term with the similar terms
+ Background Worker or
+ Background Writer).
+
+   
+  
+
+  
+   Background Worker (process)
+   
+
+ Individual processes within an Instance, which
+ run system- or user-supplied code.  A typical use case is a process
+ which handles parts of an SQL query to take
+ advantage of parallel execution on servers with multiple
+ CPUs.
+   
+   
+For more information, see
+.
+   
+   
+  
+
+  
+   Background Writer (process)
+   
+
+ A process that continuously writes dirty pages from
+ Shared Memory to the file system.
+ It wakes up periodically, but
+ works only for a short period in order to distribute its expensive
+ I/O activity over time, instead of generating fewer
+ larger I/O peaks which could block other processes.
+
+
+ For more information, see
+ .
+
+   
+  
+
+  
+   Cast
+   
+
+ A conversion of a Datum from its current data
+ type to another data type.
+
+   
+  
+
+  
+   Catalog
+   
+
+ The SQL standard uses this term to
+ indicate what is called a Database in
+ PostgreSQL's terminology.
+
+
+ This should not be confused with the
+ System Catalog.
+
+
+ For more information, see
+ .
+
+   
+  
+
+  
+   Check Constraint
+   
+
+ A type of Constraint defined on a

Re: backup manifests

2020-04-02 Thread David Steele

On 4/2/20 1:04 PM, Robert Haas wrote:
>

There
are still some things that not everybody is happy about. In
particular, Stephen and David are unhappy about using CRC-32C as the
default algorithm, but Andres and Noah both think it's a reasonable
choice, even if not as robust as everybody will want. As I agree, I'm
going to stick with that choice.


Yeah, I seem to be on the losing side of this argument, at least for 
now, so I don't think it should block the commit of this patch. It's an 
easy enough tweak if we change our minds.



For my part, I think this is a general issue that is not really this
patch's problem to solve. We have had multiple discussions over the
years about reducing the number of binaries that we ship. We could
have a general binary called "pg" or similar and use subcommands: pg
createdb, pg basebackup, pg validatebackup, etc. I think such an
approach is worth considering, though it would certainly be an
adjustment for everyone. Or we might do something else. But I don't
want to deal with that in this patch.


I'm fine with the current name, especially now that WAL is validated.


A couple of other minor suggestions have been made: (1) rejigger
things to avoid message duplication related to launching external
binaries, 


That'd be nice to have, but I think we can live without it for now.


(2) maybe use appendShellString


Seems like this would be good to have but I'm not going to make a fuss 
about it.



and (3) change some details
of error-reporting related to manifest parsing. I don't believe anyone
views these as blockers


I'd view this as later refinement once we see how the tool is being used 
and/or get gripes from the field.


So, with the addition of the 0004 patch down-thread this looks 
committable to me.


Regards,
--
-David
da...@pgmasters.net




Re: Should we add xid_current() or a int8->xid cast?

2020-04-02 Thread Tom Lane
Andres Freund  writes:
> On 2020-04-02 14:33:18 -0300, Alvaro Herrera wrote:
>> Hmm, for some reason I had it in my head that we would make these use an
>> "epoch/val" output format rather than raw uint64 values.

> Why would we do that? IMO the goal should be to reduce awareness of the
> 32bitness of normal xids from as many places as possible, and treat them
> as an internal space optimization.

If they're just int64s then you don't need special functions to do
things like finding the min or max in a column of them.

regards, tom lane




Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-02 Thread Tom Lane
Andres Freund  writes:
> On 2020-04-01 19:57:32 -0400, Tom Lane wrote:
>> Agreed, but just knowing what the oldest xmin is doesn't help you
>> find *where* it is.  Maybe what we need is a view showing all of
>> these potential sources of an old xmin.

> +1.  This would be extermely useful. It's a very common occurance to
> have to ask for a number of nontrivial queries when debugging xmin
> related bloat issues.

> Having a view that lists something like:

> - shared xmin horizon
> - pid of backend with oldest xmin across all backends

I was envisioning a view that would show you *all* the active processes
and their related xmins, then more entries for all active replication
slots, prepared xacts, etc etc.  Picking out the ones causing trouble is
then the user's concern.  If the XID column is actually fullXid then
sorting, aggregating, etc. is easy.

The point about database-local vs not is troublesome.  Maybe two
such views would be needed?

regards, tom lane




Re: control max length of parameter values logged

2020-04-02 Thread Tom Lane
Alexey Bashtanov  writes:
> Please see the new version attached.

Pushed with a bit of editorialization.

regards, tom lane




Re: backup manifests

2020-04-02 Thread Andres Freund
On 2020-04-02 14:55:19 -0400, Robert Haas wrote:
> > Yes, I am asking for something to be changed: I'd like the code that
> > read()s the file when computing the checksum to add up how many bytes
> > were read, and compare that to the size in the manifest. And if there's
> > a difference report an error about that, instead of a checksum failure.
> >
> > I've repeatedly seen filesystem issues lead to to earlier EOFs when
> > read()ing than what stat() returns. It'll be pretty annoying to have to
> > debug a general "checksum failure", rather than just knowing that
> > reading stopped after 100MB of 1GB.
> 
> Is 0004 attached like what you have in mind?

Yes. Thanks!

- Andres




Re: control max length of parameter values logged

2020-04-02 Thread Tom Lane
Alvaro Herrera  writes:
> More or less.  If you don't add these chars, mbcliplen doesn't think
> there's character there, so it ends up not adding the ellipsis.  (I
> don't remember why it has to be two chars rather than just one.)

I think the idea is to be sure that there's a full multibyte character
after the truncation point; if the truncation point is within a multibyte
character, then you might have only a partial multibyte character after
that, which could cause problems.  Doing it this way, mbcliplen will
never look at the last possibly-truncated character.

regards, tom lane




Re: Should we add xid_current() or a int8->xid cast?

2020-04-02 Thread Mark Dilger



> On Apr 2, 2020, at 11:01 AM, Andres Freund  wrote:
> 
>> 
>> Hmm, for some reason I had it in my head that we would make these use an
>> "epoch/val" output format rather than raw uint64 values.
> 
> Why would we do that? IMO the goal should be to reduce awareness of the
> 32bitness of normal xids from as many places as possible, and treat them
> as an internal space optimization.

I agree with transitioning to 64-bit xids with 32 bit xid/epoch pairs as an 
internal implementation and storage detail only, but we still have user facing 
views that don't treat it that way.   pg_stat_get_activity still returns 
backend_xid and backend_xmin as 32-bit, not 64-bit.  Should this function 
change to be consistent?  I'm curious what the user experience will be during 
the transitional period where some user facing xids are 64 bit and others 
(perhaps the same xids but viewed elsewhere) will be 32 bit.  That might make 
it difficult for users to match them up.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: snapshot too old issues, first around wraparound and then more.

2020-04-02 Thread Peter Geoghegan
On Tue, Mar 31, 2020 at 11:40 PM Andres Freund  wrote:
> The problem, as far as I can tell, is that
> oldSnapshotControl->head_timestamp appears to be intended to be the
> oldest value in the ring. But we update it unconditionally in the "need
> a new bucket, but it might not be the very next one" branch of
> MaintainOldSnapshotTimeMapping().
>
> While there's not really a clear-cut comment explaining whether
> head_timestamp() is intended to be the oldest or the newest timestamp,
> it seems to me that the rest of the code treats it as the "oldest"
> timestamp.

At first, I was almost certain that it's supposed to be the oldest
based only on the OldSnapshotControlData struct fields themselves. It
seemed pretty unambiguous:

int head_offset;/* subscript of oldest tracked time */
TimestampTz head_timestamp; /* time corresponding to head xid */

(Another thing that supports this interpretation is the fact that
there is a separate current_timestamp latest timestamp field in
OldSnapshotControlData.)

But then I took another look at the "We need a new bucket, but it
might not be the very next one" branch. It does indeed seem to
directly contradict the OldSnapshotControlData comments/documentation.
Note just the code itself, either. Even comments from this "new
bucket" branch disagree with the OldSnapshotControlData comments:

if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)
{
/* Map full and new value replaces old head. */
int old_head = oldSnapshotControl->head_offset;

if (old_head == (OLD_SNAPSHOT_TIME_MAP_ENTRIES - 1))
oldSnapshotControl->head_offset = 0;
else
oldSnapshotControl->head_offset = old_head + 1;
oldSnapshotControl->xid_by_minute[old_head] = xmin;
}

Here, the comment says the map (circular buffer) is full, and that we
must replace the current head with a *new* value/timestamp (the one we
just got in GetSnapshotData()). It looks as if the design of the data
structure changed during the development of the original patch, but
this entire branch was totally overlooked.

In conclusion, I share Andres' concerns here. There are glaring
problems with how we manipulate the data structure that controls the
effective horizon for pruning. Maybe they can be fixed while leaving
the code that manages the OldSnapshotControl circular buffer in
something resembling its current form, but I doubt it. In my opinion,
there is no approach to fixing "snapshot too old" that won't have some
serious downside.

-- 
Peter Geoghegan




Re: backup manifests

2020-04-02 Thread Andres Freund
Hi,

On 2020-04-02 14:16:27 -0400, Robert Haas wrote:
> On Thu, Apr 2, 2020 at 1:23 PM Andres Freund  wrote:
> > I suspect its possible to control the timing by preventing the
> > checkpoint at the end of recovery from completing within a relevant
> > timeframe. I think configuring a large checkpoint_timeout and using a
> > non-fast base backup ought to do the trick. The state can be advanced by
> > separately triggering an immediate checkpoint? Or by changing the
> > checkpoint_timeout?
> 
> That might make the window fairly wide on normal systems, but I'm not
> sure about Raspberry Pi BF members or things running
> CLOBBER_CACHE_ALWAYS/RECURSIVELY. I guess I could try it.

You can set checkpoint_timeout to be a day. If that's not enough, well,
then I think we have other problems.


> > FWIW, the only check I'd really like to see in this release is the
> > crosscheck with the files length and the actually read data (to be able
> > to disagnose FS issues).
> 
> Not sure I understand this comment. Isn't that a subset of what the
> patch already does? Are you asking for something to be changed?

Yes, I am asking for something to be changed: I'd like the code that
read()s the file when computing the checksum to add up how many bytes
were read, and compare that to the size in the manifest. And if there's
a difference report an error about that, instead of a checksum failure.

I've repeatedly seen filesystem issues lead to to earlier EOFs when
read()ing than what stat() returns. It'll be pretty annoying to have to
debug a general "checksum failure", rather than just knowing that
reading stopped after 100MB of 1GB.

Greetings,

Andres Freund




Re: backup manifests

2020-04-02 Thread Robert Haas
On Thu, Apr 2, 2020 at 1:23 PM Andres Freund  wrote:
> I suspect its possible to control the timing by preventing the
> checkpoint at the end of recovery from completing within a relevant
> timeframe. I think configuring a large checkpoint_timeout and using a
> non-fast base backup ought to do the trick. The state can be advanced by
> separately triggering an immediate checkpoint? Or by changing the
> checkpoint_timeout?

That might make the window fairly wide on normal systems, but I'm not
sure about Raspberry Pi BF members or things running
CLOBBER_CACHE_ALWAYS/RECURSIVELY. I guess I could try it.

> I think it might be worth looking, in a later release, at something like
> blake3 for a fast cryptographic checksum. By allowing for instruction
> parallelism (by independently checksuming different blocks in data, and
> only advancing the "shared" checksum separately) it achieves
> considerably higher throughput rates.
>
> I suspect we should also look at a better non-crypto hash. xxhash or
> whatever. Not just for these checksums, but also for in-memory.

I have no problem with that. I don't feel that I am well-placed to
recommend for or against specific algorithms. Speed is easy to
measure, but there's also code stability, the license under which
something is released, the quality of the hashes it produces, and the
extent to which it is cryptographically secure. I'm not an expert in
any of that stuff, but if we get consensus on something it should be
easy enough to plug it into this framework. Even changing the default
would be no big deal.

> FWIW, the only check I'd really like to see in this release is the
> crosscheck with the files length and the actually read data (to be able
> to disagnose FS issues).

Not sure I understand this comment. Isn't that a subset of what the
patch already does? Are you asking for something to be changed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-04-02 Thread Andres Freund
Hi,

On 2020-02-19 16:35:53 -0500, Alex Malek wrote:
> We are having a reoccurring issue on 2 of our replicas where replication
> stops due to this message:
> "incorrect resource manager data checksum in record at ..."

Could you show the *exact* log output please? Because this could
temporarily occur without signalling anything bad, if e.g. the
replication connection goes down.


> Right before the issue started we did some upgrades and altered some
> postgres configs and ZFS settings.
> We have been slowly rolling back changes but so far the the issue continues.
> 
> Some interesting data points while debugging:
> We had lowered the ZFS recordsize from 128K to 32K and for that week the
> issue started happening every other day.
> Using xxd and diff we compared "good" and "bad" wal files and the
> differences were not random bad bytes.
> 
> The bad file either had a block of zeros that were not in the good file at
> that position or other data.  Occasionally the bad data has contained
> legible strings not in the good file at that position. At least one of
> those exact strings has existed elsewhere in the files.
> However I am not sure if that is the case for all of them.
> 
> This made me think that maybe there was an issue w/ wal file recycling and
> ZFS under heavy load, so we tried lowering
> min_wal_size in order to "discourage" wal file recycling but my
> understanding is a low value discourages recycling but it will still
> happen (unless setting wal_recycle in psql 12).

This sounds a lot more like a broken filesystem than anythingon the PG
level.


> When using replication slots, what circumstances would cause the master to
> not save the WAL file?

What do you mean by "save the WAL file"?

Greetings,

Andres Freund




Re: Allow continuations in "pg_hba.conf" files

2020-04-02 Thread David Zhang

On 2020-04-01 10:25 p.m., Fabien COELHO wrote:



Hello,


FWIW, I don't think so. Generally a trailing backspace is an escape
character for the following newline.  And '\ ' is a escaped space,
which is usualy menas a space itself.

In this case escape character doesn't work generally and I think it 
is natural that a backslash in the middle of a line is a backslash 
character itself.


I concur: The backslash char is only a continuation as the very last 
character of the line, before cr/nl line ending markers.


+Agree. However, it would nice to update the sentence below if I 
understand it correctly.


"+   Comments, whitespace and continuations are handled in the same way 
as in" pg_hba.conf


For example, if a user provide a configuration like below (even such a 
comments is not recommended)


"host    all all 127.0.0.1/32    trust  #COMMENTS, it works"

i.e. the original pg_hba.conf allows to have comments in each line, but 
with new continuation introduced, the comments has to be put to the last 
line.




There are no assumption about backslash escaping, quotes and such, 
which seems reasonable given the lexing structure of the files, i.e. 
records of space-separated words, and # line comments.



--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca





Re: Should we add xid_current() or a int8->xid cast?

2020-04-02 Thread Andres Freund
Hi,

On 2020-04-02 14:33:18 -0300, Alvaro Herrera wrote:
> On 2020-Apr-02, Thomas Munro wrote:
> 
> > On Sat, Mar 21, 2020 at 11:14 AM Thomas Munro  
> > wrote:
> > > * updated OIDs to avoid collisions
> > > * added btequalimage to btree/xid8_ops
> > 
> > Here's the version I'm planning to commit tomorrow, if no one objects.  
> > Changes:
> > 
> > * txid.c renamed to xid8funcs.c
> > * remaining traces of "txid" replaced various internal identifiers
> > * s/backwards compatible/backward compatible/ in funcs.sgml (en_GB -> en_US)
> 
> Hmm, for some reason I had it in my head that we would make these use an
> "epoch/val" output format rather than raw uint64 values.

Why would we do that? IMO the goal should be to reduce awareness of the
32bitness of normal xids from as many places as possible, and treat them
as an internal space optimization.

Greetings,

Andres Freund




Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-02 Thread Andres Freund
Hi,

On 2020-04-01 19:57:32 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > On 2020-Apr-01, Tom Lane wrote:
> >> The fact that I had to use max(age(...)) in that sample query
> >> hints at one reason: it's really hard to do arithmetic correctly
> >> on raw XIDs.  Dealing with wraparound is a problem, and knowing
> >> what's past or future is even harder.  What use-case do you
> >> foresee exactly?
>
> > Maybe it would make sense to start exposing fullXids in these views and
> > functions, for this reason.  There's no good reason to continue to
> > expose bare Xids to userspace, we should use them only for storage.
>
> +1, that would help a lot.

I agree.


> > But I think James' point is precisely that it's not easy to know where
> > to look for things that keep Xmin from advancing.  Currently it's
> > backends, replication slots, prepared transactions, and replicas with
> > hot_standby_feedback.  If you forget to monitor just one of these, your
> > vacuums might be useless and you won't notice until disaster strikes.
>
> Agreed, but just knowing what the oldest xmin is doesn't help you
> find *where* it is.  Maybe what we need is a view showing all of
> these potential sources of an old xmin.

+1.  This would be extermely useful. It's a very common occurance to
have to ask for a number of nontrivial queries when debugging xmin
related bloat issues.

There's the slight complexity that one of the various xmin horizons is
database specific...

Which different xmin horizons, and which sources do we have? I can think
of:

- global xmin horizon from backends (for shared tables)
- per-database xmin horizon from backends (for local tables)
- catalog xmin horizon (from logical replication slots)
- data xmin horizon (from physical replication slots)
- streaming replication xmin horizon


Having a view that lists something like:

- shared xmin horizon
- pid of backend with oldest xmin across all backends

- database xmin horizon of current database
- pid of backend with oldest xmin in current database

- catalog xmin of oldest slot by catalog xmin
- name of oldest slot by catalog xmin

- data xmin of oldest slot by data xmin
- name of oldest slot by data xmin

- xid of oldest prepared transaction
- gid of oldest prepared transaction
- database of oldest transaction?

- xmin of oldest walsender with hot_standby_feedback active
- pid of oldest ...

would be awesome. I think it'd make sense to also add the database with
the oldest datfrozenxid, the current database's relation with the oldest
relfrozenxid.

Greetings,

Andres Freund




Re: bad wal on replica / incorrect resource manager data checksum in record / zfs

2020-04-02 Thread Alex Malek
On Wed, Feb 19, 2020 at 4:35 PM Alex Malek  wrote:

>
> Hello Postgres Hackers -
>
> We are having a reoccurring issue on 2 of our replicas where replication
> stops due to this message:
> "incorrect resource manager data checksum in record at ..."
> This has been occurring on average once every 1 to 2 weeks during large
> data imports (100s of GBs being written)
> on one of two replicas.
> Fixing the issue has been relatively straight forward: shutdown replica,
> remove the bad wal file, restart replica and
> the good wal file is retrieved from the master.
> We are doing streaming replication using replication slots.
> However twice now, the master had already removed the WAL file so the file
> had to retrieved from the wal archive.
>
> The WAL log directories on the master and the replicas are on ZFS file
> systems.
> All servers are running RHEL 7.7 (Maipo)
> PostgreSQL 10.11
> ZFS v0.7.13-1
>
> The issue seems similar to
> https://www.postgresql.org/message-id/CANQ55Tsoa6%3Dvk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw%40mail.gmail.com
>  and to https://github.com/timescale/timescaledb/issues/1443
>
> One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
> sees our array as a single device.
> 
> 
>


An update in case someone else encounters the same issue.

About 5 weeks ago, on the master database server, we turned off ZFS
compression for the volume where the WAL log resides.
The error has not occurred on any replica since.

Best,
Alex


Re: Should we add xid_current() or a int8->xid cast?

2020-04-02 Thread Alvaro Herrera
On 2020-Apr-02, Thomas Munro wrote:

> On Sat, Mar 21, 2020 at 11:14 AM Thomas Munro  wrote:
> > * updated OIDs to avoid collisions
> > * added btequalimage to btree/xid8_ops
> 
> Here's the version I'm planning to commit tomorrow, if no one objects.  
> Changes:
> 
> * txid.c renamed to xid8funcs.c
> * remaining traces of "txid" replaced various internal identifiers
> * s/backwards compatible/backward compatible/ in funcs.sgml (en_GB -> en_US)

Hmm, for some reason I had it in my head that we would make these use an
"epoch/val" output format rather than raw uint64 values.  Are we really
going to do it this way?  Myself I can convert values easily enough, but
I'm not sure this is user-friendly.  (If somebody were to tell me that
LSNs are going to be straight uint64 values, I would not be happy.)

Or maybe it's the other way around -- this is fine for everyone except
me -- and we should never expose the epoch as a separate quantity.  

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: backup manifests

2020-04-02 Thread Andres Freund
Hi,

On 2020-04-02 13:04:45 -0400, Robert Haas wrote:
> And here's another new patch set. After some experimentation, I was
> able to manually test the timeline-switch-during-a-base-backup case
> and found that it had bugs in both pg_validatebackup and the code I
> added to the backend's basebackup.c. So I fixed those.

Cool.


> It would be
> nice to have automated tests, but you need a large database (so that
> backing it up takes non-trivial time) and a load on the primary (so
> that WAL is being replayed during the backup) and there's a race
> condition (because the backup has to not finish before the cascading
> standby learns that the upstream has been promoted), so I don't at
> present see a practical way to automate that. I did verify, in manual
> testing, that a problem with WAL files on either timeline caused a
> validation failure. I also verified that the LSNs at which the standby
> began replay and reached consistency matched what was stored in the
> manifest.

I suspect its possible to control the timing by preventing the
checkpoint at the end of recovery from completing within a relevant
timeframe. I think configuring a large checkpoint_timeout and using a
non-fast base backup ought to do the trick. The state can be advanced by
separately triggering an immediate checkpoint? Or by changing the
checkpoint_timeout?



> I also implemented Noah's suggestion that we should write the backup
> manifest under a temporary name and then rename it afterward.
> Stephen's original complaint that you could end up with a backup that
> validates successfully even though we died before we got the WAL is,
> at this point, moot, because pg_validatebackup is now capable of
> noticing that the WAL is missing. Nevertheless, this seems like a nice
> belt-and-suspenders check.

Yea, it's imo generally a good idea.


> I think this responds to pretty much all of the complaints that I know
> about and upon which we have a reasonable degree of consensus. There
> are still some things that not everybody is happy about. In
> particular, Stephen and David are unhappy about using CRC-32C as the
> default algorithm, but Andres and Noah both think it's a reasonable
> choice, even if not as robust as everybody will want. As I agree, I'm
> going to stick with that choice.

I think it might be worth looking, in a later release, at something like
blake3 for a fast cryptographic checksum. By allowing for instruction
parallelism (by independently checksuming different blocks in data, and
only advancing the "shared" checksum separately) it achieves
considerably higher throughput rates.

I suspect we should also look at a better non-crypto hash. xxhash or
whatever. Not just for these checksums, but also for in-memory.


> Also, there is still some debate about what the tool ought to be
> called. My previous suggestion to rename this from pg_validatebackup
> to pg_validatemanifest seems wrong now that WAL validation has been
> added; in fact, given that we now have two independent sanity checks
> on a backup, I'm going to argue that it would be reasonable to extend
> that by adding more kinds of backup validation, perhaps even including
> the permissions check that Andres suggested before.

FWIW, the only check I'd really like to see in this release is the
crosscheck with the files length and the actually read data (to be able
to disagnose FS issues).


Greetings,

Andres Freund




Re: control max length of parameter values logged

2020-04-02 Thread Alvaro Herrera
On 2020-Apr-02, Alexey Bashtanov wrote:


> > > + if 
> > > (log_parameter_max_length_on_error > 0)
> > > + {
> > > +   /*
> > > +* We can trim 
> > > the saved string, knowing that we
> > > +* won't print 
> > > all of it.  But we must copy at
> > > +* least two more 
> > > full characters than
> > > +* 
> > > BuildParamLogString wants to use; otherwise it
> > > +* might fail to 
> > > include the trailing ellipsis.
> > > +*/
> > > +   
> > > knownTextValues[paramno] =
> > > +   
> > > pnstrdup(pstring,
> > > + 
> > >log_parameter_max_length_on_error
> > > + 
> > >+ 2 * MAX_MULTIBYTE_CHAR_LEN);
> > The comment says we need at least 2 chars, but
> > log_parameter_max_length_on_error might be 1, so I think you can either add 
> > 64
> > byte fudge factor, like before, or do 
> > Max(log_parameter_max_length_on_error, 2).
> That's the code I reused without deep analysis TBH.
> I think it's mostly for to allocate the space for the ellipsis in case it
> needs to be added,
> not to copy any actual characters, that's why we add.

More or less.  If you don't add these chars, mbcliplen doesn't think
there's character there, so it ends up not adding the ellipsis.  (I
don't remember why it has to be two chars rather than just one.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: A bug when use get_bit() function for a long bytea string

2020-04-02 Thread Daniel Verite
movead...@highgo.ca wrote:

> A little update for the patch, and patches for all stable avilable. 

Some comments about the set_bit/get_bit parts.
I'm reading long_bytea_string_bug_fix_ver6_pg10.patch, but that
applies probably to the other files meant for the existing releases
(I think you could get away with only one patch for backpatching
and one patch for v13, and committers will sort out how
to deal with release branches).

 byteaSetBit(PG_FUNCTION_ARGS)
 {
bytea  *res = PG_GETARG_BYTEA_P_COPY(0);
-   int32   n = PG_GETARG_INT32(1);
+   int64   n = PG_GETARG_INT64(1);
int32   newBit = PG_GETARG_INT32(2);

The 2nd argument is 32-bit, not 64. PG_GETARG_INT32(1) must be used.

+   errmsg("index "INT64_FORMAT" out of valid range, 0.."INT64_FORMAT,
+   n, (int64)len * 8 - 1)));

The cast to int64 avoids the overflow, but it may also produce a
result that does not reflect the actual range, which is limited to
2^31-1, again because the bit number is a signed 32-bit value.

I believe the formula for the upper limit in the 32-bit case is:
  (len <= PG_INT32_MAX / 8) ? (len*8 - 1) : PG_INT32_MAX;

These functions could benefit from a comment mentioning that
they cannot reach the full extent of a bytea, because of the size limit
on the bit number.

--- a/src/test/regress/expected/bit.out
+++ b/src/test/regress/expected/bit.out
@@ -656,6 +656,40 @@ SELECT set_bit(B'0101011000100100', 15, 1);
 
 SELECT set_bit(B'0101011000100100', 16, 1);-- fail
 ERROR:  bit index 16 out of valid range (0..15)+SELECT get_bit(
+   set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 0, 0)
+   ,0);
+ get_bit 
+-
+   0
+(1 row)
+
+SELECT get_bit(
+   set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 0, 1)
+   ,0);
+ get_bit 
+-
+   1
+(1 row)
+

These 2 tests need to allocate big chunks of contiguous memory, so they
might fail for lack of memory on tiny machines, and even when not failing,
they're pretty slow to run. Are they worth the trouble?

+select get_bit(
+set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea, 
+512::bigint * 1024 * 1024 * 8 - 1, 0)
+,512::bigint * 1024 * 1024 * 8 - 1);
+ get_bit 
+-
+   0
+(1 row)
+
+select get_bit(
+set_bit((repeat('Postgres', 512 * 1024 * 1024 / 8))::bytea,
+512::bigint * 1024 * 1024 * 8 - 1, 1)
+   ,512::bigint * 1024 * 1024 * 8 - 1);
+ get_bit 
+-
+   1
+(1 row)

These 2 tests are supposed to fail in existing releases because set_bit()
and get_bit() don't take a bigint as the 2nd argument.
Also, the same comment as above on how much they allocate.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite




Re: snapshot too old issues, first around wraparound and then more.

2020-04-02 Thread Andres Freund
Hi, 

On April 2, 2020 9:36:32 AM PDT, Kevin Grittner  wrote:
>On Wed, Apr 1, 2020 at 7:17 PM Andres Freund 
>wrote:
>
>> FWIW, with autovacuum=off the query does not get killed until a
>manual
>> vacuum, nor if fewer rows are deleted and the table has previously
>been
>> vacuumed.
>>
>> The vacuum in the second session isn't required. There just needs to
>be
>> something consuming an xid, so that oldSnapshotControl->latest_xmin
>is
>> increased. A single SELECT txid_current(); or such in a separate
>session
>> is sufficient.
>>
>
>Agreed.  I don't see that part as a problem; if no xids are being
>consumed,
>it's hard to see how we could be heading into debilitating levels of
>bloat,
>so there is no need to perform the early pruning.  It would not be
>worth
>consuming any cycles to ensure that pruning happens sooner than it does
>in
>this case.  It's OK for it to happen any time past the moment that the
>snapshot hits the threshold, but it's also OK for it to wait until a
>vacuum
>of the table or until some activity consumes an xid.

The point about txid being sufficient was just about simplifying the reproducer 
for wrong query results.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: snapshot too old issues, first around wraparound and then more.

2020-04-02 Thread Kevin Grittner
On Wed, Apr 1, 2020 at 7:17 PM Andres Freund  wrote:

> FWIW, with autovacuum=off the query does not get killed until a manual
> vacuum, nor if fewer rows are deleted and the table has previously been
> vacuumed.
>
> The vacuum in the second session isn't required. There just needs to be
> something consuming an xid, so that oldSnapshotControl->latest_xmin is
> increased. A single SELECT txid_current(); or such in a separate session
> is sufficient.
>

Agreed.  I don't see that part as a problem; if no xids are being consumed,
it's hard to see how we could be heading into debilitating levels of bloat,
so there is no need to perform the early pruning.  It would not be worth
consuming any cycles to ensure that pruning happens sooner than it does in
this case.  It's OK for it to happen any time past the moment that the
snapshot hits the threshold, but it's also OK for it to wait until a vacuum
of the table or until some activity consumes an xid.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/


Re: Ltree syntax improvement

2020-04-02 Thread Tom Lane
Nikita Glukhov  writes:
> Rebased patch attached.

Thanks for rebasing!  The cfbot's not very happy though:

4842ltxtquery_io.c: In function ‘makepol’:
4843ltxtquery_io.c:188:13: error: ‘escaped’ may be used uninitialized in this 
function [-Werror=maybe-uninitialized]
4844  if (lenval - escaped <= 0)
4845 ^
4846ltxtquery_io.c:230:6: note: ‘escaped’ was declared here
4847  int   escaped;
4848  ^
4849cc1: all warnings being treated as errors

regards, tom lane




Re: Should we add xid_current() or a int8->xid cast?

2020-04-02 Thread Mark Dilger



> On Apr 2, 2020, at 9:06 AM, Mark Dilger  wrote:
> 
> ("xid8_current" is not exercised by name anywhere in the regression suite, 
> that I can see.)

I spoke too soon.  That's exercised in the new xid.sql test file. It didn't 
show up in my 'git diff', because it's new.  Sorry about that.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Should we add xid_current() or a int8->xid cast?

2020-04-02 Thread Mark Dilger



> On Apr 1, 2020, at 8:21 PM, Thomas Munro  wrote:
> 
> On Sat, Mar 21, 2020 at 11:14 AM Thomas Munro  wrote:
>> * updated OIDs to avoid collisions
>> * added btequalimage to btree/xid8_ops
> 
> Here's the version I'm planning to commit tomorrow, if no one objects.  
> Changes:
> 
> * txid.c renamed to xid8funcs.c
> * remaining traces of "txid" replaced various internal identifiers
> * s/backwards compatible/backward compatible/ in funcs.sgml (en_GB -> en_US)
> 

Hi Thomas, Thanks for working on this.

I'm taking a quick look at your patches.  It's not a big deal, and certainly 
not a show stopper if you want to go ahead with the commit, but you've left 
some mentions of "txid_current" that might better be modified to use the new 
name "xid8_current".  At least one mention of "txid_current" is needed to check 
that the old name still works, but leaving this many in the regression test 
suite may lead other developers to follow that lead and use txid_current() in 
newly developed code.  ("xid8_current" is not exercised by name anywhere in the 
regression suite, that I can see.)

> contrib/test_decoding/expected/ddl.out:SELECT txid_current() != 0; -- so no 
> fixed xid apears in the outfile
> contrib/test_decoding/expected/decoding_in_xact.out:SELECT txid_current() = 0;
> contrib/test_decoding/expected/decoding_in_xact.out:SELECT txid_current() = 0;
> contrib/test_decoding/expected/decoding_in_xact.out:SELECT txid_current() = 0;
> contrib/test_decoding/expected/oldest_xmin.out:step s0_getxid: SELECT 
> txid_current() IS NULL;
> contrib/test_decoding/expected/ondisk_startup.out:step s2txid: SELECT 
> txid_current() IS NULL;
> contrib/test_decoding/expected/ondisk_startup.out:step s3txid: SELECT 
> txid_current() IS NULL;
> contrib/test_decoding/expected/ondisk_startup.out:step s2txid: SELECT 
> txid_current() IS NULL;
> contrib/test_decoding/expected/snapshot_transfer.out:step s0_log_assignment: 
> SELECT txid_current() IS NULL;
> contrib/test_decoding/expected/snapshot_transfer.out:step s0_log_assignment: 
> SELECT txid_current() IS NULL;
> contrib/test_decoding/specs/oldest_xmin.spec:step "s0_getxid" { SELECT 
> txid_current() IS NULL; }
> contrib/test_decoding/specs/ondisk_startup.spec:step "s2txid" { SELECT 
> txid_current() IS NULL; }
> contrib/test_decoding/specs/ondisk_startup.spec:step "s3txid" { SELECT 
> txid_current() IS NULL; }
> contrib/test_decoding/specs/snapshot_transfer.spec:step "s0_log_assignment" { 
> SELECT txid_current() IS NULL; }
> contrib/test_decoding/sql/ddl.sql:SELECT txid_current() != 0; -- so no fixed 
> xid apears in the outfile
> contrib/test_decoding/sql/decoding_in_xact.sql:SELECT txid_current() = 0;
> contrib/test_decoding/sql/decoding_in_xact.sql:SELECT txid_current() = 0;
> contrib/test_decoding/sql/decoding_in_xact.sql:SELECT txid_current() = 0;
> 
> src/test/modules/commit_ts/t/004_restart.pl:SELECT txid_current();
> src/test/modules/commit_ts/t/004_restart.pl:EXECUTE 'SELECT 
> txid_current()';
> src/test/modules/commit_ts/t/004_restart.pl:SELECT txid_current();
> src/test/recovery/t/003_recovery_targets.pl:"SELECT pg_current_wal_lsn(), 
> txid_current();");
> src/test/recovery/t/011_crash_recovery.pl:SELECT txid_current();
> src/test/recovery/t/011_crash_recovery.pl:cmp_ok($node->safe_psql('postgres', 
> 'SELECT txid_current()'),
> src/test/regress/expected/alter_table.out:where transactionid = 
> txid_current()::integer)
> src/test/regress/expected/alter_table.out:where transactionid = 
> txid_current()::integer)
> src/test/regress/expected/hs_standby_functions.out:select txid_current();
> src/test/regress/expected/hs_standby_functions.out:ERROR:  cannot execute 
> txid_current() during recovery
> src/test/regress/expected/hs_standby_functions.out:select 
> length(txid_current_snapshot()::text) >= 4;
> src/test/regress/expected/txid.out:select txid_current() >= 
> txid_snapshot_xmin(txid_current_snapshot());
> src/test/regress/expected/txid.out:select 
> txid_visible_in_snapshot(txid_current(), txid_current_snapshot());
> src/test/regress/expected/txid.out:-- test txid_current_if_assigned
> src/test/regress/expected/txid.out:SELECT txid_current_if_assigned() IS NULL;
> src/test/regress/expected/txid.out:SELECT txid_current() \gset
> src/test/regress/expected/txid.out:SELECT txid_current_if_assigned() IS NOT 
> DISTINCT FROM BIGINT :'txid_current';
> src/test/regress/expected/txid.out:SELECT txid_current() AS committed \gset
> src/test/regress/expected/txid.out:SELECT txid_current() AS rolledback \gset
> src/test/regress/expected/txid.out:SELECT txid_current() AS inprogress \gset
> src/test/regress/expected/update.out:CREATE FUNCTION xid_current() RETURNS 
> xid LANGUAGE SQL AS $$SELECT (txid_current() % ((1::int8<<32)))::text::xid;$$;
> src/test/regress/sql/alter_table.sql:where transactionid = 
> txid_current()::integer)
> src/test/regress/sql/alter_table.sql:where transactionid = 
> txid_current()::integer)

Re: snapshot too old issues, first around wraparound and then more.

2020-04-02 Thread Kevin Grittner
On Wed, Apr 1, 2020 at 6:59 PM Andres Freund  wrote:

> index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.
>

I looked at this flavor of problem today and from what I saw:

(1) This has been a problem all the way back to 9.6.0.
(2) The behavior is correct if the index creation is skipped or if
enable_indexscan is turned off in the transaction, confirming Andres'
analysis.
(3) Pruning seems to happen as intended; the bug found by Peter seems to be
entirely about failing to TestForOldSnapshot() where needed.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/


Re: WAL usage calculation patch

2020-04-02 Thread Dilip Kumar
On Thu, Apr 2, 2020 at 6:41 PM Amit Kapila  wrote:
>
> On Thu, Apr 2, 2020 at 6:18 PM Julien Rouhaud  wrote:
> >
> > =# select query, calls, wal_bytes, wal_records, wal_num_fpw from 
> > pg_stat_statements where query ilike '%create index%';
> >   query   | calls | wal_bytes | 
> > wal_records | wal_num_fpw
> > --+---+---+-+-
> >  create index t1_idx_parallel_0 ON t1(id) | 1 |  20389743 |
> > 2762 |2758
> >  create index t1_idx_parallel_0_bis ON t1(id) | 1 |  20394391 |
> > 2762 |2758
> >  create index t1_idx_parallel_0_ter ON t1(id) | 1 |  20395155 |
> > 2762 |2758
> >  create index t1_idx_parallel_1 ON t1(id) | 1 |  20388335 |
> > 2762 |2758
> >  create index t1_idx_parallel_2 ON t1(id) | 1 |  20389091 |
> > 2762 |2758
> >  create index t1_idx_parallel_3 ON t1(id) | 1 |  20389847 |
> > 2762 |2758
> >  create index t1_idx_parallel_4 ON t1(id) | 1 |  20390603 |
> > 2762 |2758
> >  create index t1_idx_parallel_5 ON t1(id) | 1 |  20391359 |
> > 2762 |2758
> >  create index t1_idx_parallel_6 ON t1(id) | 1 |  20392115 |
> > 2762 |2758
> >  create index t1_idx_parallel_7 ON t1(id) | 1 |  20392871 |
> > 2762 |2758
> >  create index t1_idx_parallel_8 ON t1(id) | 1 |  20393627 |
> > 2762 |2758
> > (11 rows)
> >
> > =# select relname, pg_relation_size(oid) from pg_class where relname like 
> > '%t1_id%';
> >   relname  | pg_relation_size
> > ---+--
> >  t1_idx_parallel_0 | 22487040
> >  t1_idx_parallel_0_bis | 22487040
> >  t1_idx_parallel_0_ter | 22487040
> >  t1_idx_parallel_2 | 22487040
> >  t1_idx_parallel_1 | 22487040
> >  t1_idx_parallel_4 | 22487040
> >  t1_idx_parallel_3 | 22487040
> >  t1_idx_parallel_5 | 22487040
> >  t1_idx_parallel_6 | 22487040
> >  t1_idx_parallel_7 | 22487040
> >  t1_idx_parallel_8 | 22487040
> > (9 rows)
> >
> >
> > So while the number of WAL records and full page images stay constant, we 
> > can
> > see some small fluctuations in the total amount of generated WAL data, even 
> > for
> > multiple execution of the sequential create index.  I'm wondering if the
> > fluctuations are due to some other internal details or if the WalUsage 
> > support
> > is just completely broken (although I don't see any obvious issue ATM).
> >
>
> I think we need to know the reason for this.  Can you try with small
> size indexes and see if the problem is reproducible? If it is, then it
> will be easier to debug the same.

I have done some testing to see where these extra WAL size is coming
from.  First I tried to create new db before every run then the size
is consistent.  But, then on the same server, I tired as Julien showed
in his experiment then I am getting few extra wal bytes from next
create index onwards.  And, the waldump(attached in the mail) shows
that is pg_class insert wal.  I still have to check that why we need
to write an extra wal size.

create extension pg_stat_statements;
drop table t1;
create table t1(id integer);
insert into t1 select * from generate_series(1, 10);
alter table t1 set (parallel_workers = 0);
vacuum;checkpoint;
select * from pg_stat_statements_reset() ;
create index t1_idx_parallel_0 ON t1(id);
select query, calls, wal_bytes, wal_records, wal_num_fpw from
pg_stat_statements where query ilike '%create index%';;
  query
   | calls | wal_bytes | wal_records | wal_num_fpw
--+---+---+-+-
 create index t1_idx_parallel_0 ON t1(id)
   | 1 | 49320 |  23 |  15


drop table t1;
create table t1(id integer);
insert into t1 select * from generate_series(1, 10);
--select * from pg_stat_statements_reset() ;
alter table t1 set (parallel_workers = 0);
vacuum;checkpoint;
create index t1_idx_parallel_1 ON t1(id);

select query, calls, wal_bytes, wal_records, wal_num_fpw from
pg_stat_statements where query ilike '%create index%';;
postgres[110383]=# select query, calls, wal_bytes, wal_records,
wal_num_fpw from pg_stat_statements;
  query
   | calls | wal_bytes | wal_records | wal_num_fpw
--+---+---+-+-
 create index t1_idx_parallel_1 ON t1(id)
   | 1 | 50040 |  23 |  15

wal_bytes diff = 50040-49320 = 720

Below, WAL record is causing the 720 bytes difference, all other WALs

Re: Berserk Autovacuum (let's save next Mandrill)

2020-04-02 Thread Tom Lane
I wrote:
> I'd be inclined to undo what you did in favor of initializing the
> test tables to contain significantly different numbers of rows,
> because that would (a) achieve plan stability more directly,
> and (b) demonstrate that the planner is actually ordering the
> tables by cost correctly.  Maybe somewhere else we have a test
> that is verifying (b), but these test cases abysmally fail to
> check that point.

Concretely, I suggest the attached, which replaces the autovac disables
with adjusting partition boundaries so that the partitions contain
different numbers of rows.

I did not touch the partition boundaries for pagg_tab1 and pagg_tab2,
because that would have required also changing the associated test
queries (which are designed to access only particular partitions).
It seemed like too much work to verify that the answers were still
right, and it's not really necessary because those tables are so
small as to fit in single pages.  That means that autovac will either
see the whole table or none of it, and in either case it won't change
reltuples.

This does cause the order of partitions to change in a couple of the
pagg_tab_ml EXPLAIN results, but I think that's fine; the ordering
does now match the actual sizes of the partitions.

regards, tom lane

diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index d8a6836..a4dc12b 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -2,9 +2,9 @@
 -- PARTITION_AGGREGATE
 -- Test partitionwise aggregation on partitioned tables
 --
--- Note: Various tests located within are sensitive to tables being
--- auto-vacuumed while the tests are running.  For this reason we
--- run autovacuum_enabled = off for all tables.
+-- Note: to ensure plan stability, it's a good idea to make the partitions of
+-- any one partitioned table in this test all have different numbers of rows.
+--
 -- Enable partitionwise aggregate, which by default is disabled.
 SET enable_partitionwise_aggregate TO true;
 -- Enable partitionwise join, which by default is disabled.
@@ -15,9 +15,9 @@ SET max_parallel_workers_per_gather TO 0;
 -- Tests for list partitioned tables.
 --
 CREATE TABLE pagg_tab (a int, b int, c text, d int) PARTITION BY LIST(c);
-CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('', '0001', '0002', '0003') WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0004', '0005', '0006', '0007') WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0008', '0009', '0010', '0011') WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES IN ('', '0001', '0002', '0003', '0004');
+CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES IN ('0005', '0006', '0007', '0008');
+CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES IN ('0009', '0010', '0011');
 INSERT INTO pagg_tab SELECT i % 20, i % 30, to_char(i % 12, 'FM'), i % 30 FROM generate_series(0, 2999) i;
 ANALYZE pagg_tab;
 -- When GROUP BY clause matches; full aggregation is performed for each partition.
@@ -400,13 +400,13 @@ SELECT a, sum(b order by a) FROM pagg_tab GROUP BY a ORDER BY 1, 2;
 
 -- JOIN query
 CREATE TABLE pagg_tab1(x int, y int) PARTITION BY RANGE(x);
-CREATE TABLE pagg_tab1_p1 PARTITION OF pagg_tab1 FOR VALUES FROM (0) TO (10) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab1_p2 PARTITION OF pagg_tab1 FOR VALUES FROM (10) TO (20) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab1_p3 PARTITION OF pagg_tab1 FOR VALUES FROM (20) TO (30) WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab1_p1 PARTITION OF pagg_tab1 FOR VALUES FROM (0) TO (10);
+CREATE TABLE pagg_tab1_p2 PARTITION OF pagg_tab1 FOR VALUES FROM (10) TO (20);
+CREATE TABLE pagg_tab1_p3 PARTITION OF pagg_tab1 FOR VALUES FROM (20) TO (30);
 CREATE TABLE pagg_tab2(x int, y int) PARTITION BY RANGE(y);
-CREATE TABLE pagg_tab2_p1 PARTITION OF pagg_tab2 FOR VALUES FROM (0) TO (10) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab2_p2 PARTITION OF pagg_tab2 FOR VALUES FROM (10) TO (20) WITH (autovacuum_enabled = off);
-CREATE TABLE pagg_tab2_p3 PARTITION OF pagg_tab2 FOR VALUES FROM (20) TO (30) WITH (autovacuum_enabled = off);
+CREATE TABLE pagg_tab2_p1 PARTITION OF pagg_tab2 FOR VALUES FROM (0) TO (10);
+CREATE TABLE pagg_tab2_p2 PARTITION OF pagg_tab2 FOR VALUES FROM (10) TO (20);
+CREATE TABLE pagg_tab2_p3 PARTITION OF pagg_tab2 FOR VALUES FROM (20) TO (30);
 INSERT INTO pagg_tab1 SELECT i % 30, i % 20 FROM generate_series(0, 299, 2) i;
 INSERT INTO pagg_tab2 SELECT i % 20, i % 30 FROM generate_series(0, 299, 3) i;
 ANALYZE pagg_tab1;
@@ -820,9 +820,9 @@ SELECT a.x, a.y, count(*) FROM (SELECT * FROM pagg_tab1 WHERE x = 1 AND x = 2) a
 
 -- Partition by multiple columns
 CREATE TABLE pagg_tab_m (a int, b int, c int) PARTITION 

Re: WIP/PoC for parallel backup

2020-04-02 Thread Robert Haas
On Thu, Apr 2, 2020 at 11:17 AM Asif Rehman  wrote:
>> Why would you need to do that? As long as the process where
>> STOP_BACKUP can do the check, that seems good enough.
>
> Yes, but the user will get the error only after the STOP_BACKUP, not while 
> the backup is
> in progress. So if the backup is a large one, early error detection would be 
> much beneficial.
> This is the current behavior of non-parallel backup as well.

Because non-parallel backup does not feature early detection of this
error, it is not necessary to make parallel backup do so. Indeed, it
is undesirable. If you want to fix that problem, do it on a separate
thread in a separate patch. A patch proposing to make parallel backup
inconsistent in behavior with non-parallel backup will be rejected, at
least if I have anything to say about it.

TBH, fixing this doesn't seem like an urgent problem to me. The
current situation is not great, but promotions ought to be relatively
infrequent, so I'm not sure it's a huge problem in practice. It is
also worth considering whether the right fix is to figure out how to
make that case actually work, rather than just making it fail quicker.
I don't currently understand the reason for the prohibition so I can't
express an intelligent opinion on what the right answer is here, but
it seems like it ought to be investigated before somebody goes and
builds a bunch of infrastructure to make the error more timely.

> Okay, then I will add the shared state. And since we are adding the shared 
> state, we can use
> that for throttling, progress-reporting and standby early error checking.

Please propose a grammar here for all the new replication commands you
plan to add before going and implement everything. That will make it
easier to hash out the design without forcing you to keep changing the
code. Your design should include a sketch of how several sets of
coordinating backends taking several concurrent parallel backups will
end up with one shared state per parallel backup.

> There are two possible options:
>
> (1) Server may generate a unique ID i.e. BackupID= OR
> (2) (Preferred Option) Use the WAL start location as the BackupID.
>
> This BackupID should be given back as a response to start backup command. All 
> client workers
> must append this ID to all parallel backup replication commands. So that we 
> can use this identifier
> to search for that particular backup. Does that sound good?

Using the WAL start location as the backup ID seems like it might be
problematic -- could a single checkpoint not end up as the start
location for multiple backups started at the same time? Whether that's
possible now or not, it seems unwise to hard-wire that assumption into
the wire protocol.

I was thinking that perhaps the client should generate a unique backup
ID, e.g. leader does:

START_BACKUP unique_backup_id [options]...

And then others do:

JOIN_BACKUP unique_backup_id

My thought is that you will have a number of shared memory structure
equal to max_wal_senders, each one large enough to hold the shared
state for one backup. The shared state will include
char[NAMEDATALEN-or-something] which will be used to hold the backup
ID. START_BACKUP would allocate one and copy the name into it;
JOIN_BACKUP would search for one by name.

If you want to generate the name on the server side, then I suppose
START_BACKUP would return a result set that includes the backup ID,
and clients would have to specify that same backup ID when invoking
JOIN_BACKUP. The rest would stay the same. I am not sure which way is
better. Either way, the backup ID should be something long and hard to
guess, not e.g. the leader processes' PID. I think we should generate
it using pg_strong_random, say 8 or 16 bytes, and then hex-encode the
result to get a string. That way there's almost no risk of two backup
IDs colliding accidentally, and even if we somehow had a malicious
user trying to screw up somebody else's parallel backup by choosing a
colliding backup ID, it would be pretty hard to have any success. A
user with enough access to do that sort of thing can probably cause a
lot worse problems anyway, but it seems pretty easy to guard against
intentional collisions robustly here, so I think we should.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] WAL logging problem in 9.4.3?

2020-04-02 Thread Robert Haas
On Wed, Apr 1, 2020 at 11:51 PM Noah Misch  wrote:
> I've translated the non-vote comments into estimated votes of -0.3, -0.6,
> -0.4, +0.5, and -0.3.  Hence, I revoke the plan to back-patch.

FWIW, I also think that it would be better not to back-patch. The risk
of back-patching is that this will break things, whereas the risk of
not back-patching is that we will harm people who are affected by this
bug for a longer period of time than would otherwise be the case.
Because this patch is complex, the risk of breaking things seems
higher than normal. On the other hand, the number of users adversely
affected by the bug appears to be relatively low. Taken together,
these factors persuade me that we should not back-patch at this time.

It is possible that in the future things may look different. In the
happy event that this patch causes no more problems following commit,
while at the same time we have more complaints about the underlying
problem, we can make a decision to back-patch at a later time. This
brings me to another point: because this patch changes the WAL format,
a straight revert will be impossible once a release has occurred.
Therefore, if we hold off on back-patching for now and later decide
that we erred, we can proceed at that time and it will probably not be
much harder than it would be to do it now. On the other hand, if we
decide to back-patch now and later decide that we have erred, we will
have additional engineering work to do to cater to people who have
already installed the version containing the back-patched fix and now
need to upgrade again. Perhaps the WAL format changes are simple
enough that this isn't likely to be a huge issue even if it happens,
but it does seem better to avoid the chance that it might. A further
factor is that releases which break WAL compatibility are undesirable,
and should only be undertaken when necessary.

Last but not least, I would like to join with others in expressing my
thanks to you for your hard work on this problem. While the process of
developing a fix has not been without bumps, few people would have had
the time, patience, diligence, and skill to take this effort as far as
you have. Kyotaro Horiguchi and others likewise deserve credit for all
of the many hours that they have put into this work. The entire
PostgreSQL community owes all of you a debt of gratitude, and you have
my thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: WIP/PoC for parallel backup

2020-04-02 Thread Asif Rehman
On Thu, Apr 2, 2020 at 4:47 PM Robert Haas  wrote:

> On Fri, Mar 27, 2020 at 1:34 PM Asif Rehman 
> wrote:
> > Yes, we are fetching a single file. However, SEND_FILES is still capable
> of fetching multiple files in one
> > go, that's why the name.
>
> I don't see why it should work that way. If we're fetching individual
> files, why have an unused capability to fetch multiple files?
>

Okay will rename and will modify the function to send a single file as well.


> > 1- parallel backup does not work with a standby server. In parallel
> backup, the server
> > spawns multiple processes and there is no shared state being maintained.
> So currently,
> > no way to tell multiple processes if the standby was promoted during the
> backup since
> > the START_BACKUP was called.
>
> Why would you need to do that? As long as the process where
> STOP_BACKUP can do the check, that seems good enough.
>


Yes, but the user will get the error only after the STOP_BACKUP, not while
the backup is
in progress. So if the backup is a large one, early error detection would
be much beneficial.
This is the current behavior of non-parallel backup as well.


>
> > 2- throttling. Robert previously suggested that we implement throttling
> on the client-side.
> > However, I found a previous discussion where it was advocated to be
> added to the
> > backend instead[1].
> >
> > So, it was better to have a consensus before moving the throttle
> function to the client.
> > That’s why for the time being I have disabled it and have asked for
> suggestions on it
> > to move forward.
> >
> > It seems to me that we have to maintain a shared state in order to
> support taking backup
> > from standby. Also, there is a new feature recently committed for backup
> progress
> > reporting in the backend (pg_stat_progress_basebackup). This
> functionality was recently
> > added via this commit ID: e65497df. For parallel backup to update these
> stats, a shared
> > state will be required.
>
> I've come around to the view that a shared state is a good idea and
> that throttling on the server-side makes more sense. I'm not clear on
> whether we need shared state only for throttling or whether we need it
> for more than that. Another possible reason might be for the
> progress-reporting stuff that just got added.
>

Okay, then I will add the shared state. And since we are adding the shared
state, we can use
that for throttling, progress-reporting and standby early error checking.


> > Since multiple pg_basebackup can be running at the same time,
> maintaining a shared state
> > can become a little complex, unless we disallow taking multiple parallel
> backups.
>
> I do not see why it would be necessary to disallow taking multiple
> parallel backups. You just need to have multiple copies of the shared
> state and a way to decide which one to use for any particular backup.
> I guess that is a little complex, but only a little.
>

There are two possible options:

(1) Server may generate a unique ID i.e. BackupID= OR
(2) (Preferred Option) Use the WAL start location as the BackupID.


This BackupID should be given back as a response to start backup command.
All client workers

must append this ID to all parallel backup replication commands. So that we
can use this identifier

to search for that particular backup. Does that sound good?


--
Asif Rehman
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca


Re: Proposal: Expose oldest xmin as SQL function for monitoring

2020-04-02 Thread James Coleman
On Thu, Apr 2, 2020 at 12:13 AM Craig Ringer  wrote:
>
>
>
>
> On Thu, 2 Apr 2020 at 07:57, Tom Lane  wrote:
>>
>> Alvaro Herrera  writes:
>> > On 2020-Apr-01, Tom Lane wrote:
>> >> The fact that I had to use max(age(...)) in that sample query
>> >> hints at one reason: it's really hard to do arithmetic correctly
>> >> on raw XIDs.  Dealing with wraparound is a problem, and knowing
>> >> what's past or future is even harder.  What use-case do you
>> >> foresee exactly?
>>
>> > Maybe it would make sense to start exposing fullXids in these views and
>> > functions, for this reason.  There's no good reason to continue to
>> > expose bare Xids to userspace, we should use them only for storage.
>>
>> +1, that would help a lot.
>>
>> > But I think James' point is precisely that it's not easy to know where
>> > to look for things that keep Xmin from advancing.  Currently it's
>> > backends, replication slots, prepared transactions, and replicas with
>> > hot_standby_feedback.  If you forget to monitor just one of these, your
>> > vacuums might be useless and you won't notice until disaster strikes.
>>
>> Agreed, but just knowing what the oldest xmin is doesn't help you
>> find *where* it is.  Maybe what we need is a view showing all of
>> these potential sources of an old xmin.
>
>
>  Strongly agree.
>
> https://www.postgresql.org/message-id/camsr+ygss6jbhmehbxqmdc1xj7sobdsq62ywaekohn-kbqy...@mail.gmail.com
>
> I was aiming to write such a view, but folks seemed opposed. I still think 
> it'd be a very good thing to have built-in as Pg grows more complex.

Did you by any chance prototype anything/are you still interested?

This sounds extremely valuable to me, and while I don't want to
resurrect the old thread (it seemed like a bit of a tangent there
anyway), in my view this kind of basic diagnostic capability is
exactly the kind of thing that *has* to be in core, and then other
monitoring packages can take advantage of it.

Finding things holding back xmin from advancing is easily one of the
single biggest operational things we care about. We need to
investigate quickly when an issue occurs, so being able to do so
directly on the server (and having it be up-to-date with any new
features as they're released) is essential. And it's also one of the
areas where in my experience tracking things down is the hardest [with
capabilities in core]; you basically need to have this list in your
head of all of the things you need to check.

James




Re: Berserk Autovacuum (let's save next Mandrill)

2020-04-02 Thread Tom Lane
David Rowley  writes:
> On Thu, 2 Apr 2020 at 16:13, Tom Lane  wrote:
>> Quite :-(.  While it's too early to declare victory, we've seen no
>> more failures of this ilk since 0936d1b6f, so it's sure looking like
>> autovacuum did have something to do with it.

> How about [1]? It seems related to me and also post 0936d1b6f.

That looks much like the first lousyjack failure, which as I said
I wasn't trying to account for at that point.

After looking at those failures, though, I believe that the root cause
may be the same, ie small changes in pg_class.reltuples due to
autovacuum not seeing all pages of the tables.  The test structure
is a bit different, but it is accessing the tables in between EXPLAIN
attempts, so it could be preventing a concurrent autovac from seeing
all pages.

I see your fix at cefb82d49, but it feels a bit brute-force.  Unlike
stats_ext.sql, we're not (supposed to be) dependent on exact planner
estimates in this test.  So I think the real problem here is crappy test
case design.  Namely, that these various sub-tables are exactly the
same size, despite which the test is expecting that the planner will
order them consistently --- with a planning algorithm that prefers
to put larger tables first in parallel appends (cf. create_append_path).
It's not surprising that the result is unstable in the face of small
variations in the rowcount estimates.

I'd be inclined to undo what you did in favor of initializing the
test tables to contain significantly different numbers of rows,
because that would (a) achieve plan stability more directly,
and (b) demonstrate that the planner is actually ordering the
tables by cost correctly.  Maybe somewhere else we have a test
that is verifying (b), but these test cases abysmally fail to
check that point.

I'm not really on board with disabling autovacuum in the regression
tests anywhere we aren't absolutely forced to do so.  It's not
representative of real world practice (or at least not real world
best practice ;-)) and it could help hide actual bugs.  We don't seem
to have much choice with the stats_ext tests as they are constituted,
but those tests look really fragile to me.  Let's not adopt that
technique where we have other possible ways to stabilize test results.

regards, tom lane




Re: WAL usage calculation patch

2020-04-02 Thread Julien Rouhaud
On Thu, Apr 02, 2020 at 06:40:51PM +0530, Amit Kapila wrote:
> On Thu, Apr 2, 2020 at 6:18 PM Julien Rouhaud  wrote:
> >
> > =# select query, calls, wal_bytes, wal_records, wal_num_fpw from 
> > pg_stat_statements where query ilike '%create index%';
> >   query   | calls | wal_bytes | 
> > wal_records | wal_num_fpw
> > --+---+---+-+-
> >  create index t1_idx_parallel_0 ON t1(id) | 1 |  20389743 |
> > 2762 |2758
> >  create index t1_idx_parallel_0_bis ON t1(id) | 1 |  20394391 |
> > 2762 |2758
> >  create index t1_idx_parallel_0_ter ON t1(id) | 1 |  20395155 |
> > 2762 |2758
> >  create index t1_idx_parallel_1 ON t1(id) | 1 |  20388335 |
> > 2762 |2758
> >  create index t1_idx_parallel_2 ON t1(id) | 1 |  20389091 |
> > 2762 |2758
> >  create index t1_idx_parallel_3 ON t1(id) | 1 |  20389847 |
> > 2762 |2758
> >  create index t1_idx_parallel_4 ON t1(id) | 1 |  20390603 |
> > 2762 |2758
> >  create index t1_idx_parallel_5 ON t1(id) | 1 |  20391359 |
> > 2762 |2758
> >  create index t1_idx_parallel_6 ON t1(id) | 1 |  20392115 |
> > 2762 |2758
> >  create index t1_idx_parallel_7 ON t1(id) | 1 |  20392871 |
> > 2762 |2758
> >  create index t1_idx_parallel_8 ON t1(id) | 1 |  20393627 |
> > 2762 |2758
> > (11 rows)
> >
> > =# select relname, pg_relation_size(oid) from pg_class where relname like 
> > '%t1_id%';
> >   relname  | pg_relation_size
> > ---+--
> >  t1_idx_parallel_0 | 22487040
> >  t1_idx_parallel_0_bis | 22487040
> >  t1_idx_parallel_0_ter | 22487040
> >  t1_idx_parallel_2 | 22487040
> >  t1_idx_parallel_1 | 22487040
> >  t1_idx_parallel_4 | 22487040
> >  t1_idx_parallel_3 | 22487040
> >  t1_idx_parallel_5 | 22487040
> >  t1_idx_parallel_6 | 22487040
> >  t1_idx_parallel_7 | 22487040
> >  t1_idx_parallel_8 | 22487040
> > (9 rows)
> >
> >
> > So while the number of WAL records and full page images stay constant, we 
> > can
> > see some small fluctuations in the total amount of generated WAL data, even 
> > for
> > multiple execution of the sequential create index.  I'm wondering if the
> > fluctuations are due to some other internal details or if the WalUsage 
> > support
> > is just completely broken (although I don't see any obvious issue ATM).
> >
> 
> I think we need to know the reason for this.  Can you try with small
> size indexes and see if the problem is reproducible? If it is, then it
> will be easier to debug the same.


I did some quick testing using the attached shell script:

- one a 1k line base number of lines, scales 1 10 100 1000 (suffix _s)
- parallel workers from 0 to 8 (suffix _w)
- each index created twice (suffix _pa and _pb)
- with a vacuum;checkpoint;pg_switch_wal executed each time

I get the following results:

   query| wal_bytes | wal_records | 
wal_num_fpw 
+---+-+-
 CREATE INDEX t1_idx_s001_pa_w0 ON t1 (id)  | 61871 |  22 | 
 18
 CREATE INDEX t1_idx_s001_pa_w1 ON t1 (id)  | 62394 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pa_w2 ON t1 (id)  | 63150 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pa_w3 ON t1 (id)  | 63906 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pa_w4 ON t1 (id)  | 64662 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pa_w5 ON t1 (id)  | 65418 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pa_w6 ON t1 (id)  | 65450 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pa_w7 ON t1 (id)  | 66206 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pa_w8 ON t1 (id)  | 66962 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pb_w0 ON t1 (id)  | 67718 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pb_w1 ON t1 (id)  | 68474 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pb_w2 ON t1 (id)  | 68418 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pb_w3 ON t1 (id)  | 69174 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pb_w4 ON t1 (id)  | 69930 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pb_w5 ON t1 (id)  | 70686 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pb_w6 ON t1 (id)  | 71442 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pb_w7 ON t1 (id)  | 64922 |  21 | 
 18
 CREATE INDEX t1_idx_s001_pb_w8 ON t1 (id)  | 65682 |  21 | 
 18
 CREATE INDEX t1_idx_s010_pa_w0 ON t1 (id)  |250460 |  

Re: WAL usage calculation patch

2020-04-02 Thread Dilip Kumar
On Thu, Apr 2, 2020 at 6:41 PM Amit Kapila  wrote:
>
> On Thu, Apr 2, 2020 at 6:18 PM Julien Rouhaud  wrote:
> >
> > =# select query, calls, wal_bytes, wal_records, wal_num_fpw from 
> > pg_stat_statements where query ilike '%create index%';
> >   query   | calls | wal_bytes | 
> > wal_records | wal_num_fpw
> > --+---+---+-+-
> >  create index t1_idx_parallel_0 ON t1(id) | 1 |  20389743 |
> > 2762 |2758
> >  create index t1_idx_parallel_0_bis ON t1(id) | 1 |  20394391 |
> > 2762 |2758
> >  create index t1_idx_parallel_0_ter ON t1(id) | 1 |  20395155 |
> > 2762 |2758
> >  create index t1_idx_parallel_1 ON t1(id) | 1 |  20388335 |
> > 2762 |2758
> >  create index t1_idx_parallel_2 ON t1(id) | 1 |  20389091 |
> > 2762 |2758
> >  create index t1_idx_parallel_3 ON t1(id) | 1 |  20389847 |
> > 2762 |2758
> >  create index t1_idx_parallel_4 ON t1(id) | 1 |  20390603 |
> > 2762 |2758
> >  create index t1_idx_parallel_5 ON t1(id) | 1 |  20391359 |
> > 2762 |2758
> >  create index t1_idx_parallel_6 ON t1(id) | 1 |  20392115 |
> > 2762 |2758
> >  create index t1_idx_parallel_7 ON t1(id) | 1 |  20392871 |
> > 2762 |2758
> >  create index t1_idx_parallel_8 ON t1(id) | 1 |  20393627 |
> > 2762 |2758
> > (11 rows)
> >
> > =# select relname, pg_relation_size(oid) from pg_class where relname like 
> > '%t1_id%';
> >   relname  | pg_relation_size
> > ---+--
> >  t1_idx_parallel_0 | 22487040
> >  t1_idx_parallel_0_bis | 22487040
> >  t1_idx_parallel_0_ter | 22487040
> >  t1_idx_parallel_2 | 22487040
> >  t1_idx_parallel_1 | 22487040
> >  t1_idx_parallel_4 | 22487040
> >  t1_idx_parallel_3 | 22487040
> >  t1_idx_parallel_5 | 22487040
> >  t1_idx_parallel_6 | 22487040
> >  t1_idx_parallel_7 | 22487040
> >  t1_idx_parallel_8 | 22487040
> > (9 rows)
> >
> >
> > So while the number of WAL records and full page images stay constant, we 
> > can
> > see some small fluctuations in the total amount of generated WAL data, even 
> > for
> > multiple execution of the sequential create index.  I'm wondering if the
> > fluctuations are due to some other internal details or if the WalUsage 
> > support
> > is just completely broken (although I don't see any obvious issue ATM).
> >
>
> I think we need to know the reason for this.  Can you try with small
> size indexes and see if the problem is reproducible? If it is, then it
> will be easier to debug the same.
>
> Few other minor comments
> 
> pg_stat_statements patch
> 1.
> +--
> +-- CRUD: INSERT SELECT UPDATE DELETE on test non-temp table to
> validate WAL generation metrics
> +--
>
> The word 'non-temp' in the above comment appears out of place.  We
> don't need to specify it.
>
> 2.
> +-- SELECT usage data, check WAL usage is reported, wal_records equal
> rows count for INSERT/UPDATE/DELETE
> +SELECT query, calls, rows,
> +wal_bytes > 0 as wal_bytes_generated,
> +wal_records > 0 as wal_records_generated,
> +wal_records = rows as wal_records_as_rows
> +FROM pg_stat_statements ORDER BY query COLLATE "C";
>
> The comment doesn't seem to match what we are doing in the statement.
> I think we can simplify it to something like "check WAL is generated
> for above statements:
>
> 3.
> @@ -185,6 +185,9 @@ typedef struct Counters
>   int64 local_blks_written; /* # of local disk blocks written */
>   int64 temp_blks_read; /* # of temp blocks read */
>   int64 temp_blks_written; /* # of temp blocks written */
> + uint64 wal_bytes; /* total amount of WAL bytes generated */
> + int64 wal_records; /* # of WAL records generated */
> + int64 wal_num_fpw; /* # of WAL full page image generated */
>   double blk_read_time; /* time spent reading, in msec */
>   double blk_write_time; /* time spent writing, in msec */
>   double usage; /* usage factor */
>
> It is better to keep wal_bytes should be after wal_num_fpw as it is in
> the main patch.  Also, consider changing at other places in this
> patch.  I think we should add these new fields after blk_write_time or
> at the end after usage.
>
> 4.
> /* # of WAL full page image generated */
> Can we change it to "/* # of WAL full page image records generated */"?

IMHO, "# of WAL full-page image records" seems like the number of wal
record which contains the full-page image.  But, actually, this is the
total number of the full-page images, not the number of records that
have a full-page image.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: WIP/PoC for parallel backup

2020-04-02 Thread Robert Haas
On Thu, Apr 2, 2020 at 9:46 AM Kashif Zeeshan <
kashif.zees...@enterprisedb.com> wrote:

> Does it fail to clean up the backup folder in all cases where the
>> backup failed, or just in this case?
>>
> The cleanup is done in the cases I have seen so far with base
> pg_basebackup functionality (not including the parallel backup feature)
> with the message "pg_basebackup: removing contents of data directory"
> A similar case was also fixed for parallel backup reported by Rajkumar
> where the contents of the backup folder were not cleaned up after the error.
>

What I'm saying is that it's unclear whether there's a bug here or whether
it just failed because of the very extreme test scenario you created.
Spawning >1000 processes on a small machine can easily make a lot of things
fail.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: pgbench - add \aset to store results of a combined query

2020-04-02 Thread Fabien COELHO


Michaël,


ISTM that I submitted a patch to test whether a variable exists in pgbench,
like available in psql (:{?var} I think),


Not sure if improving the readability of the tests is a reason for
this patch.  So I would suggest to just live with relying on debug()
for now to check that a variable with a given prefix exists.


Sure. I meant that the feature would make sense to write benchmark scripts 
which would use aset and be able to act on the success or not of this 
aset, not to resurrect it for a hidden coverage test.



Thanks.  So, it looks like everything has been addressed.  Do you have
anything else in mind?


Nope.


NB: I think that it is really strange to not use an array for the
options in subroutine pgbench() of 001_pgbench_with_server.pl.
Shouldn't this be an array of options instead?  The current logic of
using a splitted string is weak when it comes to option quoting in
perl and command handling in general.


The idea is that a scalar is simpler and readable to write in the simple 
case than a perl array. Now maybe qw() could have done the trick.


--
Fabien.

Re: WIP/PoC for parallel backup

2020-04-02 Thread Kashif Zeeshan
On Thu, Apr 2, 2020 at 6:23 PM Robert Haas  wrote:

> On Thu, Apr 2, 2020 at 7:55 AM Kashif Zeeshan
>  wrote:
> > Thanks alot Robert,
> > In this case the backup folder was not being emptied as the backup was
> failed, the cleanup should be done in this case too.
>
> Does it fail to clean up the backup folder in all cases where the
> backup failed, or just in this case?
>
The cleanup is done in the cases I have seen so far with base pg_basebackup
functionality (not including the parallel backup feature) with the message
"pg_basebackup: removing contents of data directory"
A similar case was also fixed for parallel backup reported by Rajkumar
where the contents of the backup folder were not cleaned up after the error.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


-- 
Regards

Kashif Zeeshan
Lead Quality Assurance Engineer / Manager

EnterpriseDB Corporation
The Enterprise Postgres Company


Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-04-02 Thread Julien Rouhaud
New conflict, rebased v9 attached.
>From 26b98194d8add282158c65f6ac46c721ba80f498 Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 18 Mar 2019 18:55:50 +0100
Subject: [PATCH v9 1/2] Expose queryid in pg_stat_activity and log_line_prefix

Similarly to other fields in pg_stat_activity, only the queryid from the top
level statements are exposed, and if the backends status isn't active then the
queryid from the last executed statements is displayed.

Also add a %Q placeholder to include the queryid in the log_line_prefix, which
will also only expose top level statements.

Author: Julien Rouhaud
Reviewed-by: Evgeny Efimkin, Michael Paquier
Discussion: https://postgr.es/m/CA+8PKvQnMfOE-c3YLRwxOsCYXQDyP8VXs6CDtMZp1V4=d4l...@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   | 179 --
 doc/src/sgml/config.sgml  |   9 +-
 doc/src/sgml/monitoring.sgml  |  12 ++
 src/backend/catalog/system_views.sql  |   1 +
 src/backend/executor/execMain.c   |   8 +
 src/backend/executor/execParallel.c   |  14 +-
 src/backend/executor/nodeGather.c |   3 +-
 src/backend/executor/nodeGatherMerge.c|   4 +-
 src/backend/parser/analyze.c  |   5 +
 src/backend/postmaster/pgstat.c   |  65 +++
 src/backend/tcop/postgres.c   |   5 +
 src/backend/utils/adt/pgstatfuncs.c   |   7 +-
 src/backend/utils/error/elog.c|  10 +-
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/executor/execParallel.h   |   3 +-
 src/include/pgstat.h  |   5 +
 src/test/regress/expected/rules.out   |   9 +-
 18 files changed, 267 insertions(+), 79 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 942922b01f..4073361f4c 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -115,6 +115,14 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
 
 #define JUMBLE_SIZE1024	/* query serialization buffer size */
 
+/*
+ * Utility statements that pgss_ProcessUtility and pgss_post_parse_analyze
+ * ignores.
+ */
+#define PGSS_HANDLED_UTILITY(n)		(!IsA(n, ExecuteStmt) && \
+	!IsA(n, PrepareStmt) && \
+	!IsA(n, DeallocateStmt))
+
 /*
  * Extension version number, for supporting older extension versions' objects
  */
@@ -342,7 +350,8 @@ static void pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 ProcessUtilityContext context, ParamListInfo params,
 QueryEnvironment *queryEnv,
 DestReceiver *dest, QueryCompletion *qc);
-static uint64 pgss_hash_string(const char *str, int len);
+static const char *pgss_clean_querytext(const char *query, int *location, int *len);
+static uint64 pgss_compute_utility_queryid(const char *query, int query_len);
 static void pgss_store(const char *query, uint64 queryId,
 	   int query_location, int query_len,
 	   pgssStoreKind kind,
@@ -841,16 +850,34 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query)
 		return;
 
 	/*
-	 * Utility statements get queryId zero.  We do this even in cases where
-	 * the statement contains an optimizable statement for which a queryId
-	 * could be derived (such as EXPLAIN or DECLARE CURSOR).  For such cases,
-	 * runtime control will first go through ProcessUtility and then the
-	 * executor, and we don't want the executor hooks to do anything, since we
-	 * are already measuring the statement's costs at the utility level.
+	 * We compute a queryId now so that it can get exported in out
+	 * PgBackendStatus.  pgss_ProcessUtility will later discard it to prevents
+	 * double counting of optimizable statements that are directly contained in
+	 * utility statements.  Note that we don't compute a queryId for prepared
+	 * statemets related utility, as those will inherit from the underlying
+	 * statements's one (except DEALLOCATE which is entirely untracked).
 	 */
 	if (query->utilityStmt)
 	{
-		query->queryId = UINT64CONST(0);
+		if (pgss_track_utility && PGSS_HANDLED_UTILITY(query->utilityStmt)
+			&& pstate->p_sourcetext)
+		{
+			const char *querytext = pstate->p_sourcetext;
+			int query_location = query->stmt_location;
+			int query_len = query->stmt_len;
+
+			/*
+			 * Confine our attention to the relevant part of the string, if the
+			 * query is a portion of a multi-statement source string.
+			 */
+			querytext = pgss_clean_querytext(pstate->p_sourcetext,
+			 _location,
+			 _len);
+
+			query->queryId = pgss_compute_utility_queryid(querytext, query_len);
+		}
+		else
+			query->queryId = UINT64CONST(0);
 		return;
 	}
 
@@ -1098,6 +1125,23 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	DestReceiver *dest, QueryCompletion *qc)
 {
 	

  1   2   >