Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Heikki Linnakangas

On 07/03/2019 15:41, Heikki Linnakangas wrote:

On 07/03/2019 14:54, Peter Geoghegan wrote:

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas  wrote:

After staring at the first patch for bit longer, a few things started to
bother me:

* The new struct is called BTScanInsert, but it's used for searches,
too. It makes sense when you read the README, which explains the
difference between "search scan keys" and "insertion scan keys", but now
that we have a separate struct for this, perhaps we call insertion scan
keys with a less confusing name. I don't know what to suggest, though.
"Positioning key"?


I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.


Yeah. It's been like that forever, but I must confess I hadn't paid any
attention to it, until now. I had not understood that text in the README
explaining the difference between search and insertion scan keys, before
looking at this patch. Not sure I ever read it with any thought. Now
that I understand it, I don't like the "insertion scan key" name.


BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.


I don't understand it :-(. I guess that's valuable feedback on its own.
I'll spend more time reading the code around that, but meanwhile, if you
can think of a simpler way to explain it in the comments, that'd be good.


The new BTInsertStateData struct also
holds the current buffer we're considering to insert to, and a
'bounds_valid' flag to indicate if the saved bounds are valid for the
current buffer. That way, it's more straightforward to clear the
'bounds_valid' flag whenever we move right.


I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.


I haven't given performance much thought, really. I don't expect this
method to be any slower, but the point of the refactoring is to make the
code easier to understand.


/*
* Do the insertion. First move right to find the correct page to
* insert to, if necessary. If we're inserting to a non-unique index,
* _bt_search() already did this when it checked if a move to the
* right was required for leaf page.  Insertion scankey's scantid
* would have been filled out at the time. On a unique index, the
* current buffer is the first buffer containing duplicates, however,
* so we may need to move right to the correct location for this
* tuple.
*/
if (checkingunique || itup_key->heapkeyspace)
  _bt_findinsertpage(rel, , stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, );

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
newitemoff, false);

Does this make sense?


I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.


The idea is that _bt_findinsertpage() would not need to know whether the
unique checks were performed or not. I'd like to encapsulate all the
information about the "insert position we're considering" in the
BTInsertStateData struct. Passing 'checkingunique' as a separate
argument violates that, because when it's set, the key means something
slightly different.

Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether
'scantid' is set, instead of 'checkingunique'. That would seem better.
If it looks like 'checkingunique', it's making the assumption that if
unique checks were not performed, then we are already positioned on the
correct page, according to the heap TID. But looking at 'scantid' seems
like a more direct way of getting the same information. And then we
won't need to pass the 'checkingunique' flag as an "out-of-band" argument.

So I'm specifically suggesting that we replace this, in _bt_findinsertloc:

if (!checkingunique && itup_key->heapkeyspace)
break;

With this:

if (itup_key->scantid)
break;

And remove the 'checkingunique' argument from _bt_findinsertloc.


Ah, scratch that. By the time we call _bt_findinsertloc(), scantid has 
already been restored, even if it was not set originally when we did 
_bt_search.


My dislike here is that passing 'checkingunique' as a separate argument 
acts like a "modifier", changing slightly the meaning of the insertion 
scan key. If it's not set, we know we're positioned on the correct page. 
Otherwise, we might not be. And it's a pretty indirect way of saying 
that, as it also depends 'heapkeyspace'. Perhaps add a flag to 
BTInsertStateData, to indicate the same thing more explicitly. Something 
like "bool is_final_insertion_page; /* when set, no need to move right */".


- Heikki



Re: pg_basebackup against older server versions

2019-03-06 Thread Sergei Kornilov
Hi

> No problem. Thanks for the patch, the logic looks good and I made
> some adjustments as attached. Does that look fine to you?

Looks fine, thanks. I tested against HEAD and v11.2 with and without -R in both 
plain and tar formats.

regards, Sergei



Re: Re: [HACKERS] Custom compression methods

2019-03-06 Thread Alexander Korotkov
On Thu, Mar 7, 2019 at 10:43 AM David Steele  wrote:

> On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
>
> > there are another set of patches.
> > Only rebased to current master.
> >
> > Also I will change status on commitfest to 'Needs review'.
>
> This patch has seen periodic rebases but no code review that I can see
> since last January 2018.
>
> As Andres noted in [1], I think that we need to decide if this is a
> feature that we want rather than just continuing to push it from CF to CF.
>

Yes.  I took a look at code of this patch.  I think it's in pretty good
shape.  But high level review/discussion is required.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: insensitive collations

2019-03-06 Thread Peter Eisentraut
On 2019-03-05 18:48, Daniel Verite wrote:
>> Older ICU versions (<54) don't support all the locale customization
>> options, so many of my new tests in collate.icu.utf8.sql will fail on
>> older systems.  What should we do about that?  Have another extra test file?
> Maybe stick to the old-style syntax for the regression tests?
> The declarations that won't work as expected with older ICU versions
> would be:
> 
> CREATE COLLATION case_insensitive (provider = icu, locale =
> 'und-u-ks-level2', deterministic = false);
> 
> 'und-u-ks-level2' is equivalent to 'und@colStrength=secondary'

The problem is not the syntax but that the older ICU versions don't
support the *functionality* of ks-level2 or colStrength=secondary.  If
you try it, you will simply get a normal case-sensitive behavior.

It would probably be possible to write all the tests for
nondeterministic collations without making use of this functionality,
using only canonically equivalent sequences as test data.  But that
would make the tests extremely weird and unintuitive, so I'd like to
avoid that.

After thinking about it a bit more, I think making a new test file is
reasonable.  The file is already fairly long anyway, compared to the
typical test file size.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Re: [HACKERS] Custom compression methods

2019-03-06 Thread David Steele

On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:


there are another set of patches.
Only rebased to current master.

Also I will change status on commitfest to 'Needs review'.


This patch has seen periodic rebases but no code review that I can see 
since last January 2018.


As Andres noted in [1], I think that we need to decide if this is a 
feature that we want rather than just continuing to push it from CF to CF.


--
-David
da...@pgmasters.net

[1] 
https://www.postgresql.org/message-id/20190216054526.zss2cufdxfeudr4i%40alap3.anarazel.de




Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Heikki Linnakangas

On 07/03/2019 14:54, Peter Geoghegan wrote:

On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas  wrote:

After staring at the first patch for bit longer, a few things started to
bother me:

* The new struct is called BTScanInsert, but it's used for searches,
too. It makes sense when you read the README, which explains the
difference between "search scan keys" and "insertion scan keys", but now
that we have a separate struct for this, perhaps we call insertion scan
keys with a less confusing name. I don't know what to suggest, though.
"Positioning key"?


I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.


Yeah. It's been like that forever, but I must confess I hadn't paid any 
attention to it, until now. I had not understood that text in the README 
explaining the difference between search and insertion scan keys, before 
looking at this patch. Not sure I ever read it with any thought. Now 
that I understand it, I don't like the "insertion scan key" name.



BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.


I don't understand it :-(. I guess that's valuable feedback on its own. 
I'll spend more time reading the code around that, but meanwhile, if you 
can think of a simpler way to explain it in the comments, that'd be good.



The new BTInsertStateData struct also
holds the current buffer we're considering to insert to, and a
'bounds_valid' flag to indicate if the saved bounds are valid for the
current buffer. That way, it's more straightforward to clear the
'bounds_valid' flag whenever we move right.


I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.


I haven't given performance much thought, really. I don't expect this 
method to be any slower, but the point of the refactoring is to make the 
code easier to understand.



/*
   * Do the insertion. First move right to find the correct page to
   * insert to, if necessary. If we're inserting to a non-unique index,
   * _bt_search() already did this when it checked if a move to the
   * right was required for leaf page.  Insertion scankey's scantid
   * would have been filled out at the time. On a unique index, the
   * current buffer is the first buffer containing duplicates, however,
   * so we may need to move right to the correct location for this
   * tuple.
   */
if (checkingunique || itup_key->heapkeyspace)
 _bt_findinsertpage(rel, , stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, );

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
newitemoff, false);

Does this make sense?


I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.


The idea is that _bt_findinsertpage() would not need to know whether the 
unique checks were performed or not. I'd like to encapsulate all the 
information about the "insert position we're considering" in the 
BTInsertStateData struct. Passing 'checkingunique' as a separate 
argument violates that, because when it's set, the key means something 
slightly different.


Hmm. Actually, with patch #2, _bt_findinsertloc() could look at whether 
'scantid' is set, instead of 'checkingunique'. That would seem better. 
If it looks like 'checkingunique', it's making the assumption that if 
unique checks were not performed, then we are already positioned on the 
correct page, according to the heap TID. But looking at 'scantid' seems 
like a more direct way of getting the same information. And then we 
won't need to pass the 'checkingunique' flag as an "out-of-band" argument.


So I'm specifically suggesting that we replace this, in _bt_findinsertloc:

if (!checkingunique && itup_key->heapkeyspace)
break;

With this:

if (itup_key->scantid)
break;

And remove the 'checkingunique' argument from _bt_findinsertloc.

- Heikki



Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-03-06 Thread amul sul
Thanks Rajkumar,

I am looking into this.

Regards,
Amul

On Thu, Mar 7, 2019 at 11:54 AM Rajkumar Raghuwanshi <
rajkumar.raghuwan...@enterprisedb.com> wrote:

>
>
> On Tue, Mar 5, 2019 at 3:45 PM amul sul  wrote:
>
>> Attached is the rebased atop of the latest master head(35bc0ec7c8).
>>
> thanks Amul, patches applied cleanly on PG head.
>
> While testing this I got a server crash with below test case.
>
> CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
> CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN
> ('0001','0002','0003');
> CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN
> ('0004','0005','0006');
> CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
> CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('','0010');
> INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM') FROM
> generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
> INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else
> to_char(i % 17, 'FM') end FROM generate_series(0, 500) i WHERE i % 17
> IN (7,8,9);
> ANALYSE plt1;
>
> CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
> CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
> CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN
> ('0004','0005','0006');
> CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN
> ('0007','0008','0009');
> CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('',NULL,'0012');
> INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM') FROM
> generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
> INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else
> to_char(i % 17, 'FM') end FROM generate_series(0, 500) i WHERE i % 17
> IN (0,11,12);
> ANALYZE plt2;
>
> CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c,
> 'A'));
> CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
> CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005',
> '0006');
> CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
> CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('');
> INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM') FROM
> generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14,
> 15, 16);
> ANALYZE plt1_e;
>
> EXPLAIN (COSTS OFF)
> SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
> plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c
> GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
>
> below is stack trace,  looks like some indexes got corrupted, please take
> a look.
>
> Core was generated by `postgres: edb postgres [local]
> EXPLAIN   '.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8,
> partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c)
> at partbounds.c:4162
> 4162if (partmap1->from < 0 && partmap2->from < 0)
> Missing separate debuginfos, use: debuginfo-install
> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
> libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
> openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
> (gdb) bt
> #0  0x00821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8,
> partmaps2=0x2c1c8e0, *index1=45540240*, index2=0,
> next_index=0x7ffeebd43d3c) at partbounds.c:4162
> #1  0x008226c3 in merge_null_partitions (outer_bi=0x2b6e338,
> inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0,
> jointype=JOIN_INNER,
> next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38,
> default_index=0x7ffeebd43d34) at partbounds.c:4610
> #2  0x00821726 in partition_list_bounds_merge
> (partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40,
> inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
> inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
> #3  0x0081ff5d in partition_bounds_merge (partnatts=1,
> partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40,
> inner_rel=0x2bf8d28, jointype=JOIN_INNER,
> outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at
> partbounds.c:3053
> #4  0x007c610f in try_partitionwise_join (root=0x2be2a28,
> rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0,
> parent_sjinfo=0x7ffeebd44010,
> parent_restrictlist=0x2c1c070) at joinrels.c:1370
> #5  0x007c5521 in populate_joinrel_with_paths (root=0x2be2a28,
> rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010,
> restrictlist=0x2c1c070)
> at joinrels.c:914
> #6  0x007c4f48 in make_join_rel (root=0x2be2a28, 

Re: pg_basebackup against older server versions

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 01:42:16PM +0300, Sergei Kornilov wrote:
> My fault. I thought pg_basebackup works only with same major version, sorry.
> How about attached patch?

No problem.  Thanks for the patch, the logic looks good and I made
some adjustments as attached.  Does that look fine to you?
--
Michael
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 3d2d4cd0b9..d56c8740d4 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -66,6 +66,11 @@ typedef struct TablespaceList
  */
 #define MINIMUM_VERSION_FOR_TEMP_SLOTS 10
 
+/*
+ * recovery.conf is integrated into postgresql.conf from version 12.
+ */
+#define MINIMUM_VERSION_FOR_RECOVERY_GUC 12
+
 /*
  * Different ways to include WAL
  */
@@ -974,6 +979,7 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 	bool		basetablespace = PQgetisnull(res, rownum, 0);
 	bool		in_tarhdr = true;
 	bool		skip_file = false;
+	bool		is_recovery_guc_supported = true;
 	bool		is_postgresql_auto_conf = false;
 	bool		found_postgresql_auto_conf = false;
 	int			file_padding_len = 0;
@@ -984,6 +990,10 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 	gzFile		ztarfile = NULL;
 #endif
 
+	/* recovery.conf is integrated into postgresql.conf in 12 and newer */
+	if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_RECOVERY_GUC)
+		is_recovery_guc_supported = false;
+
 	if (basetablespace)
 	{
 		/*
@@ -1130,11 +1140,22 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 			{
 char		header[512];
 
-if (!found_postgresql_auto_conf)
+/*
+ * If postgresql.auto.conf has not been found in the streamed
+ * data, add recovery configuration to postgresql.auto.conf if
+ * recovery parameters are GUCs.  If the instance connected to
+ * is older than 12, create recovery.conf with this data
+ * otherwise.
+ */
+if (!found_postgresql_auto_conf || !is_recovery_guc_supported)
 {
 	int			padding;
+	char	   *conffile = is_recovery_guc_supported ?
+	"postgresql.auto.conf" : "recovery.conf";
 
-	tarCreateHeader(header, "postgresql.auto.conf", NULL,
+	tarCreateHeader(header,
+	conffile,
+	NULL,
 	recoveryconfcontents->len,
 	pg_file_create_mode, 04000, 02000,
 	time(NULL));
@@ -1142,18 +1163,26 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 	padding = ((recoveryconfcontents->len + 511) & ~511) - recoveryconfcontents->len;
 
 	WRITE_TAR_DATA(header, sizeof(header));
-	WRITE_TAR_DATA(recoveryconfcontents->data, recoveryconfcontents->len);
+	WRITE_TAR_DATA(recoveryconfcontents->data,
+   recoveryconfcontents->len);
 	if (padding)
 		WRITE_TAR_DATA(zerobuf, padding);
 }
 
-tarCreateHeader(header, "standby.signal", NULL,
-0,	/* zero-length file */
-pg_file_create_mode, 04000, 02000,
-time(NULL));
+/*
+ * standby.signal is supported only if recovery parameters are
+ * GUCs.
+ */
+if (is_recovery_guc_supported)
+{
+	tarCreateHeader(header, "standby.signal", NULL,
+	0,	/* zero-length file */
+	pg_file_create_mode, 04000, 02000,
+	time(NULL));
 
-WRITE_TAR_DATA(header, sizeof(header));
-WRITE_TAR_DATA(zerobuf, 511);
+	WRITE_TAR_DATA(header, sizeof(header));
+	WRITE_TAR_DATA(zerobuf, 511);
+}
 			}
 
 			/* 2 * 512 bytes empty data at end of file */
@@ -1252,16 +1281,24 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 		 * We have the complete header structure in tarhdr,
 		 * look at the file metadata: we may want append
 		 * recovery info into postgresql.auto.conf and skip
-		 * standby.signal file.  In both cases we must
-		 * calculate tar padding
+		 * standby.signal file if recovery parameters are
+		 * integrated as GUCs, and recovery.conf otherwise. In
+		 * both cases we must calculate tar padding.
 		 */
-		skip_file = (strcmp([0], "standby.signal") == 0);
-		is_postgresql_auto_conf = (strcmp([0], "postgresql.auto.conf") == 0);
+		if (is_recovery_guc_supported)
+		{
+			skip_file = (strcmp([0], "standby.signal") == 0);
+			is_postgresql_auto_conf = (strcmp([0], "postgresql.auto.conf") == 0);
+		}
+		else
+			skip_file = (strcmp([0], "recovery.conf") == 0);
 
 		filesz = read_tar_number([124], 12);
 		file_padding_len = ((filesz + 511) & ~511) - filesz;
 
-		if (is_postgresql_auto_conf && writerecoveryconf)
+		if (is_recovery_guc_supported &&
+			is_postgresql_auto_conf &&
+			writerecoveryconf)
 		{
 			/* replace tar header */
 			char		header[512];
@@ -1313,7 +1350,9 @@ ReceiveTarFile(PGconn *conn, PGresult *res, int rownum)
 		pos += bytes2write;
 		filesz -= bytes2write;
 	}
-	else if (is_postgresql_auto_conf && writerecoveryconf)
+	else if (is_recovery_guc_supported &&
+			 

Re: Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2019-03-06 Thread David Steele

On 3/6/19 5:38 PM, Andrey Borodin wrote:

20 февр. 2019 г., в 17:06, Alexey Kondratov  
написал(а):


The new patch is much smaller (less than 400 lines) and works as advertised.
There's a typo "retreive" there.

These lines look a little suspicious:
char  postgres_exec_path[MAXPGPATH],
  postgres_cmd[MAXPGPATH],
  cmd_output[MAX_RESTORE_COMMAND];
Is it supposed to be any difference between MAXPGPATH and MAX_RESTORE_COMMAND?

Besides this, patch looks fine to me.


This patch appears to need attention from the author so I have marked it 
Waiting on Author.


Regards,
--
-David
da...@pgmasters.net



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Peter Geoghegan
On Wed, Mar 6, 2019 at 10:54 PM Peter Geoghegan  wrote:
> It will also have to store heapkeyspace, of course. And minusinfkey.
> BTW, I would like to hear what you think of the idea of minusinfkey
> (and the !minusinfkey optimization) specifically.

> I'm not sure that that's an improvement. Moving right should be very
> rare with my patch. gcov shows that we never move right here anymore
> with the regression tests, or within _bt_check_unique() -- not once.
> For a second, I thought that you forgot to invalidate the bounds_valid
> flag, because you didn't pass it directly, by value to
> _bt_useduplicatepage().

BTW, the !minusinfkey optimization is why we literally never move
right within _bt_findinsertloc() while the regression tests run. We
always land on the correct leaf page to begin with. (It works with
unique index insertions, where scantid is NULL when we descend the
tree.)

In general, there are two good reasons for us to move right:

* There was a concurrent page split (or page deletion), and we just
missed the downlink in the parent, and need to recover.

* We omit some columns from our scan key (at least scantid), and there
are perhaps dozens of matches -- this is not relevant to
_bt_doinsert() code.

The single value strategy used by nbtsplitloc.c does a good job of
making it unlikely that _bt_check_unique()-wise duplicates will cross
leaf pages, so there will almost always be one leaf page to visit.
And, the !minusinfkey optimization ensures that the only reason we'll
move right is because of a concurrent page split, within
_bt_moveright().

The buffer lock coupling move to the right that _bt_findinsertloc()
does should be considered an edge case with all of these measures, at
least with v4 indexes.

-- 
Peter Geoghegan



Re: Re: Reviving the "Stopping logical replication protocol" patch from Vladimir Gordichuk

2019-03-06 Thread David Steele

On 2/16/19 10:38 PM, Dave Cramer wrote:


Thanks for looking at this. FYI, I did not originally write this, rather 
the original author has not replied to requests.

JDBC could use this, I assume others could as well.

That said I'm certainly open to suggestions on how to do this.

Craig, do you have any other ideas?

This patch appears to be stalled.  Are there any ideas on how to proceed?

Regards,
--
-David
da...@pgmasters.net



Re: ECPG regression with DECLARE STATEMENT support

2019-03-06 Thread Rushabh Lathia
On Thu, Mar 7, 2019 at 9:56 AM Michael Meskes  wrote:

> Hi,
>
> > Commit bd7c95f0c1a38becffceb3ea7234d57167f6d4bf add DECLARE
> > STATEMENT support to ECPG.  This introduced the new rule
> > for EXEC SQL CLOSE cur and with that it gets transformed into
> > ECPGclose().
> >
> > Now prior to the above commit, someone can declare the
> > cursor in the SQL statement and "CLOSE cur_name" can be
> > also, execute as a normal statement.
>
> That still works, the difference in your test case is that the DECLARE
> statement is prepared.
>
> > Example:
> >
> > EXEC SQL PREPARE cur_query FROM "DECLARE cur1 CURSOR WITH HOLD FOR
> > SELECT count(*) FROM pg_class";
> > EXEC SQL PREPARE fetch_stmt FROM "FETCH next FROM cur1";
> > EXEC SQL EXECUTE cur_query;
> > EXEC SQL EXECUTE fetch_stmt INTO :c;
> > EXEC SQL CLOSE cur1;
> >
> > With commit bd7c95f0c1, "EXEC SQL CLOSE cur1" will fail
> > and throw an error "sqlcode -245 The cursor is invalid".
> >
> > I think the problem here is ECPGclose(), tries to find the
> > cursor into "connection->cursor_stmts" and if it doesn't
> > find it there, just throws an error.   Maybe require fix
> > into ECPGclose() - rather than throwing an error continue
> > executing statement "CLOSE cur_name" with ecpg_do().
>
> The problem as I see it is that the cursor is known to the backend but
> not the library.


Exactly.   So maybe we should add logic into ECPGclose() if
it doesn't find a cursor in the library - just send a query to the
backend rather than throwing an error.


> Takaheshi-san, Hayato-san, any idea how to improve the
> situation to not error out on statements that used to work?
>
> Michael
> --
> Michael Meskes
> Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
> Meskes at (Debian|Postgresql) dot Org
> Jabber: michael at xmpp dot meskes dot org
> VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL
>
>
>

-- 
Rushabh Lathia


Re: Refactoring the checkpointer's fsync request queue

2019-03-06 Thread Thomas Munro
On Wed, Mar 6, 2019 at 6:16 AM Shawn Debnath  wrote:
> On Tue, Mar 05, 2019 at 11:53:16AM -0500, Tom Lane wrote:
> > Thomas Munro  writes:
> > > Why do we need to include fmgr.h in md.h?
> >
> > More generally, any massive increase in an include file's inclusions
> > is probably a sign that you need to refactor.  Cross-header inclusions
> > are best avoided altogether if you can --- obviously that's not always
> > possible, but we should minimize them.  We've had some very unfortunate
> > problems in the past from indiscriminate #includes in headers.
>
> Agree - I do pay attention to these, but this one slipped through the
> cracks (copied smgr.h then edited to remove smgr bits). Thanks for
> catching this, will fix in the next patch iteration.

Huh... so why it was in smgr.h then?  Seems bogus.  Fix pushed to master.

-- 
Thomas Munro
https://enterprisedb.com



Re: Re: proposal: plpgsql pragma statement

2019-03-06 Thread David Steele

On 2/4/19 8:12 PM, Pavel Stehule wrote:


  attached rebased patch


This patch has gone through a few iterations but I don't think there's 
any agreement on what it should look like.  There's been no code review 
that I can see.


I think this should be pushed to PG13 at the least, perhaps returned 
with comment or rejected.


Regards,
--
-David
da...@pgmasters.net



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Peter Geoghegan
On Wed, Mar 6, 2019 at 10:15 PM Heikki Linnakangas  wrote:
> After staring at the first patch for bit longer, a few things started to
> bother me:
>
> * The new struct is called BTScanInsert, but it's used for searches,
> too. It makes sense when you read the README, which explains the
> difference between "search scan keys" and "insertion scan keys", but now
> that we have a separate struct for this, perhaps we call insertion scan
> keys with a less confusing name. I don't know what to suggest, though.
> "Positioning key"?

I think that insertion scan key is fine. It's been called that for
almost twenty years. It's not like it's an intuitive concept that
could be conveyed easily if only we came up with a new, pithy name.

> * We store the binary search bounds in BTScanInsertData, but they're
> only used during insertions.
>
> * The binary search bounds are specific for a particular buffer. But
> that buffer is passed around separately from the bounds. It seems easy
> to have them go out of sync, so that you try to use the cached bounds
> for a different page. The savebinsrch and restorebinsrch is used to deal
> with that, but it is pretty complicated.

That might be an improvement, but I do think that using mutable state
in the insertion scankey, to restrict a search is an idea that could
work well in at least one other way. That really isn't a once-off
thing, even though it looks that way.

> I came up with the attached (against master), which addresses the 2nd
> and 3rd points. I added a whole new BTInsertStateData struct, to hold
> the binary search bounds. BTScanInsert now only holds the 'scankeys'
> array, and the 'nextkey' flag.

It will also have to store heapkeyspace, of course. And minusinfkey.
BTW, I would like to hear what you think of the idea of minusinfkey
(and the !minusinfkey optimization) specifically.

> The new BTInsertStateData struct also
> holds the current buffer we're considering to insert to, and a
> 'bounds_valid' flag to indicate if the saved bounds are valid for the
> current buffer. That way, it's more straightforward to clear the
> 'bounds_valid' flag whenever we move right.

I'm not sure that that's an improvement. Moving right should be very
rare with my patch. gcov shows that we never move right here anymore
with the regression tests, or within _bt_check_unique() -- not once.
For a second, I thought that you forgot to invalidate the bounds_valid
flag, because you didn't pass it directly, by value to
_bt_useduplicatepage().

> I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary
> search like _bt_binsrch does, but the bounds caching is only done in
> _bt_binsrch_insert. Seems more clear to have separate functions for them
> now, even though there's some duplication.

I'll have to think about that some more. Having a separate
_bt_binsrch_insert() may be worth it, but I'll need to do some
profiling.

> Hmm. Perhaps it would be to move the call to _bt_binsrch (or
> _bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So
> that _bt_findinsertloc would only be responsible for finding the correct
> page to insert to. So the overall code, after patch #2, would be like:

Maybe, but as I said it's not like _bt_findinsertloc() doesn't know
all about unique indexes already. This is pointed out in a comment in
_bt_doinsert(), even. I guess that it might have to be changed to say
_bt_findinsertpage() instead, with your new approach.

> /*
>   * Do the insertion. First move right to find the correct page to
>   * insert to, if necessary. If we're inserting to a non-unique index,
>   * _bt_search() already did this when it checked if a move to the
>   * right was required for leaf page.  Insertion scankey's scantid
>   * would have been filled out at the time. On a unique index, the
>   * current buffer is the first buffer containing duplicates, however,
>   * so we may need to move right to the correct location for this
>   * tuple.
>   */
> if (checkingunique || itup_key->heapkeyspace)
> _bt_findinsertpage(rel, , stack, heapRel);
>
> newitemoff = _bt_binsrch_insert(rel, );
>
> _bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup,
> newitemoff, false);
>
> Does this make sense?

I guess you're saying this because you noticed that the for (;;) loop
in _bt_findinsertloc() doesn't do that much in many cases, because of
the fastpath.

I suppose that this could be an improvement, provided all the
assertions that verify that the work "_bt_findinsertpage()" would have
done if called was in fact unnecessary. (e.g., check the high
key/rightmost-ness)

> The attached new version simplifies this, IMHO. The bounds and the
> current buffer go together in the same struct, so it's easier to keep
> track whether the bounds are valid or not.

I'll look into integrating this with my current draft v15 tomorrow.
Need to sleep on it.

-- 
Peter Geoghegan



Re: Re: [HACKERS] proposal: schema variables

2019-03-06 Thread David Steele

On 3/3/19 10:27 PM, Pavel Stehule wrote:


rebase and fix compilation due changes related pg_dump


This patch hasn't receive any review in a while and I'm not sure if 
that's because nobody is interested or the reviewers think it does not 
need any more review.


It seems to me that this patch as implemented does not quite satisfy any 
one.


I think we need to hear something from the reviewers soon or I'll push 
this patch to PG13 as Andres recommends [1].


--
-David
da...@pgmasters.net

[1] 
https://www.postgresql.org/message-id/20190216054526.zss2cufdxfeudr4i%40alap3.anarazel.de




Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-03-06 Thread Rajkumar Raghuwanshi
On Tue, Mar 5, 2019 at 3:45 PM amul sul  wrote:

> Attached is the rebased atop of the latest master head(35bc0ec7c8).
>
thanks Amul, patches applied cleanly on PG head.

While testing this I got a server crash with below test case.

CREATE TABLE plt1 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('0001','0002','0003');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN (NULL,'0008','0009');
CREATE TABLE plt1_p4 PARTITION OF plt1 FOR VALUES IN ('','0010');
INSERT INTO plt1 SELECT i, i % 47, to_char(i % 17, 'FM') FROM
generate_series(0, 500) i WHERE i % 17 NOT IN (7, 11, 12, 13, 14, 15,16);
INSERT INTO plt1 SELECT i, i % 47, case when i % 17 = 7 then NULL else
to_char(i % 17, 'FM') end FROM generate_series(0, 500) i WHERE i % 17
IN (7,8,9);
ANALYSE plt1;

CREATE TABLE plt2 (a int, b int, c varchar) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('0002','0003');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0004','0005','0006');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0007','0008','0009');
CREATE TABLE plt2_p4 PARTITION OF plt2 FOR VALUES IN ('',NULL,'0012');
INSERT INTO plt2 SELECT i, i % 47, to_char(i % 17, 'FM') FROM
generate_series(0, 500) i WHERE i % 17 NOT IN (1, 10, 11, 13, 14, 15, 16);
INSERT INTO plt2 SELECT i, i % 47, case when i % 17 = 11 then NULL else
to_char(i % 17, 'FM') end FROM generate_series(0, 500) i WHERE i % 17
IN (0,11,12);
ANALYZE plt2;

CREATE TABLE plt1_e (a int, b int, c text) PARTITION BY LIST(ltrim(c, 'A'));
CREATE TABLE plt1_e_p1 PARTITION OF plt1_e FOR VALUES IN ('0002', '0003');
CREATE TABLE plt1_e_p2 PARTITION OF plt1_e FOR VALUES IN ('0004', '0005',
'0006');
CREATE TABLE plt1_e_p3 PARTITION OF plt1_e FOR VALUES IN ('0008', '0009');
CREATE TABLE plt1_e_p4 PARTITION OF plt1_e FOR VALUES IN ('');
INSERT INTO plt1_e SELECT i, i % 47, to_char(i % 17, 'FM') FROM
generate_series(0, 500) i WHERE i % 17 NOT IN (1, 7, 10, 11, 12, 13, 14,
15, 16);
ANALYZE plt1_e;

EXPLAIN (COSTS OFF)
SELECT avg(t1.a), avg(t2.b), avg(t3.a + t3.b), t1.c, t2.c, t3.c FROM
plt1 t1, plt2 t2, plt1_e t3 WHERE t1.c = t2.c AND ltrim(t3.c, 'A') = t1.c
GROUP BY t1.c, t2.c, t3.c ORDER BY t1.c, t2.c, t3.c;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

below is stack trace,  looks like some indexes got corrupted, please take a
look.

Core was generated by `postgres: edb postgres [local]
EXPLAIN   '.
Program terminated with signal 11, Segmentation fault.
#0  0x00821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8,
partmaps2=0x2c1c8e0, index1=45540240, index2=0, next_index=0x7ffeebd43d3c)
at partbounds.c:4162
4162if (partmap1->from < 0 && partmap2->from < 0)
Missing separate debuginfos, use: debuginfo-install
keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
libcom_err-1.41.12-24.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x00821aee in map_and_merge_partitions (partmaps1=0x2c1c8a8,
partmaps2=0x2c1c8e0, *index1=45540240*, index2=0,
next_index=0x7ffeebd43d3c) at partbounds.c:4162
#1  0x008226c3 in merge_null_partitions (outer_bi=0x2b6e338,
inner_bi=0x2bf90b0, outer_maps=0x2c1c8a8, inner_maps=0x2c1c8e0,
jointype=JOIN_INNER,
next_index=0x7ffeebd43d3c, null_index=0x7ffeebd43d38,
default_index=0x7ffeebd43d34) at partbounds.c:4610
#2  0x00821726 in partition_list_bounds_merge
(partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40,
inner_rel=0x2bf8d28, outer_parts=0x7ffeebd43ed8,
inner_parts=0x7ffeebd43ed0, jointype=JOIN_INNER) at partbounds.c:4031
#3  0x0081ff5d in partition_bounds_merge (partnatts=1,
partsupfunc=0x2ba3548, partcollation=0x2ba34e8, outer_rel=0x2b6ce40,
inner_rel=0x2bf8d28, jointype=JOIN_INNER,
outer_parts=0x7ffeebd43ed8, inner_parts=0x7ffeebd43ed0) at
partbounds.c:3053
#4  0x007c610f in try_partitionwise_join (root=0x2be2a28,
rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0,
parent_sjinfo=0x7ffeebd44010,
parent_restrictlist=0x2c1c070) at joinrels.c:1370
#5  0x007c5521 in populate_joinrel_with_paths (root=0x2be2a28,
rel1=0x2b6ce40, rel2=0x2bf8d28, joinrel=0x2c1b0f0, sjinfo=0x7ffeebd44010,
restrictlist=0x2c1c070)
at joinrels.c:914
#6  0x007c4f48 in make_join_rel (root=0x2be2a28, rel1=0x2b6ce40,
rel2=0x2bf8d28) at joinrels.c:748
#7  0x007c4514 in make_rels_by_clause_joins (root=0x2be2a28,
old_rel=0x2b6ce40, other_rels=0x2bae4d8) at joinrels.c:294
#8  0x007c41c8 in join_search_one_level (root=0x2be2a28, level=3)
at joinrels.c:116
#9  0x007abe59 in standard_join_search (root=0x2be2a28,
levels_needed=3, 

Re: Prevent extension creation in temporary schemas

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 09:33:55AM +0100, Chris Travers wrote:
> Ok so at present I see three distinct issues here, where maybe three
> different patches over time might be needed.
> 
> The issues are:
> 
> 1.  create extension pgcrypto with schema pg_temp; fails because there is
> no schema actually named pg_temp.

Yes, I agree that being able to accept pg_temp as an alias for the
temporary schema for extensions would be kind of nice.  Perhaps one
reason why this has not actually happened is that out user base does
not really have use cases for it though.

> 2.  If you work around this, the \dx shows temporary extensions in other
> sessions.  This is probably the most minor issue of the three.
> 3.  You cannot create the same extension in two different schemas.

I would like to think that it should be possible to create the same
extension linked to a temporary schema in multiple sessions in
parallel, as much as it is possible to create the same extension
across multiple schemas.  Both are actually linked as temp schemas as
based on connection slots.  This would require some changes in the way
constraints are defined in catalogs for extensions.  Perhaps there is
either no demand for it, I don't know.

> But I don't think there is likely to be a lot of user confusion here.  It
> is hard enough to install extensions in temporary schemas that those who do
> can be expected to know more that \dx commands.

The patch as it stands does not actually solve the root problem and
makes things a bit confusing, so I am marking it as returned with
feedback.  Working on this set of problems may be interesting, though
the efforts necessary to make that may not be worth the actual user
benefits.
--
Michael


signature.asc
Description: PGP signature


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Heikki Linnakangas

On 06/03/2019 04:03, Peter Geoghegan wrote:

On Tue, Mar 5, 2019 at 3:37 AM Heikki Linnakangas  wrote:

I'm looking at the first patch in the series now. I'd suggest that you
commit that very soon. It's useful on its own, and seems pretty much
ready to be committed already. I don't think it will be much affected by
whatever changes we make to the later patches, anymore.


After staring at the first patch for bit longer, a few things started to 
bother me:


* The new struct is called BTScanInsert, but it's used for searches, 
too. It makes sense when you read the README, which explains the 
difference between "search scan keys" and "insertion scan keys", but now 
that we have a separate struct for this, perhaps we call insertion scan 
keys with a less confusing name. I don't know what to suggest, though. 
"Positioning key"?


* We store the binary search bounds in BTScanInsertData, but they're 
only used during insertions.


* The binary search bounds are specific for a particular buffer. But 
that buffer is passed around separately from the bounds. It seems easy 
to have them go out of sync, so that you try to use the cached bounds 
for a different page. The savebinsrch and restorebinsrch is used to deal 
with that, but it is pretty complicated.



I came up with the attached (against master), which addresses the 2nd 
and 3rd points. I added a whole new BTInsertStateData struct, to hold 
the binary search bounds. BTScanInsert now only holds the 'scankeys' 
array, and the 'nextkey' flag. The new BTInsertStateData struct also 
holds the current buffer we're considering to insert to, and a 
'bounds_valid' flag to indicate if the saved bounds are valid for the 
current buffer. That way, it's more straightforward to clear the 
'bounds_valid' flag whenever we move right.


I made a copy of the _bt_binsrch, _bt_binsrch_insert. It does the binary 
search like _bt_binsrch does, but the bounds caching is only done in 
_bt_binsrch_insert. Seems more clear to have separate functions for them 
now, even though there's some duplication.



+/* HEIKKI:
+Do we need 'checkunique' as an argument? If unique checks were not
+performed, the insertion key will simply not have saved state.
+*/


We need it in the next patch in the series, because it's also useful
for optimizing away the high key check with non-unique indexes. We
know that _bt_moveright() was called at the leaf level, with scantid
filled in, so there is no question of needing to move right within
_bt_findinsertloc() (provided it's a heapkeyspace index).


Hmm. Perhaps it would be to move the call to _bt_binsrch (or 
_bt_binsrch_insert with this patch) to outside _bt_findinsertloc. So 
that _bt_findinsertloc would only be responsible for finding the correct 
page to insert to. So the overall code, after patch #2, would be like:


/*
 * Do the insertion. First move right to find the correct page to
 * insert to, if necessary. If we're inserting to a non-unique index,
 * _bt_search() already did this when it checked if a move to the
 * right was required for leaf page.  Insertion scankey's scantid
 * would have been filled out at the time. On a unique index, the
 * current buffer is the first buffer containing duplicates, however,
 * so we may need to move right to the correct location for this
 * tuple.
 */
if (checkingunique || itup_key->heapkeyspace)
_bt_findinsertpage(rel, , stack, heapRel);

newitemoff = _bt_binsrch_insert(rel, );

_bt_insertonpg(rel, insertstate.buf, InvalidBuffer, stack, itup, 
newitemoff, false);


Does this make sense?


Actually, we even need it in the first patch: we only restore a binary
search because we know that there is something to restore, and must
ask for it to be restored explicitly (anything else seems unsafe).
Maybe we can't restore it because it's not a unique index, or maybe we
can't restore it because we microvacuumed, or moved right to get free
space. I don't think that it'll be helpful to make _bt_findinsertloc()
pretend that it doesn't know exactly where the binary search bounds
come from -- it already knows plenty about unique indexes
specifically, and about how it may have to invalidate the bounds. The
whole way that it couples buffer locks is only useful for unique
indexes, so it already knows *plenty* about unique indexes
specifically.


The attached new version simplifies this, IMHO. The bounds and the 
current buffer go together in the same struct, so it's easier to keep 
track whether the bounds are valid or not.



- * starting a regular index scan some can be omitted.  The array is used as a
+ * starting a regular index scan, some can be omitted.  The array is used as a
   * flexible array member, though it's sized in a way that makes it possible to
   * use stack allocations.  See nbtree/README for full details.
+
+HEIKKI: I don't see anything in the README about stack allocations. What
+exactly does the README reference refer to? No code seems to actually allocate
+this in the stack, so we don't 

Re: New vacuum option to do only freezing

2019-03-06 Thread Masahiko Sawada
On Thu, Mar 7, 2019 at 3:55 AM Robert Haas  wrote:
>
> On Tue, Mar 5, 2019 at 11:29 PM Masahiko Sawada  wrote:
> > Attached updated patch incorporated all of comments. Also I've added
> > new reloption vacuum_index_cleanup as per discussion on the "reloption
> > to prevent VACUUM from truncating empty pages at the end of relation"
> > thread. Autovacuums also can skip index cleanup when the reloption is
> > set to false. Since the setting this to false might lead some problems
> > I've made autovacuums report the number of dead tuples and dead
> > itemids we left.
>
> It seems to me that the disable_index_cleanup should be renamed
> index_cleanup and the default should be changed to true, for
> consistency with the reloption (and, perhaps, other patches).

Hmm, the patch already has new reloption vacuum_index_cleanup and
default value is true but you meant I should change its name to
index_cleanup?

>
> - num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
> + num_tuples = live_tuples = tups_vacuumed  = nkeep = nunused =
> + nleft_dead_itemids = nleft_dead_tuples = 0;
>
> I would suggest leaving the existing line alone (and not adding an
> extra space to it as the patch does now) and just adding a second
> initialization on the next line as a separate statement. a = b = c = d
> = e = 0 isn't such great coding style that we should stick to it
> rigorously even when it ends up having to be broken across lines.

Fixed.

>
> + /* Index vacuum must be enabled in two-pass vacuum */
> + Assert(!skip_index_vacuum);
>
> I am a big believer in naming consistency.  Please, let's use the same
> name everywhere!  If it's going to be index_cleanup, then call the
> reloption vacuum_index_cleanup, and call the option index_cleanup, and
> call the variable index_cleanup.  Picking a different subset of
> cleanup, index, vacuum, skip, and disable for each new name makes it
> harder to understand.

Fixed.

>
> - * If there are no indexes then we can vacuum the page right now
> - * instead of doing a second scan.
> + * If there are no indexes or index vacuum is disabled we can
> + * vacuum the page right now instead of doing a second scan.
>
> This comment is wrong.  That wouldn't be safe.  And that's probably
> why it's not what the code does.

Fixed.

>
> - /* If no indexes, make log report that lazy_vacuum_heap would've made */
> + /*
> + * If no index or index vacuum is disabled, make log report that
> + * lazy_vacuum_heap would've make.
> + */
>   if (vacuumed_pages)
>
> Hmm, does this really do what the comment claims?  It looks to me like
> we only increment vacuumed_pages when we call lazy_vacuum_page(), and
> we (correctly) don't do that when index cleanup is disabled, but then
> here this claims that if (vacuumed_pages) will be true in that case.

You're right, vacuumed_pages never be > 0 in disable_index_cleanup case. Fixed.

>
> I wonder if it would be cleaner to rename vacrelstate->hasindex to
> 'useindex' and set it to false if there are no indexes or index
> cleanup is disabled.  But that might actually be worse, not sure.
>

I tried the changes and it seems good idea to me. Fixed.

Attached the updated version patches.

Regards,

--


Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From 8ab648be35fe4fd864fabd17adf72fb476a3367c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Thu, 7 Mar 2019 09:45:11 +0900
Subject: [PATCH v9 1/2] Add DISABLE_INDEX_CLEANUP option to VACUUM command

With this option, VACUUM does HOT-pruning for live tuples but doesn't
remove dead tuples completely and disables index vacuum.

vacrelstats->dead_tuples could have tuples that became dead after
checked at a HOT-pruning time, which are not marked as dead. Per
discussion on pgsql-hackers We normally records and remove them but
with this option we don't process and leave for the next vacuum for
simplifing the code. That's okay because it's very rare condition and
those tuples will be processed by the next vacuum.
---
 doc/src/sgml/ref/create_table.sgml | 16 
 doc/src/sgml/ref/vacuum.sgml   | 27 
 src/backend/access/common/reloptions.c | 13 +-
 src/backend/access/heap/vacuumlazy.c   | 75 ++
 src/backend/commands/vacuum.c  | 15 ++-
 src/backend/parser/gram.y  |  2 +
 src/bin/psql/tab-complete.c|  4 +-
 src/include/nodes/parsenodes.h |  3 +-
 src/include/utils/rel.h|  1 +
 src/test/regress/expected/vacuum.out   |  3 ++
 src/test/regress/sql/vacuum.sql|  3 ++
 11 files changed, 142 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 22dbc07..cd9cea9 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1345,6 +1345,22 @@ WITH ( MODULUS numeric_literal, REM

 

+vacuum_index_cleanup (boolean)
+
+ 
+  Per table setting 

Re: ECPG regression with DECLARE STATEMENT support

2019-03-06 Thread Michael Meskes
Hi,

> Commit bd7c95f0c1a38becffceb3ea7234d57167f6d4bf add DECLARE
> STATEMENT support to ECPG.  This introduced the new rule
> for EXEC SQL CLOSE cur and with that it gets transformed into
> ECPGclose().
> 
> Now prior to the above commit, someone can declare the
> cursor in the SQL statement and "CLOSE cur_name" can be
> also, execute as a normal statement.

That still works, the difference in your test case is that the DECLARE
statement is prepared.

> Example:
> 
> EXEC SQL PREPARE cur_query FROM "DECLARE cur1 CURSOR WITH HOLD FOR
> SELECT count(*) FROM pg_class";
> EXEC SQL PREPARE fetch_stmt FROM "FETCH next FROM cur1";
> EXEC SQL EXECUTE cur_query;
> EXEC SQL EXECUTE fetch_stmt INTO :c;
> EXEC SQL CLOSE cur1;
> 
> With commit bd7c95f0c1, "EXEC SQL CLOSE cur1" will fail
> and throw an error "sqlcode -245 The cursor is invalid".
> 
> I think the problem here is ECPGclose(), tries to find the
> cursor into "connection->cursor_stmts" and if it doesn't
> find it there, just throws an error.   Maybe require fix
> into ECPGclose() - rather than throwing an error continue
> executing statement "CLOSE cur_name" with ecpg_do().

The problem as I see it is that the cursor is known to the backend but
not the library. Takaheshi-san, Hayato-san, any idea how to improve the
situation to not error out on statements that used to work?

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL




Re: SQL statement PREPARE does not work in ECPG

2019-03-06 Thread Michael Meskes
Hi all,

> ...
> But it doesn't allow to use host variable in parameter clause of
> EXECUTE statement like the following.
> I'm afraid that it's not usefull. I will research the standard and
> other RDBMS.
> If you have some information, please adivise to me.

This also seems to be conflicting with
bd7c95f0c1a38becffceb3ea7234d57167f6d4bf. If we want to keep this
commit in for the release, I think we need to get these things fixed. 

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL




Re: [HACKERS] Block level parallel vacuum

2019-03-06 Thread Masahiko Sawada
On Thu, Mar 7, 2019 at 2:54 AM Robert Haas  wrote:
>
> On Wed, Mar 6, 2019 at 1:26 AM Masahiko Sawada  wrote:
> > Okay, attached the latest version of patch set. I've incorporated all
> > comments I got and separated the patch for making vacuum options a
> > Node (0001 patch). And the patch doesn't use parallel_workers. It
> > might be proposed in the another form again in the future if
> > requested.
>
> Why make it a Node?  I mean I think a struct makes sense, but what's
> the point of giving it a NodeTag?
>

Well, the main point is consistency with other nodes and keep the code clean.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid

2019-03-06 Thread Noah Misch
On Mon, Mar 04, 2019 at 06:04:20PM +0900, Kyotaro HORIGUCHI wrote:
> I found that I have 65(h) segments left alone on my environment:p

Did patched PostgreSQL create those, or did unpatched PostgreSQL create them?

> At Sat, 11 Aug 2018 23:48:15 -0700, Noah Misch  wrote in 
> <20180812064815.gb2301...@rfd.leadboat.com>
> > still doesn't detect.  I could pursue a fix via the aforementioned
> > sysv_shmem_key file, modulo the possibility of a DBA removing it.  I could
> > also, when postmaster.pid is missing, make sysv_shmem.c check the first N
> > (N=100?) keys applicable to the selected port.  My gut feeling is that 
> > neither
> > thing is worthwhile, but I'm interested to hear other opinions.
> 
> # Though I don't get the meaning of the "modulo" there..

It wasn't clear.  Adding a separate sysv_shmem_key file would not protect a
cluster if the DBA does "rm -f postmaster.pid sysv_shmem_key".  It would,
however, fix this test case:

+# Fail to reject startup if shm key N has become available and we crash while
+# using key N+1.  This is unwanted, but expected.  Windows is immune, because
+# its GetSharedMemName() use DataDir strings, not numeric keys.
+$flea->stop;# release first key
+is( $gnat->start(fail_ok => 1),
+   $TestLib::windows_os ? 0 : 1,
+   'key turnover fools only sysv_shmem.c');

> I think the only thing we must avoid here is sharing the same
> shmem segment with a living-dead server. If we can do that
> without the pid file, it would be better than relying on it. We
> could remove orphaned segments automatically, but I don't think
> we should do that going so far as relying on a dedicated
> file.

There are two things to avoid.  First, during postmaster startup, avoid
sharing a segment with any other process.  Second, avoid sharing a data
directory with a running PostgreSQL process that was a child of some other
postmaster.  I think that's what you mean by "living-dead server".  The first
case is already prevented thoroughly, because we only proceed with startup
after creating a new segment with IPC_CREAT | IPC_EXCL.  The second case is
what this patch seeks to improve.

> Also, I don't think it's worth stopping shmem id scanning
> at a certain number since I don't come up with an appropriate
> number for it. But looking "port * 1000", it might be expected
> that a usable segment id will found while scanning that number of
> ids (1000).

If we find thousands of unusable segments, we do keep going until we find a
usable segment.  I was referring to a different situation.  Today, if there's
no postmaster.pid file and we're using port 5432, we first try segment ID
5432000.  If segment ID 5432000 is usable, we use it without examining any
other segment ID.  If segment ID 5432010 were in use by a "living-dead
server", we won't discover that fact.  We could increase the chance of
discovering that fact by checking every segment ID from 5432000 to 5432999.
(Once finished with that scan, we'd still create and use 5432000.)  I didn't
implement that idea in the patch, but I shared it to see if anyone thought it
was worth adding.

> PGSharedMemoryCreate changed to choose a usable shmem id using
> the IpcMemoryAnalyze().  But some of the statuses from
> IpcMemoryAnalyze() is concealed by failure of
> PGSharedMemoryAttach() and ignored silently opposed to what the
> code is intending to do.

SHMSTATE_FOREIGN is ignored silently.  The patch modifies the
PGSharedMemoryAttach() header comment to direct callers to treat
PGSharedMemoryAttach() failure like SHMSTATE_FOREIGN.  I don't think the code
had an unintentional outcome due to the situation you describe.  Perhaps I
could have made the situation clearer.

> (By the way SHMSTATE_EEXISTS seems
> suggesting oppsite thing from EEXIST, which would be confusing.)

Good catch.  Is SHMSTATE_ENOENT better?

> PGSharedMemoryCreate() repeats shmat/shmdt twice in every
> iteration. It won't harm so much but it would be better if we
> could get rid of that.

I'll try to achieve that and see if the code turns out cleaner.

Thanks,
nm



Re: Drop type "smgr"?

2019-03-06 Thread Thomas Munro
On Thu, Feb 28, 2019 at 7:02 PM Thomas Munro  wrote:
> The type smgr has only one value 'magnetic disk'.  ~15 years ago it
> also had a value 'main memory', and in Berkeley POSTGRES 4.2 there was
> a third value 'sony jukebox'.  Back then, all tables had an associated
> block storage manager, and it was recorded as an attribute relsmgr of
> pg_class (or pg_relation as it was known further back).  This was the
> type of that attribute, removed by Bruce in 3fa2bb31 (1997).
>
> Nothing seems to break if you remove it (except for some tests using
> it in an incidental way).  See attached.

Pushed.

Thanks all for the interesting discussion.  I'm trying out Anton
Shyrabokau's suggestion of stealing bits from the fork number.  More
on that soon.

-- 
Thomas Munro
https://enterprisedb.com



Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread Andres Freund
On 2019-03-07 15:25:34 +1300, David Rowley wrote:
> On Thu, 7 Mar 2019 at 11:37, Andres Freund  wrote:
> >
> > On 2019-03-07 11:31:15 +1300, David Rowley wrote:
> > > Do you think it's fine to reword the docs to make this point more
> > > clear, or do you see this as a fundamental problem with the patch?
> >
> > Hm, both? I mean I wouldn't necessarily characterize it as "fundamental"
> > problem, but ...
> 
> Okay, so if I understand you correctly, you're complaining about the
> fact that if the user does:
> 
> CREATE TABLE p (a int) PARTITION BY LIST(a) TABLESPACE pg_default;
> 
> that the user intended that all future partitions go to pg_default and
> not whatever default_tablespace is set to at the time?

Correct. And also, conversely, if default_tablespace was set to frakbar
at the time of CREATE TABLE, but no explicit TABLESPACE was provided,
that that should *not* be the default for new partitions; rather
default_tablespace should be consulted again.


> If I understand what you're saying correctly, then the listp1_a_idx
> should have been created in pg_default since that's what the default
> partitioned index tablespace was set to.

That's how I understand the intent of the user yes.


> > I don't think the argument that the user intended to explicitly set a
> > tablespace holds much water if it was just set via default_tablespace,
> > rather than an explicit TABLESPACE.  I think iff you really want
> > something like this feature, you'd have to mark a partition's
> > reltablespace as 0 unless an *explicit* assignment of the tablespace
> > happened. In which case you also would need to explicitly emit a
> > TABLESPACE for the partitioned table in pg_dump, to restore that.
> 
> I think emitting an explicit tablespace in pg_dump for partitioned
> tables (when non-zero) might have issues for pg_restore's
> --no-tablespaces option.

Yea, there'd probably need to be handling in pg_backup_archiver.c or
such, not sure how to do that best.

Greetings,

Andres Freund



Re: Batch insert in CTAS/MatView code

2019-03-06 Thread Michael Paquier
Hi,

On Wed, Mar 06, 2019 at 10:06:27PM +0800, Paul Guo wrote:
> The copy code has used batch insert with function heap_multi_insert() to
> speed up. It seems that Create Table As or Materialized View could leverage
> that code also to boost the performance also. Attached is a patch to
> implement that. That was done by Taylor (cc-ed) and me.

Please note that we are currently in the last commit fest of Postgres
12, and it is too late to propose new features.  Please feel free to
add an entry to the commit fest happening afterwards.
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: tableam: introduce table AM infrastructure.

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 03:03:44PM -0300, Alvaro Herrera wrote:
> On 2019-Mar-06, Andres Freund wrote:
>> tableam: introduce table AM infrastructure.
> 
> Thanks for doing this!!

+1.
--
Michael


signature.asc
Description: PGP signature


Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 11:37, Andres Freund  wrote:
>
> On 2019-03-07 11:31:15 +1300, David Rowley wrote:
> > Do you think it's fine to reword the docs to make this point more
> > clear, or do you see this as a fundamental problem with the patch?
>
> Hm, both? I mean I wouldn't necessarily characterize it as "fundamental"
> problem, but ...

Okay, so if I understand you correctly, you're complaining about the
fact that if the user does:

CREATE TABLE p (a int) PARTITION BY LIST(a) TABLESPACE pg_default;

that the user intended that all future partitions go to pg_default and
not whatever default_tablespace is set to at the time?

If so, that seems like a genuine concern.

I see in heap_create() we do;

/*
* Never allow a pg_class entry to explicitly specify the database's
* default tablespace in reltablespace; force it to zero instead. This
* ensures that if the database is cloned with a different default
* tablespace, the pg_class entry will still match where CREATE DATABASE
* will put the physically copied relation.
*
* Yes, this is a bit of a hack.
*/
if (reltablespace == MyDatabaseTableSpace)
reltablespace = InvalidOid;

which will zero pg_class.reltablespace if the specified tablespace
happens to match pg_database.dattablespace. This causes future
partitions to think that no tablespace was specified and therefore
DefineRelation() consults the default_tablespace.

I see that this same problem exists for partitioned indexes too:

create table listp (a int) partition by list(a);
create index on listp (a) tablespace pg_default;
set default_Tablespace = n;
create table listp1 partition of listp for values in(1);
\d listp1
   Table "public.listp1"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
Partition of: listp FOR VALUES IN (1)
Indexes:
"listp1_a_idx" btree (a), tablespace "n"
Tablespace: "n"

If I understand what you're saying correctly, then the listp1_a_idx
should have been created in pg_default since that's what the default
partitioned index tablespace was set to.

> I don't think the argument that the user intended to explicitly set a
> tablespace holds much water if it was just set via default_tablespace,
> rather than an explicit TABLESPACE.  I think iff you really want
> something like this feature, you'd have to mark a partition's
> reltablespace as 0 unless an *explicit* assignment of the tablespace
> happened. In which case you also would need to explicitly emit a
> TABLESPACE for the partitioned table in pg_dump, to restore that.

I think emitting an explicit tablespace in pg_dump for partitioned
tables (when non-zero) might have issues for pg_restore's
--no-tablespaces option.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: few more wait events to add to docs

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 11:08:12AM -0800, Jeremy Schneider wrote:
> LWLock order in documentation:
> 1) CamelCase LWLocks: individually named - see lwlocknames.txt
> 2) lowercase LWLocks: tranches
> 2a) SLRUs - see SimpleLruInit() callers on doxygen
> 2b) Shared Buffer (buffer_content, buffer_io)
> 2c) Individually Named - see RegisterLWLockTranches() in lwlock.c

Hm, OK.  Perhaps I lack some user insight on the matter.  Thanks for
the feedback!  Still there are some areas where we could make the
micro-ordering better.  For example replication_slot_io and buffer_io
are I/O specific still they get in the middle of the page, still we
want buffer_content close by.

One thing that I think we could do is reorganize at least
alphabetically the section for "Lock".  Each item does not really rely
on others.  What do you think?

> If anything, I think we might just want to add comments to
> RegisterLWLockTranches() and lwlocknames.txt with links to the doc file
> that needs to be updated whenever a new tranche is added.

Yes, that would surely help.

> Not sure the best place for a comment on SLRUs (is SimpleLruInit a good
> place?)... but I'm kindof hopeful that we're not adding many more new
> SLRUs anyway and that people would bias toward leveraging the buffer
> cache when possible.

A reference at the top of SimpleLruInit() sounds good to me.
--
Michael


signature.asc
Description: PGP signature


Re: [bug fix] Produce a crash dump before main() on Windows

2019-03-06 Thread Kyotaro HORIGUCHI
Hello.

At Tue, 5 Mar 2019 16:45:53 +1100, Haribabu Kommi  
wrote in 
> > I don't have an idea about which is better behavior, but does
> > this work for you?
> >
> >
> > https://docs.microsoft.com/en-us/windows/desktop/wer/collecting-user-mode-dumps
> >
> 
> No, this option is not generating local dumps for postgresql, but this
> option is useful to
> generate dumps for the applications that don't have a specific WER
> reporting.

Mmm. Right. It just turn on WER for specific progarms. It
donesn't override SetErrorMode().. Sorry for the noise.


> Adding some doc changes for the users to refer what can be the issue
> windows if the
> PostgreSQL server doesn't start and there is no core file available.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Online verification of checksums

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 08:53:57PM +0100, Tomas Vondra wrote:
> Not sure. AFAICS that would to require a single transaction, and if we
> happen to add some sort of throttling (which is a feature request I'd
> expect pretty soon to make it usable on live clusters) that might be
> quite long-running. So, not great.
> 
> If we want to run it from the server itself, then I guess a background
> worker would be a better solution. Incidentally, that's something I've
> been toying with some time ago, see [1].

It does not prevent having a SQL function which acts as a wrapper on
top of the whole routine logic, does it?  I think that it would be
nice to have the possibility to target a specific relation and a
specific page, as well as being able to check fully a relation at
once.  It gets easier to check for page ranges this way, and the
throttling can be part of the function doing a full-relation check.
--
Michael


signature.asc
Description: PGP signature


Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 02:54:52PM +, Georgios Kokolatos wrote:
> Overall the patch looks good and according to the previous
> discussion fulfils its purpose. 
> 
> It might be worthwhile to also check for errors on close in
> SaveSlotToPath().

Thanks for the feedback, added.  I have spent some time
double-checking this stuff, and noticed that the new errors in
StartupReplicationOrigin() and CheckPointReplicationOrigin() should be
switched from ERROR to PANIC to be consistent.  One message in
dsm_impl_mmap() was not consistent either.

Are there any objections if I commit this patch?
--
Michael
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 9905593661..7b39283c89 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1991,7 +1991,10 @@ qtext_load_file(Size *buffer_size)
 		return NULL;
 	}
 
-	CloseTransientFile(fd);
+	if (CloseTransientFile(fd))
+		ereport(LOG,
+(errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m", PGSS_TEXT_FILE)));
 
 	*buffer_size = stat.st_size;
 	return buf;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f5cf9ffc9c..bce4274362 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -1202,7 +1202,10 @@ heap_xlog_logical_rewrite(XLogReaderState *r)
  errmsg("could not fsync file \"%s\": %m", path)));
 	pgstat_report_wait_end();
 
-	CloseTransientFile(fd);
+	if (CloseTransientFile(fd))
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m", path)));
 }
 
 /* ---
@@ -1300,7 +1303,11 @@ CheckPointLogicalRewriteHeap(void)
 		(errcode_for_file_access(),
 		 errmsg("could not fsync file \"%s\": %m", path)));
 			pgstat_report_wait_end();
-			CloseTransientFile(fd);
+
+			if (CloseTransientFile(fd))
+ereport(ERROR,
+		(errcode_for_file_access(),
+		 errmsg("could not close file \"%s\": %m", path)));
 		}
 	}
 	FreeDir(mappings_dir);
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 3623352b9c..974d42fc86 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -599,7 +599,7 @@ SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno)
 
 	SlruFileName(ctl, path, segno);
 
-	fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 	{
 		/* expected: file doesn't exist */
@@ -621,7 +621,13 @@ SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno)
 
 	result = endpos >= (off_t) (offset + BLCKSZ);
 
-	CloseTransientFile(fd);
+	if (CloseTransientFile(fd))
+	{
+		slru_errcause = SLRU_CLOSE_FAILED;
+		slru_errno = errno;
+		return false;
+	}
+
 	return result;
 }
 
@@ -654,7 +660,7 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
 	 * SlruPhysicalWritePage).  Hence, if we are InRecovery, allow the case
 	 * where the file doesn't exist, and return zeroes instead.
 	 */
-	fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
 	if (fd < 0)
 	{
 		if (errno != ENOENT || !InRecovery)
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c96c8b60ba..cbd9b2cee1 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -370,7 +370,11 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 			}
 			pgstat_report_wait_end();
 		}
-		CloseTransientFile(srcfd);
+
+		if (CloseTransientFile(srcfd))
+			ereport(ERROR,
+	(errcode_for_file_access(),
+	 errmsg("could not close file \"%s\": %m", path)));
 	}
 
 	/*
@@ -416,7 +420,6 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 (errcode_for_file_access(),
  errmsg("could not close file \"%s\": %m", tmppath)));
 
-
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
@@ -495,7 +498,6 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 (errcode_for_file_access(),
  errmsg("could not close file \"%s\": %m", tmppath)));
 
-
 	/*
 	 * Now move the completed history file into place with its final name.
 	 */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 64679dd2de..21986e48fe 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1297,7 +1297,11 @@ ReadTwoPhaseFile(TransactionId xid, bool missing_ok)
 	}
 
 	pgstat_report_wait_end();
-	CloseTransientFile(fd);
+
+	if (CloseTransientFile(fd))
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m", path)));
 
 	hdr = (TwoPhaseFileHeader *) buf;
 	if (hdr->magic != TWOPHASE_MAGIC)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0fdd82a287..c7047738b6 100644
--- 

RE: speeding up planning with partitions

2019-03-06 Thread Imai, Yoshikazu
Amit-san,

On Wed, Mar 6, 2019 at 5:38 AM, Amit Langote wrote:
> On 2019/03/06 11:09, Imai, Yoshikazu wrote:
> > [0004 or 0005]
> > There are redundant process in add_appendrel_other_rels (or
> expand_xxx_rtentry()?).
> > I modified add_appendrel_other_rels like below and it actually worked.
> >
> >
> > add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index
> > rti) {
> > ListCell   *l;
> > RelOptInfo *rel;
> >
> > /*
> >  * Add inheritance children to the query if not already done. For
> child
> >  * tables that are themselves partitioned, their children will be
> added
> >  * recursively.
> >  */
> > if (rte->rtekind == RTE_RELATION
> && !root->contains_inherit_children)
> > {
> > expand_inherited_rtentry(root, rte, rti);
> > return;
> > }
> >
> > rel = find_base_rel(root, rti);
> >
> > foreach(l, root->append_rel_list)
> > {
> > AppendRelInfo  *appinfo = lfirst(l);
> > Index   childRTindex = appinfo->child_relid;
> > RangeTblEntry  *childrte;
> > RelOptInfo *childrel;
> >
> > if (appinfo->parent_relid != rti)
> > continue;
> >
> > Assert(childRTindex < root->simple_rel_array_size);
> > childrte = root->simple_rte_array[childRTindex];
> > Assert(childrte != NULL);
> > build_simple_rel(root, childRTindex, rel);
> >
> > /* Child may itself be an inherited relation. */
> > if (childrte->inh)
> > add_appendrel_other_rels(root, childrte, childRTindex);
> > }
> > }
> 
> If you don't intialize part_rels here, then it will break any code in
> the planner that expects a partitioned rel with non-zero number of
> partitions to have part_rels set to non-NULL.  At the moment, that
> includes the code that implements partitioning-specific optimizations
> such partitionwise aggregate and join, run-time pruning etc.  No existing
> regression tests cover the cases where source inherited roles
> participates in those optimizations, so you didn't find anything that
> broke.  For an example, consider the following update query:
> 
> update p set a = p1.a + 1 from p p1 where p1.a = (select 1);
> 
> Planner will create a run-time pruning aware Append node for p (aliased
> p1), for which it will need to use the part_rels array.  Note that p in
> this case is a source relation which the above code initializes.
> 
> Maybe we should add such regression tests.

Ah, now I understand that the codes below of expand_inherited_rtentry() 
initializes part_rels of child queries after first child target and part_rels 
of those are used in partitioning-specific optimizations. Thanks for the 
explanation.

--
Yoshikazu Imai


Re: pgsql: Removed unused variable, openLogOff.

2019-03-06 Thread Michael Paquier
On Wed, Mar 06, 2019 at 02:47:16PM +, Robert Haas wrote:
> Removed unused variable, openLogOff.

Is that right for the report if data is written in chunks?  The same
patch has been proposed a couple of weeks ago, and I commented about
it as follows:
https://www.postgresql.org/message-id/20190129043439.gb3...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Peter Geoghegan
On Wed, Mar 6, 2019 at 1:37 PM Robert Haas  wrote:
> I know I'm stating the obvious here, but we don't have many weeks left
> at this point.  I have not reviewed any code, but I have been
> following this thread and I'd really like to see this work go into
> PostgreSQL 12, assuming it's in good enough shape.  It sounds like
> really good stuff.

Thanks!

Barring any objections, I plan to commit the first 3 patches (plus the
amcheck "relocate" patch) within 7 - 10 days (that's almost
everything). Heikki hasn't reviewed 'Add high key "continuescan"
optimization' yet, and it seems like he should take a look at that
before I proceed with it. But that seems like the least controversial
enhancement within the entire patch series, so I'm not very worried
about it.

I'm currently working on v15, which has comment-only revisions
requested by Heikki. I expect to continue to work with him to make
sure that he is happy with the presentation. I'll also need to
revalidate the performance of the patch series following recent minor
changes to the logic for choosing a split point. That can take days.
This is why I don't want to commit the first patch without committing
at least the first three all at once -- it increases the amount of
performance validation work I'll have to do considerably. (I have to
consider both v4 and v3 indexes already, which seems like enough
work.)

Two of the later patches (one of which I plan to push as part of the
first batch of commits) use heuristics to decide where to split the
page. As a Postgres contributor, I have learned to avoid inventing
heuristics, so this automatically makes me a bit uneasy. However, I
don't feel so bad about it here, on reflection. The on-disk size of
the TPC-C indexes are reduced by 35% with the 'Add "split after new
tuple" optimization' patch (I think that the entire database is
usually about 12% smaller). There simply isn't a fundamentally better
way to get the same benefit, and I'm sure that nobody will argue that
we should just accept the fact that the most influential database
benchmark of all time has a big index bloat problem with Postgres.
That would be crazy.

That said, it's not impossible that somebody will shout at me because
my heuristics made their index bloated. I can't see how that could
happen, but I am prepared. I can always adjust the heuristics when new
information comes to light. I have fairly thorough test cases that
should allow me to do this without regressing anything else. This is a
risk that can be managed sensibly.

There is no gnawing ambiguity about the on-disk changes laid down in
the second patch (nor the first patch), though. Making on-disk changes
is always a bit scary, but making the keys unique is clearly a big
improvement architecturally, as it brings nbtree closer to the Lehman
& Yao design without breaking anything for v3 indexes (v3 indexes
simply aren't allowed to use a heap TID in their scankey). Unique keys
also allow amcheck to relocate every tuple in the index from the root
page, using the same code path as regular index scans. We'll be
relying on the uniqueness of keys within amcheck from the beginning,
before anybody teaches nbtree to perform retail index tuple deletion.

-- 
Peter Geoghegan



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-06 Thread Bruce Momjian
On Wed, Mar  6, 2019 at 10:49:17AM -0800, Jeremy Schneider wrote:
> Might it make sense to generalize a little bit to secret management? It
> would be *great* if PostgreSQL could have a standard "secrets" API which
> could then use plugins or extensions to provide an internal
> implementation (software or hardware based) and/or plug in to an
> external secret management service, whether an OSS package installed on
> the box or some 3rd party service off the box.
> 
> The two obvious use cases are encryption keys (mentioned here) and
> passwords for things like logical replication, FDWs, dblinks, other
> extensions, etc. Aside from adding new encryption key secrets, the way
> PostgreSQL handles the existing secrets it already has today leaves room
> for improvement.

See this email for a possible implementation:


https://www.postgresql.org/message-id/20190222035816.uozqvc4wjyag3...@momjian.us

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Protect syscache from bloating with negative cache entries

2019-03-06 Thread Tomas Vondra
On 3/6/19 9:17 PM, Tom Lane wrote:
> Robert Haas  writes:
>> OK, so this is getting simpler, but I'm wondering why we need
>> dlist_move_tail() at all.  It is a well-known fact that maintaining
>> LRU ordering is expensive and it seems to be unnecessary for our
>> purposes here.
> 
> Yeah ... LRU maintenance was another thing that used to be in the
> catcache logic and was thrown out as far too expensive.  Your idea
> of just using a clock sweep instead seems plausible.
> 

I agree clock sweep might be sufficient, although the benchmarks done in
this thread so far do not suggest the LRU approach is very expensive.

A simple true/false flag, as proposed by Robert, would mean we can only
do the cleanup once per the catalog_cache_prune_min_age interval, so
with the default value (5 minutes) the entries might be between 5 and 10
minutes old. That's probably acceptable, although for higher values the
range gets wider and wider ...

Which part of the LRU approach is supposedly expensive? Updating the
lastaccess field or moving the entries to tail? I'd guess it's the
latter, so perhaps we can keep some sort of age field, update it less
frequently (once per minute?), and do the clock sweep?

BTW wasn't one of the cases this thread aimed to improve a session that
accesses a lot of objects in a short period of time? That balloons the
syscache, and while this patch evicts the entries from memory, we never
actually release the memory back (because AllocSet just moves it into
the freelists) and it's unlikely to get swapped out (because other
chunks on those memory pages are likely to be still used). I've proposed
to address that by recreating the context if it gets too bloated, and I
think Alvaro agreed with that. But I haven't seen any further discussion
about that.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Jerry Jelinek
On Wed, Mar 6, 2019 at 11:02 AM Alvaro Herrera 
wrote:

> On 2019-Mar-06, Robert Haas wrote:
>
> > On Wed, Mar 6, 2019 at 12:13 PM Alvaro Herrera 
> wrote:
> > > I want your dictating software.
> >
> > I'm afraid this is just me and a keyboard, but sadly for me you're not
> > the first person to accuse me of producing giant walls of text.
>
> Well, I don't have a problem reading long texts; my problem is that I'm
> unable to argue as quickly.
>
> I do buy your argument, though (if reluctantly); in particular I was
> worried to offer a parameter (to turn off zero-filling of segments) that
> would enable dangerous behavior, but then I realized we also have
> fsync=off of which the same thing can be said.  So I agree we should
> have two GUCs, properly explained, with a warning where appropriate.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

It sounds like everyone is in agreement that I should get rid  of the
single COW GUC tunable and provide two different tunables instead. I will
update the patch to go back to the original name (wal_recycle) for the
original WAL recycling behavior. The default value of that will be true to
provide the existing behavior. This matches my original proposal from last
year. I will add a new tunable (wal_init_zero) which will control the
zero-fill behavior for the WAL file. Again, the default value will be true
and provide the existing behavior. Both of these could (should) be set to
false for a COW filesystem like ZFS.

If anyone objects to this new approach, let me know, otherwise I'll start
preparing an updated patch.

Thanks for all of the feedback,
Jerry


Re: Pluggable Storage - Andres's take

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-07 11:56:57 +1300, David Rowley wrote:
> On Thu, 7 Mar 2019 at 08:33, Andres Freund  wrote:
> > Here's a cleaned up version of that patch.  David, Alvaro, you also
> > played in that area, any objections? I think this makes that part of the
> > code easier to read actually. Robert, thanks for looking at that patch
> > already.
> 
> I only had a quick look and don't have a grasp of what the patch
> series is doing to tuple slots, but I didn't see anything I found
> alarming during the read.

Thanks for looking.

Re slots - the deal basically is that going forward low level
operations, like fetching a row from a table etc, have to be done by a
slot that's compatible with the "target" table. You can get compatible
slot callbakcs by calling table_slot_callbacks(), or directly create one
by calling table_gimmegimmeslot() (likely to be renamed :)).

The problem here was that the partition root's slot was used to fetch /
store rows from a child partition. By moving mt_existing into
ResultRelInfo that's not the case anymore.

- Andres



Re: Pluggable Storage - Andres's take

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 08:33, Andres Freund  wrote:
> Here's a cleaned up version of that patch.  David, Alvaro, you also
> played in that area, any objections? I think this makes that part of the
> code easier to read actually. Robert, thanks for looking at that patch
> already.

I only had a quick look and don't have a grasp of what the patch
series is doing to tuple slots, but I didn't see anything I found
alarming during the read.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: shared-memory based stats collector

2019-03-06 Thread Andres Freund
Hi,

> From 88740269660d00d548910c2f3aa631878c7cf0d4 Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi 
> Date: Thu, 21 Feb 2019 12:42:07 +0900
> Subject: [PATCH 4/6] Allow dsm to use on postmaster.
> 
> DSM is inhibited to be used on postmaster. Shared memory baesd stats
> collector needs it to work on postmaster and no problem found to do
> that. Just allow it.

Maybe I'm missing something, but why? postmaster doesn't actually need
to process stats messages in any way?


> From 774b1495136db1ad6d174ab261487fdf6cb6a5ed Mon Sep 17 00:00:00 2001
> From: Kyotaro Horiguchi 
> Date: Thu, 21 Feb 2019 12:44:56 +0900
> Subject: [PATCH 5/6] Shared-memory based stats collector
> 
> Previously activity statistics is shared via files on disk. Every
> backend sends the numbers to the stats collector process via a socket.
> It makes snapshots as a set of files on disk with a certain interval
> then every backend reads them as necessary. It worked fine for
> comparatively small set of statistics but the set is under the
> pressure to growing up and the file size has reached the order of
> megabytes. To deal with larger statistics set, this patch let backends
> directly share the statistics via shared memory.

Btw, you can make the life of a committer easier by collecting the
reviewers and co-authors of a patch yourself...


This desparately needs an introductory comment in pgstat.c or such
explaining how the new scheme works.



> +LWLock   StatsMainLock;
> +#define  StatsLock ()

Wait, what? You can't just define a lock this way. That's process local
memory, locking that doesn't do anything useful.


> +/* Shared stats bootstrap information */
> +typedef struct StatsShmemStruct {

Please note that in PG's coding style the { comes in the next line.


> +/*
> + *  Backends store various database-wide info that's waiting to be flushed 
> out
> + *  to shared memory in these variables.
> + */
> +static int   n_deadlocks = 0;
> +static size_tn_tmpfiles = 0;
> +static size_tn_tmpfilesize = 0;
> +
> +/*
> + * have_recovery_conflicts represents the existence of any kind if conflict
> + */
> +static bool  have_recovery_conflicts = false;
> +static int   n_conflict_tablespace = 0;
> +static int   n_conflict_lock = 0;
> +static int   n_conflict_snapshot = 0;
> +static int   n_conflict_bufferpin = 0;
> +static int   n_conflict_startup_deadlock = 0;

Probably worthwhile to group those into a struct, even just to make
debugging easier.



>  
> -/* --
> - * pgstat_init() -
> - *
> - *   Called from postmaster at startup. Create the resources required
> - *   by the statistics collector process.  If unable to do so, do not
> - *   fail --- better to let the postmaster start with stats collection
> - *   disabled.
> - * --
> - */
> -void
> -pgstat_init(void)
> +static void
> +pgstat_postmaster_shutdown(int code, Datum arg)

You can't have a function like that without explaining why it's there.

> + /* trash the stats on crash */
> + if (code == 0)
> + pgstat_write_statsfiles();
>  }

And especially not without documenting what that code is supposed to
mean.



>  pgstat_report_stat(bool force)
>  {
> - /* we assume this inits to all zeroes: */
> - static const PgStat_TableCounts all_zeroes;
> - static TimestampTz last_report = 0;
> -
> + static TimestampTz last_flush = 0;
> + static TimestampTz pending_since = 0;
>   TimestampTz now;
> - PgStat_MsgTabstat regular_msg;
> - PgStat_MsgTabstat shared_msg;
> - TabStatusArray *tsa;
> - int i;
> + pgstat_flush_stat_context cxt = {0};
> + boolhave_other_stats = false;
> + boolpending_stats = false;
> + longelapsed;
> + longsecs;
> + int usecs;
> +
> + /* Do we have anything to flush? */
> + if (have_recovery_conflicts || n_deadlocks != 0 || n_tmpfiles != 0)
> + have_other_stats = true;
>  
>   /* Don't expend a clock check if nothing to do */
>   if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0) &&
>   pgStatXactCommit == 0 && pgStatXactRollback == 0 &&
> - !have_function_stats)
> - return;
> + !have_other_stats && !have_function_stats)
> + return 0;

"other" seems like a pretty mysterious category. Seems better to either
name precisely, or just use the underlying variables for checks.




> +/* ---
> + * Subroutines for pgstat_flush_stat.
> + * ---
> + */
> +
> +/*
> + * snapshot_statentry() - Find an entry from source dshash with cache.
> + *

Is snapshot_statentry() really a subroutine for pgstat_flush_stat()?

> +static void *
> +snapshot_statentry(pgstat_snapshot_cxt *cxt, Oid key)
> +{
> + char *lentry = NULL;
> + size_t keysize = cxt->dsh_params->key_size;
> + size_t dsh_entrysize = 

Re: Patch to document base64 encoding

2019-03-06 Thread Karl O. Pinc
On Wed, 6 Mar 2019 19:30:16 +0100 (CET)
Fabien COELHO  wrote:

> "... section 6.8" -> "... Section 6.8" (capital S).

Fixed.

> "The string and the binary encode and decode functions..." sentence
> looks strange to me, especially with the English article that I do
> not really master, so maybe it is ok. I'd have written something more 
> straightforward, eg: "Functions encode and decode support the
> following encodings:",

It is an atypical construction because I want to draw attention that
this is documentation not only for the encode() and decode() in
section 9.4. String Functions and Operators but also for
the encode() and decode in section 9.5. Binary String Functions 
and Operators.  Although I can't think of a better approach
it makes me uncomfortable that documentation written in
one section applies equally to functions in a different section.

Do you think it would be useful to hyperlink the word "binary"
to section 9.5?

The idiomatic phrasing would be "Both the string and the binary
encode and decode functions..." but the word "both" adds
no information.  Shorter is better.

> and also I'd use a direct "Function
> <...>decode ..." rather than "The decode
> function ..." (twice).

The straightforward English would be "Decode accepts...".  The problem
is that this begins the sentence with the name of a function.
This does not work very well when the function name is all lower case,
and can have other problems where clarity is lost depending 
on documentation output formatting.

I don't see a better approach.

> Maybe I'd use the exact same grammatical structure for all 3 cases, 
> starting with "The <>whatever encoding converts bla bla bla"
> instead of varying the sentences.

Agreed.  Good idea.  The first paragraph of each term has to 
do with encoding and the second with decoding.  
Uniformity in starting the second paragraphs helps make 
this clear, even though the first paragraphs are not uniform.
With this I am not concerned that the first paragraphs
do not have a common phrasing that's very explicit about
being about encoding.

Adjusted.

> Otherwise, all explanations look both precise and useful to me.

When writing I was slightly concerned about being overly precise;
permanently committing to behavior that might (possibly) be an artifact
of implementation.  E.g., that hex decoding accepts both
upper and lower case A-F characters, what input is ignored
and what raises an error, etc.  But it seems best
to document existing behavior, all of which has existed so long
anyway that changing it would be disruptive.  If anybody cares
they can object.

I wrote the docs by reading the code and did only a little
actual testing to be sure that what I wrote is correct.
I also did not check for regression tests which confirm
the behavior I'm documenting.  (It wouldn't hurt to have
such regression tests, if they don't already exist.
But writing regression tests is more than I want to take on 
with this patch.  Feel free to come up with tests.  :-)

I'm confident that the behavior I documented is how PG behaves
but you should know what I did in case you want further
validation.

Attached: doc_base64_v8.patch

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6765b0d584..e756bf53ba 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1752,6 +1752,9 @@
 
  decode
 
+
+  base64 encoding
+
 decode(string text,
 format text)

@@ -1769,16 +1772,25 @@
 
  encode
 
+
+ base64 encoding
+
+
+ hex encoding
+
+
+ escape encoding
+
 encode(data bytea,
 format text)

text

 Encode binary data into a textual representation.  Supported
-formats are: base64, hex, escape.
-escape converts zero bytes and high-bit-set bytes to
-octal sequences (\nnn) and
-doubles backslashes.
+formats are:
+base64,
+hex,
+escape.

encode('123\000\001', 'base64')
MTIzAAE=
@@ -2365,6 +2377,90 @@
 format treats a NULL as a zero-element array.

 
+   
+ encode
+   
+   
+ decode
+   
+   
+ base64 encoding
+   
+   
+hex encoding
+   
+   
+escape encoding
+   
+
+   
+ The string and the binary encode
+ and decode functions support the following
+ encodings:
+
+ 
+   
+ base64
+ 
+   
+ The base64 encoding is that
+ of https://tools.ietf.org/html/rfc2045#section-6.8;>RFC
+ 2045 Section 6.8.  As per the RFC, encoded lines are
+ broken at 76 characters.  However instead of the MIME CRLF
+ end-of-line marker, only a newline is used for end-of-line.
+   
+   
+

Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-07 11:31:15 +1300, David Rowley wrote:
> On Thu, 7 Mar 2019 at 05:17, Andres Freund  wrote:
> > I'm also concerned that the the current catalog representation isn't
> > right. As I said:
> >
> > > I also find it far from clear that:
> > > 
> > >  
> > >   The tablespace_name is 
> > > the name
> > >   of the tablespace in which the new table is to be created.
> > >   If not specified,
> > >is consulted, or
> > >if the table is temporary.  
> > > For
> > >   partitioned tables, since no storage is required for the table 
> > > itself,
> > >   the tablespace specified here only serves to mark the default 
> > > tablespace
> > >   for any newly created partitions when no other tablespace is 
> > > explicitly
> > >   specified.
> > >  
> > > 
> > > is handled correctly. The above says that the *specified* tablespaces -
> > > which seems to exclude the default tablespace - is what's going to
> > > determine what partitions use as their default tablespace. But in fact
> > > that's not true, the partitioned table's pg_class.retablespace is set to
> > > what default_tablespaces was at the time of the creation.
> 
> Do you think it's fine to reword the docs to make this point more
> clear, or do you see this as a fundamental problem with the patch?

Hm, both? I mean I wouldn't necessarily characterize it as "fundamental"
problem, but ...

I don't think the argument that the user intended to explicitly set a
tablespace holds much water if it was just set via default_tablespace,
rather than an explicit TABLESPACE.  I think iff you really want
something like this feature, you'd have to mark a partition's
reltablespace as 0 unless an *explicit* assignment of the tablespace
happened. In which case you also would need to explicitly emit a
TABLESPACE for the partitioned table in pg_dump, to restore that.

Greetings,

Andres Freund



Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 05:17, Andres Freund  wrote:
> I'm also concerned that the the current catalog representation isn't
> right. As I said:
>
> > I also find it far from clear that:
> > 
> >  
> >   The tablespace_name is 
> > the name
> >   of the tablespace in which the new table is to be created.
> >   If not specified,
> >is consulted, or
> >if the table is temporary.  For
> >   partitioned tables, since no storage is required for the table itself,
> >   the tablespace specified here only serves to mark the default 
> > tablespace
> >   for any newly created partitions when no other tablespace is 
> > explicitly
> >   specified.
> >  
> > 
> > is handled correctly. The above says that the *specified* tablespaces -
> > which seems to exclude the default tablespace - is what's going to
> > determine what partitions use as their default tablespace. But in fact
> > that's not true, the partitioned table's pg_class.retablespace is set to
> > what default_tablespaces was at the time of the creation.

Do you think it's fine to reword the docs to make this point more
clear, or do you see this as a fundamental problem with the patch?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 03:36, Tom Lane  wrote:
>
> David Rowley  writes:
> > As far as I can see, the biggest fundamental difference with doing
> > things this way will be that the column order of partitions will be
> > preserved, where before it would inherit the order of the partitioned
> > table.  I'm a little unsure if doing this column reordering was an
> > intended side-effect or not.
>
> Well, if the normal behavior results in changing the column order,
> it'd be necessary to do things differently in --binary-upgrade mode
> anyway, because there we *must* preserve column order.  I don't know
> if what you're describing represents a separate bug for pg_upgrade runs,
> but it might.  Is there any test case for the situation left behind by
> the core regression tests?

After having written the patch, I noticed that binary upgrade mode
does the CREATE TABLE then ATTACH PARTITION in order to preserve the
order.

After changing it nothing failed in make check-world with tap tests enabled.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Should we increase the default vacuum_cost_limit?

2019-03-06 Thread David Rowley
On Thu, 7 Mar 2019 at 08:54, Andrew Dunstan
 wrote:
>
>
> On 3/6/19 1:38 PM, Jeremy Schneider wrote:
> > On 3/5/19 14:14, Andrew Dunstan wrote:
> >> This patch is tiny, seems perfectly reasonable, and has plenty of
> >> support. I'm going to commit it shortly unless there are last minute
> >> objections.
> > +1
> >
>
> done.

Thanks!

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: shared-memory based stats collector

2019-03-06 Thread Robert Haas
On Sun, Feb 24, 2019 at 11:53 PM Kyotaro HORIGUCHI
 wrote:
> The two aboves are fixed in the attached v17.

Andres just drew my attention to patch 0004 in this series, which is
definitely not OK.  That patch allows the postmaster to use dynamic
shared memory, claiming: "Shared memory baesd stats collector needs it
to work on postmaster and no problem found to do that. Just allow it."

But if you just look a little further down in the code from where that
assertion is located, you'll find this:

/* Lock the control segment so we can register the new segment. */
LWLockAcquire(DynamicSharedMemoryControlLock, LW_EXCLUSIVE);

It is a well-established principle that the postmaster must not
acquire any locks, because if it did, a corrupted shared memory
segment could take down not only individual backends but the
postmaster as well.  So this is entirely not OK in the postmaster.  I
think there might be other reasons as well why this is not OK that
aren't occurring to me at the moment, but that one is enough by
itself.

But even if for some reason that were OK, I'm pretty sure that any
design that involves the postmaster interacting with the data stored
in shared memory by the stats collector is an extremely bad idea.
Again, the postmaster is supposed to have as little interaction with
shared memory as possible, precisely so that it is doesn't crash and
burn when some other process corrupts shared memory.  Dynamic shared
memory is included in that.  So, really, the LWLock here is just the
tip of the iceberg: the postmaster not only CAN'T safely run this
code, but it shouldn't WANT to do so.

And I'm kind of baffled that it does.  I haven't looked at the other
patches, but it seems to me that, while a shared-memory stats
collector is a good idea in general to avoid the I/O and CPU costs of
with reading and writing temporary files, I don't see why the
postmaster would need to be involved in any of that.  Whatever the
reason, though, I'm pretty sure that's GOT to be changed for this
patch set to have any chance of being accepted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Binary upgrade from <12 to 12 creates toast table for partitioned tables

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-06 16:46:07 -0500, Andrew Dunstan wrote:
> 
> On 3/6/19 3:41 PM, Andres Freund wrote:
> > Hi,
> >
> > After my tableam patch Andrew's buildfarm animal started failing in the
> > cross-version upgrades:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2019-03-06%2019%3A32%3A24
> 
> 
> Incidentally, I just fixed a bug that was preventing the module from
> picking up its log files. The latest report has them all.

Ah, cool. I was wondering about that...

One thing I noticed is that it might need a more aggressive separator,
that report has:

\0\0\0rls_tbl_force\0\0\0\0ROW SECURITY\0\0\0\0\0K\0\0\0ALTER TABLE 
"regress_rls_schema"."rls_tbl_force" ENABLE ROW LEVEL 
SECURITY;\0\0\0\0\0\0\0\0\0\0\0\0regress_rls_schema\0\0\0\0\0\0\0 
\0\0\0buildfarm\0\0\0\0false\0\0\0\0350\0\0\0\0\0\0\0\0\0\0\0=
 REL_10_STABLE-pg_upgrade_dump_16384.log ==


Greetings,

Andres Freund



Re: Binary upgrade from <12 to 12 creates toast table for partitioned tables

2019-03-06 Thread Andrew Dunstan


On 3/6/19 3:41 PM, Andres Freund wrote:
> Hi,
>
> After my tableam patch Andrew's buildfarm animal started failing in the
> cross-version upgrades:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2019-03-06%2019%3A32%3A24


Incidentally, I just fixed a bug that was preventing the module from
picking up its log files. The latest report has them all.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-06 Thread Robert Haas
On Tue, Mar 5, 2019 at 3:03 PM Peter Geoghegan  wrote:
> I agree that the parts covered by the first patch in the series are
> very unlikely to need changes, but I hesitate to commit it weeks ahead
> of the other patches.

I know I'm stating the obvious here, but we don't have many weeks left
at this point.  I have not reviewed any code, but I have been
following this thread and I'd really like to see this work go into
PostgreSQL 12, assuming it's in good enough shape.  It sounds like
really good stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-03-06 Thread Robert Haas
On Fri, Mar 1, 2019 at 5:06 PM Joe Conway  wrote:
> Seems like it would be better to modify the arguments to
> CloseTransientFile() to include the filename being closed, errorlevel,
> and fail_on_error or something similar. Then all the repeated ereport
> stanzas could be eliminated.

Hmm.  I'm not sure that really saves much in terms of notation, and
it's less flexible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Binary upgrade from <12 to 12 creates toast table for partitioned tables

2019-03-06 Thread Andres Freund
Hi,

After my tableam patch Andrew's buildfarm animal started failing in the
cross-version upgrades:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake=2019-03-06%2019%3A32%3A24

But I actually don't think that't really fault of the tableam patch. The
reason for the assertion is that we assert that
Assert(relation->rd_rel->relam != InvalidOid);
for tables that have storage.  The table in question is a toast table.

The reason that table doesn't have an AM is that it's the toast table
for a partitioned relation, and for toast tables we just copy the AM
from the main table.

The backtrace shows:

#7  0x558b4bdbecfc in create_toast_table (rel=0x7fdab64289e0, toastOid=0, 
toastIndexOid=0, reloptions=0, lockmode=8, check=false)
at /home/andres/src/postgresql/src/backend/catalog/toasting.c:263
263 toast_relid = heap_create_with_catalog(toast_relname,
(gdb) p *rel->rd_rel
$2 = {oid = 80244, relname = {
data = 
"partitioned_table\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"},
 relnamespace = 2200, reltype = 80246, reloftype = 0, relowner = 10, relam = 0, 
relfilenode = 0, 
  reltablespace = 0, relpages = 0, reltuples = 0, relallvisible = 0, 
reltoastrelid = 0, relhasindex = false, relisshared = false, relpersistence = 
112 'p', 
  relkind = 112 'p', relnatts = 2, relchecks = 0, relhasrules = false, 
relhastriggers = false, relhassubclass = false, relrowsecurity = false, 
  relforcerowsecurity = false, relispopulated = true, relreplident = 100 'd', 
relispartition = false, relrewrite = 0, relfrozenxid = 0, relminmxid = 0}

that were trying to create a toast table for a partitioned table. Which
seems wrong to me, given that recent commits made partitioned tables
have no storage.

The reason we're creating a storage is that we're upgrading from a
version of PG where partitioned tables *did* have storage. And thus the
dump looks like:

-- For binary upgrade, must preserve pg_type oid
SELECT pg_catalog.binary_upgrade_set_next_pg_type_oid('80246'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_type array oid
SELECT 
pg_catalog.binary_upgrade_set_next_array_pg_type_oid('80245'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_type toast oid
SELECT 
pg_catalog.binary_upgrade_set_next_toast_pg_type_oid('80248'::pg_catalog.oid);


-- For binary upgrade, must preserve pg_class oids
SELECT 
pg_catalog.binary_upgrade_set_next_heap_pg_class_oid('80244'::pg_catalog.oid);
SELECT 
pg_catalog.binary_upgrade_set_next_toast_pg_class_oid('80247'::pg_catalog.oid);
SELECT 
pg_catalog.binary_upgrade_set_next_index_pg_class_oid('80249'::pg_catalog.oid);

CREATE TABLE "public"."partitioned_table" (
"a" integer,
"b" "text"
)
PARTITION BY LIST ("a");


and create_toast_table() has logic like:

{
/*
 * In binary-upgrade mode, create a TOAST table if and only if
 * pg_upgrade told us to (ie, a TOAST table OID has been 
provided).
 *
 * This indicates that the old cluster had a TOAST table for the
 * current table.  We must create a TOAST table to receive the 
old
 * TOAST file, even if the table seems not to need one.
 *
 * Contrariwise, if the old cluster did not have a TOAST table, 
we
 * should be able to get along without one even if the new 
version's
 * needs_toast_table rules suggest we should have one.  There 
is a lot
 * of daylight between where we will create a TOAST table and 
where
 * one is really necessary to avoid failures, so small 
cross-version
 * differences in the when-to-create heuristic shouldn't be a 
problem.
 * If we tried to create a TOAST table anyway, we would have the
 * problem that it might take up an OID that will conflict with 
some
 * old-cluster table we haven't seen yet.
 */
if (!OidIsValid(binary_upgrade_next_toast_pg_class_oid) ||
!OidIsValid(binary_upgrade_next_toast_pg_type_oid))
return false;


I think we probably should have pg_dump suppress emitting information
about the toast table of partitioned tables?

While I'm not hugely bothered by binary upgrade mode creating
inconsistent states - there's plenty of ways to crash the server that
way - it probably also would be a good idea to have heap_create()
elog(ERROR) when accessmtd is invalid.

Greetings,

Andres Freund



Re: Should we increase the default vacuum_cost_limit?

2019-03-06 Thread Tomas Vondra



On 3/6/19 12:10 AM, David Rowley wrote:
> Thanks for chipping in on this.
> 
> On Wed, 6 Mar 2019 at 01:53, Tomas Vondra  
> wrote:
>> But on the other hand it feels a bit weird that we increase this one
>> value and leave all the other (also very conservative) defaults alone.
> 
> Which others did you have in mind? Like work_mem, shared_buffers?  If
> so, I mentioned in the initial post that I don't see vacuum_cost_limit
> as in the same category as those.  It's not like PostgreSQL won't
> start on a tiny server if vacuum_cost_limit is too high, but you will
> have issues with too big a shared_buffers, for example.   I think if
> we insist that this patch is a review of all the "how big is your
> server" GUCs then that's raising the bar significantly and
> unnecessarily for what I'm proposing here.
> 

On second thought, I think you're right. It's still true that you need
to bump up various other GUCs on reasonably current hardware, but it's
true vacuum_cost_limit is special enough to increase it separately.

so +1 (I see Andrew already pushed it, but anyway ...)

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Protect syscache from bloating with negative cache entries

2019-03-06 Thread Tom Lane
Robert Haas  writes:
> OK, so this is getting simpler, but I'm wondering why we need
> dlist_move_tail() at all.  It is a well-known fact that maintaining
> LRU ordering is expensive and it seems to be unnecessary for our
> purposes here.

Yeah ... LRU maintenance was another thing that used to be in the
catcache logic and was thrown out as far too expensive.  Your idea
of just using a clock sweep instead seems plausible.

regards, tom lane



Re: [HACKERS] Incomplete startup packet errors

2019-03-06 Thread Andrew Dunstan


On 3/6/19 12:12 PM, Robert Haas wrote:
> On Tue, Mar 5, 2019 at 5:35 PM Andrew Dunstan
>  wrote:
>> OK, I think we have agreement on Tom's patch. Do we want to backpatch
>> it? It's a change in behaviour, but I find it hard to believe anyone
>> relies on the existence of these annoying messages, so my vote would be
>> to backpatch it.
> I don't think it's a bug fix, so I don't think it should be
> back-patched.  I think trying to guess which behavior changes are
> likely to bother users is an unwise strategy -- it's very hard to know
> what will actually bother people, and it's very easy to let one's own
> desire to get a fix out the door lead to an unduly rosy view of the
> situation.  Plus, all patches carry some risk, because all developers
> make mistakes; the fewer things we back-patch, the fewer regressions
> we'll introduce.
>

OK, no back-patching it is.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Online verification of checksums

2019-03-06 Thread Tomas Vondra
On 3/6/19 8:41 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote:
>> Not sure how to integrate it into the CLI tool, though. Perhaps we it
>> could require connection info so that it can execute a function, when
>> executed in online mode?
> 
> To me the right fix would be to simply have this run as part of the
> cluster / in a function. I don't see much point in running this outside
> of the cluster.
> 

Not sure. AFAICS that would to require a single transaction, and if we
happen to add some sort of throttling (which is a feature request I'd
expect pretty soon to make it usable on live clusters) that might be
quite long-running. So, not great.

If we want to run it from the server itself, then I guess a background
worker would be a better solution. Incidentally, that's something I've
been toying with some time ago, see [1].


[1] https://github.com/tvondra/scrub

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Should we increase the default vacuum_cost_limit?

2019-03-06 Thread Andrew Dunstan


On 3/6/19 1:38 PM, Jeremy Schneider wrote:
> On 3/5/19 14:14, Andrew Dunstan wrote:
>> This patch is tiny, seems perfectly reasonable, and has plenty of
>> support. I'm going to commit it shortly unless there are last minute
>> objections.
> +1
>

done.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Optimization of some jsonb functions

2019-03-06 Thread Andrew Dunstan


On 3/5/19 5:24 AM, David Steele wrote:
> On 2/22/19 2:05 AM, Nikita Glukhov wrote:
>> Attached set of patches with some jsonb optimizations that were made
>> during
>> comparison of performance of ordinal jsonb operators and jsonpath
>> operators.
>
> This patch was submitted just before the last commitfest for PG12 and
> seems to have potential for breakage.
>
> I have updated the target to PG13.
>
>

I think that's overly cautious. The first one I looked at, to optimize
JsonbExtractScalar, is very small, self-contained, and I think low risk.
I haven't looked at the others in detail, but I think at least some part
of this is reasonably committable.


I'll try to look at the others fairly shortly.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Make drop database safer

2019-03-06 Thread Ashwin Agrawal
On Wed, Mar 6, 2019 at 7:56 AM Tomas Vondra 
wrote:

>
>
> On 2/12/19 12:55 AM, Ashwin Agrawal wrote:
> >
> > Thanks for the response and inputs.
> >
> > On Sat, Feb 9, 2019 at 4:51 AM Andres Freund  > > wrote:
> >
> > Hi,
> >
> > On 2019-02-08 16:36:13 -0800, Alexandra Wang wrote:
> >  > Current sequence of operations for drop database (dropdb())
> >  > 1. Start Transaction
> >  > 2. Make catalog changes
> >  > 3. Drop database buffers
> >  > 4. Forget database fsync requests
> >  > 5. Checkpoint
> >  > 6. Delete database directory
> >  > 7. Commit Transaction
> >  >
> >  > Problem
> >  > This sequence is unsafe from couple of fronts. Like if drop
> database,
> >  > aborts (means system can crash/shutdown can happen) right after
> > buffers are
> >  > dropped step 3 or step 4. The database will still exist and fully
> >  > accessible but will loose the data from the dirty buffers. This
> > seems very
> >  > bad.
> >  >
> >  > Operation can abort after step 5 as well in which can the entries
> > remain in
> >  > catalog but the database is not accessible. Which is bad as well
> > but not as
> >  > severe as above case mentioned, where it exists but some stuff
> goes
> >  > magically missing.
> >  >
> >  > Repo:
> >  > ```
> >  > CREATE DATABASE test;
> >  > \c test
> >  > CREATE TABLE t1(a int); CREATE TABLE t2(a int); CREATE TABLE t3(a
> > int);
> >  > \c postgres
> >  > DROP DATABASE test; <<== kill the session after
> > DropDatabaseBuffers()
> >  > (make sure to issue checkpoint before killing the session)
> >  > ```
> >  >
> >  > Proposed ways to fix
> >  > 1. CommitTransactionCommand() right after step 2. This makes it
> > fully safe
> >  > as the catalog will have the database dropped. Files may still
> > exist on
> >  > disk in some cases which is okay. This also makes it consistent
> > with the
> >  > approach used in movedb().
> >
> > To me this seems bad. The current failure mode obviously isn't good,
> but
> > the data obviously isn't valuable, and just loosing track of an
> entire
> > database worth of data seems worse.
> >
> >
> > So, based on that response seems not loosing track to the files
> > associated with the database is design choice we wish to achieve. Hence
> > catalog having entry but data directory being deleted is fine behavior
> > to have and doesn't need to be solved.
> >
>
> What about adding 'is dropped' flag to pg_database, set it to true at
> the beginning of DROP DATABASE and commit? And ensure no one can connect
> to such database, making DROP DATABASE the only allowed operation?
>
> ISTM we could then continue doing the same thing we do today, without
> any extra checkpoints etc.
>

Sure, adding 'is dropped' column flag to pg_database and committing that
update before dropping database buffers can give us the safety and also
allows to keep the link. But seems very heavy duty way to gain the desired
behavior with extra column in pg_database catalog table specifically just
to protect against this failure window. If this solution gets at least one
more vote as okay route to take, I am fine implementing it.

I am surprised though that keeping the link to database worth of data and
not losing track of it is preferred for dropdb() but not cared for in
movedb(). In movedb(), transaction commits first and then old database
directory is deleted, which was the first solution proposed for dropdb().

FWIW I don't recall why exactly we need the checkpoints, except perhaps
> to ensure the file copies see the most recent data (in CREATE DATABASE)
> and evict stuff for the to-be-dropped database from shared bufers. I
> wonder if we could do that without a checkpoint somehow ...
>

Checkpoint during CREATE DATABASE is understandable. But yes during
dropdb() seems unnecessary. Only rational seems for windows based on
comment in code "On Windows, this also ensures that background procs don't
hold any open files, which would cause rmdir() to fail." I think we can
avoid the checkpoint for all other platforms in dropdb() except windows.
Even for windows if have way to easily ensure no background procs have open
files, without checkpoint then can be avoided even for it.

> Considering the design choice we must meet, seems approach 2, moving
> > Checkpoint from step 5 before step 3 would give us the safety desired
> > and retain the desired link to the database till we actually delete the
> > files for it.
> >
>
> Ummm? That essentially means this order:
>
> 1. Start Transaction
> 2. Make catalog changes
> 5. Checkpoint
> 3. Drop database buffers
> 4. Forget database fsync requests
> 6. Delete database directory
> 7. Commit Transaction
>
> I don't see how that actually fixes any of the issues? Can you explain?
>

Since checkpoint is performed, all the dirty buffers 

Re: performance issue in remove_from_unowned_list()

2019-03-06 Thread Tomas Vondra
On 3/6/19 8:04 PM, Robert Haas wrote:
> On Wed, Mar 6, 2019 at 1:53 PM Alvaro Herrera  
> wrote:
>> On 2019-Feb-08, Tomas Vondra wrote:
>>> I'm wondering if we should just get rid of all such optimizations, and
>>> make the unowned list doubly-linked (WIP patch attached, needs fixing
>>> the comments etc.).
>>
>> +1 for that approach.
> 
> +1 for me, too.
> 
>> Did you consider using a dlist?
> 
> Maybe that is worthwhile, but this is a smaller change, which I think
> should count for quite a bit.  Nothing keeps somebody from doing the
> dlist change as a separate patch, if desired.
> 

Yeah, although now that I think about it I wouldn't expect the dlist
version to be much more complicated. We access next_unowned_reln on two
or three places, IIRC, so switching to dlist would be trivial I think.

What worries me more is that I observe the opposite behavior than what's
described in comment for b4166911 (which is from 2018, so not that old)
and 279628a0a7 (from 2013). So what changed since then? Seems fishy ...


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online verification of checksums

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-06 20:37:39 +0100, Tomas Vondra wrote:
> Not sure how to integrate it into the CLI tool, though. Perhaps we it
> could require connection info so that it can execute a function, when
> executed in online mode?

To me the right fix would be to simply have this run as part of the
cluster / in a function. I don't see much point in running this outside
of the cluster.

Greetings,

Andres Freund



Re: Online verification of checksums

2019-03-06 Thread Tomas Vondra



On 3/6/19 6:42 PM, Andres Freund wrote:
> On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
>> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck  
>> wrote:
>>> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
 On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
  wrote:
> I have added a retry for this as well now, without a pg_sleep() as well.
> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine?

 Maybe I'm confused here, but catching 80% of torn pages doesn't sound
 robust at all.
>>>
>>> The chance that pg_verify_checksums hits a torn page (at least in my
>>> tests, see below) is already pretty low, a couple of times per 1000
>>> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
>>> on. Otherwise, we now just issue a warning and skip the file (or so was
>>> the idea, see below), do you think that is not acceptable?
>>
>> Yeah.  Consider a paranoid customer with 100 clusters who runs this
>> every day on every cluster.  They're going to see failures every day
>> or three and go ballistic.
> 
> +1
> 
> 
>> I suspect that better retry logic might help here.  I mean, I would
>> guess that 10 retries at 1 second intervals or something of that sort
>> would be enough to virtually eliminate false positives while still
>> allowing us to report persistent -- and thus real -- problems.  But if
>> even that is going to produce false positives with any measurable
>> probability different from zero, then I think we have a problem,
>> because I neither like a verification tool that ignores possible signs
>> of trouble nor one that "cries wolf" when things are fine.
> 
> To me the right way seems to be to IO lock the page via PG after such a
> failure, and then retry. Which should be relatively easily doable for
> the basebackup case, but obviously harder for the pg_verify_checksums
> case.
> 

Yes, if we could ensure the retry happens after completing the current
I/O on the page (without actually initiating a read into shared buffers)
that would work I think - both for partial reads and torn pages.

Not sure how to integrate it into the CLI tool, though. Perhaps we it
could require connection info so that it can execute a function, when
executed in online mode?

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Pluggable Storage - Andres's take

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-05 23:07:21 -0800, Andres Freund wrote:
> My next steps are:
> - final polish & push the basic DDL and pg_dump patches

Done and pushed. Some collation dependent fallout, I'm hoping I've just
fixed that.

> - cleanup & polish the ON CONFLICT refactoring

Here's a cleaned up version of that patch.  David, Alvaro, you also
played in that area, any objections? I think this makes that part of the
code easier to read actually. Robert, thanks for looking at that patch
already.

Greetings,

Andres Freund
>From e4caa7de3006370f52b4dafe204d45f9d99fa5a4 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Wed, 6 Mar 2019 11:30:33 -0800
Subject: [PATCH v16] Don't reuse slots between root and partition in ON
 CONFLICT ... UPDATE.

Until now the the slot to store the conflicting tuple, and the result
of the ON CONFLICT SET, where reused between partitions. That
necessitated changing slots descriptor when switching partitions.

Besides the overhead of switching descriptors on a slot (which
requires memory allocations and prevents JITing), that's importantly
also problematic for tableam. There individual partitions might belong
to different tableams, needing different kinds of slots.

In passing also fix ExecOnConflictUpdate to clear the existing slot at
exit. Otherwise that slot could continue to hold a pin till the query
ends, which could be far too long if the input data set is large, and
there's no further conflicts. While previously also problematic, it's
now more important as there will be more such slots when partitioned.

Author: Andres Freund
Reviewed-By: Robert Haas
Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n...@alap3.anarazel.de
---
 src/backend/executor/execPartition.c   | 52 +--
 src/backend/executor/nodeModifyTable.c | 70 +-
 src/include/nodes/execnodes.h  |  5 +-
 3 files changed, 64 insertions(+), 63 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index e41801662b3..4491ee69912 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -723,28 +723,55 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
 			TupleConversionMap *map;
+			TupleDesc	leaf_desc;
 
 			map = leaf_part_rri->ri_PartitionInfo->pi_RootToPartitionMap;
+			leaf_desc = RelationGetDescr(leaf_part_rri->ri_RelationDesc);
 
 			Assert(node->onConflictSet != NIL);
 			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
+			leaf_part_rri->ri_onConflict = makeNode(OnConflictSetState);
+
+			/*
+			 * Need a separate existing slot for each partition, as the
+			 * partition could be of a different AM, even if the tuple
+			 * descriptors match.
+			 */
+			leaf_part_rri->ri_onConflict->oc_Existing =
+ExecInitExtraTupleSlot(mtstate->ps.state,
+	   leaf_desc,
+	   );
+
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
-			 * parent (the common case), we can simply re-use the parent's ON
+			 * parent (the common case), we can re-use most of the parent's ON
 			 * CONFLICT SET state, skipping a bunch of work.  Otherwise, we
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
+			{
+/*
+ * It's safe to reuse these from the partition root, as we
+ * only process one tuple at a time (therefore we won't
+ * overwrite needed data in slots), and the results of
+ * projections are independent of the underlying
+ * storage. Projections and where clauses themselves don't
+ * store state / are independent of the underlying storage.
+ */
+leaf_part_rri->ri_onConflict->oc_ProjSlot =
+	rootResultRelInfo->ri_onConflict->oc_ProjSlot;
+leaf_part_rri->ri_onConflict->oc_ProjInfo =
+	rootResultRelInfo->ri_onConflict->oc_ProjInfo;
+leaf_part_rri->ri_onConflict->oc_WhereClause =
+	rootResultRelInfo->ri_onConflict->oc_WhereClause;
+			}
 			else
 			{
 List	   *onconflset;
 TupleDesc	tupDesc;
 bool		found_whole_row;
 
-leaf_part_rri->ri_onConflict = makeNode(OnConflictSetState);
-
 /*
  * Translate expressions in onConflictSet to account for
  * different attribute numbers.  For that, map partition
@@ -778,20 +805,17 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 /* Finally, adjust this tlist to match the partition. */
 onconflset = adjust_partition_tlist(onconflset, map);
 
-/*
- * Build UPDATE SET's projection info.  The user of this
- * projection is responsible for setting the slot's tupdesc!
- * We set aside a tupdesc that's good for the common case of a
- * partition that's tupdesc-equal to the partitioned table;
- * partitions of different tupdescs must generate their own.
- */
+/* create the tuple slot for the UPDATE SET projection 

Re: Online verification of checksums

2019-03-06 Thread Tomas Vondra
On 3/6/19 6:26 PM, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra
>  wrote:
>> FWIW I don't think this qualifies as torn page - i.e. it's not a full
>> read with a mix of old and new data. This is partial write, most likely
>> because we read the blocks one by one, and when we hit the last page
>> while the table is being extended, we may only see the fist 4kB. And if
>> we retry very fast, we may still see only the first 4kB.
> 
> I see the distinction you're making, and you're right.  The problem
> is, whether in this case or whether for a real torn page, we don't
> seem to have a way to distinguish between a state that occurs
> transiently due to lack of synchronization and a situation that is
> permanent and means that we have corruption.  And that worries me,
> because it means we'll either report bogus complaints that will scare
> easily-panicked users (and anybody who is running this tool has a good
> chance of being in the "easily-panicked" category ...), or else we'll
> skip reporting real problems.  Neither is good.
> 

Sure, I'd also prefer having a tool that reliably detects all cases of
data corruption, and I certainly do share your concerns about false
positives and false negatives.

But maybe we shouldn't expect a tool meant to verify checksums to detect
various other issues.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: performance issue in remove_from_unowned_list()

2019-03-06 Thread Tomas Vondra



On 3/6/19 7:52 PM, Alvaro Herrera wrote:
> On 2019-Feb-08, Tomas Vondra wrote:
> 
>> I'm wondering if we should just get rid of all such optimizations, and
>> make the unowned list doubly-linked (WIP patch attached, needs fixing
>> the comments etc.).
> 
> +1 for that approach.
> 
> Did you consider using a dlist?
> 

I'm not sure. I might have considered it, but decided to go with a
simpler / less invasive fix demonstrating the effect. And maybe make it
more acceptable for backpatch, if we want that. Which we probably don't,
so I agree dlist might be a better choice.

cheers

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: performance issue in remove_from_unowned_list()

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 1:53 PM Alvaro Herrera  wrote:
> On 2019-Feb-08, Tomas Vondra wrote:
> > I'm wondering if we should just get rid of all such optimizations, and
> > make the unowned list doubly-linked (WIP patch attached, needs fixing
> > the comments etc.).
>
> +1 for that approach.

+1 for me, too.

> Did you consider using a dlist?

Maybe that is worthwhile, but this is a smaller change, which I think
should count for quite a bit.  Nothing keeps somebody from doing the
dlist change as a separate patch, if desired.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-03-06 Thread Georgios Kokolatos
Overall the patch looks good and according to the previous discussion fulfils 
its purpose.

It might be worthwhile to also check for errors on close in SaveSlotToPath().

pgstat_report_wait_end();

CloseTransientFile(fd);

/* rename to permanent file, fsync file and directory */
if (rename(tmppath, path) != 0)

Re: New vacuum option to do only freezing

2019-03-06 Thread Robert Haas
On Tue, Mar 5, 2019 at 11:29 PM Masahiko Sawada  wrote:
> Attached updated patch incorporated all of comments. Also I've added
> new reloption vacuum_index_cleanup as per discussion on the "reloption
> to prevent VACUUM from truncating empty pages at the end of relation"
> thread. Autovacuums also can skip index cleanup when the reloption is
> set to false. Since the setting this to false might lead some problems
> I've made autovacuums report the number of dead tuples and dead
> itemids we left.

It seems to me that the disable_index_cleanup should be renamed
index_cleanup and the default should be changed to true, for
consistency with the reloption (and, perhaps, other patches).

- num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;
+ num_tuples = live_tuples = tups_vacuumed  = nkeep = nunused =
+ nleft_dead_itemids = nleft_dead_tuples = 0;

I would suggest leaving the existing line alone (and not adding an
extra space to it as the patch does now) and just adding a second
initialization on the next line as a separate statement. a = b = c = d
= e = 0 isn't such great coding style that we should stick to it
rigorously even when it ends up having to be broken across lines.

+ /* Index vacuum must be enabled in two-pass vacuum */
+ Assert(!skip_index_vacuum);

I am a big believer in naming consistency.  Please, let's use the same
name everywhere!  If it's going to be index_cleanup, then call the
reloption vacuum_index_cleanup, and call the option index_cleanup, and
call the variable index_cleanup.  Picking a different subset of
cleanup, index, vacuum, skip, and disable for each new name makes it
harder to understand.

- * If there are no indexes then we can vacuum the page right now
- * instead of doing a second scan.
+ * If there are no indexes or index vacuum is disabled we can
+ * vacuum the page right now instead of doing a second scan.

This comment is wrong.  That wouldn't be safe.  And that's probably
why it's not what the code does.

- /* If no indexes, make log report that lazy_vacuum_heap would've made */
+ /*
+ * If no index or index vacuum is disabled, make log report that
+ * lazy_vacuum_heap would've make.
+ */
  if (vacuumed_pages)

Hmm, does this really do what the comment claims?  It looks to me like
we only increment vacuumed_pages when we call lazy_vacuum_page(), and
we (correctly) don't do that when index cleanup is disabled, but then
here this claims that if (vacuumed_pages) will be true in that case.

I wonder if it would be cleaner to rename vacrelstate->hasindex to
'useindex' and set it to false if there are no indexes or index
cleanup is disabled.  But that might actually be worse, not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: performance issue in remove_from_unowned_list()

2019-03-06 Thread Alvaro Herrera
On 2019-Feb-08, Tomas Vondra wrote:

> I'm wondering if we should just get rid of all such optimizations, and
> make the unowned list doubly-linked (WIP patch attached, needs fixing
> the comments etc.).

+1 for that approach.

Did you consider using a dlist?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Patch to document base64 encoding

2019-03-06 Thread Fabien COELHO




Attached: doc_base64_v7.patch


Patch applies cleanly, doc compiles, navigation tested and ok.

"... section 6.8" -> "... Section 6.8" (capital S).

"The string and the binary encode and decode functions..." sentence looks 
strange to me, especially with the English article that I do not really 
master, so maybe it is ok. I'd have written something more 
straightforward, eg: "Functions encode and decode support the following 
encodings:", and also I'd use a direct "Function <...>decode ..." 
rather than "The decode function ..." (twice).


Maybe I'd use the exact same grammatical structure for all 3 cases, 
starting with "The <>whatever encoding converts bla bla bla" instead of 
varying the sentences.


Otherwise, all explanations look both precise and useful to me.

--
Fabien.



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 1:02 PM Alvaro Herrera  wrote:
> Well, I don't have a problem reading long texts; my problem is that I'm
> unable to argue as quickly.

That's my secret weapon... except that it's not much of a secret.

> I do buy your argument, though (if reluctantly); in particular I was
> worried to offer a parameter (to turn off zero-filling of segments) that
> would enable dangerous behavior, but then I realized we also have
> fsync=off of which the same thing can be said.  So I agree we should
> have two GUCs, properly explained, with a warning where appropriate.

OK, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Protect syscache from bloating with negative cache entries

2019-03-06 Thread Robert Haas
On Fri, Mar 1, 2019 at 3:33 AM Kyotaro HORIGUCHI
 wrote:
> > > It is artificial (or acutually wont't be repeatedly executed in a
> > > session) but anyway what can get benefit from
> > > catalog_cache_memory_target would be a kind of extreme.
> >
> > I agree.  So then let's not have it.
>
> Ah... Yeah! I see. Andres' concern was that crucial syscache
> entries might be blown away during a long idle time. If that
> happens, it's enough to just turn off in the almost all of such
> cases.

+1.

> In the attached v18,
>catalog_cache_memory_target is removed,
>removed some leftover of removing the hard limit feature,
>separated catcache clock update during a query into 0003.
>attached 0004 (monitor part) in order just to see how it is working.
>
> v18-0001-Add-dlist_move_tail:
>   Just adds dlist_move_tail
>
> v18-0002-Remove-entries-that-haven-t-been-used-for-a-certain-:
>   Revised pruning feature.

OK, so this is getting simpler, but I'm wondering why we need
dlist_move_tail() at all.  It is a well-known fact that maintaining
LRU ordering is expensive and it seems to be unnecessary for our
purposes here.  Can't CatCacheCleanupOldEntries just use a single-bit
flag on the entry?  If the flag is set, clear it.  If the flag is
clear, drop the entry.  When an entry is used, set the flag.  Then,
entries will go away if they are not used between consecutive calls to
CatCacheCleanupOldEntries.  Sure, that might be slightly less accurate
in terms of which entries get thrown away, but I bet it makes no real
difference.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pgsql: tableam: introduce table AM infrastructure.

2019-03-06 Thread Alvaro Herrera
On 2019-Mar-06, Andres Freund wrote:

> tableam: introduce table AM infrastructure.

Thanks for doing this!!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Alvaro Herrera
On 2019-Mar-06, Robert Haas wrote:

> On Wed, Mar 6, 2019 at 12:13 PM Alvaro Herrera  
> wrote:
> > I want your dictating software.
> 
> I'm afraid this is just me and a keyboard, but sadly for me you're not
> the first person to accuse me of producing giant walls of text.

Well, I don't have a problem reading long texts; my problem is that I'm
unable to argue as quickly.

I do buy your argument, though (if reluctantly); in particular I was
worried to offer a parameter (to turn off zero-filling of segments) that
would enable dangerous behavior, but then I realized we also have
fsync=off of which the same thing can be said.  So I agree we should
have two GUCs, properly explained, with a warning where appropriate.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Block level parallel vacuum

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 1:26 AM Masahiko Sawada  wrote:
> Okay, attached the latest version of patch set. I've incorporated all
> comments I got and separated the patch for making vacuum options a
> Node (0001 patch). And the patch doesn't use parallel_workers. It
> might be proposed in the another form again in the future if
> requested.

Why make it a Node?  I mean I think a struct makes sense, but what's
the point of giving it a NodeTag?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Server Crash in logical decoding if used inside --single mode

2019-03-06 Thread Alvaro Herrera
On 2019-Mar-06, tushar wrote:

> backend> select * from pg_logical_slot_get_changes('m7',null,null);
 [...]
> TRAP: FailedAssertion("!(slot != ((void *)0) && slot->active_pid != 0)",
> File: "slot.c", Line: 428)
> Aborted (core dumped)

See argumentation in
https://www.postgresql.org/message-id/flat/3b2f809f-326c-38dd-7a9e-897f957a4eb1%40enterprisedb.com
-- essentially, we don't really care to support this case.

If you want to submit a patch to report an error before crashing, that's
fine, but don't ask for this functionality to actually work.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Online verification of checksums

2019-03-06 Thread Andres Freund
On 2019-03-06 12:33:49 -0500, Robert Haas wrote:
> On Sat, Mar 2, 2019 at 5:45 AM Michael Banck  
> wrote:
> > Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> > >  wrote:
> > > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > > This catches around 80% of the half-reads, but a few slip through. At
> > > > that point we bail out with exit(1), and the user can try again, which I
> > > > think is fine?
> > >
> > > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > > robust at all.
> >
> > The chance that pg_verify_checksums hits a torn page (at least in my
> > tests, see below) is already pretty low, a couple of times per 1000
> > runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> > on. Otherwise, we now just issue a warning and skip the file (or so was
> > the idea, see below), do you think that is not acceptable?
> 
> Yeah.  Consider a paranoid customer with 100 clusters who runs this
> every day on every cluster.  They're going to see failures every day
> or three and go ballistic.

+1


> I suspect that better retry logic might help here.  I mean, I would
> guess that 10 retries at 1 second intervals or something of that sort
> would be enough to virtually eliminate false positives while still
> allowing us to report persistent -- and thus real -- problems.  But if
> even that is going to produce false positives with any measurable
> probability different from zero, then I think we have a problem,
> because I neither like a verification tool that ignores possible signs
> of trouble nor one that "cries wolf" when things are fine.

To me the right way seems to be to IO lock the page via PG after such a
failure, and then retry. Which should be relatively easily doable for
the basebackup case, but obviously harder for the pg_verify_checksums
case.

Greetings,

Andres Freund



Re: proposal: variadic argument support for least, greatest function

2019-03-06 Thread Pavel Stehule
st 6. 3. 2019 v 16:24 odesílatel Chapman Flack 
napsal:

> On 3/6/19 10:12 AM, Andrew Dunstan wrote:
>
> > Having reviewed the thread, I'm with Andres and Tom. Maybe though we
> > should have a note somewhere to the effect that you can't use VARIADIC
> > with these.
>
> Perhaps such a note belongs hoisted into the functions-conditional
> section of the manual, making a general observation that these things
> are conditional *expressions* that may resemble functions, but in
> particular, COALESCE, GREATEST, and LEAST cannot be called with
> keyword VARIADIC and an array argument, as they could if they were
> ordinary functions.
>

+1

Pavel


> Regards,
> -Chap
>


Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 12:13 PM Alvaro Herrera  wrote:
> I want your dictating software.

I'm afraid this is just me and a keyboard, but sadly for me you're not
the first person to accuse me of producing giant walls of text.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online verification of checksums

2019-03-06 Thread Robert Haas
On Sat, Mar 2, 2019 at 5:45 AM Michael Banck  wrote:
> Am Freitag, den 01.03.2019, 18:03 -0500 schrieb Robert Haas:
> > On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
> >  wrote:
> > > I have added a retry for this as well now, without a pg_sleep() as well.
> > > This catches around 80% of the half-reads, but a few slip through. At
> > > that point we bail out with exit(1), and the user can try again, which I
> > > think is fine?
> >
> > Maybe I'm confused here, but catching 80% of torn pages doesn't sound
> > robust at all.
>
> The chance that pg_verify_checksums hits a torn page (at least in my
> tests, see below) is already pretty low, a couple of times per 1000
> runs. Maybe 4 out 5 times, the page is read fine on retry and we march
> on. Otherwise, we now just issue a warning and skip the file (or so was
> the idea, see below), do you think that is not acceptable?

Yeah.  Consider a paranoid customer with 100 clusters who runs this
every day on every cluster.  They're going to see failures every day
or three and go ballistic.

I suspect that better retry logic might help here.  I mean, I would
guess that 10 retries at 1 second intervals or something of that sort
would be enough to virtually eliminate false positives while still
allowing us to report persistent -- and thus real -- problems.  But if
even that is going to produce false positives with any measurable
probability different from zero, then I think we have a problem,
because I neither like a verification tool that ignores possible signs
of trouble nor one that "cries wolf" when things are fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Online verification of checksums

2019-03-06 Thread Robert Haas
On Sat, Mar 2, 2019 at 4:38 PM Tomas Vondra
 wrote:
> FWIW I don't think this qualifies as torn page - i.e. it's not a full
> read with a mix of old and new data. This is partial write, most likely
> because we read the blocks one by one, and when we hit the last page
> while the table is being extended, we may only see the fist 4kB. And if
> we retry very fast, we may still see only the first 4kB.

I see the distinction you're making, and you're right.  The problem
is, whether in this case or whether for a real torn page, we don't
seem to have a way to distinguish between a state that occurs
transiently due to lack of synchronization and a situation that is
permanent and means that we have corruption.  And that worries me,
because it means we'll either report bogus complaints that will scare
easily-panicked users (and anybody who is running this tool has a good
chance of being in the "easily-panicked" category ...), or else we'll
skip reporting real problems.  Neither is good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Alvaro Herrera
I want your dictating software.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Incomplete startup packet errors

2019-03-06 Thread Robert Haas
On Tue, Mar 5, 2019 at 5:35 PM Andrew Dunstan
 wrote:
> OK, I think we have agreement on Tom's patch. Do we want to backpatch
> it? It's a change in behaviour, but I find it hard to believe anyone
> relies on the existence of these annoying messages, so my vote would be
> to backpatch it.

I don't think it's a bug fix, so I don't think it should be
back-patched.  I think trying to guess which behavior changes are
likely to bother users is an unwise strategy -- it's very hard to know
what will actually bother people, and it's very easy to let one's own
desire to get a fix out the door lead to an unduly rosy view of the
situation.  Plus, all patches carry some risk, because all developers
make mistakes; the fewer things we back-patch, the fewer regressions
we'll introduce.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] kNN for btree

2019-03-06 Thread Nikita Glukhov

Attached 9th version of the patches.

On 03.03.2019 12:46, Anastasia Lubennikova wrote:


The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi,
thank you for your work on this patch.


Thank you for you review.


Patch #1 is ready for commit.
It only fixes lack of refactoring after INCLUDE index patch.

Patches 2-7 are ready for committer's review.
I found no significant problems in algorithm or implementation.
Here are few suggestions to improve readability:

1) patch 0002.
* Returned matched index clause exression.
* Number of matched index column is returned in *indexcol_p.

Typos in comment:
exPression
index columnS


"exPression" is fixed.
But there should be "column" because only single index column is matched.


2) patch 0002.
+   /*
+* We allow any column of this index to match each 
pathkey; they
+* don't have to match left-to-right as you might 
expect.
+*/

Before refactoring this comment had a line about gist and sp-gist specific:

-* We allow any column of the index to match each 
pathkey; they
-* don't have to match left-to-right as you might 
expect.  This is
-* correct for GiST, and it doesn't matter for SP-GiST 
because
-* that doesn't handle multiple columns anyway, and no 
other
-* existing AMs support amcanorderbyop.  We might need 
different
-* logic in future for other implementations.

Now, when the code was moved to a separate function, it is not clear why the
same logic is ok for b-tree and other index methods.  If I got it right, it
doesn't affect the correctness of the algorithm, because b-tree specific
checks are performed in another function.  Still it would be better to
explain it in the comment for future readers.


It seems that match_orderbyop_pathkey() is simply the wrong place for this
comment.  I moved it into match_orderbyop_pathkeys() which is implementation of
ammatchorderby() for GiST an SP-GiST.  Also I added old sentence about its
correctness for GiST and SP-GiST.


3) patch 0004
  if (!so->distanceTypeByVal)
  {
so->state.currDistance = PointerGetDatum(NULL);
so->state.markDistance = PointerGetDatum(NULL);
   }

Why do we reset these fields only for !distanceTypeByVal?


These fields should be initialized (it is initialization, not reset) only for
by-ref types because before writing a new distance values to these fields,
the previous by-ref values are pfreed.  The corresponding comment was added.


4) patch 0004
+ * _bt_next_item() -- Advance to next tuple on current page;
+ * or if there's no more, try to step to the next page with data.
+ *
+ * If there are no more matching records in the given direction
*/

Looks like the last sentence of the comment is unfinished.


Yes, "false is returned" is missing. Fixed.


5) patch 0004
_bt_start_knn_scan()

so->currRightIsNearest = _bt_compare_current_dist(so, rstate, lstate);
/* Reset right flag if the left item is nearer. */
right = so->currRightIsNearest;

If this comment relates to the string above it?


No, it relates only to string below. 'right' flag determines later the selected
scan direction, so 'currRightIsNearest' should be assigned to it. This comment
was rewritten.


6) patch 0004
_bt_parallel_seize()

+ scanPage = state == >state
+ ? >btps_scanPage
+ : >btps_knnScanPage;

This code looks a bit tricke to me. Why do we need to pass BTScanState state
to _bt_parallel_seize() at all? Won't it be enough to choose the page before
function call and pass it?


If we will pass page, then we will have to pass it through the whole function
tree:
_bt_parallel_seize()
  _bt_steppage()
_bt_next_item()
  _bt_next_nearest()
_bt_load_first_page()
  _bt_init_knn_scan()
  _bt_readnextpage()
_bt_parallel_readpage()
  _bt_first()

I decided simply to add flag 'isKnn' to BtScanState, so the code now looks like
this:
  scanPage = state->isKnn
? >btps_scanPage
: >btps_knnScanPage;

I also can offer to add 'id' (0/1) to BtScanState instead, then the code will
look like this:
  scanPage = >btps_scanPages[state->id];


7) patch 0006
+  Upgrade notes for version 1.6
+
+  
+   In version 1.6 btree_gist switched to using in-core
+   distance operators, and its own implementations were removed.  References to
+   these operators in btree_gist opclasses will be updated
+   automatically during the extension upgrade, but if the user has created
+   objects referencing these operators or functions, then these objects must be
+   dropped manually before updating the extension.

Is this comment still relevant after the latest changes?
Functions are not removed, they're still in 

Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 11:41 AM Alvaro Herrera  wrote:
> I can understand this argument.  Is there really a reason to change
> those two behaviors separately?

See my previous rely to Andrew, but also, I think you're putting the
burden of proof in the wrong place.  You could equally well ask "Is
there really a reason for work_mem to be different for sorts and index
builds?  For sorts and hashes?  For foreground and background
vacuums?"  Well, now we've got work_mem, maintenance_work_mem,
autovacuum_work_mem, and at least in my experience, that's not
necessarily fine-grained enough -- people can't predict whether their
maintenance_work_mem setting is OK because they don't know if
somebody's going to be running a foreground VACUUM or a CREATE INDEX
while it's in flight.  See also the bit about hash_mem in
https://www.postgresql.org/message-id/CAH2-WzmNwV=LfDRXPsmCqgmm91mp=2b4FvXNF=ccvmrb8yf...@mail.gmail.com
-- see also commit a1b395b6a26ae80cde17fdfd2def8d351872f399's
introduction of pending_list_cleanup_size, yet another place where we
started to decouple something that was inadvisably tied to work_mem.

There have been other cases, too, where we've bound unrelated things
together into a single parameter, and my feeling is that most of those
have turned out a mess.  Separate behaviors ought to be controlled by
separate settings, even though it means we'll end up with more
settings.  Two settings each of which does one clear and well-defined
thing can even be easier to understand than one setting that does
multiple loosely-related things.

> The reason I wrote the documentation
> weasely is that it seems pointless to have to update it whenever we
> implement more things controlled by the same GUC option (which we might,
> if we learn new things about how to use COW filesystems later on).
> AFAIR Jerry's wording was more precise about what the parameter did.  If
> the only reason to change those behaviors is to make WAL work better on
> COW filesystems, then I don't see the point in splitting the GUC in two,
> or documenting in minute detail what it does.

Really?  What about somebody who has a different experience from
Jerry?  They turn the parameter on in release N and it's good and then
the behavior changes in release N+1 and now it sucks and they read the
documentation and it tells them nothing about what has actually
changed.  They can neither get the behavior back that they liked nor
can they understand what behavior they're actually getting that is
causing a problem, because it's not documented.

I do not think our track record is very good when it comes to deciding
which things users "need to know about."  Users need to know what's
really happening.  The idea that we're just going to have a magic flag
here that is going to change all of the things that you want changed
when you're running on a copy-on-write filesystem and it's all going
to work great so that nobody cares about the details does not sound
very likely to be correct.  We don't even know that the same
combination of behavior is performant or safe on every filesystem out
there, let alone that future things that come along are going to have
similar properties.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 11:37 AM Andrew Dunstan
 wrote:
> Well, let's put the question another way. Is there any reason to allow
> skipping zero filling if we are recycling? That seems possibly
> dangerous. I can imagine turning off recycling but leaving on
> zero-filling, although I don't have a concrete use case for it ATM.

I think the short answer is that we don't know.  Any filesystem where
just writing the last byte of the file is good enough to guarantee
that all the intervening space is allocated can skip zero-filling.
Any system where creating new WAL files is faster than recycling old
ones can choose to do it that way.  I don't know how you can make a
categorical argument that there can't be a system where one of those
things -- either one -- is true and the other is false.  At least to
me, they seem like basically unrelated issues.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Alvaro Herrera
On 2019-Mar-06, Robert Haas wrote:

> On Wed, Feb 27, 2019 at 6:12 PM Alvaro Herrera  
> wrote:
> > I think the idea of it being a generic tunable for assorted behavior
> > changes, rather than specific to WAL recycling, is a good one.  I'm
> > unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
> 
> I *really* dislike this.  For one thing, it means that users don't
> have control over the behaviors individually.  For another, the
> documentation is now quite imprecise about what the option actually
> does, while expecting users to figure out whether those behaviors are
> acceptable or preferable in their environment.  It lists recycling of
> WAL files and zero-filling of those files as examples of behavior
> changes, but it does not say that those are the only changes, or even
> that they are made in all cases.

I can understand this argument.  Is there really a reason to change
those two behaviors separately?  The reason I wrote the documentation
weasely is that it seems pointless to have to update it whenever we
implement more things controlled by the same GUC option (which we might,
if we learn new things about how to use COW filesystems later on).
AFAIR Jerry's wording was more precise about what the parameter did.  If
the only reason to change those behaviors is to make WAL work better on
COW filesystems, then I don't see the point in splitting the GUC in two,
or documenting in minute detail what it does.

That said -- if there *is* such a reason, we can certainly split them up
and indicate to COW-filesystem users to change them both together.  I
don't think it's a big deal, but OTOH I see no reason to complicate
matters needlessly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Andrew Dunstan


On 3/6/19 11:30 AM, Robert Haas wrote:
> On Wed, Mar 6, 2019 at 10:55 AM Andrew Dunstan
>  wrote:
>>> I *really* dislike this.  For one thing, it means that users don't
>>> have control over the behaviors individually.  For another, the
>>> documentation is now quite imprecise about what the option actually
>>> does, while expecting users to figure out whether those behaviors are
>>> acceptable or preferable in their environment.  It lists recycling of
>>> WAL files and zero-filling of those files as examples of behavior
>>> changes, but it does not say that those are the only changes, or even
>>> that they are made in all cases.
>> So you want two options, like wal_recycle_files and wal_zero_fill, both
>> defaulting to true? Is there a reasonably use case for turning one off
>> without the other?
> I don't know whether there's a use case for that, and that's one of
> the things that worries me.  I know, though, that if we have two
> parameters, then if there is a use case for it, people will be able to
> meet that use case without submitting a patch.  On the other hand, if
> we had convincing evidence that those two things should always go
> together, that would be fine, too.  But I don't see that anyone has
> made an argument that such a thing is necessarily true outside of ZFS.
>
> I actually wouldn't find it very surprising if disabling WAL recycling
> is sometimes beneficial even on ext4.  The fact that we haven't found
> such cases on this thread doesn't mean they don't exist.  On the other
> hand I think the wal_zero_fill behavior is not about performance but
> about reliability, so you can't afford to turn that on just because
> non-recycling happens to be faster on your machine.
>
>


Well, let's put the question another way. Is there any reason to allow
skipping zero filling if we are recycling? That seems possibly
dangerous. I can imagine turning off recycling but leaving on
zero-filling, although I don't have a concrete use case for it ATM.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Mar 6, 2019 at 10:55 AM Andrew Dunstan
 wrote:
> > I *really* dislike this.  For one thing, it means that users don't
> > have control over the behaviors individually.  For another, the
> > documentation is now quite imprecise about what the option actually
> > does, while expecting users to figure out whether those behaviors are
> > acceptable or preferable in their environment.  It lists recycling of
> > WAL files and zero-filling of those files as examples of behavior
> > changes, but it does not say that those are the only changes, or even
> > that they are made in all cases.
>
> So you want two options, like wal_recycle_files and wal_zero_fill, both
> defaulting to true? Is there a reasonably use case for turning one off
> without the other?

I don't know whether there's a use case for that, and that's one of
the things that worries me.  I know, though, that if we have two
parameters, then if there is a use case for it, people will be able to
meet that use case without submitting a patch.  On the other hand, if
we had convincing evidence that those two things should always go
together, that would be fine, too.  But I don't see that anyone has
made an argument that such a thing is necessarily true outside of ZFS.

I actually wouldn't find it very surprising if disabling WAL recycling
is sometimes beneficial even on ext4.  The fact that we haven't found
such cases on this thread doesn't mean they don't exist.  On the other
hand I think the wal_zero_fill behavior is not about performance but
about reliability, so you can't afford to turn that on just because
non-recycling happens to be faster on your machine.

> Alternatively, we could remove the 'for example" wording, which I agree
> is unfortunate.

Yeah.  We seem to sometimes like to avoid documenting specifics for
fear that, should they change, we'd have to update the documentation.
But I think that just makes the documentation less useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: pg_dump is broken for partition tablespaces

2019-03-06 Thread Andres Freund
Hi,

On 2019-03-06 19:45:06 +1300, David Rowley wrote:
> Over on [1] Andres pointed out that the pg_dump support for the new to
> PG12 tablespace inheritance feature is broken.  This is the feature
> added in ca4103025dfe26 to allow a partitioned table to have a
> tablespace that acts as the default tablespace for newly attached
> partitions. The idea being that you can periodically change the
> default tablespace for new partitions to a tablespace that sits on a
> disk partition with more free space without affecting the default
> tablespace for normal non-partitioned tables. Anyway...

I'm also concerned that the the current catalog representation isn't
right. As I said:

> I also find it far from clear that:
> 
>  
>   The tablespace_name is the 
> name
>   of the tablespace in which the new table is to be created.
>   If not specified,
>is consulted, or
>if the table is temporary.  For
>   partitioned tables, since no storage is required for the table itself,
>   the tablespace specified here only serves to mark the default tablespace
>   for any newly created partitions when no other tablespace is explicitly
>   specified.
>  
> 
> is handled correctly. The above says that the *specified* tablespaces -
> which seems to exclude the default tablespace - is what's going to
> determine what partitions use as their default tablespace. But in fact
> that's not true, the partitioned table's pg_class.retablespace is set to
> what default_tablespaces was at the time of the creation.

I still think the feature as is doesn't seem to have very well defined
behaviour.



> If we instead did:
> 
> CREATE TABLE public.listp1 (a integer
> );
> 
> ALTER TABLE public.list1 ATTACH PARTITION public.listp FOR VALUES IN (1);

Isn't that a bit more expensive, because now the table needs to be
scanned for maching the value? That's probably neglegible though, given
it'd probably always empty.


Greetings,

Andres Freund



Re: PostgreSQL vs SQL/XML Standards

2019-03-06 Thread Chapman Flack
This CF entry shows Pavel and me as reviewers, but the included
patches were also produced by one or the other of us, so additional
review by someone who isn't us seems appropriate. :)

Would it make sense to remove one or both of us from the 'reviewers'
field in the app, to make it more obviously 'available' for reviewing?

One of the patches deals only with docs, so even someone who isn't
sure about reviewing XML functionality, but takes an interest in
documentation, would have a valuable role looking at that.

Regards,
-Chap



Re: Make drop database safer

2019-03-06 Thread Tomas Vondra




On 2/12/19 12:55 AM, Ashwin Agrawal wrote:


Thanks for the response and inputs.

On Sat, Feb 9, 2019 at 4:51 AM Andres Freund > wrote:


Hi,

On 2019-02-08 16:36:13 -0800, Alexandra Wang wrote:
 > Current sequence of operations for drop database (dropdb())
 > 1. Start Transaction
 > 2. Make catalog changes
 > 3. Drop database buffers
 > 4. Forget database fsync requests
 > 5. Checkpoint
 > 6. Delete database directory
 > 7. Commit Transaction
 >
 > Problem
 > This sequence is unsafe from couple of fronts. Like if drop database,
 > aborts (means system can crash/shutdown can happen) right after
buffers are
 > dropped step 3 or step 4. The database will still exist and fully
 > accessible but will loose the data from the dirty buffers. This
seems very
 > bad.
 >
 > Operation can abort after step 5 as well in which can the entries
remain in
 > catalog but the database is not accessible. Which is bad as well
but not as
 > severe as above case mentioned, where it exists but some stuff goes
 > magically missing.
 >
 > Repo:
 > ```
 > CREATE DATABASE test;
 > \c test
 > CREATE TABLE t1(a int); CREATE TABLE t2(a int); CREATE TABLE t3(a
int);
 > \c postgres
 > DROP DATABASE test; <<== kill the session after
DropDatabaseBuffers()
 > (make sure to issue checkpoint before killing the session)
 > ```
 >
 > Proposed ways to fix
 > 1. CommitTransactionCommand() right after step 2. This makes it
fully safe
 > as the catalog will have the database dropped. Files may still
exist on
 > disk in some cases which is okay. This also makes it consistent
with the
 > approach used in movedb().

To me this seems bad. The current failure mode obviously isn't good, but
the data obviously isn't valuable, and just loosing track of an entire
database worth of data seems worse.


So, based on that response seems not loosing track to the files 
associated with the database is design choice we wish to achieve. Hence 
catalog having entry but data directory being deleted is fine behavior 
to have and doesn't need to be solved.




What about adding 'is dropped' flag to pg_database, set it to true at 
the beginning of DROP DATABASE and commit? And ensure no one can connect 
to such database, making DROP DATABASE the only allowed operation?


ISTM we could then continue doing the same thing we do today, without 
any extra checkpoints etc.



 > 2. Alternative way to make it safer is perform Checkpoint (step 5) just

 > before dropping database buffers, to avoid the unsafe nature.
Caveats of
 > this solution is:
 > - Performs IO for data which in success case anyways will get deleted
 > - Still doesn't cover the case where catalog has the database
entry but
 > files are removed from disk

That seems like an unacceptable slowdown.


Given dropping database should be infrequent operation and only addition 
IO cost is for buffers for that database itself as Checkpoint is anyways 
performed in later step, is it really unacceptable slowdown, compared to 
safety it brings ?




That's probably true, although I do know quite a few systems that create 
and drop databases fairly often. And the implied explicit checkpoints 
are quite painful, so I'd vote not to make this worse.


FWIW I don't recall why exactly we need the checkpoints, except perhaps 
to ensure the file copies see the most recent data (in CREATE DATABASE) 
and evict stuff for the to-be-dropped database from shared bufers. I 
wonder if we could do that without a checkpoint somehow ...




 > 3. One more fancier approach is to use pending delete mechanism
used by
 > relation drops, to perform these non-catalog related activities
at commit.
 > Easily, the pending delete structure can be added boolean to convey
 > database directory dropping instead of file. Given drop database
can't be
 > performed inside transaction, not needed to be done this way, but
this
 > makes it one consistent approach used to deal with on-disk removal.

ISTM we'd need to do something like this.


Given the above design choice to retain link to database files till 
actually deleted, not seeing why pending delete approach any better than 
approach 1. This approach will allow us to track the database oid in 
commit transaction xlog record but any checkpoint post the same still 
looses the reference to the database. Which is same case in approach 1 
where separate xlog record XLOG_DBASE_DROP is written just after 
committing the transaction.
When we proposed approach 3, we thought its functionally same as 
approach 1 just differs in implementation. But your preference to this 
approach and stating approach 1 as bad, reads as pending deletes 
approach is functionally different, we would like to hear more how?





Re: patch to allow disable of WAL recycling

2019-03-06 Thread Andrew Dunstan


On 3/6/19 10:38 AM, Robert Haas wrote:
> On Wed, Feb 27, 2019 at 6:12 PM Alvaro Herrera  
> wrote:
>> I think the idea of it being a generic tunable for assorted behavior
>> changes, rather than specific to WAL recycling, is a good one.  I'm
>> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
> I *really* dislike this.  For one thing, it means that users don't
> have control over the behaviors individually.  For another, the
> documentation is now quite imprecise about what the option actually
> does, while expecting users to figure out whether those behaviors are
> acceptable or preferable in their environment.  It lists recycling of
> WAL files and zero-filling of those files as examples of behavior
> changes, but it does not say that those are the only changes, or even
> that they are made in all cases.
>

So you want two options, like wal_recycle_files and wal_zero_fill, both
defaulting to true? Is there a reasonably use case for turning one off
without the other?


Alternatively, we could remove the 'for example" wording, which I agree
is unfortunate.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: A separate table level option to control compression

2019-03-06 Thread Robert Haas
On Tue, Mar 5, 2019 at 5:30 PM Andrew Dunstan
 wrote:
> This is a nice idea, and I'm a bit surprised it hasn't got more
> attention. The patch itself seems very simple and straightforward,
> although it could probably do with having several sets of eyeballs on it.

I haven't needed this for anything, but it seems reasonable to me.
The documentation seems to need some wordsmithing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: patch to allow disable of WAL recycling

2019-03-06 Thread Robert Haas
On Wed, Feb 27, 2019 at 6:12 PM Alvaro Herrera  wrote:
> I think the idea of it being a generic tunable for assorted behavior
> changes, rather than specific to WAL recycling, is a good one.  I'm
> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?

I *really* dislike this.  For one thing, it means that users don't
have control over the behaviors individually.  For another, the
documentation is now quite imprecise about what the option actually
does, while expecting users to figure out whether those behaviors are
acceptable or preferable in their environment.  It lists recycling of
WAL files and zero-filling of those files as examples of behavior
changes, but it does not say that those are the only changes, or even
that they are made in all cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2019-03-06 Thread Andrey Borodin
Hi!


> 20 февр. 2019 г., в 17:06, Alexey Kondratov  
> написал(а):
> 
>> 
>> I will work out this one with postgres -C and come back till the next 
>> commitfest. I found that something similar is already used in pg_ctl and 
>> there is a mechanism for finding valid executables in exec.c. So it does not 
>> seem to be a big deal at the first sight.
>> 
> 
> I have reworked the patch, please find new version attached. It is 3 times as 
> smaller than the previous one and now touches pg_rewind's code only. Tests 
> are also slightly refactored in order to remove duplicated code. Execution of 
> postgres -C is used for restore_command retrieval (if -r is passed) as being 
> suggested. Otherwise everything works as before.
> 

> 

The new patch is much smaller (less than 400 lines) and works as advertised.
There's a typo "retreive" there.

These lines look a little suspicious:
char  postgres_exec_path[MAXPGPATH],
  postgres_cmd[MAXPGPATH],
  cmd_output[MAX_RESTORE_COMMAND];
Is it supposed to be any difference between MAXPGPATH and MAX_RESTORE_COMMAND?

Besides this, patch looks fine to me.

Best regards, Andrey Borodin.


Re: proposal: variadic argument support for least, greatest function

2019-03-06 Thread Chapman Flack
On 3/6/19 10:12 AM, Andrew Dunstan wrote:

> Having reviewed the thread, I'm with Andres and Tom. Maybe though we
> should have a note somewhere to the effect that you can't use VARIADIC
> with these.

Perhaps such a note belongs hoisted into the functions-conditional
section of the manual, making a general observation that these things
are conditional *expressions* that may resemble functions, but in
particular, COALESCE, GREATEST, and LEAST cannot be called with
keyword VARIADIC and an array argument, as they could if they were
ordinary functions.

Regards,
-Chap



Re: [HACKERS] CLUSTER command progress monitor

2019-03-06 Thread Alvaro Herrera
On 2019-Mar-06, Robert Haas wrote:

> On Tue, Mar 5, 2019 at 8:03 PM Alvaro Herrera  
> wrote:
> > One, err, small issue with that idea is that we need the param numbers
> > not to conflict for any "progress update providers" that are to be used
> > simultaneously by any command.
> 
> Is that really an issue?  I think progress reporting -- at least with
> the current infrastructure -- is only ever going to be possible for
> utility commands, not queries.  And those really shouldn't have very
> many sorts going on at once.

Well, I don't think it is, but I thought it was worth pointing out
explicitly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: proposal: variadic argument support for least, greatest function

2019-03-06 Thread Andrew Dunstan


On 3/4/19 9:39 AM, David Steele wrote:
> On 3/1/19 3:59 AM, Chapman Flack wrote:
>> The following review has been posted through the commitfest application:
>> make installcheck-world:  tested, passed
>> Implements feature:   tested, passed
>> Spec compliant:   not tested
>> Documentation:    tested, passed
>>
>> For completeness, I'll mark this reviewed again. It passes
>> installcheck-world, the latest patch addresses the suggestions from
>> me, and is improved on the code-structure matters that Tom pointed
>> out. I don't know if it will meet Tom's threshold for desirability
>> overall, but that sounds like a committer call at this point, so I'll
>> change it to RfC.
>
> Both committers who have looked at this patch (Tom, and Andres in his
> patch roundup [1]) recommend that it be rejected.
>
> If no committer steps up in the next week I think we should mark it as
> rejected.
>
>

Having reviewed the thread, I'm with Andres and Tom. Maybe though we
should have a note somewhere to the effect that you can't use VARIADIC
with these.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




  1   2   >