Re: BTP_DELETED leaf still in tree

2019-10-11 Thread Peter Geoghegan
On Fri, Oct 11, 2019 at 12:44 AM Daniel Wood  wrote:
> > Actually, I take it back -- the looping part is not normal. The
> > btpo_next->btpo_next page has no business linking back to the
> > original/first deleted page you mentioned. That's just odd.
>
> btpo_next->btpo_next does NOT link directly back to the 1st deleted page.  It 
> simply links to some in-use page which is 50 or so leaf pages back in the 
> tree.

That sounds more normal.

> > Can you provide me with a dump of the page images? The easiest way of
> > getting a page dump is described here:
>
> Customer data.  Looks like meaningless customer data (5 digit key values).  
> But too much paperwork. :-)

Well, it was worth a try.  ;-)

> The hard part for me to understand isn't just why the DELETED leaf node is 
> still referenced in the mid tree node.
> It is that the step which sets BTP_DELETED should have also linked its leaf 
> and right siblings together.  But this hasn't been done.

Before the page becomes BTP_DELETED, it must first be BTP_HALF_DEAD.
And that is also the point where it should be impossible for scans to
reach the page, more or less (there is still that narrow window where
the downlink can followed just before its deleted, making the scan
land on the BTP_HALF_DEAD page -- I mentioned this in my first mail).

> Could the page have already have been dirty, but because of "target != 
> leafblkno", we didn't stamp a new LSN on it.  Could this allow us to write 
> the DELETED dirty page without the XLOG_BTREE_MARK_PAGE_HALFDEAD and 
> XLOG_BTREE_UNLINK_PAGE being flushed?  Of course, I don't understand the 
> "target != leafblkno".

The "target != leafblkno" thing concerns whether or not this is a
multi-level deletion (actually, that's not quite right, since even a
multi-level deletion has "target == leafblkno" at the point where it
finally gets to mark a half dead page fully deleted).

Yes, it's odd that this deleted page exists, even though its siblings
still link to it -- the distinction between a fully deleted page and a
half dead page is really just the fact that a fully deleted page is
supposed to not be linked to from anywhere, including still-live
siblings. But you don't have to get that far to see evidence of
corruption -- having a downlink pointing to a half-dead page is
evidence enough of corruption.

(Actually, it's more complicated than that -- see the comments in
amcheck's bt_downlink_check() function from Postgres 11 or 12.
Multi-level deletion is a case where a half-dead page has a downlink,
but the subtree undergoing deletion is still isolated in about the
same way as it is in the simple single level case, since the
"topparent" downlink is zapped at the same point that the leaf page is
marked half-dead. The important thing is that even half-dead pages are
not reachable by descending the tree, except for the tiny window where
the topparent downlink is observed the instant before it is zapped.)

If page deletion didn't exist, it would be so much easier to
understand the B-Tree code.

My guess is that there wasn't sufficient WAL to replay the page
deletion, but some of the buffers were written out. You might have
"gotten away with it" if the internal page also happened to be written
out along with everything else, but it just didn't work out that way.
Remember, there are two weird things about this, that overlap with two
distinct types of atomic operations: the fact that the downlink still
exists at all, and the fact that the sidelinks still exist at all.
This smells like a problem with slightly inconsistent page images, as
opposed to a problem with how one particular atomic operation did
something. It's not actually surprising that this would be the first
place that you'd notice a generic issue, since many other things are
"more forgiving" in various ways.

-- 
Peter Geoghegan




Re: maintenance_work_mem used by Vacuum

2019-10-11 Thread Masahiko Sawada
On Fri, Oct 11, 2019 at 5:13 PM Amit Kapila  wrote:
>
> On Fri, Oct 11, 2019 at 7:36 AM Masahiko Sawada  wrote:
> >
> > On Thu, Oct 10, 2019 at 6:38 PM Amit Kapila  wrote:
> > >
> > >
> > > It seems you want to say about commit id
> > > a1b395b6a26ae80cde17fdfd2def8d351872f399.
> >
> > Yeah thanks.
> >
> > >  I wonder why they have not
> > > changed it to gin_penidng_list_limit (at that time
> > > pending_list_cleanup_size) in that commit itself?  I think if we want
> > > to use gin_pending_list_limit in this context then we can replace both
> > > work_mem and maintenance_work_mem with gin_penidng_list_limit.
> >
> > Hmm as far as I can see the discussion, no one mentioned about
> > maintenance_work_mem. Perhaps we just oversighted?
> >
>
> It is possible, but we can't be certain until those people confirm the same.
>
> > I also didn't know
> > that.
> >
> > I also think we can replace at least the work_mem for cleanup of
> > pending list with gin_pending_list_limit. In the following comment in
> > ginfast.c,
> >
>
> Agreed, but that won't solve the original purpose for which we started
> this thread.
>
> > /*
> >  * Force pending list cleanup when it becomes too long. And,
> >  * ginInsertCleanup could take significant amount of time, so we prefer to
> >  * call it when it can do all the work in a single collection cycle. In
> >  * non-vacuum mode, it shouldn't require maintenance_work_mem, so fire it
> >  * while pending list is still small enough to fit into
> >  * gin_pending_list_limit.
> >  *
> >  * ginInsertCleanup() should not be called inside our CRIT_SECTION.
> >  */
> > cleanupSize = GinGetPendingListCleanupSize(index);
> > if (metadata->nPendingPages * GIN_PAGE_FREESIZE > cleanupSize * 1024L)
> > needCleanup = true;
> >
> > ISTM the gin_pending_list_limit in the above comment corresponds to
> > the following code in ginfast.c,
> >
> > /*
> >  * We are called from regular insert and if we see concurrent cleanup
> >  * just exit in hope that concurrent process will clean up pending
> >  * list.
> >  */
> > if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))
> > return;
> > workMemory = work_mem;
> >
> > If work_mem is smaller than gin_pending_list_limit the pending list
> > cleanup would behave against the intention of the above comment that
> > prefers to do all the work in a single collection cycle while pending
> > list is still small enough to fit into gin_pending_list_limit.
> >
>
> That's right, but OTOH, if the user specifies gin_pending_list_limit
> as an option during Create Index with a value greater than GUC
> gin_pending_list_limit, then also we will face the same problem.  It
> seems to me that we haven't thought enough on memory usage during Gin
> pending list cleanup and I don't want to independently make a decision
> to change it.  So unless the original author/committer or some other
> people who have worked in this area share their opinion, we can leave
> it as it is and make a parallel vacuum patch adapt to it.

Yeah I totally agreed.

Apart from the GIN problem can we discuss whether need to change the
current memory usage policy of parallel utility command described in
the doc? We cannot control the memory usage in index AM after all but
we need to generically consider how much memory is used during
parallel vacuum.

Regards,

--
Masahiko Sawada




Re: [Proposal] Global temporary tables

2019-10-11 Thread Pavel Stehule
pá 11. 10. 2019 v 15:50 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 11.10.2019 15:15, 曾文旌(义从) wrote:
>
> Dear Hackers,
>
> This propose a way to develop global temporary tables in PostgreSQL.
>
> I noticed that there is an "Allow temporary tables to exist as empty by
> default in all sessions" in the postgresql todolist.
> https://wiki.postgresql.org/wiki/Todo
>
> In recent years, PG community had many discussions about global temp
> table (GTT) support. Previous discussion covered the following topics:
> (1) The main benefit or function: GTT offers features like “persistent
> schema, ephemeral data”, which avoids catalog bloat and reduces catalog
> vacuum.
> (2) Whether follows ANSI concept of temporary tables
> (3) How to deal with statistics, single copy of schema definition,
> relcache
> (4) More can be seen in
> https://www.postgresql.org/message-id/73954ab7-44d3-b37b-81a3-69bdcbb446f7%40postgrespro.ru
> (5) A recent implementation and design from Konstantin Knizhnik covered
> many functions of GTT:
> https://www.postgresql.org/message-id/attachment/103265/global_private_temp-1.patch
>
> However, as pointed by Konstantin himself, the implementation still needs
> functions related to CLOG, vacuum, and MVCC visibility.
>
>
> Just to clarify.
> I have now proposed several different solutions for GTT:
>
> Shared vs. private buffers for GTT:
> 1. Private buffers. This is least invasive patch, requiring no changes in
> relfilenodes.
> 2. Shared buffers. Requires changing relfilenode but supports parallel
> query execution for GTT.
>

This is important argument for using share buffers. Maybe the best is mix
of both - store files in temporal tablespace, but using share buffers.
More, it can be accessible for autovacuum.

>
> Access to GTT at replica:
> 1. Access is prohibited (as for original temp tables). No changes at all.
> 2. Tuples of temp tables are marked with forzen XID.  Minimal changes,
> rollbacks are not possible.
> 3. Providing special XIDs for GTT at replica. No changes in CLOG are
> required, but special MVCC visibility rules are used for GTT. Current
> limitation: number of transactions accessing GTT at replica is limited by
> 2^32
> and bitmap of correspondent size has to be maintained (tuples of GTT are
> not proceeded by vacuum and not frozen, so XID horizon never moved).
>
> So except the limitation mentioned above (which I do not consider as
> critical) there is only one problem which was not addressed: maintaining
> statistics for GTT.
> If all of the following conditions are true:
>
> 1) GTT are used in joins
> 2) There are indexes defined for GTT
> 3) Size and histogram of GTT in different backends can significantly vary.
> 4) ANALYZE was explicitly called for GTT
>
> then query execution plan built in one backend will be also used for other
> backends where it can be inefficient.
> I also do not consider this problem as "show stopper" for adding GTT to
> Postgres.
>

The last issue is show stopper in my mind. It really depends on usage.
There are situation where shared statistics are ok (and maybe good
solution), and other situation, where shared statistics are just unusable.

Regards

Pavel



> I still do not understand the opinion of community which functionality of
> GTT is considered to be most important.
> But the patch with local buffers and no replica support is small enough to
> become good starting point.
>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: Change atoi to strtol in same place

2019-10-11 Thread Joe Nelson
Here's v6 of the patch.

[x] Rebase on 20961ceaf0
[x] Don't call exit(1) after pg_fatal()
[x] Use Tom Lane's suggestion for %lld in _() string
[x] Allow full unsigned 16-bit range for ports (don't disallow ports 0-1023)
[x] Explicitly cast parsed values to smaller integers

-- 
Joe Nelson  https://begriffs.com
diff --git a/contrib/pg_standby/Makefile b/contrib/pg_standby/Makefile
index 0bca2f8e9e..cb9292d0f4 100644
--- a/contrib/pg_standby/Makefile
+++ b/contrib/pg_standby/Makefile
@@ -6,6 +6,8 @@ PGAPPICON = win32
 PROGRAM = pg_standby
 OBJS	= pg_standby.o $(WIN32RES)
 
+PG_LDFLAGS += -L$(top_builddir)/src/fe_utils -lpgfeutils
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
index 031b1b5cd5..c6405d8b94 100644
--- a/contrib/pg_standby/pg_standby.c
+++ b/contrib/pg_standby/pg_standby.c
@@ -33,6 +33,7 @@
 #include "pg_getopt.h"
 
 #include "access/xlog_internal.h"
+#include "fe_utils/option.h"
 
 const char *progname;
 
@@ -678,6 +679,10 @@ main(int argc, char **argv)
 
 	while ((c = getopt(argc, argv, "cdk:lr:s:t:w:")) != -1)
 	{
+		pg_strtoint_status s;
+		int64		parsed;
+		char	   *parse_error;
+
 		switch (c)
 		{
 			case 'c':			/* Use copy */
@@ -687,12 +692,15 @@ main(int argc, char **argv)
 debug = true;
 break;
 			case 'k':			/* keepfiles */
-keepfiles = atoi(optarg);
-if (keepfiles < 0)
+s = pg_strtoint64_range(optarg, ,
+		0, INT_MAX, _error);
+if (s != PG_STRTOINT_OK)
 {
-	fprintf(stderr, "%s: -k keepfiles must be >= 0\n", progname);
+	fprintf(stderr, "%s: invalid argument for -k keepfiles: %s\n",
+			progname, parse_error);
 	exit(2);
 }
+keepfiles = (int) parsed;
 break;
 			case 'l':			/* Use link */
 
@@ -706,31 +714,39 @@ main(int argc, char **argv)
 #endif
 break;
 			case 'r':			/* Retries */
-maxretries = atoi(optarg);
-if (maxretries < 0)
+s = pg_strtoint64_range(optarg, ,
+		0, INT_MAX, _error);
+if (s != PG_STRTOINT_OK)
 {
-	fprintf(stderr, "%s: -r maxretries must be >= 0\n", progname);
+	fprintf(stderr, "%s: invalid argument for -r maxretries: %s\n",
+			progname, parse_error);
 	exit(2);
 }
+maxretries = (int) parsed;
 break;
 			case 's':			/* Sleep time */
-sleeptime = atoi(optarg);
-if (sleeptime <= 0 || sleeptime > 60)
+s = pg_strtoint64_range(optarg, , 1, 60, _error);
+if (s != PG_STRTOINT_OK)
 {
-	fprintf(stderr, "%s: -s sleeptime incorrectly set\n", progname);
+	fprintf(stderr, "%s: invalid argument for -s sleeptime: %s\n",
+			progname, parse_error);
 	exit(2);
 }
+sleeptime = (int) parsed;
 break;
 			case 't':			/* Trigger file */
 triggerPath = pg_strdup(optarg);
 break;
 			case 'w':			/* Max wait time */
-maxwaittime = atoi(optarg);
-if (maxwaittime < 0)
+s = pg_strtoint64_range(optarg, ,
+		0, INT_MAX, _error);
+if (s != PG_STRTOINT_OK)
 {
-	fprintf(stderr, "%s: -w maxwaittime incorrectly set\n", progname);
+	fprintf(stderr, "%s: invalid argument for -w maxwaittime: %s\n",
+			progname, parse_error);
 	exit(2);
 }
+maxwaittime = (int) parsed;
 break;
 			default:
 fprintf(stderr, "Try \"%s --help\" for more information.\n", progname);
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 55ef13926d..2c0fbbc721 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -32,6 +32,7 @@
 #include "common/logging.h"
 #include "common/string.h"
 #include "fe_utils/recovery_gen.h"
+#include "fe_utils/option.h"
 #include "fe_utils/string_utils.h"
 #include "getopt_long.h"
 #include "libpq-fe.h"
@@ -2073,6 +2074,10 @@ main(int argc, char **argv)
 	while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:d:c:h:p:U:s:wWkvP",
 			long_options, _index)) != -1)
 	{
+		pg_strtoint_status s;
+		int64		parsed;
+		char	   *parse_error;
+
 		switch (c)
 		{
 			case 'C':
@@ -2157,12 +2162,13 @@ main(int argc, char **argv)
 #endif
 break;
 			case 'Z':
-compresslevel = atoi(optarg);
-if (compresslevel < 0 || compresslevel > 9)
+s = pg_strtoint64_range(optarg, , 0, 9, _error);
+if (s != PG_STRTOINT_OK)
 {
-	pg_log_error("invalid compression level \"%s\"", optarg);
+	pg_log_error("invalid compression level: %s", parse_error);
 	exit(1);
 }
+compresslevel = (int) parsed;
 break;
 			case 'c':
 if (pg_strcasecmp(optarg, "fast") == 0)
@@ -2195,12 +2201,14 @@ main(int argc, char **argv)
 dbgetpassword = 1;
 break;
 			case 's':
-standby_message_timeout = atoi(optarg) * 1000;
-if (standby_message_timeout < 0)
+s = pg_strtoint64_range(optarg, ,
+		0, INT_MAX / 1000, _error);
+if (s != PG_STRTOINT_OK)
 {
-	pg_log_error("invalid status interval \"%s\"", optarg);
+	

Re: [HACKERS] Block level parallel vacuum

2019-10-11 Thread Amit Kapila
On Fri, Oct 11, 2019 at 4:47 PM Mahendra Singh  wrote:
>
>
> I did some analysis and found that we are trying to free some already freed 
> memory. Or we are freeing palloced memory in vac_update_relstats.
> for (i = 0; i < nindexes; i++)
> {
> if (stats[i] == NULL || stats[i]->estimated_count)
> continue;
>
> /* Update index statistics */
> vac_update_relstats(Irel[i],
> stats[i]->num_pages,
> stats[i]->num_index_tuples,
> 0,
> false,
> InvalidTransactionId,
> InvalidMultiXactId,
> false);
> pfree(stats[i]);
> }
>
> As my table have 2 indexes, so we have to free both stats. When i = 0, it is 
> freeing propery but when i = 1, then vac_update_relstats  is freeing memory.
>>
>> (gdb) p *stats[i]
>> $1 = {num_pages = 218, pages_removed = 0, estimated_count = false, 
>> num_index_tuples = 3, tuples_removed = 3, pages_deleted = 102, 
>> pages_free = 0}
>> (gdb) p *stats[i]
>> $2 = {num_pages = 0, pages_removed = 65536, estimated_count = false, 
>> num_index_tuples = 0, tuples_removed = 0, pages_deleted = 0, pages_free = 0}
>> (gdb)
>
>
> From above data, it looks like, somewhere inside vac_update_relstats, we are 
> freeing all palloced memory. I don't know, why is it.
>

I don't think the problem is in vac_update_relstats as we are not even
passing stats to it, so it won't be able to free it.  I think the real
problem is in the way we copy the stats from shared memory to local
memory in the function end_parallel_vacuum().  Basically, it allocates
the memory for all the index stats together and then in function
update_index_statistics,  it is trying to free memory of individual
array elements, that won't work.  I have tried to fix the allocation
in end_parallel_vacuum, see if this fixes the problem for you.   You
need to apply the attached patch atop
v28-0001-Add-parallel-option-to-VACUUM-command posted above by
Sawada-San.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Fix-memory-allocation-for-copying-the-stats.patch
Description: Binary data


Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-11 Thread Tom Lane
Andres Freund  writes:
> On 2019-10-11 16:30:17 -0400, Robert Haas wrote:
>> But, if it does need to be changed, it seems like a terrible idea to
>> allow it to be done via SQL. Otherwise, the user can break the driver
>> by using SQL to set the list to something that the driver's not
>> expecting, and there's nothing the driver can do to prevent it.

> Uhm. The driver can just ignore GUCs it's not interested in, like our
> docs have told them for a long time?

Certainly it should do that; but the problematic case is where it
*doesn't* get told about something it's depending on knowing about.

regards, tom lane




v12.0: segfault in reindex CONCURRENTLY

2019-10-11 Thread Justin Pryzby
One of our servers crashed last night like this:

< 2019-10-10 22:31:02.186 EDT postgres >STATEMENT:  REINDEX INDEX CONCURRENTLY 
child.eric_umts_rnc_utrancell_hsdsch_eul_201910_site_idx
< 2019-10-10 22:31:02.399 EDT  >LOG:  server process (PID 29857) was terminated 
by signal 11: Segmentation fault
< 2019-10-10 22:31:02.399 EDT  >DETAIL:  Failed process was running: REINDEX 
INDEX CONCURRENTLY child.eric_umts_rnc_utrancell_hsdsch_eul_201910_site_idx
< 2019-10-10 22:31:02.399 EDT  >LOG:  terminating any other active server 
processes

ts=# \d+ child.eric_umts_rnc_utrancell_hsdsch_eul_201910_site_idx
Index "child.eric_umts_rnc_utrancell_hsdsch_eul_201910_site_idx"
 Column  |  Type   | Key? | Definition | Storage | Stats target
-+-+--++-+--
 site_id | integer | yes  | site_id| plain   |
btree, for table "child.eric_umts_rnc_utrancell_hsdsch_eul_201910"

That's an index on a table partition, but not itself a child of a relkind=I
index.

Unfortunately, there was no core file, and I'm still trying to reproduce it.

I can't see that the table was INSERTed into during the reindex...
But looks like it was SELECTed from, and the report finished within 1sec of the
crash:

(2019-10-10 22:30:50,485 - p1604 t140325365622592 - INFO): PID 1604 finished 
running report; est=None rows=552; cols=83; [...] duration:12

postgres=# SELECT log_time, pid, session_id, left(message,99), detail FROM 
postgres_log_2019_10_10_2200 WHERE pid=29857 OR (log_time BETWEEN '2019-10-10 
22:31:02.18' AND '2019-10-10 22:31:02.4' AND NOT message~'crash of another') 
ORDER BY log_time LIMIT 9;
 2019-10-10 22:30:24.441-04 | 29857 | 5d9fe93f.74a1 | temporary file: path 
"base/pgsql_tmp/pgsql_tmp29857.0.sharedfileset/0.0", size 3096576  | 
 2019-10-10 22:30:24.442-04 | 29857 | 5d9fe93f.74a1 | temporary file: path 
"base/pgsql_tmp/pgsql_tmp29857.0.sharedfileset/1.0", size 2809856  | 
 2019-10-10 22:30:24.907-04 | 29857 | 5d9fe93f.74a1 | process 29857 still 
waiting for ShareLock on virtual transaction 30/103010 after 333.078 ms | 
Process holding the lock: 29671. Wait queue: 29857.
 2019-10-10 22:31:02.186-04 | 29857 | 5d9fe93f.74a1 | process 29857 acquired 
ShareLock on virtual transaction 30/103010 after 37611.995 ms| 
 2019-10-10 22:31:02.186-04 | 29671 | 5d9fe92a.73e7 | duration: 50044.778 ms  
statement: SELECT fn, sz FROM  +| 
|   |   | 
(SELECT file_name fn, file_size_bytes sz,  +| 
|   |   |   
  | 
 2019-10-10 22:31:02.399-04 |  1161 | 5d9cad9e.489  | terminating any other 
active server processes   | 
 2019-10-10 22:31:02.399-04 |  1161 | 5d9cad9e.489  | server process (PID 
29857) was terminated by signal 11: Segmentation fault  | 
Failed process was running: REINDEX INDEX CONCURRENTLY 
child.eric_umts_rnc_utrancell_hsdsch_eul_201910_site_idx

Justin




Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-11 Thread Andres Freund
Hi,

On 2019-10-11 16:30:17 -0400, Robert Haas wrote:
> On Fri, Oct 11, 2019 at 8:21 AM Dave Cramer  wrote:
> > So off the top of my head providing a system function seems like the way to 
> > go here unless you were contemplating adding something to the protocol ?
> 
> Since the list of reportable GUCs is for the benefit of the driver,
> I'm not sure why this would need to be changed after the connection
> has been established.

Because of connection pooling. Consider the pooler inbetween the client
and the server. The pooler can't report back settings changes it doesn't
get itself. And just asking for all settings upfront seems problematic.


> But, if it does need to be changed, it seems like a terrible idea to
> allow it to be done via SQL. Otherwise, the user can break the driver
> by using SQL to set the list to something that the driver's not
> expecting, and there's nothing the driver can do to prevent it.

Uhm. The driver can just ignore GUCs it's not interested in, like our
docs have told them for a long time? And I mean if somebody is able to
issue sql, then they likely have some control over the driver as well. I
don't understand the problem here.

Greetings,

Andres Freund




Re: dropping column prevented due to inherited index

2019-10-11 Thread Alvaro Herrera
On 2019-Oct-11, Michael Paquier wrote:

> + if (!recursing)
> + {
> + /*
> +  * The resursing lookup for inherited child relations is done.  
> All
> +  * the child relations have been scanned and the object 
> addresses of
> +  * all the columns to-be-dropped are registered in addrs, so 
> drop.
> +  */
> + performMultipleDeletions(addrs, behavior, 0);
> + free_object_addresses(addrs);
> + }

Typo "resursing".  This comment seems a bit too long to me.  Maybe
"Recursion having ended, drop everything that was collected." suffices.
(Fits in one line.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
I wrote:
> It's not very clear how those things would lead to an intermittent
> failure though.  In the case of the postmaster crashes, we now see
> that timing of signal receipts is relevant.  For infinite_recurse,
> maybe it only fails if an sinval interrupt happens at the wrong time?
> (This theory would predict that commit 798070ec0 made the problem
> way more prevalent than it had been ... need to go see if the
> buildfarm history supports that.)

That seems to fit, roughly: commit 798070ec0 moved errors.sql to execute
as part of a parallel group on 2019-04-11, and the first failure of the
infinite_recurse test happened on 2019-04-27.  Since then we've averaged
about one such failure every four days, which makes a sixteen-day gap a
little more than you'd expect, but not a huge amount more.  Anyway,
I do not see any other commits in between that would plausibly have
affected this.

In other news, I reproduced the problem with gcc on wobbegong's host,
and confirmed that the gcc build uses less stack space: one recursive
cycle of reaper() and sigusr1_handler() consumes 14768 bytes with clang,
but just 9296 bytes with gcc.  So the evident difference in failure rate
between wobbegong and vulpes is nicely explained by that.  Still no
theory about pg_upgrade versus vanilla "make check" though.  I did manage
to make it happen during "make check" by dint of reducing the "ulimit -s"
setting, so it's *possible* for it to happen there, it just doesn't.
Weird.

regards, tom lane




Re: stress test for parallel workers

2019-10-11 Thread Thomas Munro
On Sat, Oct 12, 2019 at 9:40 AM Tom Lane  wrote:
> Andres Freund  writes:
> > On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
> >> ... So it's really hard to explain
> >> that as anything except a kernel bug: sometimes, the kernel
> >> doesn't give us as much stack as it promised it would.  And the
> >> machine is not loaded enough for there to be any rational
> >> resource-exhaustion excuse for that.
>
> > Linux expands stack space only on demand, thus it's possible to run out
> > of stack space while there ought to be stack space. Unfortunately that
> > during a stack expansion, which means there's no easy place to report
> > that.  I've seen this be hit in production on busy machines.
>
> As I said, this machine doesn't seem busy enough for that to be a
> tenable excuse; there's nobody but me logged in, and the buildfarm
> critter isn't running.

Yeah.  As I speculated in the other thread[1], the straightforward
can't-allocate-any-more-space-but-no-other-way-to-tell-you-that case,
ie, the explanation that doesn't involve a bug in Linux or PostgreSQL,
seems unlikely unless we also see other more obvious signs of
occasional overcommit problems (ie not during stack expansion) on
those hosts, doesn't it?  How likely is it that this 1-2MB of stack
space is the straw that breaks the camels back, every time?

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJ_MkqdEH-LmmebhNLSFeyWwvYVXfPaz3A2_p27EQfZwA%40mail.gmail.com




v12.0 ERROR: trying to store a heap tuple into wrong type of slot

2019-10-11 Thread Justin Pryzby
I'm not sure why we have that index, and my script probably should have known
to choose a better one to cluster on, but still..

ts=# CLUSTER huawei_m2000_config_enodebcell_enodeb USING 
huawei_m2000_config_enodebcell_enodeb_coalesce_idx ;
DEBUG:  0: building index "pg_toast_1840151315_index" on table 
"pg_toast_1840151315" serially
LOCATION:  index_build, index.c:2791
DEBUG:  0: clustering "public.huawei_m2000_config_enodebcell_enodeb" using 
sequential scan and sort
LOCATION:  copy_table_data, cluster.c:907
ERROR:  XX000: trying to store a heap tuple into wrong type of slot
LOCATION:  ExecStoreHeapTuple, execTuples.c:1328

ts=# \dt+ huawei_m2000_config_enodebcell_enodeb
 public | huawei_m2000_config_enodebcell_enodeb | table | telsasoft | 3480 kB | 

ts=# \d+ huawei_m2000_config_enodebcell_enodeb_coalesce_idx
Index "public.huawei_m2000_config_enodebcell_enodeb_coalesce_idx"
  Column  | Type | Key? |  Definition   | Storage  | 
Stats target 
--+--+--+---+--+--
 coalesce | text | yes  | COALESCE(enodebfunctionname, ne_name) | extended | 
btree, for table "public.huawei_m2000_config_enodebcell_enodeb"

ts=# \d+ huawei_m2000_config_enodebcell_enodeb
...
Indexes:
"huawei_m2000_config_enodebcell_enodeb_unique_idx" UNIQUE, btree (ne_name, 
tsoft_fake_key, device_name)
"huawei_m2000_config_enodebcell_enodeb_cellid_idx" btree (cellid) CLUSTER
"huawei_m2000_config_enodebcell_enodeb_coalesce_cellid_idx" btree 
(COALESCE(enodebfunctionname, ne_name), cellid)
"huawei_m2000_config_enodebcell_enodeb_coalesce_idx" btree 
(COALESCE(enodebfunctionname, ne_name))
Statistics objects:
"public"."huawei_m2000_config_enodebcell_enodeb" (ndistinct) ON ne_name, 
tsoft_fake_key, device_name FROM huawei_m2000_config_enodebcell_
enodeb
Access method: heap

(gdb) bt
#0  errfinish (dummy=dummy@entry=0) at elog.c:411
#1  0x0087a959 in elog_finish (elevel=elevel@entry=20,
fmt=fmt@entry=0x9c4d70 "trying to store a heap tuple into wrong type of 
slot") at elog.c:1365
#2  0x0061eea8 in ExecStoreHeapTuple (tuple=tuple@entry=0x1e06950, 
slot=slot@entry=0x1e05080, shouldFree=shouldFree@entry=false)
at execTuples.c:1328
#3  0x008a7a06 in comparetup_cluster (a=, b=, state=0x1e04940) at tuplesort.c:3795
#4  0x008a5895 in qsort_tuple (a=0x2254b08, n=7699, cmp_tuple=0x8a7960 
, state=state@entry=0x1e04940)
at qsort_tuple.c:112
#5  0x008a98bb in tuplesort_sort_memtuples 
(state=state@entry=0x1e04940) at tuplesort.c:3320
#6  0x008ab434 in tuplesort_performsort (state=state@entry=0x1e04940) 
at tuplesort.c:1811
#7  0x004c9404 in heapam_relation_copy_for_cluster 
(OldHeap=0x7f21606695d8, NewHeap=0x7f2160585048, OldIndex=,
use_sort=, OldestXmin=288843233, xid_cutoff=, 
multi_cutoff=0x7ffc05e6ba04, num_tuples=0x7ffc05e6ba08,
tups_vacuumed=0x7ffc05e6ba10, tups_recently_dead=0x7ffc05e6ba18) at 
heapam_handler.c:944
#8  0x0059cf07 in table_relation_copy_for_cluster 
(tups_recently_dead=0x7ffc05e6ba18, tups_vacuumed=0x7ffc05e6ba10,
num_tuples=0x7ffc05e6ba08, multi_cutoff=0x7ffc05e6ba04, 
xid_cutoff=0x7ffc05e6ba00, OldestXmin=, use_sort=true,
OldIndex=0x7f2160585f38, NewTable=0x7f2160585048, OldTable=0x7f21606695d8) 
at ../../../src/include/access/tableam.h:1410
#9  copy_table_data (pCutoffMulti=, pFreezeXid=, pSwapToastByContent=,
verbose=, OIDOldIndex=13, OIDOldHeap=1499600032, 
OIDNewHeap=1840150111) at cluster.c:920
#10 rebuild_relation (verbose=, indexOid=13, OldHeap=) at cluster.c:616
#11 cluster_rel (tableOid=tableOid@entry=1499600032, 
indexOid=indexOid@entry=3081287757, options=) at cluster.c:429
#12 0x0059d35e in cluster (stmt=stmt@entry=0x1d051f8, 
isTopLevel=isTopLevel@entry=true) at cluster.c:186
#13 0x0076547f in standard_ProcessUtility (pstmt=pstmt@entry=0x1d05518,
queryString=queryString@entry=0x1d046e0 "CLUSTER 
huawei_m2000_config_enodebcell_enodeb USING 
huawei_m2000_config_enodebcell_enodeb_coalesce_idx ;", 
context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, 
queryEnv=queryEnv@entry=0x0, dest=dest@entry=0x1d055f8,
completionTag=completionTag@entry=0x7ffc05e6c0a0 "") at utility.c:659
#14 0x7f21517204ef in pgss_ProcessUtility (pstmt=0x1d05518,
queryString=0x1d046e0 "CLUSTER huawei_m2000_config_enodebcell_enodeb USING 
huawei_m2000_config_enodebcell_enodeb_coalesce_idx ;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1d055f8, 
completionTag=0x7ffc05e6c0a0 "")
at pg_stat_statements.c:1006
#15 0x00762816 in PortalRunUtility (portal=0x1d7a4e0, pstmt=0x1d05518, 
isTopLevel=, setHoldSnapshot=,
dest=0x1d055f8, completionTag=0x7ffc05e6c0a0 "") at pquery.c:1175
#16 0x00763267 in PortalRunMulti (portal=portal@entry=0x1d7a4e0, 
isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, 

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-11 Thread Chapman Flack
On 10/11/19 4:44 PM, Dave Cramer wrote:
> On Fri, 11 Oct 2019 at 16:41, Chapman Flack  
>> On 10/11/19 4:30 PM, Robert Haas wrote:
>>> allow it to be done via SQL. Otherwise, the user can break the driver
>>> by using SQL to set the list to something that the driver's not
>>> expecting, and there's nothing the driver can do to prevent it.
>>
>> ... unless there is something like a (perhaps more generally useful)
>> BECAUSE I HAVE :cookie AND I SAY SO clause.[1]
> 
> Correct me if I'm wrong but I'm pretty sure nothing in the protocol
> currently would allow this

Nothing in the protocol to support: a post-connection-establishment
changeable list of reportable GUCs? Or some sort of cookie-based
arrangement to lock something down?

If you meant the cookie scheme, the way I had envisioned it in that
earlier message, I didn't see anything about it that would require
protocol support. I was tempted to do a PoC implementation as a plain
old extension. I can't say that I have, yet, but I'm still tempted.

-Chap




Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-11 Thread Dave Cramer
On Fri, 11 Oct 2019 at 16:41, Chapman Flack  wrote:

> On 10/11/19 4:30 PM, Robert Haas wrote:
>
> > But, if it does need to be changed, it seems like a terrible idea to
> > allow it to be done via SQL. Otherwise, the user can break the driver
> > by using SQL to set the list to something that the driver's not
> > expecting, and there's nothing the driver can do to prevent it.
>
> ... unless there is something like a (perhaps more generally useful)
> BECAUSE I HAVE :cookie AND I SAY SO clause.[1]
>


Correct me if I'm wrong but I'm pretty sure nothing in the protocol
currently would allow this

While I see the utility at least for connection pools this does seem to be
scope creep at the moment

Dave

>
>
>


Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-11 Thread Chapman Flack
On 10/11/19 4:30 PM, Robert Haas wrote:

> But, if it does need to be changed, it seems like a terrible idea to
> allow it to be done via SQL. Otherwise, the user can break the driver
> by using SQL to set the list to something that the driver's not
> expecting, and there's nothing the driver can do to prevent it.

... unless there is something like a (perhaps more generally useful)
BECAUSE I HAVE :cookie AND I SAY SO clause.[1]

Regards,
-Chap

[1]
https://www.postgresql.org/message-id/59127E4E.8090705%40anastigmatix.net




Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
Andres Freund  writes:
> On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
>> ... So it's really hard to explain
>> that as anything except a kernel bug: sometimes, the kernel
>> doesn't give us as much stack as it promised it would.  And the
>> machine is not loaded enough for there to be any rational
>> resource-exhaustion excuse for that.

> Linux expands stack space only on demand, thus it's possible to run out
> of stack space while there ought to be stack space. Unfortunately that
> during a stack expansion, which means there's no easy place to report
> that.  I've seen this be hit in production on busy machines.

As I said, this machine doesn't seem busy enough for that to be a
tenable excuse; there's nobody but me logged in, and the buildfarm
critter isn't running.

> I wonder if the machine is configured with overcommit_memory=2,
> i.e. don't overcommit.  cat /proc/sys/vm/overcommit_memory would tell.

$ cat /proc/sys/vm/overcommit_memory
0

> What does grep -E '^(Mem|Commit)' /proc/meminfo show while it's
> happening?

idle:

$ grep -E '^(Mem|Commit)' /proc/meminfo 
MemTotal:2074816 kB
MemFree:   36864 kB
MemAvailable:1779584 kB
CommitLimit: 1037376 kB
Committed_AS: 412480 kB

a few captures while regression tests are running:

$ grep -E '^(Mem|Commit)' /proc/meminfo 
MemTotal:2074816 kB
MemFree:8512 kB
MemAvailable:1819264 kB
CommitLimit: 1037376 kB
Committed_AS: 371904 kB
$ grep -E '^(Mem|Commit)' /proc/meminfo 
MemTotal:2074816 kB
MemFree:   32640 kB
MemAvailable:1753792 kB
CommitLimit: 1037376 kB
Committed_AS: 585984 kB
$ grep -E '^(Mem|Commit)' /proc/meminfo 
MemTotal:2074816 kB
MemFree:   56640 kB
MemAvailable:1695744 kB
CommitLimit: 1037376 kB
Committed_AS: 568768 kB


> What does the signal information say? You can see it with
> p $_siginfo
> after receiving the signal. A SIGSEGV here, I assume.

(gdb) p $_siginfo
$1 = {si_signo = 11, si_errno = 0, si_code = 128, _sifields = {_pad = {0 
}, _kill = {si_pid = 0, si_uid = 0}, 
_timer = {si_tid = 0, si_overrun = 0, si_sigval = {sival_int = 0, sival_ptr 
= 0x0}}, _rt = {si_pid = 0, si_uid = 0, si_sigval = {
sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 0, si_uid = 0, 
si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {
  si_addr = 0x0}, _sigpoll = {si_band = 0, si_fd = 0}}}

> Yea, that seems like it might be good. But we have to be careful too, as
> there's some thing were do want to be interruptable from within a signal
> handler. We start some processes from within one after all...

The proposed patch has zero effect on what the signal mask will be inside
a signal handler, only on the transient state during handler entry/exit.

regards, tom lane




Re: stress test for parallel workers

2019-10-11 Thread Andres Freund
Hi,

On 2019-10-11 14:56:41 -0400, Tom Lane wrote:
> I still don't have a good explanation for why this only seems to
> happen in the pg_upgrade test sequence.  However, I did notice
> something very interesting: the postmaster crashes after consuming
> only about 1MB of stack space.  This is despite the prevailing
> setting of "ulimit -s" being 8192 (8MB).  I also confirmed that
> the value of max_stack_depth within the crashed process is 2048,
> which implies that get_stack_depth_rlimit got some value larger
> than 2MB from getrlimit(RLIMIT_STACK).  And yet, here we have
> a crash, and the process memory map confirms that only 1MB was
> allocated in the stack region.  So it's really hard to explain
> that as anything except a kernel bug: sometimes, the kernel
> doesn't give us as much stack as it promised it would.  And the
> machine is not loaded enough for there to be any rational
> resource-exhaustion excuse for that.

Linux expands stack space only on demand, thus it's possible to run out
of stack space while there ought to be stack space. Unfortunately that
during a stack expansion, which means there's no easy place to report
that.  I've seen this be hit in production on busy machines.

I wonder if the machine is configured with overcommit_memory=2,
i.e. don't overcommit.  cat /proc/sys/vm/overcommit_memory would tell.
What does grep -E '^(Mem|Commit)' /proc/meminfo show while it's
happening?

What does the signal information say? You can see it with
p $_siginfo
after receiving the signal. A SIGSEGV here, I assume.

IIRC si_code and si_errno should indicate whether ENOMEM is the reason.


> This matches up with the intermittent infinite_recurse failures
> we've been seeing in the buildfarm.  Those are happening across
> a range of systems, but they're (almost) all Linux-based ppc64,
> suggesting that there's a longstanding arch-specific kernel bug
> involved.  For reference, I scraped the attached list of such
> failures in the last three months.  I wonder whether we can get
> the attention of any kernel hackers about that.

Most of them are operated by Mark, right? So it could also just be high
memory pressure on those.
[1;5B

> Anyway, as to what to do about it --- it occurred to me to wonder
> why we are relying on having the signal handlers block and unblock
> signals manually, when we could tell sigaction() that we'd like
> signals blocked.  It is reasonable to expect that the signal support
> is designed to not recursively consume stack space in the face of
> a series of signals, while the way we are doing it clearly opens
> us up to recursive space consumption.  The stack trace I showed
> before proves that the recursion happens at the points where the
> signal handlers unblock signals.

Yea, that seems like it might be good. But we have to be careful too, as
there's some thing were do want to be interruptable from within a signal
handler. We start some processes from within one after all...

Greetings,

Andres Freund




Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-11 Thread Robert Haas
On Fri, Oct 11, 2019 at 8:21 AM Dave Cramer  wrote:
> So off the top of my head providing a system function seems like the way to 
> go here unless you were contemplating adding something to the protocol ?

Since the list of reportable GUCs is for the benefit of the driver,
I'm not sure why this would need to be changed after the connection
has been established.

But, if it does need to be changed, it seems like a terrible idea to
allow it to be done via SQL. Otherwise, the user can break the driver
by using SQL to set the list to something that the driver's not
expecting, and there's nothing the driver can do to prevent it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: stress test for parallel workers

2019-10-11 Thread Mark Wong
On Sat, Oct 12, 2019 at 08:41:12AM +1300, Thomas Munro wrote:
> On Sat, Oct 12, 2019 at 7:56 AM Tom Lane  wrote:
> > This matches up with the intermittent infinite_recurse failures
> > we've been seeing in the buildfarm.  Those are happening across
> > a range of systems, but they're (almost) all Linux-based ppc64,
> > suggesting that there's a longstanding arch-specific kernel bug
> > involved.  For reference, I scraped the attached list of such
> > failures in the last three months.  I wonder whether we can get
> > the attention of any kernel hackers about that.
> 
> Yeah, I don't know anything about this stuff, but I was also beginning
> to wonder if something is busted in the arch-specific fault.c code
> that checks if stack expansion is valid[1], in a way that fails with a
> rapidly growing stack, well timed incoming signals, and perhaps
> Docker/LXC (that's on Mark's systems IIUC, not sure about the ARM
> boxes that failed or if it could be relevant here).  Perhaps the
> arbitrary tolerances mentioned in that comment are relevant.

This specific one (wobbegon) is OpenStack/KVM[2], for what it's worth...

"... cluster is an OpenStack based cluster offering POWER8 & POWER9 LE
instances running on KVM ..."

But to keep you on your toes, some of my ppc animals are Docker within
other OpenStack/KVM instance...

Regards,
Mark

[1] https://github.com/torvalds/linux/blob/master/arch/powerpc/mm/fault.c#L244
[2] https://osuosl.org/services/powerdev/

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
Thomas Munro  writes:
> Yeah, I don't know anything about this stuff, but I was also beginning
> to wonder if something is busted in the arch-specific fault.c code
> that checks if stack expansion is valid[1], in a way that fails with a
> rapidly growing stack, well timed incoming signals, and perhaps
> Docker/LXC (that's on Mark's systems IIUC, not sure about the ARM
> boxes that failed or if it could be relevant here).  Perhaps the
> arbitrary tolerances mentioned in that comment are relevant.
> [1] https://github.com/torvalds/linux/blob/master/arch/powerpc/mm/fault.c#L244

Hm, the bit about "we'll allow up to 1MB unconditionally" sure seems
to match up with the observations here.  I also wonder about the
arbitrary definition of "a long way" as 2KB.  Could it be that that
misbehaves in the face of a userland function with more than 2KB of
local variables?

It's not very clear how those things would lead to an intermittent
failure though.  In the case of the postmaster crashes, we now see
that timing of signal receipts is relevant.  For infinite_recurse,
maybe it only fails if an sinval interrupt happens at the wrong time?
(This theory would predict that commit 798070ec0 made the problem
way more prevalent than it had been ... need to go see if the
buildfarm history supports that.)

regards, tom lane




Re: Connect as multiple users using single client certificate

2019-10-11 Thread Tom Lane
Kyle Bateman  writes:
> On 10/11/19 1:05 PM, Tom Lane wrote:
>> I agree with Andrew that that's just silly.  If you give all your users
>> the same cert then any of them can masquerade as any other.  You might
>> as well just tell them to share the same login id.

> In my implementation, I'm not giving the cert to all my users.  I'm only 
> giving it to the middleware server that manages connections.

> What I hope to accomplish is: Establish a secure, encrypted connection 
> to Postgresql from a trusted process, possibly running on another 
> machine, whom I trust to tell me which user (within a limited set, 
> defined by a role) it would like to connect as.  That process does it's 
> own robust authentication of users before letting them through to the 
> database by the username they claim.  However, it is still useful to 
> connect as different users because my views and functions operate 
> differently depending on which user is on the other end of the connection.

Well, you can do that, it's just not cert authentication.

What you might consider is (1) set up an ssl_ca_file, so that the
server only believes client certs traceable to that CA, and (2) require
SSL connections (use "hostssl" entries in pg_hba.conf).  Then you
expect that possession of a cert issued by your CA is enough to
authorize connections to the database.  But don't use the cert
auth method --- based on what you said here, you might even just
use "trust".

regards, tom lane




Re: stress test for parallel workers

2019-10-11 Thread Thomas Munro
On Sat, Oct 12, 2019 at 7:56 AM Tom Lane  wrote:
> This matches up with the intermittent infinite_recurse failures
> we've been seeing in the buildfarm.  Those are happening across
> a range of systems, but they're (almost) all Linux-based ppc64,
> suggesting that there's a longstanding arch-specific kernel bug
> involved.  For reference, I scraped the attached list of such
> failures in the last three months.  I wonder whether we can get
> the attention of any kernel hackers about that.

Yeah, I don't know anything about this stuff, but I was also beginning
to wonder if something is busted in the arch-specific fault.c code
that checks if stack expansion is valid[1], in a way that fails with a
rapidly growing stack, well timed incoming signals, and perhaps
Docker/LXC (that's on Mark's systems IIUC, not sure about the ARM
boxes that failed or if it could be relevant here).  Perhaps the
arbitrary tolerances mentioned in that comment are relevant.

[1] https://github.com/torvalds/linux/blob/master/arch/powerpc/mm/fault.c#L244




Re: Connect as multiple users using single client certificate

2019-10-11 Thread Kyle Bateman

On 10/11/19 1:05 PM, Tom Lane wrote:

Kyle Bateman  writes:

On 10/11/19 12:12 PM, Andrew Dunstan wrote:

I think the short answer is: No. The client certificate should match the
username and nothing else. If you don't want to generate certificates
for all your users I suggest using some other form of auth (e.g.
scram-sha-256).
The long answer is that you can use maps, but it's probably not a good
idea. e.g. you have a map allowing foo to connect as both bar and baz,
and give both bar and baz a certificate with a CN of foo. But then bar
can connect as baz and vice versa, which isn't a good thing.

Hmmm, too bad.  It would be nice to be able to generate a certificate,
say with a commonName of "+users" (or some other setting) which matches
what is specified in pg_hba.conf, allowing connections from anyone
within the specified group.  Seems like that is the intent of the "+"
syntax in the first place.

No, it's not.  The point of the +syntax is to let a collection of users
log in without having to adjust pg_hba.conf anytime you add a new user.
It's not meant to bypass the requirement that the users authenticate
properly.  Would you expect that if you used +users with a password-
based auth method, then all the users would have the same password?


In my case, the middleware is validating end-users using distributed
keys, so no username/passwords are needed.  I was hoping to avoid all
that and just rely on SSL.
Any idea if this is a viable feature enhancement?

I agree with Andrew that that's just silly.  If you give all your users
the same cert then any of them can masquerade as any other.  You might
as well just tell them to share the same login id.
In my implementation, I'm not giving the cert to all my users.  I'm only 
giving it to the middleware server that manages connections.


What I hope to accomplish is: Establish a secure, encrypted connection 
to Postgresql from a trusted process, possibly running on another 
machine, whom I trust to tell me which user (within a limited set, 
defined by a role) it would like to connect as.  That process does it's 
own robust authentication of users before letting them through to the 
database by the username they claim.  However, it is still useful to 
connect as different users because my views and functions operate 
differently depending on which user is on the other end of the connection.


Is there a way I can accomplish this using the existing authentication 
methods (other than trust)?






Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-11 Thread Tom Lane
Justin Pryzby  writes:
> On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote:
>> Could you provide a self-contained test case please?

> I'm suspecting this; is it useful to test with this commit reverted ?

I wouldn't bother; we'd still need a test case to find out what the
problem is.

regards, tom lane




Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
Andrew Dunstan  writes:
> On 10/11/19 11:45 AM, Tom Lane wrote:
>> FWIW, I'm not excited about that as a permanent solution.  It requires
>> root privilege, and it affects the whole machine not only the buildfarm,
>> and making it persist across reboots is even more invasive.

> OK, but I'm not keen to have to tussle with coredumpctl. Right now our
> logic says: for every core file in the data directory try to get a
> backtrace. Use of systemd-coredump means that gets blown out of the
> water, and we no longer even have a simple test to see if our program
> caused a core dump.

I haven't played that much with this software, but it seems you can
do "coredumpctl list " to find out what it has
for a particular executable.  You would likely need a time-based
filter too (to avoid regurgitating previous runs' failures),
but that seems do-able.

regards, tom lane




Re: Connect as multiple users using single client certificate

2019-10-11 Thread Tom Lane
Kyle Bateman  writes:
> On 10/11/19 12:12 PM, Andrew Dunstan wrote:
>> I think the short answer is: No. The client certificate should match the
>> username and nothing else. If you don't want to generate certificates
>> for all your users I suggest using some other form of auth (e.g.
>> scram-sha-256).
>> The long answer is that you can use maps, but it's probably not a good
>> idea. e.g. you have a map allowing foo to connect as both bar and baz,
>> and give both bar and baz a certificate with a CN of foo. But then bar
>> can connect as baz and vice versa, which isn't a good thing.

> Hmmm, too bad.  It would be nice to be able to generate a certificate, 
> say with a commonName of "+users" (or some other setting) which matches 
> what is specified in pg_hba.conf, allowing connections from anyone 
> within the specified group.  Seems like that is the intent of the "+" 
> syntax in the first place.

No, it's not.  The point of the +syntax is to let a collection of users
log in without having to adjust pg_hba.conf anytime you add a new user.
It's not meant to bypass the requirement that the users authenticate
properly.  Would you expect that if you used +users with a password-
based auth method, then all the users would have the same password?

> In my case, the middleware is validating end-users using distributed 
> keys, so no username/passwords are needed.  I was hoping to avoid all 
> that and just rely on SSL.
> Any idea if this is a viable feature enhancement?

I agree with Andrew that that's just silly.  If you give all your users
the same cert then any of them can masquerade as any other.  You might
as well just tell them to share the same login id.

regards, tom lane




Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
I wrote:
> What we've apparently got here is that signals were received
> so fast that the postmaster ran out of stack space.  I remember
> Andres complaining about this as a theoretical threat, but I
> hadn't seen it in the wild before.

> I haven't finished investigating though, as there are some things
> that remain to be explained.

I still don't have a good explanation for why this only seems to
happen in the pg_upgrade test sequence.  However, I did notice
something very interesting: the postmaster crashes after consuming
only about 1MB of stack space.  This is despite the prevailing
setting of "ulimit -s" being 8192 (8MB).  I also confirmed that
the value of max_stack_depth within the crashed process is 2048,
which implies that get_stack_depth_rlimit got some value larger
than 2MB from getrlimit(RLIMIT_STACK).  And yet, here we have
a crash, and the process memory map confirms that only 1MB was
allocated in the stack region.  So it's really hard to explain
that as anything except a kernel bug: sometimes, the kernel
doesn't give us as much stack as it promised it would.  And the
machine is not loaded enough for there to be any rational
resource-exhaustion excuse for that.

This matches up with the intermittent infinite_recurse failures
we've been seeing in the buildfarm.  Those are happening across
a range of systems, but they're (almost) all Linux-based ppc64,
suggesting that there's a longstanding arch-specific kernel bug
involved.  For reference, I scraped the attached list of such
failures in the last three months.  I wonder whether we can get
the attention of any kernel hackers about that.

Anyway, as to what to do about it --- it occurred to me to wonder
why we are relying on having the signal handlers block and unblock
signals manually, when we could tell sigaction() that we'd like
signals blocked.  It is reasonable to expect that the signal support
is designed to not recursively consume stack space in the face of
a series of signals, while the way we are doing it clearly opens
us up to recursive space consumption.  The stack trace I showed
before proves that the recursion happens at the points where the
signal handlers unblock signals.

As a quick hack I made the attached patch, and it seems to fix the
problem on wobbegong's host.  I don't see crashes any more, and
watching the postmaster's stack space consumption, it stays
comfortably at a tad under 200KB (probably the default initial
allocation), while without the patch it tends to blow up to 700K
or more even in runs that don't crash.

This patch isn't committable as-is because it will (I suppose)
break things on Windows; we still need the old way there for lack
of sigaction().  But that could be fixed with a few #ifdefs.
I'm also kind of tempted to move pqsignal_no_restart into
backend/libpq/pqsignal.c (where BlockSig is defined) and maybe
rename it, but I'm not sure to what.

This issue might go away if we switched to a postmaster implementation
that doesn't do work in the signal handlers, but I'm not entirely
convinced of that.  The existing handlers don't seem to consume a lot
of stack space in themselves (there's not many local variables in them).
The bulk of the stack consumption is seemingly in the platform's signal
infrastructure, so that we might still have a stack consumption issue
even with fairly trivial handlers, if we don't tell sigaction to block
signals.  In any case, this fix seems potentially back-patchable,
while we surely wouldn't risk back-patching a postmaster rewrite.

Comments?

regards, tom lane

   sysname|   architecture   |  operating_system  |sys_owner|
branch |  snapshot   |  stage  |
  err   
   
--+--++-+---+-+-+---
 cavefish | ppc64le (POWER9) | Ubuntu | Mark Wong   | HEAD  
| 2019-07-13 03:49:38 | pg_upgradeCheck | 2019-07-13 04:01:23.437 UTC 
[9365:71] DETAIL:  Failed process was running: select infinite_recurse();
 pintail  | ppc64le (POWER9) | Debian GNU/Linux   | Mark Wong   | 
REL_12_STABLE | 2019-07-13 19:36:51 | Check   | 2019-07-13 19:39:29.013 
UTC [31086:5] DETAIL:  Failed process was running: select infinite_recurse();
 bonito   | ppc64le (POWER9) | Fedora | Mark Wong   | HEAD  
| 2019-07-19 23:13:01 | Check   | 2019-07-19 23:16:33.330 UTC 
[24191:70] DETAIL:  Failed process was running: select infinite_recurse();
 takin| ppc64le  | opensuse   | Mark Wong   | HEAD  
| 2019-07-24 08:24:56 | Check   | 2019-07-24 08:28:01.735 UTC 
[16366:75] DETAIL:  Failed process was running: select 

Re: pgsql: Remove pqsignal() from libpq's official exports list.

2019-10-11 Thread Christoph Berg
Re: Tom Lane 2019-10-10 <10247.1570731...@sss.pgh.pa.us>
> OK, done.

Thanks, that made quite a few QA pipeline jobs happy here.

Christoph




Re: Connect as multiple users using single client certificate

2019-10-11 Thread Kyle Bateman

On 10/11/19 12:12 PM, Andrew Dunstan wrote:

On 10/11/19 1:58 PM, Kyle Bateman wrote:

I have some JS middleware that needs to securely connect to the
postgresql back end.  Any number of different users may connect via
websocket to this middleware to manage their connection to the
database.  I want the JS process to have a client certificate
authorizing it to connect to the database.

I have this line in my pg_hba.conf:

hostssl        all    +users        all        cert

So the idea is, I should be able to connect as any user that is a
member of the role "users."

Under this configuration, I can currently connect as the user "users"
but not as "joe" who is a member of the role "users."  I get:

FATAL:  certificate authentication failed for user "joe"

This makes sense as the commonName on the certificate is "users" and
not "joe."  But the documentation for pg_hba.conf states that
prefixing the username with a "+" should allow me to connect as any
role who is a member of the stated role.

Is there a way to do this via client certificate authorization?  I
have no way of knowing the specific usernames ahead of time, as new
users may be created in the database (thousands) and I can't really be
creating separate certificates for every different user.




I think the short answer is: No. The client certificate should match the
username and nothing else. If you don't want to generate certificates
for all your users I suggest using some other form of auth (e.g.
scram-sha-256).


The long answer is that you can use maps, but it's probably not a good
idea. e.g. you have a map allowing foo to connect as both bar and baz,
and give both bar and baz a certificate with a CN of foo. But then bar
can connect as baz and vice versa, which isn't a good thing.


cheers


andrew


Hmmm, too bad.  It would be nice to be able to generate a certificate, 
say with a commonName of "+users" (or some other setting) which matches 
what is specified in pg_hba.conf, allowing connections from anyone 
within the specified group.  Seems like that is the intent of the "+" 
syntax in the first place.


In my case, the middleware is validating end-users using distributed 
keys, so no username/passwords are needed.  I was hoping to avoid all 
that and just rely on SSL.


Any idea if this is a viable feature enhancement?

Kyle


Kyle





Re: BRIN index which is much faster never chosen by planner

2019-10-11 Thread Tomas Vondra

On Fri, Oct 11, 2019 at 09:08:05AM -0500, Jeremy Finzel wrote:

On Thu, Oct 10, 2019 at 7:22 PM David Rowley 
wrote:


The planner might be able to get a better estimate on the number of
matching rows if the now() - interval '10 days' expression was
replaced with 'now'::timestamptz - interval '10 days'. However, care
would need to be taken to ensure the plan is never prepared since
'now' is evaluated during parse. The same care must be taken when
creating views, functions, stored procedures and the like.



You are on to something here I think with the now() function, even if above
suggestion is not exactly right as you said further down.  I am finding a
hard-coded timestamp gives the right query plan.  I also tested same with
even bigger window (last 16 days) and it yet still chooses the brin index.

foo_prod=# EXPLAIN
foo_prod-# SELECT
foo_prod-#  category, source, MIN(rec_insert_time) OVER (partition by
source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time)
OVER (partition by source order by rec_insert_time) AS last_source_time
foo_prod-# FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
foo_prod(# category, source(field1) AS source, rec_insert_time
foo_prod(# FROM log_table l
foo_prod(# INNER JOIN public.small_join_table filter ON filter.category =
l.category
foo_prod(# WHERE field1 IS NOT NULL AND l.category = 'music'
foo_prod(# AND l.rec_insert_time >= now() - interval '10 days'
foo_prod(# ORDER BY brand_id, last_change, log_id, rec_insert_time DESC)
unique_cases;

  QUERY PLAN
-
WindowAgg  (cost=24436329.10..24436343.56 rows=643 width=120)
  ->  Sort  (cost=24436329.10..24436330.70 rows=643 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
->  Subquery Scan on unique_cases  (cost=24436286.24..24436299.10
rows=643 width=104)
  ->  Unique  (cost=24436286.24..24436292.67 rows=643
width=124)
->  Sort  (cost=24436286.24..24436287.85 rows=643
width=124)
  Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
  ->  Nested Loop  (cost=0.00..24436256.25
rows=643 width=124)
Join Filter: ((l.category)::text =
filter.category)
->  Seq Scan on small_join_table filter
(cost=0.00..26.99 rows=1399 width=8)
->  Materialize  (cost=0.00..24420487.02
rows=643 width=99)
  ->  Seq Scan on log_table l
(cost=0.00..24420483.80 rows=643 width=99)
Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
(13 rows)

foo_prod=# SELECT now() - interval '10 days';
  ?column?
---
2019-10-01 08:20:38.115471-05
(1 row)

foo_prod=# EXPLAIN
SELECT
category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= '2019-10-01 08:20:38.115471-05'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;

 QUERY PLAN
---
WindowAgg  (cost=19664576.17..19664590.63 rows=643 width=120)
  ->  Sort  (cost=19664576.17..19664577.77 rows=643 width=104)
Sort Key: unique_cases.source, unique_cases.rec_insert_time
->  Subquery Scan on unique_cases  (cost=19664533.31..19664546.17
rows=643 width=104)
  ->  Unique  (cost=19664533.31..19664539.74 rows=643
width=124)
->  Sort  (cost=19664533.31..19664534.92 rows=643
width=124)
  Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
  ->  Nested Loop  (cost=3181.19..19664503.32
rows=643 width=124)
->  Gather  (cost=3180.91..19662574.92
rows=643 width=99)
  Workers Planned: 3
  ->  Parallel Bitmap Heap Scan on
log_table l  (cost=2180.91..19661510.62 rows=207 width=99)
Recheck Cond: (rec_insert_time

= '2019-10-01 08:20:38.115471-05'::timestamp with time zone)

Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name))
 

Re: Connect as multiple users using single client certificate

2019-10-11 Thread Andrew Dunstan


On 10/11/19 1:58 PM, Kyle Bateman wrote:
> I have some JS middleware that needs to securely connect to the
> postgresql back end.  Any number of different users may connect via
> websocket to this middleware to manage their connection to the
> database.  I want the JS process to have a client certificate
> authorizing it to connect to the database.
>
> I have this line in my pg_hba.conf:
>
> hostssl        all    +users        all        cert
>
> So the idea is, I should be able to connect as any user that is a
> member of the role "users."
>
> Under this configuration, I can currently connect as the user "users"
> but not as "joe" who is a member of the role "users."  I get:
>
> FATAL:  certificate authentication failed for user "joe"
>
> This makes sense as the commonName on the certificate is "users" and
> not "joe."  But the documentation for pg_hba.conf states that
> prefixing the username with a "+" should allow me to connect as any
> role who is a member of the stated role.
>
> Is there a way to do this via client certificate authorization?  I
> have no way of knowing the specific usernames ahead of time, as new
> users may be created in the database (thousands) and I can't really be
> creating separate certificates for every different user.
>
>


I think the short answer is: No. The client certificate should match the
username and nothing else. If you don't want to generate certificates
for all your users I suggest using some other form of auth (e.g.
scram-sha-256).


The long answer is that you can use maps, but it's probably not a good
idea. e.g. you have a map allowing foo to connect as both bar and baz,
and give both bar and baz a certificate with a CN of foo. But then bar
can connect as baz and vice versa, which isn't a good thing.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: stress test for parallel workers

2019-10-11 Thread Andrew Dunstan


On 10/11/19 11:45 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.
> FWIW, I'm not excited about that as a permanent solution.  It requires
> root privilege, and it affects the whole machine not only the buildfarm,
> and making it persist across reboots is even more invasive.



OK, but I'm not keen to have to tussle with coredumpctl. Right now our
logic says: for every core file in the data directory try to get a
backtrace. Use of systemd-coredump means that gets blown out of the
water, and we no longer even have a simple test to see if our program
caused a core dump.


cheers


andrew


-- 

Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-11 Thread Justin Pryzby
On Fri, Oct 11, 2019 at 10:48:37AM -0400, Tom Lane wrote:
> Justin Pryzby  writes:
> > The view is actually a join of two relkind=p partitioned tables (which I
> > will acknowledge probably performs poorly).
> 
> Could you provide a self-contained test case please?

Working on it.  FWIW explain for a v11 customer looks like this:

 Nested Loop  (cost=1011818.10..1076508.23 rows=734500 width=159)
   Join Filter: ((s.site_location = ''::text) OR (s.site_office = 
(COALESCE(huawei_ggsn_201610.ne_name, huawei_ggsn_gw_201610.ne_name
   ->  Group  (cost=11818.10..11946.31 rows=2937 width=40)
 Group Key: (COALESCE(huawei_ggsn_201610.start_time, 
huawei_ggsn_gw_201610.start_time)), (COALESCE(huawei_ggsn_201610.ne_name, 
huawei_ggsn_gw_201610.ne_name))
 ->  Merge Append  (cost=11818.10..11931.59 rows=2944 width=40)
   Sort Key: (COALESCE(huawei_ggsn_201610.start_time, 
huawei_ggsn_gw_201610.start_time)), (COALESCE(huawei_ggsn_201610.ne_name, 
huawei_ggsn_gw_201610.ne_name))
   ->  Group  (cost=332.48..333.10 rows=83 width=40)
 Group Key: (COALESCE(huawei_ggsn_201610.start_time, 
huawei_ggsn_gw_201610.start_time)), (COALESCE(huawei_ggsn_201610.ne_name, 
huawei_ggsn_gw_201610.ne_name))
 ->  Sort  (cost=332.48..332.69 rows=83 width=40)
   Sort Key: (COALESCE(huawei_ggsn_201610.start_time, 
huawei_ggsn_gw_201610.start_time)), (COALESCE(huawei_ggsn_201610.ne_name, 
huawei_ggsn_gw_201610.ne_name))
   ->  Hash Full Join  (cost=46.48..329.84 rows=83 
width=40)
 Hash Cond: ((huawei_ggsn_201610.ne_name = 
huawei_ggsn_gw_201610.ne_name) AND (huawei_ggsn_201610.ggsn_function = 
huawei_ggsn_gw_201610.ggsn_function) AND (huawei_ggsn_201610.start_time = 
huawei_
ggsn_gw_201610.start_time) AND (huawei_ggsn_201610.interval_seconds = 
huawei_ggsn_gw_201610.interval_seconds) AND (huawei_ggsn_201610.device_id = 
huawei_ggsn_gw_201610.device_id) AND (huawei_ggsn_201610.c_134710251 = 
huawei_ggs
n_gw_201610.c_134710251) AND (huawei_ggsn_201610.c_134710252 = 
huawei_ggsn_gw_201610.c_134710252) AND (huawei_ggsn_201610.c_134710253 = 
huawei_ggsn_gw_201610.c_134710253) AND (huawei_ggsn_201610.ne_id = 
huawei_ggsn_gw_201610.ne
_id) AND (huawei_ggsn_201610.ugw_function = huawei_ggsn_gw_201610.ugw_function))
 Filter: 
((COALESCE(huawei_ggsn_201610.start_time, huawei_ggsn_gw_201610.start_time) >= 
'2019-10-01 00:00:00-11'::timestamp with time zone) AND 
(COALESCE(huawei_ggsn_201610.start_time, huawei_ggs
n_gw_201610.start_time) < '2019-10-02 00:00:00-11'::timestamp with time zone))
 ->  Seq Scan on huawei_ggsn_201610  
(cost=0.00..255.44 rows=744 width=94)
 ->  Hash  (cost=20.44..20.44 rows=744 width=94)
   ->  Seq Scan on huawei_ggsn_gw_201610  
(cost=0.00..20.44 rows=744 width=94)
[...]

I'm suspecting this; is it useful to test with this commit reverted ?

commit 8edd0e79460b414b1d971895312e549e95e12e4f
Author: Tom Lane 
Date:   Mon Mar 25 15:42:35 2019 -0400

Suppress Append and MergeAppend plan nodes that have a single child.





Connect as multiple users using single client certificate

2019-10-11 Thread Kyle Bateman
I have some JS middleware that needs to securely connect to the 
postgresql back end.  Any number of different users may connect via 
websocket to this middleware to manage their connection to the 
database.  I want the JS process to have a client certificate 
authorizing it to connect to the database.


I have this line in my pg_hba.conf:

hostssl        all    +users        all        cert

So the idea is, I should be able to connect as any user that is a member 
of the role "users."


Under this configuration, I can currently connect as the user "users" 
but not as "joe" who is a member of the role "users."  I get:


FATAL:  certificate authentication failed for user "joe"

This makes sense as the commonName on the certificate is "users" and not 
"joe."  But the documentation for pg_hba.conf states that prefixing the 
username with a "+" should allow me to connect as any role who is a 
member of the stated role.


Is there a way to do this via client certificate authorization?  I have 
no way of knowing the specific usernames ahead of time, as new users may 
be created in the database (thousands) and I can't really be creating 
separate certificates for every different user.






Re: stress test for parallel workers

2019-10-11 Thread Tom Lane
Andrew Dunstan  writes:
>> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.

FWIW, I'm not excited about that as a permanent solution.  It requires
root privilege, and it affects the whole machine not only the buildfarm,
and making it persist across reboots is even more invasive.

> I have done the same on this machine. wobbegong runs every hour, so
> let's see what happens next. With any luck the buildfarm will give us a
> stack trace without needing further action.

I already collected one manually.  It looks like this:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
5114{
Missing separate debuginfos, use: dnf debuginfo-install 
glibc-2.26-30.fc27.ppc64le
(gdb) bt
#0  sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
#1  
#2  0x7fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#3  0x103fad08 in reaper (postgres_signal_arg=)
at postmaster.c:3215
#4  
#5  0x7fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#6  0x103f9f98 in sigusr1_handler (postgres_signal_arg=)
at postmaster.c:5275
#7  
#8  0x7fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#9  0x103fad08 in reaper (postgres_signal_arg=)
at postmaster.c:3215
#10 
#11 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
#12 
#13 0x7fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#14 0x103f9f98 in sigusr1_handler (postgres_signal_arg=)
at postmaster.c:5275
#15 
#16 0x7fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#17 0x103fad08 in reaper (postgres_signal_arg=)
at postmaster.c:3215
...
#572 
#573 0x7fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#574 0x103f9f98 in sigusr1_handler (
postgres_signal_arg=) at postmaster.c:5275
#575 
#576 0x7fff93923ca4 in sigprocmask () from /lib64/libc.so.6
#577 0x103fad08 in reaper (postgres_signal_arg=)
at postmaster.c:3215
#578 
#579 sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5114
#580 
#581 0x7fff93a01514 in select () from /lib64/libc.so.6
#582 0x103f7ad8 in ServerLoop () at postmaster.c:1682
#583 PostmasterMain (argc=, argv=)
at postmaster.c:1391
#584 0x in ?? ()

What we've apparently got here is that signals were received
so fast that the postmaster ran out of stack space.  I remember
Andres complaining about this as a theoretical threat, but I
hadn't seen it in the wild before.

I haven't finished investigating though, as there are some things
that remain to be explained.  The dependency on having
force_parallel_mode = regress makes sense now, because the extra
traffic to launch and reap all those parallel workers would
increase the stress on the postmaster (and it seems likely that
this stack trace corresponds exactly to alternating launch and
reap signals).  But why does it only happen during the pg_upgrade
test --- plain "make check" ought to be about the same?  I also
want to investigate why clang builds seem more prone to this
than gcc builds on the same machine; that might just be down to
more or less stack consumption, but it bears looking into.

regards, tom lane




Re: PostgreSQL, C-Extension, calling other Functions

2019-10-11 Thread Andrew Gierth
> "Stefan" == Stefan Wolf  writes:

 Stefan> I´ve written some PostgreSQL C-Extensions (for the first
 Stefan> time...) and they work as expected.

 Stefan> But now I want to call other functions from inside the
 Stefan> C-Extensions (but not via SPI_execute), for example
 Stefan> "regexp_match()" or from other extensions like PostGIS
 Stefan> "ST_POINT" etc...

 Stefan> I think "fmgr" is the key - but I didn't find any examples.

There are a number of levels of difficulty here depending on which
specific functions you need to call and whether you need to handle
NULLs.

The simplest case is using DirectFunctionCall[N][Coll] to call a builtin
(internal-language) function. This _only_ works for functions that don't
require access to an flinfo; many functions either need the flinfo to
get parameter type info or to use fn_extra as a per-callsite cache.
(Also there's no guarantee that a function that works without flinfo now
will continue to do so in future versions.) One restriction of this
method is that neither parameters nor results may be NULL.

The next step up from that is getting a function's Oid and using
OidFunctionCall[N][Coll]. This can call functions in any language,
including dynamic-loaded ones, but it can't handle polymorphic
functions. (Overloaded functions are fine, since each overload has its
own Oid.) This is still fairly simple but is inefficient: it constructs
and populates the flinfo, calls it once, then abandons it (it's not even
freed, it's up to the calling memory context to do that). If you're
going to be invoking a function repeatedly, it's worth avoiding this
one. This still has the restriction of no NULLs either in or out.

The next step from that is calling fmgr_info and FunctionCall[N][Coll]
separately (which is just breaking down OidFunctionCall into its parts);
this allows you to re-use the flinfo for multiple calls. Still no NULLs
allowed, but it's possible to use polymorphic functions if you try hard
enough (it's not documented, but it requires consing up a faked
expression tree and using fmgr_info_set_expr).

Finally, if none of the above apply, you're at the level where you
should seriously consider using SPI regardless; but if you don't want to
do that, you can use fmgr_info, InitFunctionCallInfoData and
FunctionCallInvoke.

-- 
Andrew (irc:RhodiumToad)




Re: stress test for parallel workers

2019-10-11 Thread Andrew Dunstan


On 10/10/19 6:01 PM, Andrew Dunstan wrote:
> On 10/10/19 5:34 PM, Tom Lane wrote:
>> I wrote:
> Yeah, I've been wondering whether pg_ctl could fork off a subprocess
> that would fork the postmaster, wait for the postmaster to exit, and then
> report the exit status.
>>> [ pushed at 6a5084eed ]
>>> Given wobbegong's recent failure rate, I don't think we'll have to wait
>>> long.
>> Indeed, we didn't:
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wobbegong=2019-10-10%2020%3A54%3A46
>>
>> The tail end of the system log looks like
>>
>> 2019-10-10 21:00:33.717 UTC [15127:306] pg_regress/date FATAL:  postmaster 
>> exited during a parallel transaction
>> 2019-10-10 21:00:33.717 UTC [15127:307] pg_regress/date LOG:  disconnection: 
>> session time: 0:00:02.896 user=fedora database=regression host=[local]
>> /bin/sh: line 1: 14168 Segmentation fault  (core dumped) 
>> "/home/fedora/build-farm-10-clang/buildroot/HEAD/pgsql.build/tmp_install/home/fedora/build-farm-clang/buildroot/HEAD/inst/bin/postgres"
>>  -F -c listen_addresses="" -k "/tmp/pg_upgrade_check-ZrhQ4h"
>> postmaster exit status is 139
>>
>> So that's definitive proof that the postmaster is suffering a SIGSEGV.
>> Unfortunately, we weren't blessed with a stack trace, even though
>> wobbegong is running a buildfarm client version that is new enough
>> to try to collect one.  However, seeing that wobbegong is running
>> a pretty-recent Fedora release, the odds are that systemd-coredump
>> has commandeered the core dump and squirreled it someplace where
>> we can't find it.
>
>
> At least on F29 I have set /proc/sys/kernel/core_pattern and it works.




I have done the same on this machine. wobbegong runs every hour, so
let's see what happens next. With any luck the buildfarm will give us a
stack trace without needing further action.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: v12.0: ERROR: could not find pathkey item to sort

2019-10-11 Thread Tom Lane
Justin Pryzby  writes:
> I've reduced the failing query as much as possible to this:
> -- This is necessary to fail:
> SET enable_nestloop=off;

> SELECT * FROM
> (SELECT start_time, t1.site_id
> FROM pgw_kpi_view t1
> -- Apparently the where clause is necessary to fail...
> WHERE (start_time>='2019-10-10' AND start_time<'2019-10-11')
> -- The group by MAY be necessary to fail...
> GROUP BY 1,2
> ) AS data
> JOIN sites ON ( sites.site_location='' OR sites.site_office=data.site_id)

> The view is actually a join of two relkind=p partitioned tables (which I
> will acknowledge probably performs poorly).

Could you provide a self-contained test case please?

regards, tom lane




v12.0: ERROR: could not find pathkey item to sort

2019-10-11 Thread Justin Pryzby
I've reduced the failing query as much as possible to this:

-- This is necessary to fail:
SET enable_nestloop=off;

SELECT * FROM
(SELECT start_time, t1.site_id
FROM pgw_kpi_view t1
-- Apparently the where clause is necessary to fail...
WHERE (start_time>='2019-10-10' AND start_time<'2019-10-11')
-- The group by MAY be necessary to fail...
GROUP BY 1,2
) AS data
JOIN sites ON ( sites.site_location='' OR sites.site_office=data.site_id)

The view is actually a join of two relkind=p partitioned tables (which I
will acknowledge probably performs poorly).

(gdb) bt
#0  errfinish (dummy=dummy@entry=0) at elog.c:411
#1  0x0087a959 in elog_finish (elevel=elevel@entry=20, 
fmt=fmt@entry=0x9d93d8 "could not find pathkey item to sort") at elog.c:1365
#2  0x006a587f in prepare_sort_from_pathkeys (lefttree=0x7f7cb84492e8, 
pathkeys=, relids=0x7f7cb8410700, reqColIdx=reqColIdx@entry=0x0, 
adjust_tlist_in_place=, 
adjust_tlist_in_place@entry=false, 
p_numsortkeys=p_numsortkeys@entry=0x7ffc4b2e10c4, 
p_sortColIdx=p_sortColIdx@entry=0x7ffc4b2e10c8, 
p_sortOperators=p_sortOperators@entry=0x7ffc4b2e10d0, 
p_collations=p_collations@entry=0x7ffc4b2e10d8, 
p_nullsFirst=p_nullsFirst@entry=0x7ffc4b2e10e0) at createplan.c:5893
#3  0x006a5a6a in make_sort_from_pathkeys (lefttree=, 
pathkeys=, relids=) at createplan.c:6020
#4  0x006a6e30 in create_sort_plan (flags=4, best_path=0x7f7cb8410cc8, 
root=0x7f7fdc3ac6b0) at createplan.c:1985
#5  create_plan_recurse (root=root@entry=0x7f7fdc3ac6b0, 
best_path=0x7f7cb8410cc8, flags=flags@entry=4) at createplan.c:459
#6  0x006a6e4e in create_group_plan (best_path=0x7f7cb8410d58, 
root=0x7f7fdc3ac6b0) at createplan.c:2012
#7  create_plan_recurse (root=root@entry=0x7f7fdc3ac6b0, 
best_path=best_path@entry=0x7f7cb8410d58, flags=flags@entry=1) at 
createplan.c:464
#8  0x006a8278 in create_merge_append_plan (flags=4, 
best_path=0x7f7cb8446cd8, root=0x7f7fdc3ac6b0) at createplan.c:1333
#9  create_plan_recurse (root=root@entry=0x7f7fdc3ac6b0, 
best_path=0x7f7cb8446cd8, flags=flags@entry=4) at createplan.c:402
#10 0x006a6e4e in create_group_plan (best_path=0x7f7cb84486c8, 
root=0x7f7fdc3ac6b0) at createplan.c:2012
#11 create_plan_recurse (root=root@entry=0x7f7fdc3ac6b0, 
best_path=0x7f7cb84486c8, flags=flags@entry=1) at createplan.c:464
#12 0x006a9739 in create_plan (root=0x7f7fdc3ac6b0, 
best_path=) at createplan.c:325
#13 0x006aa988 in create_subqueryscan_plan (scan_clauses=0x0, 
tlist=0x7f7cb8450820, best_path=0x7f7cb8448db8, root=0x7f7fdc34b948) at 
createplan.c:3385
#14 create_scan_plan (root=root@entry=0x7f7fdc34b948, 
best_path=best_path@entry=0x7f7cb8448db8, flags=, flags@entry=0) 
at createplan.c:670
#15 0x006a6d31 in create_plan_recurse (root=root@entry=0x7f7fdc34b948, 
best_path=0x7f7cb8448db8, flags=flags@entry=0) at createplan.c:427
#16 0x006a983a in create_nestloop_plan (best_path=0x7f7cb844fb80, 
root=0x7f7fdc34b948) at createplan.c:4008
#17 create_join_plan (root=root@entry=0x7f7fdc34b948, 
best_path=best_path@entry=0x7f7cb844fb80) at createplan.c:1020
#18 0x006a6d75 in create_plan_recurse (root=root@entry=0x7f7fdc34b948, 
best_path=0x7f7cb844fb80, flags=flags@entry=1) at createplan.c:393
#19 0x006a9739 in create_plan (root=root@entry=0x7f7fdc34b948, 
best_path=) at createplan.c:325
#20 0x006b5a04 in standard_planner (parse=0x1bd2308, cursorOptions=256, 
boundParams=0x0) at planner.c:413
#21 0x0075fb2e in pg_plan_query (querytree=querytree@entry=0x1bd2308, 
cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at 
postgres.c:878
#22 0x0075fbee in pg_plan_queries (querytrees=, 
cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at 
postgres.c:968
#23 0x0076007a in exec_simple_query (
query_string=0x1ba36e0 "SELECT * FROM\n\t(SELECT start_time, 
t1.site_id\n\tFROM pgw_kpi_view t1\n\t\n\tWHERE (start_time>='2019-10-10' AND 
start_time<'2019-10-11')\n\t\n\tGROUP BY 1,2\n\t) AS data\nJOIN sites ON ( 
sites.site_location='' OR"...) at postgres.c:1143
#24 0x00761212 in PostgresMain (argc=, 
argv=argv@entry=0x1bd8e70, dbname=0x1bd8d10 "ts", username=) at 
postgres.c:4236
#25 0x00483d02 in BackendRun (port=, port=) at postmaster.c:4431
#26 BackendStartup (port=0x1bd5190) at postmaster.c:4122
#27 ServerLoop () at postmaster.c:1704
#28 0x006f0b1f in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x1b9e280) at postmaster.c:1377
#29 0x00484c93 in main (argc=3, argv=0x1b9e280) at main.c:228


bt f:

#2  0x006a587f in prepare_sort_from_pathkeys (lefttree=0x7f7cb84492e8, 
pathkeys=, relids=0x7f7cb8410700, reqColIdx=reqColIdx@entry=0x0, 
adjust_tlist_in_place=, 
adjust_tlist_in_place@entry=false, 
p_numsortkeys=p_numsortkeys@entry=0x7ffc4b2e10c4, 
p_sortColIdx=p_sortColIdx@entry=0x7ffc4b2e10c8, 

Re: BRIN index which is much faster never chosen by planner

2019-10-11 Thread Jeremy Finzel
On Thu, Oct 10, 2019 at 6:13 PM Tomas Vondra 
wrote:

>
> So this seems like a combination of multiple issues. Firstly, the bitmap
> index scan on rec_insert_time_brin_1000 estimate seems somewhat poor. It
> might be worth increasing stats target on that column, or something like
> that. Not sure, but it's about the only "fixable" thing here, I think.
>

In the OP I had mentioned that I already increased it to 5000, and it made
no difference.  Ah fine let's go ahead and try 1... still no change:

foo_prod=# ALTER TABLE log_table ALTER COLUMN rec_insert_time SET
STATISTICS 1;
ALTER TABLE
foo_prod=# ANALYZE log_table;
ANALYZE
foo_prod=# EXPLAIN
SELECT
 category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= now() - interval '10 days'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;

   QUERY PLAN
-
 WindowAgg  (cost=24451299.20..24451313.21 rows=623 width=120)
   ->  Sort  (cost=24451299.20..24451300.75 rows=623 width=104)
 Sort Key: unique_cases.source, unique_cases.rec_insert_time
 ->  Subquery Scan on unique_cases  (cost=24451257.82..24451270.28
rows=623 width=104)
   ->  Unique  (cost=24451257.82..24451264.05 rows=623
width=124)
 ->  Sort  (cost=24451257.82..24451259.38 rows=623
width=124)
   Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
   ->  Nested Loop  (cost=0.00..24451228.90
rows=623 width=124)
 Join Filter: ((l.category)::text =
filter.category)
 ->  Seq Scan on small_join_table filter
 (cost=0.00..26.99 rows=1399 width=8)
 ->  Materialize  (cost=0.00..24435949.31
rows=623 width=99)
   ->  Seq Scan on log_table l
 (cost=0.00..24435946.20 rows=623 width=99)
 Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
(13 rows)

Thanks,
Jeremy


Re: BRIN index which is much faster never chosen by planner

2019-10-11 Thread Jeremy Finzel
Dear Michael,

On Thu, Oct 10, 2019 at 5:20 PM Michael Lewis  wrote:

> Since the optimizer is choosing a seq scan over index scan when it seems
> like it has good row estimates in both cases, to me that may mean costs of
> scanning index are expected to be high. Is this workload on SSD? Has the
> random_page_cost config been decreased from default 4 (compared with cost
> of 1 unit for sequential scan)?
>

It's 1.5


> Your buffer hits aren't great. What is shared_buffers set to? How much ram
> on this cluster?
>

shared_buffers is 4GB.  It has 500G of RAM, but server has several clusters
on it.


>
> With this table being insert only, one assumes correlation is very high on
> the data in this column as shown in pg_stats, but have your confirmed?
>

Yes, but the issue isn't with the BRIN index performing badly or being
fragmented.  It's that it performs great (7x faster than the seq scan) but
postgres doesn't pick using it.  I have seen this same issue also in other
attempts I have made to use BRIN.


> To me, distinct ON is often a bad code smell and probably can be
> re-written to be much more efficient with GROUP BY, lateral & order by, or
> some other tool. Same with the window function. It is a powerful tool, but
> sometimes not the right one.
>

I don't really agree, but it's beside the point because the issue is not in
aggregation.  It's pre-aggregation.  Indeed if I run my query as a simple
select (as I tried) it's the exact same planning issue.  (In my experience,
distinct on for given example is the fastest.  Same with window functions
which prevent inefficient self-joins)


> Is "source" a function that is called on field1? What is it doing/how is
> it defined?
>

I can't see how that matters either, but the "source" function is a mask
for a built-in pg function that is trivial.  This whole query is masked so
as not to expose our actual prod query, but I hope it's still
understandable enough :).

My question is not how to make this query faster in general.  It's that I
want to use BRIN indexes very much, but I'm not sure I can trust they will
scale with the right query plan like I know BTREE will.

Thanks!
Jeremy


Re: BRIN index which is much faster never chosen by planner

2019-10-11 Thread Jeremy Finzel
On Thu, Oct 10, 2019 at 7:22 PM David Rowley 
wrote:

> The planner might be able to get a better estimate on the number of
> matching rows if the now() - interval '10 days' expression was
> replaced with 'now'::timestamptz - interval '10 days'. However, care
> would need to be taken to ensure the plan is never prepared since
> 'now' is evaluated during parse. The same care must be taken when
> creating views, functions, stored procedures and the like.
>

You are on to something here I think with the now() function, even if above
suggestion is not exactly right as you said further down.  I am finding a
hard-coded timestamp gives the right query plan.  I also tested same with
even bigger window (last 16 days) and it yet still chooses the brin index.

foo_prod=# EXPLAIN
foo_prod-# SELECT
foo_prod-#  category, source, MIN(rec_insert_time) OVER (partition by
source order by rec_insert_time) AS first_source_time, MAX(rec_insert_time)
OVER (partition by source order by rec_insert_time) AS last_source_time
foo_prod-# FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
foo_prod(# category, source(field1) AS source, rec_insert_time
foo_prod(# FROM log_table l
foo_prod(# INNER JOIN public.small_join_table filter ON filter.category =
l.category
foo_prod(# WHERE field1 IS NOT NULL AND l.category = 'music'
foo_prod(# AND l.rec_insert_time >= now() - interval '10 days'
foo_prod(# ORDER BY brand_id, last_change, log_id, rec_insert_time DESC)
unique_cases;

   QUERY PLAN
-
 WindowAgg  (cost=24436329.10..24436343.56 rows=643 width=120)
   ->  Sort  (cost=24436329.10..24436330.70 rows=643 width=104)
 Sort Key: unique_cases.source, unique_cases.rec_insert_time
 ->  Subquery Scan on unique_cases  (cost=24436286.24..24436299.10
rows=643 width=104)
   ->  Unique  (cost=24436286.24..24436292.67 rows=643
width=124)
 ->  Sort  (cost=24436286.24..24436287.85 rows=643
width=124)
   Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
   ->  Nested Loop  (cost=0.00..24436256.25
rows=643 width=124)
 Join Filter: ((l.category)::text =
filter.category)
 ->  Seq Scan on small_join_table filter
 (cost=0.00..26.99 rows=1399 width=8)
 ->  Materialize  (cost=0.00..24420487.02
rows=643 width=99)
   ->  Seq Scan on log_table l
 (cost=0.00..24420483.80 rows=643 width=99)
 Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name) AND (rec_insert_time >= (now() - '10
days'::interval)))
(13 rows)

foo_prod=# SELECT now() - interval '10 days';
   ?column?
---
 2019-10-01 08:20:38.115471-05
(1 row)

foo_prod=# EXPLAIN
SELECT
 category, source, MIN(rec_insert_time) OVER (partition by source order by
rec_insert_time) AS first_source_time, MAX(rec_insert_time) OVER (partition
by source order by rec_insert_time) AS last_source_time
FROM (SELECT DISTINCT ON (brand_id, last_change, log_id)
category, source(field1) AS source, rec_insert_time
FROM log_table l
INNER JOIN public.small_join_table filter ON filter.category = l.category
WHERE field1 IS NOT NULL AND l.category = 'music'
AND l.rec_insert_time >= '2019-10-01 08:20:38.115471-05'
ORDER BY brand_id, last_change, log_id, rec_insert_time DESC) unique_cases;

  QUERY PLAN
---
 WindowAgg  (cost=19664576.17..19664590.63 rows=643 width=120)
   ->  Sort  (cost=19664576.17..19664577.77 rows=643 width=104)
 Sort Key: unique_cases.source, unique_cases.rec_insert_time
 ->  Subquery Scan on unique_cases  (cost=19664533.31..19664546.17
rows=643 width=104)
   ->  Unique  (cost=19664533.31..19664539.74 rows=643
width=124)
 ->  Sort  (cost=19664533.31..19664534.92 rows=643
width=124)
   Sort Key: l.brand_id, l.last_change, l.log_id,
l.rec_insert_time DESC
   ->  Nested Loop  (cost=3181.19..19664503.32
rows=643 width=124)
 ->  Gather  (cost=3180.91..19662574.92
rows=643 width=99)
   Workers Planned: 3
   ->  Parallel Bitmap Heap Scan on
log_table l  (cost=2180.91..19661510.62 rows=207 width=99)
 Recheck Cond: (rec_insert_time
>= '2019-10-01 08:20:38.115471-05'::timestamp with time zone)
 Filter: ((field1 IS NOT NULL)
AND (category = 'music'::name))
   

Re: [Proposal] Global temporary tables

2019-10-11 Thread Konstantin Knizhnik



On 11.10.2019 15:15, 曾文旌(义从) wrote:

Dear Hackers,

This propose a way to develop global temporary tables in PostgreSQL.

I noticed that there is an "Allow temporary tables to exist as empty 
by default in all sessions" in the postgresql todolist.

https://wiki.postgresql.org/wiki/Todo

In recent years, PG community had many discussions about global temp 
table (GTT) support. Previous discussion covered the following topics:
(1)The main benefit or function: GTT offers features like “persistent 
schema, ephemeral data”, which avoids catalog bloat and reduces 
catalog vacuum.

(2)Whether follows ANSI concept of temporary tables
(3)How to deal with statistics, single copy of schema definition, relcache
(4)More can be seen in 
https://www.postgresql.org/message-id/73954ab7-44d3-b37b-81a3-69bdcbb446f7%40postgrespro.ru
(5)A recent implementation and design from Konstantin Knizhnik covered 
many functions of GTT: 
https://www.postgresql.org/message-id/attachment/103265/global_private_temp-1.patch


However, as pointed by Konstantin himself, the implementation still 
needs functions related to CLOG, vacuum, and MVCC visibility.




Just to clarify.
I have now proposed several different solutions for GTT:

Shared vs. private buffers for GTT:
1. Private buffers. This is least invasive patch, requiring no changes 
in relfilenodes.
2. Shared buffers. Requires changing relfilenode but supports parallel 
query execution for GTT.


Access to GTT at replica:
1. Access is prohibited (as for original temp tables). No changes at all.
2. Tuples of temp tables are marked with forzen XID.  Minimal changes, 
rollbacks are not possible.
3. Providing special XIDs for GTT at replica. No changes in CLOG are 
required, but special MVCC visibility rules are used for GTT. Current 
limitation: number of transactions accessing GTT at replica is limited 
by 2^32
and bitmap of correspondent size has to be maintained (tuples of GTT are 
not proceeded by vacuum and not frozen, so XID horizon never moved).


So except the limitation mentioned above (which I do not consider as 
critical) there is only one problem which was not addressed: maintaining 
statistics for GTT.

If all of the following conditions are true:

1) GTT are used in joins
2) There are indexes defined for GTT
3) Size and histogram of GTT in different backends can significantly vary.
4) ANALYZE was explicitly called for GTT

then query execution plan built in one backend will be also used for 
other backends where it can be inefficient.
I also do not consider this problem as "show stopper" for adding GTT to 
Postgres.


I still do not understand the opinion of community which functionality 
of GTT is considered to be most important.
But the patch with local buffers and no replica support is small enough 
to become good starting point.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Remove size limitations of vacuums dead_tuples array

2019-10-11 Thread Ants Aasma
On Thu, 10 Oct 2019 at 17:05, Tomas Vondra 
wrote:

> There already was a attempt to make this improvement, see [1]. There was
> a fairly long discussion about how to best do that (using other data
> structure, not just a simple array). It kinda died about a year ago, but
> I suppose there's a lot of relevant info in that thread.
>
> [1]
> https://www.postgresql.org/message-id/CAGTBQpbDCaR6vv9%3DscXzuT8fSbckf%3Da3NgZdWFWZbdVugVht6Q%40mail.gmail.com


Thanks for the pointer, wow that's a long thread. For some reason it did
not consider lifting the INT_MAX tuples/12GB limitation. I'll see if I can
pick up where that thread left off and push it along.

Regards,
Ants Aasma
Web: https://www.cybertec-postgresql.com


[Proposal] Global temporary tables

2019-10-11 Thread 曾文旌(义从)
Dear Hackers,

This propose a way to develop global temporary tables in PostgreSQL.

I noticed that there is an "Allow temporary tables to exist as empty by default 
in all sessions" in the postgresql todolist.
https://wiki.postgresql.org/wiki/Todo 

In recent years, PG community had many discussions about global temp table 
(GTT) support. Previous discussion covered the following topics: 
(1) The main benefit or function: GTT offers features like “persistent 
schema, ephemeral data”, which avoids catalog bloat and reduces catalog vacuum. 
(2) Whether follows ANSI concept of temporary tables
(3) How to deal with statistics, single copy of schema definition, relcache
(4) More can be seen in 
https://www.postgresql.org/message-id/73954ab7-44d3-b37b-81a3-69bdcbb446f7%40postgrespro.ru
(5) A recent implementation and design from Konstantin Knizhnik covered 
many functions of GTT: 
https://www.postgresql.org/message-id/attachment/103265/global_private_temp-1.patch
 


However, as pointed by Konstantin himself, the implementation still needs 
functions related to CLOG, vacuum, and MVCC visibility.

We developed GTT based on PG 11 and included most needed features, such as how 
to deal with concurrent DDL and DML operations, how to handle vacuum and too 
old relfrozenxids, and how to store and access GTT statistics. 

This design followed many suggestions from previous discussion in community. 
Here are some examples:
“have a separate 'relpersistence' setting for global temp tables…by 
having the backend id in all filename….   From Andres Freund
Use session memory context to store information related to GTT.   From 
Pavel Stehule
“extend the relfilenode mapper to support a backend-local 
non-persistent relfilenode map that's used to track temp table and index 
relfilenodes…” from Craig Ringer

Our implementation creates one record in pg_class for GTT’s schema definition. 
When rows are first inserted into the GTT in a session, a session specific file 
is created to store the GTT’s data. Those files are removed when the session 
ends. We maintain the GTT’s statistics in session local memory. DDL operations, 
such as DROP table or CREATE INDEX, can be executed on a GTT only by one 
session, while no other sessions insert any data into the GTT before or it is 
already truncated. This also avoids the concurrency of DML and DDL operations 
on GTT. We maintain a session level oldest relfrozenxids for GTT. This way, 
autovacuum or vacuum can truncate CLOG and increase global relfrozenxids based 
on all tables’ relfrozenxids, including GTT’s. 
The follows summarize the main design and implementation: 
Syntax: ON COMMIT PRESERVE ROWS and ON COMMIT DELETE ROWS
Data storage and buffering follows the same way as local temp table 
with a relfilenode including session id.
A hash table(A) in shared memory is used to track sessions and their 
usage of GTTs and to serialize DDL and DML operations. 
Another hash table(B) in session memory is introduced to record storage 
files for GTTs and their indexes. When a session ends, those files are removed. 
The same hash table(B) in session memory is used to record the 
relfrozenxids of each GTT. The oldest one is stored in myproc so that 
autovacuum and vacuum may use it to determine global oldest relfrozenxids and 
truncate clog. 
The same hash table(B) in session memory stores GTT’s session level 
statistics, It is generated during the operations of vacuum and analyze, and 
used by SQL optimizer to create execution plan. 
Some utility functions are added for DBA to manage GTTs. 
TRUNCATE command on a GTT behaves differently from that on a normal 
table. The command deletes the data immediately but keeps relfilenode using 
lower level table lock, RowExclusiveLock, instead of  AccessExclusiveLock. 
Main limits of this version or future improvement: need suggestions 
from community: 
1 VACUUM FULL and CLUSTER are not supported; any operations 
which may change relfilenode are disabled to GTT.
2 Sequence column is not supported in GTT for now.
3 Users defined statistics is not supported.


Details:

Requirement
The features list about global temp table:
1. global temp table (ON COMMIT clause is omitted, SQL specifies that 
the default behavior is ON COMMIT DELETE ROWS)
2. support with on commit DELETE ROWS
3. support with on commit PRESERVE ROWS
4. not support ON COMMIT DROP

Feature description
Global temp tables are defined just once and automatically exist (starting with 
empty contents) in every session that needs them.
Global temp table, each session use local buffer, read or write independent 
data files.
Use on commit DELETE ROWS for a transaction-specific global temp 

Re: let's make the list of reportable GUCs configurable (was Re: Add %r substitution for psql prompts to show recovery status)

2019-10-11 Thread Dave Cramer
On Thu, 10 Oct 2019 at 12:05, Andres Freund  wrote:

> Hi,
>
> On 2019-10-09 16:29:07 -0400, Dave Cramer wrote:
> > I've added functionality into libpq to be able to set this STARTUP
> > parameter as well as changed it to _pq_.report.
> > Still need to document this and figure out how to test it.
>
>
> > From 85de9f48f80a3bfd9e8bdd4f1ba6b177b1ff9749 Mon Sep 17 00:00:00 2001
> > From: Dave Cramer 
> > Date: Thu, 11 Jul 2019 08:20:14 -0400
> > Subject: [PATCH] Add a STARTUP packet option to set GUC_REPORT on GUC's
> that
> >  currently do not have that option set. There is a facility to add
> protocol
> >  options using _pq_. The new option name is report and takes a
> >  comma delmited string of GUC names which will have GUC_REPORT set. Add
> >  functionality into libpq to accept this new option key
>
> I don't think it's good to only be able to change this at connection
> time. Especially for poolers this ought to be configurable at any
> time. I do think startup message support makes sense (especially to
> avoid race conditions around to-be-reported gucs changing soon after
> connecting), don't get me wrong, I just don't think it's sufficient.
>

So off the top of my head providing a system function seems like the way to
go here unless you were contemplating adding something to the protocol ?

>
> > @@ -2094,6 +2094,7 @@ retry1:
> >* zeroing extra byte above.
> >*/
> >   port->guc_options = NIL;
> > + port->guc_report = NIL;
> >
> >   while (offset < len)
> >   {
> > @@ -2138,13 +2139,34 @@ retry1:
> >   }
> >   else if (strncmp(nameptr, "_pq_.", 5) == 0)
> >   {
> > - /*
> > -  * Any option beginning with _pq_. is
> reserved for use as a
> > -  * protocol-level option, but at present
> no such options are
> > -  * defined.
> > -  */
> > - unrecognized_protocol_options =
> > -
>  lappend(unrecognized_protocol_options, pstrdup(nameptr));
> > + if (strncasecmp(nameptr + 5, "report", 6)
> == 0)
> > + {
> > + char sep[3] = " ,";
> > +
> > + /* avoid scribbling on valptr */
> > + char *temp_val = pstrdup(valptr);
> > +
> > + /* strtok is going to scribble on
> temp_val */
> > + char *freeptr = temp_val;
> > + char *guc_report =
> strtok(temp_val, sep);
> > + while (guc_report)
> > + {
> > + port->guc_report =
> lappend(port->guc_report,
> > +
> pstrdup(guc_report));
> > + guc_report = strtok(NULL,
> sep);
> > + }
> > + pfree(freeptr);
> > + }
>
> I don't think it's good to open-code this inside
> ProcessStartupPacket(). Should be moved into its own function. I'd
> probably even move all of the option handling out of
> ProcessStartupPacket() before expanding it further.
>
> I don't like the use of strtok, nor the type of parsing done
> here. Perhaps we could just use SplitGUCList()?
>

Fair enough

>
>
> > + else
> > + {
> > + /*
> > +  * Any option beginning with _pq_.
> is reserved for use as a
> > +  * protocol-level option, but at
> present no such options are
> > +  * defined.
> > +  */
> > + unrecognized_protocol_options =
> > +
>  lappend(unrecognized_protocol_options, pstrdup(nameptr));
> > + }
> >   }
>
> You can't just move a comment explaining what _pq_ is into the else,
> especially not without adjusting the contents.
>
> Upon review I'm left with "what was I thinking?"

>
>
>
>
> > +/*
> > + * Set the option to be GUC_REPORT
> > + */
> > +
> > +bool
> > +SetConfigReport(const char *name, bool missing_ok)
> > +{
> > + struct config_generic *record;
> >
> > + record = find_option(name, false, WARNING);
> > + if (record == NULL)
> > + {
> > + if (missing_ok)
> > + return 0;
> > + ereport(ERROR,
> > + (errcode(ERRCODE_UNDEFINED_OBJECT),
> > +  errmsg("unrecognized configuration
> parameter \"%s\"",
> > +

Re: [HACKERS] Block level parallel vacuum

2019-10-11 Thread Mahendra Singh
Hi

On Thu, 10 Oct 2019 at 13:18, Masahiko Sawada  wrote:

> On Thu, Oct 10, 2019 at 2:19 PM Amit Kapila 
> wrote:
> >
> > On Fri, Oct 4, 2019 at 4:18 PM Amit Kapila 
> wrote:
> > >
> > > On Wed, Oct 2, 2019 at 7:29 PM Masahiko Sawada 
> wrote:
> > >>
> >
> > Few more comments:
>
> Thank you for reviewing the patch!
>
> > -
> > 1.  Caurrently parallel vacuum is allowed for temporary relations
> > which is wrong.  It leads to below error:
> >
> > postgres=# create temporary table tmp_t1(c1 int, c2 char(10));
> > CREATE TABLE
> > postgres=# create index idx_tmp_t1 on tmp_t1(c1);
> > CREATE INDEX
> > postgres=# create index idx1_tmp_t1 on tmp_t1(c2);
> > CREATE INDEX
> > postgres=# insert into tmp_t1 values(generate_series(1,1),'');
> > INSERT 0 1
> > postgres=# delete from tmp_t1 where c1 > 5000;
> > DELETE 5000
> > postgres=# vacuum (parallel 2) tmp_t1;
> > ERROR:  cannot access temporary tables during a parallel operation
> > CONTEXT:  parallel worker
> >
> > The parallel vacuum shouldn't be allowed for temporary relations.
>
> Fixed.
>
> >
> > 2.
> > --- a/doc/src/sgml/ref/vacuum.sgml
> > +++ b/doc/src/sgml/ref/vacuum.sgml
> > @@ -34,6 +34,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [
> > boolean ]
> >  INDEX_CLEANUP [  > class="parameter">boolean ]
> >  TRUNCATE [ boolean ]
> > +PARALLEL [  > class="parameter">integer ]
> >
> > Now, if the user gives a command like Vacuum (analyze, parallel)
> > ; it is not very obvious that a parallel option will be
> > only used for vacuum purposes but not for analyze.  I think we can add
> > a note in the docs to mention this explicitly.  This can avoid any
> > confusion.
>
> Agreed.
>
> Attached the latest version patch although the memory usage problem is
> under discussion. I'll update the patches according to the result of
> that discussion.
>
>
I applied both patches on HEAD and did some testing. I am getting one crash
in freeing memory. (pfree(stats[i]))

*Steps to reproduc*e:
*Step 1) *Apply both the patches and configure with below command.
./configure --with-zlib  --enable-debug --prefix=$PWD/inst/
--with-openssl CFLAGS="-ggdb3" > war && make -j 8 install > war

*Step 2) Now start the server.*

*Step 3) Fire below commands:*

> create table tmp_t1(c1 int, c2 char(10));
> create index idx_tmp_t1 on tmp_t1(c1);
> create index idx1_tmp_t1 on tmp_t1(c2);
> insert into tmp_t1 values(generate_series(1,1),'');
> insert into tmp_t1 values(generate_series(1,1),'');
> insert into tmp_t1 values(generate_series(1,1),'');
> insert into tmp_t1 values(generate_series(1,1),'');
> insert into tmp_t1 values(generate_series(1,1),'');
> insert into tmp_t1 values(generate_series(1,1),'');
> delete from tmp_t1 where c1 > 5000;
> vacuum (parallel 2) tmp_t1;
>

*Call stack:*

> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `postgres: mahendra postgres [local] VACUUM
>'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x00a4f97a in pfree (pointer=0x10baa68) at mcxt.c:1060
> 1060 context->methods->free_p(context, pointer);
> Missing separate debuginfos, use: debuginfo-install
> keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64
> libcom_err-1.42.9-12.el7_5.x86_64 libselinux-2.5-12.el7.x86_64
> openssl-libs-1.0.2k-12.el7.x86_64 pcre-8.32-17.el7.x86_64
> zlib-1.2.7-17.el7.x86_64
> (gdb) bt
> #0  0x00a4f97a in pfree (pointer=0x10baa68) at mcxt.c:1060
> #1  0x004e7d13 in update_index_statistics (Irel=0x10b9808,
> stats=0x10b9828, nindexes=2) at vacuumlazy.c:2277
> #2  0x004e693f in lazy_scan_heap (onerel=0x7f8d99610d08,
> params=0x7ffeeaddb7f0, vacrelstats=0x10b9728, Irel=0x10b9808, nindexes=2,
> aggressive=false) at vacuumlazy.c:1659
> '#3  0x004e4d25 in heap_vacuum_rel (onerel=0x7f8d99610d08,
> params=0x7ffeeaddb7f0, bstrategy=0x1117528) at vacuumlazy.c:431
> #4  0x006a71a7 in table_relation_vacuum (rel=0x7f8d99610d08,
> params=0x7ffeeaddb7f0, bstrategy=0x1117528) at
> ../../../src/include/access/tableam.h:1432
> #5  0x006a9899 in vacuum_rel (relid=16384, relation=0x103b308,
> params=0x7ffeeaddb7f0) at vacuum.c:1870
> #6  0x006a7c22 in vacuum (relations=0x11176b8,
> params=0x7ffeeaddb7f0, bstrategy=0x1117528, isTopLevel=true) at vacuum.c:425
> #7  0x006a77e6 in ExecVacuum (pstate=0x105f578, vacstmt=0x103b3d8,
> isTopLevel=true) at vacuum.c:228
> #8  0x008af401 in standard_ProcessUtility (pstmt=0x103b6f8,
> queryString=0x103a808 "vacuum (parallel 2) tmp_t1;",
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
> dest=0x103b7d8, completionTag=0x7ffeeaddbc50 "") at utility.c:670
> #9  0x008aec40 in ProcessUtility (pstmt=0x103b6f8,
> queryString=0x103a808 "vacuum (parallel 2) tmp_t1;",
> context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
> 

Re: [PATCH] use separate PartitionedRelOptions structure to store partitioned table options

2019-10-11 Thread Nikolay Shaplov
> > I think it is bad idea to suggest option adder to ad it to
> > StdRdOption, we already have a big mess there. Better if he add it
> > to an new empty structure.  
>
> I tend to agree that this improves readability of the reloptions code
> a bit.
>
> Some comments on the patch:
>
> How about naming the function partitioned_table_reloptions() instead
> of partitioned_reloptions()?  

I have my doubts about using word table here... In relational model
there are no such concept as "table", it uses concept "relation". And
in postgres there were no tables, there were only relations. Heap
relation, toast relation, all kind of index relations... I do not
understand when and why word "table" appeared when we speak about some
virtual relation that is made of several real heap relation. That is
why I am not using the word table here...

But since you are the author of partition table code, and this code is
already accepted in the core, you should know better. So I will change
it the way you say.
 
> +switch ((int)relkind)
>
> Needs a space between (int) and relkind, but I don't think the (int)
> cast is really necessary.  I don't see it in other places.  
Oh. Yeh. This is my mistake... I had some strange compilation
problems, and this is a remnant of my efforts to find the cause of
it, I've forgot to clean...
Thanks!
   
> Speaking of partitioned relations vs tables, I see that you didn't
> touch partitioned indexes (RELKIND_PARTITIONED_INDEX relations).  Is
> that because we leave option handling to the index AMs?  

Because for partitioned indexes the code says "Use same options
original index does"

bytea
* extractRelOptions(HeapTuple tuple, TupleDesc
tupdesc, amoptions_function amoptions) 
{  
 
switch
(classForm->relkind)
{ case
RELKIND_RELATION: case
RELKIND_TOASTVALUE: case
RELKIND_MATVIEW: options = heap_reloptions(classForm->relkind, datum,
false);
break; case
RELKIND_PARTITIONED_TABLE: options =
partitioned_table_reloptions(datum, false);
break; case
RELKIND_VIEW: options = view_reloptions(datum,
false);
break; case
RELKIND_INDEX: case
RELKIND_PARTITIONED_INDEX: options = index_reloptions(amoptions, datum,
false); break;  


Here you see the function accepts amoptions method that know how to
parse options for a particular index, and passes it to index_reloptions
functions. For both indexes and partitioned indexes it is taken from AM
"method" amoptions. So options uses exactly the same code and same
options both for indexes and for partitioned indexes.

I do not know if it is correct from global point of view, but from the
point of view of reloptions engine, it does not require any attention:
change index options and get partitioned_index options for free...

Actually I would expect some problems there, because sooner or later,
somebody would want to set custom fillfactor to partitioned table, or
set some custom autovacuum options for it. But I would prefer to think
about it when I am done with reloption engine rewriting... Working in
both direction will cause more trouble then get benefits...diff --git a/.gitignore b/.gitignore
index 794e35b..37331c2 100644
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index b5072c0..067441f 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1099,9 +1099,11 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-		case RELKIND_PARTITIONED_TABLE:
 			options = heap_reloptions(classForm->relkind, datum, false);
 			break;
+		case RELKIND_PARTITIONED_TABLE:
+			options = partitioned_table_reloptions(datum, false);
+			break;
 		case RELKIND_VIEW:
 			options = view_reloptions(datum, false);
 			break;
@@ -1538,6 +1540,25 @@ default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 }
 
 /*
+ * Option parser for partitioned tables
+ */
+bytea *
+partitioned_table_reloptions(Datum reloptions, bool validate)
+{
+   int   numoptions;
+
+  /*
+   * Since there are no options for patitioned tables for now, we just do
+   * validation to report incorrect option error and leave.
+   */
+
+  if (validate)
+		parseRelOptions(reloptions, validate, RELOPT_KIND_PARTITIONED,
+		);
+   return NULL;
+}
+
+/*
  * Option parser for views
  */
 bytea *
@@ -1593,9 +1614,6 @@ heap_reloptions(char relkind, Datum reloptions, bool validate)
 		case RELKIND_RELATION:
 		case RELKIND_MATVIEW:
 			return default_reloptions(reloptions, validate, RELOPT_KIND_HEAP);
-		case RELKIND_PARTITIONED_TABLE:
-			return default_reloptions(reloptions, validate,
-	  RELOPT_KIND_PARTITIONED);
 		default:
 			/* other relkinds are not supported */
 			return NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 05593f3..f44e865 100644
--- 

Re: [PATCH] Do not use StdRdOptions in Access Methods

2019-10-11 Thread Nikolay Shaplov
В письме от четверг, 10 октября 2019 г. 17:17:30 MSK пользователь Amit Langote 
написал:

> I read comments that Tomas left at:
> https://www.postgresql.org/message-id/20190727173841.7ypzo4xuzizvijge%40deve
> lopment
> 
> I'd like to join Michael in reiterating one point from Tomas' review.
> I think the patch can go further in trying to make the code in this
> area more maintainable.
> 
> For example, even without this patch, the following stanza is repeated
> in many places:
> 
> options = parseRelOptions(reloptions, validate, foo_relopt_kind,
> );
> rdopts = allocateReloptStruct(sizeof(FooOptions), options, numoptions);
> fillRelOptions((void *) rdopts, sizeof(FooOptions), options, numoptions,
> validate, foo_relopt_tab, lengthof(foo_relopt_tab)); return (bytea *)
> rdopts;
> 
> and this patch adds few more instances as it's adding more Options structs.
> 
> I think it wouldn't be hard to encapsulate the above stanza in a new
> public function in reloptions.c and call it from the various places
> that now have the above code.
The code of reloptions is very historical and illogical. I also noticed that 
these lines are repeated several time. And may be it would be better to put 
them into reloptions.c. But could anybody clearly explain what are they doing? 
Just to give function a proper name. I understand what they are doing, but I 
am unable to give short and clear explanation.

I am planning to rewrite this part completely. So we have none of this lines 
repeated. I had a proposal you can see it here https://
commitfest.postgresql.org/15/992/ but people on the list told be that patch is 
too complex and I should commit it part by part.

So I am doing it now. I am almost done. But to provide clear and logical patch 
that introduces my concept, I need StdRdOption to be divided into separated 
structures. At least for AM. And I need at to be done as simply as possible 
because the rest of the code is going to be rewritten anyway.

That is why I want to follow the steps: make the code uniform, and then 
improve it. I have improvement in the pocket, but I need a uniform code before 
revealing it.

If you think it is absolutely necessary to put these line into one function, I 
will do it. It will not make code more clear, I guess. I See no benefits, but 
I can do it, but I would avoid doing it, if possible. At least at this step.

-- 
Software Developer: https://www.upwork.com/freelancers/~014a87e140ff02c0da
Body-oriented Therapist: https://vk.com/nataraj_rebalancing  (Russian)




Re: Collation versioning

2019-10-11 Thread Christoph Berg
Re: Thomas Munro 2019-10-11 

> While testing pg_upgrade scenarios I noticed that initdb-created
> collations' versions are not preserved, potentially losing track of
> information about corrupted indexes.  That's a preexisting condition,
> and probably well understood, but it made me realise that if we switch
> to per-database object (for example: per index) version tracking as
> mentioned up-thread, then we should probably preserve that information
> across pg_upgrade.

That would make much sense, yes. The whole problem is already complex
enough, if we add another "but if you use pg_upgrade, you still need
to do the tracking manually" footnote, users will be very confused.

Christoph




Re: Collation versioning

2019-10-11 Thread Thomas Munro
On Thu, Oct 10, 2019 at 8:38 AM Peter Eisentraut
 wrote:
> On 2019-10-09 21:19, Peter Eisentraut wrote:
> > On 2019-10-03 14:25, Thomas Munro wrote:
> >>> The only open question on this patch was whether it's a good version to
> >>> use.  I think based on subsequent discussions, there was the realization
> >>> that this is the best we can do and better than nothing.
> >>>
> >>> In the patch, I would skip the configure test and just do
> >>>
> >>> #ifdef __GLIBC__
> >>>
> >>> directly.
> >>
> >> Ok.  Here's one like that.
> >
> > Pushed that.
>
> Actually, I had to revert that because pg_dump and pg_upgrade tests need
> to be updated, but that seems doable.

[Returning from a couple of weeks mostly away from computers]

Right, sorry about that.  Here is a new version that fixes that test,
and also gives credit to Christoph for the idea in the commit message.

While testing pg_upgrade scenarios I noticed that initdb-created
collations' versions are not preserved, potentially losing track of
information about corrupted indexes.  That's a preexisting condition,
and probably well understood, but it made me realise that if we switch
to per-database object (for example: per index) version tracking as
mentioned up-thread, then we should probably preserve that information
across pg_upgrade.


0001-Use-libc-version-as-a-collation-version-on-glibc--v2.patch
Description: Binary data


Re: [PATCH] use separate PartitionedRelOptions structure to store partitioned table options

2019-10-11 Thread Nikolay Shaplov
В Thu, 10 Oct 2019 15:50:05 +0900
Amit Langote  пишет:

> > I think it is bad idea to suggest option adder to ad it to
> > StdRdOption, we already have a big mess there. Better if he add it
> > to an new empty structure.
> 
> I tend to agree that this improves readability of the reloptions code
> a bit.
> 
> Some comments on the patch:
> 
> How about naming the function partitioned_table_reloptions() instead
> of partitioned_reloptions()?

I have my doubts about using word table here... In relational model
there are no such concept as "table", it uses concept "relation". And
in postgres there were no tables, there were only relations. Heap
relation, toast relation, all kind of index relations... I do not
understand when and why word "table" appeared when we speak about some
virtual relation that is made of several real heap relation. That is
why I am not using the word table here...

But since you are the author of partition table code, and this code is
already accepted in the core, you should know better. So I will change
it the way you say.
 
> +switch ((int)relkind)
> 
> Needs a space between (int) and relkind, but I don't think the (int)
> cast is really necessary.  I don't see it in other places.
Oh. Yeh. This is my mistake... I had some strange compilation
problems, and this is a remnant of my efforts to find the cause of
it, I've forgot to clean...
Thanks!
   
> Speaking of partitioned relations vs tables, I see that you didn't
> touch partitioned indexes (RELKIND_PARTITIONED_INDEX relations).  Is
> that because we leave option handling to the index AMs?

Because for partitioned indexes the code says "Use same options
original index does"

bytea * 
extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,   
  amoptions_function amoptions) 
{  
 
switch (classForm->relkind) 
{   
case RELKIND_RELATION:  
case RELKIND_TOASTVALUE:
case RELKIND_MATVIEW:   
options = heap_reloptions(classForm->relkind, datum, false);
break;  
case RELKIND_PARTITIONED_TABLE: 
options = partitioned_table_reloptions(datum, false);   
break;  
case RELKIND_VIEW:  
options = view_reloptions(datum, false);
break;  
case RELKIND_INDEX: 
case RELKIND_PARTITIONED_INDEX: 
options = index_reloptions(amoptions, datum, false);
break;  


Here you see the function accepts amoptions method that know how to
parse options for a particular index, and passes it to index_reloptions
functions. For both indexes and partitioned indexes it is taken from AM
"method" amoptions. So options uses exactly the same code and same
options both for indexes and for partitioned indexes.

I do not know if it is correct from global point of view, but from the
point of view of reloptions engine, it does not require any attention:
change index options and get partitioned_index options for free...

Actually I would expect some problems there, because sooner or later,
somebody would want to set custom fillfactor to partitioned table, or
set some custom autovacuum options for it. But I would prefer to think
about it when I am done with reloption engine rewriting... Working in
both direction will cause more trouble then get benefits...


diff --git a/.gitignore b/.gitignore
index 794e35b..37331c2 100644
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index b5072c0..067441f 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1099,9 +1099,11 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_RELATION:
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
-		case RELKIND_PARTITIONED_TABLE:
 			options = heap_reloptions(classForm->relkind, datum, false);
 			break;
+		case RELKIND_PARTITIONED_TABLE:
+			options = partitioned_table_reloptions(datum, false);
+			break;
 		case RELKIND_VIEW:
 			options = view_reloptions(datum, false);
 			break;
@@ -1538,6 +1540,25 @@ default_reloptions(Datum reloptions, 

Re: dropping column prevented due to inherited index

2019-10-11 Thread Michael Paquier
On Fri, Oct 11, 2019 at 04:23:51PM +0900, Amit Langote wrote:
> Thanks.  The index on b is not really necessary for testing because it
> remains unaffected, but maybe it's fine.

That's on purpose.  Any more comments?
--
Michael


signature.asc
Description: PGP signature


Re: BRIN index which is much faster never chosen by planner

2019-10-11 Thread David Rowley
On Fri, 11 Oct 2019 at 17:48, Michael Lewis  wrote:
>
> On Thu, Oct 10, 2019 at 6:22 PM David Rowley  
> wrote:
>> The planner will just estimate the selectivity of now() - interval '10
>> days'  by using DEFAULT_INEQ_SEL, which is 0., so it
>> thinks it'll get 1/3rd of the table.  Using 'now' will allow the
>> planner to lookup actual statistics on that column which will likely
>> give a much better estimate, which by the looks of it, likely will
>> result in one of those BRIN index being used.
>
>
> This surprised me a bit, and would have significant implications. I tested a 
> few different tables in our system and get the same row count estimate with 
> either WHERE condition. Perhaps I am missing a critical piece of what you 
> said.
>
> explain
> select * from charges where posted_on > now() - interval '10 days';
>
> explain
> select * from charges where posted_on > 'now'::timestamptz  - interval '10 
> days';

You're right. On looking more closely at the code, it uses
estimate_expression_value(), which performs additional constant
folding of expressions for selectivity purposes only. It does end up
calling the now() function and evaluating the now() - interval '10
days'; expression into a Const.

The header comment for that function reads:

* estimate_expression_value
 *
 * This function attempts to estimate the value of an expression for
 * planning purposes.  It is in essence a more aggressive version of
 * eval_const_expressions(): we will perform constant reductions that are
 * not necessarily 100% safe, but are reasonable for estimation purposes.

So I take back what I said about using 'now'::timestamptz instead of now().

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: PostgreSQL, C-Extension, calling other Functions

2019-10-11 Thread Pavel Stehule
Hi

pá 11. 10. 2019 v 10:15 odesílatel Stefan Wolf  napsal:

> I´ve written some PostgreSQL C-Extensions (for the first time...) and they
> work as expected.
>
> But now I want to call other functions from inside the C-Extensions (but
> not
> via SPI_execute),
> for example "regexp_match()" or from other extensions like PostGIS
> "ST_POINT" etc...
>
> I think "fmgr" is the key - but I didn't find any examples.
>

search DirectFunctionCall

<-->PG_RETURN_NUMERIC(
<--><-->DirectFunctionCall1(float8_numeric, Float8GetDatumFast(result)));

Regards

Pavel


> Greetings from Berlin
> -Stefan Wolf-
>
>
>
>
>
>


PostgreSQL, C-Extension, calling other Functions

2019-10-11 Thread Stefan Wolf
I´ve written some PostgreSQL C-Extensions (for the first time...) and they
work as expected.

But now I want to call other functions from inside the C-Extensions (but not
via SPI_execute),
for example "regexp_match()" or from other extensions like PostGIS
"ST_POINT" etc...

I think "fmgr" is the key - but I didn't find any examples.

Greetings from Berlin
-Stefan Wolf-







Re: maintenance_work_mem used by Vacuum

2019-10-11 Thread Amit Kapila
On Fri, Oct 11, 2019 at 7:36 AM Masahiko Sawada  wrote:
>
> On Thu, Oct 10, 2019 at 6:38 PM Amit Kapila  wrote:
> >
> >
> > It seems you want to say about commit id
> > a1b395b6a26ae80cde17fdfd2def8d351872f399.
>
> Yeah thanks.
>
> >  I wonder why they have not
> > changed it to gin_penidng_list_limit (at that time
> > pending_list_cleanup_size) in that commit itself?  I think if we want
> > to use gin_pending_list_limit in this context then we can replace both
> > work_mem and maintenance_work_mem with gin_penidng_list_limit.
>
> Hmm as far as I can see the discussion, no one mentioned about
> maintenance_work_mem. Perhaps we just oversighted?
>

It is possible, but we can't be certain until those people confirm the same.

> I also didn't know
> that.
>
> I also think we can replace at least the work_mem for cleanup of
> pending list with gin_pending_list_limit. In the following comment in
> ginfast.c,
>

Agreed, but that won't solve the original purpose for which we started
this thread.

> /*
>  * Force pending list cleanup when it becomes too long. And,
>  * ginInsertCleanup could take significant amount of time, so we prefer to
>  * call it when it can do all the work in a single collection cycle. In
>  * non-vacuum mode, it shouldn't require maintenance_work_mem, so fire it
>  * while pending list is still small enough to fit into
>  * gin_pending_list_limit.
>  *
>  * ginInsertCleanup() should not be called inside our CRIT_SECTION.
>  */
> cleanupSize = GinGetPendingListCleanupSize(index);
> if (metadata->nPendingPages * GIN_PAGE_FREESIZE > cleanupSize * 1024L)
> needCleanup = true;
>
> ISTM the gin_pending_list_limit in the above comment corresponds to
> the following code in ginfast.c,
>
> /*
>  * We are called from regular insert and if we see concurrent cleanup
>  * just exit in hope that concurrent process will clean up pending
>  * list.
>  */
> if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))
> return;
> workMemory = work_mem;
>
> If work_mem is smaller than gin_pending_list_limit the pending list
> cleanup would behave against the intention of the above comment that
> prefers to do all the work in a single collection cycle while pending
> list is still small enough to fit into gin_pending_list_limit.
>

That's right, but OTOH, if the user specifies gin_pending_list_limit
as an option during Create Index with a value greater than GUC
gin_pending_list_limit, then also we will face the same problem.  It
seems to me that we haven't thought enough on memory usage during Gin
pending list cleanup and I don't want to independently make a decision
to change it.  So unless the original author/committer or some other
people who have worked in this area share their opinion, we can leave
it as it is and make a parallel vacuum patch adapt to it.

The suggestion from others is welcome.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Standby accepts recovery_target_timeline setting?

2019-10-11 Thread Fujii Masao
On Thu, Oct 10, 2019 at 5:52 AM Peter Eisentraut
 wrote:
>
> On 2019-09-30 03:48, Fujii Masao wrote:
> > Also we need to do the same thing for other recovery options like
> > restore_command. Attached is the patch which makes crash recovery
> > ignore restore_command and recovery_end_command.
>
> This patch looks correct to me.

Thanks for the review! I committed the patch.

> Do we need to handle archive_cleanup_command?  Perhaps not.

No, because archive_command is basically executed by checkpointer
and this process cannot be invoked in crash recovery case.

> A check in recoveryApplyDelay() might be necessary.

Yes! We are discussing this issue at
https://www.postgresql.org/message-id/CAHGQGwEyD6HdZLfdWc+95g=vqfpr4zql4n+yhxqggegjasv...@mail.gmail.com

Regards,

-- 
Fujii Masao




Re: dropping column prevented due to inherited index

2019-10-11 Thread Amit Langote
On Fri, Oct 11, 2019 at 4:16 PM Michael Paquier  wrote:
> On Thu, Oct 10, 2019 at 05:28:02PM +0900, Amit Langote wrote:
> > Actually, the code initializes it on the first call (recursing is
> > false) and asserts that it must have been already initialized in a
> > recursive (recursing is true) call.
>
> I have actually kept your simplified version.
>
> > Okay, sure.  Maybe it's better to write the comment inside the if
> > block, because if recursing is true, we don't drop yet.
>
> Sure.
>
> > Thoughts on suggestion to expand the test case?
>
> No objections to that, so done as per the attached.  Does that match
> what you were thinking about?

Thanks.  The index on b is not really necessary for testing because it
remains unaffected, but maybe it's fine.

Regards,
Amit




Re: dropping column prevented due to inherited index

2019-10-11 Thread Michael Paquier
On Thu, Oct 10, 2019 at 05:28:02PM +0900, Amit Langote wrote:
> Actually, the code initializes it on the first call (recursing is
> false) and asserts that it must have been already initialized in a
> recursive (recursing is true) call.

I have actually kept your simplified version.

> Okay, sure.  Maybe it's better to write the comment inside the if
> block, because if recursing is true, we don't drop yet.

Sure.

> Thoughts on suggestion to expand the test case?

No objections to that, so done as per the attached.  Does that match
what you were thinking about?
--
Michael
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ba8f4459f3..74c0a00a2f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -401,7 +401,8 @@ static void ATPrepDropColumn(List **wqueue, Relation rel, bool recurse, bool rec
 static ObjectAddress ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 	  DropBehavior behavior,
 	  bool recurse, bool recursing,
-	  bool missing_ok, LOCKMODE lockmode);
+	  bool missing_ok, LOCKMODE lockmode,
+	  ObjectAddresses *addrs);
 static ObjectAddress ATExecAddIndex(AlteredTableInfo *tab, Relation rel,
 	IndexStmt *stmt, bool is_rebuild, LOCKMODE lockmode);
 static ObjectAddress ATExecAddConstraint(List **wqueue,
@@ -4273,12 +4274,14 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
 		case AT_DropColumn:		/* DROP COLUMN */
 			address = ATExecDropColumn(wqueue, rel, cmd->name,
 	   cmd->behavior, false, false,
-	   cmd->missing_ok, lockmode);
+	   cmd->missing_ok, lockmode,
+	   NULL);
 			break;
 		case AT_DropColumnRecurse:	/* DROP COLUMN with recursion */
 			address = ATExecDropColumn(wqueue, rel, cmd->name,
 	   cmd->behavior, true, false,
-	   cmd->missing_ok, lockmode);
+	   cmd->missing_ok, lockmode,
+	   NULL);
 			break;
 		case AT_AddIndex:		/* ADD INDEX */
 			address = ATExecAddIndex(tab, rel, (IndexStmt *) cmd->def, false,
@@ -4893,8 +4896,8 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
 			/*
 			 * Set all columns in the new slot to NULL initially, to ensure
-			 * columns added as part of the rewrite are initialized to
-			 * NULL. That is necessary as tab->newvals will not contain an
+			 * columns added as part of the rewrite are initialized to NULL.
+			 * That is necessary as tab->newvals will not contain an
 			 * expression for columns with a NULL default, e.g. when adding a
 			 * column without a default together with a column with a default
 			 * requiring an actual rewrite.
@@ -7013,13 +7016,22 @@ ATPrepDropColumn(List **wqueue, Relation rel, bool recurse, bool recursing,
 }
 
 /*
- * Return value is the address of the dropped column.
+ * Drops column 'colName' from relation 'rel' and returns the address of the
+ * dropped column.  The column is also dropped (or marked as no longer
+ * inherited from relation) from the relation's inheritance children, if any.
+ *
+ * In the recursive invocations for inheritance child relations, instead of
+ * dropping the column directly (if to be dropped at all), its object address
+ * is added to 'addrs', which must be non-NULL in such invocations.  All
+ * columns are dropped at the same time after all the children have been
+ * checked recursively.
  */
 static ObjectAddress
 ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
  DropBehavior behavior,
  bool recurse, bool recursing,
- bool missing_ok, LOCKMODE lockmode)
+ bool missing_ok, LOCKMODE lockmode,
+ ObjectAddresses *addrs)
 {
 	HeapTuple	tuple;
 	Form_pg_attribute targetatt;
@@ -7032,6 +7044,11 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 	if (recursing)
 		ATSimplePermissions(rel, ATT_TABLE | ATT_FOREIGN_TABLE);
 
+	/* Initialize addrs on the first invocation. */
+	Assert(!recursing || addrs != NULL);
+	if (!recursing)
+		addrs = new_object_addresses();
+
 	/*
 	 * get the number of the attribute
 	 */
@@ -7144,7 +7161,7 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 	/* Time to delete this child column, too */
 	ATExecDropColumn(wqueue, childrel, colName,
 	 behavior, true, true,
-	 false, lockmode);
+	 false, lockmode, addrs);
 }
 else
 {
@@ -7180,14 +7197,22 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName,
 		table_close(attr_rel, RowExclusiveLock);
 	}
 
-	/*
-	 * Perform the actual column deletion
-	 */
+	/* Add object to delete */
 	object.classId = RelationRelationId;
 	object.objectId = RelationGetRelid(rel);
 	object.objectSubId = attnum;
+	add_exact_object_address(, addrs);
 
-	performDeletion(, behavior, 0);
+	if (!recursing)
+	{
+		/*
+		 * The resursing lookup for inherited child relations is done.  All
+		 * the child relations have been scanned and the object addresses of

Re: adding partitioned tables to publications

2019-10-11 Thread Amit Langote
Hello Rafia,

Great to hear that you are interested in this feature and thanks for
testing the patch.

On Thu, Oct 10, 2019 at 10:13 PM Rafia Sabih  wrote:
> Lately I was exploring logical replication feature of postgresql and I found 
> this addition in the scope of feature for partitioned tables a useful one.
>
> In order to understand the working of your patch a bit more, I performed an 
> experiment wherein I created a partitioned table with several  children and a 
> default partition at the publisher side and normal tables of the same name as 
> parent, children, and default partition of the publisher side at the 
> subscriber side. Next I established the logical replication connection and to 
> my surprise the data was successfully replicated from partitioned tables to 
> normal tables and then this error filled the logs,
> LOG:  logical replication table synchronization worker for subscription 
> "my_subscription", table "parent" has started
> ERROR:  table "public.parent" not found on publisher
>
> here parent is the name of the partitioned table at the publisher side and it 
> is present as normal table at subscriber side as well. Which is 
> understandable, it is trying to find a normal table of the same name but 
> couldn't find one, maybe it should not worry about that now also if not at 
> replication time.
>
> Please let me know if this is something expected because in my opinion this 
> is not desirable, there should be some check to check the table type for 
> replication. This wasn't important till now maybe because only normal tables 
> were to be replicated, but with the extension of the scope of logical 
> replication to more objects such checks would be helpful.

Thanks for sharing this case.  I hadn't considered it, but you're
right that it should be handled sensibly.  I have fixed table sync
code to handle this case properly.  Could you please check your case
with the attached updated patch?

> On a separate note was thinking for partitioned tables, wouldn't it be 
> cleaner to have something like you create only partition table at the 
> subscriber and then when logical replication starts it creates the child 
> tables accordingly. Or would that be too much in future...?

Hmm, we'd first need to built the "automatic partition creation"
feature to consider doing something like that.  I'm sure you'd agree
that we should undertake that project separately from this tiny
logical replication usability improvement project. :)

Thanks again.

Regards,
Amit


v2-0001-Support-adding-partitioned-tables-to-publication.patch
Description: Binary data