date:20200325

Re: Resolving the python 2 -> python 3 mess

2020-03-25 Thread Marco Atzeri


Am 17.02.2020 um 17:49 schrieb Tom Lane:

We've had multiple previous discussions of $SUBJECT (eg [1][2]),
without any resolution of what to do exactly.  Thinking about this
some more, I had an idea that I don't think has been discussed.
To wit:

1. On platforms where Python 2.x is still supported, recommend that
packagers continue to build both plpython2 and plpython3, same as now.



there is some documentation on how to build both ?
The INSTALL gives no hint.

And how to build for multiples 3.x ?

Currently for Cygwin package I am building only 2.x and it is clearly
not a good situation.

Regards
Marco

Re: pgbench - add \aset to store results of a combined query

2020-03-25 Thread Michael Paquier

On Fri, Nov 29, 2019 at 10:34:05AM +0100, Fabien COELHO wrote:
>> It seems to me that there is no point to have the variable aset in
>> Command because this structure includes already MetaCommand, so the
>> information is duplicated. [...] Perhaps I am missing something?
> 
> Yep. ISTM that you are missing that aset is not an independent meta command
> like most others but really changes the state of the previous SQL command,
> so that it needs to be stored into that with some additional fields. This is
> the same with "gset" which is tagged by a non-null "varprefix".
> 
> So I cannot remove the "aset" field.

Still sounds strange to me to invent a new variable to this structure
if it is possible to track the exact same thing with an existing part
of a Command, or it would make sense to split Command into two
different structures with an extra structure used after the parsing
for clarity?

>> And I would suggest to change readCommandResponse() to use a MetaCommand
>> in argument.
> 
> MetaCommand is not enough: we need varprefix, and then distinguishing
> between aset and gset. Although this last point can be done with a
> MetaCommand, ISTM that a bool is_aset is clear and good enough. It is
> possible to switch if you insist on it, but I do not think it is desirable.
> 
> Attached v4 removes an unwanted rebased comment duplication and does minor
> changes while re-reading the code.

Well, it still looks cleaner to me to just assign the meta field
properly within ParseScript(), and you get the same result.  And it is 
also possible to use "meta" to do more sanity checks around META_GSET
for some code paths.  So I'd actually find the addition of a new
argument using a meta command within readCommandResponse() cleaner.

- * varprefix   SQL commands terminated with \gset have this set
+ * varprefix   SQL commands terminated with \gset or \aset have this set
Nit from v4: varprefix can be used for \gset and \aset, and the
comment was not updated.

+   /* coldly skip empty result under \aset */
+   if (ntuples <= 0)
+   break;
Shouldn't this check after \aset?  And it seems to me that this code
path is not taken, so a test would be nice.

-   } while (res);
+   } while (res != NULL);
Useless diff.

The conflicts came from the switch to the common logging of
src/common/.  That was no big deal to solve.
--
Michael
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index b8864c6ae5..ebb4eb2c5c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -480,6 +480,7 @@ typedef enum MetaCommand
 	META_SHELL,	/* \shell */
 	META_SLEEP,	/* \sleep */
 	META_GSET,	/* \gset */
+	META_ASET,	/* \aset */
 	META_IF,	/* \if */
 	META_ELIF,	/* \elif */
 	META_ELSE,	/* \else */
@@ -509,7 +510,7 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"};
  * argv			Command arguments, the first of which is the command or SQL
  *string itself.  For SQL commands, after post-processing
  *argv[0] is the same as 'lines' with variables substituted.
- * varprefix 	SQL commands terminated with \gset have this set
+ * varprefix 	SQL commands terminated with \gset or \aset have this set
  *to a non NULL value.  If nonempty, it's used to prefix the
  *variable name that receives the value.
  * expr			Parsed expression, if needed.
@@ -2489,6 +2490,8 @@ getMetaCommand(const char *cmd)
 		mc = META_ENDIF;
 	else if (pg_strcasecmp(cmd, "gset") == 0)
 		mc = META_GSET;
+	else if (pg_strcasecmp(cmd, "aset") == 0)
+		mc = META_ASET;
 	else
 		mc = META_NONE;
 	return mc;
@@ -2711,12 +2714,12 @@ sendCommand(CState *st, Command *command)
  * Process query response from the backend.
  *
  * If varprefix is not NULL, it's the variable name prefix where to store
- * the results of the *last* command.
+ * the results of the *last* command (gset) or *all* commands (aset).
  *
  * Returns true if everything is A-OK, false if any error occurs.
  */
 static bool
-readCommandResponse(CState *st, char *varprefix)
+readCommandResponse(CState *st, char *varprefix, MetaCommand meta)
 {
 	PGresult   *res;
 	PGresult   *next_res;
@@ -2736,7 +2739,7 @@ readCommandResponse(CState *st, char *varprefix)
 		{
 			case PGRES_COMMAND_OK:	/* non-SELECT commands */
 			case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */
-if (is_last && varprefix != NULL)
+if (is_last && varprefix != NULL && meta != META_ASET)
 {
 	pg_log_error("client %d script %d command %d query %d: expected one row, got %d",
  st->id, st->use_file, st->command, qrynum, 0);
@@ -2745,16 +2748,26 @@ readCommandResponse(CState *st, char *varprefix)
 break;
 
 			case PGRES_TUPLES_OK:
-if (is_last && varprefix != NULL)
+if (varprefix != NULL && (is_last || meta == META_ASET))
 {
-	if (PQntuples(res) != 1)
+	int		ntuples	= PQntuples(res);
+
+	if (meta != META_ASET && ntuples != 1)
 	{
+		/* under

Re: Some problems of recovery conflict wait events

2020-03-25 Thread Masahiko Sawada

On Tue, 24 Mar 2020 at 17:04, Fujii Masao  wrote:
>
>
>
> On 2020/03/05 20:16, Fujii Masao wrote:
> >
> >
> > On 2020/03/05 16:58, Masahiko Sawada wrote:
> >> On Wed, 4 Mar 2020 at 15:21, Fujii Masao  
> >> wrote:
> >>>
> >>>
> >>>
> >>> On 2020/03/04 14:31, Masahiko Sawada wrote:
>  On Wed, 4 Mar 2020 at 13:48, Fujii Masao  
>  wrote:
> >
> >
> >
> > On 2020/03/04 13:27, Michael Paquier wrote:
> >> On Wed, Mar 04, 2020 at 01:13:19PM +0900, Masahiko Sawada wrote:
> >>> Yeah, so 0001 patch sets existing wait events to recovery conflict
> >>> resolution. For instance, it sets (PG_WAIT_LOCK | LOCKTAG_TRANSACTION)
> >>> to the recovery conflict on a snapshot. 0003 patch improves these wait
> >>> events by adding the new type of wait event such as
> >>> WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT. Therefore 0001 (and 0002) patch
> >>> is the fix for existing versions and 0003 patch is an improvement for
> >>> only PG13. Did you mean even 0001 patch doesn't fit for back-patching?
> >
> > Yes, it looks like a improvement rather than bug fix.
> >
> 
>  Okay, understand.
> 
> >> I got my eyes on this patch set.  The full patch set is in my opinion
> >> just a set of improvements, and not bug fixes, so I would refrain from
> >> back-backpatching.
> >
> > I think that the issue (i.e., "waiting" is reported twice needlessly
> > in PS display) that 0002 patch tries to fix is a bug. So it should be
> > fixed even in the back branches.
> 
>  So we need only two patches: one fixes process title issue and another
>  improve wait event. I've attached updated patches.
> >>>
> >>> Thanks for updating the patches! I started reading 0001 patch.
> >>
> >> Thank you for reviewing this patch.
> >>
> >>>
> >>> -   /*
> >>> -* Report via ps if we have been waiting for more 
> >>> than 500 msec
> >>> -* (should that be configurable?)
> >>> -*/
> >>> -   if (update_process_title && new_status == NULL &&
> >>> -   TimestampDifferenceExceeds(waitStart, 
> >>> GetCurrentTimestamp(),
> >>> - 
> >>>  500))
> >>>
> >>> The patch changes ResolveRecoveryConflictWithSnapshot() and
> >>> ResolveRecoveryConflictWithTablespace() so that they always add
> >>> "waiting" into the PS display, whether wait is really necessary or not.
> >>> But isn't it better to display "waiting" in PS basically when wait is
> >>> necessary, like originally ResolveRecoveryConflictWithVirtualXIDs()
> >>> does as the above?
> >>
> >> You're right. Will fix it.
> >>
> >>>
> >>>ResolveRecoveryConflictWithDatabase(Oid dbid)
> >>>{
> >>> +   char*new_status = NULL;
> >>> +
> >>> +   /* Report via ps we are waiting */
> >>> +   new_status = set_process_title_waiting();
> >>>
> >>> In  ResolveRecoveryConflictWithDatabase(), there seems no need to
> >>> display "waiting" in PS because no wait occurs when recovery conflict
> >>> with database happens.
> >>
> >> Isn't the startup process waiting for other backend to terminate?
> >
> > Yeah, you're right. I agree that "waiting" should be reported in this case.
>
> On second thought, in recovery conflict case, "waiting" should be reported
> while waiting for the specified delay (e.g., by max_standby_streaming_delay)
> until the conflict is resolved. So IMO reporting "waiting" in the case of
> recovery conflict with buffer pin, snapshot, lock and tablespace seems valid,
> because they are user-visible "expected" wait time.
>
> However, in the case of recovery conflict with database, the recovery
> basically doesn't wait at all and just terminates the conflicting sessions
> immediately. Then the recovery waits for all those sessions to be terminated,
> but that wait time is basically small and should not be the user-visible.
> If that wait time becomes very long because of unresponsive backend, ISTM
> that LOG or WARNING should be logged instead of reporting something in
> PS display. I'm not sure if that logging is really necessary now, though.
> Therefore, I'm thinking that "waiting" doesn't need to be reported in the case
> of recovery conflict with database. Thought?

Fair enough. The longer wait time of conflicts with database is not
user-expected behaviour so logging would be better. I'd like to just
drop the change around ResolveRecoveryConflictWithDatabase() from the
patch. Maybe logging LOG or WARNING for recovery conflict on database
would be a separate patch and need more discussion.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: error context for vacuum to include block number

2020-03-25 Thread Justin Pryzby

On Thu, Mar 26, 2020 at 09:50:53AM +0530, Amit Kapila wrote:
> > > after count_nondeletable_pages, and then revert it back to
> > > VACUUM_ERRCB_PHASE_SCAN_HEAP phase and the number of blocks of
> > > relation before truncation, after RelationTruncate(). It can be
> > > repeated until no more truncating can be done. Why do we need to
> > > revert back to the scan heap phase? If we can use
> > > vacrelstats->nonempty_pages in the error context message as the
> > > remaining blocks after truncation I think we can update callback
> > > arguments once at the beginning of lazy_truncate_heap() and don't
> > > revert to the previous phase, and pop the error context after exiting.
> >
> > Perhaps.  We need to "revert back" for the vacuum phases, which can be 
> > called
> > multiple times, but we don't need to do that here.
> 
> Yeah, but I think it would be better if are consistent because we have
> no control what the caller of the function intends to do after
> finishing the current phase.  I think we can add some comments where
> we set up the context (in heap_vacuum_rel) like below so that the idea
> is more clear.
> 
> "The idea is to set up an error context callback to display additional
> information with any error during vacuum.  During different phases of
> vacuum (heap scan, heap vacuum, index vacuum, index clean up, heap
> truncate), we update the error context callback to display appropriate
> information.
> 
> Note that different phases of vacuum overlap with each other, so once
> a particular phase is over, we need to revert back to the old phase to
> keep the phase information up-to-date."

Seems fine.  Rather than saying "different phases" I, would say:
"The index vacuum and heap vacuum phases may be called multiple times in the
middle of the heap scan phase."

But actually I think the concern is not that we unnecessarily "Revert back to
the old phase" but that we do it in a *loop*.  Which I agree doesn't make
sense, to go back and forth between "scanning heap" and "truncating".  So I
think we should either remove the "revert back", or otherwise put it
after/outside the "while" loop, and change the "return" paths to use "break".

-- 
Justin

Re: [Proposal] Global temporary tables

2020-03-25 Thread Prabhat Sahu

> Sorry, I introduced this bug in my refactoring.
> It's been fixed.
>
> Wenjing
>
> Hi Wenjing,
This patch(gtt_v21_pg13.patch) is not applicable on PG HEAD, I hope you
have prepared the patch on top of some previous commit.
Could you please rebase the patch which we can apply on HEAD ?

-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

Re: error context for vacuum to include block number

2020-03-25 Thread Amit Kapila

On Wed, Mar 25, 2020 at 6:11 PM Justin Pryzby  wrote:
>
> On Wed, Mar 25, 2020 at 09:27:44PM +0900, Masahiko Sawada wrote:
> > On Wed, 25 Mar 2020 at 20:24, Amit Kapila  wrote:
> > >
> > > On Wed, Mar 25, 2020 at 3:42 PM Justin Pryzby  
> > > wrote:
> > > >
> > > > Attached patch addressing these.
> > > >
> > >
> > > Thanks, you forgot to remove the below declaration which I have
> > > removed in attached.
> > >
> > > @@ -724,20 +758,20 @@ lazy_scan_heap(Relation onerel, VacuumParams
> > > *params, LVRelStats *vacrelstats,
> > >   PROGRESS_VACUUM_MAX_DEAD_TUPLES
> > >   };
> > >   int64 initprog_val[3];
> > > + ErrorContextCallback errcallback;
> > >
> > > Apart from this, I have ran pgindent and now I think it is in good
> > > shape.  Do you have any other comments?  Sawada-San, can you also
> > > check the attached patch and let me know if you have any additional
> > > comments.
> > >
> >
> > Thank you for updating the patch! I have a question about the following 
> > code:
> >
> > +/* Update error traceback information */
> > +olderrcbarg = *vacrelstats;
> > +update_vacuum_error_cbarg(vacrelstats, VACUUM_ERRCB_PHASE_TRUNCATE,
> > +  vacrelstats->nonempty_pages, NULL, 
> > false);
> > +
> >  /*
> >   * Scan backwards from the end to verify that the end pages 
> > actually
> >   * contain no tuples.  This is *necessary*, not optional, because
> >   * other backends could have added tuples to these pages whilst we
> >   * were vacuuming.
> >   */
> >  new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
> > +vacrelstats->blkno = new_rel_pages;
> >
> >  if (new_rel_pages >= old_rel_pages)
> >  {
> >  /* can't do anything after all */
> >  UnlockRelation(onerel, AccessExclusiveLock);
> >  return;
> >  }
> >
> >  /*
> >   * Okay to truncate.
> >   */
> >  RelationTruncate(onerel, new_rel_pages);
> >
> > +/* Revert back to the old phase information for error traceback */
> > +update_vacuum_error_cbarg(vacrelstats,
> > +  olderrcbarg.phase,
> > +  olderrcbarg.blkno,
> > +  olderrcbarg.indname,
> > +  true);
> >
> > vacrelstats->nonempty_pages is the last non-empty block while
> > new_rel_pages, the result of count_nondeletable_pages(), is the number
> > of blocks that we can truncate to in this attempt. Therefore
> > vacrelstats->nonempty_pages <= new_rel_pages. This means that we set a
> > lower block number to arguments and then set a higher block number
> > after count_nondeletable_pages, and then revert it back to
> > VACUUM_ERRCB_PHASE_SCAN_HEAP phase and the number of blocks of
> > relation before truncation, after RelationTruncate(). It can be
> > repeated until no more truncating can be done. Why do we need to
> > revert back to the scan heap phase? If we can use
> > vacrelstats->nonempty_pages in the error context message as the
> > remaining blocks after truncation I think we can update callback
> > arguments once at the beginning of lazy_truncate_heap() and don't
> > revert to the previous phase, and pop the error context after exiting.
>
> Perhaps.  We need to "revert back" for the vacuum phases, which can be called
> multiple times, but we don't need to do that here.
>

Yeah, but I think it would be better if are consistent because we have
no control what the caller of the function intends to do after
finishing the current phase.  I think we can add some comments where
we set up the context (in heap_vacuum_rel) like below so that the idea
is more clear.

"The idea is to set up an error context callback to display additional
information with any error during vacuum.  During different phases of
vacuum (heap scan, heap vacuum, index vacuum, index clean up, heap
truncate), we update the error context callback to display appropriate
information.

Note that different phases of vacuum overlap with each other, so once
a particular phase is over, we need to revert back to the old phase to
keep the phase information up-to-date."

What do you think?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pgbench - add \aset to store results of a combined query

2020-03-25 Thread Michael Paquier

On Tue, Mar 24, 2020 at 11:04:45AM -0400, David Steele wrote:
> On 11/29/19 4:34 AM, Fabien COELHO wrote:
>> MetaCommand is not enough: we need varprefix, and then distinguishing
>> between aset and gset. Although this last point can be done with a
>> MetaCommand, ISTM that a bool is_aset is clear and good enough. It is
>> possible to switch if you insist on it, but I do not think it is
>> desirable.
> 
> Michael, do you agree with Fabien's comments?

Thanks for the reminder.  I am following up with Fabien's comments.
--
Michael


signature.asc
Description: PGP signature

Re: Refactor compile-time assertion checks for C/C++

2020-03-25 Thread Michael Paquier

On Mon, Mar 23, 2020 at 12:22:48AM -0400, Tom Lane wrote:
> Yeah, the comment needs an update; but if we have four implementations
> then it ought to describe each of them, IMO.

I got an idea as per the attached.  Perhaps you have a better idea?
--
Michael
diff --git a/src/include/c.h b/src/include/c.h
index 831c89f473..6731861da5 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -836,9 +836,13 @@ extern void ExceptionalCondition(const char *conditionName,
  * The macro StaticAssertDecl() is suitable for use at file scope (outside of
  * any function).
  *
- * Otherwise we fall back on a kluge that assumes the compiler will complain
- * about a negative width for a struct bit-field.  This will not include a
- * helpful error message, but it beats not getting an error at all.
+ * On recent C++ compilers, we can use standard static_assert() for all
+ * types of static assertions.
+ *
+ * Otherwise, for the C and C++ fallback implementations, we fall back on
+ * a kluge that assumes the compiler will complain about a negative width
+ * for a struct bit-field.  This will not include a helpful error message,
+ * but it beats not getting an error at all.
  */
 #ifndef __cplusplus
 #ifdef HAVE__STATIC_ASSERT


signature.asc
Description: PGP signature

Re: [Proposal] Global temporary tables

2020-03-25 Thread wjzeng



> 2020年3月25日 下午10:16，tushar  写道：
> 
> On 3/17/20 9:15 AM, 曾文旌(义从) wrote:
>> Please check global_temporary_table_v20-pg13.patch
> 
> There is a typo in the error message
> 
> postgres=# create global temp table test(a int ) 
> with(on_commit_delete_rows=true) on commit delete rows;
> ERROR:  can not defeine global temp table with on commit and with clause at 
> same time
> postgres=#
Thank you pointed out.
I fixed in global_temporary_table_v21-pg13.patch


Wenjing


> 
> -- 
> regards,tushar
> EnterpriseDB  https://www.enterprisedb.com/
> The Enterprise PostgreSQL Company



smime.p7s
Description: S/MIME cryptographic signature

Re: [Proposal] Global temporary tables

2020-03-25 Thread wjzeng



> 2020年3月25日 下午6:44，tushar  写道：
> 
> On 3/17/20 9:15 AM, 曾文旌(义从) wrote:
>> reindex GTT is already supported
>> 
>> Please check global_temporary_table_v20-pg13.patch
>> 
> Please refer this scenario -
> 
> 
> postgres=# create global temp table co(n int) ;
> CREATE TABLE
> 
> postgres=# create index fff on co(n);
> CREATE INDEX
> 
> Case 1-
> postgres=# reindex table  co;
> REINDEX
> 
> Case -2
> postgres=# reindex database postgres ;
> WARNING:  global temp table "public.co" skip reindexed
I fixed in global_temporary_table_v21-pg13.patch


Wenjing


> REINDEX
> postgres=#
> 
> Case 2 should work as similar to Case 1.
> 
> -- 
> regards,tushar
> EnterpriseDB  https://www.enterprisedb.com/
> The Enterprise PostgreSQL Company



smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH] Implement INSERT SET syntax

2020-03-25 Thread Gareth Palmer

> On 26/03/2020, at 3:17 AM, Tom Lane  wrote:
> 
> Peter Eisentraut  writes:
>> On 2020-03-24 18:57, Tom Lane wrote:
>>> No doubt that's all fixable, but the realization that some cases of
>>> this syntax are*not*  just syntactic sugar for standards-compliant
>>> syntax is giving me pause.  Do we really want to get out front of
>>> the SQL committee on extending INSERT in an incompatible way?
> 
>> What is the additional functionality that we are considering adding here?
>> The thread started out proposing a more convenient syntax, but it seems 
>> to go deeper now and perhaps not everyone is following.
> 
> AIUI, the proposal is to allow INSERT commands to be written
> using an UPDATE-like syntax, for example
> 
> INSERT INTO table SET col1 = value1, col2 = value2, ... [ FROM ... ]
> 
> where everything after FROM is the same as it is in SELECT.  My initial
> belief was that this was strictly equivalent to what you could do with
> a target-column-names list in standard INSERT, viz
> 
> INSERT INTO table (col1, col2, ...) VALUES (value1, value2, ...);
> or
> INSERT INTO table (col1, col2, ...) SELECT value1, value2, ... FROM ...
> 
> but it's arguably more legible/convenient because the column names
> are written next to their values.
> 
> However, that rewriting falls down for certain multiassignment cases
> where you have a row source that can't be decomposed, such as my
> example
> 
> INSERT INTO table SET (col1, col2) = (SELECT value1, value2 FROM ...),
> ... [ FROM ... ]
> 
> So, just as we found for UPDATE, multiassignment syntax is strictly
> stronger than plain column-by-column assignment.
> 
> There are some secondary issues about which variants of this syntax
> will allow a column value to be written as DEFAULT, and perhaps
> about whether set-returning functions work.  But the major point
> right now is about whether its's possible to rewrite to standard
> syntax.
> 
>   regards, tom lane

Attached is v6 of the patch.

As per the suggestion the SET clause list is checked for any
MultiAssigmentRef nodes and to report an error if any are found.

For example, the rule definition that previously caused a parser crash
would now produce the following error:

vagrant=> create rule r1 as on insert to foo do instead
vagrant-> insert into bar set (f1,f2,f3) = (select f1,f2,f3 from foo);
ERROR:  INSERT SET syntax does not support multi-assignment of columns.
LINE 2: insert into bar set (f1,f2,f3) = (select f1,f2,f3 from foo);
^
HINT:  Specify the column assignments separately.

Requiring a FROM clause was a way to differentiate between an INSERT
with VALUES() which does allow DEFAULT and an INSERT with SELECT which
does not.

The idea was that it would help the user understand that they were writing
a different type of query and that DEFAULT would not be allowed in that
context.

To show what it would look like without that requirement I have removed
it from the v6 patch. In the first example works but the second one will
generate an error.

INSERT INTO t SET c1 = 1 WHERE true;
INSERT INTO t SET c1 = DEFAULT WHERE true;

insert-set-v6.patch
Description: Binary data

Re: [Proposal] Global temporary tables

2020-03-25 Thread wjzeng



> 2020年3月25日 下午8:52，Prabhat Sahu  写道：
> 
> Hi All,
> 
> Please check the behavior of GTT  having column with "SERIAL" datatype and 
> column with default value as "SEQUENCE" as below:
> 
> Session1:
> postgres=# create sequence gtt_c3_seq;
> CREATE SEQUENCE
> postgres=# create global temporary table gtt(c1 int, c2 serial, c3 int 
> default nextval('gtt_c3_seq') not null) on commit preserve rows;
> CREATE TABLE
> 
> -- Structure of column c2 and c3 are similar:
> postgres=# \d+ gtt
> Table "public.gtt"
>  Column |  Type   | Collation | Nullable | Default | 
> Storage | Stats target | Description 
> +-+---+--+-+-+--+-
>  c1 | integer |   |  | | 
> plain   |  | 
>  c2 | integer |   | not null | nextval('gtt_c2_seq'::regclass) | 
> plain   |  | 
>  c3 | integer |   | not null | nextval('gtt_c3_seq'::regclass) | 
> plain   |  | 
> Access method: heap
> Options: on_commit_delete_rows=false
> 
> postgres=# insert into gtt select generate_series(1,3);
> INSERT 0 3
> postgres=# select * from gtt;
>  c1 | c2 | c3 
> ++
>   1 |  1 |  1
>   2 |  2 |  2
>   3 |  3 |  3
> (3 rows)
> 
> Session2:
> postgres=# insert into gtt select generate_series(1,3);
> INSERT 0 3
> postgres=# select * from gtt;
>  c1 | c2 | c3 
> ++
>   1 |  1 |  4
>   2 |  2 |  5
>   3 |  3 |  6
> (3 rows)
> 
> Kindly let me know, Is this behavior expected?
> 
> -- 

postgres=# \d+
   List of relations
 Schema |Name|   Type   |Owner| Persistence |Size| 
Description 
++--+-+-++-
 public | gtt| table| wenjing.zwj | session | 8192 bytes | 
 public | gtt_c2_seq | sequence | wenjing.zwj | session | 8192 bytes | 
 public | gtt_c3_seq | sequence | wenjing.zwj | permanent   | 8192 bytes | 
(3 rows)

This is expected.
GTT'sequence is the same as GTT, so gtt_c2_seq is independent of each sessions.
gtt_c3_seq is a classic sequence.



Wenjing


> With Regards,
> Prabhat Kumar Sahu
> EnterpriseDB: http://www.enterprisedb.com 



smime.p7s
Description: S/MIME cryptographic signature

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-03-25 Thread Fujii Masao





On 2020/03/26 2:17, Sergei Kornilov wrote:

Hello


WAL usage patch [1] increments this version to 1_4 instead of 1_8.
I *guess* that's because previously this version was maintained
independently from pg_stat_statements' version. For example,
pg_stat_statements 1.4 seems to have used PGSS_V1_3.


As far as I remember, this was my proposed change in review a year ago.
I think that having a clear analogy between the extension version and the 
function name would be more clear than sequential numbering of PGSS_V with 
different extension versions.
For pgss 1.4 it was fine to use PGSS_V1_3, because there were no changes in 
pg_stat_statements_internal.
pg_stat_statements 1.3 will call pg_stat_statements_1_3
pg_stat_statements 1.4 - 1.7 will still call pg_stat_statements_1_3. In my 
opinion, this is the correct naming, since we did not need a new function.
but pg_stat_statements 1.8 will call pg_stat_statements_1_4. It's not confusing?


Yeah, I withdraw my comment and agree that 1_8 looks less confusing.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-03-25 Thread Andy Fan

Because I replied the old thread,  cfbot run a test based on the old patch
on that thread.  I have detached the old thread from commitfest.   Reply
this
email again to wake up Mr. cfbot with the right information.


v2-0001-Maintain-the-uniqueness-of-a-Query-from-bottom-to.patch
Description: Binary data

Re: AllocSetEstimateChunkSpace()

2020-03-25 Thread Jeff Davis

On Wed, 2020-03-25 at 15:09 -0700, Andres Freund wrote:
> > The overhead comes from two places: (a) the chunk header; and (b)
> > the
> > round-up-to-nearest-power-of-two behavior.
> > 
> > Combining the group tuple and per-group states only saves the
> > overhead
> > from (a); it does nothing to help (b), which is often bigger.
> 
> Hm? It very well can help with b), since the round-up only happens
> once
> now? I guess you could argue that it's possible that afterwards we'd
> more likely to end in a bigger size class, and thus have roughly the
> same amount of waste due rounding? But I don't think that's all that
> convincing.

Why is that not convincing? Each size class is double the previous one,
so piling double the memory into a single allocation doesn't help at
all. Two palloc(20)s turn into two 32-byte chunks; one palloc(40) turns
into a 64-byte chunk.

You might get lucky and the second chunk will fit in the wasted space
from the first chunk; but when it does cross a boundary, it will be a
bigger boundary and wipe out any efficiencies that you gained
previously.

Of course it depends on the exact distribution. But I don't see any
reason why we'd expect a distribution that would be favorable to
combining chunks together (except to avoid the chunk header, problem
(a)).

> I still, as I mentioned on the call, suspect that the right thing
> here
> is to use an allocation strategy that suffers from neither a nor b
> (for
> tuple and pergroup) and that has faster allocations too. That then
> also
> would have the consequence that we don't need to care about per-alloc
> overhead anymore (be it a or b).

It might make sense for the next release but I'm wary of more churn in
nodeAgg.c at this stage. It's not a trivial change because the
different allocations happen in different places and combining them
would be tricky.

> > So do you generally favor the EstimateMemoryChunkSpace() patch
> > (that
> > works for all context types)? Or do you have another suggestion? Or
> > do
> > you think we should just do nothing?
> 
> I think I'm increasingly leaning towards either using a constant
> overhead factor, or just getting rid of all memory context
> overhead. There's clearly no obviously correct design for the "chunk
> size" functions, and not having overhead is better than ~correctly
> estimating it.

Trying to actually eliminate the overhead sounds like v14 to me.

I believe the formula for AllocSet overhead can be approximated with:
   16 + size/4

That would probably be better than a constant but a little hacky. I can
do that as a spot fix if this patch proves unpopular.

Regards,
Jeff Davis

Re: allow online change primary_conninfo

2020-03-25 Thread Alvaro Herrera

On 2020-Mar-24, Alvaro Herrera wrote:

> I think the startup sighup handler should be in startup.c, not xlog.c,
> which has enough random code already.  We can add an accessor in xlog.c
> to let changing the walrcv status flag, to be called from the signal
> handler.

... as in the attached version.

Sergei included LOG messages to indicate which setting has been changed.
I put these in "#if 0" for now, but I'm thinking to remove these
altogether; we already have LOG messages when a setting changes value,
per ProcessConfigFileInternal(); the log sequence looks like this, taken
from tmp_check/log/001_stream_rep_standby_2.log after running the tests:

2020-03-25 20:54:19.413 -03 [17493] LOG:  received SIGHUP, reloading 
configuration files
2020-03-25 20:54:19.415 -03 [17493] LOG:  parameter "primary_slot_name" changed 
to "standby_2"
2020-03-25 20:54:19.415 -03 [17493] LOG:  parameter 
"wal_receiver_status_interval" changed to "1"
2020-03-25 20:54:19.422 -03 [17569] LOG:  started streaming WAL from primary at 
0/300 on timeline 1
2020-03-25 20:54:19.426 -03 [17494] LOG:  wal receiver process shutdown 
requested
2020-03-25 20:54:19.426 -03 [17569] FATAL:  terminating walreceiver process due 
to administrator command
2020-03-25 20:54:19.433 -03 [17572] LOG:  started streaming WAL from primary at 
0/300 on timeline 1

which looks sufficient.  Maybe we can reword that new message, say "wal
receiver process shutdown forced by parameter change".  Not sure if we
can or should adjust the FATAL line; probably not worth the trouble.

Thoughts welcome.  I'm thinking on getting this applied shortly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 0bd43454e2981d9551b12424902d537674d48e31 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 25 Mar 2020 20:48:44 -0300
Subject: [PATCH v10 1/3] Set wal_receiver_create_temp_slot PGC_POSTMASTER

and off by default
---
 doc/src/sgml/config.sgml  |  6 +--
 src/backend/access/transam/xlog.c |  3 +-
 src/backend/replication/walreceiver.c | 46 +++
 src/backend/replication/walreceiverfuncs.c| 28 +--
 src/backend/utils/misc/guc.c  |  4 +-
 src/backend/utils/misc/postgresql.conf.sample |  4 +-
 src/include/access/xlog.h |  1 +
 src/include/replication/walreceiver.h |  4 +-
 8 files changed, 50 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 355b408b0a..b5de6c3237 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4163,11 +4163,7 @@ ANY num_sync ( ).
-The default is on.  The only reason to turn this off would be if the
-remote instance is currently out of available replication slots.  This
-parameter can only be set in the postgresql.conf
-file or on the server command line.  Changes only take effect when the
-WAL receiver process starts a new connection.
+The default is off.  This parameter can only be set at server start.

   
  
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7621fc05e2..19df1d5a80 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -290,6 +290,7 @@ bool		StandbyModeRequested = false;
 char	   *PrimaryConnInfo = NULL;
 char	   *PrimarySlotName = NULL;
 char	   *PromoteTriggerFile = NULL;
+bool		wal_receiver_create_temp_slot = false;
 
 /* are we currently in standby mode? */
 bool		StandbyMode = false;
@@ -11975,7 +11976,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		}
 		curFileTLI = tli;
 		RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
-			 PrimarySlotName);
+			 PrimarySlotName, wal_receiver_create_temp_slot);
 		receivedUpto = 0;
 	}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..779d19f1c1 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -15,6 +15,12 @@
  * WalRcv->receivedUpto variable in shared memory, to inform the startup
  * process of how far it can proceed with XLOG replay.
  *
+ * WAL receivers cannot load directly GUC parameters used when establishing
+ * their connection to the primary, and rely on parameter values passed down
+ * by the startup process when WAL streaming is requested.  This applies
+ * to for example the replication slot creation and the connection string to
+ * use for the connection with the primary.
+ *
  * If the primary server ends streaming, but doesn't disconnect, walreceiver
  * goes into "waiting" mode, and waits for the startup process to give new
  * instructions. The startup process will treat that the same as
@@ -74,7 +80,6 @@
 
 
 /* GUC variables */
-bool		wal_receiver_create_temp_slot;
 int

Re: plan cache overhead on plpgsql expression

2020-03-25 Thread Tom Lane

Andres Freund  writes:
> I'm still confused by the comment I was reacting to - the code
> explicitly is about creating the *shared* resowner:

Right, this is because of the choice I mentioned earlier about creating
this resowner the same way as the one for the inline case.  I guess the
comments could go into more detail.  Or we could make the parentage
different for the two cases, but I don't like that much.

regards, tom lane

Re: plan cache overhead on plpgsql expression

2020-03-25 Thread Andres Freund

Hi,

On 2020-03-25 19:15:28 -0400, Tom Lane wrote:
> >> The comment is there because the regression tests fall over if you try
> >> to do it the other way :-(.  The failure I saw was specific to a
> >> transaction being done in a DO block, and maybe we could handle that
> >> differently from the case for a normal procedure; but it seemed better
> >> to me to make them the same.
> 
> > I'm still confused as to why it actually fixes the issue. Feel we should
> > at least understand what's going on before commtting.
> 
> I do understand the issue.  If you make the simple-resowner a child
> of TopTransactionResourceOwner, it vanishes at commit --- but
> plpgsql_inline_handler has still got a pointer to it, which it'll try
> to free afterwards, if the commit was inside the DO block.

I was confused why it fixes that, because:

>  void
>  plpgsql_xact_cb(XactEvent event, void *arg)
>  {
>   /*
>* If we are doing a clean transaction shutdown, free the EState (so 
> that
> -  * any remaining resources will be released correctly). In an abort, we
> +  * any remaining resources will be released correctly).  In an abort, we
>* expect the regular abort recovery procedures to release everything of
> -  * interest.
> +  * interest.  The resowner has to be explicitly released in both cases,
> +  * though, since it's not a child of TopTransactionResourceOwner.
>*/
>   if (event == XACT_EVENT_COMMIT || event == XACT_EVENT_PREPARE)
>   {
> @@ -8288,11 +8413,17 @@ plpgsql_xact_cb(XactEvent event, void *arg)
>   if (shared_simple_eval_estate)
>   FreeExecutorState(shared_simple_eval_estate);
>   shared_simple_eval_estate = NULL;
> + if (shared_simple_eval_resowner)
> + 
> plpgsql_free_simple_resowner(shared_simple_eval_resowner);
> + shared_simple_eval_resowner = NULL;
>   }
>   else if (event == XACT_EVENT_ABORT)
>   {
>   simple_econtext_stack = NULL;
>   shared_simple_eval_estate = NULL;
> + if (shared_simple_eval_resowner)
> + 
> plpgsql_free_simple_resowner(shared_simple_eval_resowner);
> + shared_simple_eval_resowner = NULL;
>   }
>  }

will lead to shared_simple_eval_resowner being deleted before
TopTransactionResourceOwner is deleted:

static void
CommitTransaction(void)
...
CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_COMMIT
  : XACT_EVENT_COMMIT);

ResourceOwnerRelease(TopTransactionResourceOwner,
 RESOURCE_RELEASE_BEFORE_LOCKS,
 true, true);

What I missed is that the inline handler will not use
shared_simple_eval_resowner, but instead use the function local
simple_eval_resowner. Which I had not realized before.


I'm still confused by the comment I was reacting to - the code
explicitly is about creating the *shared* resowner:

> +  * Likewise for the simple-expression resource owner.  (Note: it'd be
> +  * safer to create this as a child of TopTransactionResourceOwner; but
> +  * right now that causes issues in transaction-spanning procedures, so
> +  * make it standalone.)
> +  */
> + if (estate->simple_eval_resowner == NULL)
> + {
> + if (shared_simple_eval_resowner == NULL)
> + shared_simple_eval_resowner =
> + ResourceOwnerCreate(NULL, "PL/pgSQL simple 
> expressions");
> + estate->simple_eval_resowner = shared_simple_eval_resowner;
> + }

which, afaict, will always deleted before TopTransactionResourceOwner
goes away?


Greetings,

Andres Freund

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-03-25 Thread Justin Pryzby

On Thu, Mar 12, 2020 at 08:08:46PM +0300, Alexey Kondratov wrote:
> On 09.03.2020 23:04, Justin Pryzby wrote:
>> On Sat, Feb 29, 2020 at 08:53:04AM -0600, Justin Pryzby wrote:
>>> On Sat, Feb 29, 2020 at 03:35:27PM +0300, Alexey Kondratov wrote:
>>> tests for that.  (I'm including your v8 untouched in hopes of not messing up
>>> the cfbot).  My fixes avoid an issue if you try to REINDEX onto pg_default, 
>>> I
>>> think due to moving system toast indexes.
>> I was able to avoid this issue by adding a call to GetNewRelFileNode, even
>> though that's already called by RelationSetNewRelfilenode().  Not sure if
>> there's a better way, or if it's worth Alexey's v3 patch which added a
>> tablespace param to RelationSetNewRelfilenode.
> 
> Do you have any understanding of what exactly causes this error? I have
> tried to debug it a little bit, but still cannot figure out why we need this
> extra GetNewRelFileNode() call and a mechanism how it helps.

The PANIC is from smgr hashtable, which couldn't find an entry it expected.  My
very tentative understanding is that smgr is prepared to handle a *relation*
which is dropped/recreated multiple times in a transaction, but it's *not*
prepared to deal with a given RelFileNode(Backend) being dropped/recreated,
since that's used as a hash key.

I revisited it and solved it in a somewhat nicer way.  It's still not clear to
me if there's an issue with your original way of adding a tablespace parameter
to RelationSetNewRelfilenode().

> Probably you mean v4 patch. Yes, interestingly, if we do everything at once
> inside RelationSetNewRelfilenode(), then there is no issue at all with:

Yes, I meant to say "worth revisiting the v4 patch".

> Many thanks for you review and fixups! There are some inconsistencies like
> mentions of SET TABLESPACE in error messages and so on. I am going to
> refactor and include your fixes 0003-0004 into 0001 and 0002, but keep 0005
> separated for now, since this part requires more understanding IMO (and
> comparison with v4 implementation).

I'd suggest to keep the CLUSTER/VACUUM FULL separate from REINDEX, in case
Michael or someone else wants to progress one but cannot commit to both.  But
probably we should plan to finish this in July.

-- 
Justin
>From 8b5dde0fe77ae46aba35d3bb9ce026ca217a6e3e Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Sat, 29 Feb 2020 15:35:27 +0300
Subject: [PATCH v11 1/5] Allow REINDEX to change tablespace

>From d2b7a5fa2e11601759b47af0c142a7824ef907a2 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 30 Dec 2019 20:00:37 +0300
Subject: [PATCH v8] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml | 24 +-
 src/backend/catalog/index.c   | 75 --
 src/backend/commands/cluster.c|  2 +-
 src/backend/commands/indexcmds.c  | 96 ---
 src/backend/commands/tablecmds.c  |  2 +-
 src/backend/nodes/copyfuncs.c |  1 +
 src/backend/nodes/equalfuncs.c|  1 +
 src/backend/parser/gram.y | 14 ++--
 src/backend/tcop/utility.c|  6 +-
 src/bin/psql/tab-complete.c   |  6 ++
 src/include/catalog/index.h   |  7 +-
 src/include/commands/defrem.h |  6 +-
 src/include/nodes/parsenodes.h|  1 +
 src/test/regress/input/tablespace.source  | 49 
 src/test/regress/output/tablespace.source | 66 
 15 files changed, 323 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c54a7c420d..0628c94bb1 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name
+REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name [ TABLESPACE new_tablespace ]
 
 where option can be one of:
 
@@ -174,6 +174,28 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  This specifies a tablespace, where all rebuilt indexes will be created.
+  Cannot be used with "mapped" relations. If SCHEMA,
+  DATABASE or SYSTEM is specified, then
+  all unsuitable relations will be skipped and a single WARNING
+  will be generated.
+ 
+
+   
+
+   
+new_tablespace
+
+ 
+  The name of the specific tablespace to store rebuilt indexes.
+ 
+
+   
+

 VERBOSE
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2d81bc3cbc..e6c06e6a06 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1235,9 +1235,13 @@ index_create(Relation heapRelation,
  * Create concurrently an index based on the

Re: plan cache overhead on plpgsql expression

2020-03-25 Thread Tom Lane

Andres Freund  writes:
> On 2020-03-25 17:51:50 -0400, Tom Lane wrote:
>> Perhaps, but I'm not sure that either of those functions represent
>> material overhead in cases besides this one.

> That's not huge, but also not nothing.

I see.  So maybe worth the trouble --- but still, seems like material for
a separate patch.

>>> Would it make sense to instead compute this as we go when building a
>>> valid CachedPlanSource?

>> I'm inclined to think not, because it'd just be overhead for other
>> users of cached plans.

> Even if we make RevalidateCachedQuery take advantage of the simpler
> tests when possible?

I'm not convinced that any real optimization is practical once you
allow tables in the query.  You then have to check the RLS-active
flags in some form, and the existing tests are not *that* expensive
as long as the answer is "no".  At best I think you might be reducing
two or three simple tests to one.

Also, the reason why this is interesting at all for plpgsql simple
expressions is that the cost of these checks, simple as they are,
is a noticeable fraction of the total time to do a simple expression.
That's not going to be the case for queries involving table access.

>> The comment is there because the regression tests fall over if you try
>> to do it the other way :-(.  The failure I saw was specific to a
>> transaction being done in a DO block, and maybe we could handle that
>> differently from the case for a normal procedure; but it seemed better
>> to me to make them the same.

> I'm still confused as to why it actually fixes the issue. Feel we should
> at least understand what's going on before commtting.

I do understand the issue.  If you make the simple-resowner a child
of TopTransactionResourceOwner, it vanishes at commit --- but
plpgsql_inline_handler has still got a pointer to it, which it'll try
to free afterwards, if the commit was inside the DO block.

What's not entirely clear to me is why this in exec_stmt_commit

@@ -4825,6 +4845,7 @@ exec_stmt_commit(PLpgSQL_execstate *estate, 
PLpgSQL_stmt_commit *stmt)
}

estate->simple_eval_estate = NULL;
+   estate->simple_eval_resowner = NULL;
plpgsql_create_econtext(estate);

return PLPGSQL_RC_OK;

is okay --- it avoids having a dangling pointer, sure, but if we're inside
a DO block won't plpgsql_create_econtext create a simple_eval_estate (and,
now, simple_eval_resowner) with the wrong properties?  But that's a
pre-existing question, and maybe Peter got it right and there's no
problem.

>> Doubt it.  On the other hand, as the code stands it's certain that the
>> resowner contains nothing but plancache pins (while I was writing the
>> patch it wasn't entirely clear that that would hold).  So we could
>> drop the two unnecessary calls.  There are assertions in
>> ResourceOwnerDelete that would fire if we somehow missed releasing
>> anything, so it doesn't seem like much of a maintenance hazard.

> One could even argue that that's a nice crosscheck: Due to the later
> release it'd not actually be correct to just add "arbitrary" things to
> that resowner.

OK, I'll change that.

>> (1) Not given the existing set of uses of the push/pop capability, which
>> so far as I can see is *only* CREATE SCHEMA.

> I do recall that there were issues with SET search_path in functions
> causing noticable slowdowns...

Yeah, possibly that could be improved, but that seems outside the scope of
this patch.

>> (2) as this is written, it's totally unsafe for the generation counter
>> ever to back up; that risks false match detections later.

> I was just thinking of backing up the 'active generation' state. New
> generations would have to come from a separate 'next generation'
> counter.

Oh, I see.  Yeah, that could work, but there's no point until we have
push/pop calls that are actually interesting for performance.

regards, tom lane

Re: plan cache overhead on plpgsql expression

2020-03-25 Thread Andres Freund

Hi,

On 2020-03-25 17:51:50 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I wonder if it'd make sense to store the locks needed for
> > AcquirePlannerLocks/AcquireExecutorLocks in a better form.
> 
> Perhaps, but I'm not sure that either of those functions represent
> material overhead in cases besides this one.

For pgbench -M prepared -S GetCachedPlan() and its children are 2.36% of
the time. 1.75% of the total is RevalidateCachedQuery(). 1.13% of that
in turn is LockAcquireExtended.

That's not huge, but also not nothing. And this includes client
roundtrips. So I assume it'd show up larger when executing actual
queries in a function, or when pipelining (which e.g. pgjdbc has on by
default).

If I to simple lookups from pgbench_accounts in a loop in plpgsql
GetCachedPlan() is 4.43% and the LockAcquireExtended()'s called from
within are 1.46%.

So it's plausible that making this a more generic optimization would be
worthwhile.

> > Would it make sense to instead compute this as we go when building a
> > valid CachedPlanSource? If we make it a property of a is_valid
> > CachedPlanSource, we can assert that the plan is safe for use in
> > CachedPlanIsSimplyValid().
> 
> I'm inclined to think not, because it'd just be overhead for other
> users of cached plans.

Even if we make RevalidateCachedQuery take advantage of the simpler
tests when possible?  While there's plenty of cases where it'd not be
applicable, it seems likely that those wouldn't notice the small
slowdown either.

> >> /*
> >> +   * Likewise for the simple-expression resource owner.  (Note: it'd be
> >> +   * safer to create this as a child of TopTransactionResourceOwner; but
> >> +   * right now that causes issues in transaction-spanning procedures, so
> >> +   * make it standalone.)
> >> +   */
> 
> > Hm. I'm quite unfamiliar with this area of the code - so I'm likely just
> > missing something: Given that you're using a post xact cleanup hook to
> > release the resowner, I'm not quite sure I understand this comment. The
> > XACT_EVENT_ABORT/COMMIT callbacks are called before
> > TopTransactionResourceOwner is released, no?
> 
> The comment is there because the regression tests fall over if you try
> to do it the other way :-(.  The failure I saw was specific to a
> transaction being done in a DO block, and maybe we could handle that
> differently from the case for a normal procedure; but it seemed better
> to me to make them the same.

I'm still confused as to why it actually fixes the issue. Feel we should
at least understand what's going on before commtting.

> >> +void
> >> +plpgsql_free_simple_resowner(ResourceOwner simple_eval_resowner)
> >> +{
> >> +  /*
> >> +   * At this writing, the only thing that could actually get released is
> >> +   * plancache refcounts; but we may as well do the full release protocol.
> 
> > Hm, any chance that the multiple resowner calls here could show up in a
> > profile? Probably not?
> 
> Doubt it.  On the other hand, as the code stands it's certain that the
> resowner contains nothing but plancache pins (while I was writing the
> patch it wasn't entirely clear that that would hold).  So we could
> drop the two unnecessary calls.  There are assertions in
> ResourceOwnerDelete that would fire if we somehow missed releasing
> anything, so it doesn't seem like much of a maintenance hazard.

One could even argue that that's a nice crosscheck: Due to the later
release it'd not actually be correct to just add "arbitrary" things to
that resowner.

> > Could it be worth optimizing the path generation logic so that a
> > push/pop of an override path restores the old generation?
> 
> (1) Not given the existing set of uses of the push/pop capability, which
> so far as I can see is *only* CREATE SCHEMA.

Oh. Well, then that'd be something for later.

I do recall that there were issues with SET search_path in functions
causing noticable slowdowns...

> It's not involved in any other manipulations of the search path.  And
> (2) as this is written, it's totally unsafe for the generation counter
> ever to back up; that risks false match detections later.

I was just thinking of backing up the 'active generation' state. New
generations would have to come from a separate 'next generation'
counter.

Greetings,

Andres Freund

Re: AllocSetEstimateChunkSpace()

2020-03-25 Thread Andres Freund

Hi,

On 2020-03-25 14:43:43 -0700, Jeff Davis wrote:
> On Wed, 2020-03-25 at 12:42 -0700, Andres Freund wrote:
> > As mention on im/call: I think this is mainly an argument for
> > combining
> > the group tuple & per-group state allocations. I'm kind of fine with
> > afterwards taking the allocator overhead into account.
> 
> The overhead comes from two places: (a) the chunk header; and (b) the
> round-up-to-nearest-power-of-two behavior.
> 
> Combining the group tuple and per-group states only saves the overhead
> from (a); it does nothing to help (b), which is often bigger.

Hm? It very well can help with b), since the round-up only happens once
now? I guess you could argue that it's possible that afterwards we'd
more likely to end in a bigger size class, and thus have roughly the
same amount of waste due rounding? But I don't think that's all that
convincing.

I still, as I mentioned on the call, suspect that the right thing here
is to use an allocation strategy that suffers from neither a nor b (for
tuple and pergroup) and that has faster allocations too. That then also
would have the consequence that we don't need to care about per-alloc
overhead anymore (be it a or b).

> And it only saves that overhead when there *is* a per-group state
> (i.e. not for a DISTINCT query).

So?

> > I still don't buy that its useful to estimate the by-ref transition
> > value overhead. We just don't have anything even have close enough to
> > a
> > meaningful value to base this on. 
> 
> By-ref transition values aren't a primary motivation for me. I'm fine
> leaving that discussion separate if that's a sticking point. But if we
> do have a way to measure chunk overhead, I don't really see a reason
> not to use it for by-ref as well.

Well, my point is that it's pretty much pointless for by-ref types. The
size estimates, if they exist, are so inaccurate that we don't gain
anything by including it. As I said before, I think we'd be better off
initially assuming a higher transition space estimate.

> > Yea, the "context needs to exist" part sucks. I really don't want to
> > add
> > calls directly into AllocSet from more places though. And just
> > ignoring
> > the parameters to the context seems wrong too.
> 
> So do you generally favor the EstimateMemoryChunkSpace() patch (that
> works for all context types)? Or do you have another suggestion? Or do
> you think we should just do nothing?

I think I'm increasingly leaning towards either using a constant
overhead factor, or just getting rid of all memory context
overhead. There's clearly no obviously correct design for the "chunk
size" functions, and not having overhead is better than ~correctly
estimating it.

Greetings,

Andres Freund

Re: plan cache overhead on plpgsql expression

2020-03-25 Thread Tom Lane

Andres Freund  writes:
> I wonder if it'd make sense to store the locks needed for
> AcquirePlannerLocks/AcquireExecutorLocks in a better form.

Perhaps, but I'm not sure that either of those functions represent
material overhead in cases besides this one.

> Would it make sense to instead compute this as we go when building a
> valid CachedPlanSource? If we make it a property of a is_valid
> CachedPlanSource, we can assert that the plan is safe for use in
> CachedPlanIsSimplyValid().

I'm inclined to think not, because it'd just be overhead for other
users of cached plans.

> That's mighty subtle :/

Yeah :-(.  I don't like it that much, but I don't see an easy way to
do better, given the way that plpgsql manages its simple expressions.

>> /*
>> + * Likewise for the simple-expression resource owner.  (Note: it'd be
>> + * safer to create this as a child of TopTransactionResourceOwner; but
>> + * right now that causes issues in transaction-spanning procedures, so
>> + * make it standalone.)
>> + */

> Hm. I'm quite unfamiliar with this area of the code - so I'm likely just
> missing something: Given that you're using a post xact cleanup hook to
> release the resowner, I'm not quite sure I understand this comment. The
> XACT_EVENT_ABORT/COMMIT callbacks are called before
> TopTransactionResourceOwner is released, no?

The comment is there because the regression tests fall over if you try
to do it the other way :-(.  The failure I saw was specific to a
transaction being done in a DO block, and maybe we could handle that
differently from the case for a normal procedure; but it seemed better
to me to make them the same.

There's a separate question lurking under there, which is whether the
existing management of the simple-expression EState is right at all
for transaction-spanning DO blocks; frankly it smells a bit fishy to
me.  But looking into that did not seem in-scope for this patch.

>> +void
>> +plpgsql_free_simple_resowner(ResourceOwner simple_eval_resowner)
>> +{
>> +/*
>> + * At this writing, the only thing that could actually get released is
>> + * plancache refcounts; but we may as well do the full release protocol.

> Hm, any chance that the multiple resowner calls here could show up in a
> profile? Probably not?

Doubt it.  On the other hand, as the code stands it's certain that the
resowner contains nothing but plancache pins (while I was writing the
patch it wasn't entirely clear that that would hold).  So we could
drop the two unnecessary calls.  There are assertions in
ResourceOwnerDelete that would fire if we somehow missed releasing
anything, so it doesn't seem like much of a maintenance hazard.

>> + * We pass isCommit = false even when committing, to suppress
>> + * resource-leakage gripes, since we aren't bothering to release the
>> + * refcounts one-by-one.
>> + */

> That's a bit icky...

Agreed, and it's not like our practice elsewhere.  I thought about adding
a data structure that would track the set of held plancache pins outside
the resowner, but concluded that that'd just be pointless duplicative
overhead.

> Could it be worth optimizing the path generation logic so that a
> push/pop of an override path restores the old generation?

(1) Not given the existing set of uses of the push/pop capability, which
so far as I can see is *only* CREATE SCHEMA.  It's not involved in any
other manipulations of the search path.  And (2) as this is written, it's
totally unsafe for the generation counter ever to back up; that risks
false match detections later.

I appreciate the review!

regards, tom lane

Re: AllocSetEstimateChunkSpace()

2020-03-25 Thread Jeff Davis

On Wed, 2020-03-25 at 12:42 -0700, Andres Freund wrote:
> As mention on im/call: I think this is mainly an argument for
> combining
> the group tuple & per-group state allocations. I'm kind of fine with
> afterwards taking the allocator overhead into account.

The overhead comes from two places: (a) the chunk header; and (b) the
round-up-to-nearest-power-of-two behavior.

Combining the group tuple and per-group states only saves the overhead
from (a); it does nothing to help (b), which is often bigger. And it
only saves that overhead when there *is* a per-group state (i.e. not
for a DISTINCT query).

> I still don't buy that its useful to estimate the by-ref transition
> value overhead. We just don't have anything even have close enough to
> a
> meaningful value to base this on. 

By-ref transition values aren't a primary motivation for me. I'm fine
leaving that discussion separate if that's a sticking point. But if we
do have a way to measure chunk overhead, I don't really see a reason
not to use it for by-ref as well.

> -1 to [AllocSet-specific] approach. I think it's architecturally the
> wrong direction to
> add more direct calls to functions within specific contexts.

OK.

> Yea, the "context needs to exist" part sucks. I really don't want to
> add
> calls directly into AllocSet from more places though. And just
> ignoring
> the parameters to the context seems wrong too.

So do you generally favor the EstimateMemoryChunkSpace() patch (that
works for all context types)? Or do you have another suggestion? Or do
you think we should just do nothing?

Regards,
Jeff Davis

Re: plan cache overhead on plpgsql expression

2020-03-25 Thread Andres Freund

Hi,

On 2020-03-21 14:24:05 -0400, Tom Lane wrote:
> So while there's clearly something worth pursuing here, I do not like
> anything about the way it was done.  I think that the right way to
> think about this problem is "how can the plan cache provide a fast
> path for checking validity of simple-expression plans?".  And when you
> think about it that way, there's a pretty obvious answer: if the plan
> involves no table references, there's not going to be any locks that
> have to be taken before we can check the is_valid flag.  So we can
> have a fast path that skips AcquirePlannerLocks and
> AcquireExecutorLocks, which are a big part of the problem, and we can
> also bypass some of the other random checks that GetCachedPlan has to
> make, like whether RLS affects the plan.

That makes sense to me.

I wonder if it'd make sense to store the locks needed for
AcquirePlannerLocks/AcquireExecutorLocks in a better form. Not really
instead of your optimization, but to also address simple statements that
do reference a relation. If we stored all the locks for a plansource in
an array, it should get cheaper - and automatically implement the fast
path of skipping AcquirePlannerLocks/AcquireExecutorLocks when there's
no relations.


> Another chunk of the issue is the constant acquisition and release of
> reference counts on the plan.  We can't really skip that (I suspect
> there are additional bugs in Amit's patch arising from trying to do so).
> However, plpgsql already has mechanisms for paying simple-expression
> setup costs once per transaction rather than once per expression use.
> So we can set up a simple-expression ResourceOwner managed much like
> the simple-expression EState, and have it hold a refcount on the
> CachedPlan for each simple expression, and pay that overhead just once
> per transaction.


> I haven't done any serious performance testing on this, but it gives
> circa 2X speedup on Pavel's original example, which is at least
> fairly close to the results that Amit's patch got there.  And it
> makes this last batch of test cases faster not slower, too.
> 
> With this patch, perf shows the hotspots on Pavel's original example
> as being
> 
> +   19.24%19.17% 46470  postmaster   plpgsql.so   
> [.] exec_eval_expr
> +   15.19%15.15% 36720  postmaster   plpgsql.so   
> [.] plpgsql_param_eval_var
> +   14.98%14.94% 36213  postmaster   postgres 
> [.] ExecInterpExpr
> +6.32% 6.30% 15262  postmaster   plpgsql.so   
> [.] exec_stmt
> +6.08% 6.06% 14681  postmaster   plpgsql.so   
> [.] exec_assign_value
> 
> Maybe there's more that could be done to knock fat out of
> exec_eval_expr and/or plpgsql_param_eval_var, but at least
> the plan cache isn't the bottleneck anymore.

Nice!


> diff --git a/src/backend/utils/cache/plancache.c 
> b/src/backend/utils/cache/plancache.c
> index dbae18d..8e27b03 100644
> --- a/src/backend/utils/cache/plancache.c
> +++ b/src/backend/utils/cache/plancache.c
> @@ -1278,6 +1278,160 @@ ReleaseCachedPlan(CachedPlan *plan, bool useResOwner)
>  }
>  
>  /*
> + * CachedPlanAllowsSimpleValidityCheck: can we use CachedPlanIsSimplyValid?
> + *
> + * This function, together with CachedPlanIsSimplyValid, provides a fast path
> + * for revalidating "simple" generic plans.  The core requirement to be 
> simple
> + * is that the plan must not require taking any locks, which translates to
> + * not touching any tables; this happens to match up well with an important
> + * use-case in PL/pgSQL.  This function tests whether that's true, along
> + * with checking some other corner cases that we'd rather not bother with
> + * handling in the fast path.  (Note that it's still possible for such a plan
> + * to be invalidated, for example due to a change in a function that was
> + * inlined into the plan.)
> + *
> + * This must only be called on known-valid generic plans (eg, ones just
> + * returned by GetCachedPlan).  If it returns true, the caller may re-use
> + * the cached plan as long as CachedPlanIsSimplyValid returns true; that
> + * check is much cheaper than the full revalidation done by GetCachedPlan.
> + * Nonetheless, no required checks are omitted.
> + */
> +bool
> +CachedPlanAllowsSimpleValidityCheck(CachedPlanSource *plansource,
> + 
> CachedPlan *plan)
> +{
> + ListCell   *lc;

Would it make sense to instead compute this as we go when building a
valid CachedPlanSource? If we make it a property of a is_valid
CachedPlanSource, we can assert that the plan is safe for use in
CachedPlanIsSimplyValid().

And perhaps also optimize the normal checks in RevalidateCachedQuery()
for cases not going through the "simple" path. We could not use the
optimizations around refcounts for those, but it still seems like it
could be useful? And less separate

Re: backup manifests

2020-03-25 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Mar 25, 2020 at 9:31 AM Stephen Frost  wrote:
> > I get that the default for manifest is 'no', but I don't really see how
> > that means that the lack of saying anything about checksums should mean
> > "give me crc32c checksums".  It's really rather common that if we don't
> > specify something, it means don't do that thing- like an 'ORDER BY'
> > clause.
> 
> That's a fair argument, but I think the other relevant principle is
> that we try to give people useful defaults for things. I think that
> checksums are a sufficiently useful thing that having the default be
> not to do it doesn't make sense. I had the impression that you and
> David were in agreement on that point, actually.

I agree with wanting to have useful defaults and that checksums should
be included by default, and I'm alright even with letting people pick
what algorithms they'd like to have too.  The construct here is made odd
because we've got this idea that "no checksum" is an option, which is
actually something that I don't particularly like, but that's what's
making this particular syntax weird.  I don't suppose you'd be open to
the idea of just dropping that though..?  There wouldn't be any issue
with this syntax if we just always had checksums included when a
manifest is requested. :)

Somehow, I don't think I'm going to win that argument.

> > There were also comments made up-thread about how it might not be great
> > for larger (eg: 1GB files, like we tend to have quite a few of...), and
> > something about it being a 40 year old algorithm..
> 
> Well, the 512MB "limit" for CRC-32C means only that for certain very
> specific types of errors, detection is not guaranteed above that file
> size. So if you have a single flipped bit, for example, and the file
> size is greater than 512MB, then CRC-32C has only a 99.999767169%
> chance of detecting the error, whereas if the file size is less than
> 512MB, it is 100% certain, because of the design of the algorithm. But
> nine nines is plenty, and neither SHA nor our page-level checksums
> provide guaranteed error detection properties anyway.

Right, so we know that CRC-32C has an upper-bound of 512MB to be useful
for exactly what it's designed to be useful for, but we also know that
we're going to have larger files- at least 1GB ones, and quite possibly
larger, so why are we choosing this?

At the least, wouldn't it make sense to consider a larger CRC, one whose
limit is above the size of commonly expected files, if we're going to
use a CRC?

> I'm not sure why the fact that it's a 40-year-old algorithm is
> relevant. There are many 40-year-old algorithms that are very good.

Sure there are, but there probably wasn't a lot of thought about
GB-sized files, and this doesn't really seem to be the direction people
are going in for larger objects.  s3, as an example, uses sha256.
Google, it seems, suggests folks use "HighwayHash" (from their crc32c
github repo- https://github.com/google/crc32c).  Most CRC uses seem to
be for much smaller data sets.

> My guess is that if this patch is adopted as currently proposed, we
> will eventually need to replace the cryptographic hash functions due
> to the march of time. As I'm sure you realize, the problem with hash
> functions that are designed to foil an adversary is that adversaries
> keep getting smarter. So, eventually someone will probably figure out
> how to do something nefarious with SHA-512. Some other technique that
> nobody's cracked yet will need to be adopted, and then people will
> begin trying to crack that, and the whole thing will repeat. But I
> suspect that we can keep using the same non-cryptographic hash
> function essentially forever. It does not matter that people know how
> the algorithm works because it makes no pretensions of trying to foil
> an opponent. It is just trying to mix up the bits in such a way that a
> change to the file is likely to cause a change in the checksum. The
> bit-mixing properties of the algorithm do not degrade with the passage
> of time.

Sure, there's a good chance we'll need newer algorithms in the future, I
don't doubt that.  On the other hand, if crc32c, or CRC whatever, was
the perfect answer and no one will ever need something better, then
what's with folks like Google suggesting something else..?

> > I'm sure that sha256 takes a lot more time than crc32c, I'm certainly
> > not trying to dispute that, but what's relevent here is how much it
> > impacts the time required to run the overall backup (including sync'ing
> > it to disk, and possibly network transmission time..  if we're just
> > comparing the time to run it through memory then, sure, the sha256
> > computation time might end up being quite a bit of the time, but that's
> > not really that interesting of a test..).
> 
> I think that 
> http://postgr.es/m/38e29a1c-0d20-fc73-badd-ca05f7f07...@pgmasters.net
> is one of the more interesting emails on this topic.  My

Re: Allow continuations in "pg_hba.conf" files

2020-03-25 Thread Fabien COELHO



Hello Justin,

thanks for the feedback.


-   Records cannot be continued across lines.
+   Records can be backslash-continued across lines.


Maybe say: "lines ending with a backslash are logically continued on the next
line", or similar.


I tried to change it along that.


Since it puts a blank there, it creates a "word" boundary, which I gather
worked for your use case.  But I wonder whether it's needed to add a space (or
otherwise, document that lines cannot be split beween words?).


Hmmm. Ok, you are right. I hesitated while doing it. I removed the char 
instead, so that it does not add a word break.


Note, that also appears to affect the "username maps" file.  So mention 
in that chapter, too. 
https://www.postgresql.org/docs/current/auth-username-maps.html


Indeed, the same tokenizer is used. I updated a sentence to point on 
continuations.


--
Fabien.diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index 5f1eec78fb..4f947b0235 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -77,7 +77,8 @@
The general format of the pg_hba.conf file is
a set of records, one per line. Blank lines are ignored, as is any
text after the # comment character.
-   Records cannot be continued across lines.
+   Lines ending with a backslash are logically continued on the next
+   line, so a record can span several lines.
A record is made
up of a number of fields which are separated by spaces and/or tabs.
Fields can contain white space if the field value is double-quoted.
@@ -821,7 +822,7 @@ local   db1,db2,@demodbs  all   md5
 
 map-name system-username database-username
 
-   Comments and whitespace are handled in the same way as in
+   Comments, whitespace and continuations are handled in the same way as in
pg_hba.conf.  The
map-name is an arbitrary name that will be used to
refer to this mapping in pg_hba.conf. The other
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index da5189a4fa..bae20dbc06 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -486,8 +486,43 @@ tokenize_file(const char *filename, FILE *file, List **tok_lines, int elevel)
 		char	   *lineptr;
 		List	   *current_line = NIL;
 		char	   *err_msg = NULL;
+		char	   *cur = rawline;
+		int			len = sizeof(rawline);
+		int			continuations = 0;
 
-		if (!fgets(rawline, sizeof(rawline), file))
+		/* read input and handle simplistic backslash continuations */
+		while ((cur = fgets(cur, len, file)) != NULL)
+		{
+			int		curlen = strlen(cur);
+			char   *curend = cur + curlen - 1;
+
+			if (curlen == len - 1)
+			{
+/* Line too long! */
+ereport(elevel,
+		(errcode(ERRCODE_CONFIG_FILE_ERROR),
+		 errmsg("authentication file line too long"),
+		 errcontext("line %d of configuration file \"%s\"",
+	line_number + continuations, filename)));
+err_msg = "authentication file line too long";
+			}
+
+			/* Strip trailing linebreak from rawline */
+			while (curend >= cur && (*curend == '\n' || *curend == '\r'))
+*curend-- = '\0';
+
+			/* empty or not a continuation, we are done */
+			if (curend < cur || *curend != '\\')
+break;
+
+			/* else we have a continuation, just remove it and loop */
+			continuations++;
+			*curend = '\0';
+			len -= (curend - cur);
+			cur = curend;
+		}
+
+		if (cur == NULL)
 		{
 			int			save_errno = errno;
 
@@ -501,21 +536,6 @@ tokenize_file(const char *filename, FILE *file, List **tok_lines, int elevel)
 			   filename, strerror(save_errno));
 			rawline[0] = '\0';
 		}
-		if (strlen(rawline) == MAX_LINE - 1)
-		{
-			/* Line too long! */
-			ereport(elevel,
-	(errcode(ERRCODE_CONFIG_FILE_ERROR),
-	 errmsg("authentication file line too long"),
-	 errcontext("line %d of configuration file \"%s\"",
-line_number, filename)));
-			err_msg = "authentication file line too long";
-		}
-
-		/* Strip trailing linebreak from rawline */
-		lineptr = rawline + strlen(rawline) - 1;
-		while (lineptr >= rawline && (*lineptr == '\n' || *lineptr == '\r'))
-			*lineptr-- = '\0';
 
 		/* Parse fields */
 		lineptr = rawline;
@@ -543,7 +563,7 @@ tokenize_file(const char *filename, FILE *file, List **tok_lines, int elevel)
 			*tok_lines = lappend(*tok_lines, tok_line);
 		}
 
-		line_number++;
+		line_number += continuations + 1;
 	}
 
 	MemoryContextSwitchTo(oldcxt);

Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding

2020-03-25 Thread Tom Lane

Noah Misch  writes:
> On Thu, Mar 19, 2020 at 06:04:52PM -0400, Tom Lane wrote:
>> 1. It seems like this discussion is conflating two different issues.

> When the newest XID and the oldest XID fall in "opposite" segments in the XID
> space, we must not unlink the segment containing the newest XID.  That is the
> chief goal at present.

Got it.  Thanks for clarifying the scope of the patch.

>> 3. It feels like the proposed test of cutoff position against both
>> ends of a segment is a roundabout way of fixing the problem.  I
>> wonder whether we ought not pass *both* the cutoff and the current
>> endpoint (latest_page_number) down to the truncation logic, and
>> have it compare against both of those values.

> Since latest_page_number can keep changing throughout SlruScanDirectory()
> execution, that would give a false impression of control.  Better to
> demonstrate that the xidWrapLimit machinery keeps latest_page_number within
> acceptable constraints than to ascribe significance to a comparison with a
> stale latest_page_number.

Perhaps.  I'm prepared to accept that line of argument so far as the clog
SLRU goes, but I'm not convinced that the other SLRUs have equally robust
defenses against advancing too far.  So on the whole I'd rather that the
SLRU logic handled this issue strictly on the basis of what it knows,
without assumptions about what calling code may be doing.  Still, maybe
we only really care about the risk for the clog SLRU?

>> To try to clarify this in my head, I thought about an image of
>> the modular XID space as an octagon, where each side would correspond to
>> a segment file if we chose numbers such that there were only 8 possible
>> segment files.

> Exactly.
> https://docs.google.com/drawings/d/1xRTbQ4DVyP5wI1Ujm_gmmY-cC8KKCjahEtsU_o0fC7I
> uses your octagon to show the behaviors before and after this patch.

Cool, thanks for drafting that up.  (My original sketch was not of
publishable quality ;-).)  To clarify, the upper annotations probably
ought to read "nextXid <= xidWrapLimit"?  And "cutoffPage" ought
to be affixed to the orange dot at lower right of the center image?

I agree that this diagram depicts why we have a problem right now,
and the right-hand image shows what we want to have happen.
What's a little less clear is whether the proposed patch achieves
that effect.

In particular, after studying this awhile, it seems like removal
of the initial "cutoffPage -= cutoffPage % SLRU_PAGES_PER_SEGMENT"
adjustment isn't really affecting anything.  It's already the case
that just by allowing oldestXact to get rounded back to an SLRU page
boundary, we've created some daylight between oldestXact and the
cutoff point.  Rounding back further within the same SLRU segment
changes nothing.  (For example, suppose that oldestXact is already
within the oldest page of its SLRU segment.  Then either rounding
rule has the same effect.  But there's still a little bit of room for
xidWrapLimit to be in the opposite SLRU segment, allowing trouble.)

So I think what we're actually trying to accomplish here is to
ensure that instead of deleting up to half of the SLRU space
before the cutoff, we delete up to half-less-one-segment.
Maybe it should be half-less-two-segments, just to provide some
cushion against edge cases.  Reading the first comment in
SetTransactionIdLimit makes one not want to trust too much in
arguments based on the exact value of xidWrapLimit, while for
the other SLRUs it was already unclear whether the edge cases
were exactly right.

In any case, it feels like the specific solution you have here,
of testing both ends of the segment, is a roundabout way of
providing that one-segment slop; and it doesn't help if we decide
we need two-segment slop.  Can we write the test in a way that
explicitly provides N segments of slop?

regards, tom lane

Re: plan cache overhead on plpgsql expression

2020-03-25 Thread Robert Haas

On Sat, Mar 21, 2020 at 2:24 PM Tom Lane  wrote:
> With this patch, perf shows the hotspots on Pavel's original example
> as being
>
> +   19.24%19.17% 46470  postmaster   plpgsql.so   
> [.] exec_eval_expr
> +   15.19%15.15% 36720  postmaster   plpgsql.so   
> [.] plpgsql_param_eval_var
> +   14.98%14.94% 36213  postmaster   postgres 
> [.] ExecInterpExpr
> +6.32% 6.30% 15262  postmaster   plpgsql.so   
> [.] exec_stmt
> +6.08% 6.06% 14681  postmaster   plpgsql.so   
> [.] exec_assign_value

That's pretty sweet. As you say, there's probably some way to
eliminate some of the non-plancache overhead, but it's still a big
improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-25 Thread Alexander Korotkov

On Wed, Mar 25, 2020 at 10:26 PM Andres Freund  wrote:
> On 2020-03-25 11:05:21 -0500, Justin Pryzby wrote:
> > Since we talked about how scale_factor can be used to effectively disable 
> > this
> > new feature, I thought that scale=100 was too small and suggesed 1e10 (same 
> > as
> > max for vacuum_cleanup_index_scale_factor since 4d54543ef).  That should 
> > allow
> > handling the case that analyze is disabled, or its threshold is high, or it
> > hasn't run yet, or it's running but hasn't finished, or analyze is 
> > triggered as
> > same time as vacuum.
>
> For disabling we instead should allow -1, and disable the feature if set
> to < 0.

This patch introduces both GUC and reloption.  In reloptions we
typically use -1 for "disable reloption, use GUC value instead"
semantics.  So it's unclear how should we allow reloption to both
disable feature and disable reloption.  I think we don't have a
precedent in the codebase yet.  We could allow -2 (disable reloption)
and -1 (disable feature) for reloption.  Opinions?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [DOC] Document concurrent index builds waiting on each other

2020-03-25 Thread Alvaro Herrera

On 2020-Mar-25, Andres Freund wrote:

> > I don't know if my approach is exactly what Andres has in mind
> 
> Not quite. I don't think it's generally correct for CIC to set
> PROC_IN_VACUUM. I'm doubtful it's the case even just for plain indexes -
> we don't want rows to be pruned away from under us. I also think we'd
> want to set such a flag during all of the CIC phases?
> 
> What I was thinking of was a new flag, with a distinct value from
> PROC_IN_VACUUM. It'd currently just be specified in the
> GetCurrentVirtualXIDs() calls in WaitForOlderSnapshots(). That'd avoid
> needing to wait for other CICs on different relations. Since CIC is not
> permitted on system tables, and CIC doesn't do DML on normal tables, it
> seems fairly obviously correct to exclude other CICs.

Hmm, that sounds more promising.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [DOC] Document concurrent index builds waiting on each other

2020-03-25 Thread Andres Freund

Hi,

On 2020-03-25 16:30:10 -0300, Alvaro Herrera wrote:
> I posted this in November
> https://postgr.es/m/20191101203310.GA12239@alvherre.pgsql but I didn't
> put time to go through the issues there.

Oh, missed that.

> I don't know if my approach is exactly what Andres has in mind

Not quite. I don't think it's generally correct for CIC to set
PROC_IN_VACUUM. I'm doubtful it's the case even just for plain indexes -
we don't want rows to be pruned away from under us. I also think we'd
want to set such a flag during all of the CIC phases?

What I was thinking of was a new flag, with a distinct value from
PROC_IN_VACUUM. It'd currently just be specified in the
GetCurrentVirtualXIDs() calls in WaitForOlderSnapshots(). That'd avoid
needing to wait for other CICs on different relations. Since CIC is not
permitted on system tables, and CIC doesn't do DML on normal tables, it
seems fairly obviously correct to exclude other CICs.

Greetings,

Andres Freund

Re: [DOC] Document concurrent index builds waiting on each other

2020-03-25 Thread Andres Freund

Hi,

On 2020-03-25 15:24:44 -0400, James Coleman wrote:
> Andres: If we got this fixed in current PG would you be opposed to
> documenting the caveat in previous versions?

Not really. I'm just not confident it's going to be useful, given the
amount of details needed to be provided to really make sense of the
issue (the earlier CIC phases don't wait for snapshots, but just
relation locks etc).

Greetings,

Andres Freund

Re: plan cache overhead on plpgsql expression

2020-03-25 Thread Tom Lane

Pavel Stehule  writes:
> I'll mark this patch as ready for commiters.

Thanks for reviewing!  Amit, do you have any thoughts on this?

regards, tom lane

Re: AllocSetEstimateChunkSpace()

2020-03-25 Thread Andres Freund

Hi,

On 2020-03-24 18:12:03 -0700, Jeff Davis wrote:
> Attached is a small patch that introduces a simple function,
> AllocSetEstimateChunkSpace(), and uses it to improve the estimate for
> the size of a hash entry for hash aggregation.
> 
> Getting reasonable estimates for the hash entry size (including the
> TupleHashEntryData, the group key tuple, the per-group state, and by-
> ref transition values) is important for multiple reasons:
> 
> * It helps the planner know when hash aggregation is likely to spill,
> and how to cost it.
> 
> * It helps hash aggregation regulate itself by setting a group limit
> (separate from the memory limit) to account for growing by-ref
> transition values.
> 
> * It helps choose a good initial size for the hash table. Too large of
> a hash table will crowd out space that could be used for the group keys
> or the per-group state.
> 
> Each group requires up to three palloc chunks: one for the group key
> tuple, one for the per-group states, and one for a by-ref transition
> value. Each chunk can have a lot of overhead: in addition to the chunk
> header (16 bytes overhead), it also rounds the value up to a power of
> two (~25% overhead). So, it's important to account for this chunk
> overhead.

As mention on im/call: I think this is mainly an argument for combining
the group tuple & per-group state allocations. I'm kind of fine with
afterwards taking the allocator overhead into account.

I still don't buy that its useful to estimate the by-ref transition
value overhead. We just don't have anything even have close enough to a
meaningful value to base this on. Even if we want to consider the
initial transition value or something, we'd be better off initially
over-estimating the size of the transition state by a lot more than 25%
(I am thinking more like 4x or so, with a minumum of 128 bytes or
so). Since this is about the initial size of the hashtable, we're better
off with a too small table, imo. A "too large" table is more likely to
end up needing to spill when filled to only a small degree.

I am kind of wondering if there's actually much point in trying to be
accurate here at all. Especially when doing this from the
planner. Since, for a large fraction of aggregates, we're going to very
roughly guess at transition space anyway, it seems like a bunch of
"overhead factors" could turn out to be better than trying to be
accurate on some parts, while still widely guessing at transition space.
But I'm not sure.

> I considered making it a method of a memory context, but the planner
> will call this function before the hash agg memory context is created.
> It seems to make more sense for this to just be an AllocSet-specific
> function for now.

-1 to this approach. I think it's architecturally the wrong direction to
add more direct calls to functions within specific contexts.

On 2020-03-25 11:46:31 -0700, Jeff Davis wrote:
> On Tue, 2020-03-24 at 18:12 -0700, Jeff Davis wrote:
> > I considered making it a method of a memory context, but the planner
> > will call this function before the hash agg memory context is
> > created.
> 
> I implemented this approach also; attached.
> 
> It works a little better by having access to the memory context, so it
> knows set->allocChunkLimit. It also allows it  to work with the slab
> allocator, which needs the memory context to return a useful number.
> However, this introduces more code and awkwardly needs to use
> CurrentMemoryContext when called from the planner (because the actual
> memory context we're try to estimate for doesn't exist yet).

Yea, the "context needs to exist" part sucks. I really don't want to add
calls directly into AllocSet from more places though. And just ignoring
the parameters to the context seems wrong too.

Greetings,

Andres Freund

Re: Allow continuations in "pg_hba.conf" files

2020-03-25 Thread Justin Pryzby

Hi,

On Wed, Mar 25, 2020 at 07:09:38PM +0100, Fabien COELHO wrote:
> 
> Hello,
> 
> After writing an unreadable and stupidly long line for ldap authentification
> in a "pg_hba.conf" file, I figured out that allowing continuations looked
> simple enough and should just be done.

I tried this briefly.

> -   Records cannot be continued across lines.
> +   Records can be backslash-continued across lines.

Maybe say: "lines ending with a backslash are logically continued on the next
line", or similar.

> + /* else we have a continuation, just blank it and loop 
> */
> + continuations++;
> + *curend++ = ' ';

Since it puts a blank there, it creates a "word" boundary, which I gather
worked for your use case.  But I wonder whether it's needed to add a space (or
otherwise, document that lines cannot be split beween words?).

You might compare this behavior with that of makefiles (or find a better
example) which I happen to recall *don't* add a space; if you want that, you
have to end the line with: " \" not just "\".

Anyway, I checked that the current patch handles users split across lines, like:
alice,\
bob,\
carol

As written, that depends on the parser's behavior of ignoring commas and
blanks, since it sees:
"alice,[SPACE]bob,[SPACE]carol"

Maybe it'd be nice to avoid depending on that.

I tried with a username called "alice,bob", split across lines:

"alice,\
bob",\

But then your patch makes it look for a user called "alice, bob" (with a
space).  I realize that's not a compelling argument :)

Note, that also appears to affect the "username maps" file.  So mention in that
chapter, too.
https://www.postgresql.org/docs/current/auth-username-maps.html

Cheers,
-- 
Justin

Re: [DOC] Document concurrent index builds waiting on each other

2020-03-25 Thread Alvaro Herrera

On 2020-Mar-25, James Coleman wrote:

> Alvaro: I think you had some ideas on this too; any chance you've know
> of a patch that anyone's got cooking?

I posted this in November
https://postgr.es/m/20191101203310.GA12239@alvherre.pgsql but I didn't
put time to go through the issues there.  I don't know if my approach is
exactly what Andres has in mind, but I was discouraged by the number of
gotchas for which the optimization I propose has to be turned off.

Maybe that preliminary patch can serve as a discussion starter, if
nothing else.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Include sequence relation support in logical replication

2020-03-25 Thread Andres Freund

Hi,

On 2020-03-24 16:19:21 -0700, Cary Huang wrote:
> I have shared a patch that allows sequence relation to be supported in
> logical replication via the decoding plugin ( test_decoding for
> example ); it does not support sequence relation in logical
> replication between a PG publisher and a PG subscriber via pgoutput
> plugin as it will require much more work.

Could you expand on "much more work"? Once decoding support is there,
that shouldn't be that much?


> Sequence changes caused by other sequence-related SQL functions like
> setval() or ALTER SEQUENCE xxx, will always emit a WAL update, so
> replicating changes caused by these should not be a problem.

I think this really would need to handle at the very least setval to
make sense.


> For the replication to make sense, the patch actually disables the WAL
> update at every 32 nextval() calls, so every call to nextval() will
> emit a WAL update for proper replication. This is done by setting
> SEQ_LOG_VALS to 0 in sequence.c

Why is that needed? ISTM updating in increments of 32 is fine for
replication purposes? It's good imo, because sending out more granular
increments would increase the size of the WAL stream?



> diff --git a/contrib/test_decoding/test_decoding.c 
> b/contrib/test_decoding/test_decoding.c
> index 93c948856e..7a7e572d6c 100644
> --- a/contrib/test_decoding/test_decoding.c
> +++ b/contrib/test_decoding/test_decoding.c
> @@ -466,6 +466,15 @@ pg_decode_change(LogicalDecodingContext *ctx, 
> ReorderBufferTXN *txn,
>   
> >data.tp.oldtuple->tuple,
>   true);
>   break;
> + case REORDER_BUFFER_CHANGE_SEQUENCE:
> + appendStringInfoString(ctx->out, " 
> SEQUENCE:");
> + if (change->data.sequence.newtuple == 
> NULL)
> + 
> appendStringInfoString(ctx->out, " (no-tuple-data)");
> + else
> + tuple_to_stringinfo(ctx->out, 
> tupdesc,
> + 
> >data.sequence.newtuple->tuple,
> + 
> false);
> + break;
>   default:
>   Assert(false);
>   }

You should also add tests - the main purpose of contrib/test_decoding is
to facilitate writing those...


> + ReorderBufferXidSetCatalogChanges(ctx->reorder, 
> XLogRecGetXid(buf->record), buf->origptr);

Huh, why are you doing this? That's going to increase overhead of logical
decoding by many times?


> + case REORDER_BUFFER_CHANGE_SEQUENCE:
> + Assert(snapshot_now);
> +
> + reloid = 
> RelidByRelfilenode(change->data.sequence.relnode.spcNode,
> + 
> change->data.sequence.relnode.relNode);
> +
> + if (reloid == InvalidOid &&
> + change->data.sequence.newtuple 
> == NULL)
> + goto change_done;

I don't think this path should be needed? There's afaict no valid ways
we should be able to end up here without a tuple?


> + if 
> (!RelationIsLogicallyLogged(relation))
> + goto change_done;

Similarly, this seems superflous and should perhaps be an assertion?

> + /* user-triggered change */
> + if (!IsToastRelation(relation))
> + {
> + ReorderBufferToastReplace(rb, 
> txn, relation, change);
> + rb->apply_change(rb, txn, 
> relation, change);
> + }
> + break;
>   }
>   }
>

This doesn't make sense either.



> diff --git a/src/include/replication/reorderbuffer.h 
> b/src/include/replication/reorderbuffer.h
> index 626ecf4dc9..cf3fd45c5f 100644
> --- a/src/include/replication/reorderbuffer.h
> +++ b/src/include/replication/reorderbuffer.h
> @@ -62,7 +62,8 @@ enum ReorderBufferChangeType
>   REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID,
>   REORDER_BUFFER_CHANGE_INTERNAL_SPEC_INSERT,
>   REORDER_BUFFER_CHANGE_INTERNAL_SPEC_CONFIRM,
> - REORDER_BUFFER_CHANGE_TRUNCATE
> + REORDER_BUFFER_CHANGE_TRUNCATE,
> + REORDER_BUFFER_CHANGE_SEQUENCE,
>  };
>
>  /* forward declaration */
> @@ -149,6 +150,15 @@ typedef struct

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-25 Thread Andres Freund

Hi,

On 2020-03-25 11:05:21 -0500, Justin Pryzby wrote:
> Since we talked about how scale_factor can be used to effectively disable this
> new feature, I thought that scale=100 was too small and suggesed 1e10 (same as
> max for vacuum_cleanup_index_scale_factor since 4d54543ef).  That should allow
> handling the case that analyze is disabled, or its threshold is high, or it
> hasn't run yet, or it's running but hasn't finished, or analyze is triggered 
> as
> same time as vacuum.

For disabling we instead should allow -1, and disable the feature if set
to < 0.

Greetings,

Andres Freund

Re: [DOC] Document concurrent index builds waiting on each other

2020-03-25 Thread James Coleman

On Wed, Mar 25, 2020 at 3:19 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-09-18 13:51:00 -0400, James Coleman wrote:
> > In my experience it's not immediately obvious (even after reading the
> > documentation) the implications of how concurrent index builds manage
> > transactions with respect to multiple concurrent index builds in
> > flight at the same time.
> >
> > Specifically, as I understand multiple concurrent index builds running
> > at the same time will all return at the same time as the longest
> > running one.
> >
> > I've attached a small patch to call this caveat out specifically in
> > the documentation. I think the description in the patch is accurate,
> > but please let me know if there's some intricacies around how the
> > various stages might change the results.
> >
> > James Coleman
>
> I'd much rather see effort spent fixing this issue as far as it relates
> to concurrent CICs. For the snapshot waits we can add a procarray flag
> (alongside PROCARRAY_VACUUM_FLAG) indicating that the backend is
> doing. Which WaitForOlderSnapshots() can then use to ignore those CICs,
> which is safe, because those transactions definitely don't insert into
> relations targeted by CIC. The change to WaitForOlderSnapshots() would
> just be to pass the new flag to GetCurrentVirtualXIDs, I think.

Alvaro: I think you had some ideas on this too; any chance you've know
of a patch that anyone's got cooking?

Andres: If we got this fixed in current PG would you be opposed to
documenting the caveat in previous versions?

Thanks,
James

Re: [DOC] Document concurrent index builds waiting on each other

2020-03-25 Thread Andres Freund

Hi,

On 2019-09-18 13:51:00 -0400, James Coleman wrote:
> In my experience it's not immediately obvious (even after reading the
> documentation) the implications of how concurrent index builds manage
> transactions with respect to multiple concurrent index builds in
> flight at the same time.
> 
> Specifically, as I understand multiple concurrent index builds running
> at the same time will all return at the same time as the longest
> running one.
> 
> I've attached a small patch to call this caveat out specifically in
> the documentation. I think the description in the patch is accurate,
> but please let me know if there's some intricacies around how the
> various stages might change the results.
> 
> James Coleman

I'd much rather see effort spent fixing this issue as far as it relates
to concurrent CICs. For the snapshot waits we can add a procarray flag
(alongside PROCARRAY_VACUUM_FLAG) indicating that the backend is
doing. Which WaitForOlderSnapshots() can then use to ignore those CICs,
which is safe, because those transactions definitely don't insert into
relations targeted by CIC. The change to WaitForOlderSnapshots() would
just be to pass the new flag to GetCurrentVirtualXIDs, I think.

Greetings,

Andres Freund

Re: plan cache overhead on plpgsql expression

2020-03-25 Thread Pavel Stehule

so 21. 3. 2020 v 21:29 odesílatel Pavel Stehule 
napsal:

>
>
> so 21. 3. 2020 v 19:24 odesílatel Tom Lane  napsal:
>
>> Pavel Stehule  writes:
>> > So the patch has a problem with constant casting - unfortunately the
>> mix of
>> > double precision variables and numeric constants is pretty often in
>> > Postgres.
>>
>> Yeah.  I believe the cause of that is that the patch thinks it can skip
>> passing an inline-function-free simple expression through the planner.
>> That's flat out wrong.  Quite aside from failing to perform
>> constant-folding (which is presumably the cause of the slowdown you
>> spotted), that means that we miss performing such non-optional
>> transformations as rearranging named-function-argument notation into
>> positional order.  I didn't bother to test that but I'm sure it can be
>> shown to lead to crashes.
>>
>> Now that I've looked at the patch I don't like it one bit from a
>> structural standpoint either.  It's basically trying to make an end
>> run around the plancache, which is not going to be maintainable even
>> if it correctly accounted for everything the plancache does today.
>> Which it doesn't.  Two big problems are:
>>
>> * It doesn't account for the possibility of search_path changes
>> affecting the interpretation of an expression.
>>
>> * It assumes that the *only* things that a simple plan could get
>> invalidated for are functions that were inlined.  This isn't the
>> case --- a counterexample is that removal of no-op CoerceToDomain
>> nodes requires the plan to be invalidated if the domain's constraints
>> change.  And there are likely to be more such issues in future.
>>
>>
>> So while there's clearly something worth pursuing here, I do not like
>> anything about the way it was done.  I think that the right way to
>> think about this problem is "how can the plan cache provide a fast
>> path for checking validity of simple-expression plans?".  And when you
>> think about it that way, there's a pretty obvious answer: if the plan
>> involves no table references, there's not going to be any locks that
>> have to be taken before we can check the is_valid flag.  So we can
>> have a fast path that skips AcquirePlannerLocks and
>> AcquireExecutorLocks, which are a big part of the problem, and we can
>> also bypass some of the other random checks that GetCachedPlan has to
>> make, like whether RLS affects the plan.
>>
>> Another chunk of the issue is the constant acquisition and release of
>> reference counts on the plan.  We can't really skip that (I suspect
>> there are additional bugs in Amit's patch arising from trying to do so).
>> However, plpgsql already has mechanisms for paying simple-expression
>> setup costs once per transaction rather than once per expression use.
>> So we can set up a simple-expression ResourceOwner managed much like
>> the simple-expression EState, and have it hold a refcount on the
>> CachedPlan for each simple expression, and pay that overhead just once
>> per transaction.
>>
>> So I worked on those ideas for awhile, and came up with the attached
>> patchset:
>>
>> 0001 adds some regression tests in this area (Amit's patch fails the
>> tests concerning search_path changes).
>>
>> 0002 does what's suggested above.  I also did a little bit of marginal
>> micro-tuning in exec_eval_simple_expr() itself.
>>
>> 0003 improves the biggest remaining cost of validity rechecking,
>> which is verifying that the search_path is the same as it was when
>> the plan was cached.
>>
>> I haven't done any serious performance testing on this, but it gives
>> circa 2X speedup on Pavel's original example, which is at least
>> fairly close to the results that Amit's patch got there.  And it
>> makes this last batch of test cases faster not slower, too.
>>
>> With this patch, perf shows the hotspots on Pavel's original example
>> as being
>>
>> +   19.24%19.17% 46470  postmaster   plpgsql.so
>>  [.] exec_eval_expr
>> +   15.19%15.15% 36720  postmaster   plpgsql.so
>>  [.] plpgsql_param_eval_var
>> +   14.98%14.94% 36213  postmaster   postgres
>>  [.] ExecInterpExpr
>> +6.32% 6.30% 15262  postmaster   plpgsql.so
>>  [.] exec_stmt
>> +6.08% 6.06% 14681  postmaster   plpgsql.so
>>  [.] exec_assign_value
>>
>> Maybe there's more that could be done to knock fat out of
>> exec_eval_expr and/or plpgsql_param_eval_var, but at least
>> the plan cache isn't the bottleneck anymore.
>>
>
> I tested Tom's patches, and I can confirm these results.
>
> It doesn't break tests (and all tests plpgsql_check tests passed without
> problems).
>
> The high overhead of ExecInterpExpr is related to prepare fcinfo, and
> checking nulls arguments because all functions are strict
> plpgsql_param_eval_var, looks like expensive is var = (PLpgSQL_var *)
> estate->datums[dno] and *op->resvalue = var->value;
>

I rechecked Tom's patch, and all tests passed, and there

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-03-25 Thread Tomas Vondra


Hi,

I've started reviewing the patch a couple days ago. I haven't done any
extensive testing, but I do have a bunch of initial comments that I can
share now.

1) I wonder if this needs to update src/backend/optimizer/README, which
does have a section about partitionwise joins. It seems formulated in a
way that that probably covers even this more advanced algorithm, but
maybe it should mention handling of default partitions etc.?

There certainly needs to be some description of the algorithm somewhere,
either in a README or before a suitable function. It doesn't have to be
particularly detailed, a rough outline of the matching would be enough,
so that readers don't have to rebuild the knowledge from pieces
scattered around various comments.

2) Do we need another GUC enabling this more complex algorithm? In PG11
the partitionwise join is disabled by default, on the grounds that it's
expensive and not worth it by default. How much more expensive is this?
Maybe it makes sense to allow enabling only the "simple" approach?

3) This comment in try_partitionwise_join is now incorrect, because the
condition may be true even for partitioned tables with (nparts == 0).

/* Nothing to do, if the join relation is not partitioned. */
if (joinrel->part_scheme == NULL || joinrel->nparts == 0)
return;

Moreover, the condition used to be 


if (!IS_PARTITIONED_REL(joinrel))
return;

which is way more readable. I think it's net negative to replace these
"nice" macros with clear meaning with complex conditions. If needed, we
can invent new macros. There are many other places where the patch
replaces macros with less readable conditions.


4) I'm a bit puzzled how we could get here with non-partitioned rels?

/*
 * We can not perform partitionwise join if either of the joining relations
 * is not partitioned.
 */
if (!IS_PARTITIONED_REL(rel1) || !IS_PARTITIONED_REL(rel2))
return;

5) I find the "merged" flag in RelOptInfo rather unclear, because it
does not clearly indicate what was merged. Maybe something like
partbounds_merged would be better?

6) The try_partitionwise_join function is getting a bit too long and
harder to understand. The whole block in

if (joinrel->nparts == -1)
{
...
}

seems rather well isolated, so I propose to move it to a separate
function responsible only for the merging. We can simply call it on the
joinrel, and make it return right away if (joinrel->nparts == -1).

7) I'd suggest not to reference exact functions in comments unless
abolutely necessary, because it's harder to maintain and it does not
really explain purpose of the struct/code. E.g. consider this:

/* Per-partitioned-relation data for 
merge_list_bounds()/merge_range_bounds() */
typedef struct PartitionMap
{ ... }

Why does it matter where is the struct used? That's pretty trivial to
find using 'git grep' or something. Instead the comment should explain
the purpose of the struct.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: AllocSetEstimateChunkSpace()

2020-03-25 Thread Jeff Davis

On Tue, 2020-03-24 at 18:12 -0700, Jeff Davis wrote:
> I considered making it a method of a memory context, but the planner
> will call this function before the hash agg memory context is
> created.

I implemented this approach also; attached.

It works a little better by having access to the memory context, so it
knows set->allocChunkLimit. It also allows it  to work with the slab
allocator, which needs the memory context to return a useful number.
However, this introduces more code and awkwardly needs to use
CurrentMemoryContext when called from the planner (because the actual
memory context we're try to estimate for doesn't exist yet).

I slightly favor the previous approach (mentioned in the parent email) 
because it's simple and effective. But I'm fine with this one if
someone thinks it will be better for other use cases.

Regards,
Jeff Davis

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2a6f44a6274..1ba62879cf6 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1637,16 +1637,49 @@ find_hash_columns(AggState *aggstate)
 
 /*
  * Estimate per-hash-table-entry overhead.
+ *
+ * It's important to account for the overhead of individual memory
+ * allocations, which can be significant. If the caller specifies a memory
+ * context, use that to estimate the total allocation sizes; otherwise, use
+ * CurrentMemoryContext.
  */
 Size
-hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
+hash_agg_entry_size(int numTrans, Size tupleWidth, Size transitionSpace,
+	MemoryContext context)
 {
+	SizetupleChunkSize;
+	SizepergroupChunkSize;
+	SizetransitionChunkSize;
+	SizetupleSize	 = (MAXALIGN(SizeofMinimalTupleHeader) +
+			tupleWidth);
+	SizepergroupSize = numTrans * sizeof(AggStatePerGroupData);
+
+	if (context == NULL)
+		context = CurrentMemoryContext;
+
+	tupleChunkSize = EstimateMemoryChunkSpace(context, tupleSize);
+
+	if (pergroupSize > 0)
+	{
+		pergroupChunkSize = EstimateMemoryChunkSpace(
+			context, pergroupSize);
+	}
+	else
+		pergroupChunkSize = 0;
+
+	if (transitionSpace > 0)
+	{
+		transitionChunkSize = EstimateMemoryChunkSpace(
+			context, transitionSpace);
+	}
+	else
+		transitionChunkSize = 0;
+
 	return
-		MAXALIGN(SizeofMinimalTupleHeader) +
-		MAXALIGN(tupleWidth) +
-		MAXALIGN(sizeof(TupleHashEntryData) +
- numAggs * sizeof(AggStatePerGroupData)) +
-		transitionSpace;
+		sizeof(TupleHashEntryData) +
+		tupleChunkSize +
+		pergroupChunkSize +
+		transitionChunkSize;
 }
 
 /*
@@ -3549,7 +3582,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		aggstate->hash_pergroup = pergroups;
 
 		aggstate->hashentrysize = hash_agg_entry_size(
-			aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+			aggstate->numtrans, outerplan->plan_width, node->transitionSpace,
+			aggstate->hashcontext->ecxt_per_tuple_memory);
 
 		/*
 		 * Consider all of the grouping sets together when setting the limits
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8cf694b61dc..adc191704c2 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2272,7 +2272,7 @@ cost_agg(Path *path, PlannerInfo *root,
 		 * otherwise we expect to spill.
 		 */
 		hashentrysize = hash_agg_entry_size(
-			aggcosts->numAggs, input_width, aggcosts->transitionSpace);
+			aggcosts->numAggs, input_width, aggcosts->transitionSpace, NULL);
 		hash_agg_set_limits(hashentrysize, numGroups, 0, _limit,
 			_limit, _partitions);
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b65abf6046d..5c423a3d14c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4866,7 +4866,7 @@ create_distinct_paths(PlannerInfo *root,
 	else
 	{
 		Size		hashentrysize = hash_agg_entry_size(
-			0, cheapest_input_path->pathtarget->width, 0);
+			0, cheapest_input_path->pathtarget->width, 0, NULL);
 
 		allow_hash = enable_hashagg_disk ||
 			(hashentrysize * numDistinctRows <= work_mem * 1024L);
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 8339f4cd7a2..cc5788fc954 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -3527,8 +3527,10 @@ double
 estimate_hashagg_tablesize(Path *path, const AggClauseCosts *agg_costs,
 		   double dNumGroups)
 {
-	Size		hashentrysize = hash_agg_entry_size(
-		agg_costs->numAggs, path->pathtarget->width, agg_costs->transitionSpace);
+	Size	hashentrysize = hash_agg_entry_size(agg_costs->numAggs,
+path->pathtarget->width,
+agg_costs->transitionSpace,
+NULL);
 
 	/*
 	 * Note that this disregards the effect of fill-factor and growth policy
diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index c0623f106d2..107afce953b 100644
--- a/src/backend/utils/mmgr/aset.c
+++

Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-03-25 Thread Anna Akenteva


On 2020-03-21 14:16, Kartyshov Ivan wrote:

As it was discussed earlier, I added wait for statement into
begin/start statement.
Thanks! To address the discussion: I like the idea of having WAIT as a 
part of BEGIN statement rather than a separate command, as Thomas Munro 
suggested. That way, the syntax itself enforces that WAIT FOR LSN will 
actually take effect, even for single-snapshot transactions. It seems 
more convenient for the user, who won't have to remember the details 
about how WAIT interacts with isolation levels.




BEGIN [ WORK | TRANSACTION ] [ transaction_mode[, ...] ] wait_for_event
Not sure about this, but could we add "WAIT FOR .." as another 
transaction_mode rather than a separate thing? That way, user won't have 
to worry about the order. As of now, one should remember to always put 
WAIT FOR as the Last parameter in the BEGIN statement.




  and event is:
  LSN value [options]
  TIMESTAMP value
I would maybe remove WAIT FOR TIMESTAMP. As Robert Haas has pointed out, 
it seems a lot like pg_sleep_until(). Besides, it doesn't necessarily 
need to be connected to transaction start, which makes it different from 
WAIT FOR LSN - so I wouldn't mix them together.



I had another look at the code:


===
In WaitShmemSize() you might want to use functions that check for 
overflow - add_size()/mul_size(). They're used in similar situations, 
for example in BTreeShmemSize().



===
This is how WaitUtility() is called - note that time_val will always be 
> 0:

+    if (time_val <= 0)
+        time_val = 1;
+...
+    res = WaitUtility(lsn, (int)(time_val * 1000), dest);

(time_val * 1000) is passed to WaitUtility() as the delay argument. And 
inside WaitUtility() we have this:


+if (delay > 0)
+    latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH;
+else
+    latch_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;

Since we always pass a delay value greater than 0, we'll never get to 
the "else" clause here and we'll never be ready to wait for LSN forever. 
Perhaps due to that, the current test outputs this after a simple WAIT 
FOR LSN command:

psql::1: NOTICE:  LSN is not reached.


===
Speaking of tests,

When I tried to test BEGIN TRANSACTION WAIT FOR LSN, I got a segfault:
LOG:  statement: BEGIN TRANSACTION WAIT FOR LSN '0/3002808'
LOG:  server process (PID 10385) was terminated by signal 11: 
Segmentation fault

DETAIL:  Failed process was running: COMMIT

Could you add some more tests to the patch when this is fixed? With WAIT 
as part of BEGIN statement + with things such as WAIT FOR ALL ... / WAIT 
FOR ANY ... / WAIT FOR LSN ... UNTIL TIMESTAMP ...



===
In WaitSetLatch() we should probably check backend for NULL before 
calling SetLatch(>procLatch)


We might also need to check wait statement for NULL in these two cases:
+   case T_TransactionStmt:
+   {...
+       result = transformWaitForStmt(pstate, (WaitStmt *) stmt->wait);

case TRANS_STMT_START:
{...
+   WaitStmt   *waitstmt = (WaitStmt *) stmt->wait;
+   res = WaitMain(waitstmt, dest);


===
After we added the "wait" attribute to TransactionStmt struct, do we 
perhaps need to add something to _copyTransactionStmt() / 
_equalTransactionStmt()?


--
Anna Akenteva
Postgres Professional:
The Russian Postgres Company
http://www.postgrespro.com

Allow continuations in "pg_hba.conf" files

2020-03-25 Thread Fabien COELHO



Hello,

After writing an unreadable and stupidly long line for ldap 
authentification in a "pg_hba.conf" file, I figured out that allowing 
continuations looked simple enough and should just be done.


Patch attached.

--
Fabien.diff --git a/doc/src/sgml/client-auth.sgml b/doc/src/sgml/client-auth.sgml
index 5f1eec78fb..cf3b432c34 100644
--- a/doc/src/sgml/client-auth.sgml
+++ b/doc/src/sgml/client-auth.sgml
@@ -77,7 +77,7 @@
The general format of the pg_hba.conf file is
a set of records, one per line. Blank lines are ignored, as is any
text after the # comment character.
-   Records cannot be continued across lines.
+   Records can be backslash-continued across lines.
A record is made
up of a number of fields which are separated by spaces and/or tabs.
Fields can contain white space if the field value is double-quoted.
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index da5189a4fa..95c2cfc8e4 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -486,8 +486,43 @@ tokenize_file(const char *filename, FILE *file, List **tok_lines, int elevel)
 		char	   *lineptr;
 		List	   *current_line = NIL;
 		char	   *err_msg = NULL;
+		char	   *cur = rawline;
+		int			len = sizeof(rawline);
+		int			continuations = 0;
 
-		if (!fgets(rawline, sizeof(rawline), file))
+		/* read input and handle simplistic backslash continuations */
+		while ((cur = fgets(cur, len, file)) != NULL)
+		{
+			int		curlen = strlen(cur);
+			char   *curend = cur + curlen - 1;
+
+			if (curlen == len - 1)
+			{
+/* Line too long! */
+ereport(elevel,
+		(errcode(ERRCODE_CONFIG_FILE_ERROR),
+		 errmsg("authentication file line too long"),
+		 errcontext("line %d of configuration file \"%s\"",
+	line_number + continuations, filename)));
+err_msg = "authentication file line too long";
+			}
+
+			/* Strip trailing linebreak from rawline */
+			while (curend >= cur && (*curend == '\n' || *curend == '\r'))
+*curend-- = '\0';
+
+			/* empty or not a continuation, we are done */
+			if (curend < cur || *curend != '\\')
+break;
+
+			/* else we have a continuation, just blank it and loop */
+			continuations++;
+			*curend++ = ' ';
+			len -= (curend - cur);
+			cur = curend;
+		}
+
+		if (cur == NULL)
 		{
 			int			save_errno = errno;
 
@@ -501,21 +536,6 @@ tokenize_file(const char *filename, FILE *file, List **tok_lines, int elevel)
 			   filename, strerror(save_errno));
 			rawline[0] = '\0';
 		}
-		if (strlen(rawline) == MAX_LINE - 1)
-		{
-			/* Line too long! */
-			ereport(elevel,
-	(errcode(ERRCODE_CONFIG_FILE_ERROR),
-	 errmsg("authentication file line too long"),
-	 errcontext("line %d of configuration file \"%s\"",
-line_number, filename)));
-			err_msg = "authentication file line too long";
-		}
-
-		/* Strip trailing linebreak from rawline */
-		lineptr = rawline + strlen(rawline) - 1;
-		while (lineptr >= rawline && (*lineptr == '\n' || *lineptr == '\r'))
-			*lineptr-- = '\0';
 
 		/* Parse fields */
 		lineptr = rawline;
@@ -543,7 +563,7 @@ tokenize_file(const char *filename, FILE *file, List **tok_lines, int elevel)
 			*tok_lines = lappend(*tok_lines, tok_line);
 		}
 
-		line_number++;
+		line_number += continuations + 1;
 	}
 
 	MemoryContextSwitchTo(oldcxt);

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-03-25 Thread Sergei Kornilov

Hello

> WAL usage patch [1] increments this version to 1_4 instead of 1_8.
> I *guess* that's because previously this version was maintained
> independently from pg_stat_statements' version. For example,
> pg_stat_statements 1.4 seems to have used PGSS_V1_3.

As far as I remember, this was my proposed change in review a year ago.
I think that having a clear analogy between the extension version and the 
function name would be more clear than sequential numbering of PGSS_V with 
different extension versions.
For pgss 1.4 it was fine to use PGSS_V1_3, because there were no changes in 
pg_stat_statements_internal.
pg_stat_statements 1.3 will call pg_stat_statements_1_3
pg_stat_statements 1.4 - 1.7 will still call pg_stat_statements_1_3. In my 
opinion, this is the correct naming, since we did not need a new function.
but pg_stat_statements 1.8 will call pg_stat_statements_1_4. It's not confusing?

Well, no strong opinion.

regards, Sergei

Re: Columns correlation and adaptive query optimization

2020-03-25 Thread Rafia Sabih

Hello,

This sounded like an interesting addition to postgresql. I gave some
time to it today to review, here are few comments,

On Wed, 25 Mar 2020 at 14:28, Konstantin Knizhnik
 wrote:
>
>
>
> On 25.03.2020 16:00, David Steele wrote:
> > On 3/25/20 6:57 AM, Konstantin Knizhnik wrote:
> >>
> >>
> >> On 24.03.2020 20:12, David Steele wrote:
> >>> On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:
>  New version of patch implicitly adding multicolumn statistic in
>  auto_explain extension and using it in optimizer for more precise
>  estimation of join selectivity.
>  This patch fixes race condition while adding statistics and
>  restricts generated statistic name to fit in 64 bytes (NameData).
> >>>
> >>> This patch no longer applies:
> >>> https://commitfest.postgresql.org/27/2386/
> >>>
> >>> The CF entry has been updated to Waiting on Autho
> >>
> >> Rebased patch is attached.
> >
> > The patch applies now but there are error on Windows and Linux:
> > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481
> >
> > https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979
> >
> > Regards,
>
> Sorry, yet another patch is attached.
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+

This doesn't look like the right place for it, you might want to
declare it with other functions in the starting of the file.

Also, there is no description about any of the functions here,
wouldn’t hurt having some more comments there.

A few of more questions that cross my mind at this point,

- have you tried measuring the extra cost we have to pay for this
mores statistics , and also compare it with the benefit it gives in
terms of accuracy.
- I would also be interested in understanding if there are cases when
adding this extra step doesn’t help and have you excluded them already
or if some of them are easily identifiable at this stage...?
- is there any limit  on the number of columns for which this will
work, or should there be any such limit...?

-- 
Regards,
Rafia Sabih

Re: [PATCH] Check operator when creating unique index on partition table

2020-03-25 Thread Tom Lane

Guancheng Luo  writes:
> I found that things could go wrong in some cases, when the unique index and 
> the partition key use different opclass.

I agree that this is an oversight, but it seems like your solution is
overcomplicated and probably still too forgiving.  Should we not just
insist that the index opfamily match the partition key opfamily?
It looks to me like that would reduce the code change to about like
this:

-   if (key->partattrs[i] == indexInfo->ii_IndexAttrNumbers[j])
+   if (key->partattrs[i] == indexInfo->ii_IndexAttrNumbers[j] &&
+   key->partopfamily[i] == 
get_opclass_family(classObjectId[j]))

which is a lot more straightforward and provides a lot more certainty
that the index will act as the partition constraint demands.

This would reject, for example, a hash index associated with a btree-based
partition constraint, but I'm not sure we're losing anything much thereby.
(I do not think your patch is correct for the case where the opfamilies
belong to different AMs, anyway.)

I'm not really on board with adding a whole new test script for this,
either.

regards, tom lane

Re: potential stuck lock in SaveSlotToPath()

2020-03-25 Thread Robert Haas

On Wed, Mar 25, 2020 at 6:13 AM Peter Eisentraut
 wrote:
> Any concerns about applying and backpatching the patch I posted?

Not from me.

> The talk about reorganizing this code doesn't seem very concrete at the
> moment and would probably not be backpatch material anyway.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: backup manifests

2020-03-25 Thread Robert Haas

On Wed, Mar 25, 2020 at 9:31 AM Stephen Frost  wrote:
> I get that the default for manifest is 'no', but I don't really see how
> that means that the lack of saying anything about checksums should mean
> "give me crc32c checksums".  It's really rather common that if we don't
> specify something, it means don't do that thing- like an 'ORDER BY'
> clause.

That's a fair argument, but I think the other relevant principle is
that we try to give people useful defaults for things. I think that
checksums are a sufficiently useful thing that having the default be
not to do it doesn't make sense. I had the impression that you and
David were in agreement on that point, actually.

> There were also comments made up-thread about how it might not be great
> for larger (eg: 1GB files, like we tend to have quite a few of...), and
> something about it being a 40 year old algorithm..

Well, the 512MB "limit" for CRC-32C means only that for certain very
specific types of errors, detection is not guaranteed above that file
size. So if you have a single flipped bit, for example, and the file
size is greater than 512MB, then CRC-32C has only a 99.999767169%
chance of detecting the error, whereas if the file size is less than
512MB, it is 100% certain, because of the design of the algorithm. But
nine nines is plenty, and neither SHA nor our page-level checksums
provide guaranteed error detection properties anyway.

I'm not sure why the fact that it's a 40-year-old algorithm is
relevant. There are many 40-year-old algorithms that are very good.
Generally, if we discover that we're using bad 40-year-old algorithms,
like Knuth's tape sorting stuff, we eventually figure out how to
replace them with something else that's better. But there's no reason
to retire an algorithm simply because it's old. I have not heard
anyone say, for example, that we should stop using CRC-32C for XLOG
checksums. We continue to use it for that purpose because it (1) is
highly likely to detect any errors and (2) is very fast. Those are the
same reasons why I think it's a good fit for this case.

My guess is that if this patch is adopted as currently proposed, we
will eventually need to replace the cryptographic hash functions due
to the march of time. As I'm sure you realize, the problem with hash
functions that are designed to foil an adversary is that adversaries
keep getting smarter. So, eventually someone will probably figure out
how to do something nefarious with SHA-512. Some other technique that
nobody's cracked yet will need to be adopted, and then people will
begin trying to crack that, and the whole thing will repeat. But I
suspect that we can keep using the same non-cryptographic hash
function essentially forever. It does not matter that people know how
the algorithm works because it makes no pretensions of trying to foil
an opponent. It is just trying to mix up the bits in such a way that a
change to the file is likely to cause a change in the checksum. The
bit-mixing properties of the algorithm do not degrade with the passage
of time.

> I'm sure that sha256 takes a lot more time than crc32c, I'm certainly
> not trying to dispute that, but what's relevent here is how much it
> impacts the time required to run the overall backup (including sync'ing
> it to disk, and possibly network transmission time..  if we're just
> comparing the time to run it through memory then, sure, the sha256
> computation time might end up being quite a bit of the time, but that's
> not really that interesting of a test..).

I think that 
http://postgr.es/m/38e29a1c-0d20-fc73-badd-ca05f7f07...@pgmasters.net
is one of the more interesting emails on this topic.  My conclusion
from that email, and the ones that led up to it, was that there is a
40-50% overhead from doing a SHA checksum, but in pgbackrest, users
don't see it because backups are compressed. Because the compression
uses so much CPU time, the additional overhead from the SHA checksum
is only a few percent more. But I don't think that it would be smart
to slow down uncompressed backups by 40-50%. That's going to cause a
problem for somebody, almost for sure.

> Maybe:
>
> "Using a SHA hash function provides a cryptographically secure digest
> of each file for users who wish to verify that the backup has not been
> tampered with, while the CRC32C algorithm provides a checksum which is
> much faster to calculate and good at catching errors due to accidental
> changes but is not resistent to targeted modifications.  Note that, to
> be useful against an adversary who has access to the backup, the backup
> manifest would need to be stored securely elsewhere or otherwise
> verified to have not been modified since the backup was taken."
>
> This at least talks about things in a positive direction (SHA hash
> functions do this, CRC32C does that) rather than in a negative tone.

Cool. I like it.

> Anyway, my point here was really just that *pg_validatebackup* is about
> validating backups taken with

Re: Some improvements to numeric sqrt() and ln()

2020-03-25 Thread Tom Lane

Dean Rasheed  writes:
> Here is an updated patch with the following updates based on your comments:

This resolves all my concerns.  I've marked it RFC in the CF app.

regards, tom lane

Re: Option to dump foreign data in pg_dump

2020-03-25 Thread Alvaro Herrera

On 2020-Mar-24, Alvaro Herrera wrote:

> Hmm, but travis is failing on the cfbot, and I can't see why ...

My only guess, without further output, is that getopt_long is not liking
the [ "--include-foreign-data", "xxx" ] style of arguments in the Perl
array of the command to run (which we don't use with anywhere else in
the files I looked), so I changed it to [ "--include-foreign-data=xxx" ].
If this was not the problem, we'll need more info, which the buildfarm
will give us.

And pushed.  Thanks, Luis, and thanks to all reviewers.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Missing errcode() in ereport

2020-03-25 Thread Tom Lane

I wrote:
> Thomas Munro  writes:
>> I think this caused anole to say:
>> "reloptions.c", line 1362: error #2042: operand types are incompatible
>> ("void" and "int")
>> errdetail_internal("%s", _(optenum->detailmsg)) : 0));

> Yeah, I was just looking at that :-(

> We could revert the change to have these functions return void,
> or we could run around and change the places with this usage
> pattern to use "(void) 0" instead of just "0".  The latter would
> be somewhat painful if only minority compilers warn, though.
> Also, I don't think that having to change ereport usage was part
> of the agreed-to plan here ... so I'm leaning to the former.

Done that way.

regards, tom lane

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-25 Thread Justin Pryzby

On Wed, Mar 25, 2020 at 12:46:52PM -0300, Alvaro Herrera wrote:
> On 2020-Mar-25, Justin Pryzby wrote:
> 
> > Maybe in the docs you can write this with thousands separators: 10,000,000
> > 
> > It looks like the GUC uses scale factor max=1e10, but the relopt is still
> > max=100, which means it's less possible to disable for a single rel.
> 
> I have paid no attention to this thread, but how does it make sense to
> have a scale factor to be higher than 100?  Surely you mean the
> threshold value that should be set to ten million, not the scale factor?

We went over this here:
https://www.postgresql.org/message-id/20200317195616.GZ26184%40telsasoft.com
...
https://www.postgresql.org/message-id/20200317213426.GB26184%40telsasoft.com

The scale factor is relative to the reltuples estimate, which comes from vacuum
(which presently doesn't run against insert-only tables, and what we're trying
to schedule), or analyze, which probably runs adequately, but might be disabled
or run too infrequently.

Since we talked about how scale_factor can be used to effectively disable this
new feature, I thought that scale=100 was too small and suggesed 1e10 (same as
max for vacuum_cleanup_index_scale_factor since 4d54543ef).  That should allow
handling the case that analyze is disabled, or its threshold is high, or it
hasn't run yet, or it's running but hasn't finished, or analyze is triggered as
same time as vacuum.

A table with 1e7 tuples (threshold) into which one inserts 1e9 tuples would hit
scale_factor=100 threshold, which means scale_factor failed to "disable" the
feature, as claimed.  If anything, I think it may need to be larger...

-- 
Justin

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-25 Thread Alvaro Herrera

On 2020-Mar-25, Justin Pryzby wrote:

> Maybe in the docs you can write this with thousands separators: 10,000,000
> 
> It looks like the GUC uses scale factor max=1e10, but the relopt is still
> max=100, which means it's less possible to disable for a single rel.

I have paid no attention to this thread, but how does it make sense to
have a scale factor to be higher than 100?  Surely you mean the
threshold value that should be set to ten million, not the scale factor?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Erase the distinctClause if the result is unique by definition

2020-03-25 Thread Andy Fan

I have started the new thread [1] to continue talking about this.
Mr. cfbot is happy now.

[1]
https://www.postgresql.org/message-id/flat/CAKU4AWrwZMAL%3DuaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw%40mail.gmail.com

Thanks

>

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-25 Thread Justin Pryzby

On Mon, Mar 23, 2020 at 02:27:29PM +0100, Laurenz Albe wrote:
> Here is version 10 of the patch, which uses a scale factor of 0.2.

Thanks

> Any table that has received more inserts since it was
> last vacuumed (and that is not vacuumed for another
> reason) will be autovacuumed.

Since this vacuum doesn't trigger any special behavior (freeze), you can remove
the parenthesized part: "(and that is not vacuumed for another reason)".

Maybe in the docs you can write this with thousands separators: 10,000,000

It looks like the GUC uses scale factor max=1e10, but the relopt is still
max=100, which means it's less possible to disable for a single rel.

> +++ b/src/backend/access/common/reloptions.c
> @@ -398,6 +407,15 @@ static relopt_real realRelOpts[] =
>   },
>   -1, 0.0, 100.0
>   },
> + {
> + {
> + "autovacuum_vacuum_insert_scale_factor",
> + "Number of tuple inserts prior to vacuum as a fraction 
> of reltuples",
> + RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
> + ShareUpdateExclusiveLock
> + },
> + -1, 0.0, 100.0
> + },
>   {
>   {
>   "autovacuum_analyze_scale_factor",

> +++ b/src/backend/utils/misc/guc.c
> @@ -3549,6 +3558,17 @@ static struct config_real ConfigureNamesReal[] =
>   0.2, 0.0, 100.0,
>   NULL, NULL, NULL
>   },
> +
> + {
> + {"autovacuum_vacuum_insert_scale_factor", PGC_SIGHUP, 
> AUTOVACUUM,
> + gettext_noop("Number of tuple inserts prior to vacuum 
> as a fraction of reltuples."),
> + NULL
> + },
> + _vac_ins_scale,
> + 0.2, 0.0, 1e10,
> + NULL, NULL, NULL
> + },
> +
>   {
>   {"autovacuum_analyze_scale_factor", PGC_SIGHUP, AUTOVACUUM,
>   gettext_noop("Number of tuple inserts, updates, or 
> deletes prior to analyze as a fraction of reltuples."),

Re: [Patch] Invalid permission check in pg_stats for functional indexes

2020-03-25 Thread David Steele


Hi Pierre,

On 12/26/19 6:46 PM, Tom Lane wrote:

Awhile back I wrote:

Actually ... maybe we don't need to change the view definition at all,
but instead just make has_column_privilege() do something different
for indexes than it does for other relation types.  It's dubious that
applying that function to an index yields anything meaningful today,
so we could redefine what it returns without (probably) breaking
anything.  That would at least give us an option to back-patch, too,
though the end result might be complex enough that we don't care to
risk it.


In hopes of resurrecting this thread, here's a draft patch that does
it like that (and also fixes row_security_active(), as otherwise this
probably creates a security hole in pg_stats).


Do you know when you will have a chance to look at this patch?

Tom made a suggestion up-thread about where the regression tests could go.

Regards,
--
-David
da...@pgmasters.net

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2020-03-25 Thread Ibrar Ahmed

On Tue, Mar 24, 2020 at 10:06 PM Ibrar Ahmed  wrote:

>
>
> On Fri, Mar 13, 2020 at 6:58 AM Justin Pryzby 
> wrote:
>
>>
>> Thanks for picking up this patch.  There's a minor typo:
>>
>> +* readable outside of this sessoin. Therefore
>> doing IO here isn't
>>
>> => session
>>
>> --
>> Justin
>>
>
> Thanks, please see the updated and rebased patch. (master
> 17a28b03645e27d73bf69a95d7569b61e58f06eb)
>
> --
> Ibrar Ahmed
>

Andres while fixing the one FIXME in the patch

"visibilitymap_pin(relation, BufferGetBlockNumber(buffer),
);

/*

 * FIXME: setting recptr here is a dirty dirty hack, to prevent

 * visibilitymap_set() from WAL logging.
 *
"
I am not able to see any scenario where recptr is not set before reaching
to that statement. Can you clarify why you think recptr will not be set at
that statement?

-- 
Ibrar Ahmed

Re: Berserk Autovacuum (let's save next Mandrill)

2020-03-25 Thread Laurenz Albe

On Mon, 2020-03-23 at 14:27 +0100, Laurenz Albe wrote:
> Here is version 10 of the patch, which uses a scale factor of 0.2.

This patch should be what everybody can live with.

It would be good if we can get at least that committed before feature freeze.

Yours,
Laurenz Albe

Re: weird hash plan cost, starting with pg10

2020-03-25 Thread Konstantin Knizhnik




On 25.03.2020 13:36, Richard Guo wrote:


On Tue, Mar 24, 2020 at 3:36 PM Richard Guo > wrote:


On Tue, Mar 24, 2020 at 11:05 AM Thomas Munro
mailto:thomas.mu...@gmail.com>> wrote:


I think there might be a case like this:

* ExecRescanHashJoin() decides it can't reuse the hash table for a
rescan, so it calls ExecHashTableDestroy(), clears HashJoinState's
hj_HashTable and sets hj_JoinState to HJ_BUILD_HASHTABLE
* the HashState node still has a reference to the pfree'd
HashJoinTable!
* HJ_BUILD_HASHTABLE case reaches the empty-outer optimisation
case so
it doesn't bother to build a new hash table
* EXPLAIN examines the HashState's pointer to a freed
HashJoinTable struct


Yes, debugging with gdb shows this is exactly what happens.


According to the scenario above, here is a recipe that reproduces this
issue.

-- recipe start
create table a(i int, j int);
create table b(i int, j int);
create table c(i int, j int);

insert into a select 3,3;
insert into a select 2,2;
insert into a select 1,1;

insert into b select 3,3;

insert into c select 0,0;

analyze a;
analyze b;
analyze c;

set enable_nestloop to off;
set enable_mergejoin to off;

explain analyze
select exists(select * from b join c on a.i > c.i and a.i = b.i and 
b.j = c.j) from a;

-- recipe end

I tried this recipe on different PostgreSQL versions, starting from
current master and going backwards. I was able to reproduce this issue
on all versions above 8.4. In 8.4 version, we do not output information
on hash buckets/batches. But manual inspection with gdb shows in 8.4 we
also have the dangling pointer for HashState->hashtable. I didn't check
versions below 8.4 though.

Thanks
Richard


I can propose the following patch for the problem.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ff2f45c..470073b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2685,7 +2685,8 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 	 */
 	if (hashstate->hashtable)
 		ExecHashGetInstrumentation(, hashstate->hashtable);
-
+	else
+		hinstrument = hashstate->saved_instrument;
 	/*
 	 * Merge results from workers.  In the parallel-oblivious case, the
 	 * results from all participants should be identical, except where
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index c901a80..db082be 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -1336,6 +1336,13 @@ ExecReScanHashJoin(HashJoinState *node)
 		else
 		{
 			/* must destroy and rebuild hash table */
+			HashState* state = (HashState*)innerPlanState(node);
+			Assert(state->hashtable == node->hj_HashTable);
+			if (state->ps.instrument)
+			{
+ExecHashGetInstrumentation(>saved_instrument, state->hashtable);
+			}
+			state->hashtable = NULL;
 			ExecHashTableDestroy(node->hj_HashTable);
 			node->hj_HashTable = NULL;
 			node->hj_JoinState = HJ_BUILD_HASHTABLE;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50..94afc18 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2302,6 +2302,8 @@ typedef struct HashState
 	SharedHashInfo *shared_info;	/* one entry per worker */
 	HashInstrumentation *hinstrument;	/* this worker's entry */
 
+	HashInstrumentation saved_instrument; /* save here instrumentation for dropped hash */
+
 	/* Parallel hash state. */
 	struct ParallelHashJoinState *parallel_state;
 } HashState;

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-03-25 Thread Andy Fan

On Mon, Mar 23, 2020 at 6:21 PM Andy Fan  wrote:

> Greetings.
>
> This thread is a follow-up thread for [1], where I submit a patch for
> erasing the
> distinct node if we have known the data is unique for sure.  But since the
> implementation has changed a lot from the beginning and they are not very
> related, so I start this new thread to discuss the new strategy to save
> the time
> of reviewers.
>
> As I said above, my original intention is just used to erase the distinct
> clause,
> then Tom Lane suggested function query_is_distinct_for,  I found the
> uniqueness
> can be used for costing,  remove_useless_join, reduce_unqiue_semijoins.
> David suggested to maintain the uniqueness from bottom to top, like join
> & subqueries, group-by, distinct, union and so on(we call it as
> UniqueKey).
> Ideally the uniqueness will be be lost in any case. This current
> implementation
> follows the David's suggestion and also thanks Ashutosh who reminded me
> the cost should be ok while I had concerns of this at the beginning.
>
> A new field named uniquekeys was added in RelOptInfo struct, which is a
> list of UniqueKey struct.
>
> typedef struct UniqueKey
> {
> NodeTag type;
> List   *exprs;
> List   *positions;
> bool   grantee;
> } UniqueKey;
>
> exprs is a list of exprs which is unique if we don't care about the null
> vaues on
> current RelOptInfo.
>
> positions is a list of the sequence no. of the exprs in the current
> RelOptInfo,
> which is used for SubQuery. like
>
> create table t1 (a int primary key, b int);
> create table t2 (a int primary key, b int);
> select .. from t1,  (select b from t2 group by t2) t2 ..;
>
> The UniqueKey for the subquery will be Var(varno=1, varattno=2),  but for
> the
> top query, the UniqueKey of t2 should be Var(varno=2, varattrno=1),  the 1
> here
> need to be calculated by UnqiueKey->positions.
>
> grantee field is introduced mainly for remove_useless_join &
> reduce_unique_semijions.
> Take the above case for example:
>
> -- b is nullable.  so select b from t2 still can result in duplicated
> rows.
> create unique index t2_uk_b on t2(b);
>
> -- the left join still can be removed since t2.b is a unique index and the
> nullable
> doesn't matter here.
> select t1.* from t1 left join t2 on (t1.b = t2.b);
>
> so t2.b will still be an UniqueKey for t2, just that the grantee = false.
>
> A branch of functions like populate_xxx_unqiuekeys for manipulating
> uniquekeys
> for a lot of cases, xxx maybe baserel, joinrel, paritioned table,
> unionrel, groupbyrel,
> distincrel and so on.  partitioned table has some not obviously troubles
> due to
> users can create index on the childrel directly and differently.  You can
> check
> the comments of the code for details.
>
>
> When maintaining the uniquekeys of joinrel,  we have a rule that if both
> rels have
> UniqueKeys,  then any combination from the 2 sides is a unqiquekey of the
> joinrel.
> I used two algorithms to keep the length of the UniqueKeys short.  One is
> we only
> add useful UniqueKey to the RelOptInfo.uniquekeys.  If the expr isn't
> shown in
> rel->reltargets->exprs, it will not be used for others, so we can ignore
> it safely.
> The another one is if column sets A is unqiuekey already,  any superset of
> A
> will no need to be added as an UnqiueKey.
>
>
> The overall cost of the maintaining unqiuekeys should be ok.  If you check
> the code,
> you may find there are many 2 or 3 levels foreach, but most of them are
> started with
> unique index, and I used UnqiueKeyContext and SubqueryUnqiueKeyContext in
> joinrel
> and subquery case to avoid too many loops.
>
> Now I have used the UnqiueKey to erase the unnecessary distinct/group by,
> and also changed
> the rel_is_distinct_for to use UnqiueKeys.  so we can handle more cases.
>
> create table m1 (a int primary key,  b int, c int);
> create table m2 (a int primary key,  b int, c int);
> create table m3 (a int primary key,  b int, c int);
>
> Wit the current patch, we can get:
> task3=# explain select  t1.a from m3 t1 left join  (select m1.a from m1,
> m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a);
>QUERY PLAN
> -
>  Seq Scan on m3 t1  (cost=0.00..32.60 rows=2260 width=4)
>
>
> Before the patch, we will get:
> postgres=# explain select  t1.a from m3 t1 left join  (select m1.a from
> m1, m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a)
> postgres-# ;
>   QUERY PLAN
>
> ---
>  Hash Left Join  (cost=0.39..41.47 rows=2260 width=4)
>Hash Cond: (t1.a = m1.a)
>->  Seq Scan on m3 t1  (cost=0.00..32.60 rows=2260 width=4)
>->  Hash  (cost=0.37..0.37 rows=1 width=4)
>  ->  Limit  (cost=0.15..0.36 rows=1 width=4)
>->  Nested Loop  (cost=0.15..470.41 rows=2260 width=4)
>  ->  Seq

Re: [PATCH] Implement INSERT SET syntax

2020-03-25 Thread Tom Lane

Peter Eisentraut  writes:
> On 2020-03-24 18:57, Tom Lane wrote:
>> No doubt that's all fixable, but the realization that some cases of
>> this syntax are*not*  just syntactic sugar for standards-compliant
>> syntax is giving me pause.  Do we really want to get out front of
>> the SQL committee on extending INSERT in an incompatible way?

> What is the additional functionality that we are considering adding here?
> The thread started out proposing a more convenient syntax, but it seems 
> to go deeper now and perhaps not everyone is following.

AIUI, the proposal is to allow INSERT commands to be written
using an UPDATE-like syntax, for example

INSERT INTO table SET col1 = value1, col2 = value2, ... [ FROM ... ]

where everything after FROM is the same as it is in SELECT.  My initial
belief was that this was strictly equivalent to what you could do with
a target-column-names list in standard INSERT, viz

INSERT INTO table (col1, col2, ...) VALUES (value1, value2, ...);
or
INSERT INTO table (col1, col2, ...) SELECT value1, value2, ... FROM ...

but it's arguably more legible/convenient because the column names
are written next to their values.

However, that rewriting falls down for certain multiassignment cases
where you have a row source that can't be decomposed, such as my
example

INSERT INTO table SET (col1, col2) = (SELECT value1, value2 FROM ...),
... [ FROM ... ]

So, just as we found for UPDATE, multiassignment syntax is strictly
stronger than plain column-by-column assignment.

There are some secondary issues about which variants of this syntax
will allow a column value to be written as DEFAULT, and perhaps
about whether set-returning functions work.  But the major point
right now is about whether its's possible to rewrite to standard
syntax.

regards, tom lane

Re: [Proposal] Global temporary tables

2020-03-25 Thread tushar


On 3/17/20 9:15 AM, 曾文旌(义从) wrote:

Please check global_temporary_table_v20-pg13.patch


There is a typo in the error message

postgres=# create global temp table test(a int ) 
with(on_commit_delete_rows=true) on commit delete rows;
ERROR:  can not defeine global temp table with on commit and with clause 
at same time

postgres=#

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: Applying for GSOC 2020 | Need review of proposal

2020-03-25 Thread Stephen Frost

Greetings,

* Gyati Mittal (gmitt...@horizon.csueastbay.edu) wrote:
> I am trying to submit a draft proposal for this task:
> 
> Develop Performance Farm Benchmarks and Website (2020)

Great!  I'd suggest you reach out to the mentor listed on the wiki page
for that project to chat about what a good proposal would look like.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-03-25 Thread Julien Rouhaud

On Wed, Mar 25, 2020 at 10:09:37PM +0900, Fujii Masao wrote:
> 
> On 2020/03/21 4:30, Julien Rouhaud wrote:
> > On Fri, Mar 20, 2020 at 05:09:05PM +0300, Sergei Kornilov wrote:
> > > Hello
> > > 
> > > Yet another is missed in docs: total_time
> > 
> > Oh good catch!  I rechecked many time the field, and totally missed that the
> > documentation is referring to the view, which has an additional column, and 
> > not
> > the function.  Attached v9 fixes that.
> 
> Thanks for the patch! Here are the review comments from me.
> 
> - PGSS_V1_3
> + PGSS_V1_3,
> + PGSS_V1_8
> 
> WAL usage patch [1] increments this version to 1_4 instead of 1_8.
> I *guess* that's because previously this version was maintained
> independently from pg_stat_statements' version. For example,
> pg_stat_statements 1.4 seems to have used PGSS_V1_3.

Oh right.  It seems that I changed that many versions ago, I'm not sure why.
I'm personally fine with any, but I think this was previously raised and
consensus was to keep distinct counters.  Unless you prefer to keep it this
way, I'll send an updated version (with other possible modifications depending
on the rest of the mail) using PGSS_V1_4.

> + /*
> +  * We can't process the query if no query_text is provided, as 
> pgss_store
> +  * needs it.  We also ignore query without queryid, as it would be 
> treated
> +  * as a utility statement, which may not be the case.
> +  */
> 
> Could you tell me why the planning stats are not tracked when executing
> utility statements? In some utility statements like REFRESH MATERIALIZED VIEW,
> the planner would work.

I explained that in [1].  The problem is that the underlying statement doesn't
get the proper stmt_location and stmt_len, so you eventually end up with two
different entries.  I suggested fixing transformTopLevelStmt() to handle the
various DDL that can contain optimisable statements, but everyone preferred to
postpone that for a future enhencement.

> +static BufferUsage
> +compute_buffer_counters(BufferUsage start, BufferUsage stop)
> +{
> + BufferUsage result;
> 
> BufferUsageAccumDiff() has very similar logic. Isn't it better to expose
> and use that function rather than creating new similar function?

Oh, I thought this wouldn't be acceptable.  That's indeed better so I'll do
that instead.

>   values[i++] = Int64GetDatumFast(tmp.rows);
>   values[i++] = Int64GetDatumFast(tmp.shared_blks_hit);
>   values[i++] = Int64GetDatumFast(tmp.shared_blks_read);
> 
> Previously (without the patch) pg_stat_statements_1_3() reported
> the buffer usage counters updated only in execution phase. But,
> in the patched version, pg_stat_statements_1_3() reports the total
> of buffer usage counters updated in both planning and execution
> phases. Is this OK? I'm not sure how seriously we should ensure
> the backward compatibility for pg_stat_statements

That's indeed a behavior change, although the new behavior is probably better
as user want to know how much resource a query is consuming overall.  We could
distinguish all buffers with a plan/exec version, but it seems quite overkill.

> +/* contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql */
> 
> ISTM it's good timing to have also pg_stat_statements--1.8.sql since
> the definition of pg_stat_statements() is changed. Thought?

I thought that since CreateExtension() was modified to be able to find it's way
automatically, we shouldn't provide base version anymore, to minimize
maintenance burden and also avoid possible bug/discrepancy.  The only drawback
is that it'll do multiple CREATE or DROP/CREATE of the function usually once
per database, which doesn't seem like a big problem.

[1] 
https://www.postgresql.org/message-id/caobau_y-y+vohtzgdoudk6-9v72-zxdwccxo_kx0p4ddbee...@mail.gmail.com

Re: potential stuck lock in SaveSlotToPath()

2020-03-25 Thread Alvaro Herrera

On 2020-Mar-25, Peter Eisentraut wrote:

> On 2020-03-20 16:38, Robert Haas wrote:
> > On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
> >  wrote:
> > > On 2020-03-19 16:38, Robert Haas wrote:
> > > > Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
> > > > right for the early-exit cases either.
> > > 
> > > There appear to be appropriate pgstat_report_wait_end() calls.  What are
> > > you seeing?
> > 
> > Oh, you're right. I think I got confused because the rename() and
> > close() don't have that, but those don't have a wait event set either.
> > Sorry for the noise.
> 
> Any concerns about applying and backpatching the patch I posted?

It looks a straight bug fix to me, I agree it should be back-patched.

> The talk about reorganizing this code doesn't seem very concrete at the
> moment and would probably not be backpatch material anyway.

Agreed on both counts.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: where EXEC_BACKEND is defined

2020-03-25 Thread Peter Eisentraut


On 2020-03-20 17:36, Tom Lane wrote:

One small point is that I believe the existing code has the effect of
"#define EXEC_BACKEND 1" not just "#define EXEC_BACKEND".  I don't
think this matters to anyplace in the core code, but it's conceivable
that somebody has extension code written to assume the former.
Nonetheless, I'm +1 for re-standardizing on the latter, because it's
a couple less keystrokes when inserting a manual definition ;-).
Might be worth mentioning in the commit log entry though.


Ok, done that way.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: backup manifests

2020-03-25 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Mar 23, 2020 at 6:42 PM Stephen Frost  wrote:
> > While I get the desire to have a default here that includes checksums,
> > the way the command is structured, it strikes me as odd that the lack of
> > MANIFEST_CHECKSUMS in the command actually results in checksums being
> > included.
> 
> I don't think that's quite accurate, because the default for the
> MANIFEST option is 'no', so the actual default if you say nothing
> about manifests at all, you don't get one. However, it is true that if
> you ask for a manifest and you don't specify the type of checksums,
> you get CRC-32C. We could change it so that if you ask for a manifest
> you must also specify the type of checksum, but I don't see any
> advantage in that approach. Nothing prevents the client from
> specifying the value if it cares, but making the default "I don't
> care, you pick" seems pretty sensible. It could be really helpful if,
> for example, we decide to remove the initial default in a future
> release for some reason. Then the client just keeps working without
> needing to change anything, but anyone who explicitly specified the
> old default gets an error.

I get that the default for manifest is 'no', but I don't really see how
that means that the lack of saying anything about checksums should mean
"give me crc32c checksums".  It's really rather common that if we don't
specify something, it means don't do that thing- like an 'ORDER BY'
clause.  We aren't designing SQL here, so I'm not going to get terribly
upset if you push forward with "if you don't want checksums, you have to
explicitly say MANIFEST_CHECKSUMS no", but I don't agree with the
reasoning here.

> > > + 
> > > +  --manifest-checksums= > > class="parameter">algorithm
> > > +  
> > > +   
> > > +Specifies the algorithm that should be used to checksum each file
> > > +for purposes of the backup manifest. Currently, the available
> > > +algorithms are NONE, 
> > > CRC32C,
> > > +SHA224, SHA256,
> > > +SHA384, and SHA512.
> > > +The default is CRC32C.
> > > +   
> >
> > As I recall, there was an invitation to argue about the defaults at one
> > point, and so I'm going to say here that I would advocate for a
> > different default than 'crc32c'.  Specifically, I would think sha256 or
> > 512 would be better.  I don't recall seeing a debate about this that
> > conclusively found crc32c to be better, but I'm happy to go back and
> > reread anything someone wants to point me at.
> 
> It was discussed upthread. Andrew Dunstan argued that there was no
> reason to use a cryptographic checksum here and that we shouldn't do
> so gratuitously. Suraj Kharage found that CRC-32C has very little
> performance impact but that any of the SHA functions slow down backups
> considerably. David Steele pointed out that you'd need a better
> checksum if you wanted to use it for purposes such as delta restore,
> with which I agree, but that's not the design center for this feature.
> I concluded that different people wanted different things, so that we
> ought to make this configurable, but that CRC-32C is a good default.
> It has approximately a 99.999767169% chance of detecting a random
> error, which is pretty good, and it doesn't drastically slow down
> backups, which is also good.

There were also comments made up-thread about how it might not be great
for larger (eg: 1GB files, like we tend to have quite a few of...), and
something about it being a 40 year old algorithm..  Having re-read some
of the discussion, I'm actually more inclined to say we should be using
sha256 instead of crc32c.

> > It also seems a bit silly to me that using the defaults means having to
> > deal with two different algorithms- crc32c and sha256.  Considering how
> > fast these algorithms are, compared to everything else involved in a
> > backup (particularly one that's likely going across a network...), I
> > wonder if we should say "may slightly increase" above.
> 
> Actually, Suraj's results upthread show that it's a pretty big hit.

So, I went back and re-read part of the thread and looked at the
(seemingly, only one..?) post regarding timing and didn't understand
what, exactly, was being timed there, because I didn't see the actual
commands/script/whatever that was used to get those results included.

I'm sure that sha256 takes a lot more time than crc32c, I'm certainly
not trying to dispute that, but what's relevent here is how much it
impacts the time required to run the overall backup (including sync'ing
it to disk, and possibly network transmission time..  if we're just
comparing the time to run it through memory then, sure, the sha256
computation time might end up being quite a bit of the time, but that's
not really that interesting of a test..).

> > > +   
> > > +On the other hand, CRC32C is not a 
> > > cryptographic
> > > +hash function, so it is

Re: Columns correlation and adaptive query optimization

2020-03-25 Thread Konstantin Knizhnik




On 25.03.2020 16:00, David Steele wrote:

On 3/25/20 6:57 AM, Konstantin Knizhnik wrote:



On 24.03.2020 20:12, David Steele wrote:

On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:
New version of patch implicitly adding multicolumn statistic in 
auto_explain extension and using it in optimizer for more precise 
estimation of join selectivity.
This patch fixes race condition while adding statistics and 
restricts generated statistic name to fit in 64 bytes (NameData).


This patch no longer applies: 
https://commitfest.postgresql.org/27/2386/


The CF entry has been updated to Waiting on Autho


Rebased patch is attached.


The patch applies now but there are error on Windows and Linux:
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481 


https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979

Regards,


Sorry, yet another patch is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index f69dde8..8714d94 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include 
+#include 
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +53,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -218,6 +240,30 @@ _PG_init(void)
 			 NULL,
 			 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+			 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+			 "Zero disables implicit creation of multicolumn statistic.",
+			 _explain_add_statistics_threshold,
+			 0.0,
+			 0.0,
+			 INT_MAX,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+			 "Do not create statistic but just record in WAL suggested create statistics statement.",
+			 NULL,
+			 _explain_suggest_only,
+			 false,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -349,6 +395,256 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+	ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+Var *var = (Var *) node;
+if (cols == NULL || var->varno == varno)
+{
+	varno = var->varno;
+	if (var->varattno > 0 &&
+		!bms_is_member(var->varattno, colmap)

Re: Planning counters in pg_stat_statements (using pgss_store)

2020-03-25 Thread Fujii Masao





On 2020/03/21 4:30, Julien Rouhaud wrote:

On Fri, Mar 20, 2020 at 05:09:05PM +0300, Sergei Kornilov wrote:

Hello

Yet another is missed in docs: total_time


Oh good catch!  I rechecked many time the field, and totally missed that the
documentation is referring to the view, which has an additional column, and not
the function.  Attached v9 fixes that.


Thanks for the patch! Here are the review comments from me.

-   PGSS_V1_3
+   PGSS_V1_3,
+   PGSS_V1_8

WAL usage patch [1] increments this version to 1_4 instead of 1_8.
I *guess* that's because previously this version was maintained
independently from pg_stat_statements' version. For example,
pg_stat_statements 1.4 seems to have used PGSS_V1_3.

+   /*
+* We can't process the query if no query_text is provided, as 
pgss_store
+* needs it.  We also ignore query without queryid, as it would be 
treated
+* as a utility statement, which may not be the case.
+*/

Could you tell me why the planning stats are not tracked when executing
utility statements? In some utility statements like REFRESH MATERIALIZED VIEW,
the planner would work.

+static BufferUsage
+compute_buffer_counters(BufferUsage start, BufferUsage stop)
+{
+   BufferUsage result;

BufferUsageAccumDiff() has very similar logic. Isn't it better to expose
and use that function rather than creating new similar function?

values[i++] = Int64GetDatumFast(tmp.rows);
values[i++] = Int64GetDatumFast(tmp.shared_blks_hit);
values[i++] = Int64GetDatumFast(tmp.shared_blks_read);

Previously (without the patch) pg_stat_statements_1_3() reported
the buffer usage counters updated only in execution phase. But,
in the patched version, pg_stat_statements_1_3() reports the total
of buffer usage counters updated in both planning and execution
phases. Is this OK? I'm not sure how seriously we should ensure
the backward compatibility for pg_stat_statements

+/* contrib/pg_stat_statements/pg_stat_statements--1.7--1.8.sql */

ISTM it's good timing to have also pg_stat_statements--1.8.sql since
the definition of pg_stat_statements() is changed. Thought?

[1]
https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4zlxws_cza3...@mail.gmail.com

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters

Re: Columns correlation and adaptive query optimization

2020-03-25 Thread David Steele


On 3/25/20 6:57 AM, Konstantin Knizhnik wrote:



On 24.03.2020 20:12, David Steele wrote:

On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:
New version of patch implicitly adding multicolumn statistic in 
auto_explain extension and using it in optimizer for more precise 
estimation of join selectivity.
This patch fixes race condition while adding statistics and restricts 
generated statistic name to fit in 64 bytes (NameData).


This patch no longer applies: https://commitfest.postgresql.org/27/2386/

The CF entry has been updated to Waiting on Autho


Rebased patch is attached.


The patch applies now but there are error on Windows and Linux:
https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85481
https://travis-ci.org/postgresql-cfbot/postgresql/builds/666729979

Regards,
--
-David
da...@pgmasters.net

Re: [Proposal] Global temporary tables

2020-03-25 Thread Pavel Stehule

st 25. 3. 2020 v 13:53 odesílatel Prabhat Sahu <
prabhat.s...@enterprisedb.com> napsal:

> Hi All,
>
> Please check the behavior of GTT  having column with "SERIAL" datatype and
> column with default value as "SEQUENCE" as below:
>
>
> *Session1:*postgres=# create sequence gtt_c3_seq;
> CREATE SEQUENCE
> postgres=# create global temporary table gtt(c1 int, c2 serial, c3 int
> default nextval('gtt_c3_seq') not null) on commit preserve rows;
> CREATE TABLE
>
> -- Structure of column c2 and c3 are similar:
> postgres=# \d+ gtt
> Table "public.gtt"
>  Column |  Type   | Collation | Nullable | Default
> | Storage | Stats target | Description
>
> +-+---+--+-+-+--+-
>  c1 | integer |   |  |
> | plain   |  |
>  c2 | integer |   | not null | nextval('gtt_c2_seq'::regclass)
> | plain   |  |
>  c3 | integer |   | not null | nextval('gtt_c3_seq'::regclass)
> | plain   |  |
> Access method: heap
> Options: on_commit_delete_rows=false
>
> postgres=# insert into gtt select generate_series(1,3);
> INSERT 0 3
> postgres=# select * from gtt;
>  c1 | c2 | c3
> ++
>   1 |  1 |  1
>   2 |  2 |  2
>   3 |  3 |  3
> (3 rows)
>
>
> *Session2:*postgres=# insert into gtt select generate_series(1,3);
> INSERT 0 3
> postgres=# select * from gtt;
>  c1 | c2 | c3
> ++
>   1 |  1 |  4
>   2 |  2 |  5
>   3 |  3 |  6
> (3 rows)
>
> Kindly let me know, Is this behavior expected?
>

It is interesting side effect - theoretically it  is not important, because
sequence ensure just unique values - so values are not important.

You created classic shared sequence so the behave is correct and expected.

Pavel


> --
>
> With Regards,
> Prabhat Kumar Sahu
> EnterpriseDB: http://www.enterprisedb.com
>

Re: [Proposal] Global temporary tables

2020-03-25 Thread Prabhat Sahu

Hi All,

Please check the behavior of GTT  having column with "SERIAL" datatype and
column with default value as "SEQUENCE" as below:


*Session1:*postgres=# create sequence gtt_c3_seq;
CREATE SEQUENCE
postgres=# create global temporary table gtt(c1 int, c2 serial, c3 int
default nextval('gtt_c3_seq') not null) on commit preserve rows;
CREATE TABLE

-- Structure of column c2 and c3 are similar:
postgres=# \d+ gtt
Table "public.gtt"
 Column |  Type   | Collation | Nullable | Default
| Storage | Stats target | Description
+-+---+--+-+-+--+-
 c1 | integer |   |  |
| plain   |  |
 c2 | integer |   | not null | nextval('gtt_c2_seq'::regclass)
| plain   |  |
 c3 | integer |   | not null | nextval('gtt_c3_seq'::regclass)
| plain   |  |
Access method: heap
Options: on_commit_delete_rows=false

postgres=# insert into gtt select generate_series(1,3);
INSERT 0 3
postgres=# select * from gtt;
 c1 | c2 | c3
++
  1 |  1 |  1
  2 |  2 |  2
  3 |  3 |  3
(3 rows)


*Session2:*postgres=# insert into gtt select generate_series(1,3);
INSERT 0 3
postgres=# select * from gtt;
 c1 | c2 | c3
++
  1 |  1 |  4
  2 |  2 |  5
  3 |  3 |  6
(3 rows)

Kindly let me know, Is this behavior expected?

-- 

With Regards,
Prabhat Kumar Sahu
EnterpriseDB: http://www.enterprisedb.com

Re: Make mesage at end-of-recovery less scary.

2020-03-25 Thread Peter Eisentraut


On 2020-03-24 02:52, Kyotaro Horiguchi wrote:

I don't find that very satisfying, but I can't come up with something
that provides the current information, while being less scary than my
suggestion?

The 0-length record is not an "invalid" state during recovery, so we
can add the message for the state as "record length is 0 at %X/%X". I
think if other states found there, it implies something wrong.

LSN is redundantly shown but I'm not sure if it is better to remove it
from either of the two lines.

| LOG:  reached end of WAL at 0/3000850 on timeline 1 in pg_wal during crash 
recovery
| DETAIL:  record length is 0 at 0/3000850


I'm not up to date on all these details, but my high-level idea would be 
some kind of hint associated with the existing error messages, like:


HINT:  This is to be expected if this is the end of the WAL.  Otherwise, 
it could indicate corruption.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Implement INSERT SET syntax

2020-03-25 Thread Peter Eisentraut


On 2020-03-24 18:57, Tom Lane wrote:

No doubt that's all fixable, but the realization that some cases of
this syntax are*not*  just syntactic sugar for standards-compliant
syntax is giving me pause.  Do we really want to get out front of
the SQL committee on extending INSERT in an incompatible way?


What is the additional functionality that we are considering adding here?

The thread started out proposing a more convenient syntax, but it seems 
to go deeper now and perhaps not everyone is following.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_upgrade fails with non-standard ACL

2020-03-25 Thread Justin Pryzby

On Wed, Mar 25, 2020 at 11:12:05AM +0900, Artur Zakirov wrote:
> Hello David,
> 
> On 3/25/2020 2:08 AM, David Steele wrote:
> > On 12/17/19 3:10 AM, Arthur Zakirov wrote:
> > > 
> > > I attached new version of the patch. It still uses
> > > pg_identify_object(), I'm not sure about other ways to build
> > > identities yet.
> > 
> > This patch applies and builds but fails regression tests on Linux and
> > Windows:
> > https://travis-ci.org/postgresql-cfbot/postgresql/builds/666134656
> > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.85292
> 
> Regression tests fail because cfbot applies
> "test_rename_catalog_objects.patch". Regression tests pass without it.
> 
> This patch shouldn't be applied by cfbot. I'm not sure how to do this. But
> maybe it is possible to use different extension name for the test patch, not
> ".patch".

Yes, see Tom's message here:
https://www.postgresql.org/message-id/14255.1536781029%40sss.pgh.pa.us

I don't know what extensions cfbot looks for; in that case I didn't have a file
extension at all.

-- 
Justin

Re: error context for vacuum to include block number

2020-03-25 Thread Justin Pryzby

On Wed, Mar 25, 2020 at 09:27:44PM +0900, Masahiko Sawada wrote:
> On Wed, 25 Mar 2020 at 20:24, Amit Kapila  wrote:
> >
> > On Wed, Mar 25, 2020 at 3:42 PM Justin Pryzby  wrote:
> > >
> > > Attached patch addressing these.
> > >
> >
> > Thanks, you forgot to remove the below declaration which I have
> > removed in attached.
> >
> > @@ -724,20 +758,20 @@ lazy_scan_heap(Relation onerel, VacuumParams
> > *params, LVRelStats *vacrelstats,
> >   PROGRESS_VACUUM_MAX_DEAD_TUPLES
> >   };
> >   int64 initprog_val[3];
> > + ErrorContextCallback errcallback;
> >
> > Apart from this, I have ran pgindent and now I think it is in good
> > shape.  Do you have any other comments?  Sawada-San, can you also
> > check the attached patch and let me know if you have any additional
> > comments.
> >
> 
> Thank you for updating the patch! I have a question about the following code:
> 
> +/* Update error traceback information */
> +olderrcbarg = *vacrelstats;
> +update_vacuum_error_cbarg(vacrelstats, VACUUM_ERRCB_PHASE_TRUNCATE,
> +  vacrelstats->nonempty_pages, NULL, false);
> +
>  /*
>   * Scan backwards from the end to verify that the end pages actually
>   * contain no tuples.  This is *necessary*, not optional, because
>   * other backends could have added tuples to these pages whilst we
>   * were vacuuming.
>   */
>  new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
> +vacrelstats->blkno = new_rel_pages;
> 
>  if (new_rel_pages >= old_rel_pages)
>  {
>  /* can't do anything after all */
>  UnlockRelation(onerel, AccessExclusiveLock);
>  return;
>  }
> 
>  /*
>   * Okay to truncate.
>   */
>  RelationTruncate(onerel, new_rel_pages);
> 
> +/* Revert back to the old phase information for error traceback */
> +update_vacuum_error_cbarg(vacrelstats,
> +  olderrcbarg.phase,
> +  olderrcbarg.blkno,
> +  olderrcbarg.indname,
> +  true);
> 
> vacrelstats->nonempty_pages is the last non-empty block while
> new_rel_pages, the result of count_nondeletable_pages(), is the number
> of blocks that we can truncate to in this attempt. Therefore
> vacrelstats->nonempty_pages <= new_rel_pages. This means that we set a
> lower block number to arguments and then set a higher block number
> after count_nondeletable_pages, and then revert it back to
> VACUUM_ERRCB_PHASE_SCAN_HEAP phase and the number of blocks of
> relation before truncation, after RelationTruncate(). It can be
> repeated until no more truncating can be done. Why do we need to
> revert back to the scan heap phase? If we can use
> vacrelstats->nonempty_pages in the error context message as the
> remaining blocks after truncation I think we can update callback
> arguments once at the beginning of lazy_truncate_heap() and don't
> revert to the previous phase, and pop the error context after exiting.

Perhaps.  We need to "revert back" for the vacuum phases, which can be called
multiple times, but we don't need to do that here.

In the future, if we decided to add something for final cleanup phase (say),
it's fine (and maybe better) to exit truncate_heap() without resetting the
argument, and we'd immediately set it to CLEANUP.

I think the same thing applies to lazy_cleanup_index, too.  It can be called
from a parallel worker, but we never "go back" to a heap scan.

-- 
Justin

Re: adding partitioned tables to publications

2020-03-25 Thread Peter Eisentraut


On 2020-03-23 06:02, Amit Langote wrote:

Okay, added some tests.

Attached updated patches.


I have committed the worker.c refactoring patch.

"Add subscription support to replicate into partitioned tables" still 
has lacking test coverage.  Your changes in relation.c are not exercised 
at all because the partitioned table branch in apply_handle_update() is 
never taken.  This is critical and tricky code, so I would look for 
significant testing.


The code looks okay to me.  I would remove this code

+   memset(entry->attrmap->attnums, -1,
+  entry->attrmap->maplen * sizeof(AttrNumber));

because the entries are explicitly filled right after anyway, and 
filling the bytes with -1 has an unclear effect.  There is also 
seemingly some fishiness in this code around whether attribute numbers 
are zero- or one-based.  Perhaps this could be documented briefly. 
Maybe I'm misunderstanding something.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: error context for vacuum to include block number

2020-03-25 Thread Masahiko Sawada

On Wed, 25 Mar 2020 at 20:24, Amit Kapila  wrote:
>
> On Wed, Mar 25, 2020 at 3:42 PM Justin Pryzby  wrote:
> >
> > Attached patch addressing these.
> >
>
> Thanks, you forgot to remove the below declaration which I have
> removed in attached.
>
> @@ -724,20 +758,20 @@ lazy_scan_heap(Relation onerel, VacuumParams
> *params, LVRelStats *vacrelstats,
>   PROGRESS_VACUUM_MAX_DEAD_TUPLES
>   };
>   int64 initprog_val[3];
> + ErrorContextCallback errcallback;
>
> Apart from this, I have ran pgindent and now I think it is in good
> shape.  Do you have any other comments?  Sawada-San, can you also
> check the attached patch and let me know if you have any additional
> comments.
>

Thank you for updating the patch! I have a question about the following code:

+/* Update error traceback information */
+olderrcbarg = *vacrelstats;
+update_vacuum_error_cbarg(vacrelstats, VACUUM_ERRCB_PHASE_TRUNCATE,
+  vacrelstats->nonempty_pages, NULL, false);
+
 /*
  * Scan backwards from the end to verify that the end pages actually
  * contain no tuples.  This is *necessary*, not optional, because
  * other backends could have added tuples to these pages whilst we
  * were vacuuming.
  */
 new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
+vacrelstats->blkno = new_rel_pages;

 if (new_rel_pages >= old_rel_pages)
 {
 /* can't do anything after all */
 UnlockRelation(onerel, AccessExclusiveLock);
 return;
 }

 /*
  * Okay to truncate.
  */
 RelationTruncate(onerel, new_rel_pages);

+/* Revert back to the old phase information for error traceback */
+update_vacuum_error_cbarg(vacrelstats,
+  olderrcbarg.phase,
+  olderrcbarg.blkno,
+  olderrcbarg.indname,
+  true);

vacrelstats->nonempty_pages is the last non-empty block while
new_rel_pages, the result of count_nondeletable_pages(), is the number
of blocks that we can truncate to in this attempt. Therefore
vacrelstats->nonempty_pages <= new_rel_pages. This means that we set a
lower block number to arguments and then set a higher block number
after count_nondeletable_pages, and then revert it back to
VACUUM_ERRCB_PHASE_SCAN_HEAP phase and the number of blocks of
relation before truncation, after RelationTruncate(). It can be
repeated until no more truncating can be done. Why do we need to
revert back to the scan heap phase? If we can use
vacrelstats->nonempty_pages in the error context message as the
remaining blocks after truncation I think we can update callback
arguments once at the beginning of lazy_truncate_heap() and don't
revert to the previous phase, and pop the error context after exiting.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: improve transparency of bitmap-only heap scans

2020-03-25 Thread James Coleman

On Tue, Mar 24, 2020 at 11:02 PM Amit Kapila  wrote:
>
> On Wed, Mar 25, 2020 at 12:44 AM Tom Lane  wrote:
> >
> > I took a quick look through this patch.  While I see nothing to complain
> > about implementation-wise, I'm a bit befuddled as to why we need this
> > reporting when there is no comparable data provided for regular index-only
> > scans.  Over there, you just get "Heap Fetches: n", and the existing
> > counts for bitmap scans seem to cover the same territory.
> >
>
> Isn't deducing similar information ("Skipped Heap Fetches: n") there
> is a bit easier than it is here?

While I'm not the patch author so can't speak to the original thought
process, I do think it makes sense to show it. I could imagine a world
in which index only scans were printed in explain as purely an
optimization to index scans that shows exactly this (how many pages we
were able to skip fetching). That approach actually can make things
more helpful than the approach current in explain for index only
scans, since the optimization isn't all or nothing (i.e., it can still
fetch heap pages), so it's interesting to see exactly how much it
gained you.

James

Re: error context for vacuum to include block number

2020-03-25 Thread Justin Pryzby

On Wed, Mar 25, 2020 at 04:54:41PM +0530, Amit Kapila wrote:
> On Wed, Mar 25, 2020 at 3:42 PM Justin Pryzby  wrote:
> >
> > Attached patch addressing these.
> >
> 
> Thanks, you forgot to remove the below declaration which I have
> removed in attached.

Yes I saw..

> Apart from this, I have ran pgindent and now I think it is in good
> shape.  Do you have any other comments?  Sawada-San, can you also

I did just notice/remember while testing trucate that autovacuum does this:

src/backend/postmaster/autovacuum.c:
errcontext("automatic vacuum of table \"%s.%s.%s\"",

And that appears to be interacting correctly.  For example if you add an
elog(ERROR) and run UPDATE/DELETE, and wait autovacuum_naptime, then it shows
both contexts.

-- 
Justin

Re: error context for vacuum to include block number

2020-03-25 Thread Amit Kapila

On Wed, Mar 25, 2020 at 3:42 PM Justin Pryzby  wrote:
>
> Attached patch addressing these.
>

Thanks, you forgot to remove the below declaration which I have
removed in attached.

@@ -724,20 +758,20 @@ lazy_scan_heap(Relation onerel, VacuumParams
*params, LVRelStats *vacrelstats,
  PROGRESS_VACUUM_MAX_DEAD_TUPLES
  };
  int64 initprog_val[3];
+ ErrorContextCallback errcallback;

Apart from this, I have ran pgindent and now I think it is in good
shape.  Do you have any other comments?  Sawada-San, can you also
check the attached patch and let me know if you have any additional
comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

v34-0001-Introduce-vacuum-errcontext-to-display-additiona.patch
Description: Binary data

Generated column and inheritance: strange default error

2020-03-25 Thread Rémi Cura

Hello dear Hackers,
I encountered a strange issue with generated column and inheritance.
The core of the problem is that when inheriting from a table, you can
re-declare the same column (provided it has the same type).

But when doing this and introducing a generated column, an error is raised
`[0A000] ERROR: cannot use column reference in DEFAULT expression`

This is too bad, as generated columns are an awesome new feature.
I don't think that is an  expected behavior, and at the very least the
error is misleading.

Here is a super short synthetic demo, along with 2 workarounds
```SQL
 testing an issue with inheritance and generated column
DROP SCHEMA IF EXISTS test_inheritance_and_generated_column CASCADE ;
CREATE SCHEMA IF NOT EXISTS test_inheritance_and_generated_column ;


--- ERROR (set a generated column in an inherited table)
DROP TABLE IF EXISTS test_inheritance_and_generated_column.parent_table
CASCADE;
CREATE TABLE test_inheritance_and_generated_column.parent_table(
   science_grade float DEFAULT 0.7
, avg_grade float
);
-- THIS RAISE THE ERROR : [0A000] ERROR: cannot use column reference in
DEFAULT expression
DROP TABLE IF EXISTS test_inheritance_and_generated_column.child_basic;
CREATE TABLE test_inheritance_and_generated_column.child_basic(
   literature_grade float DEFAULT 0.3,
   -- avg_grade float is inherited
   avg_grade float GENERATED ALWAYS AS (
(science_grade+literature_grade)/2.0 ) STORED
)INHERITS (test_inheritance_and_generated_column.parent_table);
--



--- WORKS (removing the column from parent)
DROP TABLE IF EXISTS test_inheritance_and_generated_column.parent_table
CASCADE;
CREATE TABLE test_inheritance_and_generated_column.parent_table(
   science_grade float DEFAULT 0.7
-- , avg_grade float
);
--
DROP TABLE IF EXISTS test_inheritance_and_generated_column.child_basic;
CREATE TABLE test_inheritance_and_generated_column.child_basic(
   literature_grade float DEFAULT 0.3,
   avg_grade float GENERATED ALWAYS AS (
(science_grade+literature_grade)/2.0 ) STORED
)INHERITS (test_inheritance_and_generated_column.parent_table);
--




-- THIS WORKS (droping inheritance, dropping column, creating column with
generating, adding inheritance )
DROP TABLE IF EXISTS test_inheritance_and_generated_column.parent_table
CASCADE;
CREATE TABLE test_inheritance_and_generated_column.parent_table(
   science_grade float DEFAULT 0.7
, avg_grade float
);
--
DROP TABLE IF EXISTS test_inheritance_and_generated_column.child_basic;
CREATE TABLE test_inheritance_and_generated_column.child_basic(
   literature_grade float DEFAULT 0.3,
   avg_grade float
)INHERITS (test_inheritance_and_generated_column.parent_table);
ALTER TABLE test_inheritance_and_generated_column.child_basic NO INHERIT
test_inheritance_and_generated_column.parent_table;
ALTER TABLE test_inheritance_and_generated_column.child_basic DROP COLUMN
avg_grade;
ALTER TABLE test_inheritance_and_generated_column.child_basic
ADD COLUMN avg_grade float GENERATED ALWAYS AS (
(science_grade+literature_grade)/2.0 ) STORED;
ALTER TABLE test_inheritance_and_generated_column.child_basic INHERIT
test_inheritance_and_generated_column.parent_table;

```
PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg18.04+1) on x86_64-pc-linux-gnu,
compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit

Cheers,
Remi-C

Re: Columns correlation and adaptive query optimization

2020-03-25 Thread Konstantin Knizhnik




On 24.03.2020 20:12, David Steele wrote:

On 12/24/19 3:15 AM, Konstantin Knizhnik wrote:
New version of patch implicitly adding multicolumn statistic in 
auto_explain extension and using it in optimizer for more precise 
estimation of join selectivity.
This patch fixes race condition while adding statistics and restricts 
generated statistic name to fit in 64 bytes (NameData).


This patch no longer applies: https://commitfest.postgresql.org/27/2386/

The CF entry has been updated to Waiting on Author.

Regards,


Rebased patch is attached.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index f69dde8..9e15913 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -13,12 +13,32 @@
 #include "postgres.h"
 
 #include 
+#include 
 
+#include "access/hash.h"
 #include "access/parallel.h"
+#include "access/relscan.h"
+#include "access/skey.h"
+#include "access/table.h"
+#include "access/tableam.h"
+#include "catalog/pg_statistic_ext.h"
 #include "commands/explain.h"
+#include "commands/defrem.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/planmain.h"
+#include "parser/parsetree.h"
+#include "storage/ipc.h"
+#include "statistics/statistics.h"
+#include "utils/fmgroids.h"
 #include "utils/guc.h"
+#include "utils/syscache.h"
+#include "utils/lsyscache.h"
+#include "utils/ruleutils.h"
 
 PG_MODULE_MAGIC;
 
@@ -33,7 +53,9 @@ static bool auto_explain_log_settings = false;
 static int	auto_explain_log_format = EXPLAIN_FORMAT_TEXT;
 static int	auto_explain_log_level = LOG;
 static bool auto_explain_log_nested_statements = false;
+static bool auto_explain_suggest_only = false;
 static double auto_explain_sample_rate = 1;
+static double auto_explain_add_statistics_threshold = 0.0;
 
 static const struct config_enum_entry format_options[] = {
 	{"text", EXPLAIN_FORMAT_TEXT, false},
@@ -218,6 +240,30 @@ _PG_init(void)
 			 NULL,
 			 NULL);
 
+	DefineCustomRealVariable("auto_explain.add_statistics_threshold",
+			 "Sets the threshold for actual/estimated #rows ratio triggering creation of multicolumn statistic for the related columns.",
+			 "Zero disables implicit creation of multicolumn statistic.",
+			 _explain_add_statistics_threshold,
+			 0.0,
+			 0.0,
+			 INT_MAX,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
+	DefineCustomBoolVariable("auto_explain.suggest_only",
+			 "Do not create statistic but just record in WAL suggested create statistics statement.",
+			 NULL,
+			 _explain_suggest_only,
+			 false,
+			 PGC_SUSET,
+			 0,
+			 NULL,
+			 NULL,
+			 NULL);
+
 	EmitWarningsOnPlaceholders("auto_explain");
 
 	/* Install hooks. */
@@ -349,6 +395,256 @@ explain_ExecutorFinish(QueryDesc *queryDesc)
 	PG_END_TRY();
 }
 
+static void
+AddMultiColumnStatisticsForNode(PlanState *planstate, ExplainState *es);
+
+static void
+AddMultiColumnStatisticsForSubPlans(List *plans, ExplainState *es)
+{
+	ListCell   *lst;
+
+	foreach(lst, plans)
+	{
+		SubPlanState *sps = (SubPlanState *) lfirst(lst);
+
+		AddMultiColumnStatisticsForNode(sps->planstate, es);
+	}
+}
+
+static void
+AddMultiColumnStatisticsForMemberNodes(PlanState **planstates, int nsubnodes,
+	ExplainState *es)
+{
+	int			j;
+
+	for (j = 0; j < nsubnodes; j++)
+		AddMultiColumnStatisticsForNode(planstates[j], es);
+}
+
+static int
+vars_list_comparator(const ListCell *a, const ListCell *b)
+{
+	char* va = strVal((Value *) linitial(((ColumnRef *)lfirst(a))->fields));
+	char* vb = strVal((Value *) linitial(((ColumnRef *)lfirst(b))->fields));
+	return strcmp(va, vb);
+}
+
+static void
+AddMultiColumnStatisticsForQual(void* qual, ExplainState *es)
+{
+	List *vars = NULL;
+	ListCell* lc;
+
+	foreach (lc, qual)
+	{
+		Node* node = (Node*)lfirst(lc);
+		if (IsA(node, RestrictInfo))
+			node = (Node*)((RestrictInfo*)node)->clause;
+		vars = list_concat(vars, pull_vars_of_level(node, 0));
+	}
+	while (vars != NULL)
+	{
+		ListCell *cell;
+		List *cols = NULL;
+		Index varno = 0;
+		Bitmapset* colmap = NULL;
+
+		foreach (cell, vars)
+		{
+			Node* node = (Node *) lfirst(cell);
+			if (IsA(node, Var))
+			{
+Var *var = (Var *) node;
+if (cols == NULL || var->varnoold == varno)
+{
+	varno = var->varnoold;
+	if (var->varoattno > 0 &&
+		!bms_is_member(var->varoattno, colmap) &&
+		varno >= 1 &&
+		varno <= list_length(es->rtable) &&
+		list_length(cols) < STATS_MAX_DIMENSIONS)
+	{
+		RangeTblEntry *rte = rt_fetch(varno, es->rtable);
+		if (rte->rtekind == RTE_RELATION)
+		{
+			ColumnRef  *col = makeNode(ColumnRef);
+			char *colname = get_rte_attribute_name(rte,

Re: [Proposal] Global temporary tables

2020-03-25 Thread tushar


On 3/17/20 9:15 AM, 曾文旌(义从) wrote:

reindex GTT is already supported

Please check global_temporary_table_v20-pg13.patch


Please refer this scenario -


postgres=# create global temp table co(n int) ;
CREATE TABLE

postgres=# create index fff on co(n);
CREATE INDEX

Case 1-
postgres=# reindex table  co;
REINDEX

Case -2
postgres=# reindex database postgres ;
WARNING:  global temp table "public.co" skip reindexed
REINDEX
postgres=#

Case 2 should work as similar to Case 1.

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

[PATCH] Check operator when creating unique index on partition table

2020-03-25 Thread Guancheng Luo

Hi,

I found that things could go wrong in some cases, when the unique index and the 
partition key use different opclass.

For example:
```
CREATE TABLE ptop_test (a int, b int, c int) PARTITION BY LIST (a);
CREATE TABLE ptop_test_p1 PARTITION OF ptop_test FOR VALUES IN ('1');
CREATE TABLE ptop_test_m1 PARTITION OF ptop_test FOR VALUES IN ('-1');
CREATE UNIQUE INDEX ptop_test_unq_abs_a ON ptop_test (a abs_int_btree_ops);  -- 
this should fail
```
In this example, `abs_int_btree_ops` is a opclass whose equality operator is 
customized to consider ‘-1’ and ‘1’ as equal.
So the unique index should not be allowed to create, since ‘-1’ and ‘1’ will be 
put in different partition.

The attached patch should fix this problem.


0001-Check-operator-when-creating-UNIQUE-index-on-PARTITI.patch
Description: Binary data


Best Regards,
Guancheng Luo

Re: replay pause vs. standby promotion

2020-03-25 Thread Sergei Kornilov

Hi

>>  Could we add a few words in func.sgml to clarify the behavior? Especially 
>> for users from my example above. Something like:
>>
>>>  If a promotion is triggered while recovery is paused, the paused state 
>>> ends, replay of any WAL immediately available in the archive or in pg_wal 
>>> will be continued and then a promotion will be completed.
>
> This description is true if pause is requested by pg_wal_replay_pause(),
> but not if recovery target is reached and pause is requested by
> recovery_target_action=pause. In the latter case, even if there are WAL data
> avaiable in pg_wal or archive, they are not replayed, i.e., the promotion
> completes immediately. Probably we should document those two cases
> explicitly to avoid the confusion about a promotion and recovery pause?

This is description for pg_wal_replay_pause, but actually we suggest to call 
pg_wal_replay_resume in recovery_target_action=pause... So, I agree, we need to 
document both cases.

PS: I think we have inconsistent behavior here... Read wal during promotion 
from local pg_wal AND call restore_command, but ignore walreceiver also seems 
strange for my DBA hat...

regards, Sergei

Re: weird hash plan cost, starting with pg10

2020-03-25 Thread Richard Guo

On Tue, Mar 24, 2020 at 3:36 PM Richard Guo  wrote:

> On Tue, Mar 24, 2020 at 11:05 AM Thomas Munro 
> wrote:
>
>>
>> I think there might be a case like this:
>>
>> * ExecRescanHashJoin() decides it can't reuse the hash table for a
>> rescan, so it calls ExecHashTableDestroy(), clears HashJoinState's
>> hj_HashTable and sets hj_JoinState to HJ_BUILD_HASHTABLE
>> * the HashState node still has a reference to the pfree'd HashJoinTable!
>> * HJ_BUILD_HASHTABLE case reaches the empty-outer optimisation case so
>> it doesn't bother to build a new hash table
>> * EXPLAIN examines the HashState's pointer to a freed HashJoinTable struct
>>
>
> Yes, debugging with gdb shows this is exactly what happens.
>

According to the scenario above, here is a recipe that reproduces this
issue.

-- recipe start
create table a(i int, j int);
create table b(i int, j int);
create table c(i int, j int);

insert into a select 3,3;
insert into a select 2,2;
insert into a select 1,1;

insert into b select 3,3;

insert into c select 0,0;

analyze a;
analyze b;
analyze c;

set enable_nestloop to off;
set enable_mergejoin to off;

explain analyze
select exists(select * from b join c on a.i > c.i and a.i = b.i and b.j =
c.j) from a;
-- recipe end

I tried this recipe on different PostgreSQL versions, starting from
current master and going backwards. I was able to reproduce this issue
on all versions above 8.4. In 8.4 version, we do not output information
on hash buckets/batches. But manual inspection with gdb shows in 8.4 we
also have the dangling pointer for HashState->hashtable. I didn't check
versions below 8.4 though.

Thanks
Richard

Re: truncating timestamps on arbitrary intervals

2020-03-25 Thread John Naylor

On Tue, Mar 24, 2020 at 9:34 PM Tels  wrote:
>
> Hello John,
>
> this looks like a nice feature. I'm wondering how it relates to the
> following use-case:
>
> When drawing charts, the user can select pre-defined widths on times
> (like "15 min", "1 hour").
>
> The data for these slots is fitted always to intervalls that start in 0
> in the slot, e.g. if the user selects "15 min", the interval always
> starts at xx:00, xx:15, xx:30 or xx:45. This is to aid caching of the
> resulting data, and so that requesting the same chart at xx:00 and xx:01
> actually draws the same chart, and not some bar with only one minute
> data at at the end.

Hi Tels, thanks for your interest! Sounds like the exact use case I had in mind.

> It is of course easy for things like "1 hour", but for yearly I had to
> come up with things like:
>
>EXTRAC(YEAR FROM thetime) * 12 + EXTRACT(MONTH FROM thetime)
>
> and its gets even more involved for weeks, weekdays or odd things like
> fortnights.

To be clear, this particular case was already handled by the existing
date_trunc, but only single units and a few other special cases. I
understand if you have to write code to handle 15 minutes, you need to
use that structure for all cases.

Fortnight would be trivial:

date_trunc_interval('14 days'::interval, thetime [, optional start of
the fortnight])

...but weekdays would still need something extra.

> So would that mean one could replace this by:
>
>   date_trunc_interval('3600 seconds'::interval, thetime)
>
> and it would be easier, and (hopefully) faster?

Certainly easier and more flexible. It's hard to make guesses about
performance, but with your example above where you have two SQL
function calls plus some expression evaluation, I think a single
function would be faster in many cases.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: potential stuck lock in SaveSlotToPath()

2020-03-25 Thread Peter Eisentraut


On 2020-03-20 16:38, Robert Haas wrote:

On Fri, Mar 20, 2020 at 11:32 AM Peter Eisentraut
 wrote:

On 2020-03-19 16:38, Robert Haas wrote:

Incidentally, the wait-event handling in SaveSlotToPath() doesn't look
right for the early-exit cases either.


There appear to be appropriate pgstat_report_wait_end() calls.  What are
you seeing?


Oh, you're right. I think I got confused because the rename() and
close() don't have that, but those don't have a wait event set either.
Sorry for the noise.


Any concerns about applying and backpatching the patch I posted?

The talk about reorganizing this code doesn't seem very concrete at the 
moment and would probably not be backpatch material anyway.



--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: error context for vacuum to include block number

2020-03-25 Thread Justin Pryzby

On Wed, Mar 25, 2020 at 09:16:46AM +0530, Amit Kapila wrote:
> I think by mistake you have re-introduced this chunk of code.  We
> don't need this as we have done it in the caller.

Yes, sorry.

I used too much of git-am and git-rebase to make sure I didn't lose your
changes and instead reintroduced them.

On Wed, Mar 25, 2020 at 02:16:35PM +0900, Masahiko Sawada wrote:
> > > Won't the latest patch by Justin will fix this as he has updated the
> > > block count after count_nondeletable_pages?  Apart from that, I feel
> >
> > The issue is if the error happens *during* count_nondeletable_pages().
> > We don't want it to say "truncating relation to 100 blocks".
> 
> Right.
> 
> > > the first call to update_vacuum_error_cbarg in lazy_truncate_heap
> > > should have input parameter as vacrelstats->nonempty_pages instead of
> > > new_rel_pages to indicate the remaining pages after truncation?
> >
> > Yea, I think that addresses the issue.

Attached patch addressing these.

-- 
Justin
>From b3e112c3d02982b4c050a43c53e47a868879c561 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 12 Dec 2019 20:54:37 -0600
Subject: [PATCH v33 1/3] Introduce vacuum errcontext to display additional
 information.

The additional information displayed will be block number for error
occurring while processing heap and index name for error occurring
while processing the index.

This will help us in diagnosing the problems that occur during a vacuum.
For ex. due to corruption (either caused by bad hardware or by some bug)
if we get some error while vacuuming, it can help us identify the block
in heap and or additional index information.

It sets up an error context callback to display additional information
with the error.  During different phases of vacuum (heap scan, heap
vacuum, index vacuum, index clean up, heap truncate), we update the error
context callback to display appropriate information.  We can extend it to
a bit more granular level like adding the phases for FSM operations or for
prefetching the blocks while truncating. However, I felt that it requires
adding many more error callback function calls and can make the code a bit
complex, so left those for now.

Author: Justin Pryzby, with few changes by Amit Kapila
Reviewed-by: Alvaro Herrera, Amit Kapila, Andres Freund, Michael Paquier
and Sawada Masahiko
Discussion: https://www.postgresql.org/message-id/20191120210600.gc30...@telsasoft.com
---
 src/backend/access/heap/vacuumlazy.c | 250 ---
 src/tools/pgindent/typedefs.list |   1 +
 2 files changed, 226 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 03c43efc32..c68d5952c8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -144,6 +144,17 @@
  */
 #define ParallelVacuumIsActive(lps) PointerIsValid(lps)
 
+/* Phases of vacuum during which we report error context. */
+typedef enum
+{
+	VACUUM_ERRCB_PHASE_UNKNOWN,
+	VACUUM_ERRCB_PHASE_SCAN_HEAP,
+	VACUUM_ERRCB_PHASE_VACUUM_INDEX,
+	VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+	VACUUM_ERRCB_PHASE_INDEX_CLEANUP,
+	VACUUM_ERRCB_PHASE_TRUNCATE
+} VacErrCbPhase;
+
 /*
  * LVDeadTuples stores the dead tuple TIDs collected during the heap scan.
  * This is allocated in the DSM segment in parallel mode and in local memory
@@ -270,6 +281,8 @@ typedef struct LVParallelState
 
 typedef struct LVRelStats
 {
+	char	   *relnamespace;
+	char	   *relname;
 	/* useindex = true means two-pass strategy; false means one-pass */
 	bool		useindex;
 	/* Overall statistics about rel */
@@ -290,8 +303,12 @@ typedef struct LVRelStats
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
-} LVRelStats;
 
+	/* Used for error callback */
+	char	   *indname;
+	BlockNumber blkno;			/* used only for heap operations */
+	VacErrCbPhase	phase;
+} LVRelStats;
 
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
@@ -314,10 +331,10 @@ static void lazy_vacuum_all_indexes(Relation onerel, Relation *Irel,
 	LVRelStats *vacrelstats, LVParallelState *lps,
 	int nindexes);
 static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats,
-			  LVDeadTuples *dead_tuples, double reltuples);
+			  LVDeadTuples *dead_tuples, double reltuples, LVRelStats *vacrelstats);
 static void lazy_cleanup_index(Relation indrel,
 			   IndexBulkDeleteResult **stats,
-			   double reltuples, bool estimated_count);
+			   double reltuples, bool estimated_count, LVRelStats *vacrelstats);
 static int	lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 			 int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
 static bool should_attempt_truncation(VacuumParams *params,
@@ -337,13 +354,13 @@ static void lazy_parallel_vacuum_indexes(Relation *Irel, IndexBulkDeleteResult *
 		 int nindexes);
 static void parallel_vacuum_index(Relation

Re: [proposal] de-TOAST'ing using a iterator

2020-03-25 Thread John Naylor

On Fri, Mar 13, 2020 at 10:19 PM Alvaro Herrera
 wrote:
>
> So let's move forward with v10 (submitted on Sept 10th).

In the attached v12, based on v10, I've made some progress to address
some of the remaining issues. There's still some work to be done, in
particular to think about how to hide the struct details better, as
mentioned by you and Tomas back in September, but wanted to put this
much out there to keep things moving.

> Rather than cross-including header files, it seems better
> to expose some struct definitions after all and let the main iterator
> interface (detoast_iterate) be a "static inline" function in detoast.h.

The cross-include is gone, and detoast_iterate is now static inline.

> Looking at that version, I don't think the function protos that were put
> in toast_internals.h should be there at all; I think they should be in
> detoast.h so that they can be used.

Done.

> But I don't like the fact that
> detoast.h now has to include genam.h; that seems pretty bogus.  I think
> this can be fixed by moving the FetchDatumIteratorData struct definition
> (but not its typedef) to toast_internals.h.

I took a stab at this, but I ended up playing whack-a-mole with
compiler warnings. I'll have to step back and try again later.

> OTOH we've recently seen the TOAST interface (and header files) heavily
> reworked because of table-AM considerations, so probably this needs even
> more changes to avoid parts of it becoming heapam-dependant again.

Haven't thought about this.

> create_toast_buffer() doing two pallocs seems a waste.  It could be a
> single one,
> +   buf = (ToastBuffer *) palloc0(MAXALIGN(sizeof(ToastBuffer)) + size);
> +   buf->buf = buf + MAXALIGN(sizeof(ToastBuffer));
> (I'm not sure that the MAXALIGNs are strictly necessary there; I think
> we access the buf as four-byte aligned stuff somewhere in the toast
> innards, but maybe I'm wrong about that.)

I tried this briefly and got backend crashes, and didn't try to analyze further.

In addition, I brought in the memcpy() and comment changes in
c60e520f6e from common/pg_lzcompress.c to pglz_decompress_iterate(). I
also made a typo correction in the former, which could be extracted
into a separate patch if this one is not ready in time.

For this comment back in [1]:

> This arrangement where there are two ToastBuffers and they sometimes are
> the same is cute, but I think we need a better way to know when each
> needs to be freed afterwards; the proposed coding is confusing. And it
> certainly it needs more than zero comments about what's going on there.

One idea is to test if the pointers are equal via a macro, rather than
setting and testing a member bool var. And I agree commentary could be
improved in this area.

[1] 
https://www.postgresql.org/message-id/20190903201226.GA16197%40alvherre.pgsql

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

v12-detoasting-using-an-iterator.patch
Description: Binary data

Re: Some improvements to numeric sqrt() and ln()

2020-03-25 Thread Dean Rasheed

On Sun, 22 Mar 2020 at 22:16, Tom Lane  wrote:
>
> With resolutions of the XXX items, I think this'd be committable.
>

Thanks for looking at this!

Here is an updated patch with the following updates based on your comments:

* Now uses integer arithmetic to compute res_weight and res_ndigits,
instead of floor() and ceil().

* New comment giving a more detailed explanation of how blen is
chosen, and why it must sometimes examine the first digit of the input
and reduce blen by 1 (which can occur at any step, as shown in the
example given).

* New comment giving a proof that the number of steps required is
guaranteed to be less than 32.

* New comment explaining why the initial integer square root using
Newton's method is guaranteed to converge. I couldn't find a formal
reference for this, but there's a Wikipedia article on it -
https://en.wikipedia.org/wiki/Integer_square_root and I think it's a
well-known result in the field.

Regards,
Dean
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
new file mode 100644
index 10229eb..9986132
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -393,16 +393,6 @@ static const NumericVar const_ten =
 #endif
 
 #if DEC_DIGITS == 4
-static const NumericDigit const_zero_point_five_data[1] = {5000};
-#elif DEC_DIGITS == 2
-static const NumericDigit const_zero_point_five_data[1] = {50};
-#elif DEC_DIGITS == 1
-static const NumericDigit const_zero_point_five_data[1] = {5};
-#endif
-static const NumericVar const_zero_point_five =
-{1, -1, NUMERIC_POS, 1, NULL, (NumericDigit *) const_zero_point_five_data};
-
-#if DEC_DIGITS == 4
 static const NumericDigit const_zero_point_nine_data[1] = {9000};
 #elif DEC_DIGITS == 2
 static const NumericDigit const_zero_point_nine_data[1] = {90};
@@ -518,6 +508,8 @@ static void div_var_fast(const NumericVa
 static int	select_div_scale(const NumericVar *var1, const NumericVar *var2);
 static void mod_var(const NumericVar *var1, const NumericVar *var2,
 	NumericVar *result);
+static void div_mod_var(const NumericVar *var1, const NumericVar *var2,
+		NumericVar *quot, NumericVar *rem);
 static void ceil_var(const NumericVar *var, NumericVar *result);
 static void floor_var(const NumericVar *var, NumericVar *result);
 
@@ -7712,6 +7704,7 @@ div_var_fast(const NumericVar *var1, con
 			 NumericVar *result, int rscale, bool round)
 {
 	int			div_ndigits;
+	int			load_ndigits;
 	int			res_sign;
 	int			res_weight;
 	int		   *div;
@@ -7766,9 +7759,6 @@ div_var_fast(const NumericVar *var1, con
 	div_ndigits += DIV_GUARD_DIGITS;
 	if (div_ndigits < DIV_GUARD_DIGITS)
 		div_ndigits = DIV_GUARD_DIGITS;
-	/* Must be at least var1ndigits, too, to simplify data-loading loop */
-	if (div_ndigits < var1ndigits)
-		div_ndigits = var1ndigits;
 
 	/*
 	 * We do the arithmetic in an array "div[]" of signed int's.  Since
@@ -7781,9 +7771,16 @@ div_var_fast(const NumericVar *var1, con
 	 * (approximate) quotient digit and stores it into div[], removing one
 	 * position of dividend space.  A final pass of carry propagation takes
 	 * care of any mistaken quotient digits.
+	 *
+	 * Note that div[] doesn't necessarily contain all of the digits from the
+	 * dividend --- the desired precision plus guard digits might be less than
+	 * the dividend's precision.  This happens, for example, in the square
+	 * root algorithm, where we typically divide a 2N-digit number by an
+	 * N-digit number, and only require a result with N digits of precision.
 	 */
 	div = (int *) palloc0((div_ndigits + 1) * sizeof(int));
-	for (i = 0; i < var1ndigits; i++)
+	load_ndigits = Min(div_ndigits, var1ndigits);
+	for (i = 0; i < load_ndigits; i++)
 		div[i + 1] = var1digits[i];
 
 	/*
@@ -7844,9 +7841,15 @@ div_var_fast(const NumericVar *var1, con
 			maxdiv += Abs(qdigit);
 			if (maxdiv > (INT_MAX - INT_MAX / NBASE - 1) / (NBASE - 1))
 			{
-/* Yes, do it */
+/*
+ * Yes, do it.  Note that if var2ndigits is much smaller than
+ * div_ndigits, we can save a significant amount of effort
+ * here by noting that we only need to normalise those div[]
+ * entries touched where prior iterations subtracted multiples
+ * of the divisor.
+ */
 carry = 0;
-for (i = div_ndigits; i > qi; i--)
+for (i = Min(qi + var2ndigits - 2, div_ndigits); i > qi; i--)
 {
 	newdig = div[i] + carry;
 	if (newdig < 0)
@@ -8095,6 +8098,76 @@ mod_var(const NumericVar *var1, const Nu
 
 
 /*
+ * div_mod_var() -
+ *
+ *	Calculate the truncated integer quotient and numeric remainder of two
+ *	numeric variables.  The remainder is precise to var2's dscale.
+ */
+static void
+div_mod_var(const NumericVar *var1, const NumericVar *var2,
+			NumericVar *quot, NumericVar *rem)
+{
+	NumericVar	q;
+	NumericVar	r;
+
+	init_var();
+	init_var();
+
+	/*
+	 * Use div_var_fast() to get an initial estimate for the integer quotient.
+	 * This might be inaccurate (per the warning in div_var_fast's comments),
+	 * but we

Re: Internal key management system

2020-03-25 Thread Masahiko Sawada

On Tue, 24 Mar 2020 at 23:15, Bruce Momjian  wrote:
>
> On Tue, Mar 24, 2020 at 02:29:57PM +0900, Masahiko Sawada wrote:
> > That seems to work fine.
> >
> > So we will have pg_cryptokeys within PGDATA and each key is stored
> > into separate file named the key id such as "sql", "tde-wal" and
> > "tde-block". I'll update the patch and post.
>
> Yes, that makes sense to me.
>

I've attached the updated patch. With the patch, we have three
internal keys: SQL key, TDE-block key and TDE-wal key. Only SQL key
can be used so far to wrap and unwrap user secret via pg_wrap and
pg_unwrap SQL functions. Each keys is saved to the single file located
at pg_cryptokeys. After initdb with enabling key manager, the
pg_cryptokeys directory has the following files:

$ ll data/pg_cryptokeys
total 12K
-rw--- 1 masahiko staff 132 Mar 25 15:45 
-rw--- 1 masahiko staff 132 Mar 25 15:45 0001
-rw--- 1 masahiko staff 132 Mar 25 15:45 0002

I used the integer id rather than string id to make the code simple.

When cluster passphrase rotation, we update all keys atomically using
temporary directory as follows:

1. Derive the new passphrase
2. Wrap all internal keys with the new passphrase
3. Save all internal keys to the temp directory
4. Remove the original directory, pg_cryptokeys
5. Rename the temp directory to pg_cryptokeys

In case of failure during rotation, pg_cyrptokeys and
pg_cyrptokeys_tmp can be left in an incomplete state. We recover it by
checking if the temporary directory exists and the wrapped keys in the
temporary directory are valid.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

kms_v9.patch
Description: Binary data

Re: Index Skip Scan

2020-03-25 Thread Dmitry Dolgov

> On Wed, Mar 25, 2020 at 11:31:56AM +0530, Dilip Kumar wrote:
>
> Seems like you forgot to add the uniquekey.c file in the
> v33-0001-Unique-key.patch.

Oh, you're right, thanks. Here is the corrected patch.
>From 15989c5250214fea8606a56afd1eeaf760b8723e Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Tue, 24 Mar 2020 17:04:32 +0100
Subject: [PATCH v33 1/2] Unique key

Design by David Rowley.

Author: Jesper Pedersen
---
 src/backend/nodes/outfuncs.c   |  14 +++
 src/backend/nodes/print.c  |  39 +++
 src/backend/optimizer/path/Makefile|   3 +-
 src/backend/optimizer/path/allpaths.c  |   8 ++
 src/backend/optimizer/path/indxpath.c  |  41 
 src/backend/optimizer/path/pathkeys.c  |  71 +++--
 src/backend/optimizer/path/uniquekey.c | 136 +
 src/backend/optimizer/plan/planagg.c   |   1 +
 src/backend/optimizer/plan/planmain.c  |   1 +
 src/backend/optimizer/plan/planner.c   |  37 ++-
 src/backend/optimizer/util/pathnode.c  |  46 +++--
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  19 
 src/include/nodes/print.h  |   1 +
 src/include/optimizer/pathnode.h   |   2 +
 src/include/optimizer/paths.h  |  11 ++
 16 files changed, 408 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekey.c

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index d76fae44b8..16083e7a7e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -1723,6 +1723,7 @@ _outPathInfo(StringInfo str, const Path *node)
 	WRITE_FLOAT_FIELD(startup_cost, "%.2f");
 	WRITE_FLOAT_FIELD(total_cost, "%.2f");
 	WRITE_NODE_FIELD(pathkeys);
+	WRITE_NODE_FIELD(uniquekeys);
 }
 
 /*
@@ -2205,6 +2206,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(eq_classes);
 	WRITE_BOOL_FIELD(ec_merging_done);
 	WRITE_NODE_FIELD(canon_pathkeys);
+	WRITE_NODE_FIELD(canon_uniquekeys);
 	WRITE_NODE_FIELD(left_join_clauses);
 	WRITE_NODE_FIELD(right_join_clauses);
 	WRITE_NODE_FIELD(full_join_clauses);
@@ -2214,6 +2216,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node)
 	WRITE_NODE_FIELD(placeholder_list);
 	WRITE_NODE_FIELD(fkey_list);
 	WRITE_NODE_FIELD(query_pathkeys);
+	WRITE_NODE_FIELD(query_uniquekeys);
 	WRITE_NODE_FIELD(group_pathkeys);
 	WRITE_NODE_FIELD(window_pathkeys);
 	WRITE_NODE_FIELD(distinct_pathkeys);
@@ -2401,6 +2404,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+
+	WRITE_NODE_FIELD(eq_clause);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4092,6 +4103,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 _outPathKey(str, obj);
 break;
+			case T_UniqueKey:
+_outUniqueKey(str, obj);
+break;
 			case T_PathTarget:
 _outPathTarget(str, obj);
 break;
diff --git a/src/backend/nodes/print.c b/src/backend/nodes/print.c
index 42476724d8..d286b34544 100644
--- a/src/backend/nodes/print.c
+++ b/src/backend/nodes/print.c
@@ -459,6 +459,45 @@ print_pathkeys(const List *pathkeys, const List *rtable)
 	printf(")\n");
 }
 
+/*
+ * print_uniquekeys -
+ *	  uniquekeys list of UniqueKeys
+ */
+void
+print_uniquekeys(const List *uniquekeys, const List *rtable)
+{
+	ListCell   *l;
+
+	printf("(");
+	foreach(l, uniquekeys)
+	{
+		UniqueKey *unique_key = (UniqueKey *) lfirst(l);
+		EquivalenceClass *eclass = (EquivalenceClass *) unique_key->eq_clause;
+		ListCell   *k;
+		bool		first = true;
+
+		/* chase up */
+		while (eclass->ec_merged)
+			eclass = eclass->ec_merged;
+
+		printf("(");
+		foreach(k, eclass->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+
+			if (first)
+first = false;
+			else
+printf(", ");
+			print_expr((Node *) mem->em_expr, rtable);
+		}
+		printf(")");
+		if (lnext(uniquekeys, l))
+			printf(", ");
+	}
+	printf(")\n");
+}
+
 /*
  * print_tl
  *	  print targetlist in a more legible way.
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..63cc1505d9 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekey.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8286d9cf34..bbc13e6141 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3954,6 +3954,14 @@ print_path(PlannerInfo *root, Path *path, int indent)
 		print_pathkeys(path->pathkeys, root->parse->rtable);
 	}
 
+	if (path->uniquekeys)
+	{
+		for (i = 0; i < indent; i++)
+			printf("\t");
+		printf("  uniquekeys: ");
+

Small computeRegionDelta optimization.

2020-03-25 Thread Konstantin Knizhnik


Hi hackers,

Playing with my undo-log storage, I found out that its performance is 
mostly limited by generic WAL record mechanism,
and particularly by computeRegionDelta function which  computes page 
delta for each logged operation.


I noticed that computeRegionDelta also becomes bottleneck in other cases 
where generic WAL records are used, for example in RUM extension.

This is profile of inserting records in table with RUM index:

32.99%  postgres  postgres    [.] computeRegionDelta
   6.13%  postgres  rum.so  [.] updateItemIndexes
   4.61%  postgres  postgres    [.] hash_search_with_hash_value
   4.53%  postgres  postgres    [.] GenericXLogRegisterBuffer
   3.74%  postgres  rum.so  [.] rumTraverseLock
   3.33%  postgres  rum.so  [.] rumtuple_get_attrnum
   3.24%  postgres  rum.so  [.] dataPlaceToPage
   3.14%  postgres  postgres    [.] writeFragment
   2.99%  postgres  libc-2.23.so    [.] __memcpy_avx_unaligned
   2.81%  postgres  postgres    [.] nocache_index_getattr
   2.72%  postgres  rum.so  [.] rumPlaceToDataPageLeaf
   1.93%  postgres  postgres    [.] pg_comp_crc32c_sse42
   1.87%  postgres  rum.so  [.] findInLeafPage
   1.77%  postgres  postgres    [.] PinBuffer
   1.52%  postgres  rum.so  [.] compareRumItem
   1.49%  postgres  postgres    [.] FunctionCall2Coll
   1.34%  postgres  rum.so  [.] entryLocateEntry
   1.22%  postgres  libc-2.23.so    [.] __memcmp_sse4_1
   0.97%  postgres  postgres    [.] LWLockAttemptLock


I noticed that computeRegionDelta performs byte-by-byte comparison of page.
The obvious optimization is to compare words instead of bytes.
Small patch with such optimization is attached.
Definitely it may lead to small increase of produced deltas.
It is possible to calculate deltas more precisely: using work comparison 
for raw  location of region and then locate precise boundaries using bye 
comparisons.

But it complicates algorithm and so makes it slower/
In practice, taken in account that header of record in Postgres is 24 
bytes long and fields are usually aligned on 4/8 bytes boundary,

I think that calculating deltas in words is preferable.

Results of such optimization:
Performance of my UNDAM storage is increased from 6500 TPS to 7000 TPS 
(vs. 8500 for unlogged table),

and computeRegionDelta completely disappears from  RUM profile:

   9.37%  postgres  rum.so  [.] updateItemIndexes ▒
   6.57%  postgres  postgres    [.] GenericXLogRegisterBuffer ▒
   5.85%  postgres  postgres    [.] hash_search_with_hash_value ▒
   5.54%  postgres  rum.so  [.] rumTraverseLock ▒
   5.09%  postgres  rum.so  [.] dataPlaceToPage ▒
   4.85%  postgres  postgres    [.] computeRegionDelta ▒
   4.78%  postgres  rum.so  [.] rumtuple_get_attrnum ▒
   4.28%  postgres  postgres    [.] nocache_index_getattr ▒
   4.23%  postgres  rum.so  [.] rumPlaceToDataPageLeaf ▒
   3.39%  postgres  postgres    [.] pg_comp_crc32c_sse42 ▒
   3.16%  postgres  libc-2.23.so    [.] __memcpy_avx_unaligned ▒
   2.72%  postgres  rum.so  [.] findInLeafPage ▒
   2.64%  postgres  postgres    [.] PinBuffer ▒
   2.22%  postgres  postgres    [.] FunctionCall2Coll ▒
   2.22%  postgres  rum.so  [.] compareRumItem ▒
   1.91%  postgres  rum.so  [.] entryLocateEntry ▒

But... time of RUN insertion almost not changed: 1770 seconds vs. 1881 
seconds.

Looks like it was mostly limited by time of writing data to the disk.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c
index 5164a1c..59d5466 100644
--- a/src/backend/access/transam/generic_xlog.c
+++ b/src/backend/access/transam/generic_xlog.c
@@ -117,7 +117,7 @@ writeFragment(PageData *pageData, OffsetNumber offset, OffsetNumber length,
  */
 static void
 computeRegionDelta(PageData *pageData,
-   const char *curpage, const char *targetpage,
+   const char *currPage, const char *targetPage,
    int targetStart, int targetEnd,
    int validStart, int validEnd)
 {
@@ -125,6 +125,13 @@ computeRegionDelta(PageData *pageData,
 loopEnd,
 fragmentBegin = -1,
 fragmentEnd = -1;
+	int64* curpage = (int64*)currPage;
+	int64* targetpage = (int64*)targetPage;
+
+	targetStart >>= 3;
+	validStart >>= 3;
+	targetEnd = (targetEnd + 7) >> 3;
+	validEnd = (validEnd + 7) >> 3;
 
 	/* Deal with any invalid start region by including it in first fragment */
 	if (validStart > targetStart)
@@ -189,11 +196,11 @@ computeRegionDelta(PageData *pageData,
 		 * fragmentEnd value, which is why it's OK that we unconditionally
 		 * assign "fragmentEnd = i" above.
 		 */
-		if (fragmentBegin >= 0 && i -

Re: Autovacuum vs vac_update_datfrozenxid() vs ?

2020-03-25 Thread Michael Paquier

On Mon, Mar 23, 2020 at 04:50:36PM -0700, Andres Freund wrote:
> Which all in theory makes sense - but there's two fairly sizable races
> here:
> 1) Because lastSaneFrozenXid is determined *before* the snapshot for
>the pg_class, it's entirely possible that concurrently another vacuum on
>another table in the same database starts & finishes. And that very well
>can end up using a relfrozenxid that's newer than the current
>database.
> 2) Much worse, because vac_update_relstats() uses heap_inplace_update(),
>there's actually no snapshot semantics here anyway! Which means that
>the window for a race isn't just between the ReadNewTransactionId()
>and the systable_beginscan(), but instead lasts for the duration of
>the entire scan.
> 
> Either of these can then triggers vac_update_datfrozenxid() to
> *silently* bail out without touching anything.

Hmm.  It looks that you are right for both points, with 2) being much
more plausible to trigger.  Don't we have other issues with the cutoff
calculations done within vacuum_set_xid_limits()?  We decide if an
aggressive job happens depending on the data in the relation cache,
but that may not play well with concurrent jobs that just finished
with invalidations?

> I think there's also another (even larger?) race in
> vac_update_datfrozenxid(): Unless I miss something, two backends can
> concurrently run through the scan in vac_update_datfrozenxid() for two
> different tables in the same database, both can check that they need to
> update the pg_database row, and then one of them can overwrite the
> results of the other. And the one that updates last might actually be
> the one with an older horizon.  This is possible since there is no 'per
> database' locking in heap_vacuum_rel().

The chances of hitting this gets higher with a higher number of max
workers, and it is easy to finish with multiple concurrent workers on
the same database with a busy system.  Perhaps a reason why it was
harder for me to reproduce the problem that was that I was just using
the default for autovacuum_max_workers.  Looks worth trying with a
crazily high value for the max number of workers and more relations
doing a bunch of heavy updates (the application I saw facing a
lockdown uses literally hundreds of relations with a poor man's
partitioning schema and some tables of the schema are heavily
updated).

> The way I suspect that interacts with the issue in [1] ist that once
> that has happened, do_start_worker() figures out it's doing a xid
> wraparound vacuum based on the "wrong" datfrozenxid:
>   /* Check to see if this one is at risk of wraparound */
>   if (TransactionIdPrecedes(tmp->adw_frozenxid, xidForceLimit))
>   {
>   if (avdb == NULL ||
>   TransactionIdPrecedes(tmp->adw_frozenxid,
> 
> avdb->adw_frozenxid))
>   avdb = tmp;
>   for_xid_wrap = true;
>   continue;
>   }
>   else if (for_xid_wrap)
>   continue;   /* ignore not-at-risk 
> DBs */
> 
> which means autovac launcher will only start workers for that database -
> but there might actually not be any work for it.

Actually, I don't think that pg_database.datfrozenxid explains
everything.  As per what I saw this field is getting freshly updated.
So workers are spawned for a wraparound, and the relation-level stats
cause workers to try to autovacuum a table but nothing happens at the
end.

> I have some theories as to why this is more common in 12, but I've not
> fully nailed them down.

It seems to me that 2aa6e33 also helps in triggering those issues more
easily as it creates a kind of cascading effect with the workers still
getting spawned by the launcher.  However the workers finish by
handling relations which have nothing to do as I suspect that skipping
the anti-wraparound and non-aggressive jobs creates a reduction of the
number of relation stat updates done via vac_update_relstats(), where
individual relations finish in a state where they consider that they
cannot do any more work, being put in a state where they all consider
non-aggressive but anti-wraparound jobs as the norm, causing nothing
to happen as we saw in the thread of -general mentioned by Andres
upthread.  And then things just loop over again and again.  Note that
it is also mentioned on the other thread that issuing a manual VACUUM
FREEZE fixes temporarily the lockdown issue as it forces an update of
the relation stats.  So it seems to me that reverting 2aa6e33 is a
necessary first step to prevent the lockdown to happen, still that
does not address the actual issues causing anti-wraparound and
non-aggressive jobs to exist.
--
Michael


signature.asc
Description: PGP signature

Re: pg_upgrade fails with non-standard ACL

2020-03-25 Thread Daniel Gustafsson

> On 25 Mar 2020, at 03:12, Artur Zakirov  wrote:

> Regression tests fail because cfbot applies 
> "test_rename_catalog_objects.patch". Regression tests pass without it.
> 
> This patch shouldn't be applied by cfbot. I'm not sure how to do this. But 
> maybe it is possible to use different extension name for the test patch, not 
> ".patch".

To get around that, post a new version again without the test_ patch, that
should make the cfbot pick up only the "new" patches.

cheers ./daniel

1 2 >

1 - 100 of 104 matches

Mail list logo