Re: Why does creating logical replication subscriptions require superuser?

2021-01-21 Thread Andrey Borodin
Hi!

> 21 янв. 2021 г., в 21:20, Paul Martinez  написал(а):
> 
> Hey, all,
> 
> I'm working with native logical replication, and I don't fully understand
> why logical replication subscribers need to be superusers, nor do I fully
> understand the implication of some of the comments made on this page:
> 
> https://www.postgresql.org/docs/current/logical-replication-security.html
> 
>> A user able to modify the schema of subscriber-side tables can execute
>> arbitrary code as a superuser.
> 
> Does "execute arbitrary code" here really mean executing arbitrary code on the
> machine that is running Postgres, or simply running arbitrary SQL in the
> replicating database? Would it only be able to modify data in the database
> containing the subscription, or could it modify other databases in the same
> cluster? Is there any "blast-radius" to the capabilities gained by such a 
> user?
I suspect it means what it states. Replication is running under superuser and 
e.g. one can add system catalog to subscription.
Or exploit this fact other way. Having superuser you can just COPY FROM PROGRAM 
anything.

> According to the commit message that added this text, the callout of this
> point was added as result of CVE-2020-14349, but the details there don't
> really help me understand what the concern is here, nor do I have a deep
> understanding of various features that might combine to create a 
> vulnerability.
> 
> I'm not sure what arbitrary code could be executed, but my rough guess, based
> on some of the other text on that page, is that custom triggers, written in
> an arbitrary language (e.g., Python), would run arbitrary code and that is
> the concern. Is that correct? And, if so, assuming that Python triggers were
> the only way to execute arbitrary code, then simply building Postgres without
> Python support would prevent users that can modify the schema from executing
> code as superuser. What is the full set of features that could lead to running
> arbitrary code in this scenario? Is it just all the different languages that
> can be used to write triggers?
We cannot build PostgreSQL without SQL.

> Essentially, I'm wondering what a loose proof-of-concept privilege escalation
> vulnerability would look like if a non-superuser can modify the schema of
> subscriber-side tables.

> On a related note, what happens if a superuser creates a subscription, and 
> then
> is demoted to a regular user? My understanding is that the worker that applies
> the logical changes to the database connects as the user that created the
> subscription, so would that prevent potential vulnerabilities in any way?

Subscription operations must run with privileges of user that created it. All 
other ways are error prone and leave subscriptions only in superuser's arsenal.

Thanks!

Best regards, Andrey Borodin.



RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Hou, Zhijie

> > And there seems another solution for this:
> >
> > In the patch, We only use the { ii_Expressions , ii_NumIndexAttrs ,
> > ii_IndexAttrNumbers } from the IndexInfo, which seems can get from
> "Relation-> rd_index".
> >
> > Based on above, May be we do not need to call BuildIndexInfo to build
> the IndexInfo.
> > It can avoid some unnecessary cost.
> > And in this solution we do not need to remove expression_planner.
> >
> 
> Hmmm, when I debugged my simple test case, I found rel->rd_index was NULL,
> so it seems that the call to BuildIndexInfo is needed.
> (have I understood your suggestion correctly?)
> 

Hi greg,

Thanks for debugging this.

May be I missed something. I am not sure about the case when rel->rd_index was 
NULL.
Because, In function BuildIndexInfo, it seems does not have NULL-check for 
index->rd_index.
Like the following:

BuildIndexInfo(Relation index)
{
IndexInfo  *ii;
Form_pg_index indexStruct = index->rd_index;
int i;
int numAtts;

/* check the number of keys, and copy attr numbers into the IndexInfo */
numAtts = indexStruct->indnatts;


And the patch do not have NULL-check for index->rd_index too.
So I thought we can assume index->rd_index is not null, but it seems I may 
missed something ?

Can you please share the test case with me ?

I use the following code to replace the call of BuildIndexInfo.
And the installcheck passed.

Example:
+Form_pg_index indexStruct = index_rel->rd_index;
+List *ii_Expressions = RelationGetIndexExpressions(index_rel);
+int ii_NumIndexAttrs = indexStruct->indnatts;
+AttrNumber  ii_IndexAttrNumbers[INDEX_MAX_KEYS];

+for (int i = 0; i < ii_NumIndexAttrs; i++)
+ii_IndexAttrNumbers[i] = indexStruct->indkey.values[i];

Best regards,
houzj






Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Greg Nancarrow
On Fri, Jan 22, 2021 at 4:49 PM Amit Kapila  wrote:
>
> >
> > Unfortunately, this change results in a single test failure in the
> > "with" tests when "force_parallel_mode=regress" is in effect.
> >
> > I have reproduced the problem, by extracting relevant SQL from those
> > tests, as follows:
> >
> > CREATE TEMP TABLE bug6051 AS
> >   select i from generate_series(1,3) as i;
> > SELECT * FROM bug6051;
> > CREATE TEMP TABLE bug6051_2 (i int);
> > CREATE RULE bug6051_ins AS ON INSERT TO bug6051 DO INSTEAD
> >  INSERT INTO bug6051_2
> >  SELECT NEW.i;
> > WITH t1 AS ( DELETE FROM bug6051 RETURNING * )
> > INSERT INTO bug6051 SELECT * FROM t1;
> > ERROR:  cannot delete tuples during a parallel operation
> >
> > Note that prior to the patch, all INSERTs were regarded as
> > PARALLEL_UNSAFE, so this problem obviously didn't occur.
> > I believe this INSERT should be regarded as PARALLEL_UNSAFE, because
> > it contains a modifying CTE.
> > However, for some reason, the INSERT is not regarded as having a
> > modifying CTE, so instead of finding it PARALLEL_UNSAFE, it falls into
> > the parallel-safety-checks and is found to be PARALLEL_RESTRICTED:
> >
> > The relevant code in standard_planner() is:
> >
> > if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
> > IsUnderPostmaster &&
> > (parse->commandType == CMD_SELECT ||
> >  IsModifySupportedInParallelMode(parse->commandType)) &&
> > !parse->hasModifyingCTE &&
> > max_parallel_workers_per_gather > 0 &&
> > !IsParallelWorker())
> > {
> > /* all the cheap tests pass, so scan the query tree */
> > glob->maxParallelHazard = max_parallel_hazard(parse);
> > glob->parallelModeOK = (glob->maxParallelHazard != 
> > PROPARALLEL_UNSAFE);
> > }
> > else
> > {
> > /* skip the query tree scan, just assume it's unsafe */
> > glob->maxParallelHazard = PROPARALLEL_UNSAFE;
> > glob->parallelModeOK = false;
> > }
> >
> > When I debugged this (transformWithClause()), the WITH clause was
> > found to contain a modifying CTE and for the INSERT
> > query->hasModifyingCTE was set true.
> > But somehow in the re-writer code, this got lost.
> > Bug?
> > Ideas?
> >
>
> How it behaves when the table in the above test is a non-temp table
> with your patch? If it leads to the same error then we can at least
> conclude that this is a generic problem and nothing specific to temp
> tables.
>

Oh, I don't believe that this has anything to do with TEMP tables -
it's just that when I relaxed the parallel-safety level on TEMP
tables, it exposed the CTE issue in this test case because it just
happens to use a TEMP table.
Having said that, when I changed that test code to not use a TEMP
table, an Assert fired in the planner code and caused the backend to
abort.
It looks like I need to update the following Assert in the planner
code (unchanged by the current patch) in order to test further - but
this Assert only fired because the commandType was CMD_DELETE, which
SHOULD have been excluded by the "hasModifyingCTE" test on the parent
INSERT, which is what I'm saying is strangely NOT getting set.

/*
 * Generate partial paths for final_rel, too, if outer query levels might
 * be able to make use of them.
 */
if (final_rel->consider_parallel && root->query_level > 1 &&
!limit_needed(parse))
{
Assert(!parse->rowMarks && parse->commandType == CMD_SELECT);
foreach(lc, current_rel->partial_pathlist)
{
Path   *partial_path = (Path *) lfirst(lc);

add_partial_path(final_rel, partial_path);
}
}

Once the Assert above is changed to not test the commandType, the same
error occurs as before.

This appears to possibly be some kind of bug in which hasModifyingCTE
is not getting set, at least in this particular INSERT case, but in
the current Postgres code it doesn't matter because all INSERTs are
considered parallel-unsafe, so won't be executed in parallel-mode
anyway.
I notice that if I execute "SELECT * from t1" instead of "INSERT INTO
bug6051 SELECT * from t1", then "hasModifyingCTE" is getting set to
true for the query, and thus is always considered parallel-unsafe.


Regards,
Greg Nancarrow
Fujitsu Australia




Re: Identify missing publications from publisher while create/alter subscription.

2021-01-21 Thread Bharath Rupireddy
On Fri, Jan 22, 2021 at 10:14 AM japin  wrote:
> > 2) Can't we know whether the publications exist on the publisher with
> > the existing (or modifying it a bit if required) query in
> > fetch_table_list(), so that we can avoid making another connection to
> > the publisher system from the subscriber?
>
> IIUC, the patch does not make another connection, it just execute a new
> query in already connection.  If we want to check publication existence
> for ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false)
> we should make another connection.

Actually, I meant that we can avoid submitting another SQL query to
the publisher if we could manage to submit a single query that first
checks if a given publication exists in pg_publication and if yes
returns the tables associated with it from pg_publication_tables. Can
we modify the existing query in fetch_table_list that gets only the
table list from pg_publcation_tables to see if the given publication
exists in the pg_publication?

Yes you are right, if we were to check the existence of publications
provided with ALTER SUBSCRIPTION statements, we need to do
walrcv_connect, walrcv_exec. We could just call a common function from
there.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] More docs on what to do and not do in extension code

2021-01-21 Thread Craig Ringer
Hi

Thanks so much for reading over this!

Would you mind attaching a revised version of the patch with your edits?
Otherwise I'll go and merge them in once you've had your say on my comments
inline below.

Bruce, Robert, can I have an opinion from you on how best to locate and
structure these docs, or whether you think they're suitable for the main
docs at all? See patch upthread.

On Tue, 19 Jan 2021 at 22:33, Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

>
> Here are some comments:
>
> [1]
>background worker's main function, and must be unblocked by it; this is
> to
> allow the process to customize its signal handlers, if necessary.
> -   Signals can be unblocked in the new process by calling
> -   BackgroundWorkerUnblockSignals and blocked by
> calling
> -   BackgroundWorkerBlockSignals.
> +   It is important that all background workers set up and unblock signal
> +   handling before they enter their main loops. Signal handling in
> background
> +   workers is discussed separately in .
>
>

I wasn't sure either way on that, see your [3] below.

[2]
> +   interupt-aware APIs for the purpose. Do not
> usleep(),
> +   system(), make blocking system calls, etc.
> +  
>
> Is it "Do not use usleep(),
> system() or make blocking system calls etc." ?
>

Right. Good catch.

[3] IMO, we can remove following from "bgworker-signals" if we retain
> it where currently it is, as discussed in [1].
> +Signals can be unblocked in the new process by calling
> +BackgroundWorkerUnblockSignals and blocked by
> calling
> +BackgroundWorkerBlockSignals.
>

If so, need to mention that they start blocked and link to the main text
where that's mentioned.

That's part of why I moved this chunk into the signal section.

[4] Can we say
> +The default signal handlers set up for background workers do
> +default background worker signal handlers, it should call
>
> instead of
> +The default signal handlers installed for background workers
> do
> +default background worker signal handling it should call
>

Huh?

I don't understand this proposal.

s/install/set up/g?

[5] IMO, we can have something like below
> +request, etc. Set up these handlers before unblocking signals as
> +shown below:
>
> instead of
> +request, etc. To install these handlers, before unblocking interrupts
> +run:
>

Ditto

[6] I think logs and errors either elog() or ereport can be used, so how
> about
> +Use elog() or ereport()
> for
> +logging output or raising errors instead of any direct stdio
> calls.
>
> instead of
> +Use elog() and
> ereport() for
> +logging output and raising errors instead of any direct stdio
> calls.
>

OK.

[7] Can we use child processes instead of subprocess ? If okay in
> other places in the patch as well.
>

Fine with me. The point is really that they're non-postgres processes being
spawned by a backend, and that doing so must be done carefully.

[8] Why should file descriptor manager API be used to execute
> subprocesses/child processes?
> +To execute subprocesses and open files, use the routines provided
> by
> +the file descriptor manager like
> BasicOpenFile
> +and OpenPipeStream instead of a direct
>

Yeah, that wording is confusing, agreed. The point was that you shouldn't
use system() or popen(), you should OpenPipeStream(). And similarly, you
should avoid fopen() etc and instead use the Pg wrapper BasicOpenFile().

"
PostgreSQL backends are required to limit the number of file descriptors
they
open. To open files, use postgres file descriptor manager routines like
BasicOpenFile()
instead of directly using open() or fopen(). To open pipes to or from
external processes,
use OpenPipeStream() instead of popen().
"

?


> [9] "should always be"? even if it's a blocking extesion, does it
> work? If our intention is to recommend the developers, maybe we should
> avoid using the term "should" in the patch in other places as well.
>

The trouble is that it's a bit ... fuzzy.

You can get away with blocking for short periods without responding to
signals, but it's a "how long is a piece of string" issue.

"should be" is fine.

A hard "must" or "must not" would be misleading. But this isn't the place
to go into all the details of how time sensitive (or not) interrupt
handling of different kinds is in different places for different worker
types.


> [11] I think it is
> +block if this happens. So cleanup of resources is not
> entirely managed by PostgreSQL, it
> +   must be handled using appropriate callbacks provided by PostgreSQL
>
> instead of
> +block if this happens. So all cleanup of resources not already
> +managed by the PostgreSQL runtime must be handled using
> appropriate
>

I don't agree with the proposed new wording here.

Delete the "So all" from my original, or

... Cleanup of any resources that are not managed
> by the PostgreSQL runtime must be handled using 

Re: Very misleading documentation for PQreset()

2021-01-21 Thread Noah Misch
On Thu, Jan 21, 2021 at 05:32:56PM -0500, Tom Lane wrote:
> I happened to notice that PQreset is documented thus:
> 
> This function will close the connection to the server and attempt to
> reestablish a new connection to the same server, using all the same
> parameters previously used.
> 
> Since we invented multi-host connection parameters, a reasonable person
> would assume that "to the same server" means we promise to reconnect to
> the same host we selected the first time.  There is no such guarantee
> though; the new connection attempt is done just like the first one,
> so it will select the first suitable server in the list.
> 
> I think we should just drop that phrase.

I agree that dropping the phrase strictly improves that documentation.




Re: simplifying foreign key/RI checks

2021-01-21 Thread Corey Huinker
>
>
>
> I decided not to deviate from pk_ terminology so that the new code
> doesn't look too different from the other code in the file.  Although,
> I guess we can at least call the main function
> ri_ReferencedKeyExists() instead of ri_PrimaryKeyExists(), so I've
> changed that.
>

I think that's a nice compromise, it makes the reader aware of the concept.


>
> I've attached the updated patch.
>

Missing "break" added. Check.
Comment updated. Check.
Function renamed. Check.
Attribute mapping matching test (and assertion) added. Check.
Patch applies to an as-of-today master, passes make check and check world.
No additional regression tests required, as no new functionality is
introduced.
No docs required, as there is nothing user-facing.

Questions:
1. There's a palloc for mapped_partkey_attnums, which is never freed, is
the prevailing memory context short lived enough that we don't care?
2. Same question for the AtrrMap map, should there be a free_attrmap().


Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Greg Nancarrow
On Fri, Jan 22, 2021 at 1:16 PM Hou, Zhijie  wrote:
>
>
> 4.
> +   ListCell   *index_expr_item = 
> list_head(index_info->ii_Expressions);
> ...
> +   index_expr = (Node *) 
> lfirst(index_expr_item);
> +   index_expr = (Node *) 
> expression_planner((Expr *) index_expr);
>
> It seems BuildIndexInfo has already called eval_const_expressions for 
> ii_Expressions,
> Like the flow: BuildIndexInfo--> RelationGetIndexExpressions--> 
> eval_const_expressions
>
> So, IMO, we do not need to call expression_planner for the expr again.
>

Thanks. You are right. I debugged it, and found that BuildIndexInfo-->
RelationGetIndexExpressions executes the same expression evaluation
code as expression_planner().
So I'll remove the redundant call to expression_planner() here.

>
> And there seems another solution for this:
>
> In the patch, We only use the { ii_Expressions , ii_NumIndexAttrs , 
> ii_IndexAttrNumbers } from the IndexInfo,
> which seems can get from "Relation-> rd_index".
>
> Based on above, May be we do not need to call BuildIndexInfo to build the 
> IndexInfo.
> It can avoid some unnecessary cost.
> And in this solution we do not need to remove expression_planner.
>

Hmmm, when I debugged my simple test case, I found rel->rd_index was
NULL, so it seems that the call to BuildIndexInfo is needed.
(have I understood your suggestion correctly?)

Regards,
Greg Nancarrow
Fujitsu Australia




Re: POC: postgres_fdw insert batching

2021-01-21 Thread Ian Lawrence Barwick
Hi

2021年1月21日(木) 8:00 Tomas Vondra :

> OK, pushed after a little bit of additional polishing (mostly comments).
>
> Thanks everyone!
>

There's a minor typo in the doc's version of the ExecForeignBatchInsert()
declaration;
is:

TupleTableSlot **
ExecForeignBatchInsert(EState *estate,
  ResultRelInfo *rinfo,
  TupleTableSlot **slots,
  TupleTableSlot *planSlots,
  int *numSlots);

should be:

TupleTableSlot **
ExecForeignBatchInsert(EState *estate,
  ResultRelInfo *rinfo,
  TupleTableSlot **slots,
  TupleTableSlot **planSlots,
  int *numSlots);

(Trivial patch attached).


Regards

Ian Barwick

-- 
EnterpriseDB: https://www.enterprisedb.com
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 854913ae5f..2e73d296d2 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -619,7 +619,7 @@ TupleTableSlot **
 ExecForeignBatchInsert(EState *estate,
   ResultRelInfo *rinfo,
   TupleTableSlot **slots,
-  TupleTableSlot *planSlots,
+  TupleTableSlot **planSlots,
   int *numSlots);
 
 


Re: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-21 Thread Masahiko Sawada
On Fri, Dec 25, 2020 at 6:46 PM Masahiro Ikeda  wrote:
>
> Hi,
>
> I rebased the patch to the master branch.

Thank you for working on this. I've read the latest patch. Here are comments:

---
+   if (track_wal_io_timing)
+   {
+   INSTR_TIME_SET_CURRENT(duration);
+   INSTR_TIME_SUBTRACT(duration, start);
+   WalStats.m_wal_write_time +=
INSTR_TIME_GET_MILLISEC(duration);
+   }

* I think it should add the time in micro sec.

After running pgbench with track_wal_io_timing = on for 30 sec,
pg_stat_wal showed the following on my environment:

postgres(1:61569)=# select * from pg_stat_wal;
-[ RECORD 1 ]+-
wal_records  | 285947
wal_fpi  | 53285
wal_bytes| 442008213
wal_buffers_full | 0
wal_write| 25516
wal_write_time   | 0
wal_sync | 25437
wal_sync_time| 14490
stats_reset  | 2021-01-22 10:56:13.29464+09

Since writes can complete less than a millisecond, wal_write_time
didn't increase. I think sync_time could also have the same problem.

---
+   /*
+* Measure i/o timing to fsync WAL data.
+*
+* The wal receiver skip to collect it to avoid performance
degradation of standy servers.
+* If sync_method doesn't have its fsync method, to skip too.
+*/
+   if (!AmWalReceiverProcess() && track_wal_io_timing && fsyncMethodCalled())
+   INSTR_TIME_SET_CURRENT(start);

* Why does only the wal receiver skip it even if track_wal_io_timinig
is true? I think the performance degradation is also true for backend
processes. If there is another reason for that, I think it's better to
mention in both the doc and comment.

* How about checking track_wal_io_timing first?

* s/standy/standby/

---
+   /* increment the i/o timing and the number of times to fsync WAL data */
+   if (fsyncMethodCalled())
+   {
+   if (!AmWalReceiverProcess() && track_wal_io_timing)
+   {
+   INSTR_TIME_SET_CURRENT(duration);
+   INSTR_TIME_SUBTRACT(duration, start);
+   WalStats.m_wal_sync_time += INSTR_TIME_GET_MILLISEC(duration);
+   }
+
+   WalStats.m_wal_sync++;
+   }

* I'd avoid always calling fsyncMethodCalled() in this path. How about
incrementing m_wal_sync after each sync operation?

---
+/*
+ * Check if fsync mothod is called.
+ */
+static bool
+fsyncMethodCalled()
+{
+   if (!enableFsync)
+   return false;
+
+   switch (sync_method)
+   {
+   case SYNC_METHOD_FSYNC:
+   case SYNC_METHOD_FSYNC_WRITETHROUGH:
+   case SYNC_METHOD_FDATASYNC:
+   return true;
+   default:
+   /* others don't have a specific fsync method */
+   return false;
+   }
+}

* I'm concerned that the function name could confuse the reader
because it's called even before the fsync method is called. As I
commented above, calling to fsyncMethodCalled() can be eliminated.
That way, this function is called at only once. So do we really need
this function?

* As far as I read the code, issue_xlog_fsync() seems to do fsync even
if enableFsync is false. Why does the function return false in that
case? I might be missing something.

* void is missing as argument?

* s/mothod/method/

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Amit Kapila
On Fri, Jan 22, 2021 at 8:29 AM Greg Nancarrow  wrote:
>
> On Thu, Jan 21, 2021 at 7:30 PM Amit Kapila  wrote:
> >
> > > i.e. code-wise:
> > >
> > > /*
> > > -* We can't support table modification in parallel-mode if
> > > it's a foreign
> > > -* table/partition (no FDW API for supporting parallel access) or 
> > > a
> > > +* We can't support table modification in a parallel worker if 
> > > it's a
> > > +* foreign table/partition (no FDW API for supporting parallel
> > > access) or a
> > >  * temporary table.
> > >  */
> > > if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE ||
> > > RelationUsesLocalBuffers(rel))
> > > {
> > > -   table_close(rel, lockmode);
> > > -   context->max_hazard = PROPARALLEL_UNSAFE;
> > > -   return true;
> > > +   if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, 
> > > context))
> > > +   {
> > > +   table_close(rel, lockmode);
> > > +   return true;
> > > +   }
> > > }
> > >
> >
> > Yeah, these changes look correct to me.
> >
>
> Unfortunately, this change results in a single test failure in the
> "with" tests when "force_parallel_mode=regress" is in effect.
>
> I have reproduced the problem, by extracting relevant SQL from those
> tests, as follows:
>
> CREATE TEMP TABLE bug6051 AS
>   select i from generate_series(1,3) as i;
> SELECT * FROM bug6051;
> CREATE TEMP TABLE bug6051_2 (i int);
> CREATE RULE bug6051_ins AS ON INSERT TO bug6051 DO INSTEAD
>  INSERT INTO bug6051_2
>  SELECT NEW.i;
> WITH t1 AS ( DELETE FROM bug6051 RETURNING * )
> INSERT INTO bug6051 SELECT * FROM t1;
> ERROR:  cannot delete tuples during a parallel operation
>
> Note that prior to the patch, all INSERTs were regarded as
> PARALLEL_UNSAFE, so this problem obviously didn't occur.
> I believe this INSERT should be regarded as PARALLEL_UNSAFE, because
> it contains a modifying CTE.
> However, for some reason, the INSERT is not regarded as having a
> modifying CTE, so instead of finding it PARALLEL_UNSAFE, it falls into
> the parallel-safety-checks and is found to be PARALLEL_RESTRICTED:
>
> The relevant code in standard_planner() is:
>
> if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
> IsUnderPostmaster &&
> (parse->commandType == CMD_SELECT ||
>  IsModifySupportedInParallelMode(parse->commandType)) &&
> !parse->hasModifyingCTE &&
> max_parallel_workers_per_gather > 0 &&
> !IsParallelWorker())
> {
> /* all the cheap tests pass, so scan the query tree */
> glob->maxParallelHazard = max_parallel_hazard(parse);
> glob->parallelModeOK = (glob->maxParallelHazard != 
> PROPARALLEL_UNSAFE);
> }
> else
> {
> /* skip the query tree scan, just assume it's unsafe */
> glob->maxParallelHazard = PROPARALLEL_UNSAFE;
> glob->parallelModeOK = false;
> }
>
> When I debugged this (transformWithClause()), the WITH clause was
> found to contain a modifying CTE and for the INSERT
> query->hasModifyingCTE was set true.
> But somehow in the re-writer code, this got lost.
> Bug?
> Ideas?
>

How it behaves when the table in the above test is a non-temp table
with your patch? If it leads to the same error then we can at least
conclude that this is a generic problem and nothing specific to temp
tables.

-- 
With Regards,
Amit Kapila.




Re: adding wait_start column to pg_locks

2021-01-21 Thread torikoshia

On 2021-01-21 12:48, Fujii Masao wrote:

Thanks for updating the patch! I think that this is really useful 
feature!!


Thanks for reviewing!


I have two minor comments.

+  role="column_definition">

+   wait_start timestamptz

The column name "wait_start" should be "waitstart" for the sake of 
consistency

with other column names in pg_locks? pg_locks seems to avoid including
an underscore in column names, so "locktype" is used instead of 
"lock_type",

"virtualtransaction" is used instead of "virtual_transaction", etc.

+   Lock acquisition wait start time. NULL if
+   lock acquired.



Agreed.

I also changed the variable name "wait_start" in struct PGPROC and
LockInstanceData to "waitStart" for the same reason.


There seems the case where the wait start time is NULL even when 
"grant"
is false. It's better to add note about that case into the docs? For 
example,
I found that the wait start time is NULL while the startup process is 
waiting

for the lock. Is this only that case?


Thanks, this is because I set 'waitstart' in the following
condition.

  ---src/backend/storage/lmgr/proc.c
  > 1250 if (!InHotStandby)

As far as considering this, I guess startup process would
be the only case.

I also think that in case of startup process, it seems possible
to set 'waitstart' in ResolveRecoveryConflictWithLock(), so I
did it in the attached patch.


Any thoughts?


Regards,

--
Atsushi TorikoshiFrom 6beb1c61e72c797c915427ae4e36d6bab9e0594c Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Fri, 22 Jan 2021 13:51:00 +0900
Subject: [PATCH v5] To examine the duration of locks, we did join on pg_locks
 and pg_stat_activity and used columns such as query_start or state_change.
 However, since they are the moment when queries have started or their state
 has changed, we could not get the exact lock duration in this way.

---
 contrib/amcheck/expected/check_btree.out |  4 ++--
 doc/src/sgml/catalogs.sgml   | 10 ++
 src/backend/storage/ipc/standby.c|  8 ++--
 src/backend/storage/lmgr/lock.c  |  8 
 src/backend/storage/lmgr/proc.c  |  4 
 src/backend/utils/adt/lockfuncs.c|  9 -
 src/include/catalog/pg_proc.dat  |  6 +++---
 src/include/storage/lock.h   |  2 ++
 src/include/storage/proc.h   |  1 +
 src/test/regress/expected/rules.out  |  5 +++--
 10 files changed, 47 insertions(+), 10 deletions(-)

diff --git a/contrib/amcheck/expected/check_btree.out b/contrib/amcheck/expected/check_btree.out
index 13848b7449..5a3f1ef737 100644
--- a/contrib/amcheck/expected/check_btree.out
+++ b/contrib/amcheck/expected/check_btree.out
@@ -97,8 +97,8 @@ SELECT bt_index_parent_check('bttest_b_idx');
 SELECT * FROM pg_locks
 WHERE relation = ANY(ARRAY['bttest_a', 'bttest_a_idx', 'bttest_b', 'bttest_b_idx']::regclass[])
 AND pid = pg_backend_pid();
- locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath 
---+--+--+--+---++---+-+---+--++-+--+-+--
+ locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted | fastpath | waitstart 
+--+--+--+--+---++---+-+---+--++-+--+-+--+---
 (0 rows)
 
 COMMIT;
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 43d7a1ad90..ba003ce393 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -10589,6 +10589,16 @@ SCRAM-SHA-256$iteration count:
lock table
   
  
+
+ 
+  
+   waitstart timestamptz
+  
+  
+   Lock acquisition wait start time. NULL if
+   lock acquired
+  
+ 
 

   
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 39a30c00f7..819e00e4ab 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -539,13 +539,17 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
 void
 ResolveRecoveryConflictWithLock(LOCKTAG locktag, bool logging_conflict)
 {
-	TimestampTz ltime;
+	TimestampTz ltime, now;
 
 	Assert(InHotStandby);
 
 	ltime = GetStandbyLimitTime();
+	now = GetCurrentTimestamp();
 
-	if (GetCurrentTimestamp() >= ltime && ltime != 0)
+	if (MyProc->waitStart == 0)
+		MyProc->waitStart = now;
+
+	if (now >= ltime && ltime != 0)
 	{
 		/*
 		 * We're already behind, so clear a path as quickly as possible.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 20e50247ea..ffad4e94bc 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -3627,6 +3627,12 @@ GetLockStatusData(void)
 			

Re: Deleting older versions in unique indexes to avoid page splits

2021-01-21 Thread Amit Kapila
On Thu, Jan 21, 2021 at 12:23 AM Peter Geoghegan  wrote:
>
> On Wed, Jan 20, 2021 at 5:33 AM Amit Kapila  wrote:
> > > Victor independently came up with a benchmark that ran over several
> > > hours, with cleanup consistently held back by ~5 minutes by a long
> > > running transaction:
> > >
> >
> > AFAICS, the long-running transaction used in the test is below:
> > SELECT abalance, pg_sleep(300) FROM pgbench_accounts WHERE mtime >
> > now() - INTERVAL '15min' ORDER BY aid LIMIT 1;
> >
> > This shouldn't generate a transaction id so would it be sufficient to
> > hold back the clean-up?
>
> It will hold back clean-up because it holds open a snapshot.
>

Okay, that makes sense. It skipped from my mind.

>
> Slowing down non-HOT updaters in these extreme cases may actually be a
> good thing, even when bottom-up deletion finally becomes ineffective.
> It can be thought of as backpressure. I am not worried about slowing
> down something that is already hopelessly inefficient and
> unsustainable. I'd go even further than that, in fact -- I now wonder
> if we should *deliberately* slow them down some more!
>

Do you have something specific in mind for this?

> > Test with 2 un-modified indexes
> > ===
> > create table t1(c1 int, c2 int, c3 int, c4 char(10));
> > create index idx_t1 on t1(c1);
> > create index idx_t2 on t1(c2);
> > create index idx_t3 on t1(c3);
> >
> > insert into t1 values(generate_series(1,500), 1, 10, 'aa');
> > update t1 set c2 = 2;
>
> > postgres=# update t1 set c2 = 2;
> > UPDATE 500
> > Time: 46533.530 ms (00:46.534)
> >
> > With HEAD
> > ==
> > postgres=# update t1 set c2 = 2;
> > UPDATE 500
> > Time: 52529.839 ms (00:52.530)
> >
> > I have dropped and recreated the table after each update in the test.
>
> Good thing that you remembered to drop and recreate the table, since
> otherwise bottom-up index deletion would look really good!
>

I have briefly tried that but numbers were not consistent probably
because at that time autovacuum was also 'on'. So, I tried switching
off autovacuum and dropping/recreating the tables.

> Besides, this test case is just ludicrous.
>

I think it might be okay to say that in such cases we can expect
regression especially because we see benefits in many other cases so
paying some cost in such cases is acceptable or such scenarios are
less common or probably such cases are already not efficient so it
doesn't matter much but I am not sure if we can say they are
completely unreasonable. I think this test case depicts the behavior
with bulk updates.

I am not saying that we need to definitely do anything but
acknowledging that we can regress in some cases without actually
removing bloat is not necessarily a bad thing because till now none of
the tests done has shown any such behavior (where we are not able to
help with bloat but still the performance is reduced).

> > I have read your patch and have some decent understanding but
> > obviously, you and Victor will have a better idea. I am not sure what
> > I wrote in my previous email which made you say so. Anyway, I hope I
> > have made my point clear this time.
>
> I don't think that you fully understand the insights behind the patch.
> Understanding how the patch works mechanistically is not enough.
>
> This patch is unusual in that you really need to think about emergent
> behaviors to understand it. That is certainly a difficult thing to do,
> and it's understandable that even an expert might not grok it without
> considering it carefully. What annoys me here is that you didn't seem
> to seriously consider the *possibility* that something like that
> *might* be true, even after I pointed it out several times.
>

I am not denying that I could be missing your point but OTOH you are
also completely refuting the points raised even though I have shown
them by test and by sharing an example. For example, I understand that
you want to be conservative in triggering the bottom-up clean up so
you choose to do it in fewer cases but we might still want to add a
'Note' in the code (or README) suggesting something like we have
considered the alternative for page-level-flag (to be aggressive in
triggering this optimization) but not pursued with that for so-and-so
reasons. I think this can help future developers to carefully think
about it even if they want to attempt something like that. You have
considered it during the early development phase and then the same
thing occurred to Victor and me as an interesting optimization to
explore so the same can occur to someone else as well.

-- 
With Regards,
Amit Kapila.




Re: New IndexAM API controlling index vacuum strategies

2021-01-21 Thread Peter Geoghegan
On Tue, Jan 19, 2021 at 2:57 PM Peter Geoghegan  wrote:
> Looks good. I'll give this version a review now. I will do a lot more
> soon. I need to come up with a good benchmark for this, that I can
> return to again and again during review as needed.

I performed another benchmark, similar to the last one but with the
latest version (v2), and over a much longer period. Attached is a
summary of the whole benchmark, and log_autovacuum output from the
logs of both the master branch and the patch.

This was pgbench scale 2000, 4 indexes on pgbench_accounts, and a
transaction with one update and two selects. Each run was 4 hours, and
we alternate between patch and master for each run, and alternate
between 16 and 32 clients. There were 8 4 hour runs in total, meaning
the entire set of runs took 8 * 4 hours = 32 hours (not including
initial load time and a few other small things like that). I used a
10k TPS rate limit, so TPS isn't interesting here. Latency is
interesting -- we see a nice improvement in latency (i.e. a reduction)
for the patch (see all.summary.out).

The benefits of the patch are clearly visible when I drill down and
look at the details. Each pgbench_accounts autovacuum VACUUM operation
can finish faster with the patch because they can often skip at least
some indexes (usually the PK, sometimes 3 out of 4 indexes total). But
it's more subtle than some might assume. We're skipping indexes that
VACUUM actually would have deleted *some* index tuples from, which is
very good. Bottom-up index deletion is usually lazy, and only
occasionally very eager, so you still have plenty of "floating
garbage" index tuples in most pages. And now we see VACUUM behave a
little more like bottom-up index deletion -- it is lazy when that is
appropriate (with indexes that really only have floating garbage that
is spread diffusely throughout the index structure), and eager when
that is appropriate (with indexes that get much more garbage).

The benefit is not really that we're avoiding doing I/O for index
vacuuming (though that is one of the smaller benefits here). The real
benefit is that VACUUM is not dirtying pages, since it skips indexes
when it would be "premature" to vacuum them from an efficiency point
of view. This is important because we know that Postgres throughput is
very often limited by page cleaning. Also, the "economics" of this new
behavior make perfect sense -- obviously it's more efficient to delay
garbage cleanup until the point when the same page will be modified by
a backend anyway -- in the case of this benchmark via bottom-up index
deletion (which deletes all garbage tuples in the leaf page at the
point that it runs for a subset of pointed-to heap pages -- it's not
using an oldestXmin cutoff from 30 minutes ago). So whenever we dirty
a page, we now get more value per additional-page-dirtied.

I believe that controlling the number of pages dirtied by VACUUM is
usually much more important than reducing the amount of read I/O from
VACUUM, for reasons I go into on the recent "vacuum_cost_page_miss
default value and modern hardware" thread. As a further consequence of
all this, VACUUM can go faster safely and sustainably (since the cost
limit is not affected so much by vacuum_cost_page_miss), which has its
own benefits (e.g. oldestXmin cutoff doesn't get so old towards the
end).

Another closely related huge improvement that we see here is that the
number of FPIs generated by VACUUM can be significantly reduced. This
cost is closely related to the cost of dirtying pages, but it's worth
mentioning separately. You'll see some of that in the log_autovacuum
log output I attached.

There is an archive with much more detailed information, including
dumps from most pg_stat_* views at key intervals. This has way more
information than anybody is likely to want:

https://drive.google.com/file/d/1OTiErELKRZmYnuJuczO2Tfcm1-cBYITd/view?usp=sharing

I did notice a problem, though. I now think that the criteria for
skipping an index vacuum in the third patch from the series is too
conservative, and that this led to an excessive number of index
vacuums with the patch. This is probably because there was a tiny
number of page splits in some of the indexes that were not really
supposed to grow. I believe that this is caused by ANALYZE running --
I think that it prevented bottom-up deletion from keeping a few of the
hottest pages from splitting (that can take 5 or 6 seconds) at a few
points over the 32 hour run. For example, the index named "tenner"
grew by 9 blocks, starting out at 230,701 and ending up at 230,710 (to
see this, extract the files from the archive and "diff
patch.r1c16.initial_pg_relation_size.out
patch.r2c32.after_pg_relation_size.out").

I now think that 0 blocks added is unnecessarily restrictive -- a
small tolerance still seems like a good idea, though (let's still be
somewhat conservative about it).

Maybe a better criteria would be for nbtree to always proceed with
index vacuuming when the index 

Re: PoC/WIP: Extended statistics on expressions

2021-01-21 Thread Justin Pryzby
On Thu, Jan 21, 2021 at 10:01:01PM -0600, Justin Pryzby wrote:
> On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote:
> > > > | Statistics objects:
> > > > | "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> > 
> > Umm, for me that prints:
> 
> > "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t
> > 
> > which I think is OK. But maybe there's something else to trigger the
> > problem?
> 
> Oh.  It's because I was using /usr/bin/psql and not ./src/bin/psql.
> I think it's considered ok if old client's \d commands don't work on new
> server, but it's not clear to me if it's ok if they misbehave.  It's almost
> better it made an ERROR.

I think you'll maybe have to do something better - this seems a bit too weird:

| postgres=# CREATE STATISTICS s2 ON (i+1) ,i FROM t;
| postgres=# \d t
| ...
| "public"."s2" (ndistinct, dependencies, mcv) ON i FROM t

It suggests including additional columns in stxkeys for each expression.
Maybe that also helps give direction to response to Dean's concern?

That doesn't make old psql do anything more desirable, though.
Unless you also added attributes, all you can do is make it say things like
"columns: ctid".

> In any case, why are there so many parentheses ?

-- 
Justin




Re: Identify missing publications from publisher while create/alter subscription.

2021-01-21 Thread japin


On Fri, 22 Jan 2021 at 00:51, Bharath Rupireddy 
 wrote:
> On Thu, Jan 21, 2021 at 6:56 PM vignesh C  wrote:
>>
>> Hi,
>>
>> Creating/altering subscription is successful when we specify a
>> publication which does not exist in the publisher. I felt we should
>> throw an error in this case, that will help the user to check if there
>> is any typo in the create subscription command or to create the
>> publication before the subscription is created.
>> If the above analysis looks correct, then please find a patch that
>> checks if the specified publications are present in the publisher and
>> throws an error if any of the publications is missing in the
>> publisher.
>> Thoughts?
>
> I was having similar thoughts (while working on  the logical
> replication bug on alter publication...drop table behaviour) on why
> create subscription succeeds without checking the publication
> existence. I checked in documentation, to find if there's a strong
> reason for that, but I couldn't. Maybe it's because of the principle
> "first let users create subscriptions, later the publications can be
> created on the publisher system", similar to this behaviour
> "publications can be created without any tables attached to it
> initially, later they can be added".
>
> Others may have better thoughts.
>
> If we do check publication existence for CREATE SUBSCRIPTION, I think
> we should also do it for ALTER SUBSCRIPTION ... SET PUBLICATION.
>

Agreed. Current patch do not check publication existence for
ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).

> I wonder, why isn't dropping a publication from a list of publications
> that are with subscription is not allowed?
>
> Some comments so far on the patch:
>
> 1) I see most of the code in the new function check_publications() and
> existing fetch_table_list() is the same. Can we have a generic
> function, with maybe a flag to separate out the logic specific for
> checking publication and fetching table list from the publisher.

+1

> 2) Can't we know whether the publications exist on the publisher with
> the existing (or modifying it a bit if required) query in
> fetch_table_list(), so that we can avoid making another connection to
> the publisher system from the subscriber?

IIUC, the patch does not make another connection, it just execute a new
query in already connection.  If we want to check publication existence
for ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false)
we should make another connection.

> 3) If multiple publications are specified in the CREATE SUBSCRIPTION
> query, IIUC, with your patch, the query fails even if at least one of
> the publications doesn't exist. Should we throw a warning in this case
> and allow the subscription be created for other existing
> publications?
>

+1. If all the publications do not exist, we should throw an error.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.




Re: Added schema level support for publication.

2021-01-21 Thread vignesh C
Thanks Rahila for your comments. Please find my thoughts below:

On Wed, Jan 20, 2021 at 6:27 PM Rahila Syed  wrote:
>
> Hi Vignesh,
>
>>
>> I have handled the above scenario(drop schema should automatically
>> remove the schema entry from publication schema relation) & addition
>> of tests in the new v2 patch attached.
>> Thoughts?
>
>
> Please see some initial comments:
>
> 1. I think there should be more tests to show that the schema data is 
> actually replicated
> to the subscriber.  Currently, I am not seeing the data being replicated when 
> I use FOR SCHEMA.
>
I will fix this issue and include more tests in my next version of the patch.

> 2. How does replication behave when a table is added or removed from a 
> subscribed schema
> using ALTER TABLE SET SCHEMA?
>
I would like to keep the behavior similar to the table behavior. I
will post more details for this along with my next version of the
patch.

> 3. Can we have a default schema like a public or current schema that gets 
> replicated in case the user didn't
> specify one, this can be handy to replicate current schema tables.
>
It looks like a good use case, I will check on the feasibility of this
and try to implement this.

> 4. +   The fourth, fifth and sixth variants change which schemas are part of 
> the
> +   publication.  The SET TABLE clause will replace the 
> list
> +   of schemas in the publication with the specified one.  The ADD
>
> There is a typo above s/SET TABLE/SET SCHEMA
I will fix this in the next version of the patch.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Greg Nancarrow
On Fri, Jan 22, 2021 at 1:16 PM Hou, Zhijie  wrote:
>
> Hi
>
> I took a look at v12-0001 patch, here are some comments:
>
> 1.
> +   /*
> +* Setup the context used in finding the max parallel-mode hazard.
> +*/
> +   Assert(initial_max_parallel_hazard == 0 ||
> +  initial_max_parallel_hazard == PROPARALLEL_SAFE ||
> +  initial_max_parallel_hazard == PROPARALLEL_RESTRICTED);
> +   context.max_hazard = initial_max_parallel_hazard == 0 ?
> +   PROPARALLEL_SAFE : initial_max_parallel_hazard;
>
> I am not quiet sure when will " max_parallel_hazard == 0"
> Does it means the case max_parallel_hazard_context not initialized ?
>

That function doesn't accept a "max_parallel_hazard_context". It
accepts an initial "max_parallel_hazard" value (char).
The "0" value is just a way of specifying "use the default"
(PROPARALLEL_SAFE). It is not currently used, since currently we just
always pass the "context.max_parallel_hazard" value resulting from the
previous parallel-safety checks for SELECT (and otherwise don't call
that function anywhere else).

>
> 2.
> Some tiny code style suggestions
>
> +   if (con->contype == CONSTRAINT_CHECK)
> +   {
> +   char   *conbin;
> +   Datum   val;
> +   boolisnull;
> +   Expr   *check_expr;
>
> How about :
>
> if (con->contype != CONSTRAINT_CHECK)
> continue;
>
> 3.
> +   if (keycol == 0)
> +   {
> +   /* Found an index expression */
> +
> +   Node   *index_expr;
>
> Like 2, how about:
>
> If (keycol != 0)
> Continue;
>

This is really a programmer style preference (plenty of discussions on
the internet about it), but it can be argued that use of "continue"
here is not quite as clear as the explicit "if" condition, especially
in this very simple one-condition case.
I'm inclined to leave it as is.

>
> 4.
> +   ListCell   *index_expr_item = 
> list_head(index_info->ii_Expressions);
> ...
> +   index_expr = (Node *) 
> lfirst(index_expr_item);
> +   index_expr = (Node *) 
> expression_planner((Expr *) index_expr);
>
> It seems BuildIndexInfo has already called eval_const_expressions for 
> ii_Expressions,
> Like the flow: BuildIndexInfo--> RelationGetIndexExpressions--> 
> eval_const_expressions
>
> So, IMO, we do not need to call expression_planner for the expr again.
>
>
> And there seems another solution for this:
>
> In the patch, We only use the { ii_Expressions , ii_NumIndexAttrs , 
> ii_IndexAttrNumbers } from the IndexInfo,
> which seems can get from "Relation-> rd_index".
>
> Based on above, May be we do not need to call BuildIndexInfo to build the 
> IndexInfo.
> It can avoid some unnecessary cost.
> And in this solution we do not need to remove expression_planner.
>

OK, maybe this is a good idea, but I do recall trying to minimize this
kind of processing before, but there were cases which broke it.
Have you actually tried your idea and run all regression tests and
verified that they passed?
In any case, I'll look into it.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: PoC/WIP: Extended statistics on expressions

2021-01-21 Thread Justin Pryzby
On Fri, Jan 22, 2021 at 04:49:51AM +0100, Tomas Vondra wrote:
> > > | Statistics objects:
> > > | "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> 
> Umm, for me that prints:

> "public"."s2" ON ((i + 1)), (((i + 1) + 0)) FROM t
> 
> which I think is OK. But maybe there's something else to trigger the
> problem?

Oh.  It's because I was using /usr/bin/psql and not ./src/bin/psql.
I think it's considered ok if old client's \d commands don't work on new
server, but it's not clear to me if it's ok if they misbehave.  It's almost
better it made an ERROR.

In any case, why are there so many parentheses ?

-- 
Justin




RE: libpq debug log

2021-01-21 Thread tsunakawa.ta...@fujitsu.com
Hello Alvaro-san, Iwata-san,


First of all, thank you Alvaro-san really a lot for your great help.  I'm glad 
you didn't lose interest and time for this patch yet.  (Iwata-san is my 
colleague.)


From: Alvaro Herrera 
> That's true, but it'd require that we move PQtrace() to fe-misc.c, because
> pqTraceInit() uses definitions that are private to that file.

Ah, that was the reason for separation.  Then, I'm fine with either keeping the 
separation, or integrating the two functions in fe-misc.c with PQuntrace() 
being also there.  I kind of think the latter is better, because of code 
readability and, because tracing facility is not a connection-related reature 
so categorizing it as "misc" feels natural.


> > What's this code for?  I think it's sufficient that PQuntrace() frees b_msg
> and PQtrace() always allocates b_msg.
> 
> The point is to be able to cope with a connection that calls PQtrace() 
> multiple
> times, with or without an intervening PQuntrace().
> We'd make no friends if we made PQtrace() crash, or leak memory if called a
> second time ...

HEAD's PQtrace() always call PQuntrace() and then re-initialize from scratch.  
So, if we keep that flow, the user can call PQtrace() multiple in a row.


> I think trying to apply tabs inside the message contents is going to be more
> confusing than helpful.

Agreed.


> > Plus, you may as well print the invalid message type.  Why don't you do
> something like this:
> >
> > +static const char *
> > +pqGetProtocolMsgType(unsigned char c, PGCommSource commsource) {
> > +   static char unknown_message[64];
> > +   char *msg = NULL;
> >
> > +   if (commsource == FROM_BACKEND && c <
> lengthof(protocol_message_type_b))
> > +   msg = protocol_message_type_b[c];
> > +   else if (commsource == FROM_FRONTEND && c <
> lengthof(protocol_message_type_f))
> > +   msg = protocol_message_type_f[c];
> > +
> > +   if (msg == NULL)
> > +   {
> > +   sprintf(unknown_message, "Unknown message %02x", c);
> > +   msg = unknown_message;
> > +   }
> > +
> > +   return msg;
> > +}
> 
> Right.  I had written something like this, but in the end decided not to 
> bother
> because it doesn't seem terribly important.  You can't do exactly what you
> suggest, because it has the problem that you're returning a stack-allocated
> variable even though your stack is about to be blown away.  My copy had a
> static buffer that was malloc()ed on first use (and if allocation fails, 
> return a
> const string).  Anyway, I fixed the crasher problem.

My suggestion included static qualifier to not use the stack, but it doesn't 
work anyway in multi-threaded applications.  So I agree that we don't print the 
invalid message type value.


> > (18)
> > if (conn->Pfdebug)
> > -   fprintf(conn->Pfdebug, "To backend> %c\n", c);
> > +   pqStoreFrontendMsg(conn, LOG_BYTE1, 1);
> >
> > To match the style for backend messages with pqLogMsgByte1 etc., why
> don't you wrap the conn->Pfdebug check in the macro?
> 
> I think these macros are useless and should be removed.  There's more
> verbosity and confusion caused by them, than the clarity they provide.

Agreed.


> This patch needs more work still, of course.

Yes, she is updating the patch based on the feedback from you, Kirk-san, and me.

By the way, I just looked at the beginning of v12.

-void PQtrace(PGconn *conn, FILE *stream);
+void PQtrace(PGconn *conn, FILE *stream, bits32 flags);

Can we change the signature of an API function?



Iwata-san,
Please integrate Alvaro-san's patch with yours.  Next week is the last week in 
this CF, so do your best to post the patch by next Monday or so (before 
Alvaro-san loses interest or time.)  Then I'll review it ASAP.


Regards
Takayuki Tsunakawa



Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Greg Nancarrow
On Thu, Jan 21, 2021 at 7:30 PM Amit Kapila  wrote:
>
> > i.e. code-wise:
> >
> > /*
> > -* We can't support table modification in parallel-mode if
> > it's a foreign
> > -* table/partition (no FDW API for supporting parallel access) or a
> > +* We can't support table modification in a parallel worker if it's 
> > a
> > +* foreign table/partition (no FDW API for supporting parallel
> > access) or a
> >  * temporary table.
> >  */
> > if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE ||
> > RelationUsesLocalBuffers(rel))
> > {
> > -   table_close(rel, lockmode);
> > -   context->max_hazard = PROPARALLEL_UNSAFE;
> > -   return true;
> > +   if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, 
> > context))
> > +   {
> > +   table_close(rel, lockmode);
> > +   return true;
> > +   }
> > }
> >
>
> Yeah, these changes look correct to me.
>

Unfortunately, this change results in a single test failure in the
"with" tests when "force_parallel_mode=regress" is in effect.

I have reproduced the problem, by extracting relevant SQL from those
tests, as follows:

CREATE TEMP TABLE bug6051 AS
  select i from generate_series(1,3) as i;
SELECT * FROM bug6051;
CREATE TEMP TABLE bug6051_2 (i int);
CREATE RULE bug6051_ins AS ON INSERT TO bug6051 DO INSTEAD
 INSERT INTO bug6051_2
 SELECT NEW.i;
WITH t1 AS ( DELETE FROM bug6051 RETURNING * )
INSERT INTO bug6051 SELECT * FROM t1;
ERROR:  cannot delete tuples during a parallel operation

Note that prior to the patch, all INSERTs were regarded as
PARALLEL_UNSAFE, so this problem obviously didn't occur.
I believe this INSERT should be regarded as PARALLEL_UNSAFE, because
it contains a modifying CTE.
However, for some reason, the INSERT is not regarded as having a
modifying CTE, so instead of finding it PARALLEL_UNSAFE, it falls into
the parallel-safety-checks and is found to be PARALLEL_RESTRICTED:

The relevant code in standard_planner() is:

if ((cursorOptions & CURSOR_OPT_PARALLEL_OK) != 0 &&
IsUnderPostmaster &&
(parse->commandType == CMD_SELECT ||
 IsModifySupportedInParallelMode(parse->commandType)) &&
!parse->hasModifyingCTE &&
max_parallel_workers_per_gather > 0 &&
!IsParallelWorker())
{
/* all the cheap tests pass, so scan the query tree */
glob->maxParallelHazard = max_parallel_hazard(parse);
glob->parallelModeOK = (glob->maxParallelHazard != PROPARALLEL_UNSAFE);
}
else
{
/* skip the query tree scan, just assume it's unsafe */
glob->maxParallelHazard = PROPARALLEL_UNSAFE;
glob->parallelModeOK = false;
}

When I debugged this (transformWithClause()), the WITH clause was
found to contain a modifying CTE and for the INSERT
query->hasModifyingCTE was set true.
But somehow in the re-writer code, this got lost.
Bug?
Ideas?

Regards,
Greg Nancarrow
Fujitsu Australia




RE: About to add WAL write/fsync statistics to pg_stat_wal view

2021-01-21 Thread kuroda.hay...@fujitsu.com
Dear Ikeda-san,

This patch cannot be applied to the HEAD, but anyway I put a comment.

```
+   /*
+* Measure i/o timing to fsync WAL data.
+*
+* The wal receiver skip to collect it to avoid performance degradation 
of standy servers.
+* If sync_method doesn't have its fsync method, to skip too.
+*/
+   if (!AmWalReceiverProcess() && track_wal_io_timing && 
fsyncMethodCalled())
+   INSTR_TIME_SET_CURRENT(start);
```

I think m_wal_sync_time should be collected even if the process is WalRecevier.
Because all wal_fsync should be recorded, and
some performance issues have been aleady occurred if track_wal_io_timing is 
turned on.
I think it's strange only to take care of the walrecevier case.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: pglz compression performance, take two

2021-01-21 Thread Justin Pryzby
@cfbot: rebased
>From 03fec5d2587cf34a1d1a75d7afdcfbad9cb7ec68 Mon Sep 17 00:00:00 2001
From: Andrey 
Date: Thu, 27 Jun 2019 23:18:21 +0500
Subject: [PATCH] Reorganize pglz compression code

This patch accumulates several changes:
1. Convert macro-functions to regular functions for readability
2. Use more compact hash table with uint16 indexes instead of pointers
3. Avoid prev pointer in hash table
4. Use 4-byte comparisons in a search instead of 1-byte
Total speedup about x1.4
---
 src/common/pg_lzcompress.c | 475 ++---
 1 file changed, 288 insertions(+), 187 deletions(-)

diff --git a/src/common/pg_lzcompress.c b/src/common/pg_lzcompress.c
index fdd527f757..ffa4fe549e 100644
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@@ -118,13 +118,13 @@
  *			our 4K maximum look-back distance is too small.
  *
  *			The compressor creates a table for lists of positions.
- *			For each input position (except the last 3), a hash key is
+ *			For each input position (except the last 4), a hash key is
  *			built from the 4 next input bytes and the position remembered
  *			in the appropriate list. Thus, the table points to linked
  *			lists of likely to be at least in the first 4 characters
  *			matching strings. This is done on the fly while the input
  *			is compressed into the output area.  Table entries are only
- *			kept for the last 4096 input positions, since we cannot use
+ *			kept for the last 4094 input positions, since we cannot use
  *			back-pointers larger than that anyway.  The size of the hash
  *			table is chosen based on the size of the input - a larger table
  *			has a larger startup cost, as it needs to be initialized to
@@ -146,17 +146,15 @@
  *			used for the next tag in the output.
  *
  *			For each subsequent entry in the history list, the "good_match"
- *			is lowered by 10%. So the compressor will be more happy with
- *			short matches the farer it has to go back in the history.
+ *			is lowered roughly by 10%. So the compressor will be more happy
+ *			with short matches the further it has to go back in the history.
  *			Another "speed against ratio" preference characteristic of
  *			the algorithm.
  *
- *			Thus there are 3 stop conditions for the lookup of matches:
+ *			Thus there are 2 stop conditions for the lookup of matches:
  *
  *- a match >= good_match is found
  *- there are no more history entries to look at
- *- the next history entry is already too far back
- *  to be coded into a tag.
  *
  *			Finally the match algorithm checks that at least a match
  *			of 3 or more bytes has been found, because that is the smallest
@@ -173,6 +171,10 @@
  *			Jan Wieck
  *
  * Copyright (c) 1999-2021, PostgreSQL Global Development Group
+ *			Many thanks to Yann Collet for his article about unaligned
+ *			memory access.
+ *
+ *			Leskov Vladimir
  *
  * src/common/pg_lzcompress.c
  * --
@@ -188,12 +190,57 @@
 #include "common/pg_lzcompress.h"
 
 
+/**
+ *  CPU Feature Detection			  *
+ **/
+/* PGLZ_FORCE_MEMORY_ACCESS
+ * By default, access to unaligned memory is controlled by `memcpy()`, which is safe and portable.
+ * Unfortunately, on some target/compiler combinations, the generated assembly is sub-optimal.
+ * The below switch allow to select different access method for improved performance.
+ * Method 0 (default) : use `memcpy()`. Safe and portable.
+ * Method 1 : direct access. This method is portable but violate C standard.
+ *			It can generate buggy code on targets which assembly generation depends on alignment.
+ *			But in some circumstances, it's the only known way to get the most performance (ie GCC + ARMv6)
+ * See https://fastcompression.blogspot.fr/2015/08/accessing-unaligned-memory.html for details.
+ * Prefer these methods in priority order (0 > 1)
+ */
+#ifndef PGLZ_FORCE_MEMORY_ACCESS	/* can be defined externally */
+#if defined(__GNUC__) && \
+  ( defined(__ARM_ARCH_6__) || defined(__ARM_ARCH_6J__) || defined(__ARM_ARCH_6K__) \
+  || defined(__ARM_ARCH_6Z__) || defined(__ARM_ARCH_6ZK__) || defined(__ARM_ARCH_6T2__) )
+#define PGLZ_FORCE_MEMORY_ACCESS 1
+#endif
+#endif
+
+#if defined(PGLZ_FORCE_MEMORY_ACCESS) && (PGLZ_FORCE_MEMORY_ACCESS==1)
+/* lie to the compiler about data alignment; use with caution */
+
+static uint32
+pglz_read32(const void *ptr)
+{
+	return *(const uint32 *) ptr;
+}
+
+#else			/* safe and portable access using memcpy() */
+
+static uint32
+pglz_read32(const void *ptr)
+{
+	uint32		val;
+
+	memcpy(, ptr, sizeof(val));
+	return val;
+}
+
+#endif			/* PGLZ_FORCE_MEMORY_ACCESS */
+
+
 /* --
  * Local definitions
  * --
  */
 #define PGLZ_MAX_HISTORY_LISTS	8192	/* must be power of 2 */
-#define PGLZ_HISTORY_SIZE		4096
+#define PGLZ_HISTORY_SIZE		0x0fff - 1	/* to avoid compare in iteration */
 #define PGLZ_MAX_MATCH			273
 
 
@@ -202,17 +249,16 @@
  *
  *		Linked list for 

Re: Single transaction in the tablesync worker?

2021-01-21 Thread Amit Kapila
On Thu, Jan 21, 2021 at 3:47 PM Amit Kapila  wrote:
>
> On Tue, Jan 19, 2021 at 2:32 PM Peter Smith  wrote:
> >
> > Hi Amit.
> >
> > PSA the v17 patch for the Tablesync Solution1.
> >
>
> Thanks for the updated patch. Below are few comments:
>

One more comment:

In LogicalRepSyncTableStart(), you are trying to remove the slot on
the failure of copy which won't work if the publisher is down. If that
happens on restart of tablesync worker, we will retry to create the
slot with the same name and it will fail because the previous slot is
still not removed from the publisher. I think the same problem can
happen if, after an error in tablesync worker and we drop the
subscription before tablesync worker gets a chance to restart. So, to
avoid these problems can we use the TEMPORARY slot for tablesync
workers as previously? If I remember correctly, the main problem was
we don't know where to start decoding if we fail in catchup phase. But
for that origins should be sufficient because if we fail before copy
then anyway we have to create a new slot and origin but if we fail
after copy then we can use the start_decoding_position from the
origin. So before copy, we still need to use CRS_USE_SNAPSHOT while
creating a temporary slot but if we are already in FINISHED COPY state
at the start of tablesync worker then create a slot with
CRS_NOEXPORT_SNAPSHOT option and then use origin's start_pos and
proceed decoding changes from that point onwards similar to how
currently the apply worker works.

-- 
With Regards,
Amit Kapila.




Re: [PoC] Non-volatile WAL buffer

2021-01-21 Thread Tomas Vondra




On 1/21/21 3:17 AM, Masahiko Sawada wrote:

On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
 wrote:


Hi,

I think I've managed to get the 0002 patch [1] rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how it
could have worked as submitted - my theory is that an incomplete patch
was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

branch 116326496

master  7291 87704165310150437224186
ntt 7912106095213206212410237819
simple-no-buffers   7654 96544115416 95828103065

NTT refers to the patch from September 10, pre-allocating a large WAL
file on PMEM, and simple-no-buffers is the simpler patch simply removing
the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want to
keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad. Not
only compared to the "ntt" patch, but even to master. I'm not entirely
sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've tried
tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much faster than
NVMe storage, but still much slower than DRAM (both in terms of latency
and bandwidth, see [2] for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may matter for
WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying, because
that's how we write into WAL buffers - each thread writes it's own data,
so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time for
each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

xlog stats cnt 4300
   map cnt 100 time 5448333 unmap cnt 100 time 3730963
   memcpy cnt 985964 time 1550442272 len 15150499
   memset cnt 0 time 0 len 0
   persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100  mmap and
unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls,
taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy
copying about 15MB of data. That's quite a lot :-(


It might also be interesting if we can see how much time spent on each
logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().



Yeah, we could extend it to that, that's fairly mechanical thing. Bbut 
maybe that could be visible in a regular perf profile. Also, I suppose 
most of the time will be used by the pmem calls, shown in the stats.




My conclusion from this is that eliminating WAL buffers and writing WAL
directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not the
right approach.

I suppose we should keep WAL buffers, and then just write the data to
mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does,
except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL segments,
mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that,
and the results look a bit nicer:

branch 116326496

master  7291 87704165310150437224186
ntt 7912106095213206212410237819
simple-no-buffers   7654 96544115416 95828103065
with-wal-buffers7477 95454181702140167214715

So, much better than the version without WAL buffers, somewhat better
than master (except for 64/96 clients), but still not as good as NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some differences,
but it seemed strange. The main difference seems to be that it only maps
one large file, and only once. OTOH the alternative "simple" patch maps
segments one by one, in each backend. Per the debug stats the map/unmap
calls are fairly cheap, but maybe it interferes with the memcpy somehow.



While looking at the two methods: 

Re: PoC/WIP: Extended statistics on expressions

2021-01-21 Thread Justin Pryzby
This already needs to be rebased on 55dc86eca.

And needs to update rules.out.

And doesn't address this one:

On Sun, Jan 17, 2021 at 10:53:31PM -0600, Justin Pryzby wrote:
> | postgres=# CREATE TABLE t(i int);
> | postgres=# CREATE STATISTICS s2 ON (i+1) ,(i+1+0) FROM t;
> | postgres=# \d t
> |  Table "public.t"
> |  Column |  Type   | Collation | Nullable | Default 
> | +-+---+--+-
> |  i  | integer |   |  | 
> | Indexes:
> | "t_i_idx" btree (i)
> | Statistics objects:
> | "public"."s2" (ndistinct, dependencies, mcv) ON  FROM t
> 
> on ... what ?

-- 
Justin




RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Hou, Zhijie
Hi

I took a look at v12-0001 patch, here are some comments:

1.
+   /*
+* Setup the context used in finding the max parallel-mode hazard.
+*/
+   Assert(initial_max_parallel_hazard == 0 ||
+  initial_max_parallel_hazard == PROPARALLEL_SAFE ||
+  initial_max_parallel_hazard == PROPARALLEL_RESTRICTED);
+   context.max_hazard = initial_max_parallel_hazard == 0 ?
+   PROPARALLEL_SAFE : initial_max_parallel_hazard;

I am not quiet sure when will " max_parallel_hazard == 0"
Does it means the case max_parallel_hazard_context not initialized ?


2.
Some tiny code style suggestions

+   if (con->contype == CONSTRAINT_CHECK)
+   {
+   char   *conbin;
+   Datum   val;
+   boolisnull;
+   Expr   *check_expr;

How about :

if (con->contype != CONSTRAINT_CHECK)
continue;

3.
+   if (keycol == 0)
+   {
+   /* Found an index expression */
+
+   Node   *index_expr;

Like 2, how about:

If (keycol != 0)
Continue;


4.
+   ListCell   *index_expr_item = 
list_head(index_info->ii_Expressions);
...
+   index_expr = (Node *) 
lfirst(index_expr_item);
+   index_expr = (Node *) 
expression_planner((Expr *) index_expr);

It seems BuildIndexInfo has already called eval_const_expressions for 
ii_Expressions,
Like the flow: BuildIndexInfo--> RelationGetIndexExpressions--> 
eval_const_expressions

So, IMO, we do not need to call expression_planner for the expr again.


And there seems another solution for this:

In the patch, We only use the { ii_Expressions , ii_NumIndexAttrs , 
ii_IndexAttrNumbers } from the IndexInfo,
which seems can get from "Relation-> rd_index".

Based on above, May be we do not need to call BuildIndexInfo to build the 
IndexInfo.
It can avoid some unnecessary cost.
And in this solution we do not need to remove expression_planner.

What do you think ?

Best regards,
houzj
















Re: OpenSSL connection setup debug callback issue

2021-01-21 Thread Michael Paquier
On Thu, Jan 21, 2021 at 05:01:15PM +0900, Michael Paquier wrote:
> This is interesting for debugging, +1 for applying what you have
> here, and this works for 1.0.1~3.0.0.  Worth noting that this returns
> a static string, as per ssl_stat.c.

Done as of af0e79c, after an indentation.
--
Michael


signature.asc
Description: PGP signature


Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Greg Nancarrow
On Fri, Jan 22, 2021 at 12:08 PM Hou, Zhijie  wrote:
>
> >
> > I think that's a good idea, so I'll make that update in the next version
> > of the patch.
> > I do notice, however, that there seems to be quite a few places in the 
> > Postgres
> > code where RelationGetIndexList() is being called without a corresponding
> > list_free() being called - obviously instead relying on it being deallocated
> > when the caller's memory-context is destroyed.
> >
>
> Yes, it will be deallocated when the caller's memory-context is destroyed.
>
> Currently, parallel safety-check check each partition.
> I am just a little worried about if there are lots of partition here, it may 
> cause high memory use.
>
> And there is another place like this:
>
> 1.
> +   conbin = TextDatumGetCString(val);
> +   check_expr = stringToNode(conbin);
>
> It seems we can free the cobin when not used(for the same reason above).
> What do you think ?
>
>

Yes, I think you're right, we should pfree conbin after converting to
Node, to minimize memory usage.
Again, it's interesting that existing Postgres code, when looping
through all of the constraints, doesn't do this.
Hmmm. I'm wondering if there is a performance reason behind this -
avoiding multiple calls to pfree() and just relying on it to be
deallocated in one hit, when the memory context is destroyed.
Anyway, perhaps the concerns of many partitions and the recursive
nature of these checks overrides that, because, as you say, possible
high memory usage.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: Support for NSS as a libpq TLS backend

2021-01-21 Thread Jacob Champion
On Thu, 2021-01-21 at 14:21 +0900, Michael Paquier wrote:
> Also, what's the minimum version of NSS that would be supported?  It
> would be good to define an acceptable older version, to keep that
> documented and to track that perhaps with some configure checks (?),
> similarly to what is done for OpenSSL.

Some version landmarks:

- 3.21 adds support for extended master secret, which according to [1]
is required for SCRAM channel binding to actually be secure.
- 3.26 is Debian Stretch.
- 3.28 is Ubuntu 16.04, and RHEL6 (I think).
- 3.35 is Ubuntu 18.04.
- 3.36 is RHEL7 (I think).
- 3.39 gets us final TLS 1.3 support.
- 3.42 is Debian Buster.
- 3.49 is Ubuntu 20.04.

(I'm having trouble finding online package information for RHEL variants, so 
I've pulled those versions from online support docs. If someone notices that 
those are wrong please speak up.)
So 3.39 would guarantee TLS1.3 but exclude a decent chunk of still-
supported Debian-alikes. Anything less than 3.21 seems actively unsafe
unless we disable SCRAM with those versions.

Any other important landmarks (whether feature- or distro-related) we
need to consider?

--Jacob

[1] https://tools.ietf.org/html/rfc7677#section-4


Re: vacuum_cost_page_miss default value and modern hardware

2021-01-21 Thread Peter Geoghegan
On Thu, Jan 14, 2021 at 8:34 PM Masahiko Sawada  wrote:
> +1 for this change. Lowering to 2 also looks good to me.

I'm going to go ahead with committing my patch to lower the default
next week. If anybody has any objections to that plan, please speak
up.

It doesn't really need to be said again, but just to be clear: I share
the concerns held by Magnus and Robert. It's certainly true that this
change alone is no fix for the general problem I described. At the
same time I believe that the patch makes incremental progress in the
right direction, without much risk.

-- 
Peter Geoghegan




RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Hou, Zhijie
> >
> > +
> > +   index_oid_list = RelationGetIndexList(rel);
> > ...
> >
> > As memtioned in the comments of RelationGetIndexList:
> > * we return a copy of the list palloc'd in the caller's context.  The
> > caller
> > * may list_free() the returned list after scanning it.
> >
> > Shall we list_free(index_oid_list) at the end of function ?
> > Just to avoid potential memory leak.
> >
> 
> I think that's a good idea, so I'll make that update in the next version
> of the patch.
> I do notice, however, that there seems to be quite a few places in the 
> Postgres
> code where RelationGetIndexList() is being called without a corresponding
> list_free() being called - obviously instead relying on it being deallocated
> when the caller's memory-context is destroyed.
> 

Yes, it will be deallocated when the caller's memory-context is destroyed.

Currently, parallel safety-check check each partition.
I am just a little worried about if there are lots of partition here, it may 
cause high memory use.

And there is another place like this:

1.
+   conbin = TextDatumGetCString(val);
+   check_expr = stringToNode(conbin);

It seems we can free the cobin when not used(for the same reason above).
What do you think ?


Best regards,
houzj




Re: [PATCH] Full support for index LP_DEAD hint bits on standby

2021-01-21 Thread Michail Nikolaev
Hello, everyone.

Oh, I just realized that it seems like I was too naive to allow
standby to set LP_DEAD bits this way.
There is a possible consistency problem in the case of low
minRecoveryPoint value (because hint bits do not move PageLSN
forward).

Something like this:

LSN=10  STANDBY INSERTS NEW ROW IN INDEX (index_lsn=10)
<---minRecoveryPoint will go here
LSN=20  STANDBY DELETES ROW FROM HEAP, INDEX UNTACHED (index_lsn=10)
   REPLICA SCANS INDEX AND SET hint bits (index_lsn=10)
   INDEX IS FLUSHED (minRecoveryPoint=index_lsn=10)
   CRASH

On crash recovery, a standby will be able to handle queries after
LSN=10. But the index page contains hints bits from the future
(LSN=20).
So, need to think here.

Thanks,
Michail.




Re: PATCH: Batch/pipelining support for libpq

2021-01-21 Thread Zhihong Yu
Hi,

+   commandFailed(st, "SQL", "\\gset and \\aset are not
allowed in a batch section");

It seems '\\gset or \\aset is not ' would correspond to the check more
closely.

+   if (my_command->argc != 1)
+   syntax_error(source, lineno, my_command->first_line,
my_command->argv[0],

It is possible that my_command->argc == 0 (where my_command->argv[0]
shouldn't be accessed) ?

+   appendPQExpBufferStr(>errorMessage,
+ libpq_gettext("cannot queue commands
during COPY\n"));
+   return false;
+   break;

Is the break necessary ? Similar comment for pqBatchProcessQueue().

+int
+PQexitBatchMode(PGconn *conn)

Since either 0 or 1 is returned, maybe change the return type to bool.
Also, the function operates on PGconn - should the function be static
(pqBatchProcessQueue is) ?

Cheers

On Thu, Jan 21, 2021 at 3:39 PM Alvaro Herrera 
wrote:

> Thanks David Johnston and Daniel Vérité, I have incorporated your
> changes into this patch, which is now v26.  Also, it's been rebased on
> current sources.
>
> I've been using the new PQtrace() stuff to verify the behavior of the
> new feature.  It's not perfect, but at least it doesn't crash
> immediately as it did when I tried a few weeks ago.  There are
> imperfections that I think are due to bugs in the PQtrace
> implementation, not in this patch.
>
> --
> Álvaro Herrera39°49'30"S 73°17'W
> "El conflicto es el camino real hacia la unión"
>


Re: PATCH: Batch/pipelining support for libpq

2021-01-21 Thread Alvaro Herrera
As you can see in an XXX comment in the libpq test program, the current
implementation has the behavior that PQgetResult() returns NULL after a
batch is finished and has reported PGRES_BATCH_END.  I don't know if
there's a hard reason to do that, but I'd like to supress it because it
seems weird and out of place.

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: PATCH: Batch/pipelining support for libpq

2021-01-21 Thread Alvaro Herrera
Thanks David Johnston and Daniel Vérité, I have incorporated your
changes into this patch, which is now v26.  Also, it's been rebased on
current sources.

I've been using the new PQtrace() stuff to verify the behavior of the
new feature.  It's not perfect, but at least it doesn't crash
immediately as it did when I tried a few weeks ago.  There are
imperfections that I think are due to bugs in the PQtrace
implementation, not in this patch.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"El conflicto es el camino real hacia la unión"
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 2bb3bf77e4..d22b866677 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -3099,6 +3099,30 @@ ExecStatusType PQresultStatus(const PGresult *res);

   
  
+
+ 
+  PGRES_BATCH_END
+  
+   
+The PGresult represents the end of a batch.
+This status occurs only when batch mode has been selected.
+   
+  
+ 
+
+ 
+  PGRES_BATCH_ABORTED
+  
+   
+The PGresult represents a batch that's
+received an error from the server.  PQgetResult
+must be called repeatedly, and it will return this status code,
+until the end of the current batch, at which point it will return
+PGRES_BATCH_END and normal processing can resume.
+   
+  
+ 
+
 
 
 If the result status is PGRES_TUPLES_OK or
@@ -4857,6 +4881,488 @@ int PQflush(PGconn *conn);
 
  
 
+ 
+  Batch Mode
+
+  
+   libpq
+   batch mode
+  
+
+  
+   pipelining
+   in libpq
+  
+
+  
+   libpq batch mode allows applications to
+   send a query without having to read the result of the previously
+   sent query.  Taking advantage of the batch mode, a client will wait
+   less for the server, since multiple queries/results can be sent/
+   received in a single network transaction.
+  
+
+  
+   While batch mode provides a significant performance boost, writing
+   clients using the batch mode is more complex because it involves
+   managing a queue of pending queries and finding which result
+   corresponds to which query in the queue.
+  
+
+  
+   Batch mode also generally consumes more memory on both the client and server,
+   though careful and aggressive management of the send/receive queue can mitigate
+   this.  This applies whether or not the connection is in blocking or non-blocking
+   mode.
+  
+
+  
+   Using Batch Mode
+
+   
+To issue batches the application must switch a connection into batch mode.
+Enter batch mode with 
+or test whether batch mode is active with
+.
+In batch mode, only asynchronous operations
+are permitted, and COPY is not recommended as it
+may trigger failure in batch processing.  Using any synchronous
+command execution functions, such as PQfn, or
+PQexec and its sibling functions, is an error
+condition.
+   
+
+   
+
+ It is best to use batch mode with libpq in
+ non-blocking mode. If used
+ in blocking mode it is possible for a client/server deadlock to occur.
+  
+   
+The client will block trying to send queries to the server, but the
+server will block trying to send results to the client from queries
+it has already processed. This only occurs when the client sends
+enough queries to fill its output buffer and the server's receive
+buffer before switching to processing input from the server,
+but it's hard to predict exactly when that will happen.
+   
+  
+
+   
+
+   
+Issuing Queries
+
+
+ After entering batch mode the application dispatches requests using
+ normal asynchronous libpq functions, such as:
+ PQsendQueryParams, PQsendPrepare,
+ PQsendQueryPrepared, PQsendDescribePortal,
+ and PQsendDescribePrepared.
+ The asynchronous requests are followed by a
+ 
+ call to mark the end of the batch. The client need not
+ call PQgetResult immediately after
+ dispatching each operation.
+ Result processing
+ is handled separately.
+
+
+
+ The server executes statements, and returns results, in the order the
+ client sends them.  The server will begin executing the batch commands
+ immediately, not waiting for the end of the batch.
+ If any statement encounters an error the server aborts the current
+ transaction and skips processing the rest of the batch.
+ Query processing resumes after the end of the failed batch.
+
+
+
+ It's fine for one operation to depend on the results of a
+ prior one. One query may define a table that the next query in the same
+ batch uses; similarly, an application may create a named prepared statement
+ then execute it with later statements in the same batch.
+
+   
+
+   
+Processing Results
+
+  

Re: [PATCH 1/1] Fix detection of pwritev support for OSX.

2021-01-21 Thread Sergey Shinderuk

On 22.01.2021 01:17, James Hilliard wrote:

On Thu, Jan 21, 2021 at 11:38 AM Tom Lane  wrote:


James Hilliard  writes:

On Wed, Jan 20, 2021 at 4:07 PM Tom Lane  wrote:

I'm not sure that the case of not having the "command line tools"
installed is interesting for our purposes.  AFAIK you have to have
that in order to have access to required tools like bison and gmake.
(That reminds me, I was intending to add something to our docs
about how-to-build-from-source to say that you need to install those.)



Yeah, not 100% sure but I was able to build just fine after deleting my
command line tools.


Hm.  I've never been totally clear on what's included in the "command line
tools", although it's now apparent that one thing that gets installed is
an SDK matching the host OS version.  However, Apple's description at [1]
says

 Command Line Tools

 Download the macOS SDK, headers, and build tools such as the Apple
 LLVM compiler and Make. These tools make it easy to install open
 source software or develop on UNIX within Terminal. macOS can
 automatically download these tools the first time you try to build
 software, and they are available on the downloads page.

which certainly strongly implies that gmake is not there otherwise.
At this point I lack any "bare" macOS system to check it on.  I wonder
whether you have a copy of make available from MacPorts or Homebrew.
Or maybe uninstalling the command line tools doesn't really remove
everything?

Yeah, not entirely sure there but I do use homebrew.



FWIW, I tested with a clean install of Catalina. Before I install 
anything at all, I already have xcode-select, xcrun and all the shims in 
/usr/bin for developer tools, including cc, make, git, xcodebuild... 
Just about everything listed in the FILES section of "man xcode-select".


When I run any tool (except xcode-select), a GUI dialog pops up offering 
to install the Command Line Tools. So apparently those shims are not 
functional yet. I rejected the installation.


Instead I downloaded Xcode12.1.xip via [1], the latest version with 
macosx10.15 SDK. I unpacked it and installed by dragging Xcode.app to 
/Applications. It seems to me there is no magic behind the scenes, just 
moving the directory. I selectively checked that the shims in /usr/bin 
didn't change after that.


Now, running "cc" tells me that I have to accept the Xcode license 
agreement. After accepting it, all the shims in /usr/bin start to work, 
forwarding to the real tools inside Xcode.app.


If I run the Homebrew installer, it says that it's going to install the 
Command Line Tools. I don't know why it needs them, all the tools are 
there already. I thought that CLT is a lighter-weight option when you 
don't want the full Xcode installation, but Homebrew requires them anyway.


I rejected to install CLT and abandoned Homebrew. Then I just cloned and 
built Postgres successfully. So it looks like Xcode is really enough, at 
least on a recent macOS version.



[1] https://xcodereleases.com

--
Sergey Shinderuk
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Avoiding smgrimmedsync() during nbtree index builds

2021-01-21 Thread Andres Freund
Hi,

On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote:
> On 21/01/2021 22:36, Andres Freund wrote:
> > A quick hack (probably not quite correct!) to evaluate the benefit shows
> > that the attached script takes 2m17.223s with the smgrimmedsync and
> > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> > the former case, CPU bound in the latter.
> >
> > Creating lots of tables with indexes (directly or indirectly through
> > relations having a toast table) is pretty common, particularly after the
> > introduction of partitioning.
> >
> >
> > Thinking through the correctness of replacing smgrimmedsync() with sync
> > requests, the potential problems that I can see are:
> >
> > 1) redo point falls between the log_newpage() and the
> > write()/register_dirty_segment() in smgrextend/smgrwrite.
> > 2) redo point falls between write() and register_dirty_segment()
> >
> > But both of these are fine in the context of initially filling a newly
> > created relfilenode, as far as I can tell? Otherwise the current
> > smgrimmedsync() approach wouldn't be safe either, as far as I can tell?
>
> Hmm. If the redo point falls between write() and the
> register_dirty_segment(), and the checkpointer finishes the whole checkpoint
> before register_dirty_segment(), you are not safe. That can't happen with
> write from the buffer manager, because the checkpointer would block waiting
> for the flush of the buffer to finish.

Hm, right.

The easiest way to address that race would be to just record the redo
pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a
checkpoint started since the start of the index build.

Another approach would be to utilize PGPROC.delayChkpt, but I would
rather not unnecessarily expand the use of that.

It's kind of interesting - in my aio branch I moved the
register_dirty_segment() to before the actual asynchronous write (due to
availability of the necessary data), which ought to be safe because of
the buffer interlocking. But that doesn't work here, or for other places
doing writes without going through s_b.  It'd be great if we could come
up with a general solution, but I don't immediately see anything great.

The best I can come up with is adding helper functions to wrap some of
the complexity for "unbuffered" writes of doing an immedsync iff the
redo pointer changed. Something very roughly like

typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} 
UnbufferedWriteState;
void unbuffered_prep(UnbufferedWriteState* state);
void unbuffered_write(UnbufferedWriteState* state, ...);
void unbuffered_extend(UnbufferedWriteState* state, ...);
void unbuffered_finish(UnbufferedWriteState* state);

which wouldn't just do the dance to avoid the immedsync() if possible,
but also took care of PageSetChecksumInplace() (and PageEncryptInplace()
if we get that [1]).

Greetings,

Andres Freund

[1] 
https://www.postgresql.org/message-id/20210112193431.2edcz776qjen7kao%40alap3.anarazel.de




Very misleading documentation for PQreset()

2021-01-21 Thread Tom Lane
I happened to notice that PQreset is documented thus:

This function will close the connection to the server and attempt to
reestablish a new connection to the same server, using all the same
parameters previously used.

Since we invented multi-host connection parameters, a reasonable person
would assume that "to the same server" means we promise to reconnect to
the same host we selected the first time.  There is no such guarantee
though; the new connection attempt is done just like the first one,
so it will select the first suitable server in the list.

I think we should just drop that phrase.  Alternatively we could decide
that the code's behavior is buggy, but I don't think it is.  If, say,
the reason you need to reset is that your existing host died, you don't
really want libpq to refuse to select an alternative server.

regards, tom lane




Re: [PATCH 1/1] Fix detection of pwritev support for OSX.

2021-01-21 Thread James Hilliard
On Thu, Jan 21, 2021 at 11:38 AM Tom Lane  wrote:
>
> James Hilliard  writes:
> > On Wed, Jan 20, 2021 at 4:07 PM Tom Lane  wrote:
> >> I'm not sure that the case of not having the "command line tools"
> >> installed is interesting for our purposes.  AFAIK you have to have
> >> that in order to have access to required tools like bison and gmake.
> >> (That reminds me, I was intending to add something to our docs
> >> about how-to-build-from-source to say that you need to install those.)
>
> > Yeah, not 100% sure but I was able to build just fine after deleting my
> > command line tools.
>
> Hm.  I've never been totally clear on what's included in the "command line
> tools", although it's now apparent that one thing that gets installed is
> an SDK matching the host OS version.  However, Apple's description at [1]
> says
>
> Command Line Tools
>
> Download the macOS SDK, headers, and build tools such as the Apple
> LLVM compiler and Make. These tools make it easy to install open
> source software or develop on UNIX within Terminal. macOS can
> automatically download these tools the first time you try to build
> software, and they are available on the downloads page.
>
> which certainly strongly implies that gmake is not there otherwise.
> At this point I lack any "bare" macOS system to check it on.  I wonder
> whether you have a copy of make available from MacPorts or Homebrew.
> Or maybe uninstalling the command line tools doesn't really remove
> everything?
Yeah, not entirely sure there but I do use homebrew.
>
> > It would be pretty annoying to have to install an outdated SDK just to
> > build postgres for no other reason than the autoconf feature detection
> > being broken.
>
> It's only as "outdated" as your host system ;-).  Besides, it doesn't
> look like Apple's really giving you a choice not to.
The newer SDK should work fine as long as long as the autoconf feature
detection is fixed somehow.
>
> The long and short of this is that I'm unwilling to buy into maintaining
> our own substitutes for standard autoconf probes in order to make it
> possible to use the wrong SDK version.  The preadv/pwritev case is already
> messy enough, and I fear that trying to support such scenarios is going to
> lead to more and more pain in the future.
Well it's actually a larger issue, if it isn't fixed then the ability
to change the
MACOSX_DEPLOYMENT_TARGET doesn't work properly, not only for
the case of having a newer SDK on an older host but it would also prevent
MACOSX_DEPLOYMENT_TARGET from working in general such as for
building with support for older targets from newer hosts, I'll see if there's
maybe a better way to fix the feature detection that's less of a maintenance
issue.
>
> regards, tom lane
>
> [1] https://developer.apple.com/xcode/features/




Re: strange error reporting

2021-01-21 Thread Tom Lane
I wrote:
> If I don't hear any other opinions, I'll change these messages to
> "connection to server at socket \"%s\" failed: "
> "connection to server at \"%s\" (%s), port %s failed: "

Done.  Also, here is a patch to remove the redundant-seeming prefixes
from our reports of connection failures.  My feeling that this is the
right thing was greatly increased when I noticed that psql, as well as
a few other programs, already did it like this.  (I still favor
Alvaro's patch for the back branches, though.)

regards, tom lane

diff --git a/contrib/oid2name/oid2name.c b/contrib/oid2name/oid2name.c
index 5a884e2904..65cce49993 100644
--- a/contrib/oid2name/oid2name.c
+++ b/contrib/oid2name/oid2name.c
@@ -347,8 +347,7 @@ sql_conn(struct options *my_opts)
 	/* check to see that the backend connection was successfully made */
 	if (PQstatus(conn) == CONNECTION_BAD)
 	{
-		pg_log_error("could not connect to database %s: %s",
-	 my_opts->dbname, PQerrorMessage(conn));
+		pg_log_error("%s", PQerrorMessage(conn));
 		PQfinish(conn);
 		exit(1);
 	}
diff --git a/contrib/vacuumlo/vacuumlo.c b/contrib/vacuumlo/vacuumlo.c
index cdc2c02b8e..dcb95c4320 100644
--- a/contrib/vacuumlo/vacuumlo.c
+++ b/contrib/vacuumlo/vacuumlo.c
@@ -124,8 +124,7 @@ vacuumlo(const char *database, const struct _param *param)
 	/* check to see that the backend connection was successfully made */
 	if (PQstatus(conn) == CONNECTION_BAD)
 	{
-		pg_log_error("connection to database \"%s\" failed: %s",
-	 database, PQerrorMessage(conn));
+		pg_log_error("%s", PQerrorMessage(conn));
 		PQfinish(conn);
 		return -1;
 	}
diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 2bb3bf77e4..5f5faaa0b7 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -6837,8 +6837,8 @@ main(void)
 
 if (PQstatus(conn) != CONNECTION_OK)
 {
-fprintf(stderr, "Connection to database failed: %s",
-PQerrorMessage(conn));
+/* PQerrorMessage's result includes a trailing newline */
+fprintf(stderr, "%s", PQerrorMessage(conn));
 PQfinish(conn);
 return 1;
 }
@@ -8296,8 +8296,7 @@ main(int argc, char **argv)
 /* Check to see that the backend connection was successfully made */
 if (PQstatus(conn) != CONNECTION_OK)
 {
-fprintf(stderr, "Connection to database failed: %s",
-PQerrorMessage(conn));
+fprintf(stderr, "%s", PQerrorMessage(conn));
 exit_nicely(conn);
 }
 
@@ -8466,8 +8465,7 @@ main(int argc, char **argv)
 /* Check to see that the backend connection was successfully made */
 if (PQstatus(conn) != CONNECTION_OK)
 {
-fprintf(stderr, "Connection to database failed: %s",
-PQerrorMessage(conn));
+fprintf(stderr, "%s", PQerrorMessage(conn));
 exit_nicely(conn);
 }
 
@@ -8694,8 +8692,7 @@ main(int argc, char **argv)
 /* Check to see that the backend connection was successfully made */
 if (PQstatus(conn) != CONNECTION_OK)
 {
-fprintf(stderr, "Connection to database failed: %s",
-PQerrorMessage(conn));
+fprintf(stderr, "%s", PQerrorMessage(conn));
 exit_nicely(conn);
 }
 
diff --git a/doc/src/sgml/lobj.sgml b/doc/src/sgml/lobj.sgml
index 413eda50af..6d46da42e2 100644
--- a/doc/src/sgml/lobj.sgml
+++ b/doc/src/sgml/lobj.sgml
@@ -939,8 +939,7 @@ main(int argc, char **argv)
 /* check to see that the backend connection was successfully made */
 if (PQstatus(conn) != CONNECTION_OK)
 {
-fprintf(stderr, "Connection to database failed: %s",
-PQerrorMessage(conn));
+fprintf(stderr, "%s", PQerrorMessage(conn));
 exit_nicely(conn);
 }
 
diff --git a/src/bin/pg_dump/pg_backup_db.c b/src/bin/pg_dump/pg_backup_db.c
index 5ba43441f5..2856c16e85 100644
--- a/src/bin/pg_dump/pg_backup_db.c
+++ b/src/bin/pg_dump/pg_backup_db.c
@@ -188,12 +188,10 @@ ConnectDatabase(Archive *AHX,
 	if (PQstatus(AH->connection) == CONNECTION_BAD)
 	{
 		if (isReconnect)
-			fatal("reconnection to database \"%s\" failed: %s",
-  PQdb(AH->connection) ? PQdb(AH->connection) : "",
+			fatal("reconnection failed: %s",
   PQerrorMessage(AH->connection));
 		else
-			fatal("connection to database \"%s\" failed: %s",
-  PQdb(AH->connection) ? PQdb(AH->connection) : "",
+			fatal("%s",
   PQerrorMessage(AH->connection));
 	}
 
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
index 85d08ad660..007a3d0f9a 100644
--- a/src/bin/pg_dump/pg_dumpall.c
+++ b/src/bin/pg_dump/pg_dumpall.c
@@ -1768,8 +1768,7 @@ connectDatabase(const char *dbname, const char *connection_string,
 	{
 		if (fail_on_error)
 		{
-			pg_log_error("could not connect to database \"%s\": %s",
-		 dbname, PQerrorMessage(conn));
+			pg_log_error("%s", PQerrorMessage(conn));
 			exit_nicely(1);
 		}
 		else
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl 

Re: Avoiding smgrimmedsync() during nbtree index builds

2021-01-21 Thread Heikki Linnakangas

On 21/01/2021 22:36, Andres Freund wrote:

Hi,

Every nbtree index build currently does an smgrimmedsync at the end:

/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
static void
_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
...
/*
 * When we WAL-logged index pages, we must nonetheless fsync index 
files.
 * Since we're building outside shared buffers, a CHECKPOINT occurring
 * during the build has no way to flush the previously written data to
 * disk (indeed it won't know the index even exists).  A crash later on
 * would replay WAL from the checkpoint, therefore it wouldn't replay 
our
 * earlier WAL entries. If we do not fsync those pages here, they might
 * still not be on disk when the crash occurs.
 */
if (wstate->btws_use_wal)
{
RelationOpenSmgr(wstate->index);
smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
}

In cases we create lots of small indexes, e.g. because of an initial
schema load, partition creation or something like that, that turns out
to be a major limiting factor (unless one turns fsync off).


One way to address that would be to put newly built indexes into s_b
(using a strategy, to avoid blowing out the whole cache), instead of
using smgrwrite() etc directly. But that's a discussion with a bit more
complex tradeoffs.


What I wonder is why the issue addressed in the comment I copied above
can't more efficiently be addressed using sync requests, like we do for
other writes?  It's possibly bit more complicated than just passing
skipFsync=false to smgrwrite/smgrextend, but it should be quite doable?


Makes sense.


A quick hack (probably not quite correct!) to evaluate the benefit shows
that the attached script takes 2m17.223s with the smgrimmedsync and
0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
the former case, CPU bound in the latter.

Creating lots of tables with indexes (directly or indirectly through
relations having a toast table) is pretty common, particularly after the
introduction of partitioning.


Thinking through the correctness of replacing smgrimmedsync() with sync
requests, the potential problems that I can see are:

1) redo point falls between the log_newpage() and the
write()/register_dirty_segment() in smgrextend/smgrwrite.
2) redo point falls between write() and register_dirty_segment()

But both of these are fine in the context of initially filling a newly
created relfilenode, as far as I can tell? Otherwise the current
smgrimmedsync() approach wouldn't be safe either, as far as I can tell?


Hmm. If the redo point falls between write() and the 
register_dirty_segment(), and the checkpointer finishes the whole 
checkpoint before register_dirty_segment(), you are not safe. That can't 
happen with write from the buffer manager, because the checkpointer 
would block waiting for the flush of the buffer to finish.


- Heikki




Re: libpq debug log

2021-01-21 Thread Alvaro Herrera
On 2021-Jan-17, tsunakawa.ta...@fujitsu.com wrote:

> * I don't see the need for separate pqTraceInit() function, because it is 
> only called here.

That's true, but it'd require that we move PQtrace() to fe-misc.c,
because pqTraceInit() uses definitions that are private to that file.

> (2)
> +bool
> +pqTraceInit(PGconn *conn)
> +{
> + /* already done? */
> + if (conn->be_msg != NULL)
> + {
> 
> What's this code for?  I think it's sufficient that PQuntrace() frees b_msg 
> and PQtrace() always allocates b_msg.

The point is to be able to cope with a connection that calls PQtrace()
multiple times, with or without an intervening PQuntrace().
We'd make no friends if we made PQtrace() crash, or leak memory if
called a second time ...

> (5)
> @@ -966,10 +966,6 @@ pqSaveParameterStatus(PGconn *conn, const char *name, 
> const char *value)
>   pgParameterStatus *pstatus;
>   pgParameterStatus *prev;
>  
> - if (conn->Pfdebug)
> - fprintf(conn->Pfdebug, "pqSaveParameterStatus: '%s' = '%s'\n",
> - name, value);
> -
> 
> Where is this information output instead?

Hmm, seems to have been lost.  I restored it, but didn't verify
the resulting behavior carefully.

> (9)
> + currtime = time(NULL);
> + tmp = localtime();
> + strftime(timestr, sizeof(timestr), "%Y-%m-%d 
> %H:%M:%S %Z", tmp);
> +
> + fprintf(conn->Pfdebug, "%s %s ", timestr, 
> message_direction);
> 
> It's better to have microsecond accuracy, because knowing the accumulation of 
> such fine-grained timings may help to troubleshoot heavily-loaded client 
> cases.  You can mimic setup_formatted_log_time() in 
> src/backend/utils/error/elog.c.  This is used for the %m marker in 
> log_line_prefix.

Yeah, it was me that removed printing of milliseconds -- Iwata-san's
original patch used ftime() which is not portable.  I had looked for
portable ways to get this but couldn't find anything that didn't involve
linking backend files, so I punted.  But looking again now, it is quite
simple: just use standard strftime and localtime and just concatenate
both together.  Similar to what setup_formatted_log_time except we don't
worry about the TZ at all.

> (10)
> I think it's better to use tabs (or any other character that is less likely 
> to appear in the log field) as a field separator, because it makes it easier 
> to process the log with a spreadsheet or some other tools.

I can buy that.  I changed it to have some tabs; the lines are now:

timestamp "> or <"  "message type"  length  message contents

I think trying to apply tabs inside the message contents is going to be
more confusing than helpful.



> (12)
> [...]
> Plus, you may as well print the invalid message type.  Why don't you do 
> something like this:
> 
> +static const char *
> +pqGetProtocolMsgType(unsigned char c, PGCommSource commsource)
> +{
> + static char unknown_message[64];
> + char *msg = NULL;
> 
> + if (commsource == FROM_BACKEND && c < lengthof(protocol_message_type_b))
> + msg = protocol_message_type_b[c];
> + else if (commsource == FROM_FRONTEND && c < 
> lengthof(protocol_message_type_f))
> + msg = protocol_message_type_f[c];
> +
> + if (msg == NULL)
> + {
> + sprintf(unknown_message, "Unknown message %02x", c);
> + msg = unknown_message;
> + }
> +
> + return msg;
> +}

Right.  I had written something like this, but in the end decided not to
bother because it doesn't seem terribly important.  You can't do exactly
what you suggest, because it has the problem that you're returning a
stack-allocated variable even though your stack is about to be blown
away.  My copy had a static buffer that was malloc()ed on first use (and
if allocation fails, return a const string).  Anyway, I fixed the
crasher problem.

The protocol_message_type_b array had the serious problem it confused
the entry for byte 1 (0x01) with that of char '1' (and 2 and 3), so it
printed a lot of 'Unknown message' lines.  Clearly, the maintenance of
this array is going to be a pain point of this patch (counting number of
zeroes is no fun), and I think we're going to have some way to have an
explicit initializer, where we can do things like
protocol_message_type_b['A'] = "NotificationResponse";
etc instead of the current way, which is messy, hard to maintain.
I'm not sure how to do that and not make things worse, however.

> (16)
> +static void
> +pqLogMessageByte1(PGconn *conn, char v, PGCommSource commsource)
> +{
> + char   *message_direction = commsource == FROM_BACKEND ? "<" : ">";
> ...
> + switch (conn->be_msg->state)
> 
> This function handles messages in both directions.  But the switch condition 
> is based on the backend message state (b_msg).  How is this correct?

It's not.

I split things so that this function only prints backend 

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-21 Thread Justin Pryzby
On Thu, Jan 21, 2021 at 11:48:08PM +0300, Alexey Kondratov wrote:
> Attached is a new patch set of first two patches, that should resolve all
> the issues raised before (ACL, docs, tests) excepting TOAST. Double thanks
> for suggestion to add more tests with nested partitioning. I have found and
> squashed a huge bug related to the returning back to the default tablespace
> using newly added tests.
> 
> Regarding TOAST. Now we skip moving toast indexes or throw error if someone
> wants to move TOAST index directly. I had a look on ALTER TABLE SET
> TABLESPACE and it has a bit complicated logic:
> 
> 1) You cannot move TOAST table directly.
> 2) But if you move basic relation that TOAST table belongs to, then they are
> moved altogether.
> 3) Same logic as 2) happens if one does ALTER TABLE ALL IN TABLESPACE ...
> 
> That way, ALTER TABLE allows moving TOAST tables (with indexes) implicitly,
> but does not allow doing that explicitly. In the same time I found docs to
> be vague about such behavior it only says:
> 
> All tables in the current database in a tablespace can be moved
> by using the ALL IN TABLESPACE ... Note that system catalogs are
> not moved by this command
> 
> Changing any part of a system catalog table is not permitted.
> 
> So actually ALTER TABLE treats TOAST relations as system sometimes, but
> sometimes not.
> 
> From the end user perspective it makes sense to move TOAST with main table
> when doing ALTER TABLE SET TABLESPACE. But should we touch indexes on TOAST
> table with REINDEX? We cannot move TOAST relation itself, since we are doing
> only a reindex, so we end up in the state when TOAST table and its index are
> placed in the different tablespaces. This state is not reachable with ALTER
> TABLE/INDEX, so it seem we should not allow it with REINDEX as well, should
> we?

> +  * Even if a table's indexes were moved to a new tablespace, 
> the index
> +  * on its toast table is not normally moved.
>*/
>   ReindexParams newparams = *params;
>  
>   newparams.options &= ~(REINDEXOPT_MISSING_OK);
> + if (!allowSystemTableMods)
> + newparams.tablespaceOid = InvalidOid;

I think you're right.  So actually TOAST should never move, even if
allowSystemTableMods, right ?

> @@ -292,7 +315,11 @@ REINDEX [ (  class="parameter">option [, ...] ) ] { IN
> with REINDEX INDEX or REINDEX TABLE,
> respectively. Each partition of the specified partitioned relation is
> reindexed in a separate transaction. Those commands cannot be used inside
> -   a transaction block when working on a partitioned table or index.
> +   a transaction block when working on a partitioned table or index. If
> +   REINDEX with TABLESPACE executed
> +   on partitioned relation fails it may have moved some partitions to the new
> +   tablespace. Repeated command will still reindex all partitions even if 
> they
> +   are already in the new tablespace.

Minor corrections here:

If a REINDEX command fails when run on a partitioned
relation, and TABLESPACE was specified, then it may have
moved indexes on some partitions to the new tablespace.  Re-running the command
will reindex all partitions and move previously-unprocessed indexes to the new
tablespace.


-- 
Justin




Re: Is Recovery actually paused?

2021-01-21 Thread Robert Haas
On Mon, Jan 18, 2021 at 9:42 PM Yugo NAGATA  wrote:
> If it is acceptable that pg_is_wal_replay_paused() makes users wait,
> I'm ok for the current interface. I don't feel the need of
> pg_is_wal_replay_paluse_requeseted().

Another idea could be that pg_is_wal_replay_paused() could be changed
to text, and the string could be either 'paused' or 'pause requested'
or 'not paused'. That way we'd be returning a direct representation of
the state we're keeping in memory. Some of the complexity in this
discussion seems to come from trying to squeeze 3 possibilities into a
Boolean.

Let's also consider that we don't really know whether the client wants
us to wait or not, and different clients may want different things, or
maybe not, but we don't really know at this point. If we provide an
interface that waits, and the client doesn't want to wait but just
know the current state, they don't necessarily have any great options.
If we provide an interface that doesn't wait, and the client wants to
wait, it can poll until it gets the answer it wants. Polling can be
inefficient, but anybody who is writing a tool that uses this should
be able to manage an algorithm with some reasonable back-off behavior
(e.g. try after 10ms, 20ms, keep doubling, max of 5s, or something of
that sort), so I'm not sure there's actually any real problem in
practice. So to me it seems more likely that an interface that is
based on waiting will cause difficulty for tool-writers than one that
does not.

Other people may feel differently, of course...

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-21 Thread Alexey Kondratov

On 2021-01-21 17:06, Alexey Kondratov wrote:

On 2021-01-21 04:41, Michael Paquier wrote:


There are no tests for partitioned tables, aka we'd want to make sure
that the new partitioned index is on the correct tablespace, as well
as all its leaves.  It may be better to have at least two levels of
partitioned tables, as well as a partitioned table with no leaves in
the cases dealt with.



Yes, sure, it makes sense.


+*
+* Even if a table's indexes were moved to a new tablespace, 
the index

+* on its toast table is not normally moved.
 */
Still, REINDEX (TABLESPACE) TABLE should move all of them to be
consistent with ALTER TABLE SET TABLESPACE, but that's not the case
with this code, no?  This requires proper test coverage, but there is
nothing of the kind in this patch.


You are right, we do not move TOAST indexes now, since
IsSystemRelation() is true for TOAST indexes, so I thought that we
should not allow moving them without allow_system_table_mods=true. Now
I wonder why ALTER TABLE does that.

I am going to attach the new version of patch set today or tomorrow.



Attached is a new patch set of first two patches, that should resolve 
all the issues raised before (ACL, docs, tests) excepting TOAST. Double 
thanks for suggestion to add more tests with nested partitioning. I have 
found and squashed a huge bug related to the returning back to the 
default tablespace using newly added tests.


Regarding TOAST. Now we skip moving toast indexes or throw error if 
someone wants to move TOAST index directly. I had a look on ALTER TABLE 
SET TABLESPACE and it has a bit complicated logic:


1) You cannot move TOAST table directly.
2) But if you move basic relation that TOAST table belongs to, then they 
are moved altogether.
3) Same logic as 2) happens if one does ALTER TABLE ALL IN TABLESPACE 
...


That way, ALTER TABLE allows moving TOAST tables (with indexes) 
implicitly, but does not allow doing that explicitly. In the same time I 
found docs to be vague about such behavior it only says:


All tables in the current database in a tablespace can be moved
by using the ALL IN TABLESPACE ... Note that system catalogs are
not moved by this command

Changing any part of a system catalog table is not permitted.

So actually ALTER TABLE treats TOAST relations as system sometimes, but 
sometimes not.


From the end user perspective it makes sense to move TOAST with main 
table when doing ALTER TABLE SET TABLESPACE. But should we touch indexes 
on TOAST table with REINDEX? We cannot move TOAST relation itself, since 
we are doing only a reindex, so we end up in the state when TOAST table 
and its index are placed in the different tablespaces. This state is not 
reachable with ALTER TABLE/INDEX, so it seem we should not allow it with 
REINDEX as well, should we?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom bcd690da6bc3db16a96305b45546d3c9e400f769 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 23 Mar 2020 21:10:29 +0300
Subject: [PATCH v3 2/2] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml |  29 +++-
 src/backend/catalog/index.c   |  82 +++-
 src/backend/commands/indexcmds.c  |  81 +++-
 src/bin/psql/tab-complete.c   |   4 +-
 src/include/catalog/index.h   |   2 +
 src/test/regress/input/tablespace.source  |  79 +++
 src/test/regress/output/tablespace.source | 154 ++
 7 files changed, 425 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 627b36300c..90fdad0b4c 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 CONCURRENTLY [ boolean ]
 VERBOSE [ boolean ]
+TABLESPACE new_tablespace
 
  
 
@@ -187,6 +188,20 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  Specifies that indexes will be rebuilt on a new tablespace.
+  Cannot be used with "mapped" and system (unless allow_system_table_mods
+  is set to TRUE) relations. If SCHEMA,
+  DATABASE or SYSTEM are specified, then
+  all "mapped" and system relations will be skipped and a single
+  WARNING will be generated.
+ 
+
+   
+

 VERBOSE
 
@@ -210,6 +225,14 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+new_tablespace
+
+ 
+  The tablespace where indexes will be rebuilt.
+ 
+
+   
   
  
 
@@ -292,7 +315,11 @@ REINDEX [ ( option [, ...] ) ] { IN
with REINDEX INDEX or REINDEX TABLE,
respectively. Each partition of the specified partitioned relation is
reindexed in a separate transaction. Those 

Avoiding smgrimmedsync() during nbtree index builds

2021-01-21 Thread Andres Freund
Hi,

Every nbtree index build currently does an smgrimmedsync at the end:

/*
 * Read tuples in correct sort order from tuplesort, and load them into
 * btree leaves.
 */
static void
_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
...
/*
 * When we WAL-logged index pages, we must nonetheless fsync index 
files.
 * Since we're building outside shared buffers, a CHECKPOINT occurring
 * during the build has no way to flush the previously written data to
 * disk (indeed it won't know the index even exists).  A crash later on
 * would replay WAL from the checkpoint, therefore it wouldn't replay 
our
 * earlier WAL entries. If we do not fsync those pages here, they might
 * still not be on disk when the crash occurs.
 */
if (wstate->btws_use_wal)
{
RelationOpenSmgr(wstate->index);
smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
}

In cases we create lots of small indexes, e.g. because of an initial
schema load, partition creation or something like that, that turns out
to be a major limiting factor (unless one turns fsync off).


One way to address that would be to put newly built indexes into s_b
(using a strategy, to avoid blowing out the whole cache), instead of
using smgrwrite() etc directly. But that's a discussion with a bit more
complex tradeoffs.


What I wonder is why the issue addressed in the comment I copied above
can't more efficiently be addressed using sync requests, like we do for
other writes?  It's possibly bit more complicated than just passing
skipFsync=false to smgrwrite/smgrextend, but it should be quite doable?


A quick hack (probably not quite correct!) to evaluate the benefit shows
that the attached script takes 2m17.223s with the smgrimmedsync and
0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
the former case, CPU bound in the latter.

Creating lots of tables with indexes (directly or indirectly through
relations having a toast table) is pretty common, particularly after the
introduction of partitioning.


Thinking through the correctness of replacing smgrimmedsync() with sync
requests, the potential problems that I can see are:

1) redo point falls between the log_newpage() and the
   write()/register_dirty_segment() in smgrextend/smgrwrite.
2) redo point falls between write() and register_dirty_segment()

But both of these are fine in the context of initially filling a newly
created relfilenode, as far as I can tell? Otherwise the current
smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Greetings,

Andres Freund


createlots.sql
Description: application/sql


Re: Race condition in recovery?

2021-01-21 Thread Robert Haas
On Thu, Jan 21, 2021 at 4:00 AM Dilip Kumar  wrote:
> 8. Node3, get it because walsender of Node2 read it from TL13 and send
> it and Node2 write in the new WAL file but with TL12.
>
> WalSndSegmentOpen()
> {
> /*---
> * When reading from a historic timeline, and there is a timeline switch
> * within this segment, read from the WAL segment belonging to the new
> * timeline.
> }
>
> 9. Node3, now set the expectedTLEs to 12 because that is what
> walreceiver has streamed the WAL using.

This seems to be incorrect, because the comment for expectedTLEs says:

 * expectedTLEs: a list of TimeLineHistoryEntries for
recoveryTargetTLI and the timelines of
 * its known parents, newest first (so recoveryTargetTLI is always the
 * first list member).  Only these TLIs are expected to be seen in the WAL
 * segments we read, and indeed only these TLIs will be considered as
 * candidate WAL files to open at all.

But in your scenario apparently we end up with a situation that
contradicts that, because you go on to say:

> 10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12.  So

As I understand, expectedTLEs should end up being a list where the
first element is the timeline we want to end up on, and the last
element is the timeline where we are now, and every timeline in the
list branches off of the next timeline in the list. So here if 13
branches of 12 then the list should be 13,12 not just 12; and if 13
does not branch off of 12 OR if 13 branches off of 12 at an earlier
point in the WAL stream than where we are now, then that should be an
error that shuts down the standby, because then there is no way for
replay to get from where it is now to the correct timeline.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Support for NSS as a libpq TLS backend

2021-01-21 Thread Jacob Champion
On Mon, 2020-07-20 at 15:35 +0200, Daniel Gustafsson wrote:
> With this, I have one failing test ("intermediate client certificate is
> provided by client") which I've left failing since I believe the case should 
> be
> supported by NSS.  The issue is most likely that I havent figured out the 
> right
> certinfo incantation to make it so (Mozilla hasn't strained themselves when
> writing documentation for this toolchain, or any part of NSS for that matter).

I think we're missing a counterpart to this piece of the OpenSSL
implementation, in be_tls_init():

if (ssl_ca_file[0])
{
...
SSL_CTX_set_client_CA_list(context, root_cert_list);
}

I think the NSS equivalent to SSL_CTX_set_client_CA_list() is probably
SSL_SetTrustAnchors() (which isn't called out in the online NSS docs,
as far as I can see).

What I'm less sure of is how we want the NSS counterpart to ssl_ca_file
to behave. The OpenSSL implementation allows a list of CA names to be
sent. Should the NSS side take a list of CA cert nicknames? a list of
Subjects? something else?

mod_nss for httpd had a proposed feature [1] to do this that
unfortunately withered on the vine, and Google returns ~500 results for
"SSL_SetTrustAnchors", so I'm unaware of any prior art in the wild...

--Jacob

[1] https://bugzilla.redhat.com/show_bug.cgi?id=719401


Re: [HACKERS] [PATCH] Generic type subscripting

2021-01-21 Thread Pavel Stehule
Hi


> Looks good, I've applied it, thanks.
>

I tested last set of patches

1. There is no problem with patching and compilation
2. make check-world passed
3. build doc without problems
4. I have not any objections against implemented functionality,
implementation and tests

I'll mark this patch as ready for committers

Thank you for your work. It will be nice feature

Regards

Pavel


Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2021-01-21 Thread Matthias van de Meent
On Tue, 19 Jan 2021 at 21:59, Matthias van de Meent
 wrote:
>
> On Mon, 18 Jan 2021, 21:25 Álvaro Herrera,  wrote:
> >
> > On 2021-Jan-18, Matthias van de Meent wrote:
> >
> > > Example:
> > >
> > > 1.) RI starts
> > > 2.) PHASE 2: filling the index:
> > > 2.1.) scanning the heap (live tuple is cached)
> > > < tuple is deleted
> > > < last transaction other than RI commits, only snapshot of RI exists
> > > < vacuum drops the tuple, and cannot remove it from the new index
> > > because this new index is not yet populated.
> > > 2.2.) sorting tuples
> > > 2.3.) index filled with tuples, incl. deleted tuple
> > > 3.) PHASE 3: wait for transactions
> > > 4.) PHASE 4: validate does not remove the tuple from the index,
> > > because it is not built to do so: it will only insert new tuples.
> > > Tuples that are marked for deletion are removed from the index only
> > > through VACUUM (and optimistic ALL_DEAD detection).
> > >
> > > According to my limited knowledge of RI, it requires VACUUM to not run
> > > on the table during the initial index build process (which is
> > > currently guaranteed through the use of a snapshot).
> >
> > VACUUM cannot run concurrently with CIC or RI in a table -- both acquire
> > ShareUpdateExclusiveLock, which conflicts with itself, so this cannot
> > occur.
>
> Yes, you are correct. Vacuum indeed has a ShareUpdateExclusiveLock.
> Are there no other ways that pages are optimistically pruned?
>
> But the base case still stands, ignoring CIC snapshots in  would give
> the semantic of all_dead to tuples that are actually still considered
> alive in some context, and should not yet be deleted (you're deleting
> data from an in-use snapshot). Any local pruning optimizations using
> all_dead mechanics now cannot be run on the table unless they hold an
> ShareUpdateExclusiveLock; though I'm unaware of any such mechanisms
> (other than below).

Re-thinking this, and after some research:

Is the behaviour of "any process that invalidates TIDs in this table
(that could be in an index on this table) always holds a lock that
conflicts with CIC/RiC on that table" a requirement of tableams, or is
it an implementation-detail?

If it is a requirement, then this patch is a +1 for me (and that
requirement should be documented in such case), otherwise a -1 while
there is no mechanism in place to remove concurrently-invalidated TIDs
from CIC-ed/RiC-ed indexes.

This concurrently-invalidated check could be done through e.g.
updating validate_index to have one more phase that removes unknown /
incorrect TIDs from the index. As a note: index insertion logic would
then also have to be able to handle duplicate TIDs in the index.

Regards,

Matthias van de Meent


Regards,

Matthias van de Meent




Re: [PATCH 1/1] Fix detection of pwritev support for OSX.

2021-01-21 Thread Tom Lane
James Hilliard  writes:
> On Wed, Jan 20, 2021 at 4:07 PM Tom Lane  wrote:
>> I'm not sure that the case of not having the "command line tools"
>> installed is interesting for our purposes.  AFAIK you have to have
>> that in order to have access to required tools like bison and gmake.
>> (That reminds me, I was intending to add something to our docs
>> about how-to-build-from-source to say that you need to install those.)

> Yeah, not 100% sure but I was able to build just fine after deleting my
> command line tools.

Hm.  I've never been totally clear on what's included in the "command line
tools", although it's now apparent that one thing that gets installed is
an SDK matching the host OS version.  However, Apple's description at [1]
says

Command Line Tools

Download the macOS SDK, headers, and build tools such as the Apple
LLVM compiler and Make. These tools make it easy to install open
source software or develop on UNIX within Terminal. macOS can
automatically download these tools the first time you try to build
software, and they are available on the downloads page.

which certainly strongly implies that gmake is not there otherwise.
At this point I lack any "bare" macOS system to check it on.  I wonder
whether you have a copy of make available from MacPorts or Homebrew.
Or maybe uninstalling the command line tools doesn't really remove
everything?

> It would be pretty annoying to have to install an outdated SDK just to
> build postgres for no other reason than the autoconf feature detection
> being broken.

It's only as "outdated" as your host system ;-).  Besides, it doesn't
look like Apple's really giving you a choice not to.

The long and short of this is that I'm unwilling to buy into maintaining
our own substitutes for standard autoconf probes in order to make it
possible to use the wrong SDK version.  The preadv/pwritev case is already
messy enough, and I fear that trying to support such scenarios is going to
lead to more and more pain in the future.

regards, tom lane

[1] https://developer.apple.com/xcode/features/




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-21 Thread Fujii Masao




On 2021/01/22 1:17, Bharath Rupireddy wrote:

On Thu, Jan 21, 2021 at 8:58 PM Fujii Masao  wrote:

My opinion is to check "!all", but if others prefer using such boolean flag,
I'd withdraw my opinion.


I'm really sorry, actually if (!all) is enough there, my earlier
understanding was wrong.


+   if ((all || entry->server_hashvalue == hashvalue) &&

What about making disconnect_cached_connections() accept serverid instead
of hashvalue, and perform the above comparison based on serverid? That is,
I'm thinking "if (all || entry->serverid == serverid)". If we do that, we can
simplify postgres_fdw_disconnect() a bit more by getting rid of the calculation
of hashvalue.


That's a good idea. I missed this point. Thanks.


+   if ((all || entry->server_hashvalue == hashvalue) &&
+entry->conn)

I think that it's better to make the check of "entry->conn" independent
like other functions in postgres_fdw/connection.c. What about adding
the following check before the above?

 /* Ignore cache entry if no open connection right now */
 if (entry->conn == NULL)
 continue;


Done.


+   /*
+* If the server has been dropped in 
the current explicit
+* transaction, then this entry would 
have been invalidated
+* in pgfdw_inval_callback at the end 
of drop sever
+* command. Note that this connection 
would not have been
+* closed in pgfdw_inval_callback 
because it is still being
+* used in the current explicit 
transaction. So, assert
+* that here.
+*/
+   Assert(entry->invalidated);

As this comment explains, even when the connection is used in the transaction,
its server can be dropped in the same transaction. The connection can remain
until the end of transaction even though its server has been already dropped.
I'm now wondering if this behavior itself is problematic and should be forbid.
Of course, this is separate topic from this patch, though..

BTW, my just idea for that is;
1. change postgres_fdw_get_connections() return also serverid and xact_depth.
2. make postgres_fdw define the event trigger on DROP SERVER command so that
  an error is thrown if the connection to the server is still in use.
  The event trigger function uses postgres_fdw_get_connections() to check
  if the server connection is still in use or not.

I'm not sure if this just idea is really feasible or not, though...


I'm not quite sure if we can create such a dependency i.e. blocking
"drop foreign server" when at least one session has an in use cached
connection on it?


Maybe my explanation was not clear... I was thinking to prevent the server 
whose connection is used *within the current transaction* from being dropped. 
IOW, I was thinking to forbid the drop of server if xact_depth of its 
connection is more than one. So one session can drop the server even when its 
connection is open in other session if it's not used within the transaction 
(i.e., xact_depth == 0).

BTW, for now, if the connection is used within the transaction, other session 
cannot drop the corresponding server because the transaction holds the lock on 
the relations that depend on the server. Only the session running that 
transaction can drop the server. This can cause the issue in discussion.

So, my just idea is to disallow even that session running the transaction to drop 
the server. This means that no session can drop the server while its connection is 
used within the transaction (xact_depth > 0).



What if a user wants to drop a server from one
session, all other sessions one after the other keep having in-use
connections related to that server, (though this use case sounds
impractical) will the drop server ever be successful? Since we can
have hundreds of sessions in real world postgres environment, I don't
know if it's a good idea to create such dependency.

As you suggested, this point can be discussed in a separate thread and
if any of the approaches proposed by you above is finalized we can
extend postgres_fdw_get_connections anytime.

Thoughts?


I will consider more before starting separate discussion!




Attaching v16 patch set, addressing above review comments and also
added a test case suggested upthread that postgres_fdw_disconnect()
with existing server name returns false that is when the cache doesn't
have active connection.

Please review the v16 patch set further.


Thanks! Will review that later.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Add primary keys to system catalogs

2021-01-21 Thread Tom Lane
Peter Eisentraut  writes:
> On 2021-01-17 23:07, Tom Lane wrote:
>> I've reviewed this patch.  It looks pretty solid to me, with a couple
>> trivial nits as mentioned below, and one bigger thing that's perhaps
>> in the category of bikeshedding.  Namely, do we really want to prefer
>> using the OID indexes as the primary keys?

> I chose this because the notional foreign keys point to the OID.
> If you design some basic business database with customer IDs, product 
> IDs, etc., you'd also usually make the ID the primary key, even if you 
> have, say, a unique constraint on the product name.  But this is of 
> course a matter of taste to some degree.

Fair enough.  As I said upthread, I just wanted to be sure we'd considered
the alternative.  I'm content to use the OIDs as pkeys, although I think
that decision should be explicitly recorded somewhere (cf attachment).

>> The contents of system_constraints.sql seem pretty randomly ordered,
>> and I bet the order isn't stable across machines.

> They follow the order in which the catalogs are processed byt genbki.pl. 

Looking closer, I see the data structure is an array not a hash, so
I withdraw the concern about instability.

After reading the patch again, I have a couple more nits about comments,
which I'll just present as a proposed delta patch.  Otherwise it's good.
I'll mark it RFC.

regards, tom lane

diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index 6747996ce3..b68c1752c0 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -663,6 +663,8 @@ die
   "genbki OID counter reached $GenbkiNextOid, overrunning FirstBootstrapObjectId\n"
   if $GenbkiNextOid > $FirstBootstrapObjectId;
 
+# Now generate system_constraints.sql
+
 foreach my $c (@system_constraints)
 {
 	print $constraints $c, "\n";
diff --git a/src/include/catalog/genbki.h b/src/include/catalog/genbki.h
index 296765d987..5d05fafb5d 100644
--- a/src/include/catalog/genbki.h
+++ b/src/include/catalog/genbki.h
@@ -55,12 +55,15 @@
 #define DECLARE_TOAST(name,toastoid,indexoid) extern int no_such_variable
 
 /*
- * These lines processed by genbki.pl to create the statements
+ * These lines are processed by genbki.pl to create the statements
  * the bootstrap parser will turn into DefineIndex calls.
  *
- * The keyword is DECLARE_INDEX or DECLARE_UNIQUE_INDEX or DECLARE_UNIQUE_INDEX_PKEY.  The first two
- * arguments are the index name and OID, the rest is much like a standard
- * 'create index' SQL command.
+ * The keyword is DECLARE_INDEX or DECLARE_UNIQUE_INDEX or
+ * DECLARE_UNIQUE_INDEX_PKEY.  ("PKEY" marks the index as being the catalog's
+ * primary key; currently this is only cosmetically different from a regular
+ * unique index.  By convention, we usually make a catalog's OID column its
+ * pkey, if it has one.)  The first two arguments are the index's name and
+ * OID, the rest is much like a standard 'create index' SQL command.
  *
  * For each index, we also provide a #define for its OID.  References to
  * the index in the C code should always use these #defines, not the actual
diff --git a/src/test/regress/expected/misc_sanity.out b/src/test/regress/expected/misc_sanity.out
index ae85817c61..9ebe28a78d 100644
--- a/src/test/regress/expected/misc_sanity.out
+++ b/src/test/regress/expected/misc_sanity.out
@@ -43,6 +43,9 @@ WHERE refclassid = 0 OR refobjid = 0 OR
 -- whatever OID the test is complaining about must have been added later
 -- in initdb, where it intentionally isn't pinned.  Legitimate exceptions
 -- to that rule are listed in the comments in setup_depend().
+-- Currently, pg_rewrite is also listed by this check, even though it is
+-- covered by setup_depend().  That happens because there are no rules in
+-- the pinned data, but initdb creates some intentionally-not-pinned views.
 do $$
 declare relnm text;
   reloid oid;
diff --git a/src/test/regress/sql/misc_sanity.sql b/src/test/regress/sql/misc_sanity.sql
index ea80cf1f10..9699f5cc3b 100644
--- a/src/test/regress/sql/misc_sanity.sql
+++ b/src/test/regress/sql/misc_sanity.sql
@@ -44,6 +44,9 @@ WHERE refclassid = 0 OR refobjid = 0 OR
 -- whatever OID the test is complaining about must have been added later
 -- in initdb, where it intentionally isn't pinned.  Legitimate exceptions
 -- to that rule are listed in the comments in setup_depend().
+-- Currently, pg_rewrite is also listed by this check, even though it is
+-- covered by setup_depend().  That happens because there are no rules in
+-- the pinned data, but initdb creates some intentionally-not-pinned views.
 
 do $$
 declare relnm text;


Re: Identify missing publications from publisher while create/alter subscription.

2021-01-21 Thread Bharath Rupireddy
On Thu, Jan 21, 2021 at 6:56 PM vignesh C  wrote:
>
> Hi,
>
> Creating/altering subscription is successful when we specify a
> publication which does not exist in the publisher. I felt we should
> throw an error in this case, that will help the user to check if there
> is any typo in the create subscription command or to create the
> publication before the subscription is created.
> If the above analysis looks correct, then please find a patch that
> checks if the specified publications are present in the publisher and
> throws an error if any of the publications is missing in the
> publisher.
> Thoughts?

I was having similar thoughts (while working on  the logical
replication bug on alter publication...drop table behaviour) on why
create subscription succeeds without checking the publication
existence. I checked in documentation, to find if there's a strong
reason for that, but I couldn't. Maybe it's because of the principle
"first let users create subscriptions, later the publications can be
created on the publisher system", similar to this behaviour
"publications can be created without any tables attached to it
initially, later they can be added".

Others may have better thoughts.

If we do check publication existence for CREATE SUBSCRIPTION, I think
we should also do it for ALTER SUBSCRIPTION ... SET PUBLICATION.

I wonder, why isn't dropping a publication from a list of publications
that are with subscription is not allowed?

Some comments so far on the patch:

1) I see most of the code in the new function check_publications() and
existing fetch_table_list() is the same. Can we have a generic
function, with maybe a flag to separate out the logic specific for
checking publication and fetching table list from the publisher.
2) Can't we know whether the publications exist on the publisher with
the existing (or modifying it a bit if required) query in
fetch_table_list(), so that we can avoid making another connection to
the publisher system from the subscriber?
3) If multiple publications are specified in the CREATE SUBSCRIPTION
query, IIUC, with your patch, the query fails even if at least one of
the publications doesn't exist. Should we throw a warning in this case
and allow the subscription be created for other existing
publications?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Why does creating logical replication subscriptions require superuser?

2021-01-21 Thread Paul Martinez
Hey, all,

I'm working with native logical replication, and I don't fully understand
why logical replication subscribers need to be superusers, nor do I fully
understand the implication of some of the comments made on this page:

https://www.postgresql.org/docs/current/logical-replication-security.html

> A user able to modify the schema of subscriber-side tables can execute
> arbitrary code as a superuser.

Does "execute arbitrary code" here really mean executing arbitrary code on the
machine that is running Postgres, or simply running arbitrary SQL in the
replicating database? Would it only be able to modify data in the database
containing the subscription, or could it modify other databases in the same
cluster? Is there any "blast-radius" to the capabilities gained by such a user?

According to the commit message that added this text, the callout of this
point was added as result of CVE-2020-14349, but the details there don't
really help me understand what the concern is here, nor do I have a deep
understanding of various features that might combine to create a vulnerability.

I'm not sure what arbitrary code could be executed, but my rough guess, based
on some of the other text on that page, is that custom triggers, written in
an arbitrary language (e.g., Python), would run arbitrary code and that is
the concern. Is that correct? And, if so, assuming that Python triggers were
the only way to execute arbitrary code, then simply building Postgres without
Python support would prevent users that can modify the schema from executing
code as superuser. What is the full set of features that could lead to running
arbitrary code in this scenario? Is it just all the different languages that
can be used to write triggers?

Essentially, I'm wondering what a loose proof-of-concept privilege escalation
vulnerability would look like if a non-superuser can modify the schema of
subscriber-side tables.

On a related note, what happens if a superuser creates a subscription, and then
is demoted to a regular user? My understanding is that the worker that applies
the logical changes to the database connects as the user that created the
subscription, so would that prevent potential vulnerabilities in any way?


Thanks,
Paul




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-21 Thread Bharath Rupireddy
On Thu, Jan 21, 2021 at 8:58 PM Fujii Masao  wrote:
> My opinion is to check "!all", but if others prefer using such boolean flag,
> I'd withdraw my opinion.

I'm really sorry, actually if (!all) is enough there, my earlier
understanding was wrong.

> +   if ((all || entry->server_hashvalue == hashvalue) &&
>
> What about making disconnect_cached_connections() accept serverid instead
> of hashvalue, and perform the above comparison based on serverid? That is,
> I'm thinking "if (all || entry->serverid == serverid)". If we do that, we can
> simplify postgres_fdw_disconnect() a bit more by getting rid of the 
> calculation
> of hashvalue.

That's a good idea. I missed this point. Thanks.

> +   if ((all || entry->server_hashvalue == hashvalue) &&
> +entry->conn)
>
> I think that it's better to make the check of "entry->conn" independent
> like other functions in postgres_fdw/connection.c. What about adding
> the following check before the above?
>
> /* Ignore cache entry if no open connection right now */
> if (entry->conn == NULL)
> continue;

Done.

> +   /*
> +* If the server has been dropped in 
> the current explicit
> +* transaction, then this entry would 
> have been invalidated
> +* in pgfdw_inval_callback at the end 
> of drop sever
> +* command. Note that this connection 
> would not have been
> +* closed in pgfdw_inval_callback 
> because it is still being
> +* used in the current explicit 
> transaction. So, assert
> +* that here.
> +*/
> +   Assert(entry->invalidated);
>
> As this comment explains, even when the connection is used in the transaction,
> its server can be dropped in the same transaction. The connection can remain
> until the end of transaction even though its server has been already dropped.
> I'm now wondering if this behavior itself is problematic and should be forbid.
> Of course, this is separate topic from this patch, though..
>
> BTW, my just idea for that is;
> 1. change postgres_fdw_get_connections() return also serverid and xact_depth.
> 2. make postgres_fdw define the event trigger on DROP SERVER command so that
>  an error is thrown if the connection to the server is still in use.
>  The event trigger function uses postgres_fdw_get_connections() to check
>  if the server connection is still in use or not.
>
> I'm not sure if this just idea is really feasible or not, though...

I'm not quite sure if we can create such a dependency i.e. blocking
"drop foreign server" when at least one session has an in use cached
connection on it? What if a user wants to drop a server from one
session, all other sessions one after the other keep having in-use
connections related to that server, (though this use case sounds
impractical) will the drop server ever be successful? Since we can
have hundreds of sessions in real world postgres environment, I don't
know if it's a good idea to create such dependency.

As you suggested, this point can be discussed in a separate thread and
if any of the approaches proposed by you above is finalized we can
extend postgres_fdw_get_connections anytime.

Thoughts?

Attaching v16 patch set, addressing above review comments and also
added a test case suggested upthread that postgres_fdw_disconnect()
with existing server name returns false that is when the cache doesn't
have active connection.

Please review the v16 patch set further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v16-0001-postgres_fdw-function-to-discard-cached-connecti.patch
Description: Binary data


v16-0002-postgres_fdw-add-keep_connections-GUC-to-not-cac.patch
Description: Binary data


v16-0003-postgres_fdw-server-level-option-keep_connection.patch
Description: Binary data


Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-21 Thread Fujii Masao




On 2021/01/21 16:16, Bharath Rupireddy wrote:

On Thu, Jan 21, 2021 at 12:17 PM Fujii Masao
 wrote:

On 2021/01/21 14:46, Bharath Rupireddy wrote:

On Thu, Jan 21, 2021 at 10:06 AM Fujii Masao
 wrote:
   > >> +   if (entry->server_hashvalue == hashvalue &&

+   (entry->xact_depth > 0 || result))
+   {
+   hash_seq_term();
+   break;

entry->server_hashvalue can be 0? If yes, since postgres_fdw_disconnect_all()
specifies 0 as hashvalue, ISTM that the above condition can be true
unexpectedly. Can we replace this condition with just "if (!all)"?


I don't think so entry->server_hashvalue can be zero, because
GetSysCacheHashValue1/CatalogCacheComputeHashValue will not return 0
as hash value. I have not seen someone comparing hashvalue with an
expectation that it has 0 value, for instance see if (hashvalue == 0
|| riinfo->oidHashValue == hashvalue).

Having if(!all) something like below there doesn't suffice because we
might call hash_seq_term, when some connection other than the given
foreign server connection is in use.


No because we check the following condition before reaching that code. No?

+   if ((all || entry->server_hashvalue == hashvalue) &&


I was thinking that "(entry->xact_depth > 0 || result))" condition is not
necessary because "result" is set to true when xact_depth <= 0 and that
condition always indicates true.


I think that condition is too confusing. How about having a boolean
can_terminate_scan like below?


Thanks for thinking this. But at least for me, "if (!all)" looks not so 
confusing.
And the comment seems to explain why we can end the scan.


May I know if it's okay to have the boolean can_terminate_scan as shown in [1]?


My opinion is to check "!all", but if others prefer using such boolean flag,
I'd withdraw my opinion.

+   if ((all || entry->server_hashvalue == hashvalue) &&

What about making disconnect_cached_connections() accept serverid instead
of hashvalue, and perform the above comparison based on serverid? That is,
I'm thinking "if (all || entry->serverid == serverid)". If we do that, we can
simplify postgres_fdw_disconnect() a bit more by getting rid of the calculation
of hashvalue.

+   if ((all || entry->server_hashvalue == hashvalue) &&
+entry->conn)

I think that it's better to make the check of "entry->conn" independent
like other functions in postgres_fdw/connection.c. What about adding
the following check before the above?

/* Ignore cache entry if no open connection right now */
if (entry->conn == NULL)
continue;

+   /*
+* If the server has been dropped in 
the current explicit
+* transaction, then this entry would 
have been invalidated
+* in pgfdw_inval_callback at the end 
of drop sever
+* command. Note that this connection 
would not have been
+* closed in pgfdw_inval_callback 
because it is still being
+* used in the current explicit 
transaction. So, assert
+* that here.
+*/
+   Assert(entry->invalidated);

As this comment explains, even when the connection is used in the transaction,
its server can be dropped in the same transaction. The connection can remain
until the end of transaction even though its server has been already dropped.
I'm now wondering if this behavior itself is problematic and should be forbid.
Of course, this is separate topic from this patch, though..

BTW, my just idea for that is;
1. change postgres_fdw_get_connections() return also serverid and xact_depth.
2. make postgres_fdw define the event trigger on DROP SERVER command so that
an error is thrown if the connection to the server is still in use.
The event trigger function uses postgres_fdw_get_connections() to check
if the server connection is still in use or not.

I'm not sure if this just idea is really feasible or not, though...

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




[PATCH] Disable bgworkers during servers start in pg_upgrade

2021-01-21 Thread Denis Laxalde
Hello,

We found an issue in pg_upgrade on a cluster with a third-party
background worker. The upgrade goes fine, but the new cluster is then in
an inconsistent state. The background worker comes from the PoWA
extension but the issue does not appear to related to this particular
code.

Here is a shell script to reproduce the issue (error at the end):

  OLDBINDIR=/usr/lib/postgresql/11/bin
  NEWBINDIR=/usr/lib/postgresql/13/bin
  
  OLDDATADIR=$(mktemp -d)
  NEWDATADIR=$(mktemp -d)
  
  $OLDBINDIR/initdb -D $OLDDATADIR
  echo "unix_socket_directories = '/tmp'" >> $OLDDATADIR/postgresql.auto.conf
  echo "shared_preload_libraries = 'pg_stat_statements, powa'" >> 
$OLDDATADIR/postgresql.auto.conf
  $OLDBINDIR/pg_ctl -D $OLDDATADIR -l $OLDDATADIR/pgsql.log start
  $OLDBINDIR/createdb -h /tmp powa
  $OLDBINDIR/psql -h /tmp -d powa -c "CREATE EXTENSION powa CASCADE"
  $OLDBINDIR/pg_ctl -D $OLDDATADIR -m fast stop
  
  $NEWBINDIR/initdb -D $NEWDATADIR
  cp $OLDDATADIR/postgresql.auto.conf $NEWDATADIR/postgresql.auto.conf
  
  $NEWBINDIR/pg_upgrade --old-datadir $OLDDATADIR --new-datadir $NEWDATADIR 
--old-bindir $OLDBINDIR
  
  $NEWBINDIR/pg_ctl -D $NEWDATADIR -l $NEWDATADIR/pgsql.log start
  $NEWBINDIR/psql -h /tmp -d powa -c "SELECT 1 FROM powa_snapshot_metas"
  # ERROR:  MultiXactId 1 has not been created yet -- apparent wraparound

(This needs PoWA to be installed; packages are available on pgdg
repositories as postgresql--powa on Debian or
powa_ on RedHat or directly from source code at
https://github.com/powa-team/powa-archivist).

As far as I currently understand, this is due to the data to be migrated
being somewhat inconsistent (from the perspective of pg_upgrade) when
the old cluster and its background workers get started in pg_upgrade
during the "checks" step. (The old cluster remains sane, still.)

As a solution, it seems that, for similar reasons that we restrict
socket access to prevent accidental connections (from commit
f763b77193), we should also prevent background workers to start at this
step.

Please find attached a patch implementing this.

Thanks for considering,
Denis
>From 31b1f31cd3a822d23ccd5883120a013891ade0f3 Mon Sep 17 00:00:00 2001
From: Denis Laxalde 
Date: Wed, 20 Jan 2021 17:25:58 +0100
Subject: [PATCH] Disable background workers during servers start in pg_upgrade

We disable shared_preload_libraries to prevent background workers to
initialize and start during server start in pg_upgrade.

In essence, this is for a similar reason that we use a restricted socket
access from f763b77193b04eba03a1f4ce46df34dc0348419e because background
workers may produce undesired activities during the upgrade.

Author: Denis Laxalde 
Co-authored-by: Jehan-Guillaume de Rorthais 
---
 src/bin/pg_upgrade/server.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_upgrade/server.c b/src/bin/pg_upgrade/server.c
index 31b1425202..fab95a2d24 100644
--- a/src/bin/pg_upgrade/server.c
+++ b/src/bin/pg_upgrade/server.c
@@ -240,11 +240,13 @@ start_postmaster(ClusterInfo *cluster, bool report_and_exit_on_error)
 	 * crash, the new cluster has to be recreated anyway.  fsync=off is a big
 	 * win on ext4.
 	 *
+	 * Turn off background workers by emptying shared_preload_libraries.
+	 *
 	 * Force vacuum_defer_cleanup_age to 0 on the new cluster, so that
 	 * vacuumdb --freeze actually freezes the tuples.
 	 */
 	snprintf(cmd, sizeof(cmd),
-			 "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s %s%s\" start",
+			 "\"%s/pg_ctl\" -w -l \"%s\" -D \"%s\" -o \"-p %d%s%s %s%s -c shared_preload_libraries=''\" start",
 			 cluster->bindir, SERVER_LOG_FILE, cluster->pgconfig, cluster->port,
 			 (cluster->controldata.cat_ver >=
 			  BINARY_UPGRADE_SERVER_FLAG_CAT_VER) ? " -b" :
-- 
2.20.1



postgres_fdw: IMPORT FOREIGN SCHEMA ... LIMIT TO (partition)

2021-01-21 Thread Matthias van de Meent
Hi,

Recently I was trying to copy some of the data of one database to
another through postgres_fdw, and found that it wouldn't import that
partition through IMPORT FOREIGN SCHEMA, even when I explicitly
specified the name of the table that contained the data in the LIMIT
TO clause.

I realised the reason is that currently, postgres_fdw explicitly
disallows importing foreign partitions. This is a reasonable default
when importing a whole schema, but if I wanted to explicitly import
one of a partitioned tables' partitions, that would now require me to
manually copy the foreign table's definitions through the use of
CREATE FOREIGN TABLE, which is a hassle and prone to mistakes.

As such, I propose the attached patch, in which the 'no
partitions'-restriction of postgres_fdw is lifted for the LIMIT TO
clause. This has several benefits, including not holding locks on the
foreign root partition during queries, and less suprising behaviour
for LIMIT TO ("table that happens to be a partition").

Regards,

Matthias van de Meent
From f69b961d743b3fdd647b761d6069d8fa9163e2c7 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Thu, 21 Jan 2021 13:03:32 +0100
Subject: [PATCH v1] Update postgres_fdw to import partitions when named in
 IMPORT SCHEMA LIMIT TO.

---
 .../postgres_fdw/expected/postgres_fdw.out| 34 ++-
 contrib/postgres_fdw/postgres_fdw.c   |  3 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql |  4 +--
 doc/src/sgml/postgres-fdw.sgml| 12 +++
 4 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index b4a04d2c14..cb04e7a2ab 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8435,27 +8435,29 @@ FDW options: (schema_name 'import_source', table_name 'x 5')
 
 -- Check LIMIT TO and EXCEPT
 CREATE SCHEMA import_dest4;
-IMPORT FOREIGN SCHEMA import_source LIMIT TO (t1, nonesuch)
+IMPORT FOREIGN SCHEMA import_source LIMIT TO (t1, nonesuch, t4_part)
   FROM SERVER loopback INTO import_dest4;
 \det+ import_dest4.*
- List of foreign tables
-Schema| Table |  Server  |  FDW options   | Description 
---+---+--++-
- import_dest4 | t1| loopback | (schema_name 'import_source', table_name 't1') | 
-(1 row)
+List of foreign tables
+Schema|  Table  |  Server  | FDW options | Description 
+--+-+--+-+-
+ import_dest4 | t1  | loopback | (schema_name 'import_source', table_name 't1')  | 
+ import_dest4 | t4_part | loopback | (schema_name 'import_source', table_name 't4_part') | 
+(2 rows)
 
-IMPORT FOREIGN SCHEMA import_source EXCEPT (t1, "x 4", nonesuch)
+IMPORT FOREIGN SCHEMA import_source EXCEPT (t1, "x 4", nonesuch, t4_part)
   FROM SERVER loopback INTO import_dest4;
 \det+ import_dest4.*
- List of foreign tables
-Schema| Table |  Server  |   FDW options   | Description 
---+---+--+-+-
- import_dest4 | t1| loopback | (schema_name 'import_source', table_name 't1')  | 
- import_dest4 | t2| loopback | (schema_name 'import_source', table_name 't2')  | 
- import_dest4 | t3| loopback | (schema_name 'import_source', table_name 't3')  | 
- import_dest4 | t4| loopback | (schema_name 'import_source', table_name 't4')  | 
- import_dest4 | x 5   | loopback | (schema_name 'import_source', table_name 'x 5') | 
-(5 rows)
+List of foreign tables
+Schema|  Table  |  Server  | FDW options | Description 
+--+-+--+-+-
+ import_dest4 | t1  | loopback | (schema_name 'import_source', table_name 't1')  | 
+ import_dest4 | t2  | loopback | (schema_name 'import_source', table_name 't2')  | 
+ import_dest4 | t3  | loopback | (schema_name 'import_source', table_name 't3')  | 
+ import_dest4 | t4  | loopback | (schema_name 'import_source', table_name 't4')  | 
+ import_dest4 | t4_part | loopback | (schema_name 'import_source', table_name 't4_part') | 
+ import_dest4 | x 5 | loopback | (schema_name 'import_source', table_name 'x 5') | 
+(6 rows)
 
 -- Assorted error cases
 IMPORT FOREIGN SCHEMA import_source FROM SERVER loopback INTO import_dest4;
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9a31bbb86b..5f33d5cf1e 100644
--- 

Re: Stronger safeguard for archive recovery not to miss data

2021-01-21 Thread Laurenz Albe
On Thu, 2021-01-21 at 13:09 +, osumi.takami...@fujitsu.com wrote:
> > My vote is that we should not have a GUC for such an unlikely event, and 
> > that
> > stopping recovery is good enough.
> 
> OK. IIUC, my current patch for this fix doesn't need to be changed or 
> withdrawn.
> Thank you for your explanation.

Well, that's just my opinion.
Fujii Masao seemed to disagree with the patch, and his voice carries weight.

Yours,
Laurenz Albe





Re: New IndexAM API controlling index vacuum strategies

2021-01-21 Thread Masahiko Sawada
On Wed, Jan 20, 2021 at 7:58 AM Peter Geoghegan  wrote:
>
> On Sun, Jan 17, 2021 at 9:18 PM Masahiko Sawada  wrote:
> > After more thought, I think that ambulkdelete needs to be able to
> > refer the answer to amvacuumstrategy. That way, the index can skip
> > bulk-deletion when lazy vacuum doesn't vacuum heap and it also doesn’t
> > want to do that.
>
> Makes sense.
>
> BTW, your patch has bitrot already. Peter E's recent pageinspect
> commit happens to conflict with this patch. It might make sense to
> produce a new version that just fixes the bitrot, so that other people
> don't have to deal with it each time.
>
> > I’ve attached the updated version patch that includes the following changes:
>
> Looks good. I'll give this version a review now. I will do a lot more
> soon. I need to come up with a good benchmark for this, that I can
> return to again and again during review as needed.

Thank you for reviewing the patches.

>
> Some feedback on the first patch:
>
> * Just so you know: I agree with you about handling
> VACOPT_TERNARY_DISABLED in the index AM's amvacuumstrategy routine. I
> think that it's better to do that there, even though this choice may
> have some downsides.
>
> * Can you add some "stub" sgml doc changes for this? Doesn't have to
> be complete in any way. Just a placeholder for later, that has the
> correct general "shape" to orientate the reader of the patch. It can
> just be a FIXME comment, plus basic mechanical stuff -- details of the
> new amvacuumstrategy_function routine and its signature.
>

0002 patch had the doc update (I mistakenly included it to 0002
patch). Is that update what you meant?

> Some feedback on the second patch:
>
> * Why do you move around IndexVacuumStrategy in the second patch?
> Looks like a rebasing oversight.

Check.

>
> * Actually, do we really need the first and second patches to be
> separate patches? I agree that the nbtree patch should be a separate
> patch, but dividing the first two sets of changes doesn't seem like it
> adds much. Did I miss some something?

I separated the patches since I used to have separate patches when
adding other index AM options required by parallel vacuum. But I
agreed to merge the first two patches.

>
> * Is the "MAXALIGN(sizeof(ItemIdData)))" change in the definition of
> MaxHeapTuplesPerPage appropriate? Here is the relevant section from
> the patch:
>
> diff --git a/src/include/access/htup_details.h
> b/src/include/access/htup_details.h
> index 7c62852e7f..038e7cd580 100644
> --- a/src/include/access/htup_details.h
> +++ b/src/include/access/htup_details.h
> @@ -563,17 +563,18 @@ do { \
>  /*
>   * MaxHeapTuplesPerPage is an upper bound on the number of tuples that can
>   * fit on one heap page.  (Note that indexes could have more, because they
> - * use a smaller tuple header.)  We arrive at the divisor because each tuple
> - * must be maxaligned, and it must have an associated line pointer.
> + * use a smaller tuple header.)  We arrive at the divisor because each line
> + * pointer must be maxaligned.
> *** SNIP ***
>  #define MaxHeapTuplesPerPage\
> -((int) ((BLCKSZ - SizeOfPageHeaderData) / \
> -(MAXALIGN(SizeofHeapTupleHeader) + sizeof(ItemIdData
> +((int) ((BLCKSZ - SizeOfPageHeaderData) / 
> (MAXALIGN(sizeof(ItemIdData)
>
> It's true that ItemIdData structs (line pointers) are aligned, but
> they're not MAXALIGN()'d. If they were then the on-disk size of line
> pointers would generally be 8 bytes, not 4 bytes.

You're right. Will fix.

>
> * Maybe it would be better if you just changed the definition such
> that "MAXALIGN(SizeofHeapTupleHeader)" became "MAXIMUM_ALIGNOF", with
> no other changes? (Some variant of this suggestion might be better,
> not sure.)
>
> For some reason that feels a bit safer: we still have an "imaginary
> tuple header", but it's just 1 MAXALIGN() quantum now. This is still
> much less than the current 3 MAXALIGN() quantums (i.e. what
> MaxHeapTuplesPerPage treats as the tuple header size). Do you think
> that this alternative approach will be noticeably less effective
> within vacuumlazy.c?
>
> Note that you probably understand the issue with MaxHeapTuplesPerPage
> for vacuumlazy.c better than I do currently. I'm still trying to
> understand your choices, and to understand what is really important
> here.

Yeah, using MAXIMUM_ALIGNOF seems better for safety. I shared my
thoughts on the issue with MaxHeapTuplesPerPage yesterday. I think we
need to discuss how to deal with that.

>
> * Maybe add a #define for the value 0.7? (I refer to the value used in
> choose_vacuum_strategy() to calculate a "this is the number of LP_DEAD
> line pointers that we consider too many" cut off point, which is to be
> applied throughout lazy_scan_heap() processing.)
>

Agreed.

> * I notice that your new lazy_vacuum_table_and_indexes() function is
> the only place that calls lazy_vacuum_table_and_indexes(). I think
> that you should merge them together -- replace the 

RE: Enhance traceability of wal_level changes for backup management

2021-01-21 Thread osumi.takami...@fujitsu.com
Hi


Apologies for my delay.
On Wednesday, January 6, 2021 7:03 PM I wrote:
> I'll continue the discussion of [2].
> We talked about how to recognize the time or LSN when/where wal_level is
> changed to 'none' there.
> 
> You said
> > The use case I imagined is that the user temporarily changes wal_level
> > to 'none' from 'replica' or 'logical' to speed up loading and changes
> > back to the normal. In this case, the backups taken before the
> > wal_level change cannot be used to restore the database to the point
> > after the wal_level change. So I think backup management tools would
> > want to recognize the time or LSN when/where wal_level is changed to
> > ‘none’ in order to do some actions such as invalidating older backups,
> > re-calculating backup redundancy etc.
> > Actually the same is true when the user changes to ‘minimal’. The
> > tools would need to recognize the time or LSN in this case too. Since
> > temporarily changing wal_level has been an uncommon use case some
> > tools perhaps are not aware of that yet. But since introducing
> > wal_level = ’none’ could make the change common, I think we need to
> provide a way for the tools.
Before my implementation, I'd like to confirm something.

As of now, I think there are two major ideas already.
I think to implement the 1st idea suffices.
If no one disagree with it, I'll proceed with (1) below.

(1) writing the time or LSN in the control file
to indicate when/where wal_level is changed to 'minimal'
from upper level to invalidate the old backups or make alerts to users.
I think we reset this when user executes pg_basebackup successfully.

(2) implementing incremental counters that indicates
drop of wal_level from replica to minimal(or between other levels).
Its purpose was to compare the wal_level changes between snapshots.
When any monitoring tools detect any difference of the counter,
we can predict something happened immediately without checking WAL in between.

The former could give accureate information for backup management
while the latter gives easier way to compare snapshots, I think.

By the way, thankfully I got advice to refer to
Oracle's feature such as Oracle Server Alert or
backup catalog management capability from Tsunakawa-San.
However, because those development would be huge, then
I'd like to choose either the first one or the second one
and for the purpose to give better information, I prefer the first one.

Any comments ?

Best Regards,
Takamichi Osumi





Re: patch: reduce overhead of execution of CALL statement in no atomic mode from PL/pgSQL

2021-01-21 Thread Pavel Stehule
čt 21. 1. 2021 v 14:37 odesílatel Pavel Stehule 
napsal:

> Hi
>
> This is a little bit of an enhanced version of the previous patch. The
> worst case overhead is reduced almost to zero. The local resource owner is
> created only when routine is executed in non-atomic mode, and when routine
> contains a CALL statement.
>

Sorry. Last patch wasn't tested well.

Regards

Pavel




> Regards
>
> Pavel
>
>
diff --git a/doc/src/sgml/spi.sgml b/doc/src/sgml/spi.sgml
index f5e0a35da0..8cf66f8c7b 100644
--- a/doc/src/sgml/spi.sgml
+++ b/doc/src/sgml/spi.sgml
@@ -1722,6 +1722,138 @@ int SPI_execute_plan(SPIPlanPtr plan, Datum * 
 
 
 
+
+ SPI_execute_plan_extended
+
+ 
+  SPI_execute_plan_extended
+  3
+ 
+
+ 
+  SPI_execute_plan_extended
+  execute a statement prepared by SPI_prepare
+ 
+
+ 
+
+int SPI_execute_plan_extended(SPIPlanPtr plan,
+  ParamListInfo params,
+  bool read_only,
+  long count,
+  const SPIExecuteOptions * options)
+
+ 
+
+ 
+  Description
+
+  
+   SPI_execute_plan_extended executes a statement
+   prepared by SPI_prepare.
+   This function is equivalent to SPI_execute_plan_with_paramlist,
+   but allows to pass additional options.
+  
+ 
+
+ 
+  Arguments
+
+  
+   
+SPIPlanPtr plan
+
+ 
+  prepared statement (returned by SPI_prepare)
+ 
+
+   
+
+   
+ParamListInfo params
+
+ 
+  data structure containing parameter types and values; NULL if none
+ 
+
+   
+
+   
+bool read_only
+
+ true for read-only execution
+
+   
+
+   
+long count
+
+ 
+  maximum number of rows to return,
+  or 0 for no limit
+ 
+
+   
+
+   
+const SPIPExecuteOptions * options
+
+ 
+  struct containing optional arguments
+ 
+
+   
+  
+
+  
+   Callers should always zero out the entire options
+   struct, then fill whichever fields they want to set.  This ensures forward
+   compatibility of code, since any fields that are added to the struct in
+   future will be defined to behave backwards-compatibly if they are zero.
+   The currently available options fields are:
+  
+
+  
+   
+DestReceiver * dest
+
+ 
+  Query target
+ 
+
+   
+
+   
+ResourceOwner owner
+
+ 
+  The resource owner that will be used for plan execution. Now it is used
+  as cached query plan owner only. This resource owner should to have longer
+  life cycle than executed command. Current resource owner is used when
+  this option is NULL.
+ 
+
+   
+  
+ 
+
+ 
+  Return Value
+
+  
+   The return value is the same as for SPI_execute_plan.
+  
+
+  
+   SPI_processed and
+   SPI_tuptable are set as in
+   SPI_execute_plan if successful.
+  
+ 
+
+
+
+
 
  SPI_execute_plan_with_paramlist
 
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 653ef8e41a..3de2932a03 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -230,7 +230,7 @@ ExecuteQuery(ParseState *pstate,
 	   entry->plansource->query_string);
 
 	/* Replan if needed, and increment plan refcount for portal */
-	cplan = GetCachedPlan(entry->plansource, paramLI, false, NULL);
+	cplan = GetCachedPlan(entry->plansource, paramLI, NULL, NULL);
 	plan_list = cplan->stmt_list;
 
 	/*
@@ -651,7 +651,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	}
 
 	/* Replan if needed, and acquire a transient refcount */
-	cplan = GetCachedPlan(entry->plansource, paramLI, true, queryEnv);
+	cplan = GetCachedPlan(entry->plansource, paramLI, CurrentResourceOwner, queryEnv);
 
 	INSTR_TIME_SET_CURRENT(planduration);
 	INSTR_TIME_SUBTRACT(planduration, planstart);
@@ -687,7 +687,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	if (estate)
 		FreeExecutorState(estate);
 
-	ReleaseCachedPlan(cplan, true);
+	ReleaseCachedPlan(cplan, CurrentResourceOwner);
 }
 
 /*
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index e28d242922..17df852959 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -67,7 +67,7 @@ static void _SPI_prepare_oneshot_plan(const char *src, SPIPlanPtr plan);
 static int	_SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 			  Snapshot snapshot, Snapshot crosscheck_snapshot,
 			  bool read_only, bool fire_triggers, uint64 tcount,
-			  DestReceiver *caller_dest);
+			  DestReceiver *caller_dest, ResourceOwner owner);
 
 static ParamListInfo _SPI_convert_params(int nargs, Oid *argtypes,
 		 Datum *Values, const char *Nulls);
@@ -521,7 +521,7 @@ SPI_execute(const char *src, bool read_only, long tcount)
 
 	res = _SPI_execute_plan(, NULL,
 			InvalidSnapshot, InvalidSnapshot,
-			read_only, true, tcount, NULL);
+			read_only, true, tcount, NULL, NULL);
 
 	_SPI_end_call(true);
 	return res;

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-21 Thread Alexey Kondratov

On 2021-01-21 04:41, Michael Paquier wrote:

On Wed, Jan 20, 2021 at 03:34:39PM -0300, Alvaro Herrera wrote:

On 2021-Jan-20, Alexey Kondratov wrote:

Ugh, forgot to attach the patches. Here they are.


Yeah, looks reasonable.



+
+   if (changed)
+   /* Record dependency on tablespace */
+   changeDependencyOnTablespace(RelationRelationId,
+reloid, 
rd_rel->reltablespace);


Why have a separate "if (changed)" block here instead of merging with
the above?


Yep.



Sure, this is a refactoring artifact.


+   if (SetRelTablespace(reloid, newTableSpace))
+   /* Make sure the reltablespace change is visible */
+   CommandCounterIncrement();
At quick glance, I am wondering why you just don't do a CCI within
SetRelTablespace().



I did it that way for a better readability at first, since it looks more 
natural when you do some change (SetRelTablespace) and then make them 
visible with CCI. Second argument was that in the case of 
reindex_index() we have to also call RelationAssumeNewRelfilenode() and 
RelationDropStorage() before doing CCI and making the new tablespace 
visible. And this part is critical, I guess.




+  This specifies that indexes will be rebuilt on a new tablespace.
+  Cannot be used with "mapped" relations. If 
SCHEMA,

+  DATABASE or SYSTEM is
specified, then
+  all unsuitable relations will be skipped and a single
WARNING
+  will be generated.
What is an unsuitable relation?  How can the end user know that?



This was referring to mapped relations mentioned in the previous 
sentence. I have tried to rewrite this part and make it more specific in 
my current version. Also added Justin's changes to the docs and comment.



This is missing ACL checks when moving the index into a new location,
so this requires some pg_tablespace_aclcheck() calls, and the other
patches share the same issue.



I added proper pg_tablespace_aclcheck()'s into the reindex_index() and 
ReindexPartitions().



+   else if (partkind == RELKIND_PARTITIONED_TABLE)
+   {
+   Relation rel = table_open(partoid, ShareLock);
+   List*indexIds = RelationGetIndexList(rel);
+   ListCell *lc;
+
+   table_close(rel, NoLock);
+   foreach (lc, indexIds)
+   {
+   Oid indexid = lfirst_oid(lc);
+   (void) set_rel_tablespace(indexid, 
params->tablespaceOid);

+   }
+   }
This is really a good question.  ReindexPartitions() would trigger one
transaction per leaf to work on.  Changing the tablespace of the
partitioned table(s) before doing any work has the advantage to tell
any new partition to use the new tablespace.  Now, I see a struggling
point here: what should we do if the processing fails in the middle of
the move, leaving a portion of the leaves in the previous tablespace?
On a follow-up reindex with the same command, should the command force
a reindex even on the partitions that have been moved?  Or could there
be a point in skipping the partitions that are already on the new
tablespace and only process the ones on the previous tablespace?  It
seems to me that the first scenario makes the most sense as currently
a REINDEX works on all the relations defined, though there could be
use cases for the second case.  This should be documented, I think.



I agree that follow-up REINDEX should also reindex moved partitions, 
since REINDEX (TABLESPACE ...) is still reindex at first. I will try to 
put something about this part into the docs. Also I think that we cannot 
be sure that nothing happened with already reindexed partitions between 
two consequent REINDEX calls.



There are no tests for partitioned tables, aka we'd want to make sure
that the new partitioned index is on the correct tablespace, as well
as all its leaves.  It may be better to have at least two levels of
partitioned tables, as well as a partitioned table with no leaves in
the cases dealt with.



Yes, sure, it makes sense.


+*
+* Even if a table's indexes were moved to a new tablespace, 
the index

+* on its toast table is not normally moved.
 */
Still, REINDEX (TABLESPACE) TABLE should move all of them to be
consistent with ALTER TABLE SET TABLESPACE, but that's not the case
with this code, no?  This requires proper test coverage, but there is
nothing of the kind in this patch.


You are right, we do not move TOAST indexes now, since 
IsSystemRelation() is true for TOAST indexes, so I thought that we 
should not allow moving them without allow_system_table_mods=true. Now I 
wonder why ALTER TABLE does that.


I am going to attach the new version of patch set today or tomorrow.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: patch: reduce overhead of execution of CALL statement in no atomic mode from PL/pgSQL

2021-01-21 Thread Pavel Stehule
Hi

This is a little bit of an enhanced version of the previous patch. The
worst case overhead is reduced almost to zero. The local resource owner is
created only when routine is executed in non-atomic mode, and when routine
contains a CALL statement.

Regards

Pavel
diff --git a/doc/src/sgml/spi.sgml b/doc/src/sgml/spi.sgml
index f5e0a35da0..8cf66f8c7b 100644
--- a/doc/src/sgml/spi.sgml
+++ b/doc/src/sgml/spi.sgml
@@ -1722,6 +1722,138 @@ int SPI_execute_plan(SPIPlanPtr plan, Datum * 
 
 
 
+
+ SPI_execute_plan_extended
+
+ 
+  SPI_execute_plan_extended
+  3
+ 
+
+ 
+  SPI_execute_plan_extended
+  execute a statement prepared by SPI_prepare
+ 
+
+ 
+
+int SPI_execute_plan_extended(SPIPlanPtr plan,
+  ParamListInfo params,
+  bool read_only,
+  long count,
+  const SPIExecuteOptions * options)
+
+ 
+
+ 
+  Description
+
+  
+   SPI_execute_plan_extended executes a statement
+   prepared by SPI_prepare.
+   This function is equivalent to SPI_execute_plan_with_paramlist,
+   but allows to pass additional options.
+  
+ 
+
+ 
+  Arguments
+
+  
+   
+SPIPlanPtr plan
+
+ 
+  prepared statement (returned by SPI_prepare)
+ 
+
+   
+
+   
+ParamListInfo params
+
+ 
+  data structure containing parameter types and values; NULL if none
+ 
+
+   
+
+   
+bool read_only
+
+ true for read-only execution
+
+   
+
+   
+long count
+
+ 
+  maximum number of rows to return,
+  or 0 for no limit
+ 
+
+   
+
+   
+const SPIPExecuteOptions * options
+
+ 
+  struct containing optional arguments
+ 
+
+   
+  
+
+  
+   Callers should always zero out the entire options
+   struct, then fill whichever fields they want to set.  This ensures forward
+   compatibility of code, since any fields that are added to the struct in
+   future will be defined to behave backwards-compatibly if they are zero.
+   The currently available options fields are:
+  
+
+  
+   
+DestReceiver * dest
+
+ 
+  Query target
+ 
+
+   
+
+   
+ResourceOwner owner
+
+ 
+  The resource owner that will be used for plan execution. Now it is used
+  as cached query plan owner only. This resource owner should to have longer
+  life cycle than executed command. Current resource owner is used when
+  this option is NULL.
+ 
+
+   
+  
+ 
+
+ 
+  Return Value
+
+  
+   The return value is the same as for SPI_execute_plan.
+  
+
+  
+   SPI_processed and
+   SPI_tuptable are set as in
+   SPI_execute_plan if successful.
+  
+ 
+
+
+
+
 
  SPI_execute_plan_with_paramlist
 
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 653ef8e41a..3de2932a03 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -230,7 +230,7 @@ ExecuteQuery(ParseState *pstate,
 	   entry->plansource->query_string);
 
 	/* Replan if needed, and increment plan refcount for portal */
-	cplan = GetCachedPlan(entry->plansource, paramLI, false, NULL);
+	cplan = GetCachedPlan(entry->plansource, paramLI, NULL, NULL);
 	plan_list = cplan->stmt_list;
 
 	/*
@@ -651,7 +651,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	}
 
 	/* Replan if needed, and acquire a transient refcount */
-	cplan = GetCachedPlan(entry->plansource, paramLI, true, queryEnv);
+	cplan = GetCachedPlan(entry->plansource, paramLI, CurrentResourceOwner, queryEnv);
 
 	INSTR_TIME_SET_CURRENT(planduration);
 	INSTR_TIME_SUBTRACT(planduration, planstart);
@@ -687,7 +687,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	if (estate)
 		FreeExecutorState(estate);
 
-	ReleaseCachedPlan(cplan, true);
+	ReleaseCachedPlan(cplan, CurrentResourceOwner);
 }
 
 /*
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index e28d242922..17df852959 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -67,7 +67,7 @@ static void _SPI_prepare_oneshot_plan(const char *src, SPIPlanPtr plan);
 static int	_SPI_execute_plan(SPIPlanPtr plan, ParamListInfo paramLI,
 			  Snapshot snapshot, Snapshot crosscheck_snapshot,
 			  bool read_only, bool fire_triggers, uint64 tcount,
-			  DestReceiver *caller_dest);
+			  DestReceiver *caller_dest, ResourceOwner owner);
 
 static ParamListInfo _SPI_convert_params(int nargs, Oid *argtypes,
 		 Datum *Values, const char *Nulls);
@@ -521,7 +521,7 @@ SPI_execute(const char *src, bool read_only, long tcount)
 
 	res = _SPI_execute_plan(, NULL,
 			InvalidSnapshot, InvalidSnapshot,
-			read_only, true, tcount, NULL);
+			read_only, true, tcount, NULL, NULL);
 
 	_SPI_end_call(true);
 	return res;
@@ -555,7 +555,7 @@ SPI_execute_plan(SPIPlanPtr plan, Datum *Values, const char *Nulls,
 			_SPI_convert_params(plan->nargs, 

Re: Jsonpath ** vs lax mode

2021-01-21 Thread Alvaro Herrera
On 2021-Jan-21, Alexander Korotkov wrote:

> Requiring strict mode for ** is a solution, but probably too restrictive...
> 
> What do you think about making just subsequent accessor after ** not
> to unwrap arrays.  That would be a bit tricky to implement, but
> probably that would better satisfy the user needs.

Hmm, why is it too restrictive?  If the user needs to further drill into
the JSON, can't they chain json_path_query calls, specifying (or
defaulting to) lax mode for the part doesn't include the ** expression?

-- 
Álvaro Herrera   Valdivia, Chile




Identify missing publications from publisher while create/alter subscription.

2021-01-21 Thread vignesh C
Hi,

Creating/altering subscription is successful when we specify a
publication which does not exist in the publisher. I felt we should
throw an error in this case, that will help the user to check if there
is any typo in the create subscription command or to create the
publication before the subscription is created.
If the above analysis looks correct, then please find a patch that
checks if the specified publications are present in the publisher and
throws an error if any of the publications is missing in the
publisher.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From 2e7a6e41f789f7f1717058e9c78441ae8d5faf9e Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Thu, 21 Jan 2021 18:38:54 +0530
Subject: [PATCH v1] Identify missing publications from publisher while
 create/alter subscription.

Creating/altering subscription is successful when we specify a publication which
does not exist in the publisher. This patch checks if the specified publications
are present in the publisher and throws an error if any of the publication is
missing in the publisher.
---
 src/backend/commands/subscriptioncmds.c | 87 +
 src/test/subscription/t/100_bugs.pl | 46 -
 2 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c
index 082f785..16c89cb 100644
--- a/src/backend/commands/subscriptioncmds.c
+++ b/src/backend/commands/subscriptioncmds.c
@@ -46,6 +46,7 @@
 #include "utils/syscache.h"
 
 static List *fetch_table_list(WalReceiverConn *wrconn, List *publications);
+static void check_publications(WalReceiverConn *wrconn, List *publications);
 
 /*
  * Common option parsing function for CREATE and ALTER SUBSCRIPTION commands.
@@ -490,6 +491,7 @@ CreateSubscription(CreateSubscriptionStmt *stmt, bool isTopLevel)
 
 		PG_TRY();
 		{
+			check_publications(wrconn, publications);
 			/*
 			 * Set sync state based on if we were asked to do data copy or
 			 * not.
@@ -576,6 +578,8 @@ AlterSubscription_refresh(Subscription *sub, bool copy_data)
 		ereport(ERROR,
 (errmsg("could not connect to the publisher: %s", err)));
 
+	check_publications(wrconn, sub->publications);
+
 	/* Get the table list from publisher. */
 	pubrel_names = fetch_table_list(wrconn, sub->publications);
 
@@ -1211,6 +1215,89 @@ AlterSubscriptionOwner_oid(Oid subid, Oid newOwnerId)
 }
 
 /*
+ * Verify that the specified publication(s) exists in the publisher.
+ */
+static void
+check_publications(WalReceiverConn *wrconn, List *publications)
+{
+	WalRcvExecResult *res;
+	StringInfoData cmd;
+	StringInfoData nonExistentPublications;
+	bool		first;
+	TupleTableSlot *slot;
+	ListCell   *lc;
+	List	   *publicationsCopy = NIL;
+	Oid			tableRow[1] = {TEXTOID};
+
+	Assert(list_length(publications) > 0);
+	initStringInfo();
+	appendStringInfoString(, "SELECT t.pubname\n"
+		   "  FROM pg_catalog.pg_publication t\n"
+		   " WHERE t.pubname IN (");
+	first = true;
+	foreach(lc, publications)
+	{
+		char	   *pubname = strVal(lfirst(lc));
+
+		if (first)
+			first = false;
+		else
+			appendStringInfoString(, ", ");
+
+		appendStringInfoString(, quote_literal_cstr(pubname));
+	}
+	appendStringInfoChar(, ')');
+
+	res = walrcv_exec(wrconn, cmd.data, 1, tableRow);
+	pfree(cmd.data);
+
+	if (res->status != WALRCV_OK_TUPLES)
+		ereport(ERROR,
+(errmsg("could not receive list of publications from the publisher: %s",
+		res->err)));
+
+	publicationsCopy = list_copy(publications);
+
+	/* Process publications. */
+	slot = MakeSingleTupleTableSlot(res->tupledesc, );
+	while (tuplestore_gettupleslot(res->tuplestore, true, false, slot))
+	{
+		char	   *pubname;
+		bool		isnull;
+
+		pubname = TextDatumGetCString(slot_getattr(slot, 1, ));
+		Assert(!isnull);
+
+		/* Delete the publication present in publisher from the list. */
+		publicationsCopy = list_delete(publicationsCopy, makeString(pubname));
+		ExecClearTuple(slot);
+	}
+
+	walrcv_clear_result(res);
+	if (list_length(publicationsCopy))
+	{
+		first = true;
+
+		/* Convert the publications which does not exist into a string. */
+		initStringInfo();
+		foreach(lc, publicationsCopy)
+		{
+			char	   *pubname = strVal(lfirst(lc));
+
+			if (first)
+first = false;
+			else
+appendStringInfoString(, ", ");
+			appendStringInfoString(, pubname);
+		}
+
+		ereport(ERROR,
+(errmsg("publication(s) %s does not exist in the publisher",
+		nonExistentPublications.data)));
+	}
+}
+
+/*
  * Get the list of tables which belong to specified publications on the
  * publisher connection.
  */
diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl
index d1e407a..54e082d 100644
--- a/src/test/subscription/t/100_bugs.pl
+++ b/src/test/subscription/t/100_bugs.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 5;
+use Test::More tests => 9;
 
 # Bug #15114
 
@@ -153,3 

Re: Boundary value check in lazy_tid_reaped()

2021-01-21 Thread Masahiko Sawada
On Wed, Jan 20, 2021 at 3:50 PM Peter Eisentraut
 wrote:
>
> On 2020-10-30 02:43, Masahiko Sawada wrote:
> > Using the integer set is very memory efficient (5MB vs. 114MB in the
> > base case) and there is no 1GB limitation. Looking at the execution
> > time, I had expected that using the integer set is slower on recording
> > TIDs than using the simple array but in this experiment, there is no
> > big difference among methods. Perhaps the result will vary if vacuum
> > needs to record much more dead tuple TIDs. From the results, I can see
> > a good improvement in the integer set case and probably a good start
> > but if we want to use it also for lazy vacuum, we will need to improve
> > it so that it can be used on DSA for the parallel vacuum.
> >
> > I've attached the patch I used for the experiment that adds xx_vacuum
> > GUC parameter that controls the method of recording TIDs. Please note
> > that it doesn't support parallel vacuum.
>
> How do you want to proceed here?  The approach of writing a wrapper for
> bsearch with bound check sounded like a good start.  All the other ideas
> discussed here seem larger projects and would probably be out of scope
> of this commit fest.

Agreed. bsearch with bound check showed a reasonable improvement in my
evaluation in terms of performance. Regarding memory efficiency, we
can experiment with other methods later.

I've attached the patch that adds a bound check for encoded
itermpointers before bsearch() in lazy_tid_reaped() and inlines the
function.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


bound_check_lazy_vacuum.patch
Description: Binary data


RE: Stronger safeguard for archive recovery not to miss data

2021-01-21 Thread osumi.takami...@fujitsu.com
Hi, Laurenz


On Thursday, January 21, 2021 9:51 PM Laurenz Albe  
wrote:
> On Thu, 2021-01-21 at 11:49 +, osumi.takami...@fujitsu.com wrote:
> > Adding a condition to check if "recovery_allow_data_corruption" is
> > 'on' around the end of
> > CheckRequiredParameterValues() sounds safer for me too, although
> > implementing a new GUC parameter sounds bigger than what I expected at
> first.
> > The default of the value should be 'off' to protect users from getting the
> corrupted server.
> > Does everyone agree with this direction ?
> 
> I'd say that adding such a GUC is material for another patch, if we want it 
> at all.
OK. You meant another different patch.

> I think it is very unlikely that people will switch from "wal_level=replica" 
> to
> "minimal" and back very soon afterwards and also try to recover past such a
> switch, which probably explains why nobody has complained about data
> corruption generated that way.  To get the server to start with
> "wal_level=minimal", you must set "archive_mode=off" and
> "max_wal_senders=0", and few people will do that and still expect recovery to
> work.
Yeah, the possibility is low of course.

> My vote is that we should not have a GUC for such an unlikely event, and that
> stopping recovery is good enough.
OK. IIUC, my current patch for this fix doesn't need to be changed or withdrawn.
Thank you for your explanation.


Best Regards,
Takamichi Osumi


Re: patch: reduce overhead of execution of CALL statement in no atomic mode from PL/pgSQL

2021-01-21 Thread Pavel Stehule
Hi

pá 15. 1. 2021 v 22:46 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > [ plpgsql-using-local-resowner-for-call-plans-20200108.patch ]
>
> I took a quick look through this patch, just reading it without
> any testing.  A few thoughts:
>
> * Instead of adding an argument to GetCachedPlan and ReleaseCachedPlan,
> I think it'd be better to *replace* the useResOwner bool with
> a ResourceOwner pointer, with the obvious semantics "do nothing if
> it's NULL".  Otherwise you have to explain what it means to pass NULL
> with useResOwner = true.  In any case, changing the APIs of these
> functions without updating their header comments is not okay.
>

done


> * I'm not really happy with adding yet another nonorthogonal variant
> of SPI_execute_plan.  Maybe better to do something like I did with
> SPI_prepare_extended() in commit 844fe9f15, and create a struct with
> all the inessential parameters so that we can make future API extensions
> without inventing whole new functions.  Remember also that new SPI
> functions have to be documented in spi.sgml.
>

done


> * Do we really need a PG_TRY in exec_toplevel_block?  Not to mention
> creating and destroying a ResOwner?  That seems pretty expensive, and it
> should be unnecessary for ordinary plpgsql functions.  (I'm also unhappy
> that you've utterly falsified that function's comment without doing
> anything to update it.)  This is really the part that needs more
> work.  I'm not sure that you can sell a speedup of CALL operations
> if the penalty is to slow down every other plpgsql function.
>

I rewrote this part - now there is no new PG_TRY. local_resowner is created
only when routine is executed in non atomic mode


> * The part of the patch around exec_stmt_call is just about unreadable,
> mainly because git diff seems to think that exec_stmt_call is being
> changed into make_callstmt_target.  Maybe it'd be less messy if you
> put make_callstmt_target after exec_stmt_call.
>

done


> * Looks like an extra exec_prepare_plan() call snuck into
> exec_assign_expr()?
>

fixed

I did performance tests and not the slowdown in the worst case is lower
(3-5%) only for execution in non-atomic mode. The performance of atomic
mode is the same.

Regards

Pavel




> regards, tom lane
>
diff --git a/doc/src/sgml/spi.sgml b/doc/src/sgml/spi.sgml
index f5e0a35da0..8cf66f8c7b 100644
--- a/doc/src/sgml/spi.sgml
+++ b/doc/src/sgml/spi.sgml
@@ -1722,6 +1722,138 @@ int SPI_execute_plan(SPIPlanPtr plan, Datum * 
 
 
 
+
+ SPI_execute_plan_extended
+
+ 
+  SPI_execute_plan_extended
+  3
+ 
+
+ 
+  SPI_execute_plan_extended
+  execute a statement prepared by SPI_prepare
+ 
+
+ 
+
+int SPI_execute_plan_extended(SPIPlanPtr plan,
+  ParamListInfo params,
+  bool read_only,
+  long count,
+  const SPIExecuteOptions * options)
+
+ 
+
+ 
+  Description
+
+  
+   SPI_execute_plan_extended executes a statement
+   prepared by SPI_prepare.
+   This function is equivalent to SPI_execute_plan_with_paramlist,
+   but allows to pass additional options.
+  
+ 
+
+ 
+  Arguments
+
+  
+   
+SPIPlanPtr plan
+
+ 
+  prepared statement (returned by SPI_prepare)
+ 
+
+   
+
+   
+ParamListInfo params
+
+ 
+  data structure containing parameter types and values; NULL if none
+ 
+
+   
+
+   
+bool read_only
+
+ true for read-only execution
+
+   
+
+   
+long count
+
+ 
+  maximum number of rows to return,
+  or 0 for no limit
+ 
+
+   
+
+   
+const SPIPExecuteOptions * options
+
+ 
+  struct containing optional arguments
+ 
+
+   
+  
+
+  
+   Callers should always zero out the entire options
+   struct, then fill whichever fields they want to set.  This ensures forward
+   compatibility of code, since any fields that are added to the struct in
+   future will be defined to behave backwards-compatibly if they are zero.
+   The currently available options fields are:
+  
+
+  
+   
+DestReceiver * dest
+
+ 
+  Query target
+ 
+
+   
+
+   
+ResourceOwner owner
+
+ 
+  The resource owner that will be used for plan execution. Now it is used
+  as cached query plan owner only. This resource owner should to have longer
+  life cycle than executed command. Current resource owner is used when
+  this option is NULL.
+ 
+
+   
+  
+ 
+
+ 
+  Return Value
+
+  
+   The return value is the same as for SPI_execute_plan.
+  
+
+  
+   SPI_processed and
+   SPI_tuptable are set as in
+   SPI_execute_plan if successful.
+  
+ 
+
+
+
+
 
  SPI_execute_plan_with_paramlist
 
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 653ef8e41a..3de2932a03 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -230,7 +230,7 @@ 

Re: Stronger safeguard for archive recovery not to miss data

2021-01-21 Thread Laurenz Albe
On Thu, 2021-01-21 at 11:49 +, osumi.takami...@fujitsu.com wrote:
> Adding a condition to check if "recovery_allow_data_corruption" is 'on' 
> around the end of
> CheckRequiredParameterValues() sounds safer for me too, although
> implementing a new GUC parameter sounds bigger than what I expected at first.
> The default of the value should be 'off' to protect users from getting the 
> corrupted server.
> Does everyone agree with this direction ?

I'd say that adding such a GUC is material for another patch, if we want it at 
all.

I think it is very unlikely that people will switch from "wal_level=replica" to
"minimal" and back very soon afterwards and also try to recover past such
a switch, which probably explains why nobody has complained about data 
corruption
generated that way.  To get the server to start with "wal_level=minimal", you 
must
set "archive_mode=off" and "max_wal_senders=0", and few people will do that and
still expect recovery to work.

My vote is that we should not have a GUC for such an unlikely event, and that
stopping recovery is good enough.

Yours,
Laurenz Albe





Re: Is Recovery actually paused?

2021-01-21 Thread Yugo NAGATA
On Tue, 19 Jan 2021 21:32:31 +0530
Dilip Kumar  wrote:

> On Tue, Jan 19, 2021 at 8:34 AM Dilip Kumar  wrote:
> >
> > On Tue, 19 Jan 2021 at 8:12 AM, Yugo NAGATA  wrote:
> >>
> >> On Sun, 17 Jan 2021 11:33:52 +0530
> >> Dilip Kumar  wrote:
> >>
> >> > On Thu, Jan 14, 2021 at 6:49 PM Yugo NAGATA  wrote:
> >> > >
> >> > > On Wed, 13 Jan 2021 17:49:43 +0530
> >> > > Dilip Kumar  wrote:
> >> > >
> >> > > > On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar  
> >> > > > wrote:
> >> > > > >
> >> > > > > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA  
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > On Thu, 10 Dec 2020 11:25:23 +0530
> >> > > > > > Dilip Kumar  wrote:
> >> > > > > >
> >> > > > > > > > > However, I wonder users don't expect 
> >> > > > > > > > > pg_is_wal_replay_paused to wait.
> >> > > > > > > > > Especially, if max_standby_streaming_delay is -1, this 
> >> > > > > > > > > will be blocked forever,
> >> > > > > > > > > although this setting may not be usual. In addition, some 
> >> > > > > > > > > users may set
> >> > > > > > > > > recovery_min_apply_delay for a large.  If such users call 
> >> > > > > > > > > pg_is_wal_replay_paused,
> >> > > > > > > > > it could wait for a long time.
> >> > > > > > > > >
> >> > > > > > > > > At least, I think we need some descriptions on document to 
> >> > > > > > > > > explain
> >> > > > > > > > > pg_is_wal_replay_paused could wait while a time.
> >> > > > > > > >
> >> > > > > > > > Ok
> >> > > > > > >
> >> > > > > > > Fixed this, added some comments in .sgml as well as in 
> >> > > > > > > function header
> >> > > > > >
> >> > > > > > Thank you for fixing this.
> >> > > > > >
> >> > > > > > Also, is it better to fix the description of pg_wal_replay_pause 
> >> > > > > > from
> >> > > > > > "Pauses recovery." to "Request to pause recovery." in according 
> >> > > > > > with
> >> > > > > > pg_is_wal_replay_paused?
> >> > > > >
> >> > > > > Okay
> >> > > > >
> >> > > > > >
> >> > > > > > > > > Also, how about adding a new boolean argument to 
> >> > > > > > > > > pg_is_wal_replay_paused to
> >> > > > > > > > > control whether this waits for recovery to get paused or 
> >> > > > > > > > > not? By setting its
> >> > > > > > > > > default value to true or false, users can use the old 
> >> > > > > > > > > format for calling this
> >> > > > > > > > > and the backward compatibility can be maintained.
> >> > > > > > > >
> >> > > > > > > > So basically, if the wait_recovery_pause flag is false then 
> >> > > > > > > > we will
> >> > > > > > > > immediately return true if the pause is requested?  I agree 
> >> > > > > > > > that it is
> >> > > > > > > > good to have an API to know whether the recovery pause is 
> >> > > > > > > > requested or
> >> > > > > > > > not but I am not sure is it good idea to make this API serve 
> >> > > > > > > > both the
> >> > > > > > > > purpose?  Anyone else have any thoughts on this?
> >> > > > > > > >
> >> > > > > >
> >> > > > > > I think the current pg_is_wal_replay_paused() already has 
> >> > > > > > another purpose;
> >> > > > > > this waits recovery to actually get paused. If we want to limit 
> >> > > > > > this API's
> >> > > > > > purpose only to return the pause state, it seems better to fix 
> >> > > > > > this to return
> >> > > > > > the actual state at the cost of lacking the backward 
> >> > > > > > compatibility. If we want
> >> > > > > > to know whether pause is requested, we may add a new API like
> >> > > > > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait 
> >> > > > > > recovery to actually
> >> > > > > > get paused, we may add an option to pg_wal_replay_pause() for 
> >> > > > > > this purpose.
> >> > > > > >
> >> > > > > > However, this might be a bikeshedding. If anyone don't care that
> >> > > > > > pg_is_wal_replay_paused() can make user wait for a long time, I 
> >> > > > > > don't care either.
> >> > > > >
> >> > > > > I don't think that it will be blocked ever, because
> >> > > > > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> >> > > > > recovery process will not be stuck on waiting for the WAL.
> >> > >
> >> > > Yes, there is no stuck on waiting for the WAL. However, it can be 
> >> > > stuck during resolving
> >> > > a recovery conflict. The process could wait for 
> >> > > max_standby_streaming_delay or
> >> > > max_standby_archive_delay at most before recovery get completely 
> >> > > paused.
> >> >
> >> > Okay, I agree that it is possible so for handling this we have a
> >> > couple of options
> >> > 1. pg_is_wal_replay_paused(), interface will wait for recovery to
> >> > actually get paused, but user have an option to cancel that.  So I
> >> > agree that there is currently no option to just know that recovery
> >> > pause is requested without waiting for its actually get paused if it
> >> > is requested.  So one option is we can provide an another interface as
> >> > you mentioned pg_is_wal_replay_paluse_requeseted(), which can just
> >> > return the request status.  I 

Re: Additional Chapter for Tutorial - arch-dev.sgml

2021-01-21 Thread Jürgen Purtz

On 18.01.21 15:13, Heikki Linnakangas wrote:

On 20/11/2020 23:52, Erik Rijkers wrote:

(smallish) Changes to arch-dev.sgml


This looks good to me. One little complaint:


@@ -125,7 +122,7 @@
 use a supervisor process (also
 master process) that spawns a new
 server process every time a connection is requested. This 
supervisor

-    process is called postgres and listens at a
+    process is called postgres (formerly 
'postmaster') and listens at a

 specified TCP/IP port for incoming connections. Whenever a request
 for a connection is detected the postgres
 process spawns a new server process. The server tasks


I believe we still call it the postmaster process. We renamed the 
binary a long time ago (commit 5266f221a2), and the above text was 
changed as part of that commit. I think that was a mistake, and this 
should say simply:


... This supervisor process is called postmaster 
and ...


like it did before we renamed the binary.

Barring objections, I'll commit this with that change (as attached).

- Heikki


Some additional changes in 51.2:

 - smaller number of different terms

 - aligning with Glossary

 - active voice instead of passive voice

 - commas

---

J. Purtz


diff --git a/doc/src/sgml/arch-dev.sgml b/doc/src/sgml/arch-dev.sgml
index ade0ad97d8..7ced927c9d 100644
--- a/doc/src/sgml/arch-dev.sgml
+++ b/doc/src/sgml/arch-dev.sgml
@@ -7,7 +7,7 @@
Author

 This chapter originated as part of
-, Stefan Simkovics'
+ Stefan Simkovics'
 Master's Thesis prepared at Vienna University of Technology under the direction
 of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.

@@ -17,10 +17,7 @@
This chapter gives an overview of the internal structure of the
backend of PostgreSQL.  After having
read the following sections you should have an idea of how a query
-   is processed. This chapter does not aim to provide a detailed
-   description of the internal operation of
-   PostgreSQL, as such a document would be
-   very extensive. Rather, this chapter is intended to help the reader
+   is processed.  This chapter is intended to help the reader
understand the general sequence of operations that occur within the
backend from the point at which a query is received, to the point
at which the results are returned to the client.
@@ -30,8 +27,8 @@
The Path of a Query
 

-Here we give a short overview of the stages a query has to pass in
-order to obtain a result.
+Here we give a short overview of the stages a query has to pass 
+to obtain a result.

 

@@ -117,21 +114,28 @@
How Connections Are Established
 

-PostgreSQL is implemented using a
-simple process per user client/server model.  In this model
-there is one client process connected to
-exactly one server process.  As we do not
+PostgreSQL implements a
+process per user client/server model.
+In this model, every
+client process
+connects to exactly one
+backend process.
+As we do not
 know ahead of time how many connections will be made, we have to
-use a supervisor process (also
-master process) that spawns a new
-server process every time a connection is requested. This supervisor
-process is called postgres and listens at a
-specified TCP/IP port for incoming connections. Whenever a request
-for a connection is detected the postgres
-process spawns a new server process. The server tasks
-communicate with each other using semaphores and
-shared memory to ensure data integrity
-throughout concurrent data access.
+use a supervisor process
+that spawns a new
+backend process every time a connection is requested. This supervisor
+process is called
+postmaster 
+and listens at a
+specified TCP/IP port for incoming connections. Whenever he detects
+a request for a connection, he spawns a new backend process.
+Those backend processes communicate with each other and with other
+processes of the
+instance
+using semaphores and
+shared memory
+to ensure data integrity throughout concurrent data access.

 

@@ -144,11 +148,11 @@

 

-Once a connection is established the client process can send a query
-to the backend (server). The query is transmitted using plain text,
-i.e., there is no parsing done in the frontend (client). The
-server parses the query, creates an execution plan,
-executes the plan and returns the retrieved rows to the client
+Once a connection is established, the client process can send a query
+to his backend process. The query is transmitted using plain text,
+i.e., there is no parsing done in the client. The backend 
+process parses the query, creates an execution plan,
+executes the plan, and returns the retrieved rows to the client
 by transmitting them over the established connection.

   
@@ 

Re: Columns correlation and adaptive query optimization

2021-01-21 Thread Yugo NAGATA
Hello,

On Thu, 26 Mar 2020 18:49:51 +0300
Konstantin Knizhnik  wrote:

> Attached please find new version of the patch with more comments and 
> descriptions added.

Adaptive query optimization is very interesting feature for me, so I looked
into this patch.  Here are some comments and questions.

(1)
This patch needs rebase because clauselist_selectivity was modified to improve
estimation of OR clauses.

(2)
If I understand correctly, your proposal consists of the following two features.

1. Add a feature to auto_explain that creates an extended statistic 
automatically
if an error on estimated rows number is large.

2. Improve rows number estimation of join results by considering functional
dependencies between vars in the join qual if the qual has more than one 
clauses,
and also functional dependencies between a var in the join qual and vars in 
quals
of the inner/outer relation.

As you said, these two parts are independent each other, so one feature will 
work
even if we don't assume the other.  I wonder it would be better to split the 
patch
again, and register them to commitfest separately.

(3)
+   DefineCustomBoolVariable("auto_explain.suggest_only",
+"Do not create 
statistic but just record in WAL suggested create statistics statement.",
+NULL,
+
_explain_suggest_on

To imply that this parameter is involving to add_statistics_threshold, it seems
better for me to use more related name like add_statistics_suggest_only.

Also, additional documentations for new parameters are required.

(4)
+   /*
+* Prevent concurrent access to extended statistic table
+*/
+   stat_rel = table_open(StatisticExtRelationId, 
AccessExclusiveLock);
+   slot = table_slot_create(stat_rel, NULL);
+   scan = table_beginscan_catalog(stat_rel, 2, entry);
(snip)
+   table_close(stat_rel, AccessExclusiveLock);
+   }

When I tested the auto_explain part, I got the following WARNING.

 WARNING:  buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, 
flags=0x8300, refcount=1 2)
 WARNING:  buffer refcount leak: [097] (rel=base/12879/3381, blockNum=0, 
flags=0x8300, refcount=1 1)
 WARNING:  relcache reference leak: relation "pg_statistic_ext" not closed
 WARNING:  TupleDesc reference leak: TupleDesc 0x7fa439266338 (12029,-1) still 
referenced
 WARNING:  Snapshot reference leak: Snapshot 0x55c332c10418 still referenced

To suppress this, I think we need table_endscan(scan) and
ExecDropSingleTupleTableSlot(slot) before finishing this function.

(6)
+   elog(NOTICE, "Auto_explain suggestion: 
CREATE STATISTICS %s %s FROM %s", stat_name, create_stat_stmt, rel_name);

We should use ereport instead of elog for log messages.

(7)
+   double dep = 
find_var_dependency(root, innerRelid, var, clauses_attnums);
+   if (dep != 0.0)
+   {
+   s1 *= dep + (1 - dep) * 
s2;
+   continue;
+   }

I found the following comment of clauselist_apply_dependencies():

  * we actually combine selectivities using the formula
  *
  *  P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)

so, is it not necessary using the same formula in this patch? That is, 

   s1 *= dep + (1-dep) * s2   (if s1 <= s2)
  s1 *= dep * (s2/s1) + (1-dep) * s2   (otherwise) .

(8)
+/*
+ * Try to find dependency between variables.
+ * var: varaibles which dependencies are considered
+ * join_vars: list of variables used in other clauses
+ * This functions return strongest dependency and some subset of variables 
from the same relation
+ * or 0.0 if no dependency was found.
+ */
+static double
+var_depends_on(PlannerInfo *root, Var* var, List* clause_vars)
+{

The comment mentions join_vars but the actual argument name is clauses_vars,
so it needs unification.

(9)
Currently, it only consider functional dependencies statistics. Can we also
consider multivariate MCV list, and is it useful?

(10)
To achieve adaptive query optimization  (AQO) in PostgreSQL, this patch proposes
to use auto_explain for getting feedback from actual results. So, could 
auto_explain
be a infrastructure of AQO in future? Or, do you have any plan or idea to make
built-in infrastructure for AQO?

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: [HACKERS] [PATCH] Generic type subscripting

2021-01-21 Thread Dmitry Dolgov
> On Wed, Jan 20, 2021 at 11:37:32PM -0500, Dian M Fay wrote:
> On Wed Jan 20, 2021 at 2:08 PM EST, Dmitry Dolgov wrote:
> > > On Wed, Jan 20, 2021 at 11:34:16AM -0500, Dian M Fay wrote:
> > > > Thanks, I need to remember to not skipp doc building for testing process
> > > > even for such small changes. Hope now I didn't forget anything.
> > > >
> > > > > On Wed, Jan 20, 2021 at 09:58:43AM -0500, Dian M Fay wrote:
> > > >
> > > > > Here's a full editing pass on the documentation, with v45 and Pavel's
> > > > > doc-whitespaces-fix.patch applied. I also corrected a typo in one of 
> > > > > the
> > > > > added hints.
> > > >
> > > > Great! I've applied almost all of it, except:
> > > >
> > > > + A jsonb value will accept assignments to nonexistent
> > > > subscript
> > > > + paths as long as the nonexistent elements being traversed are all
> > > > arrays.
> > > >
> > > > Maybe I've misunderstood the intention, but there is no requirement
> > > > about arrays for creating such an empty path. I've formulated it as:
> > > >
> > > > + A jsonb value will accept assignments to nonexistent
> > > > subscript
> > > > + paths as long as the last existing path key is an object or an array.
> > >
> > > My intention there was to highlight the difference between:
> > >
> > > * SET obj['a']['b']['c'] = '"newvalue"'
> > > * SET arr[0][0][3] = '"newvalue"'
> > >
> > > obj has to conform to {"a": {"b": {...}}} in order to receive the
> > > assignment of the nested c. If it doesn't, that's the error case we
> > > discussed earlier. But arr can be null, [], and so on, and any missing
> > > structure [[[null, null, null, "newvalue"]]] will be created.
> >
> > If arr is 'null', or any other scalar value, such subscripting will work
> > only one level deep because they represented internally as an array of
> > one element. If arr is '[]' the path will comply by definition. So it's
> > essentially the same as for objects with no particular difference. If
> > such a quirk about scalars being treated like arrays is bothering, we
> > could also bend it in this case as well (see the attached version).
>
> I missed that distinction in the original UPDATE paragraph too. Here's
> another revision based on v48.

Looks good, I've applied it, thanks.
>From a4037c651a0cfd2448f38b6c8c932b5a09136b0a Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Fri, 18 Dec 2020 17:19:51 +0100
Subject: [PATCH v49 1/3] Subscripting for jsonb

Subscripting implementation for jsonb. It does not support slices, does
not have a limit for number of subscripts and for assignment expects a
replace value to be of jsonb type. There is also one functional
difference in assignment via subscripting from jsonb_set, when an
original jsonb container is NULL, subscripting replaces it with an empty
jsonb and proceed with assignment.

For the sake of code reuse, some parts of jsonb functionality were
rearranged to allow use the same functions for jsonb_set and assign
subscripting operation.

The original idea belongs to Oleg Bartunov.

Reviewed-by: Tom Lane, Arthur Zakirov, Pavel Stehule, Dian M Fay
---
 doc/src/sgml/json.sgml  |  51 
 src/backend/utils/adt/Makefile  |   1 +
 src/backend/utils/adt/jsonb_util.c  |  76 -
 src/backend/utils/adt/jsonbsubs.c   | 413 
 src/backend/utils/adt/jsonfuncs.c   | 180 ++--
 src/include/catalog/pg_proc.dat |   4 +
 src/include/catalog/pg_type.dat |   3 +-
 src/include/utils/jsonb.h   |   6 +-
 src/test/regress/expected/jsonb.out | 272 +-
 src/test/regress/sql/jsonb.sql  |  84 +-
 10 files changed, 985 insertions(+), 105 deletions(-)
 create mode 100644 src/backend/utils/adt/jsonbsubs.c

diff --git a/doc/src/sgml/json.sgml b/doc/src/sgml/json.sgml
index 5b9a5557a4..3ace5e444b 100644
--- a/doc/src/sgml/json.sgml
+++ b/doc/src/sgml/json.sgml
@@ -602,6 +602,57 @@ SELECT jdoc-'guid', jdoc-'name' FROM api WHERE jdoc @ '{"tags": ["qu
   
  
 
+ 
+  jsonb Subscripting
+  
+   The jsonb data type supports array-style subscripting expressions
+   to extract and modify elements. Nested values can be indicated by chaining
+   subscripting expressions, following the same rules as the path
+   argument in the jsonb_set function. If a jsonb
+   value is an array, numeric subscripts start at zero, and negative integers count
+   backwards from the last element of the array. Slice expressions are not supported.
+   The result of a subscripting expression is always of the jsonb data type.
+  
+
+  
+   An example of subscripting syntax:
+
+
+-- Extract object value by key
+SELECT ('{"a": 1}'::jsonb)['a'];
+
+-- Extract nested object value by key path
+SELECT ('{"a": {"b": {"c": 1}}}'::jsonb)['a']['b']['c'];
+
+-- Extract array element by index
+SELECT ('[1, "2", null]'::jsonb)[1];
+
+-- Update object value by key. Note the quotes around '1': the assigned
+-- value must be of the jsonb type as well
+UPDATE table_name SET 

Re: Is Recovery actually paused?

2021-01-21 Thread Dilip Kumar
On Thu, Jan 21, 2021 at 3:29 PM Bharath Rupireddy
 wrote:

Thanks for reviewing Bharat.

> On Tue, Jan 19, 2021 at 9:32 PM Dilip Kumar  wrote:
> > In the last patch there were some local changes which I did not add to
> > the patch and it was giving compilation warning so fixed that along
> > with that I have addressed your this comment as well.
>
> Thanks for the patch. I took a look at the v5 patch, below are some
> comments. Please ignore if I'm repeating any of the comments discussed
> upthread.
>
> [1] Can we also have a wait version for pg_wal_replay_pause that waits
> until recovery is actually paused right after setting the recovery
> state to RECOVERY_PAUSE_REQUESTED? Something like this -
> pg_wal_replay_pause_and_wait(wait boolean, wait_seconds integer
> DEFAULT 60) returns boolean. It waits until either default or provided
> wait_seconds and returns true if the recovery is paused within that
> wait_seconds otherwise false. If wait_seconds is 0 or -1, then it
> waits until recovery is paused and returns true. One advantage of this
> function is that users don't need to call pg_is_wal_replay_paused().
> IMHO, the job of ensuring whether or not the recovery is actually
> paused, is better done by the one who requests
> it(pg_wal_replay_pause/pg_wal_replay_pause_and_wait).

I don't think we need wait/onwait version for all the APIs,  IMHO it
would be enough for the user to know whether the recovery is actually
paused or not
and for that, we are changing pg_is_wal_replay_paused to wait for the
pause.  However, in the next version in pg_is_wal_replay_paused I will
provide a flag so that the user can decide whether to wait for the
pause or just get the request status.

> [2] Is it intentional that RecoveryPauseRequested() returns true even
> if XLogCtl->recoveryPauseState is either RECOVERY_PAUSE_REQUESTED or
> RECOVERY_PAUSED?

Yes this is intended

> [3] Can we change IsRecoveryPaused() instead of RecoveryIsPaused() and
> IsRecoveryPauseRequested() instead of RecoveryPauseRequested()? How
> about having inline(because they have one line of code) functions like
> IsRecoveryPauseRequested(), IsRecoveryPaused() and
> IsRecoveryInProgress(), returning true when RECOVERY_PAUSE_REQUESTED,
> RECOVERY_PAUSED and RECOVERY_IN_PROGRESS respectively?

Yeah, we can do that, I am not sure whether we need
IsRecoveryInProgress function though.

> [4] Can we have at least one line of comments for each of the new
> functions, I know the function names mean everything? And also for
> existing SetRecoveryPause() and GetRecoveryPauseState()?

Will do that

> [5] Typo, it's "every time" ---> + * this everytime.

Ok

> [6] Do we reach PG_RETURN_BOOL(true); at the end of
> pg_is_wal_replay_paused()? If not, is it there for satisfying the
> compiler?

Yes

> [7] In pg_is_wal_replay_paused(), do we need if
> (!RecoveryPauseRequested()) inside the for (;;)? If yes, can we add
> comments about why we need it there?

Yes, we need it if replay resumed during the loop,  I will add the comments.

> [8] Isn't it good to have
> pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE) and
> pgstat_report_wait_end(), just before for (;;) and at the end
> respectively? This will help users in knowing what they are waiting
> for? Alternatively we can issue notice/warnings/write to server log
> while we are waiting in for (;;) for the recovery to get paused?

I think we can do that, let me think about this and get back to you.

> [9] In pg_is_wal_replay_paused(), is 10 msec sleep time chosen
> randomly or based on some analysis that the time it takes to get to
> recovery paused state from request state or some other?

I don't think we can identify that when actually recovery can get
paused.  Though in pg_wal_replay_pause() we are sending a signal to
all the places we are waiting for WAL and right after we are checking.
But it depends upon other configuration parameters like
max_standby_streaming_delay.

> [10]  errhint("The standby was promoted while waiting for recovery to
> be paused."))); Can we know whether standby is actually promoted and
> throw this error? Because the error "recovery is not in progress" here
> is being thrown by just relying on if (!RecoveryInProgress()). IIUC,
> using pg_promote

Because before checking this it has already checked
RecoveryPauseRequested() and if that is true then the
RecoveryInProgress was in progress at some time and now it is not
anymore and that can happen due to promotion.  But I am fine with
reverting to the old error that can not execute if recovery is not in
progress.

> [11] Can we try to add tests for these functions in TAP? Currently, we
> don't have any tests for pg_is_wal_replay_paused, pg_wal_replay_resume
> or pg_wal_replay_pause, but we have tests for pg_promote in timeline
> switch.

I will work on this.

> [12] Isn't it better to change RecoveryPauseState enum
> RECOVERY_IN_PROGRESS value to start from 1 instead of 0? Because when
> XLogCtl shared memory is initialized, I 

RE: Stronger safeguard for archive recovery not to miss data

2021-01-21 Thread osumi.takami...@fujitsu.com
Hi


On Wed, Jan 20, 2021 9:03 PM Laurenz Albe  wrote:
> 
> On Wed, 2021-01-20 at 13:10 +0900, Fujii Masao wrote:
> > +errhint("Run recovery again from a
> > + new base backup taken after setting wal_level higher than
> > + minimal")));
> >
> > Isn't it impossible to do this in normal archive recovery case? In
> > that case, since the server crashed and the database got corrupted,
> > probably we cannot take a new base backup.
> >
> > Originally even when users accidentally set wal_level to minimal, they
> > could start the server from the backup by disabling hot_standby and salvage
> the data.
> > But with the patch, we lost the way to do that. Right? I was wondering
> > that WARNING was used intentionally there for that case.
OK. I understand that this WARNING is necessary to give user a chance
to start up the server again and salvage his/her data,
when hot_standby=off and the server goes through a period of wal_level=minimal
during an archive recovery.


> I would argue that if you set your "wal_level" to minimal, do some work, 
> reset it
> to "replica" and recover past that, two things could happen:
> 
> 1. Since "archive_mode" was off, you are missing some WAL segments and
>cannot recover past that point anyway.
> 
> 2. You are missing some relevant WAL records, and your recovered
>database is corrupted.  This is very likely, because you probably
>switched to "minimal" with the intention to generate less WAL.
> 
> Now I see your point that a corrupted database may be better than no database
> at all, but wouldn't you agree that a warning in the log (that nobody reads) 
> is
> too little information?
> 
> Normally we don't take such a relaxed attitude towards data corruption.
Yeah, we thought we needed stronger protection for that reason.


> Perhaps there could be a GUC "recovery_allow_data_corruption" to override this
> check, but I'd say that a warning is too little.
> 
> > if (ControlFile->wal_level < WAL_LEVEL_REPLICA)
> > ereport(ERROR,
> > (errmsg("hot standby is not
> possible because wal_level was not set to \"replica\" or higher on the primary
> server"),
> >  errhint("Either set wal_level
> > to \"replica\" on the primary, or turn off hot_standby here.")));
> >
> > With the patch, we never reach the above code?
> 
> Right, that would have to go.  I didn't notice that.
Adding a condition to check if "recovery_allow_data_corruption" is 'on' around 
the end of
CheckRequiredParameterValues() sounds safer for me too, although
implementing a new GUC parameter sounds bigger than what I expected at first.
The default of the value should be 'off' to protect users from getting the 
corrupted server.
Does everyone agree with this direction ?


Best Regards,
Takamichi Osumi


Re: PoC/WIP: Extended statistics on expressions

2021-01-21 Thread Dean Rasheed
On Tue, 19 Jan 2021 at 01:57, Tomas Vondra
 wrote:
>
> > A slightly bigger issue that I don't like is the way it assigns
> > attribute numbers for expressions starting from
> > MaxHeapAttributeNumber+1, so the first expression has an attnum of
> > 1601. That leads to pretty inefficient use of Bitmapsets, since most
> > tables only contain a handful of columns, and there's a large block of
> > unused space in the middle the Bitmapset.
> >
> > An alternative approach might be to use regular attnums for columns
> > and use negative indexes -1, -2, -3, ... for expressions in the stored
> > stats. Then when storing and retrieving attnums from Bitmapsets, it
> > could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
> > in the Bitmapsets, since there can't be more than that many
> > expressions (just like other code stores system attributes using
> > FirstLowInvalidHeapAttributeNumber).
>
> Well, I tried this but unfortunately it's not that simple. We still need
> to build the bitmaps, so I don't think add_expression_to_attributes can
> be just removed. I mean, we need to do the offsetting somewhere, even if
> we change how we do it.

Hmm, I was imagining that the offsetting would happen in each place
that adds or retrieves an attnum from a Bitmapset, much like a lot of
other code does for system attributes, and then you'd know you had an
expression if the resulting attnum was negative.

I was also thinking that it would be these negative attnums that would
be stored in the stats data, so instead of something like "1, 2 =>
1601", it would be "1, 2 => -1", so in some sense "-1" would be the
"real" attnum associated with the expression.

> But the main issue is that in some cases the number of expressions is
> not really limited by STATS_MAX_DIMENSIONS - for example when applying
> functional dependencies, we "merge" multiple statistics, so we may end
> up with more expressions. So we can't just use STATS_MAX_DIMENSIONS.

Ah, I see. I hadn't really fully understood what that code was doing.

ISTM though that this is really an internal problem to the
dependencies code, in that these "merged" Bitmapsets containing attrs
from multiple different stats objects do not (and must not) ever go
outside that local code that uses them. So that code would be free to
use a different offset for its own purposes -- e..g., collect all the
distinct expressions across all the stats objects and then offset by
the number of distinct expressions.

> Also, if we offset regular attnums by STATS_MAX_DIMENSIONS, that inverts
> the order of processing (so far we've assumed expressions are after
> regular attnums). So the changes are more extensive - I tried doing that
> anyway, and I'm still struggling with crashes and regression failures.
> Of course, that doesn't mean we shouldn't do it, but it's far from
> mechanical. (Some of that is probably a sign this code needs a bit more
> work to polish.)

Interesting. What code assumes expressions come after attributes?
Ideally, I think it would be cleaner if no code assumed any particular
order, but I can believe that it might be convenient in some
circumstances.

> But I wonder if it'd be easier to just calculate the actual max attnum
> and then use it instead of MaxHeapAttributeNumber ...

Hmm, I'm not sure how that would work. There still needs to be an
attnum that gets stored in the database, and it has to continue to
work if the user adds columns to the table. That's why I was
advocating storing negative values, though I haven't actually tried it
to see what might go wrong.

Regards,
Dean




Re: Printing LSN made easy

2021-01-21 Thread Ashutosh Bapat
On Thu, Jan 21, 2021 at 3:53 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2021-01-20 08:50, Ashutosh Bapat wrote:
> > Thanks for looking into this. I would like to keep both the LSN_FORMAT
> > and LSN_FORMAT_ARGS but with a note that the first can not be used in
> > elog() or in messages which require localization. We have many other
> > places doing snprintf() and such stuff, which can use LSN_FORMAT. If we
> > do so, the functions to output string representation will not be needed
> > so they can be removed.
>
> Then you'd end up with half the code doing this and half the code doing
> that.  That doesn't sound very attractive.  It's not like "%X/%X" is
> hard to type.
>

:). That's true. I thought LSN_FORMAT would document the string
representation of an LSN. But anyway I am fine with this as well.


>
> --
> Peter Eisentraut
> 2ndQuadrant, an EDB company
> https://www.2ndquadrant.com/
>


-- 
--
Best Wishes,
Ashutosh


Re: Printing LSN made easy

2021-01-21 Thread Peter Eisentraut

On 2021-01-20 08:50, Ashutosh Bapat wrote:
Thanks for looking into this. I would like to keep both the LSN_FORMAT 
and LSN_FORMAT_ARGS but with a note that the first can not be used in 
elog() or in messages which require localization. We have many other 
places doing snprintf() and such stuff, which can use LSN_FORMAT. If we 
do so, the functions to output string representation will not be needed 
so they can be removed.


Then you'd end up with half the code doing this and half the code doing 
that.  That doesn't sound very attractive.  It's not like "%X/%X" is 
hard to type.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/




Re: Single transaction in the tablesync worker?

2021-01-21 Thread Amit Kapila
On Tue, Jan 19, 2021 at 2:32 PM Peter Smith  wrote:
>
> Hi Amit.
>
> PSA the v17 patch for the Tablesync Solution1.
>

Thanks for the updated patch. Below are few comments:
1. Why are we changing the scope of PG_TRY in DropSubscription()?
Also, it might be better to keep the replication slot drop part as it
is.

2.
- *- Tablesync worker finishes the copy and sets table state to SYNCWAIT;
- * waits for state change.
+ *- Tablesync worker does initial table copy; there is a
FINISHEDCOPY state to
+ * indicate when the copy phase has completed, so if the worker crashes
+ * before reaching SYNCDONE the copy will not be re-attempted.

In the last line, shouldn't the state be FINISHEDCOPY instead of SYNCDONE?

3.
+void
+tablesync_cleanup_at_interrupt(void)
+{
+ bool drop_slot_needed;
+ char originname[NAMEDATALEN] = {0};
+ RepOriginId originid;
+ TimeLineID tli;
+ Oid subid = MySubscription->oid;
+ Oid relid = MyLogicalRepWorker->relid;
+
+ elog(DEBUG1,
+ "tablesync_cleanup_at_interrupt for relid = %d",
+ MyLogicalRepWorker->relid);

The function name and message makes it sound like that we drop slot
and origin at any interrupt. Isn't it better to name it as
tablesync_cleanup_at_shutdown()?

4.
+ drop_slot_needed =
+ wrconn != NULL &&
+ MyLogicalRepWorker->relstate != SUBREL_STATE_SYNCDONE &&
+ MyLogicalRepWorker->relstate != SUBREL_STATE_READY;
+
+ if (drop_slot_needed)
+ {
+ char syncslotname[NAMEDATALEN] = {0};
+ bool missing_ok = true; /* no ERROR if slot is missing. */

I think we can avoid using missing_ok and drop_slot_needed variables.

5. Can we drop the origin along with the slot in
process_syncing_tables_for_sync() instead of
process_syncing_tables_for_apply()? I think this is possible because
of the other changes you made in origin.c. Also, if possible, we can
try to use the same code to drop the slot and origin in
tablesync_cleanup_at_interrupt and process_syncing_tables_for_sync.

6.
+ if (MyLogicalRepWorker->relstate == SUBREL_STATE_FINISHEDCOPY)
+ {
+ /*
+ * The COPY phase was previously done, but tablesync then crashed/etc
+ * before it was able to finish normally.
+ */

There seems to be a typo (crashed/etc) in the above comment.

7.
+# check for occurrence of the expected error
+poll_output_until("replication slot \"$slotname\" already exists")
+or die "no error stop for the pre-existing origin";

In this test, isn't it better to check for datasync state like below?
004_sync.pl has some other similar test.
my $started_query = "SELECT srsubstate = 'd' FROM pg_subscription_rel;";
$node_subscriber->poll_query_until('postgres', $started_query)
  or die "Timed out while waiting for subscriber to start sync";

Is there a reason why we can't use the existing way to check for
failure in this case?

-- 
With Regards,
Amit Kapila.




Re: ResourceOwner refactoring

2021-01-21 Thread Heikki Linnakangas

On 21/01/2021 06:14, Michael Paquier wrote:

On Thu, Jan 21, 2021 at 12:11:37AM +0200, Heikki Linnakangas wrote:

Summary: In the the worst scenario, the patched version is still 24% slower
than unpatched. But many other scenarios are now faster with the patch.


Is there a reason explaining the sudden drop for numsnaps within the
[10,60] range?  The gap looks deeper with a low numkeep.


It's the switch from array to hash table. With the patch, the array 
holds 8 entries. Without the patch, it's 64 entries. So you see a drop 
around those points. I added more data points in that range to get a 
better picture:


LIFO tests:
 numkeep | numsnaps | unpatched | patched | ratio
-+--+---+-+---
   0 |1 |  32.8 |32.9 |  1.00
   0 |2 |  31.6 |32.8 |  1.04
   0 |4 |  32.7 |32.0 |  0.98
   0 |6 |  34.1 |33.9 |  0.99
   0 |8 |  35.1 |32.4 |  0.92
   0 |   10 |  34.0 |37.1 |  1.09
   0 |   15 |  33.1 |35.9 |  1.08
   0 |   20 |  33.0 |38.8 |  1.18
   0 |   25 |  32.9 |42.3 |  1.29
   0 |   30 |  32.9 |40.5 |  1.23
   0 |   35 |  33.1 |39.9 |  1.21
   0 |   40 |  33.0 |39.0 |  1.18
   0 |   45 |  35.3 |41.1 |  1.16
   0 |   50 |  33.0 |40.8 |  1.24
   0 |   55 |  32.8 |40.6 |  1.24
   0 |   58 |  33.0 |41.5 |  1.26
   0 |   60 |  33.1 |41.6 |  1.26
   0 |   62 |  32.8 |41.7 |  1.27
   0 |   64 |  46.8 |40.9 |  0.87
   0 |   66 |  47.0 |42.5 |  0.90
   0 |   68 |  47.1 |41.8 |  0.89
   0 |   70 |  47.8 |41.7 |  0.87
(22 rows)

FIFO tests:
 numkeep | numsnaps | unpatched | patched | ratio
-+--+---+-+---
   0 |1 |  32.3 |32.1 |  0.99
   0 |2 |  33.4 |31.6 |  0.95
   0 |4 |  34.0 |31.4 |  0.92
   0 |6 |  35.4 |33.2 |  0.94
   0 |8 |  34.8 |31.9 |  0.92
   0 |   10 |  35.4 |40.2 |  1.14
   0 |   15 |  35.4 |40.3 |  1.14
   0 |   20 |  35.6 |43.8 |  1.23
   0 |   25 |  35.4 |42.4 |  1.20
   0 |   30 |  36.0 |43.3 |  1.20
   0 |   35 |  36.4 |45.1 |  1.24
   0 |   40 |  36.9 |46.6 |  1.26
   0 |   45 |  37.7 |45.3 |  1.20
   0 |   50 |  37.2 |43.9 |  1.18
   0 |   55 |  38.4 |46.8 |  1.22
   0 |   58 |  37.6 |45.0 |  1.20
   0 |   60 |  37.7 |46.6 |  1.24
   0 |   62 |  38.4 |46.5 |  1.21
   0 |   64 |  48.7 |47.6 |  0.98
   0 |   66 |  48.2 |45.8 |  0.95
   0 |   68 |  48.5 |47.5 |  0.98
   0 |   70 |  48.4 |47.3 |  0.98
(22 rows)

Let's recap the behavior:

Without patch
-

For each different kind of resource, there's an array that holds up to 
64 entries. In ResourceOwnerForget(), the array is scanned linearly. If 
the array fills up, we replace the array with a hash table. After 
switching, all operations use the hash table.


With patch
--

There is one array that holds up to 8 entries. It is shared by all 
resources. In ResourceOwnerForget(), the array is always scanned.


If the array fills up, all the entries are moved to a hash table, 
freeing up space in the array, and the new entry is added to the array. 
So the array is used together with the hash table, like an LRU cache of 
the most recently remembered entries.



Why this change? I was afraid that now that all different resources 
share the same data structure, remembering e.g. a lot of locks at the 
beginning of a transaction would cause the switch to the hash table, 
making all subsequent remember/forget operations, like for buffer pins, 
slower. That kind of interference seems bad. By continuing to use the 
array for the recently-remembered entries, we avoid that problem. The 
[numkeep, numsnaps] = [65, 70] test is in that regime, and the patched 
version was significantly faster.


Because the array is now always scanned, I felt that it needs to be 
small, to avoid wasting much time scanning for entries that have already 
been moved to the hash table. That's why I made it just 8 entries.


Perhaps 8 entries is too small, after all? Linearly scanning an array is 
very fast. To test that, I bumped up RESOWNER_ARRAY_SIZE to 64, and ran 
the test again:


 numkeep | numsnaps | lifo_time_ns | fifo_time_ns
-+--+--+--
   0 |1 | 35.4 | 31.5
   0 |2 | 32.3 | 32.3
   0 |4 | 32.8 | 31.0
   0 |6 | 34.5 |   

Re: Is Recovery actually paused?

2021-01-21 Thread Bharath Rupireddy
On Tue, Jan 19, 2021 at 9:32 PM Dilip Kumar  wrote:
> In the last patch there were some local changes which I did not add to
> the patch and it was giving compilation warning so fixed that along
> with that I have addressed your this comment as well.

Thanks for the patch. I took a look at the v5 patch, below are some
comments. Please ignore if I'm repeating any of the comments discussed
upthread.

[1] Can we also have a wait version for pg_wal_replay_pause that waits
until recovery is actually paused right after setting the recovery
state to RECOVERY_PAUSE_REQUESTED? Something like this -
pg_wal_replay_pause_and_wait(wait boolean, wait_seconds integer
DEFAULT 60) returns boolean. It waits until either default or provided
wait_seconds and returns true if the recovery is paused within that
wait_seconds otherwise false. If wait_seconds is 0 or -1, then it
waits until recovery is paused and returns true. One advantage of this
function is that users don't need to call pg_is_wal_replay_paused().
IMHO, the job of ensuring whether or not the recovery is actually
paused, is better done by the one who requests
it(pg_wal_replay_pause/pg_wal_replay_pause_and_wait).

[2] Is it intentional that RecoveryPauseRequested() returns true even
if XLogCtl->recoveryPauseState is either RECOVERY_PAUSE_REQUESTED or
RECOVERY_PAUSED?

[3] Can we change IsRecoveryPaused() instead of RecoveryIsPaused() and
IsRecoveryPauseRequested() instead of RecoveryPauseRequested()? How
about having inline(because they have one line of code) functions like
IsRecoveryPauseRequested(), IsRecoveryPaused() and
IsRecoveryInProgress(), returning true when RECOVERY_PAUSE_REQUESTED,
RECOVERY_PAUSED and RECOVERY_IN_PROGRESS respectively?

[4] Can we have at least one line of comments for each of the new
functions, I know the function names mean everything? And also for
existing SetRecoveryPause() and GetRecoveryPauseState()?

[5] Typo, it's "every time" ---> + * this everytime.

[6] Do we reach PG_RETURN_BOOL(true); at the end of
pg_is_wal_replay_paused()? If not, is it there for satisfying the
compiler?

[7] In pg_is_wal_replay_paused(), do we need if
(!RecoveryPauseRequested()) inside the for (;;)? If yes, can we add
comments about why we need it there?

[8] Isn't it good to have
pgstat_report_wait_start(WAIT_EVENT_RECOVERY_PAUSE) and
pgstat_report_wait_end(), just before for (;;) and at the end
respectively? This will help users in knowing what they are waiting
for? Alternatively we can issue notice/warnings/write to server log
while we are waiting in for (;;) for the recovery to get paused?

[9] In pg_is_wal_replay_paused(), is 10 msec sleep time chosen
randomly or based on some analysis that the time it takes to get to
recovery paused state from request state or some other?

[10]  errhint("The standby was promoted while waiting for recovery to
be paused."))); Can we know whether standby is actually promoted and
throw this error? Because the error "recovery is not in progress" here
is being thrown by just relying on if (!RecoveryInProgress()). IIUC,
using pg_promote

[11] Can we try to add tests for these functions in TAP? Currently, we
don't have any tests for pg_is_wal_replay_paused, pg_wal_replay_resume
or pg_wal_replay_pause, but we have tests for pg_promote in timeline
switch.

[12] Isn't it better to change RecoveryPauseState enum
RECOVERY_IN_PROGRESS value to start from 1 instead of 0? Because when
XLogCtl shared memory is initialized, I think recoveryPauseState can
be 0, so should it mean recovery in progress?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Refactor SSL test framework to support multiple TLS libraries

2021-01-21 Thread Daniel Gustafsson
In an attempt to slice off as much non-NSS specific changes as possible from
the larger libnss patch proposed in [0], the attached patch contains the ssl
test harness refactoring to support multiple TLS libraries.

The changes are mostly a refactoring to hide library specific setup in their
own modules, but also extend set_server_cert() to support password command
which cleans up the TAP tests from hands-on setup and teardown. 

cheers ./daniel

[0] https://postgr.es/m/fab21fc8-0f62-434f-aa78-6bd9336d6...@yesql.se



0001-Refactor-library-specific-SSL-test-setup-into-separa.patch
Description: Binary data



Re: Add primary keys to system catalogs

2021-01-21 Thread Peter Eisentraut

On 2021-01-17 23:07, Tom Lane wrote:

I've reviewed this patch.  It looks pretty solid to me, with a couple
trivial nits as mentioned below, and one bigger thing that's perhaps
in the category of bikeshedding.  Namely, do we really want to prefer
using the OID indexes as the primary keys?  In most cases there's some
other index that seems to me to be what a user would think of as the
pkey, for example pg_class_relname_nsp_index for pg_class or
pg_proc_proname_args_nsp_index for pg_proc.  Preferring OID where it
exists is a nice simple rule, which has some attractiveness, but the
OID indexes seem to me like a lookup aid rather than the "real" object
identity.


I chose this because the notional foreign keys point to the OID.

If you design some basic business database with customer IDs, product 
IDs, etc., you'd also usually make the ID the primary key, even if you 
have, say, a unique constraint on the product name.  But this is of 
course a matter of taste to some degree.



system_constraints.sql should be removed by the maintainer-clean target
in src/backend/catalog/Makefile; perhaps also mention it in the comment
for the clean target.  Also I suppose src/tools/msvc/clean.bat needs to
remove it, like it does postgres.bki.


done


The contents of system_constraints.sql seem pretty randomly ordered,
and I bet the order isn't stable across machines.  It would be wise
to sort the entries by name to ensure we don't get inconsistencies
between different builds.  (If nothing else, that could complicate
validating tarballs.)


They follow the order in which the catalogs are processed byt genbki.pl. 
 This is the same order in which the catalog data and indexes are 
created in postgres.bki.  The Perl code to handle each of these is 
conceptually the same, so that seems solid.  We could order them 
differently, but there is also value in keeping the catalog processing 
order globally consistent.



I don't follow why the pg_seclabel, pg_shseclabel indexes aren't made
into pkeys?  There's a comment claiming they "use a nondefault operator
class", but that's untrue AFAICS.


When I started the patch, it was text_pattern_ops, but that was a long 
time ago. ;-)  Fixed now.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/
From d6e15c6ac1903af5dbe424a34e7ea9ad776517ff Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 21 Jan 2021 09:47:42 +0100
Subject: [PATCH v3] Add primary keys and unique constraints to system catalogs

For those system catalogs that have a unique indexes, make a primary
key and unique constraint, using ALTER TABLE ... PRIMARY KEY/UNIQUE
USING INDEX.

This can be helpful for GUI tools that look for a primary key, and it
might in the future allow declaring foreign keys, for making schema
diagrams.

The constraint creation statements are automatically created by
genbki.pl from DECLARE_UNIQUE_INDEX directives.  To specify which one
of the available unique indexes is the primary key, use the new
directive DECLARE_UNIQUE_INDEX_PKEY instead.

Discussion: 
https://www.postgresql.org/message-id/flat/dc5f44d9-5ec1-a596-0251-dadadcded...@2ndquadrant.com
---
 src/backend/catalog/.gitignore|  1 +
 src/backend/catalog/Catalog.pm| 11 ++--
 src/backend/catalog/Makefile  |  9 +--
 src/backend/catalog/genbki.pl | 21 +++
 src/bin/initdb/initdb.c   | 60 +--
 src/include/catalog/genbki.h  |  3 +-
 src/include/catalog/pg_aggregate.h|  2 +-
 src/include/catalog/pg_am.h   |  2 +-
 src/include/catalog/pg_amop.h |  2 +-
 src/include/catalog/pg_amproc.h   |  2 +-
 src/include/catalog/pg_attrdef.h  |  2 +-
 src/include/catalog/pg_attribute.h|  2 +-
 src/include/catalog/pg_auth_members.h |  2 +-
 src/include/catalog/pg_authid.h   |  2 +-
 src/include/catalog/pg_cast.h |  2 +-
 src/include/catalog/pg_class.h|  2 +-
 src/include/catalog/pg_collation.h|  2 +-
 src/include/catalog/pg_constraint.h   |  2 +-
 src/include/catalog/pg_conversion.h   |  2 +-
 src/include/catalog/pg_database.h |  2 +-
 src/include/catalog/pg_db_role_setting.h  |  2 +-
 src/include/catalog/pg_default_acl.h  |  2 +-
 src/include/catalog/pg_description.h  |  2 +-
 src/include/catalog/pg_enum.h |  2 +-
 src/include/catalog/pg_event_trigger.h|  2 +-
 src/include/catalog/pg_extension.h|  2 +-
 src/include/catalog/pg_foreign_data_wrapper.h |  2 +-
 src/include/catalog/pg_foreign_server.h   |  2 +-
 src/include/catalog/pg_foreign_table.h|  2 +-
 src/include/catalog/pg_index.h|  2 +-
 src/include/catalog/pg_inherits.h |  2 +-
 src/include/catalog/pg_init_privs.h   |  2 +-
 src/include/catalog/pg_language.h |  2 +-
 

Re: Jsonpath ** vs lax mode

2021-01-21 Thread Thomas Kellerer
Alexander Korotkov schrieb am 20.01.2021 um 18:13:
> We have a bug report which says that jsonpath ** operator behaves strangely 
> in the lax mode [1].

That report was from me ;)

Thanks for looking into it.

> At first sight, we may just say that lax mode just sucks and
> counter-intuitive results are expected.  But at the second sight, the
> lax mode is used by default and current behavior may look too
> surprising.

I personally would be fine with the manual stating that the Postgres extension
to the JSONPath processing that allows a recursive lookup using ** requires 
strict
mode to work properly.

It should probably be documented in chapter 9.16.2 "The SQL/JSON Path Language",
maybe with a little warning in the description of jsonb_path_query** and in
chapter 8.14.16 as well (or at least that's were I would expect such a warning)

Regards
Thomas




Re: Jsonpath ** vs lax mode

2021-01-21 Thread Alexander Korotkov
Hi, Alvaro!

Thank you for your feedback.

On Wed, Jan 20, 2021 at 9:16 PM Alvaro Herrera  wrote:
> On 2021-Jan-20, Alexander Korotkov wrote:
>
> > My proposal is to make everything after the ** operator use strict mode
> > (patch attached).  I think this shouldn't be backpatched, just applied to
> > the v14.  Other suggestions?
>
> I think changing the mode midway through the operation is strange.  What
> do you think of requiring for ** that mode is strict?  That is, if ** is
> used and the mode is lax, an error is thrown.

Yes, changing mode in midway is a bit strange.

Requiring strict mode for ** is a solution, but probably too restrictive...

What do you think about making just subsequent accessor after ** not
to unwrap arrays.  That would be a bit tricky to implement, but
probably that would better satisfy the user needs.

--
Regards,
Alexander Korotkov




Re: Wrong usage of RelationNeedsWAL

2021-01-21 Thread Noah Misch
On Thu, Jan 21, 2021 at 06:02:11PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 21 Jan 2021 00:19:58 -0800, Noah Misch  wrote in 
> > On Thu, Jan 21, 2021 at 12:28:44AM +0900, Kyotaro Horiguchi wrote:
> > > However, with the previous patch, two existing tests sto_using_cursor
> > > and sto_using_select behaves differently from the master.  That change
> > > is coming from the omission of actual LSN check in TestForOldSnapshot
> > > while wal_level=minimal. So we have no choice other than actually
> > > updating page LSN.
> > > 
> > > In the scenario under discussion there are two places we need to do
> > > that. one is heap_page_prune, and the other is RelationCopyStorge. As
> > > a PoC, I used gistXLogAssignLSN() as is for thie purpose. See the
> > > attached third file.
> > 
> > Fake LSNs make the system harder to understand, so I prefer not to spread 
> > fake
> > LSNs to more access methods.  What I had in mind is to simply suppress early
> > pruning when the relation is skipping WAL.  Attached.  Is this reasonable?  
> > It
> > passes the older tests.  While it changes the sto_wal_optimized.spec 
> > output, I
> > think it preserves the old_snapshot_threshold behavior contract.
> 
> Perhaps I'm missing something, but the patch doesn't pass the v5-0001
> test with wal_level=minimal?

Correct.  The case we must avoid is letting an old snapshot read an
early-pruned page without error.  v5-0001 expects "ERROR:  snapshot too old".
The patch suspends early pruning, so that error is not applicable.




RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Hou, Zhijie
> > So I think we're saying that if the target table is a foreign table or
> > temporary table, it can be regarded as PARALLEL_RESTRICTED, right?
> >
> 
> Yes
> 
> IMO, PARALLEL_RESTRICTED currently enable parallel select but disable
> parallel insert.
> So, the INSERT only happen in leader worker which seems safe to insert into
> tempory/foreigh table.
> 
> In addition, there are some other restriction about parallel select which
> seems can be removed:
> 
> 1.- Target table has a parallel-unsafe trigger, index expression, column
> default
> expression or check constraint
> 2.- Target table is a partitioned table with a parallel-unsafe partition
> key
> expression or support function
> 
> If the Insert's target table is the type listed above, Is there some reasons
> why we can not support parallel select ?
> It seems only leader worker will execute the trigger and key-experssion
> which seems safe.
> (If I miss something about it, please let me know)


So Sorry, please ignore the above, I think of something wrong.

Best regards,
houzj




Re: Wrong usage of RelationNeedsWAL

2021-01-21 Thread Kyotaro Horiguchi
At Thu, 21 Jan 2021 00:19:58 -0800, Noah Misch  wrote in 
> On Thu, Jan 21, 2021 at 12:28:44AM +0900, Kyotaro Horiguchi wrote:
> > At Wed, 20 Jan 2021 17:34:44 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > Anyway, it seems actually dangerous that cause pruning on wal-skipped
> > > relation.
> > > 
> > > > with your patch versions.  Could you try implementing both test 
> > > > procedures in
> > > > src/test/modules/snapshot_too_old?  There's no need to make the test use
> > > > wal_level=minimal explicitly; it's sufficient to catch these bugs when
> > > > wal_level=minimal is in the TEMP_CONFIG file.
> > > 
> > > In the attached, TestForOldSnapshot() considers XLogIsNeeded(),
> > > instead of moving the condition on LSN to TestForOldSnapshot_impl for
> > > performance.
> > > 
> > > I'll add the test part in the next version.
> 
> That test helped me.  I now see "there's not a single tuple removed due to
> old_snapshot_threshold in src/test/modules/snapshot_too_old"[1], which limits
> our ability to test using this infrastructure.

Yes.

> > However, with the previous patch, two existing tests sto_using_cursor
> > and sto_using_select behaves differently from the master.  That change
> > is coming from the omission of actual LSN check in TestForOldSnapshot
> > while wal_level=minimal. So we have no choice other than actually
> > updating page LSN.
> > 
> > In the scenario under discussion there are two places we need to do
> > that. one is heap_page_prune, and the other is RelationCopyStorge. As
> > a PoC, I used gistXLogAssignLSN() as is for thie purpose. See the
> > attached third file.
> 
> Fake LSNs make the system harder to understand, so I prefer not to spread fake
> LSNs to more access methods.  What I had in mind is to simply suppress early
> pruning when the relation is skipping WAL.  Attached.  Is this reasonable?  It
> passes the older tests.  While it changes the sto_wal_optimized.spec output, I
> think it preserves the old_snapshot_threshold behavior contract.

Perhaps I'm missing something, but the patch doesn't pass the v5-0001
test with wal_level=minimal?

> [1] 
> https://www.postgresql.org/message-id/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Race condition in recovery?

2021-01-21 Thread Dilip Kumar
While analyzing one of the customer issues, based on the log it
appeared that there is a race condition in the recovery process.

The summary of the issue is, that one of the standby is promoted as
the new primary (Node2) and another standby (Node3) is restarted and
set the primary_info and the restore_command so that it can
stream/restore from Node2 (new primary).  The problem is that during
the promotion the timeline switch happened in the middle of the
segment in Node2 and the Node3 is able to restore the newTLI.history
file from the archive but the WAL file with the new TLI is not yet
archived.  Now, we will try to stream the wal file from the primary
but if we are fetching the checkpoint that time we will not use the
latest timeline instead we will try with the checkpoint timeline and
walsender will send the WAL from the new timeline file because
requested WAL recptr is before the TLI switch.   And once that
happened the expectedTLEs will be set based on the oldTLI.history
file.  Now, whenever we try to restore the required WAL file and
recoveryTargetTimeLineGoal is set to the latest we again try to rescan
the latest timeline (rescanLatestTimeLine) but the problem is
recoveryTargetTLI was already set to the latest timeline.  So now the
problem is expectedTLEs is set to oldTLI and recoveryTargetTLI is set
to newTLI and rescanLatestTimeLine will never update the expectedTLEs.
Now,  walsender will fail to stream new WAL using the old TLI and the
archiver process will also fail because that file doesn't not exists
anymore (converted to .partial).  Basically, now we will never try
with the newTLI.

I have given the sequence of the events based on my analysis.

Refer to the sequence of event
-
Node1 primary, Node2 standby1, Node3 standby2

1. Node2 got promoted to new primary, and node 2 picked new TL 13 in
the middle of the segment.
2. Node3, restarted with new primary info of Node2 and restore command
3. Node3, found the newest TL13 in validateRecoveryParameters()
Because the latest TL was requested in recovery.conf (history file
restored from TL13) and set recoveryTargetTLI to 13
So point to note is recoveryTargetTLI is set to 13 but expectedTLEs is
not yet set.
4. Node3, entered into the standby mode.
5. Node3, tries to read the checkpoint Record, on Node3 still the
checkpoint TL (ControlFile->checkPointCopy.ThisTimeLineID) is 12.
6. Node3, tries to get the checkpoint record file using new TL13 from
the archive which it should get ideally but it may not if the Node2
haven't yet archived it.
7. Node3, tries to stream from primary but using TL12 because
ControlFile->checkPointCopy.ThisTimeLineID is 12.
8. Node3, get it because walsender of Node2 read it from TL13 and send
it and Node2 write in the new WAL file but with TL12.

WalSndSegmentOpen()
{
/*---
* When reading from a historic timeline, and there is a timeline switch
* within this segment, read from the WAL segment belonging to the new
* timeline.
}

9. Node3, now set the expectedTLEs to 12 because that is what
walreceiver has streamed the WAL using.

10. Node3, now recoveryTargetTLI is 13 and expectedTLEs is 12.  So
whenever it tries to find the latest TLE (rescanLatestTimeLine ) it
finds it is 13 which is the same as recoveryTargetTLI so nothing to
change but expectedTLEs is 12 using which it will try to
restore/stream further WAL and fail every time.

rescanLatestTimeLine(void)
{

newtarget = findNewestTimeLine(recoveryTargetTLI);
if (newtarget == recoveryTargetTLI)
{
  /* No new timelines found */
return false;
}

11.  So now the situation is that ideally in rescanLatestTimeLine() we
should get the latest TLI but it assumes that it is already on the
latest TLI because recoveryTargetTLI is on the latest TLI.

Other points to be noted:
- The actual issue happened on 9.6.11 but based on the code comparison
it appeared that same can occur on the latest code as well.
- After Node3 is shutdown wal from its pg_wal/ directory were removed
so that it can follow the new primary.

Based on the sequence of events It is quite clear that something is
wrong in rescanLatestTimeLine, maybe after selecting the latest TLI we
should compare it with the head of the expectedTLEs as well instead of
just comparing it to the recoveryTargetTLI?


Log from Node2:
2020-12-22 04:49:02 UTC [1019]: [9-1] user=; db=; app=; client=;
SQLcode=0 LOG:  received promote request
2020-12-22 04:49:02 UTC [1019]: [10-1] user=; db=; app=; client=;
SQLcode=0 LOG:  redo done at 0/F828
2020-12-22 04:49:02 UTC [1019]: [11-1] user=; db=; app=; client=;
SQLcode=0 LOG:  last completed transaction was at log time
2020-12-22 04:48:01.833476+00
rsync: link_stat "/wal_archive/ins30wfm02dbs/000C00F8"
failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous
errors) (code 23) at main.c(1179) [sender=3.1.2]
rsync: link_stat 

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Hou, Zhijie
> > > Hi
> > >
> > > > > I may be wrong, and if I miss sth in previous mails, please give
> > > > > me some
> > > > hints.
> > > > > IMO, serial insertion with underlying parallel SELECT can be
> > > > > considered for foreign table or temporary table, as the
> > > > > insertions only
> > > > happened in the leader process.
> > > > >
> > > >
> > > > I don't think we support parallel scan for temporary tables. Can
> > > > you please try once both of these operations without Insert being
> > > > involved? If you are able to produce a parallel plan without
> > > > Insert then we can see why it is not supported with Insert.
> > >
> > > Sorry, may be I did not express it clearly, I actually means the case
> when insert's target(not in select part) table is temporary.
> > > And you are right that parallel select is not enabled when temporary
> table is in select part.
> > >
> >
> > I think Select can be parallel for this case and we should support this
> case.
> >
> 
> So I think we're saying that if the target table is a foreign table or
> temporary table, it can be regarded as PARALLEL_RESTRICTED, right?
>

Yes

IMO, PARALLEL_RESTRICTED currently enable parallel select but disable parallel 
insert.
So, the INSERT only happen in leader worker which seems safe to insert into 
tempory/foreigh table.

In addition, there are some other restriction about parallel select which seems 
can be removed:

1.- Target table has a parallel-unsafe trigger, index expression, column default
expression or check constraint
2.- Target table is a partitioned table with a parallel-unsafe partition key
expression or support function

If the Insert's target table is the type listed above, Is there some reasons 
why we can not support parallel select ?
It seems only leader worker will execute the trigger and key-experssion which 
seems safe.
(If I miss something about it, please let me know)

Best regards,
houzj







Re: [PATCH 1/1] Fix detection of pwritev support for OSX.

2021-01-21 Thread Sergey Shinderuk

On 21.01.2021 02:07, Tom Lane wrote:

I now believe what is actually happening with the short command is
that it's iterating through the available SDKs (according to some not
very clear search path) and picking the first one it finds that
matches the host system version.  That matches the ktrace evidence
that shows it reading the SDKSettings.plist file in each SDK
directory.


Yes, you are right. After some more digging...

It searches the DEVELOPER_DIR first and then 
/Library/Developer/CommandLineTools, which is hardcoded.


My DEVELOPER_DIR is
% xcode-select -p
/Applications/Xcode.app/Contents/Developer

(For more detail try "otool -tV /usr/lib/libxcselect.dylib -p 
_xcselect_get_developer_dir_path".)


It reads ProductVersion from 
/System/Library/CoreServices/SystemVersion.plist


% plutil -p /System/Library/CoreServices/SystemVersion.plist | grep 
ProductVersion

  "ProductVersion" => "10.15.7"

Strips anything after the second dot, and prepends "macosx" to it, which 
gives "macosx10.15".


Then it scans through SDK dirs looking up CanonicalName from 
SDKSettings.plist until it finds a match with "macosx10.15".



The overall callstack:

% sudo dtrace -n 'syscall::getdirentries64:entry { ustack() }' -c 'xcrun 
--show-sdk-path'

dtrace: description 'syscall::getdirentries64:entry ' matched 1 probe
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk
dtrace: pid 20183 has exited
CPU IDFUNCTION:NAME
  0846getdirentries64:entry
  libsystem_kernel.dylib`__getdirentries64+0xa
  libsystem_c.dylib`readdir$INODE64+0x23
  libsystem_c.dylib`scandir$INODE64+0x6c
  libxcrun.dylib`cltools_lookup_sdk_by_key+0x5f
  libxcrun.dylib`cltools_lookup_boot_system_sdk+0xda
  libxcrun.dylib`xcinfocache_resolve_sdkroot+0xc0
  libxcrun.dylib`xcrun_main2+0x57a
  libxcrun.dylib`xcrun_main+0x9
  libxcselect.dylib`xcselect_invoke_xcrun_via_library+0xc8
  libxcselect.dylib`xcselect_invoke_xcrun+0x25a
  xcrun`DYLD-STUB$$getprogname
  libdyld.dylib`start+0x1
  xcrun`0x2

  0846getdirentries64:entry
  libsystem_kernel.dylib`__getdirentries64+0xa
  libsystem_c.dylib`readdir$INODE64+0x23
  libsystem_c.dylib`scandir$INODE64+0x6c
  libxcrun.dylib`cltools_lookup_sdk_by_key+0x5f
  libxcrun.dylib`cltools_lookup_boot_system_sdk+0xf3
  libxcrun.dylib`xcinfocache_resolve_sdkroot+0xc0
  libxcrun.dylib`xcrun_main2+0x57a
  libxcrun.dylib`xcrun_main+0x9
  libxcselect.dylib`xcselect_invoke_xcrun_via_library+0xc8
  libxcselect.dylib`xcselect_invoke_xcrun+0x25a
  xcrun`DYLD-STUB$$getprogname
  libdyld.dylib`start+0x1
  xcrun`0x2


The SDK search path:

% sudo dtrace -n 'pid$target:::entry 
/probefunc=="cltools_lookup_sdk_by_key"/ { trace(copyinstr(arg0)); 
trace(copyinstr(arg1)) }' -c 'xcrun --show-sdk-path'

dtrace: description 'pid$target:::entry ' matched 17293 probes
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk
dtrace: pid 20191 has exited
CPU IDFUNCTION:NAME
  8 398290  cltools_lookup_sdk_by_key:entry 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer 
 macosx10.15
  9 398290  cltools_lookup_sdk_by_key:entry 
/Library/Developer/CommandLineTools  macosx10.15



The properties read from SDKSettings.plist:

% sudo dtrace -n 'pid$target:::entry 
/probefunc=="_cltools_lookup_property_in_path"/ { 
trace(copyinstr(arg0)); trace(copyinstr(arg1)); trace(copyinstr(arg2)) 
}' -c 'xcrun --show-sdk-path'

dtrace: description 'pid$target:::entry ' matched 17293 probes
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk
dtrace: pid 20195 has exited
CPU IDFUNCTION:NAME
  8 398288 _cltools_lookup_property_in_path:entry   / 
System/Library/CoreServices/SystemVersion.plist 
ProductVersion
  8 398288 _cltools_lookup_property_in_path:entry 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/DriverKit20.2.sdk 
 SDKSettings.plist  IsBaseSDK
  8 398288 _cltools_lookup_property_in_path:entry 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/DriverKit20.2.sdk 
 SDKSettings.plist  CanonicalName
  4 398288 _cltools_lookup_property_in_path:entry 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/DriverKit20.2.sdk 
 SDKSettings.plist  CanonicalNameForBuildSettings
  4 398288 _cltools_lookup_property_in_path:entry 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk 
 SDKSettings.plist  IsBaseSDK
  4 398288 _cltools_lookup_property_in_path:entry 

Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-21 Thread Amit Kapila
On Thu, Jan 21, 2021 at 12:44 PM Greg Nancarrow  wrote:
>
> On Wed, Dec 23, 2020 at 1:45 PM Amit Kapila  wrote:
> >
> > On Wed, Dec 23, 2020 at 7:52 AM Hou, Zhijie  
> > wrote:
> > >
> > > Hi
> > >
> > > > > I may be wrong, and if I miss sth in previous mails, please give me 
> > > > > some
> > > > hints.
> > > > > IMO, serial insertion with underlying parallel SELECT can be
> > > > > considered for foreign table or temporary table, as the insertions 
> > > > > only
> > > > happened in the leader process.
> > > > >
> > > >
> > > > I don't think we support parallel scan for temporary tables. Can you 
> > > > please
> > > > try once both of these operations without Insert being involved? If you
> > > > are able to produce a parallel plan without Insert then we can see why 
> > > > it
> > > > is not supported with Insert.
> > >
> > > Sorry, may be I did not express it clearly, I actually means the case 
> > > when insert's target(not in select part) table is temporary.
> > > And you are right that parallel select is not enabled when temporary 
> > > table is in select part.
> > >
> >
> > I think Select can be parallel for this case and we should support this 
> > case.
> >
>
> So I think we're saying that if the target table is a foreign table or
> temporary table, it can be regarded as PARALLEL_RESTRICTED, right?
>

Yes.

> i.e. code-wise:
>
> /*
> -* We can't support table modification in parallel-mode if
> it's a foreign
> -* table/partition (no FDW API for supporting parallel access) or a
> +* We can't support table modification in a parallel worker if it's a
> +* foreign table/partition (no FDW API for supporting parallel
> access) or a
>  * temporary table.
>  */
> if (rel->rd_rel->relkind == RELKIND_FOREIGN_TABLE ||
> RelationUsesLocalBuffers(rel))
> {
> -   table_close(rel, lockmode);
> -   context->max_hazard = PROPARALLEL_UNSAFE;
> -   return true;
> +   if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
> +   {
> +   table_close(rel, lockmode);
> +   return true;
> +   }
> }
>

Yeah, these changes look correct to me.

-- 
With Regards,
Amit Kapila.




Re: [HACKERS] make async slave to wait for lsn to be replayed

2021-01-21 Thread Kyotaro Horiguchi
Hello.

At Wed, 18 Nov 2020 15:05:00 +0300, a.pervush...@postgrespro.ru wrote in 
> I've changed the BEGIN WAIT FOR LSN statement to core functions
> pg_waitlsn, pg_waitlsn_infinite and pg_waitlsn_no_wait.
> Currently the functions work inside repeatable read transactions, but
> waitlsn creates a snapshot if called first in a transaction block,
> which can possibly lead the transaction to working incorrectly, so the
> function gives a warning.

According to the discuttion here, implementing as functions is not
optimal. As a Poc, I made it as a procedure. However I'm not sure it
is the correct implement as a native procedure but it seems working as
expected.

> Usage examples
> ==
> select pg_waitlsn(‘LSN’, timeout);
> select pg_waitlsn_infinite(‘LSN’);
> select pg_waitlsn_no_wait(‘LSN’);

The first and second usage is coverd by a single procedure. The last
function is equivalent to pg_last_wal_replay_lsn(). As the result, the
following procedure is provided in the attached.

pg_waitlsn(wait_lsn pg_lsn, timeout integer DEFAULT -1)

Any opinions mainly compared to implementation as a command?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 470e113b33..4283b98eb4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -42,6 +42,7 @@
 #include "catalog/pg_database.h"
 #include "commands/progress.h"
 #include "commands/tablespace.h"
+#include "commands/wait.h"
 #include "common/controldata_utils.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
@@ -7463,6 +7464,15 @@ StartupXLOG(void)
 	break;
 }
 
+/*
+ * If we replayed an LSN that someone was waiting for,
+ * set latches in shared memory array to notify the waiter.
+ */
+if (XLogCtl->lastReplayedEndRecPtr >= GetMinWaitedLSN())
+{
+	WaitSetLatch(XLogCtl->lastReplayedEndRecPtr);
+}
+
 /* Else, try to fetch the next WAL record */
 record = ReadRecord(xlogreader, LOG, false);
 			} while (record != NULL);
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..c19d49e7a4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1460,6 +1460,10 @@ LANGUAGE internal
 STRICT IMMUTABLE PARALLEL SAFE
 AS 'unicode_is_normalized';
 
+CREATE OR REPLACE PROCEDURE
+  pg_waitlsn(wait_lsn pg_lsn, timeout integer DEFAULT -1)
+  LANGUAGE internal AS 'pg_waitlsn';
+
 --
 -- The default permissions for functions mean that anyone can execute them.
 -- A number of functions shouldn't be executable by just anyone, but rather
diff --git a/src/backend/commands/Makefile b/src/backend/commands/Makefile
index e8504f0ae4..2c0bd41336 100644
--- a/src/backend/commands/Makefile
+++ b/src/backend/commands/Makefile
@@ -60,6 +60,7 @@ OBJS = \
 	user.o \
 	vacuum.o \
 	variable.o \
-	view.o
+	view.o \
+	wait.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b50..959e96b7e0 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -23,6 +23,7 @@
 #include "access/syncscan.h"
 #include "access/twophase.h"
 #include "commands/async.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -149,6 +150,7 @@ CreateSharedMemoryAndSemaphores(void)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, WaitShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -268,6 +270,11 @@ CreateSharedMemoryAndSemaphores(void)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 
+	/*
+	 * Init array of events for the wait clause in shared memory
+	 */
+	WaitShmemInit();
+
 #ifdef EXEC_BACKEND
 
 	/*
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index c87ffc6549..2b4d73ba2f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -38,6 +38,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "commands/wait.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
@@ -713,6 +714,9 @@ LockErrorCleanup(void)
 
 	AbortStrongLockAcquire();
 
+	/* If waitlsn was interrupted, then stop waiting for that LSN */
+	DeleteWaitedLSN();
+
 	/* Nothing to do if we weren't waiting for a lock */
 	if (lockAwaited == NULL)
 	{
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 4096faff9a..90876da120 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -373,8 +373,6 @@ pg_sleep(PG_FUNCTION_ARGS)
 	 * less than the specified time when WaitLatch is terminated early by a
 	 * non-query-canceling signal such as SIGHUP.
 	 */
-#define GetNowFloat()	

Re: Wrong usage of RelationNeedsWAL

2021-01-21 Thread Noah Misch
On Thu, Jan 21, 2021 at 12:28:44AM +0900, Kyotaro Horiguchi wrote:
> At Wed, 20 Jan 2021 17:34:44 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > Anyway, it seems actually dangerous that cause pruning on wal-skipped
> > relation.
> > 
> > > with your patch versions.  Could you try implementing both test 
> > > procedures in
> > > src/test/modules/snapshot_too_old?  There's no need to make the test use
> > > wal_level=minimal explicitly; it's sufficient to catch these bugs when
> > > wal_level=minimal is in the TEMP_CONFIG file.
> > 
> > In the attached, TestForOldSnapshot() considers XLogIsNeeded(),
> > instead of moving the condition on LSN to TestForOldSnapshot_impl for
> > performance.
> > 
> > I'll add the test part in the next version.

That test helped me.  I now see "there's not a single tuple removed due to
old_snapshot_threshold in src/test/modules/snapshot_too_old"[1], which limits
our ability to test using this infrastructure.

> However, with the previous patch, two existing tests sto_using_cursor
> and sto_using_select behaves differently from the master.  That change
> is coming from the omission of actual LSN check in TestForOldSnapshot
> while wal_level=minimal. So we have no choice other than actually
> updating page LSN.
> 
> In the scenario under discussion there are two places we need to do
> that. one is heap_page_prune, and the other is RelationCopyStorge. As
> a PoC, I used gistXLogAssignLSN() as is for thie purpose. See the
> attached third file.

Fake LSNs make the system harder to understand, so I prefer not to spread fake
LSNs to more access methods.  What I had in mind is to simply suppress early
pruning when the relation is skipping WAL.  Attached.  Is this reasonable?  It
passes the older tests.  While it changes the sto_wal_optimized.spec output, I
think it preserves the old_snapshot_threshold behavior contract.

[1] 
https://www.postgresql.org/message-id/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de
diff --git a/src/backend/catalog/pg_publication.c 
b/src/backend/catalog/pg_publication.c
index 5f8e1c6..84d2efc 100644
--- a/src/backend/catalog/pg_publication.c
+++ b/src/backend/catalog/pg_publication.c
@@ -67,7 +67,7 @@ check_publication_add_relation(Relation targetrel)
 errdetail("System tables cannot be added to 
publications.")));
 
/* UNLOGGED and TEMP relations cannot be part of publication. */
-   if (!RelationNeedsWAL(targetrel))
+   if (targetrel->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 errmsg("table \"%s\" cannot be replicated",
diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c
index da322b4..177e6e3 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -126,7 +126,8 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, 
bool inhparent,
relation = table_open(relationObjectId, NoLock);
 
/* Temporary and unlogged relations are inaccessible during recovery. */
-   if (!RelationNeedsWAL(relation) && RecoveryInProgress())
+   if (relation->rd_rel->relpersistence != RELPERSISTENCE_PERMANENT &&
+   RecoveryInProgress())
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot access temporary or unlogged 
relations during recovery")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ae16c3e..9570426 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -1764,7 +1764,11 @@ TransactionIdLimitedForOldSnapshots(TransactionId 
recentXmin,
Assert(OldSnapshotThresholdActive());
Assert(limit_ts != NULL && limit_xid != NULL);
 
-   if (!RelationAllowsEarlyPruning(relation))
+   /*
+* TestForOldSnapshot() assumes early pruning advances the page LSN, so 
we
+* can't prune early when skipping WAL.
+*/
+   if (!RelationAllowsEarlyPruning(relation) || 
!RelationNeedsWAL(relation))
return false;
 
ts = GetSnapshotCurrentTimestamp();
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 579be35..c21ee3c 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -37,7 +37,7 @@
  */
 #define RelationAllowsEarlyPruning(rel) \
 ( \
-RelationNeedsWAL(rel) \
+(rel)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT  \
   && !IsCatalogRelation(rel) \
   && !RelationIsAccessibleInLogicalDecoding(rel) \
 )
diff --git a/src/test/subscription/t/001_rep_changes.pl 
b/src/test/subscription/t/001_rep_changes.pl
index 0680f44..ed9b48e 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -3,7 +3,7 @@ use strict;
 use 

Re: OpenSSL connection setup debug callback issue

2021-01-21 Thread Michael Paquier
On Thu, Dec 10, 2020 at 02:43:33PM +0100, Daniel Gustafsson wrote:
> I went looking at the SSL connection state change information callback we
> install when setting up connections with OpenSSL, and I wasn't getting the
> state changes I expected.  Turns out we install it at the tail end of setting
> up the connection so we miss most of the calls.  Moving it to the beginning of
> be_tls_open_server allows us to catch the handshake etc.  I also extended it 
> by
> printing the human readable state change message available from OpenSSL to 
> make
> the logs more detailed (SSL_state_string_long has existed since 0.9.8).

Looking at the docs, SSL_state_string_long() is better than just
SSL_state_string(), so that sounds right:
https://www.openssl.org/docs/manmaster/man3/SSL_CTX_set_info_callback.html
https://www.openssl.org/docs/manmaster/man3/SSL_state_string.html
https://www.openssl.org/docs/manmaster/man3/SSL_state_string_long.html

This is interesting for debugging, +1 for applying what you have
here, and this works for 1.0.1~3.0.0.  Worth noting that this returns
a static string, as per ssl_stat.c.
--
Michael


signature.asc
Description: PGP signature