date:20210215

Hi,

On Tue, Feb 16, 2021 at 1:06 AM Laurenz Albe  wrote:
> On Mon, 2021-02-15 at 17:53 +0900, Amit Langote wrote:
> > On Mon, Feb 15, 2021 at 5:28 PM Seamus Abshere  wrote:
> > > It turns out parallel_workers may be a useful reloption for certain uses 
> > > of partitioned tables,
> > >  at least if they're made up of fancy column store partitions (see
> > >  
> > > https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520ed55%40www.fastmail.com).
> > > Would somebody tell me what I'm doing wrong? I would love to submit a 
> > > patch but I'm stuck:
> >
> > You may see by inspecting the callers of compute_parallel_worker()
> > that it never gets called on a partitioned table, only its leaf
> > partitions.  Maybe you could try calling compute_parallel_worker()
> > somewhere in add_paths_to_append_rel(), which has this code to figure
> > out parallel_workers to use for a parallel Append path for a given
> > partitioned table:
> >
> > /* Find the highest number of workers requested for any subpath. */
> > foreach(lc, partial_subpaths)
> > {
> > Path   *path = lfirst(lc);
> >
> > parallel_workers = Max(parallel_workers, 
> > path->parallel_workers);
> > }
> > Assert(parallel_workers > 0);
> >
> > /*
> >  * If the use of parallel append is permitted, always request at 
> > least
> >  * log2(# of children) workers.  We assume it can be useful to have
> >  * extra workers in this case because they will be spread out across
> >  * the children.  The precise formula is just a guess, but we don't
> >  * want to end up with a radically different answer for a table 
> > with N
> >  * partitions vs. an unpartitioned table with the same data, so the
> >  * use of some kind of log-scaling here seems to make some sense.
> >  */
> > if (enable_parallel_append)
> > {
> > parallel_workers = Max(parallel_workers,
> >fls(list_length(live_childrels)));
> > parallel_workers = Min(parallel_workers,
> >max_parallel_workers_per_gather);
> > }
> > Assert(parallel_workers > 0);
> >
> > Note that the 'rel' in this code refers to the partitioned table for
> > which an Append path is being considered, so compute_parallel_worker()
> > using that 'rel' would use the partitioned table's
> > rel_parallel_workers as you are trying to do.
>
> Note that there is a second chunk of code quite like that one a few
> lines down from there that would also have to be modified.
>
> I am +1 on allowing to override the degree of parallelism on a parallel
> append.  If "parallel_workers" on the partitioned table is an option for
> that, it might be a simple solution.  On the other hand, perhaps it would
> be less confusing to have a different storage parameter name rather than
> having "parallel_workers" do double duty.
>
> Also, since there is a design rule that storage parameters can only be used
> on partitions, we would have to change that - is that a problem for anybody?

I am not aware of a rule that suggests that parallel_workers is always
interpreted using storage-level considerations.  If that is indeed a
popular interpretation at this point, then yes, we should be open to
considering a new name for the parameter that this patch wants to add.

Maybe parallel_append_workers?  Perhaps not a bad idea in this patch's
case, but considering that we may want to expand the support of
cross-partition parallelism to operations other than querying, maybe
something else?

This reminds me of something I forgot to mention in my review of the
patch -- it should also update the documentation of parallel_workers
on the CREATE TABLE page to mention that it will be interpreted a bit
differently for partitioned tables than for regular storage-bearing
relations.  Workers specified for partitioned tables would be
distributed by the executor over its partitions, unlike with
storage-bearing relations, where the distribution of specified workers
is controlled by the AM using storage-level considerations.

> There is another related consideration that doesn't need to be addressed
> by this patch, but that is somewhat related: if the executor prunes some
> partitions, the degree of parallelism is unaffected, right?

That's correct.  Launched workers could be less than planned, but that
would not have anything to do with executor pruning.

> So if the planner decides to use 24 workers for 25 partitions, and the
> executor discards all but one of these partition scans, we would end up
> with 24 workers scanning a single partition.

I do remember pondering this when testing my patches to improve the
performance of executing a generic plan to scan a partitioned table
where runtime pruning is possible.  Here is an example:

create table hp (a int) partition by hash (a);
select 'create table hp' || i

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-02-15 Thread Joel Jacobson

On Mon, Feb 15, 2021, at 20:34, Mark Rofail wrote:
>Dear All, 
>
>I know that avoiding trivial coercion problems is convenient but at the end of 
>the day,
>it's the DB Architect's job to use the proper tables to be able to use FK 
>Arrays.
>For instance, if we have two tables, TABLE A (atest1 int2) and TABLE B (btest1 
>int, btest2 int4[])
>and an FK constraint between A(atest1) and B(btest2), it simply shouldn't 
>work. btest2 should be of type int2[].
>Thus, I have reverted back the signature @>>(anyarray,anyelement) and 
><<@(anyelement,anyarray).
>I am open to discuss this if anyone has any input, would be appreciated.

I agree, I think this is a wise decision.
This reduces complexity to the actual need for fk_arrays_elem_v1.patch,
and eliminates an entire class of possible bugs.

>Please find the "anyarray_anyelement_operators-v3.patch" attached below.
>Changelog:
>- v3 (compatible with current master 2021-02-15, commit 
>0e5290312851557ee24e3d6103baf14d6066695c)
>* refactor ginqueryarrayextract in ginarrayproc.c
>* revert back the signature @>>(anyarray,anyelement) and 
> <<@(anyelement,anyarray)

Hmm, I think it looks like you forgot to update the documentation?

It still says anycompatiblenonarray:

@ doc/src/sgml/func.sgml
+anyarray @>> 
anycompatiblenonarray
+anycompatiblenonarray <<@ 
anyarray

@ src/sgml/gin.sgml
+  @>> 
(anyarray,anycompatiblenonarray)
+  <<@ 
(anycompatiblenonarray,anyarray)

Should it be anyelement in combination with anyarray?

Anyway, I've tested the patch, not only using your tests, but also the attached 
test.

The test has been auto-generated by type-test.sql (attached) mining values
of different types from the regression database, and then ensuring there
are no non-null differences between these three queries:

SELECT value1::type1 = ANY(ARRAY[value2::type2]);
SELECT value1::type1 <<@ ARRAY[value2::type2];
SELECT ARRAY[value2::type1] @>> value1::type2;

It tests a huge number of different type combinations, and reports any problems.

For the values which types could actually be compared (now only when types are 
the same),
it outputs the queries and results, from which the attached tests have been 
created.

No problems were found. Good.

/Joel

type-test.sql
Description: Binary data


anyarray_anyelement_operators.expected
Description: Binary data


anyarray_anyelement_operators.sql
Description: Binary data

Re: [PoC] Non-volatile WAL buffer

2021-02-15 Thread Takashi Menjo

Hi,

I made a new page at PostgreSQL Wiki to gather and summarize information
and discussion about PMEM-backed WAL designs and implementations. Some
parts of the page are TBD. I will continue to maintain the page. Requests
are welcome.

Persistent Memory for WAL
https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL

Regards,

-- 
Takashi Menjo

Re: 64-bit XIDs in deleted nbtree pages

2021-02-15 Thread Peter Geoghegan

On Mon, Feb 15, 2021 at 7:26 PM Peter Geoghegan  wrote:
> Actually, there is one more reason why I bring up idea 1 now: I want
> to hear your thoughts on the index AM API questions now, which idea 1
> touches on. Ideally all of the details around the index AM VACUUM APIs
> (i.e. when and where the extra work happens -- btvacuumcleanup() vs
> btbulkdelete()) won't need to change much in the future. I worry about
> getting this index AM API stuff right, at least a little.

Speaking of problems like this, I think I spotted an old one: we call
_bt_update_meta_cleanup_info() in either btbulkdelete() or
btvacuumcleanup(). I think that we should always call it in
btvacuumcleanup(), though -- even in cases where there is no call to
btvacuumscan() inside btvacuumcleanup() (because btvacuumscan()
happened earlier instead, during the btbulkdelete() call).

This makes the value of IndexVacuumInfo.num_heap_tuples (which is what
we store in the metapage) much more accurate -- right now it's always
pg_class.reltuples from *before* the VACUUM started. And so the
btm_last_cleanup_num_heap_tuples value in a nbtree metapage is often
kind of inaccurate.

This "estimate during ambulkdelete" issue is documented here (kind of):

/*
 * Struct for input arguments passed to ambulkdelete and amvacuumcleanup
 *
 * num_heap_tuples is accurate only when estimated_count is false;
 * otherwise it's just an estimate (currently, the estimate is the
 * prior value of the relation's pg_class.reltuples field, so it could
 * even be -1).  It will always just be an estimate during ambulkdelete.
 */
typedef struct IndexVacuumInfo
{
...
}

The name of the metapage field is already
btm_last_cleanup_num_heap_tuples, which already suggests the approach
that I propose now. So why don't we do it like that already?

(Thinks some more...)

I wonder: did this detail change at the last minute during the
development of the feature (just before commit 857f9c36) back in early
2018? That change have made it easier to deal with
oldestBtpoXact/btm_oldest_btpo_xact, which IIRC was a late addition to
the patch -- so maybe it's truly an accident that the code doesn't
work the way that I suggest it should already. (It's annoying to make
state from btbulkdelete() appear in btvacuumcleanup(), unless it's
from IndexVacuumInfo or something -- I can imagine this changing at
the last minute, just for that reason.)

Do you think that this needs to be treated as a bug in the
backbranches, Masahiko? I'm not sure...

In any case we should probably make this change as part of Postgres
14. Don't you think? It's certainly easy to do it this way now, since
there will be no need to keep around a oldestBtpoXact value until
btvacuumcleanup() (in the common case where btbulkdelete() is where we
call btvacuumscan()). The new btm_last_cleanup_num_delpages field
(which replaces btm_oldest_btpo_xact) has a value that just comes from
the bulk stats, which is easy anyway.

-- 
Peter Geoghegan

Re: ERROR: invalid spinlock number: 0

On Tue, Feb 16, 2021 at 12:43:42PM +0900, Fujii Masao wrote:
> On 2021/02/16 6:28, Andres Freund wrote:
>> So what? It's just about free to initialize a spinlock, whether it's
>> using the fallback implementation or not. Initializing upon walsender
>> startup adds a lot of complications, because e.g. somebody could already
>> hold the spinlock because the previous walsender just disconnected, and
>> they were looking at the stats.

Okay.

> Even if we initialize "writtenUpto" in WalRcvShmemInit(), WalReceiverMain()
> still needs to initialize (reset to 0) by using pg_atomic_write_u64().

Yes, you have to do that.

> Basically we should not acquire new spinlock while holding another spinlock,
> to shorten the spinlock duration. Right? If yes, we need to move
> pg_atomic_read_u64() of "writtenUpto" after the release of spinlock in
> pg_stat_get_wal_receiver.

It would not matter much as a NULL tuple is returned as long as the
WAL receiver information is not ready to be displayed.  The only
reason why all the fields are read before checking for
ready_to_display is that we can be sure that everything is consistent
with the PID.  So reading writtenUpto before or after does not really
matter logically.  I would just move it after the check, as you did
previously.

+   /*
+* Read "writtenUpto" without holding a spinlock. So it may not be
+* consistent with other WAL receiver's shared variables protected by a
+* spinlock. This is OK because that variable is used only for
+* informational purpose and should not be used for data integrity checks.
+*/
What about the following?
"Read "writtenUpto" without holding a spinlock.  Note that it may not
be consistent with the other shared variables of the WAL receiver
protected by a spinlock, but this should not be used for data
integrity checks."

I agree that what has been done with MyProc->waitStart in 46d6e5f is
not safe, and that initialization should happen once at postmaster
startup, with a write(0) when starting the backend.  There are two of
them in proc.c, one in twophase.c.  Do you mind if I add an open item
for this one?
--
Michael


signature.asc
Description: PGP signature

progress reporting for partitioned REINDEX

2021-02-15 Thread Justin Pryzby

It looks like we missed this in a6642b3ae.

I think it's an odd behavior of pg_stat_progress_create_index to simultaneously
show the global progress as well as the progress for the current partition ...

It seems like for partitioned reindex, reindex_index() should set the AM, which
is used in the view:

src/backend/catalog/system_views.sql-   WHEN 2 THEN 
'building index' ||
src/backend/catalog/system_views.sql:   COALESCE((': ' 
|| pg_indexam_progress_phasename(S.param9::oid, S.param11)),

Maybe it needs a new flag, like:
params->options & REINDEXOPT_REPORT_PROGRESS_AM

I don't understand why e66bcfb4c added multiple calls to
pgstat_progress_start_command().

-- 
Justin
>From b04d46958b16bc07ecd6c2f07220599ecb304b4d Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 15 Feb 2021 14:12:32 -0600
Subject: [PATCH] WIP: progress reporting for CREATE INDEX on partitioned
 tables

See also:
a6642b3ae060976b42830b7dc8f29ec190ab05e4
03f9e5cba0ee1633af4abe734504df50af46fbd8
c880096dc1e14b62610aa34bc98db226fa134260
e66bcfb4c66ebc97020b1f7484b1529bd7993f23
---
 src/backend/catalog/index.c  |   6 +-
 src/backend/commands/indexcmds.c | 104 ++-
 2 files changed, 77 insertions(+), 33 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f5441ce24a..fef9c831ac 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3733,9 +3733,9 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	 */
 	iRel = index_open(indexId, AccessExclusiveLock);
 
-	if (progress)
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_ACCESS_METHOD_OID,
-	 iRel->rd_rel->relam);
+	// Do this unconditionally?
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_ACCESS_METHOD_OID,
+ iRel->rd_rel->relam);
 
 	/*
 	 * Partitioned indexes should never get processed here, as they have no
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 127ba7835d..8fdfeffce5 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -98,9 +98,10 @@ static void reindex_error_callback(void *args);
 static void ReindexPartitions(Oid relid, ReindexParams *params,
 			  bool isTopLevel);
 static void ReindexMultipleInternal(List *relids,
-	ReindexParams *params);
+	ReindexParams *params, bool ispartitioned);
 static bool ReindexRelationConcurrently(Oid relationOid,
-		ReindexParams *params);
+		ReindexParams *params,
+		bool ispartitioned);
 static void update_relispartition(Oid relationId, bool newval);
 static inline void set_indexsafe_procflags(void);
 
@@ -2600,7 +2601,7 @@ ReindexIndex(RangeVar *indexRelation, ReindexParams *params, bool isTopLevel)
 		ReindexPartitions(indOid, params, isTopLevel);
 	else if ((params->options & REINDEXOPT_CONCURRENTLY) != 0 &&
 			 persistence != RELPERSISTENCE_TEMP)
-		ReindexRelationConcurrently(indOid, params);
+		ReindexRelationConcurrently(indOid, params, false);
 	else
 	{
 		ReindexParams newparams = *params;
@@ -2710,7 +2711,7 @@ ReindexTable(RangeVar *relation, ReindexParams *params, bool isTopLevel)
 	else if ((params->options & REINDEXOPT_CONCURRENTLY) != 0 &&
 			 get_rel_persistence(heapOid) != RELPERSISTENCE_TEMP)
 	{
-		result = ReindexRelationConcurrently(heapOid, params);
+		result = ReindexRelationConcurrently(heapOid, params, false);
 
 		if (!result)
 			ereport(NOTICE,
@@ -2942,7 +2943,7 @@ ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
 	 * Process each relation listed in a separate transaction.  Note that this
 	 * commits and then starts a new transaction immediately.
 	 */
-	ReindexMultipleInternal(relids, params);
+	ReindexMultipleInternal(relids, params, false);
 
 	MemoryContextDelete(private_context);
 }
@@ -3046,11 +3047,38 @@ ReindexPartitions(Oid relid, ReindexParams *params, bool isTopLevel)
 		MemoryContextSwitchTo(old_context);
 	}
 
+	/* progress reporting for partitioned indexes */
+	if (relkind == RELKIND_PARTITIONED_INDEX)
+	{
+		const int   progress_index[3] = {
+			PROGRESS_CREATEIDX_COMMAND,
+			PROGRESS_CREATEIDX_INDEX_OID,
+			PROGRESS_CREATEIDX_PARTITIONS_TOTAL
+		};
+
+		int64   progress_val[3] = {
+			(params->options & REINDEXOPT_CONCURRENTLY) != 0 &&
+get_rel_persistence(relid) != RELPERSISTENCE_TEMP ?
+PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY :
+PROGRESS_CREATEIDX_COMMAND_REINDEX,
+			relid,
+			list_length(partitions)
+		};
+
+		Oid tableid = IndexGetRelation(relid, false);
+		pgstat_progress_start_command(PROGRESS_COMMAND_CREATE_INDEX, tableid);
+
+		pgstat_progress_update_multi_param(3, progress_index, progress_val);
+	}
+
 	/*
 	 * Process each partition listed in a separate transaction.  Note that
 	 * this commits and then starts a new transaction immediately.
 	 */
-	ReindexMultipleInternal(partitions, params);
+	ReindexMultipleInternal(parti

libpq PQresultErrorMessage vs PQerrorMessage API issue

2021-02-15 Thread Craig Ringer

Hi all

This morning for the the umpteenth time I saw:

 some error message: [blank here]

output from a libpq program.

That's because passing a NULL PGresult to PQgetResultErrorMessage() returns
"". But a NULL PGresult is a normal result from PQexec when it fails to
submit a query due to an invalid connection, when PQgetResult can't get a
result from an invalid connection, etc.

E.g. this pattern:

res = PQexec(conn, "something");

if (PQstatus(res) != PGRES_TUPLES_OK)
{
   report_an_error("some error message: %s",
 PQresultErrorMessage(res));
}

... where "res" is NULL because the connection was invalid, so
PQresultErrorMessage(res) emits the empty string.

As a result, using PQerrorMessage(conn) is actually better in most cases,
despite the static error buffer issues. It'll do the right thing when the
connection itself is bad. Alternately you land up with the pattern

 res == NULL ? PQerrorMessage(conn) : PQresultErrorMessage(res)

I'm not quite sure what to do about this. Ideally PQresultErrorMessage()
would take the PGconn* too, but it doesn't, and it's not too friendly to
add an extra argument. Plus arguably they mean different things.

Maybe it's as simple as changing the docs to say that you should prefer
PQerrorMessage() if you aren't using multiple PGresult s at a time, and
mentioning the need to copy the error string.

But I'd kind of like to instead return a new non-null PGresult
PGRES_RESULT_ERROR whenever we'd currently return a NULL PGresult due to a
failure. Embed the error message into the PGresult, so
PQresultErrorMessage() can fetch it. Because a PGresult needs to be owned
by a PGconn and a NULL PGconn can't own anything, we'd instead return a
pointer to a static const global PGresult with value PGRES_NO_CONNECTION if
any function is passed a NULL PGconn*. That way it doesn't matter if it
gets PQclear()ed or not. And PQclear() could test for (res ==
PGresult_no_connection) and not try to free it if found.

The main issue I see there is that existing code may expect a NULL PGresult
and may test for (res == NULL) as an indicator of a query-sending failure
from PQexec etc. So I suspect we'd need a libpq-global option to enable
this behaviour for apps that are aware of it - we wouldn't want to add new
function signature variants after all.

Similar changes would make sense for returning NULL when there are no
result sets remaining after a PQsendQuery, and for returning NULL after
row-by-row fetch mode runs out of rows.

Re: A reloption for partitioned tables - parallel_workers

Hi Seamus,

Please see my reply below.

On Tue, Feb 16, 2021 at 1:35 AM Seamus Abshere  wrote:
> On Mon, Feb 15, 2021, at 3:53 AM, Amit Langote wrote:
> > On Mon, Feb 15, 2021 at 5:28 PM Seamus Abshere  wrote:
> > > It turns out parallel_workers may be a useful reloption for certain uses 
> > > of partitioned tables, at least if they're made up of fancy column store 
> > > partitions (see 
> > > https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520ed55%40www.fastmail.com).
> > >
> > > Would somebody tell me what I'm doing wrong? I would love to submit a 
> > > patch but I'm stuck:
>
> > You may see by inspecting the callers of compute_parallel_worker()
> > that it never gets called on a partitioned table, only its leaf
> > partitions.  Maybe you could try calling compute_parallel_worker()
> > somewhere in add_paths_to_append_rel(), which has this code to figure
> > out parallel_workers to use for a parallel Append path for a given
> > partitioned table:
> >
> > /* Find the highest number of workers requested for any subpath. */
> > foreach(lc, partial_subpaths)
> > {
> > Path   *path = lfirst(lc);
> >
> > parallel_workers = Max(parallel_workers, 
> > path->parallel_workers);
> > }
> > Assert(parallel_workers > 0);
> >
> > /*
> >  * If the use of parallel append is permitted, always request at 
> > least
> >  * log2(# of children) workers.  We assume it can be useful to have
> >  * extra workers in this case because they will be spread out across
> >  * the children.  The precise formula is just a guess, but we don't
> >  * want to end up with a radically different answer for a table 
> > with N
> >  * partitions vs. an unpartitioned table with the same data, so the
> >  * use of some kind of log-scaling here seems to make some sense.
> >  */
> > if (enable_parallel_append)
> > {
> > parallel_workers = Max(parallel_workers,
> >fls(list_length(live_childrels)));
> > parallel_workers = Min(parallel_workers,
> >max_parallel_workers_per_gather);
> > }
> > Assert(parallel_workers > 0);
> >
> > /* Generate a partial append path. */
> > appendpath = create_append_path(root, rel, NIL, partial_subpaths,
> > NIL, NULL, parallel_workers,
> > enable_parallel_append,
> > -1);
> >
> > Note that the 'rel' in this code refers to the partitioned table for
> > which an Append path is being considered, so compute_parallel_worker()
> > using that 'rel' would use the partitioned table's
> > rel_parallel_workers as you are trying to do.
>
> Here we go, my first patch... solves 
> https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520e...@www.fastmail.com

Thanks for sending the patch here.

It seems you haven't made enough changes for reloptions code to
recognize parallel_workers as valid for partitioned tables, because
even with the patch applied, I get this:

create table rp (a int) partition by range (a);
create table rp1 partition of rp for values from (minvalue) to (0);
create table rp2 partition of rp for values from (0) to (maxvalue);
alter table rp set (parallel_workers = 1);
ERROR:  unrecognized parameter "parallel_workers"

You need this:

diff --git a/src/backend/access/common/reloptions.c
b/src/backend/access/common/reloptions.c
index 029a73325e..9eb8a0c10d 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -377,7 +377,7 @@ static relopt_int intRelOpts[] =
{
"parallel_workers",
"Number of parallel processes that can be used per
executor node for this relation.",
-   RELOPT_KIND_HEAP,
+   RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
ShareUpdateExclusiveLock
},
-1, 0, 1024

which tells reloptions parsing code that parallel_workers is
acceptable for both heap and partitioned relations.

Other comments on the patch:

* This function calling style, where the first argument is not placed
on the same line as the function itself, is not very common in
Postgres:

+   /* First see if there is a root-level setting for parallel_workers */
+   parallel_workers = compute_parallel_worker(
+   rel,
+   -1,
+   -1,
+   max_parallel_workers_per_gather
+

This makes the new code look very different from the rest of the
codebase.  Better to stick to existing styles.

2. It might be a good idea to use this opportunity to add a function,
say compute_append_parallel_workers(), for the code that does what the
function name says.  Then the patch will simply add the new
compute_parallel_worker() call at the top of that function.

3. I think we should co

Re: Libpq support to connect to standby server as priority

2021-02-15 Thread Greg Nancarrow

On Fri, Feb 12, 2021 at 2:42 PM vignesh C  wrote:
> > Thanks, just one minor thing I missed in doc/src/sgml/libpq.sgml.
> >
> > +The support of read-write transactions is determined by the
> > value of the
> > +default_transaction_read_only and
> > +in_hot_standby configuration parameters, that is
> > +reported by the server (if supported) upon successful connection.  
> > If
> >
> >
> > should be:
> >
> > +The support of read-write transactions is determined by the
> > values of the
> > +default_transaction_read_only and
> > +in_hot_standby configuration parameters, that 
> > are
> > +reported by the server (if supported) upon successful connection.  
> > If
> >
> >
> > (i.e. "value" -> "values" and "is" -> "are")
>
> Thanks for the comments, this is handled in the v23 patch attached.
> Thoughts?
>

I've marked this as "Ready for Committer".

(and also added you to the author list)


Regards,
Greg Nancarrow
Fujitsu Australia

RE: [POC] Fast COPY FROM command for the table with foreign partitions

2021-02-15 Thread tsunakawa.ta...@fujitsu.com

From: Amit Langote 
> Think I have mentioned upthread that this looks better as:
> 
> if (rootResultRelInfo->ri_usesMultiInsert)
> leaf_part_rri->ri_usesMultiInsert = ExecMultiInsertAllowed(leaf_part_rri);
> 
> This keeps the logic confined to ExecInitPartitionInfo() where it
> belongs.  No point in burdening other callers of
> ExecMultiInsertAllowed() in deciding whether or not it should pass a
> valid value for the 2nd parameter.

Oh, that's a good idea.  (Why didn't I think of such a simple idea?)



> Maybe a bit late realizing this, but why does BeginForeignCopy()
> accept a ModifyTableState pointer whereas maybe just an EState pointer
> will do?  I can't imagine why an FDW would want to look at the
> ModifyTableState.  Case in point, I see that
> postgresBeginForeignCopy() only uses the EState from the
> ModifyTableState passed to it.  I think the ResultRelInfo that's being
> passed to the Copy APIs contains most of the necessary information.

You're right.  COPY is not under the control of a ModifyTable plan, so it's 
strange to pass ModifyTableState.


> Also, EndForeignCopy() seems fine with just receiving the EState.

I think this can have the ResultRelInfo like EndForeignInsert() and 
EndForeignModify() to correspond to the Begin function: "begin/end COPYing into 
this relation."


> /*
>  * Check whether we support copying data out of the specified relation,
>  * unless the caller also passed a non-NULL data_dest_cb, in which case,
>  * the callback will take care of it
>  */
> if (rel != NULL && rel->rd_rel->relkind != RELKIND_RELATION &&
> data_dest_cb == NULL)
> 
> I just checked that this works or at least doesn't break any newly added 
> tests.

Good idea, too.  The code has become more readable.

Thank you a lot.  Your other comments that are not mentioned above are also 
reflected.  The attached patch passes the postgres_fdw regression test.


Regards
Takayuki Tsunakawa




v16-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch
Description:  v16-0001-Fast-COPY-FROM-into-the-foreign-or-sharded-table.patch

Re: repeated decoding of prepared transactions

2021-02-15 Thread Amit Kapila

On Thu, Feb 11, 2021 at 4:06 PM Amit Kapila  wrote:
>
> On Mon, Feb 8, 2021 at 2:01 PM Markus Wanner
>  wrote:
> >
> Now, coming back to the restart case where the prepared transaction
> can be sent again by the publisher. I understand yours and others
> point that we should not send prepared transaction if there is a
> restart between prepare and commit but there are reasons why we have
> done that way and I am open to your suggestions. I'll once again try
> to explain the exact case to you which is not very apparent. The basic
> idea is that we ship/replay all transactions where commit happens
> after the snapshot has a consistent state (SNAPBUILD_CONSISTENT), see
> atop snapbuild.c for details. Now, for transactions where prepare is
> before snapshot state SNAPBUILD_CONSISTENT and commit prepared is
> after SNAPBUILD_CONSISTENT, we need to send the entire transaction
> including prepare at the commit time. One might think it is quite easy
> to detect that, basically if we skip prepare when the snapshot state
> was not SNAPBUILD_CONSISTENT, then mark a flag in ReorderBufferTxn and
> use the same to detect during commit and accordingly take the decision
> to send prepare but unfortunately it is not that easy. There is always
> a chance that on restart we reuse the snapshot serialized by some
> other Walsender at a location prior to Prepare and if that happens
> then this time the prepare won't be skipped due to snapshot state
> (SNAPBUILD_CONSISTENT) but due to start_decodint_at point (considering
> we have already shipped some of the later commits but not prepare).
> Now, this will actually become the same situation where the restart
> has happened after we have sent the prepare but not commit. This is
> the reason we have to resend the prepare when the subscriber restarts
> between prepare and commit.
>

After further thinking on this problem and some off-list discussions
with Ajin, there appears to be another way to solve the above problem
by which we can avoid resending the prepare after restart if it has
already been processed by the subscriber. The main reason why we were
not able to distinguish between the two cases ((a) prepare happened
before SNAPBUILD_CONSISTENT state but commit prepared happened after
we reach SNAPBUILD_CONSISTENT state and (b) prepare is already
decoded, successfully processed by the subscriber and we have
restarted the decoding) is that we can re-use the serialized snapshot
at LSN location prior to Prepare of some concurrent WALSender after
the restart. Now, if we ensure that we don't use serialized snapshots
for decoding via slots where two_phase decoding option is enabled then
we won't have that problem. The drawback is that in some cases it can
take a bit more time for initial snapshot building but maybe that is
better than the current solution.

Any suggestions?

-- 
With Regards,
Amit Kapila.

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

2021-02-15 Thread David Rowley

On Fri, 12 Feb 2021 at 15:18, Andy Fan  wrote:
>
> On Fri, Feb 12, 2021 at 9:02 AM David Rowley  wrote:
>> The reason I don't really like this is that it really depends where
>> you want to use RelOptInfo.notnullattrs.  If someone wants to use it
>> to optimise something before the base quals are evaluated then they
>> might be unhappy that they found some NULLs.
>>
>
> Do you mean the notnullattrs is not set correctly before the base quals are
> evaluated?  I think we have lots of data structures which are set just after 
> some
> stage.  but notnullattrs is special because it is set at more than 1 stage.  
> However
> I'm doubtful it is unacceptable, Some fields ever change their meaning at 
> different
> stages like Var->varno.  If a user has a misunderstanding on it, it probably 
> will find it
> at the testing stage.

You're maybe focusing too much on your use case for notnullattrs. It
only cares about NULLs in the result for each query level.

 thinks of an example...

OK, let's say I decided that COUNT(*) is faster than COUNT(id) so
decided that I might like to write a patch which rewrite the query to
use COUNT(*) when it was certain that "id" could not contain NULLs.

The query is:

SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER
JOIN sales s ON p.partid = s.partid GROUP BY p.partid;

sale.saleid is marked as NOT NULL in pg_attribute.  As the writer of
the patch, I checked the comment for notnullattrs and it says "Not
null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I
should be ok to assume since sales.saleid is marked in notnullattrs
that I can rewrite the query?!

The documentation about the RelOptInfo.notnullattrs needs to be clear
what exactly it means. I'm not saying your representation of how to
record NOT NULL in incorrect. I'm saying that you need to be clear
what exactly is being recorded in that field.

If you want it to mean "attribute marked here cannot output NULL
values at this query level", then you should say something along those
lines.

However, having said that, because this is a Bitmapset of
pg_attribute.attnums, it's only possible to record Vars from base
relations.  It does not seem like you have any means to record
attributes that are normally NULLable, but cannot produce NULL values
due to a strict join qual.

e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something;

I'd expect the RelOptInfo for t not to contain a bit for the
"nullable" column, but there's no way to record the fact that the join
RelOptInfo for {t,j} cannot produce a NULL for that column. It might
be quite useful to know that for the UniqueKeys patch.

I know there's another discussion here between Ashutosh and Tom about
PathTarget's and Vars.   I had the Var idea too once myself [1] but it
was quickly shot down.  Tom's reasoning there in [1] seems legit.  I
guess we'd need some sort of planner version of Var and never confuse
it with the Parse version of Var.  That sounds like quite a big
project which would have quite a lot of code churn. I'm not sure how
acceptable it would be to have Var represent both these things.  It
gets complex when you do equal(var1, var2) and expect that to return
true when everything matches apart from the notnull field. We
currently have this issue with the "location" field and we even have a
special macro which just ignores those in equalfuncs.c.  I imagine not
many people would like to expand that to other fields.

It would be good to agree on the correct representation for Vars that
cannot produce NULLs so that we don't shut the door on classes of
optimisation that require something other than what you need for your
case.

David

[1] 
https://www.postgresql.org/message-id/flat/14678.1401639369%40sss.pgh.pa.us#d726d397f86755b64bb09d0c487f975f

Re: ERROR: invalid spinlock number: 0

2021-02-15 Thread Fujii Masao




On 2021/02/16 6:28, Andres Freund wrote:

Hi,

On 2021-02-15 19:45:21 +0900, Michael Paquier wrote:

On Mon, Feb 15, 2021 at 10:47:05PM +1300, Thomas Munro wrote:

Why not initialise it in WalRcvShmemInit()?


I was thinking about doing that as well, but we have no real need to
initialize this stuff in most cases, say standalone deployments.  In
particular for the fallback implementation of atomics, we would
prepare a spinlock for nothing.


So what? It's just about free to initialize a spinlock, whether it's
using the fallback implementation or not. Initializing upon walsender
startup adds a lot of complications, because e.g. somebody could already
hold the spinlock because the previous walsender just disconnected, and
they were looking at the stats.


Even if we initialize "writtenUpto" in WalRcvShmemInit(), WalReceiverMain()
still needs to initialize (reset to 0) by using pg_atomic_write_u64().

Basically we should not acquire new spinlock while holding another spinlock,
to shorten the spinlock duration. Right? If yes, we need to move
pg_atomic_read_u64() of "writtenUpto" after the release of spinlock in
pg_stat_get_wal_receiver.

Attached is the updated version of the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/replication/walreceiver.c 
b/src/backend/replication/walreceiver.c
index 723f513d8b..316640c275 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -249,7 +249,7 @@ WalReceiverMain(void)
 
SpinLockRelease(&walrcv->mutex);
 
-   pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
+   pg_atomic_write_u64(&WalRcv->writtenUpto, 0);
 
/* Arrange to clean up at walreceiver exit */
on_shmem_exit(WalRcvDie, 0);
@@ -1325,7 +1325,6 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
state = WalRcv->walRcvState;
receive_start_lsn = WalRcv->receiveStart;
receive_start_tli = WalRcv->receiveStartTLI;
-   written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto);
flushed_lsn = WalRcv->flushedUpto;
received_tli = WalRcv->receivedTLI;
last_send_time = WalRcv->lastMsgSendTime;
@@ -1338,6 +1337,14 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
strlcpy(conninfo, (char *) WalRcv->conninfo, sizeof(conninfo));
SpinLockRelease(&WalRcv->mutex);
 
+   /*
+* Read "writtenUpto" without holding a spinlock. So it may not be
+* consistent with other WAL receiver's shared variables protected by a
+* spinlock. This is OK because that variable is used only for
+* informational purpose and should not be used for data integrity 
checks.
+*/
+   written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto);
+
/*
 * No WAL receiver (or not ready yet), just return a tuple with NULL
 * values
diff --git a/src/backend/replication/walreceiverfuncs.c 
b/src/backend/replication/walreceiverfuncs.c
index 69b91a7dab..63e60478ea 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -63,6 +63,7 @@ WalRcvShmemInit(void)
MemSet(WalRcv, 0, WalRcvShmemSize());
WalRcv->walRcvState = WALRCV_STOPPED;
SpinLockInit(&WalRcv->mutex);
+   pg_atomic_init_u64(&WalRcv->writtenUpto, 0);
WalRcv->latch = NULL;
}
 }
diff --git a/src/test/regress/expected/sysviews.out 
b/src/test/regress/expected/sysviews.out
index 81bdacf59d..6d048e309c 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -83,6 +83,13 @@ select count(*) = 1 as ok from pg_stat_wal;
  t
 (1 row)
 
+-- We expect no walreceiver running in this test
+select count(*) = 0 as ok from pg_stat_wal_receiver;
+ ok 
+
+ t
+(1 row)
+
 -- This is to record the prevailing planner enable_foo settings during
 -- a regression test run.
 select name, setting from pg_settings where name like 'enable%';
diff --git a/src/test/regress/sql/sysviews.sql 
b/src/test/regress/sql/sysviews.sql
index b9b875bc6a..dc8c9a3ac2 100644
--- a/src/test/regress/sql/sysviews.sql
+++ b/src/test/regress/sql/sysviews.sql
@@ -40,6 +40,9 @@ select count(*) >= 0 as ok from pg_prepared_xacts;
 -- There must be only one record
 select count(*) = 1 as ok from pg_stat_wal;
 
+-- We expect no walreceiver running in this test
+select count(*) = 0 as ok from pg_stat_wal_receiver;
+
 -- This is to record the prevailing planner enable_foo settings during
 -- a regression test run.
 select name, setting from pg_settings where name like 'enable%';

Re: logical replication seems broken

2021-02-15 Thread Amit Kapila

On Mon, Feb 15, 2021 at 7:50 PM vignesh C  wrote:
>
> On Mon, Feb 15, 2021 at 6:14 PM  wrote:
> >
> >
> > > On 2021.02.15. 12:31 Amit Kapila  wrote:
> > > On Mon, Feb 15, 2021 at 11:53 AM vignesh C  wrote:
> > > > On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers  wrote:
> > > > > I compiled just now a binary from HEAD, and a binary from HEAD+patch
> > > > > HEAD is still broken; your patch rescues it, so yes, fixed.
> > > > > Maybe a test (check or check-world) should be added to run a second 
> > > > > replica?  (Assuming that would have caught this bug)
> > > > >
> > > > +1 for the idea of having a test for this. I have written a test for 
> > > > this.
> > > > Thanks for the fix Amit, I could reproduce the issue without your fix
> > > > and verified that the issue gets fixed with the patch you shared.
> > > > Attached a patch for the same. Thoughts?
> > > >
> > >
> > > I have slightly modified the comments in the test case to make things
> > > clear. I am planning not to backpatch this because there is no way in
> > > the core code to hit this prior to commit ce0fdbfe97 and we haven't
> > > received any complaints so far. What do you think?
> >
> > My tests indeed run OK with this.
> >
> > (I haven't tested whether the newly added test actually tests for the 
> > problem that was there - I suppose one of you did that)
> >
>
> I could re-create the scenario that you had faced with this test. This
> test case is a simplified test of your script, I have removed the use
> of pgbench, reduced the number of tables used and simulated the same
> problem with the similar replication setup that you had used.
>

Yeah, I was also able to see an issue with this test without a patch
and it got fixed after the patch. Pushed!

-- 
With Regards,
Amit Kapila.

Re: 64-bit XIDs in deleted nbtree pages

2021-02-15 Thread Peter Geoghegan

On Mon, Feb 15, 2021 at 3:15 AM Masahiko Sawada  wrote:
> Yes. I think this would simplify the problem by resolving almost all
> problems related to indefinite deferring page recycle.
>
> We will be able to recycle almost all just-deleted pages in practice
> especially when btvacuumscan() took a long time. And there would not
> be a noticeable downside, I think.

Great!

> BTW if btree index starts to use maintenan_work_mem for this purpose,
> we also need to set amusemaintenanceworkmem to true which is
> considered when parallel vacuum.

I was just going to use work_mem. This should be okay. Note that
CREATE INDEX uses an additional work_mem allocation when building a
unique index, for the second spool/tuplesort. That seems like a
precedent that I can follow here.

Right now the BTPendingRecycle struct the patch uses to store
information about a page that the current VACUUM deleted (and may yet
be able to place in the FSM) are each 16 bytes (including alignment
overhead). I could probably make them smaller with a little work, but
even now that's quite small. Even with the default 4MiB work_mem
setting we can fit information about 262144 pages all at once. That's
2GiB worth of deleted index pages, which is generally much more than
we'll need.

> Yeah, increasing the threshold would solve the problem in most cases.
> Given that nbtree index page deletion is unlikely to happen in
> practice, having the threshold 5% or 10% seems to avoid the problem in
> nearly 100% of cases, I think.

Of course it all depends on workload/index characteristics, in the
end. It is very rare to delete a percentage of the index that exceeds
autovacuum_vacuum_scale_factor -- that's the important thing here IMV.

> Another idea I come up with (maybe on top of above your idea) is to
> change btm_oldest_btpo_xact to 64-bit XID and store the *newest*
> btpo.xact XID among all deleted pages when the total amount of deleted
> pages exceeds 2% of index. That way, we surely can recycle more than
> 2% of index when the XID becomes older than the global xmin.

You could make my basic approach to recycling deleted pages earlier
(ideally at the end of the same btvacuumscan() that deleted the pages
in the first place) more sophisticated in a variety of ways. These are
all subject to diminishing returns, though.

I've already managed to recycle close to 100% of all B-Tree pages
during the same VACUUM with a very simple approach -- at least if we
assume BenchmarkSQL is representative. It is hard to know how much
more effort can be justified. To be clear, I'm not saying that an
improved design cannot be justified now or in the future (BenchmarkSQL
is not like many workloads that people use Postgres for). I'm just
saying that I *don't know* where to draw the line. Any particular
place that we draw the line feels a little arbitrary to me. This
includes my own choice of the work_mem-limited BTPendingRecycle array.
My patch currently works that way because it's simple -- no other
reason.

Any scheme to further improve the "work_mem-limited BTPendingRecycle
array" design from my patch boils down to this: A new approach that
makes recycling of any remaining deleted pages take place "before too
long": After the end of the btvacuumscan() BTPendingRecycle array
stuff (presumably that didn't work out in cases where an improved
approach matters), but before the next VACUUM takes place (since that
will do the required recycling anyway, unless it's unable to do any
work at all, in which case it hardly matters). Here are two ideas of
my own in this same class as your idea:

1. Remember to do some of the BTPendingRecycle array FSM processing
stuff in btvacuumcleanup() -- defer some of the recycling of pages
recorded in BTPendingRecycle entries (paged deleted during
btbulkdelete() for the same VACUUM) until btvacuumcleanup() is called.

Right now btvacuumcleanup() will always do nothing when btbulkdelete()
was called earlier. But that's just a current nbtree convention, and
is no reason to not do this (we don't have to scan the index again at
all). The advantage of teaching btvacuumcleanup() to do this is that
it delays the "BTPendingRecycle array FSM processing" stuff until the
last moment that it is still easy to use the in-memory array (because
we haven't freed it yet). In general, doing it later makes it more
likely that we'll successfully recycle the pages. Though in general it
might not make any difference -- so we're still hoping that the
workload allows us to recycle everything we deleted, without making
the design much more complicated than what I posted already.

(BTW I see that you reviewed commit 4e514c61, so you must have thought
about the trade-off between doing deferred recycling in
amvacuumcleanup() vs ambulkdelete(), when to call
IndexFreeSpaceMapVacuum(), etc. But there is no reason why we cannot
implement this idea while calling IndexFreeSpaceMapVacuum() during
both btvacuumcleanup() and btbulkdelete(), so that we get the best of
both worlds -

Re: [DOC] add missing "[ NO ]" to various "DEPENDS ON" synopses

2021-02-15 Thread Ian Lawrence Barwick

2021年2月16日(火) 10:20 Michael Paquier :

> On Mon, Feb 15, 2021 at 03:57:04PM +0900, Ian Lawrence Barwick wrote:
> > Indeed it does. Not the most exciting of use cases, though I imagine it
> > might come in handy for anyone developing an extension, and the
> > existing implementation is inconsistent (in place for ALTER INDEX,
> > and partially for ALTER MATERIALIZED VIEW, but not the others).
> > Patch suggestion attached.
>
> Thanks.
>
> -   else if (Matches("ALTER", "INDEX", MatchAny, "NO", "DEPENDS"))
> -   COMPLETE_WITH("ON EXTENSION");
> -   else if (Matches("ALTER", "INDEX", MatchAny, "DEPENDS"))
> -   COMPLETE_WITH("ON EXTENSION");
> The part, if removed, means that typing "alter index my_index no " is
> not able to complete with "DEPENDS ON EXTENSION" anymore.  So it seems
> to me that ALTER INDEX got that right, and that the other commands had
> better do the same.
>

Hmm, with the current implementation "alter index my_index no "
doesn't work
anyway; you'd need to add this before the above lines:

+   else if (Matches("ALTER", "INDEX", MatchAny, "NO"))
+   COMPLETE_WITH("DEPENDS");

so AFAICT the patch doesn't change that behaviour. It does mean "alter index
my_index no depends " no longer completes to "ON EXTENSION", but if
you've
typed one of "NO" or "DEPENDS" in that context, "ON EXTENSION" is the only
completion so I'm not sure what's gained by forcing the user to hit TAB
twice.

There are quite a few tab completions consisting of more than one word
(e.g. "MATERIALIZED VIEW", "FORCE ROW LEVEL SECURITY") where tab completion
is
ineffective after the first word followed by a space, e.g. "alter
materialized
" doesn't result in any expansion either. I suppose we could go
through all
those and handle each word individually, but presumably there's a reason why
that hasn't been done already (maybe no-one has complained?).

Regards

Ian Barwick

-- 
EnterpriseDB: https://www.enterprisedb.com

Re: Support ALTER SUBSCRIPTION ... ADD/DROP PUBLICATION ... syntax

2021-02-15 Thread Bharath Rupireddy

On Mon, Feb 15, 2021 at 8:13 AM Bharath Rupireddy
 wrote:
>
> On Sat, Feb 13, 2021 at 11:41 AM japin  wrote:
> > > IIUC, with the current patch, the new ALTER SUBSCRIPTION ... ADD/DROP
> > > errors out on the first publication that already exists/that doesn't
> > > exist right? What if there are multiple publications given in the
> > > ADD/DROP list, and few of them exist/don't exist. Isn't it good if we
> > > loop over the subscription's publication list and show all the already
> > > existing/not existing publications in the error message, instead of
> > > just erroring out for the first existing/not existing publication?
> > >
> >
> > Yes, you are right. Agree with you, I modified it. Please consider v5
> > for further review.
>
> Thanks for the updated patches. I have a comment about reporting the
> existing/not existing publications code. How about something like the
> attached delta patch on v5-0002?

Attaching the v6 patch set so that cfbot can proceed to test the
patches. The above delta patch was merged into 0002. Please have a
look.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v6-0001-Export-textarray_to_stringlist.patch
Description: Binary data


v6-0002-Introduce-a-new-syntax-to-add-drop-publications.patch
Description: Binary data


v6-0003-Test-ALTER-SUBSCRIPTION-.-ADD-DROP-PUBLICATION.patch
Description: Binary data


v6-0004-Add-documentation-for-ALTER-SUBSCRIPTION.ADD-DROP.patch
Description: Binary data


v6-0005-Add-tab-complete-for-ALTER-SUBSCRIPTION.ADD-DROP.patch
Description: Binary data

Re: proposal - psql - use pager for \watch command

2021-02-15 Thread Thomas Munro

On Fri, Jan 8, 2021 at 10:36 PM Pavel Stehule  wrote:
> ne 19. 4. 2020 v 19:27 odesílatel Pavel Stehule  
> napsal:
>> last week I finished pspg 3.0 https://github.com/okbob/pspg . pspg now 
>> supports pipes, named pipes very well. Today the pspg can be used as pager 
>> for output of \watch command. Sure, psql needs attached patch.
>>
>> I propose new psql environment variable PSQL_WATCH_PAGER. When this variable 
>> is not empty, then \watch command starts specified pager, and redirect 
>> output to related pipe. When pipe is closed - by pager, then \watch cycle is 
>> leaved.
>>
>> If you want to test proposed feature, you need a pspg with 
>> cb4114f98318344d162a84b895a3b7f8badec241 commit.
>>
>> Then you can set your env
>>
>> export PSQL_WATCH_PAGER="pspg --stream"
>> psql
>>
>> SELECT * FROM pg_stat_database;
>> \watch 1
>>
>> Comments, notes?

I tried this out with pspg 4.1 from my package manager.  It seems
really useful, especially for demos.  I like it!

 * Set up rendering options, in particular, disable the pager, because
 * nobody wants to be prompted while watching the output of 'watch'.
 */
-   myopt.topt.pager = 0;
+   if (!pagerpipe)
+   myopt.topt.pager = 0;

Obsolete comment.

+static bool sigpipe_received = false;

This should be "static volatile sig_atomic_t", and I suppose our
convention name for that variable would be got_SIGPIPE.  Would it be
possible to ignore SIGPIPE instead, and then rely on another way of
knowing that the pager has quit?  But... hmm:

-   longs = Min(i, 1000L);
+   longs = Min(i, pagerpipe ? 100L : 1000L);

I haven't studied this (preexisting) polling loop, but I don't like
it.  I understand that it's there because on some systems, pg_usleep()
won't wake up for SIGINT (^C), but now it's being used for a secondary
purpose, that I haven't fully understood.  After I quit pspg (by
pressing q) while running \watch 10, I have to wait until the end of a
10 second cycle before it tries to write to the pipe again, unless I
also press ^C.  I feel like it has to be possible to achieve "precise"
behaviour somehow when you quit; maybe something like waiting for
readiness on the pager's stderr, or something like that -- I haven't
thought hard about this and I admit that I have no idea how this works
on Windows.

Sometimes I see a message like this after I quit pspg:

postgres=# \watch 10
input stream was closed

Re: PATCH: Batch/pipelining support for libpq

2021-02-15 Thread Craig Ringer

On Tue, 16 Feb 2021 at 09:19, Craig Ringer 
wrote:
>
> So long as there is a way to "send A", "send B", "send C", "read results
> from A", "send D", and there's a way for the application to associate some
> kind of state (an application specific id or index, a pointer to an
> application query-queue struct, whatever)  so it can match queries to
> results ... then I'm happy.
>

FWIW I'm also thinking of revising the docs to mostly use the term
"pipeline" instead of "batch". Use "pipelining and batching" in the chapter
topic, and mention "batch" in the index, and add a para that explains how
to run batches on top of pipelining, but otherwise use the term "pipeline"
not "batch".

That's because pipelining isn't actually batching, and using it as a naïve
batch interface will get you in trouble. If you PQsendQuery(...) a long
list of queries without consuming results, you'll block unless you're in
libpq's nonblocking-send mode. You'll then be deadlocked because the server
can't send results until you consume some (tx buffer full) and can't
consume queries until it can send some results, but you can't consume
results because you're blocked on a send that'll never end.

An actual batch interface where you can bind and send a long list of
queries might be worthwhile, but should be addressed separately, and it'd
be confusing if this pipelining interface claimed the term "batch". I'm not
convinced enough application developers actually code directly against
libpq for it to be worth creating a pretty batch interface where you can
submit (say) an array of struct PQbatchEntry { const char * query, int
nparams, ... } to a PQsendBatch(...) and let libpq handle the socket I/O
for you. But if someone wants to add one later it'll be easier if we don't
use the term "batch" for the pipelining feature.

I'll submit a reworded patch in a bit.

Re: Implementing Incremental View Maintenance

2021-02-15 Thread Yugo NAGATA

Hi,

Attached is a rebased patch (v22a).

Ragards,
Yugo Nagata


-- 
Yugo NAGATA 


IVM_patches_v22a.tar.gz
Description: application/gzip

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-15 Thread John Naylor

On Mon, Feb 15, 2021 at 9:18 AM Heikki Linnakangas  wrote:
>

Attached is the first attempt at using SSE4 to do the validation, but first
I'll answer your questions about the fallback.

I should mention that v2 had a correctness bug for 4-byte characters that I
found when I was writing regression tests. It shouldn't materially affect
performance, however.

> I thought the "chinese" numbers above are pure multibyte input, and it
> seems to do well on that. Where does it regress? In multibyte encodings
> other than UTF-8?

Yes, the second set of measurements was intended to represent multibyte
encodings other than UTF-8. But instead of using one of those encodings, I
simulated non-UTF-8 by copying the pattern used for those: in the loop,
check for ascii then either advance or verify one character. It was a quick
way to use the same test.

> How bad is the regression?

I'll copy the measurements here together with master so it's easier to
compare:

~= PG13 (revert 6c5576075b0f9 and b80e10638e3):

 chinese | mixed | ascii
-+---+---
1565 |   848 |   365

master:

 chinese | mixed | ascii
-+---+---
1081 |   761 |   366

ascii fast-path plus pg_*_verifychar():

 chinese | mixed | ascii
-+---+---
1279 |   656 |94

As I mentioned upthread, pure multibyte is still faster than PG13. Reducing
the ascii check to 8-bytes at time might alleviate the regression.

> I tested this on my first generation Raspberry Pi (chipmunk). I had to
> tweak it a bit to make it compile, since the SSE autodetection code was
> not finished yet. And I used generate_series(1, 1000) instead of
> generate_series(1, 1) in the test script (mbverifystr-speed.sql)
> because this system is so slow.
>
> master:
>
>   mixed | ascii
> ---+---
>1310 |  1041
> (1 row)
>
> v2-add-portability-stub-and-new-fallback.patch:
>
>   mixed | ascii
> ---+---
>2979 |   910
> (1 row)
>
> I'm guessing that's because the unaligned access in check_ascii() is
> expensive on this platform.

Hmm, I used memcpy() as suggested. Is that still slow on that platform?
That's 32-bit, right? Some possible remedies:

1) For the COPY FROM case, we should align the allocation on a cacheline --
we already have examples of that idiom elsewhere. I was actually going to
suggest doing this anyway, since unaligned SIMD loads are often slower, too.

2) As the simdjson fallback was based on Fuchsia (the Lemire paper implies
it was tested carefully on Arm and I have no reason to doubt that), I could
try to follow that example more faithfully by computing the actual
codepoints. It's more computation and just as many branches as far as I can
tell, but it's not a lot of work. I can add that alternative fallback to
the patch set. I have no Arm machines, but I can test on a POWER8 machine.

3) #ifdef out the ascii check for 32-bit platforms.

4) Same as the non-UTF8 case -- only check for ascii 8 bytes at a time.
I'll probably try this first.

Now, I'm pleased to report that I got SSE4 working, and it seems to work.
It still needs some stress testing to find any corner case bugs, but it
shouldn't be too early to share some numbers on Clang 10 / MacOS:

master:

 chinese | mixed | ascii
-+---+---
1082 |   751 |   364

v3 with SSE4.1:

 chinese | mixed | ascii
-+---+---
 127 |   128 |   126

Some caveats and notes:

- It takes almost no recognizable code from simdjson, but it does take the
magic constants lookup tables almost verbatim. The main body of the code
has no intrinsics at all (I think). They're all hidden inside static inline
helper functions. I reused some cryptic variable names from simdjson. It's
a bit messy but not terrible.

- It diffs against the noError conversion patch and adds additional tests.

- It's not smart enough to stop at the last valid character boundary --
it's either all-valid or it must start over with the fallback. That will
have to change in order to work with the proposed noError conversions. It
shouldn't be very hard, but needs thought as to the clearest and safest way
to code it.

- There is no ascii fast-path yet. With this algorithm we have to be a bit
more careful since a valid ascii chunk could be preceded by an incomplete
sequence at the end of the previous chunk. Not too hard, just a bit more
work.

- This is my first time hacking autoconf, and it still seems slightly
broken, yet functional on my machine at least.

- It only needs SSE4.1, but I didn't want to create a whole new CFLAGS, so
it just reuses SSE4.2 for the runtime check and the macro names. Also, it
doesn't test for SSE2, it just insists on 64-bit for the runtime check. I
imagine it would refuse to build on 32-bit machines if you passed it -msse42

- There is a placeholder for Windows support, but it's not developed.

- I had to add a large number of casts to get rid of warnings in the magic
constants macros. That needs some polish.

I also attached a C file that v

Re: Snapshot scalability patch issue

2021-02-15 Thread Peter Geoghegan

On Mon, Feb 15, 2021 at 5:30 PM Andres Freund  wrote:
> Done. Thanks for noticing/reporting!

Great, thanks!

-- 
Peter Geoghegan

Re: Snapshot scalability patch issue

Hi,

On 2021-02-15 15:08:40 -0800, Andres Freund wrote:
> On 2021-02-14 18:42:18 -0800, Peter Geoghegan wrote:
> > The call to heap_page_prune() within lazy_scan_heap() passes a bool
> > literal ('false') as its fourth argument. But the fourth argument is
> > of type TransactionId, not bool. This has been the case since the
> > snapshot scalability work performed by commit dc7420c2c92. Surely
> > something is amiss here.
> 
> Looks like I accidentally swapped the InvalidTransactionId and false
> around - luckily they have the same actual bit pattern...
> 
> I do wish C could pass arguments by name.
> 
> I'll push something once I'm back at my computer...

Done. Thanks for noticing/reporting!

Greetings,

Andres Freund

Fwd: Row description Metadata information

2021-02-15 Thread Aleksei Ivanov

Not sure that previous email was sent correctly. If it was duplicated,
sorry for the inconvenience.

Hi, hackers,

I have one question related to returned information in the row description
for prepared statement.

For example Select $1 * 2 and then Bind 1.6 to it.
The returned result is correct and equal to 3.2, but type modifier in the
row description is equal to -1, which is not correct.

Does someone know where this modifier is calculated? Is this a bug or
intention behavior?

Best regards,
Aleksei Ivanov

Re: [DOC] add missing "[ NO ]" to various "DEPENDS ON" synopses

On Mon, Feb 15, 2021 at 03:57:04PM +0900, Ian Lawrence Barwick wrote:
> Indeed it does. Not the most exciting of use cases, though I imagine it
> might come in handy for anyone developing an extension, and the
> existing implementation is inconsistent (in place for ALTER INDEX,
> and partially for ALTER MATERIALIZED VIEW, but not the others).
> Patch suggestion attached.

Thanks.

-   else if (Matches("ALTER", "INDEX", MatchAny, "NO", "DEPENDS"))
-   COMPLETE_WITH("ON EXTENSION");
-   else if (Matches("ALTER", "INDEX", MatchAny, "DEPENDS"))
-   COMPLETE_WITH("ON EXTENSION");
The part, if removed, means that typing "alter index my_index no " is
not able to complete with "DEPENDS ON EXTENSION" anymore.  So it seems
to me that ALTER INDEX got that right, and that the other commands had
better do the same.
--
Michael


signature.asc
Description: PGP signature

Re: PATCH: Batch/pipelining support for libpq

2021-02-15 Thread Craig Ringer

On Thu, 11 Feb 2021 at 07:51, Alvaro Herrera 
wrote:

> On 2021-Jan-21, Alvaro Herrera wrote:
>
> > As you can see in an XXX comment in the libpq test program, the current
> > implementation has the behavior that PQgetResult() returns NULL after a
> > batch is finished and has reported PGRES_BATCH_END.  I don't know if
> > there's a hard reason to do that, but I'd like to supress it because it
> > seems weird and out of place.
>
> Hello Craig, a question for you since this API is your devising.  I've
> been looking at changing the way this works to prevent those NULL
> returns from PQgetResult.  That is, instead of having what seems like a
> "NULL separator" of query results, you'd just get the PGRES_BATCH_END to
> signify a batch end (not a NULL result after the BATCH_END); and the
> normal PGRES_COMMAND_OK or PGRES_TUPLES_OK etc when the result of a
> command has been sent.  It's not working yet so I'm not sending an
> updated patch, but I wanted to know if you had a rationale for including
> this NULL return "separator" or was it just a convenience because of how
> the code grew together.
>

The existing API for libpq actually specifies[1] that you should call
PQgetResult() until it returns NULL:

> After successfully calling PQsendQuery, call PQgetResult one or more
times to obtain the results. PQsendQuery cannot be called again (on the
same connection) until PQgetResult has returned a null pointer, indicating
that the command is done.

Similarly, in single-row mode, the existing API specifies that you should
call PQgetResult() until it returns NULL.

Also, IIRC the protocol already permits multiple result sets to be
returned, and the caller cannot safely assume that a single PQsendQuery()
will produce only a single result set. (I really should write a test
extension that exercises that and how libpq reacts to it.)

That's why I went for the NULL return. I think. It's been a while now so
it's a bit fuzzy.

I would definitely like an API that does not rely on testing for a NULL
return. Especially since NULL return can be ambiguous in the context of
row-at-a-time mode. New explicit enumerations for PGresult would make a lot
more sense.

So +1 from me for the general idea. I need to look at the patch as it has
evolved soon too.

Remember that the original patch I submitted for this was a 1-day weekend
hack and proof of concept to show that libpq could be modified to support
query pipelining (and thus batching too), so I could illustrate the
performance benefits that could be attained by doing so. I'd been aware of
the benefits and the protocol's ability to support it for some time thanks
to working on PgJDBC, but couldn't get anyone interested without some C
code to demonstrate it, so I wrote some. I am not going to argue that the
API I added for it is ideal in any way, and happy to see improvements.

The only change I would very strongly object to would be anything that
turned this into a *batch* mode without query-pipelining support. If you
have to queue all the queries up in advance then send them as a batch and
wait for all the results, you miss out on a lot of the potential
round-trip-time optimisations and you add initial latency. So long as there
is a way to "send A", "send B", "send C", "read results from A", "send D",
and there's a way for the application to associate some kind of state (an
application specific id or index, a pointer to an application query-queue
struct, whatever)  so it can match queries to results ... then I'm happy.

[1]
https://www.postgresql.org/docs/current/libpq-async.html#LIBPQ-PQSENDQUERY

Re: Snapbuild woes followup

Hi,

On 2021-01-29 14:04:47 +0530, Amit Kapila wrote:
> On Tue, Jan 26, 2021 at 2:18 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2021-01-25 12:00:08 -0800, Andres Freund wrote:
> > > > >   /*
> > > > >* For backward compatibility reasons this has to be stored in the 
> > > > > wrongly
> > > > >* named field.  Will be fixed in next major version.
> > > > >*/
> > > > >   return builder->was_running.was_xmax;
> > > >
> > > > We could fix that now... Andres, what did you have in mind for a proper
> > > > name?
> > >
> > > next_phase_at seems like it'd do the trick?
> >
> 
> The new proposed name sounds good to me.

And pushed.

> > See attached patch...
> 
> LGTM.

Thanks for looking over - should have added your name to reviewed-by,
sorry...

Greetings,

Andres Freund

Re: pg_replication_origin_session_setup and superuser

On Mon, Feb 15, 2021 at 09:37:53AM +, Zohar Gofer wrote:
> In my mind the requirement for superuser is too strong. I think that
> requiring privileges of a replication user is more suitable. This
> way we can require that only a user with replication privileges will
> actually do replication, even if this is not really a replication.

PostgreSQL 14 will remove those hardcoded superuser checks.  Please
see this thread:
https://www.postgresql.org/message-id/capdie1xjmzokql3dghmurpqyszkgwzsmxetfkkhynbab7-0...@mail.gmail.com
And its related commit:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cc072641d41c55c6aa24a331fc1f8029e0a8d799

While the default is still superuser-only, it becomes possible to
grant access to this stuff to other roles that have no need to be
superusers.
--
Michael

signature.asc
Description: PGP signature

RE: [HACKERS] logical decoding of two-phase transactions

2021-02-15 Thread osumi.takami...@fujitsu.com

Hi


On Tuesday, February 16, 2021 8:33 AM Peter Smith 
> On Fri, Feb 12, 2021 at 5:59 PM osumi.takami...@fujitsu.com
>  wrote:
> > (2)
> >
> > File : v39-0006-Support-2PC-txn-Subscription-option.patch
> >
> > @@ -213,6 +219,15 @@ parse_subscription_options(List *options,
> > *streaming_given = true;
> > *streaming = defGetBoolean(defel);
> > }
> > +   else if (strcmp(defel->defname, "two_phase") == 0 &&
> twophase)
> > +   {
> > +   if (*twophase_given)
> > +   ereport(ERROR,
> > +
> (errcode(ERRCODE_SYNTAX_ERROR),
> > +errmsg("conflicting or
> redundant options")));
> > +   *twophase_given = true;
> > +   *twophase = defGetBoolean(defel);
> > +   }
> >
> > You can add this test in subscription.sql easily with double twophase
> options.
> 
> Thanks for the feedback. You are right.
> 
> But in the pgoutput.c there are several other potential syntax errors
> "conflicting or redundant options" which are just like this "two_phase" one.
> e.g. there is the same error for options "proto_version", "publication_names",
> "binary", "streaming".
> 
> AFAIK none of those other syntax errors had any regression tests. That is the
> reason why I did not include any new test for the "two_phase"
> option.
> 
> So:
> a) should I add a new test per your feedback comment, or
> b) should I be consistent with the other similar errors, and not add the test?
> 
> Of course it is easy to add a new test if you think option (a) is best.
> 
> Thoughts?
OK. Then, we can think previously, such tests for other options are
regarded as needless because the result are too apparent.
Let's choose (b) to make the patch set aligned with other similar past codes.
Thanks.

Best Regards,
Takamichi Osumi

Re: [HACKERS] logical decoding of two-phase transactions

2021-02-15 Thread Peter Smith

On Fri, Feb 12, 2021 at 5:59 PM osumi.takami...@fujitsu.com
 wrote:
> (2)
>
> File : v39-0006-Support-2PC-txn-Subscription-option.patch
>
> @@ -213,6 +219,15 @@ parse_subscription_options(List *options,
> *streaming_given = true;
> *streaming = defGetBoolean(defel);
> }
> +   else if (strcmp(defel->defname, "two_phase") == 0 && twophase)
> +   {
> +   if (*twophase_given)
> +   ereport(ERROR,
> +   
> (errcode(ERRCODE_SYNTAX_ERROR),
> +errmsg("conflicting or 
> redundant options")));
> +   *twophase_given = true;
> +   *twophase = defGetBoolean(defel);
> +   }
>
> You can add this test in subscription.sql easily with double twophase options.

Thanks for the feedback. You are right.

But in the pgoutput.c there are several other potential syntax errors
"conflicting or redundant options" which are just like this
"two_phase" one.
e.g. there is the same error for options "proto_version",
"publication_names", "binary", "streaming".

AFAIK none of those other syntax errors had any regression tests. That
is the reason why I did not include any new test for the "two_phase"
option.

So:
a) should I add a new test per your feedback comment, or
b) should I be consistent with the other similar errors, and not add the test?

Of course it is easy to add a new test if you think option (a) is best.

Thoughts?

-
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Snapshot scalability patch issue

Hi,

On 2021-02-14 18:42:18 -0800, Peter Geoghegan wrote:
> The call to heap_page_prune() within lazy_scan_heap() passes a bool
> literal ('false') as its fourth argument. But the fourth argument is
> of type TransactionId, not bool. This has been the case since the
> snapshot scalability work performed by commit dc7420c2c92. Surely
> something is amiss here.

Looks like I accidentally swapped the InvalidTransactionId and false
around - luckily they have the same actual bit pattern...

I do wish C could pass arguments by name.

I'll push something once I'm back at my computer...

Greetings,

Andres Freund

Re: CREATE INDEX CONCURRENTLY on partitioned index

2021-02-15 Thread Zhihong Yu

Hi,
For v13-0006-More-refactoring.patch :

+   /* It's not a shared catalog, so refuse to move it to shared tablespace
*/
+   if (params->tablespaceOid == GLOBALTABLESPACE_OID && false)
+   ereport(ERROR,

Do you intend to remove the ineffective check ?

+   else
+   heapRelation = table_open(heapId,
+ ShareUpdateExclusiveLock);
+   table_close(heapRelation, NoLock);

The table_open() seems to be unnecessary since there is no check after the
open.

+   // heapRelationIds = list_make1_oid(heapId);
If the code is not needed, you can remove the above.

For v13-0005-Refactor-to-allow-reindexing-all-index-partition.patch :

+   /* Skip invalid indexes, if requested */
+   if ((options & REINDEXOPT_SKIPVALID) != 0 &&
+   get_index_isvalid(partoid))

The comment seems to diverge from the name of the flag (which says skip
valid index).

Cheers

On Mon, Feb 15, 2021 at 11:34 AM Justin Pryzby  wrote:

> On Mon, Feb 15, 2021 at 10:06:47PM +0300, Anastasia Lubennikova wrote:
> > On 28.01.2021 17:30, Justin Pryzby wrote:
> > > On Thu, Jan 28, 2021 at 09:51:51PM +0900, Masahiko Sawada wrote:
> > > > On Mon, Nov 30, 2020 at 5:22 AM Justin Pryzby 
> wrote:
> > > > > On Sat, Oct 31, 2020 at 01:31:17AM -0500, Justin Pryzby wrote:
> > > > > > Forking this thread, since the existing CFs have been closed.
> > > > > >
> https://www.postgresql.org/message-id/flat/20200914143102.GX18552%40telsasoft.com#58b1056488451f8594b0f0ba40996afd
> > > > > >
> > > > > > The strategy is to create catalog entries for all tables with
> indisvalid=false,
> > > > > > and then process them like REINDEX CONCURRENTLY.  If it's
> interrupted, it
> > > > > > leaves INVALID indexes, which can be cleaned up with DROP or
> REINDEX, same as
> > > > > > CIC on a plain table.
> > > > > >
> > > > > > On Sat, Aug 08, 2020 at 01:37:44AM -0500, Justin Pryzby wrote:
> > > > > > > On Mon, Jun 15, 2020 at 09:37:42PM +0900, Michael Paquier
> wrote:
> > > > > > > Note that the mentioned problem wasn't serious: there was
> missing index on
> > > > > > > child table, therefor the parent index was invalid, as
> intended.  However I
> > > > > > > agree that it's not nice that the command can fail so easily
> and leave behind
> > > > > > > some indexes created successfully and some failed some not
> created at all.
> > > > > > >
> > > > > > > But I took your advice initially creating invalid inds.
> > > > > > ...
> > > > > > > That gave me the idea to layer CIC on top of Reindex, since I
> think it does
> > > > > > > exactly what's needed.
> > > > > > On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote:
> > > > > > > On Thu, Sep 24, 2020 at 05:11:03PM +0900, Michael Paquier
> wrote:
> > > > > > > > It would be good also to check if
> > > > > > > > we have a partition index tree that maps partially with a
> partition
> > > > > > > > table tree (aka no all table partitions have a partition
> index), where
> > > > > > > > these don't get clustered because there is no index to work
> on.
> > > > > > > This should not happen, since a incomplete partitioned index
> is "invalid".
> >
> > > > > > > I had been waiting to rebase since there hasn't been any
> review comments and I
> > > > > > > expected additional, future conflicts.
> > > > > > >
> >
> > I attempted to review this feature, but the last patch conflicts with the
> > recent refactoring, so I wasn't able to test it properly.
> > Could you please send a new version?
>
> I rebased this yesterday, so here's my latest.
>
> > 2) Here we access relation field after closing the relation. Is it safe?
> > /* save lockrelid and locktag for below */
> > heaprelid = rel->rd_lockInfo.lockRelId;
>
> Thanks, fixed this just now.
>
> > 3) leaf_partitions() function only handles indexes, so I suggest to name
> it
> > more specifically and add a comment about meaning of 'options' parameter.
> >
> > 4) I don't quite understand the idea of the regression test. Why do we
> > expect to see invalid indexes there?
> > +"idxpart_a_idx1" UNIQUE, btree (a) INVALID
>
> Because of the unique failure:
> +create unique index concurrently on idxpart (a); -- partitioned, unique
> failure
> +ERROR:  could not create unique index "idxpart2_a_idx2_ccnew"
> +DETAIL:  Key (a)=(10) is duplicated.
> +\d idxpart
>
> This shows that CIC first creates catalog-only INVALID indexes, and then
> reindexes them to "validate".
>
> --
> Justin
>

Re: ERROR: invalid spinlock number: 0

Hi,

On 2021-02-15 19:45:21 +0900, Michael Paquier wrote:
> On Mon, Feb 15, 2021 at 10:47:05PM +1300, Thomas Munro wrote:
> > Why not initialise it in WalRcvShmemInit()?
> 
> I was thinking about doing that as well, but we have no real need to
> initialize this stuff in most cases, say standalone deployments.  In
> particular for the fallback implementation of atomics, we would
> prepare a spinlock for nothing.

So what? It's just about free to initialize a spinlock, whether it's
using the fallback implementation or not. Initializing upon walsender
startup adds a lot of complications, because e.g. somebody could already
hold the spinlock because the previous walsender just disconnected, and
they were looking at the stats.

Greetings,

Andres Freund

Re: partial heap only tuples

2021-02-15 Thread Bossart, Nathan

On 2/13/21, 8:26 AM, "Andres Freund"  wrote:
> On 2021-02-09 18:48:21 +, Bossart, Nathan wrote:
>> In order to be eligible for cleanup, the final tuple in the
>> corresponding PHOT/HOT chain must also be eligible for cleanup, or all
>> indexes must have been updated later in the chain before any visible
>> tuples.
>
> This sounds like it might be prohibitively painful. Adding effectively
> unremovable bloat to remove other bloat is not an uncomplicated
> premise. I think you'd really need a way to fully remove this as part of
> vacuum for this to be viable.

Yeah, this is something I'm concerned about.  I think adding a bitmap
of modified columns to the header of PHOT-updated tuples improves
matters quite a bit, even for single-page vacuuming.  Following is a
strategy I've been developing (there may still be some gaps).  Here's
a basic PHOT chain where all tuples are visible and the last one has
not been deleted or updated:

idx10   1   2   3
idx20   1   2
idx30   2   3
lp  1   2   3   4   5
tuple   (0,0,0) (0,1,1) (2,2,1) (2,2,2) (3,2,3)
bitmap  -xx xx- --x x-x

For single-page vacuum, we take the following actions:
1. Starting at the root of the PHOT chain, create an OR'd bitmap
   of the chain.
2. Walk backwards, OR-ing the bitmaps.  Stop when the bitmap
   matches the one from step 1.  As we walk backwards, identify
   "key" tuples, which are tuples where the OR'd bitmap changes as
   you walk backwards.  If the OR'd bitmap does not include all
   columns for the table, also include the root of the PHOT chain
   as a key tuple.
3. Redirect each key tuple to the next key tuple.
4. For all but the first key tuple, OR the bitmaps of all pruned
   tuples from each key tuple (exclusive) to the next key tuple
   (inclusive) and store the result in the bitmap of the next key
   tuple.
5. Mark all line pointers for all non-key tuples as dead.  Storage
   can be removed for all tuples except the last one, but we must
   leave around the bitmap for all key tuples except for the first
   one.

After this, our basic PHOT chain looks like this:

idx10   1   2   3
idx20   1   2
idx30   2   3
lp  X   X   3->5X   5
tuple   (3,2,3)
bitmap  x-x

Without PHOT, this intermediate state would have 15 index tuples, 5
line pointers, and 1 heap tuples.  With PHOT, we have 10 index tuples,
5 line pointers, 1 heap tuple, and 1 bitmap.  When we vacuum the
indexes, we can reclaim the dead line pointers and remove the
associated index tuples:

idx13
idx22
idx32   3
lp  3->55
tuple   (3,2,3)
bitmap  x-x

Without PHOT, this final state would have 3 index tuples, 1 line
pointer, and 1 heap tuple.  With PHOT, we have 4 index tuples, 2 line
pointers, 1 heap tuple, and 1 bitmap.  Overall, we still end up
keeping around more line pointers and tuple headers (for the bitmaps),
but maybe that is good enough.  I think the next step here would be to
find a way to remove some of the unnecessary index tuples and adjust
the remaining ones to point to the last line pointer in the PHOT
chain.

Nathan

Re: PG vs LLVM 12 on seawasp, next round

2021-02-15 Thread Andrew Dunstan



On 1/24/21 11:06 PM, Tom Lane wrote:
> Noah Misch  writes:
>> On Mon, Jan 18, 2021 at 09:29:53PM +0100, Fabien COELHO wrote:
>>> ... Last two months were a
>>> little overworked for me so I let slip quite a few things. If you want to
>>> disable the animal as Tom suggests, do as you want.
>> Perhaps he was suggesting that you (buildfarm owner) disable the cron job 
>> that
>> initiates new runs.  That's what I do when one of my animals needs my
>> intervention.
> Indeed.  I'm not sure there even is a provision to block an animal on the
> buildfarm-server side.  If there is, you'd have to request that it be
> manually undone after you get around to fixing the animal.  Frankly,
> if I were the BF admin, I would be in about as much hurry to do that
> as you've been to fix it.
>
>   


It's actually very easy, but it's something I usually reserve for people
who are very unresponsive to emails. As noted, disabling the crontab
entry on the client side is a preferable solution.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: PG vs LLVM 12 on seawasp, next round

2021-02-15 Thread Fabien COELHO





I've added an explicit LD_LIBRARY_PATH, which will be triggered at some
point later.


This seems to have fixed the issue.

I'm sorry for the noise and quite baffled anyway, because according to my 
change logs it does not seem that I modified anything from my side about 
the dynamic library path when compiling with clang. At least I do not see 
a trace of that.



You can also do something like LDFLAGS="$LDFLAGS -Wl,-rpath,$(llvm-config 
--libdir)"
or such.


I've resorted to just hardcode LD_LIBRARY_PATH alongside PATH when 
compiling with clang in my buildfarm script. Thanks for the tip anyway.


And thanks Thomas for pointing out the fix!

--
Fabien.

Re: CREATE INDEX CONCURRENTLY on partitioned index

2021-02-15 Thread Justin Pryzby

On Mon, Feb 15, 2021 at 10:06:47PM +0300, Anastasia Lubennikova wrote:
> On 28.01.2021 17:30, Justin Pryzby wrote:
> > On Thu, Jan 28, 2021 at 09:51:51PM +0900, Masahiko Sawada wrote:
> > > On Mon, Nov 30, 2020 at 5:22 AM Justin Pryzby  
> > > wrote:
> > > > On Sat, Oct 31, 2020 at 01:31:17AM -0500, Justin Pryzby wrote:
> > > > > Forking this thread, since the existing CFs have been closed.
> > > > > https://www.postgresql.org/message-id/flat/20200914143102.GX18552%40telsasoft.com#58b1056488451f8594b0f0ba40996afd
> > > > > 
> > > > > The strategy is to create catalog entries for all tables with 
> > > > > indisvalid=false,
> > > > > and then process them like REINDEX CONCURRENTLY.  If it's 
> > > > > interrupted, it
> > > > > leaves INVALID indexes, which can be cleaned up with DROP or REINDEX, 
> > > > > same as
> > > > > CIC on a plain table.
> > > > > 
> > > > > On Sat, Aug 08, 2020 at 01:37:44AM -0500, Justin Pryzby wrote:
> > > > > > On Mon, Jun 15, 2020 at 09:37:42PM +0900, Michael Paquier wrote:
> > > > > > Note that the mentioned problem wasn't serious: there was missing 
> > > > > > index on
> > > > > > child table, therefor the parent index was invalid, as intended.  
> > > > > > However I
> > > > > > agree that it's not nice that the command can fail so easily and 
> > > > > > leave behind
> > > > > > some indexes created successfully and some failed some not created 
> > > > > > at all.
> > > > > > 
> > > > > > But I took your advice initially creating invalid inds.
> > > > > ...
> > > > > > That gave me the idea to layer CIC on top of Reindex, since I think 
> > > > > > it does
> > > > > > exactly what's needed.
> > > > > On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote:
> > > > > > On Thu, Sep 24, 2020 at 05:11:03PM +0900, Michael Paquier wrote:
> > > > > > > It would be good also to check if
> > > > > > > we have a partition index tree that maps partially with a 
> > > > > > > partition
> > > > > > > table tree (aka no all table partitions have a partition index), 
> > > > > > > where
> > > > > > > these don't get clustered because there is no index to work on.
> > > > > > This should not happen, since a incomplete partitioned index is 
> > > > > > "invalid".
> 
> > > > > > I had been waiting to rebase since there hasn't been any review 
> > > > > > comments and I
> > > > > > expected additional, future conflicts.
> > > > > > 
> 
> I attempted to review this feature, but the last patch conflicts with the
> recent refactoring, so I wasn't able to test it properly.
> Could you please send a new version?

I rebased this yesterday, so here's my latest.

> 2) Here we access relation field after closing the relation. Is it safe?
>     /* save lockrelid and locktag for below */
>     heaprelid = rel->rd_lockInfo.lockRelId;

Thanks, fixed this just now.

> 3) leaf_partitions() function only handles indexes, so I suggest to name it
> more specifically and add a comment about meaning of 'options' parameter.
> 
> 4) I don't quite understand the idea of the regression test. Why do we
> expect to see invalid indexes there?
> +    "idxpart_a_idx1" UNIQUE, btree (a) INVALID

Because of the unique failure:
+create unique index concurrently on idxpart (a); -- partitioned, unique failure
+ERROR:  could not create unique index "idxpart2_a_idx2_ccnew"
+DETAIL:  Key (a)=(10) is duplicated.
+\d idxpart

This shows that CIC first creates catalog-only INVALID indexes, and then
reindexes them to "validate".

-- 
Justin
>From c846ddfc287bfeddb9b389de1869aadf7173c068 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 6 Jun 2020 17:42:23 -0500
Subject: [PATCH v13 01/18] Allow CREATE INDEX CONCURRENTLY on partitioned
 table

Note, this effectively reverts 050098b14, so take care to not reintroduce the
bug it fixed.
---
 doc/src/sgml/ref/create_index.sgml |   9 --
 src/backend/commands/indexcmds.c   | 143 ++---
 src/test/regress/expected/indexing.out |  60 ++-
 src/test/regress/sql/indexing.sql  |  18 +++-
 4 files changed, 176 insertions(+), 54 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index a5271a9f8f..6869a18968 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -686,15 +686,6 @@ Indexes:
 cannot.

 
-   
-Concurrent builds for indexes on partitioned tables are currently not
-supported.  However, you may concurrently build the index on each
-partition individually and then finally create the partitioned index
-non-concurrently in order to reduce the time where writes to the
-partitioned table will be locked out.  In this case, building the
-partitioned index is a metadata only operation.
-   
-
   
  
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 127ba7835d..4ac1dacd7d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -68,6 +68,7 @@
 
 
 /* non-export f

Re: SSL SNI


On 2021-02-15 18:40, Jesse Zhang wrote:

I imagine this also (finally) opens up the possibility for the server
to present a different certificate for each hostname based on SNI.
This eliminates the requirement for wildcard certs where the cluster
is running on a host with multiple (typically two to three) hostnames
and the clients check the hostname against SAN in the cert
(sslmode=verify-full). Am I right? Is that feature on anybody's
roadmap?


This would be the client side of that.  But I don't know of anyone 
planning to work on the server side.

Re: CREATE INDEX CONCURRENTLY on partitioned index

2021-02-15 Thread Anastasia Lubennikova

On 28.01.2021 17:30, Justin Pryzby wrote:

On Thu, Jan 28, 2021 at 09:51:51PM +0900, Masahiko Sawada wrote:

On Mon, Nov 30, 2020 at 5:22 AM Justin Pryzby wrote:

On Sat, Oct 31, 2020 at 01:31:17AM -0500, Justin Pryzby wrote:

Forking this thread, since the existing CFs have been closed.
https://www.postgresql.org/message-id/flat/20200914143102.GX18552%40telsasoft.com#58b1056488451f8594b0f0ba40996afd

The strategy is to create catalog entries for all tables with indisvalid=false,
and then process them like REINDEX CONCURRENTLY. If it's interrupted, it
leaves INVALID indexes, which can be cleaned up with DROP or REINDEX, same as
CIC on a plain table.

On Sat, Aug 08, 2020 at 01:37:44AM -0500, Justin Pryzby wrote:

On Mon, Jun 15, 2020 at 09:37:42PM +0900, Michael Paquier wrote:
Note that the mentioned problem wasn't serious: there was missing index on
child table, therefor the parent index was invalid, as intended. However I
agree that it's not nice that the command can fail so easily and leave behind
some indexes created successfully and some failed some not created at all.

But I took your advice initially creating invalid inds.

...

That gave me the idea to layer CIC on top of Reindex, since I think it does
exactly what's needed.

On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote:

On Thu, Sep 24, 2020 at 05:11:03PM +0900, Michael Paquier wrote:

It would be good also to check if
we have a partition index tree that maps partially with a partition
table tree (aka no all table partitions have a partition index), where
these don't get clustered because there is no index to work on.

This should not happen, since a incomplete partitioned index is "invalid".

I had been waiting to rebase since there hasn't been any review comments and I
expected additional, future conflicts.

I attempted to review this feature, but the last patch conflicts with
the recent refactoring, so I wasn't able to test it properly.

Could you please send a new version?

Meanwhile, here are my questions about the patch:

1) I don't see a reason to change the logic here. We don't skip counting
existing indexes when create parent index. Why should we skip them in
CONCURRENTLY mode?

// If concurrent, maybe this should be done after excluding
indexes which already exist ?

pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
nparts);

2) Here we access relation field after closing the relation. Is it safe?

/* save lockrelid and locktag for below */
heaprelid = rel->rd_lockInfo.lockRelId;

3) leaf_partitions() function only handles indexes, so I suggest to name
it more specifically and add a comment about meaning of 'options' parameter.

4) I don't quite understand the idea of the regression test. Why do we
expect to see invalid indexes there?

+ "idxpart_a_idx1" UNIQUE, btree (a) INVALID

5) Speaking of documentation, I think we need to add a paragraph about
CIC on partitioned indexes which will explain that invalid indexes may
appear and what user should do to fix them.

6) ReindexIndexesConcurrently() needs some code cleanup.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: CREATE INDEX CONCURRENTLY on partitioned index

2021-02-15 Thread Anastasia Lubennikova

On 28.01.2021 17:30, Justin Pryzby wrote:

On Thu, Jan 28, 2021 at 09:51:51PM +0900, Masahiko Sawada wrote:

On Mon, Nov 30, 2020 at 5:22 AM Justin Pryzby wrote:

On Sat, Oct 31, 2020 at 01:31:17AM -0500, Justin Pryzby wrote:

Forking this thread, since the existing CFs have been closed.
https://www.postgresql.org/message-id/flat/20200914143102.GX18552%40telsasoft.com#58b1056488451f8594b0f0ba40996afd

On Sat, Aug 08, 2020 at 01:37:44AM -0500, Justin Pryzby wrote:

But I took your advice initially creating invalid inds.

...

That gave me the idea to layer CIC on top of Reindex, since I think it does
exactly what's needed.

On Sat, Sep 26, 2020 at 02:56:55PM -0500, Justin Pryzby wrote:

On Thu, Sep 24, 2020 at 05:11:03PM +0900, Michael Paquier wrote:

This should not happen, since a incomplete partitioned index is "invalid".

I had been waiting to rebase since there hasn't been any review comments and I
expected additional, future conflicts.

I attempted to review this feature, but the last patch conflicts with
the recent refactoring, so I wasn't able to test it properly.

Could you please send a new version?

Meanwhile, here are my questions about the patch:

1) I don't see a reason to change the logic here. We don't skip counting
existing indexes when create parent index. Why should we skip them in
CONCURRENTLY mode?

// If concurrent, maybe this should be done after excluding
indexes which already exist ?

pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL,
nparts);

2) Here we access relation field after closing the relation. Is it safe?

/* save lockrelid and locktag for below */
heaprelid = rel->rd_lockInfo.lockRelId;

3) leaf_partitions() function only handles indexes, so I suggest to name
it more specifically and add a comment about meaning of 'options' parameter.

4) I don't quite understand the idea of the regression test. Why do we
expect to see invalid indexes there?

+ "idxpart_a_idx1" UNIQUE, btree (a) INVALID

5) Speaking of documentation, I think we need to add a paragraph about
CIC on partitioned indexes which will explain that invalid indexes may
appear and what user should do to fix them.

6) ReindexIndexesConcurrently() needs some code cleanup.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: PG vs LLVM 12 on seawasp, next round

Hi,

On 2021-02-15 10:05:32 +0100, Fabien COELHO wrote:
> > show?  What do you have in seawap's LD_LIBRARY_PATH, and is the
> > problem that you need to add /home/fabien/clgtk/lib to it
> 
> Argh. Would it be so stupid? :-( I thought the configuration stuff would
> manage the link path automatically, which may be quite naïve, indeed.

The only way to do that is to add an rpath annotation, and e.g. distros
don't like that - there's some security implications.

> I've added an explicit LD_LIBRARY_PATH, which will be triggered at some
> point later.

You can also do something like LDFLAGS="$LDFLAGS -Wl,-rpath,$(llvm-config 
--libdir)"
or such.

Greetings,

Andres Freund

Re: SSL SNI

2021-02-15 Thread Jesse Zhang

Hi Peter,
I imagine this also (finally) opens up the possibility for the server
to present a different certificate for each hostname based on SNI.
This eliminates the requirement for wildcard certs where the cluster
is running on a host with multiple (typically two to three) hostnames
and the clients check the hostname against SAN in the cert
(sslmode=verify-full). Am I right? Is that feature on anybody's
roadmap?

Cheers,
Jesse



On Mon, Feb 15, 2021 at 6:09 AM Peter Eisentraut
 wrote:
>
> A customer asked about including Server Name Indication (SNI) into the
> SSL connection from the client, so they can use an SSL-aware proxy to
> route connections.  There was a thread a few years ago where this was
> briefly discussed but no patch appeared.[0]  I whipped up a quick patch
> and it did seem to do the job, so I figured I'd share it here.
>
> The question I had was whether this should be an optional behavior, or
> conversely a behavior that can be turned off, or whether it should just
> be turned on all the time.
>
> Technically, it seems pretty harmless.  It adds another field to the TLS
> handshake, and if the server is not interested in it, it just gets ignored.
>
> The Wikipedia page[1] discusses some privacy concerns in the context of
> web browsing, but it seems there is no principled solution to those.
> The relevant RFC[2] "recommends" that SNI is used for all applicable TLS
> connections.
>
>
> [0]:
> https://www.postgresql.org/message-id/flat/CAPPwrB_tsOw8MtVaA_DFyOFRY2ohNdvMnLoA_JRr3yB67Rggmg%40mail.gmail.com
> [1]: https://en.wikipedia.org/wiki/Server_Name_Indication
> [2]: https://tools.ietf.org/html/rfc6066#section-3

Re: MultiXact\SLRU buffers configuration

2021-02-15 Thread Andrey Borodin



> 23 дек. 2020 г., в 21:31, Gilles Darold  написал(а):
> 
> Sorry for the response delay, we have run several others tests trying to 
> figure out the performances gain per patch but unfortunately we have very 
> heratic results. With the same parameters and patches the test doesn't 
> returns the same results following the day or the hour of the day. This is 
> very frustrating and I suppose that this is related to the Azure 
> architecture. The only thing that I am sure is that we had the best 
> performances results with all patches and
> 
> multixact_offsets_slru_buffers = 256
> multixact_members_slru_buffers = 512
> multixact_local_cache_entries = 4096
> 
> 
> but I can not say if all or part of the patches are improving the 
> performances. My feeling is that performances gain related to patches 1 
> (shared lock) and 3 (conditional variable) do not have much to do with the 
> performances gain compared to just tuning the multixact buffers. This is when 
> the multixact contention is observed but perhaps they are delaying the 
> contention. It's all the more frustrating that we had a test case to 
> reproduce the contention but not the architecture apparently.

Hi! Thanks for the input.
I think we have a consensus here that configuring SLRU size is beneficial for 
MultiXacts.
There is proposal in nearby thread [0] on changing default value of commit_ts 
SLRU buffers.
In my experience from time to time there can be problems with subtransactions 
cured by extending subtrans SLRU.

Let's make all SLRUs configurable?
PFA patch with draft of these changes.

Best regards, Andrey Borodin.


[0] 
https://www.postgresql.org/message-id/flat/20210115220744.GA24457%40alvherre.pgsql



v9-0001-Make-all-SLRU-buffer-sizes-configurable.patch
Description: Binary data

Re: Improve new hash partition bound check error messages


On 2021-02-03 15:52, Peter Eisentraut wrote:

On 2021-02-02 13:26, Heikki Linnakangas wrote:

How about this?

CREATE TABLE fail_part PARTITION OF hash_parted FOR VALUES WITH (MODULUS
25, REMAINDER 3);
ERROR:  every hash partition modulus must be a factor of the next larger
modulus
DETAIL:  25 is not divisible by 10, the modulus of existing partition
"hpart_1"


I don't know if we can easily get the name of the existing partition.
I'll have to check that.

I'm worried that this phrasing requires the user to understand that
"divisible by" is related to "factor of", which is of course correct but
introduces yet more complexity into this.

I'll play around with this a bit more.


Here is a new patch that implements the suggestions.
From 537e8f2f27e43b94777b962e408245fb1d4784dd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 15 Feb 2021 17:10:11 +0100
Subject: [PATCH v2] Improve new hash partition bound check error messages

For the error message "every hash partition modulus must be a factor
of the next larger modulus", add a detail message that shows the
particular numbers involved.  Also comment the code more.

Discussion: 
https://www.postgresql.org/message-id/flat/bb9d60b4-aadb-607a-1a9d-fdc3434dddcd%40enterprisedb.com
---
 src/backend/partitioning/partbounds.c  | 62 +++---
 src/test/regress/expected/alter_table.out  |  1 +
 src/test/regress/expected/create_table.out |  2 +
 3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/src/backend/partitioning/partbounds.c 
b/src/backend/partitioning/partbounds.c
index 0c3f212ff2..02cd58ce5f 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -2832,14 +2832,9 @@ check_new_partition_bound(char *relname, Relation parent,
 
if (partdesc->nparts > 0)
{
-   Datum **datums = boundinfo->datums;
-   int ndatums = 
boundinfo->ndatums;
int 
greatest_modulus;
int remainder;
int offset;
-   boolvalid_modulus = true;
-   int prev_modulus,   
/* Previous largest modulus */
-   next_modulus;   
/* Next largest modulus */
 
/*
 * Check rule that every modulus must 
be a factor of the
@@ -2849,7 +2844,9 @@ check_new_partition_bound(char *relname, Relation parent,
 * modulus 15, but you cannot add both 
a partition with
 * modulus 10 and a partition with 
modulus 15, because 10
 * is not a factor of 15.
-*
+*/
+
+   /*
 * Get the greatest (modulus, 
remainder) pair contained in
 * boundinfo->datums that is less than 
or equal to the
 * (spec->modulus, spec->remainder) 
pair.
@@ -2859,26 +2856,55 @@ check_new_partition_bound(char *relname, Relation 
parent,

spec->remainder);
if (offset < 0)
{
-   next_modulus = 
DatumGetInt32(datums[0][0]);
-   valid_modulus = (next_modulus % 
spec->modulus) == 0;
+   int 
next_modulus;
+
+   /*
+* All existing moduli are 
greater or equal, so the
+* new one must be a factor of 
the smallest one, which
+* is first in the boundinfo.
+*/
+   next_modulus = 
DatumGetInt32(boundinfo->datums[0][0]);
+   if (next_modulus % 
spec->modulus != 0)
+   ereport(ERROR,
+   
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+
errmsg("every hash partition modulus must be a factor of the next larger 
modulus"),
+

Re: POC: postgres_fdw insert batching

2021-02-15 Thread Tomas Vondra

On 2/5/21 3:52 AM, Amit Langote wrote:
> Tsunakwa-san,
> 
> On Mon, Jan 25, 2021 at 1:21 PM tsunakawa.ta...@fujitsu.com
>  wrote:
>> From: Amit Langote 
>>> Yes, it can be simplified by using a local join to prevent the update of 
>>> the foreign
>>> partition from being pushed to the remote server, for which my example in 
>>> the
>>> previous email used a local trigger.  Note that the update of the foreign
>>> partition to be done locally is a prerequisite for this bug to occur.
>>
>> Thank you, I was aware that UPDATE calls ExecInsert() but forgot about it 
>> partway.  Good catch (and my bad miss.)
> 
> It appears I had missed your reply, sorry.
> 
>> +   PgFdwModifyState *fmstate = resultRelInfo->ri_FdwState ?
>> +   (PgFdwModifyState *) 
>> resultRelInfo->ri_FdwState :
>> +   NULL;
>>
>> This can be written as:
>>
>> +   PgFdwModifyState *fmstate = (PgFdwModifyState *) 
>> resultRelInfo->ri_FdwState;
> 
> Facepalm, yes.
> 
> Patch updated.  Thanks for the review.
> 

Thanks for the patch, it seems fine to me. I wonder it the commit
message needs some tweaks, though. At the moment it says:

Prevent FDW insert batching during cross-partition updates

but what the patch seems to be doing is simply initializing the info
only for CMD_INSERT operations. Which does the trick, but it affects
everything, i.e. all updates, no? Not just cross-partition updates.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: A reloption for partitioned tables - parallel_workers

2021-02-15 Thread Seamus Abshere

hi,

Here we go, my first patch... solves 
https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520e...@www.fastmail.com

Best,
Seamus

On Mon, Feb 15, 2021, at 3:53 AM, Amit Langote wrote:
> Hi Seamus,
> 
> On Mon, Feb 15, 2021 at 5:28 PM Seamus Abshere  wrote:
> > It turns out parallel_workers may be a useful reloption for certain uses of 
> > partitioned tables, at least if they're made up of fancy column store 
> > partitions (see 
> > https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520ed55%40www.fastmail.com).
> >
> > Would somebody tell me what I'm doing wrong? I would love to submit a patch 
> > but I'm stuck:
> >
> > diff --git a/src/backend/optimizer/path/allpaths.c 
> > b/src/backend/optimizer/path/allpaths.c
> > index cd3fdd259c..f1ade035ac 100644
> > --- a/src/backend/optimizer/path/allpaths.c
> > +++ b/src/backend/optimizer/path/allpaths.c
> > @@ -3751,6 +3751,7 @@ compute_parallel_worker(RelOptInfo *rel, double 
> > heap_pages, double index_pages,
> >  * If the user has set the parallel_workers reloption, use that; 
> > otherwise
> >  * select a default number of workers.
> >  */
> > +   // I want to affect this
> > if (rel->rel_parallel_workers != -1)
> > parallel_workers = rel->rel_parallel_workers;
> > else
> >
> > so I do this
> >
> > diff --git a/src/backend/access/common/reloptions.c 
> > b/src/backend/access/common/reloptions.c
> > index c687d3ee9e..597b209bfb 100644
> > --- a/src/backend/access/common/reloptions.c
> > +++ b/src/backend/access/common/reloptions.c
> > @@ -1961,13 +1961,15 @@ build_local_reloptions(local_relopts *relopts, 
> > Datum options, bool validate)
> >  bytea *
> >  partitioned_table_reloptions(Datum reloptions, bool validate)
> >  {
> > -   /*
> > -* There are no options for partitioned tables yet, but this is 
> > able to do
> > -* some validation.
> > -*/
> > +   static const relopt_parse_elt tab[] = {
> > +   {"parallel_workers", RELOPT_TYPE_INT,
> > +   offsetof(StdRdOptions, parallel_workers)},
> > +   };
> > +
> > return (bytea *) build_reloptions(reloptions, validate,
> >   
> > RELOPT_KIND_PARTITIONED,
> > - 
> > 0, NULL, 0);
> > + 
> > sizeof(StdRdOptions),
> > + 
> > tab, lengthof(tab));
> >  }
> >
> > That "works":
> >
> > postgres=# alter table test_3pd_cstore_partitioned set (parallel_workers = 
> > 33);
> > ALTER TABLE
> > postgres=# select relname, relkind, reloptions from pg_class where relname 
> > = 'test_3pd_cstore_partitioned';
> >relname   | relkind |  reloptions
> > -+-+---
> >  test_3pd_cstore_partitioned | p   | {parallel_workers=33}
> > (1 row)
> >
> > But it seems to be ignored:
> >
> > diff --git a/src/backend/optimizer/path/allpaths.c 
> > b/src/backend/optimizer/path/allpaths.c
> > index cd3fdd259c..c68835ce38 100644
> > --- a/src/backend/optimizer/path/allpaths.c
> > +++ b/src/backend/optimizer/path/allpaths.c
> > @@ -3751,6 +3751,8 @@ compute_parallel_worker(RelOptInfo *rel, double 
> > heap_pages, double index_pages,
> >  * If the user has set the parallel_workers reloption, use that; 
> > otherwise
> >  * select a default number of workers.
> >  */
> > +   // I want to affect this, but this assertion always passes
> > +   Assert(rel->rel_parallel_workers == -1)
> > if (rel->rel_parallel_workers != -1)
> > parallel_workers = rel->rel_parallel_workers;
> > else
> 
> You may see by inspecting the callers of compute_parallel_worker()
> that it never gets called on a partitioned table, only its leaf
> partitions.  Maybe you could try calling compute_parallel_worker()
> somewhere in add_paths_to_append_rel(), which has this code to figure
> out parallel_workers to use for a parallel Append path for a given
> partitioned table:
> 
> /* Find the highest number of workers requested for any subpath. */
> foreach(lc, partial_subpaths)
> {
> Path   *path = lfirst(lc);
> 
> parallel_workers = Max(parallel_workers, path->parallel_workers);
> }
> Assert(parallel_workers > 0);
> 
> /*
>  * If the use of parallel append is permitted, always request at least
>  * log2(# of children) workers.  We assume it can be useful to have
>  * extra workers in this case because they will be spread out across
>  * the children.  The precise formula is just a guess, but we don't
>  * want to end up with a radically different answer for a table with N
>

Re: POC: postgres_fdw insert batching

2021-02-15 Thread Tomas Vondra

On 2/5/21 2:55 AM, Ian Lawrence Barwick wrote:
> ...
> 
> There's a minor typo in the doc's version of the
> ExecForeignBatchInsert() declaration;
> is:
> 
>     TupleTableSlot **
>     ExecForeignBatchInsert(EState *estate,
>                       ResultRelInfo *rinfo,
>                       TupleTableSlot **slots,
>                       TupleTableSlot *planSlots,
>                       int *numSlots);
> 
> should be:
> 
>     TupleTableSlot **
>     ExecForeignBatchInsert(EState *estate,
>                       ResultRelInfo *rinfo,
>                       TupleTableSlot **slots,
>                       TupleTableSlot **planSlots,
>                       int *numSlots);
> 
> (Trivial patch attached).
> 
> 
> Forgot to mention the relevant doc link:
> 
>    
> https://www.postgresql.org/docs/devel/fdw-callbacks.html#FDW-CALLBACKS-UPDATE
> 
> 

Thanks, I'll get this fixed.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Improvements and additions to COPY progress reporting

2021-02-15 Thread Tomas Vondra

Hi,

I agree with these changes in general - I have a couple minor comment:

1) 0001

- the SGML docs are missing a couple tags

- The blocks in copyfrom.cc/copyto.c should be reworked - I don't think
we do this in our codebase. Move the variable declarations to the
beginning, get rid of the out block. Or something like that.

- I fir the "io_target" name misleading, because in some cases it's
actually the *source*.


2) 0002

- I believe "each backend ... reports its" (not theirs), right?

- This seems more like a generic docs improvement, not quite specific to
the COPY progress patch. It's a bit buried, maybe it should be posted
separately. OTOH it's pretty small.


3) 0003

- Some whitespace noise, triggering "git am" warnings.

- Might be good to briefly explain what the regression test does with
the triggers, etc.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From 32e8bf3e84d58b43c8ab5888eb909e86bdf0ceea Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Fri, 12 Feb 2021 14:06:44 +0100
Subject: [PATCH 1/6] Add progress-reported components for COPY progress
 reporting

The command, io target and excluded tuple count (by COPY ... FROM ... s'
WHERE -clause) are now reported in the pg_stat_progress_copy view.
Additionally, the column lines_processed is renamed to tuples_processed to
disambiguate the meaning of this column in cases of CSV and BINARY copies and
to stay consistent with regards to names in the pg_stat_progress_*-views.

Of special interest is the reporting of io_target, with which we can
distinguish logical replications' initial table synchronization workers'
progress without having to join the pg_stat_activity view.
---
 doc/src/sgml/monitoring.sgml | 37 ++--
 src/backend/catalog/system_views.sql | 11 -
 src/backend/commands/copyfrom.c  | 30 +-
 src/backend/commands/copyto.c| 29 --
 src/include/commands/progress.h  | 15 ++-
 src/test/regress/expected/rules.out  | 15 ++-
 6 files changed, 129 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c602ee4427..3c39c82f1a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6544,6 +6544,29 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
   
  
 
+ 
+  
+   command text
+  
+  
+   The command that is running: COPY FROM, or
+   COPY TO.
+  
+ 
+
+ 
+  
+   io_target text
+  
+  
+   The io target that the data is read from or written to: 
+   FILE, PROGRAM, 
+   STDIO (for COPY FROM STDIN and COPY TO STDOUT),
+   or CALLBACK (used in the table synchronization
+   background worker).
+  
+ 
+
  
   
bytes_processed bigint
@@ -6565,10 +6588,20 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
 
  
   
-   lines_processed bigint
+   tuples_processed bigint
+  
+  
+   Number of tuples already processed by COPY command.
+  
+ 
+
+ 
+  
+   tuples_excluded bigint
   
   
-   Number of lines already processed by COPY command.
+   Number of tuples not processed because they were excluded by the
+   WHERE clause of the COPY command.
   
  
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index fa58afd9d7..6a3ac47b85 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1129,9 +1129,18 @@ CREATE VIEW pg_stat_progress_copy AS
 SELECT
 S.pid AS pid, S.datid AS datid, D.datname AS datname,
 S.relid AS relid,
+CASE S.param5 WHEN 1 THEN 'COPY FROM'
+  WHEN 2 THEN 'COPY TO'
+  END AS command,
+CASE S.param6 WHEN 1 THEN 'FILE'
+  WHEN 2 THEN 'PROGRAM'
+  WHEN 3 THEN 'STDIO'
+  WHEN 4 THEN 'CALLBACK'
+  END AS io_target,
 S.param1 AS bytes_processed,
 S.param2 AS bytes_total,
-S.param3 AS lines_processed
+S.param3 AS tuples_processed,
+S.param4 AS tuples_excluded
 FROM pg_stat_get_progress_info('COPY') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 796ca7b3f7..c3610eb67e 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -540,6 +540,7 @@ CopyFrom(CopyFromState cstate)
 	CopyInsertMethod insertMethod;
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	uint64		processed = 0;
+	uint64		excluded = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
@@ -869,7 +870,11 @@ CopyFrom(CopyFromState cstate)
 			econtext->ecxt_s

Re: A reloption for partitioned tables - parallel_workers

2021-02-15 Thread Laurenz Albe

On Mon, 2021-02-15 at 17:53 +0900, Amit Langote wrote:
> On Mon, Feb 15, 2021 at 5:28 PM Seamus Abshere  wrote:
> > It turns out parallel_workers may be a useful reloption for certain uses of 
> > partitioned tables,
> >  at least if they're made up of fancy column store partitions (see
> >  
> > https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520ed55%40www.fastmail.com).
> > Would somebody tell me what I'm doing wrong? I would love to submit a patch 
> > but I'm stuck:
> 
> You may see by inspecting the callers of compute_parallel_worker()
> that it never gets called on a partitioned table, only its leaf
> partitions.  Maybe you could try calling compute_parallel_worker()
> somewhere in add_paths_to_append_rel(), which has this code to figure
> out parallel_workers to use for a parallel Append path for a given
> partitioned table:
> 
> /* Find the highest number of workers requested for any subpath. */
> foreach(lc, partial_subpaths)
> {
> Path   *path = lfirst(lc);
> 
> parallel_workers = Max(parallel_workers, path->parallel_workers);
> }
> Assert(parallel_workers > 0);
> 
> /*
>  * If the use of parallel append is permitted, always request at least
>  * log2(# of children) workers.  We assume it can be useful to have
>  * extra workers in this case because they will be spread out across
>  * the children.  The precise formula is just a guess, but we don't
>  * want to end up with a radically different answer for a table with N
>  * partitions vs. an unpartitioned table with the same data, so the
>  * use of some kind of log-scaling here seems to make some sense.
>  */
> if (enable_parallel_append)
> {
> parallel_workers = Max(parallel_workers,
>fls(list_length(live_childrels)));
> parallel_workers = Min(parallel_workers,
>max_parallel_workers_per_gather);
> }
> Assert(parallel_workers > 0);
> 
> Note that the 'rel' in this code refers to the partitioned table for
> which an Append path is being considered, so compute_parallel_worker()
> using that 'rel' would use the partitioned table's
> rel_parallel_workers as you are trying to do.

Note that there is a second chunk of code quite like that one a few
lines down from there that would also have to be modified.

I am +1 on allowing to override the degree of parallelism on a parallel
append.  If "parallel_workers" on the partitioned table is an option for
that, it might be a simple solution.  On the other hand, perhaps it would
be less confusing to have a different storage parameter name rather than
having "parallel_workers" do double duty.

Also, since there is a design rule that storage parameters can only be used
on partitions, we would have to change that - is that a problem for anybody?

There is another related consideration that doesn't need to be addressed
by this patch, but that is somewhat related: if the executor prunes some
partitions, the degree of parallelism is unaffected, right?
So if the planner decides to use 24 workers for 25 partitions, and the
executor discards all but one of these partition scans, we would end up
with 24 workers scanning a single partition.

I am not sure how that could be improved.

Yours,
Laurenz Albe

Re: PostgreSQL <-> Babelfish integration

2021-02-15 Thread Finnerty, Jim

We are applying the Babelfish commits to the REL_12_STABLE branch now, and the 
plan is to merge them into the REL_13_STABLE and master branch ASAP after that. 
 There should be a publicly downloadable git repository before very long.

On 2/12/21, 2:35 PM, "Peter Geoghegan"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.

On Fri, Feb 12, 2021 at 11:13 AM Matthias van de Meent
 wrote:
> I agree. I believe that Babelfish's efforts can be compared with the
> zedstore and zheap efforts: they require work in core before they can
> be integrated or added as an extension that could replace the normal
> heap tableam, and while core is being prepared we can discover what
> can and cannot be prepared in core for this new feature.

I see what you mean, but even that seems generous to me, since, as I
said, we don't have any Babelfish code to evaluate today. Whereas
Zedstore and zheap can actually be downloaded and tested.

--
Peter Geoghegan

Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2021-02-15 Thread Joel Jacobson

Hi all,

I've reviewed Mark's anyarray_anyelement_operators-v2.patch
and the only remaining issue I've identified is the opr_sanity problem.

Mark seems to be in need of some input here from more experienced hackers, see 
below.

Hopefully someone can guide him in the right direction.

/Joel

On Sat, Feb 13, 2021, at 11:49, Mark Rofail wrote:
>Hey Joel,
>
>test opr_sanity   ... FAILED
>
>AND binary_coercible(p2.opcintype, p1.amoplefttype));
>  amopfamily | amopstrategy | amopopr
>+--+-
>-(0 rows)
>+   2745 |5 |6105
>+(1 row)
>-- Operators that are primary members of opclasses must be immutable (else
>-- it suggests that the index ordering isn't fixed).  Operators that are
>This is due using anycompatiblearray for the left operand in @>>. 
>To solve this problem we need to use @>>(anyarray,anyelement) or introduce a 
>new opclass for gin indices. 
>These are the two approaches that come to mind to solve this. Which one is the 
>right way or is there another solution I am not aware of?
>That’s why I’m asking this on the mailing list, to get the community’s input.

Re: SSL SNI

2021-02-15 Thread Matthias van de Meent

On Mon, 15 Feb 2021 at 15:09, Peter Eisentraut
 wrote:
>
> A customer asked about including Server Name Indication (SNI) into the
> SSL connection from the client, so they can use an SSL-aware proxy to
> route connections.  There was a thread a few years ago where this was
> briefly discussed but no patch appeared.[0]  I whipped up a quick patch
> and it did seem to do the job, so I figured I'd share it here.

The same topic of SSL-aware proxying based on SNI was mentioned in a
more recent thread here [0]. The state of that patch is unclear,
though. Other than that, this feature seems useful.

+/*
+ * Set Server Name Indication (SNI), but not if it's a literal IP address.
+ * (RFC 6066)
+ */
+if (!((conn->pghost[0] >= '0' && conn->pghost[0] <= '9') ||
strchr(conn->pghost, ':')))

'1one.example.com' is a valid hostname, but would fail this trivial
test, and thus would not have SNI enabled on its connection.

With regards,

Matthias van de Meent

[0] 
https://www.postgresql.org/message-id/flat/37846a5e-bb5e-0c4f-3ee8-54fb4bd02fab%40gmx.de

Re: logical replication seems broken

2021-02-15 Thread vignesh C

On Mon, Feb 15, 2021 at 6:14 PM  wrote:
>
>
> > On 2021.02.15. 12:31 Amit Kapila  wrote:
> > On Mon, Feb 15, 2021 at 11:53 AM vignesh C  wrote:
> > > On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers  wrote:
> > > > I compiled just now a binary from HEAD, and a binary from HEAD+patch
> > > > HEAD is still broken; your patch rescues it, so yes, fixed.
> > > > Maybe a test (check or check-world) should be added to run a second 
> > > > replica?  (Assuming that would have caught this bug)
> > > >
> > > +1 for the idea of having a test for this. I have written a test for this.
> > > Thanks for the fix Amit, I could reproduce the issue without your fix
> > > and verified that the issue gets fixed with the patch you shared.
> > > Attached a patch for the same. Thoughts?
> > >
> >
> > I have slightly modified the comments in the test case to make things
> > clear. I am planning not to backpatch this because there is no way in
> > the core code to hit this prior to commit ce0fdbfe97 and we haven't
> > received any complaints so far. What do you think?
>
> My tests indeed run OK with this.
>
> (I haven't tested whether the newly added test actually tests for the problem 
> that was there - I suppose one of you did that)
>

I could re-create the scenario that you had faced with this test. This
test case is a simplified test of your script, I have removed the use
of pgbench, reduced the number of tables used and simulated the same
problem with the similar replication setup that you had used.

Regards,
Vignesh

SSL SNI

A customer asked about including Server Name Indication (SNI) into the 
SSL connection from the client, so they can use an SSL-aware proxy to 
route connections.  There was a thread a few years ago where this was 
briefly discussed but no patch appeared.[0]  I whipped up a quick patch 
and it did seem to do the job, so I figured I'd share it here.


The question I had was whether this should be an optional behavior, or 
conversely a behavior that can be turned off, or whether it should just 
be turned on all the time.


Technically, it seems pretty harmless.  It adds another field to the TLS 
handshake, and if the server is not interested in it, it just gets ignored.


The Wikipedia page[1] discusses some privacy concerns in the context of 
web browsing, but it seems there is no principled solution to those. 
The relevant RFC[2] "recommends" that SNI is used for all applicable TLS 
connections.



[0]: 
https://www.postgresql.org/message-id/flat/CAPPwrB_tsOw8MtVaA_DFyOFRY2ohNdvMnLoA_JRr3yB67Rggmg%40mail.gmail.com

[1]: https://en.wikipedia.org/wiki/Server_Name_Indication
[2]: https://tools.ietf.org/html/rfc6066#section-3
From 3e4aae8b01a05fe78017659bb3be1942a7f1ccf5 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 15 Feb 2021 14:11:27 +0100
Subject: [PATCH] Set SNI for SSL connections from the client

This allows an SNI-aware proxy to route connections.
---
 src/interfaces/libpq/fe-secure-openssl.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/src/interfaces/libpq/fe-secure-openssl.c 
b/src/interfaces/libpq/fe-secure-openssl.c
index 5b4a4157d5..6e439679e5 100644
--- a/src/interfaces/libpq/fe-secure-openssl.c
+++ b/src/interfaces/libpq/fe-secure-openssl.c
@@ -1066,6 +1066,25 @@ initialize_SSL(PGconn *conn)
SSL_CTX_free(SSL_context);
SSL_context = NULL;
 
+   /*
+* Set Server Name Indication (SNI), but not if it's a literal IP 
address.
+* (RFC 6066)
+*/
+   if (!((conn->pghost[0] >= '0' && conn->pghost[0] <= '9') || 
strchr(conn->pghost, ':')))
+   {
+   if (SSL_set_tlsext_host_name(conn->ssl, conn->pghost) != 1)
+   {
+   char   *err = SSLerrmessage(ERR_get_error());
+
+   appendPQExpBuffer(&conn->errorMessage,
+ libpq_gettext("could 
not set SSL Server Name Indication (SNI): %s\n"),
+ err);
+   SSLerrfree(err);
+   SSL_CTX_free(SSL_context);
+   return -1;
+   }
+   }
+
/*
 * Read the SSL key. If a key is specified, treat it as an engine:key
 * combination if there is colon present - we don't support files with
-- 
2.30.0

Re: Possible dereference after null check (src/backend/executor/ExecUtils.c)

2021-02-15 Thread Ranier Vilela

Em sex., 12 de fev. de 2021 às 13:11, Ranier Vilela 
escreveu:

> Em sex., 12 de fev. de 2021 às 03:28, Kyotaro Horiguchi <
> horikyota@gmail.com> escreveu:
>
>> At Wed, 10 Feb 2021 19:54:46 -0300, Ranier Vilela 
>> wrote in
>> > Hi,
>> >
>> > Per Coverity.
>> >
>> > The functions ExecGetInsertedCols and ExecGetUpdatedCols at ExecUtils.c
>> > only are safe to call if the variable "ri_RangeTableIndex" is  != 0.
>> >
>> > Otherwise a possible Dereference after null check (FORWARD_NULL) can be
>> > raised.
>>
>> As it turns out, it's a false positive. And perhaps we don't want to
>> take action just to satisfy the static code analyzer.
>>
>>
>> The coment in ExecGetInsertedCols says:
>>
>> > /*
>> >  * The columns are stored in the range table entry. If this
>> ResultRelInfo
>> >  * doesn't have an entry in the range table (i.e. if it represents a
>> >  * partition routing target), fetch the parent's RTE and map the columns
>> >  * to the order they are in the partition.
>> >  */
>> > if (relinfo->ri_RangeTableIndex != 0)
>> > {
>>
>> This means that any one of the two is always usable here.  AFAICS,
>> actually, ri_RangeTableIndex is non-zero for partitioned (=leaf) and
>> non-partitoned relations and ri_RootResultRelInfo is non-null for
>> partitioned (parent or intermediate) relations (since they don't have
>> a coressponding range table entry).
>>
>> The only cases where both are 0 and NULL are trigger-use, which is
>> unrelated to the code path.
>>
> This is a case where it would be worth an assertion.
> What do you think?
>
Apparently my efforts with Coverity have been worth it.
And together we are helping to keep Postgres more secure.
Although sometimes it is not recognized for that  [1].

regards,
Ranier Vilela

[1]
https://github.com/postgres/postgres/commit/54e51dcde03e5c746e8de6243c69fafdc8d0ec7a

Re: pg_cryptohash_final possible out-of-bounds access (per Coverity)

2021-02-15 Thread Ranier Vilela

Em dom., 14 de fev. de 2021 às 22:28, Michael Paquier 
escreveu:

> On Sun, Feb 14, 2021 at 11:39:47AM -0300, Ranier Vilela wrote:
> > What do you think?
>
> That's not a good idea for two reasons:
> 1) There is CRC32 to worry about, which relies on a different logic.
> 2) It would become easier to miss the new option as compilation would
> not warn anymore if a new checksum type is added.
>
> I have reviewed my patch this morning, tweaked a comment, and applied
> it.
>
Thanks for the commit.

regards,
Ranier Vilela

Re: a misbehavior of partition row movement (?)

Thank you for looking at this.

On Mon, Feb 15, 2021 at 4:12 PM Masahiko Sawada  wrote:
> On Wed, Jan 20, 2021 at 7:04 PM Amit Langote  wrote:
> > This shows that the way we've made these triggers behave in general
> > can cause some unintended behaviors for foreign keys during
> > cross-partition updates.  I started this thread to do something about
> > that and sent a patch to prevent cross-partition updates at all when
> > there are foreign keys pointing to it.  As others pointed out, that's
> > not a great long-term solution to the problem, but that's what we may
> > have to do in the back-branches if anything at all.
>
> I've started by reviewing the patch for back-patching that the first
> patch you posted[1].
>
> +   for (i = 0; i < trigdesc->numtriggers; i++)
> +   {
> +   Trigger *trigger = &trigdesc->triggers[i];
> +
> +   if (trigger->tgisinternal &&
> +   OidIsValid(trigger->tgconstrrelid) &&
> +   trigger->tgfoid == F_RI_FKEY_CASCADE_DEL)
> +   {
> +   found = true;
> +   break;
> +   }
> +   }
>
> IIUC the above checks if the partition table is referenced by a
> foreign key constraint on another table with ON DELETE CASCADE option.
> I think we should prevent cross-partition update also when ON DELETE
> SET NULL and ON DELETE SET DEFAULT. For example, with the patch, a
> tuple in a partition table is still updated to NULL when
> cross-partition update as follows:
>
> postgres=# create table p (a numeric primary key) partition by list (a);
> CREATE TABLE
> postgres=# create table p1 partition of p for values in (1);
> CREATE TABLE
> postgres=# create table p2 partition of p for values in (2);
> CREATE TABLE
> postgres=# insert into p values (1);
> INSERT 0 1
> postgres=# create table q (a int references p(a) on delete set null);
> CREATE TABLE
> postgres=# insert into q values (1);
> INSERT 0 1
> postgres=# update p set a = 2;
> UPDATE 1
> postgres=# table p;
>  a
> ---
>  2
> (1 row)
>
> postgres=# table q;
>  a
> ---
>
> (1 row)

Yeah, I agree that's not good.

Actually, I had meant to send an updated version of the patch to apply
in back-branches that would prevent such a cross-partition update, but
never did since starting to work on the patch for HEAD.  I have
attached it here.

Regarding the patch, I would have liked if it only prevented the
update when the foreign key that would be violated by the component
delete references the update query's main target relation.  If the
foreign key references the source partition directly, then the delete
would be harmless.  However there's not enough infrastructure in v12,
v13 branches to determine that, which makes back-patching this a bit
hard.  For example, there's no way to tell in the back-branches if the
foreign-key-enforcing triggers of a leaf partition have descended from
the parent table.  IOW, I am not so sure anymore if we should try to
do anything in the back-branches.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

prevent-row-movement-on-pk-table.patch
Description: Binary data

Re: snowball update


On 2021-02-12 22:00, Oleg Bartunov wrote:

We don't have (and really it's impossible) regression test for stemmers, so
maybe we should warn users about possible inconsistencies of old
tsvectors and new stemmers ?


Yeah, it's analogous to collation and Unicode updates.  We could invent 
a versioning mechanism; we have some of the infrastructure for that now. 
 But until we do that, perhaps some more elaborate guidance in the 
major version release notes would be appropriate.

Re: [POC] verifying UTF-8 using SIMD instructions

2021-02-15 Thread Heikki Linnakangas

On 13/02/2021 03:31, John Naylor wrote:
On Mon, Feb 8, 2021 at 6:17 AM Heikki Linnakangas > wrote:

 >
 > I also tested the fallback implementation from the simdjson library
 > (included in the patch, if you uncomment it in simdjson-glue.c):
 >
 >   mixed | ascii
 > ---+---
 >     447 |    46
 > (1 row)
 >
 > I think we should at least try to adopt that. At a high level, it looks
 > pretty similar your patch: you load the data 8 bytes at a time, check if
 > there are all ASCII. If there are any non-ASCII chars, you check the
 > bytes one by one, otherwise you load the next 8 bytes. Your patch should
 > be able to achieve the same performance, if done right. I don't think
 > the simdjson code forbids \0 bytes, so that will add a few cycles, but
 > still.

Attached is a patch that does roughly what simdjson fallback did, except 
I use straight tests on the bytes and only calculate code points in 
assertion builds. In the course of doing this, I found that my earlier 
concerns about putting the ascii check in a static inline function were 
due to my suboptimal loop implementation. I had assumed that if the 
chunked ascii check failed, it had to check all those bytes one at a 
time. As it turns out, that's a waste of the branch predictor. In the v2 
patch, we do the chunked ascii check every time we loop. With that, I 
can also confirm the claim in the Lemire paper that it's better to do 
the check on 16-byte chunks:

(MacOS, Clang 10)

master:

  chinese | mixed | ascii
-+---+---
     1081 |   761 |   366

v2 patch, with 16-byte stride:

  chinese | mixed | ascii
-+---+---
      806 |   474 |    83

patch but with 8-byte stride:

  chinese | mixed | ascii
-+---+---
      792 |   490 |   105

I also included the fast path in all other multibyte encodings, and that 
is also pretty good performance-wise.

Cool.

It regresses from master on pure 
multibyte input, but that case is still faster than PG13, which I 
simulated by reverting 6c5576075b0f9 and b80e10638e3:

I thought the "chinese" numbers above are pure multibyte input, and it 
seems to do well on that. Where does it regress? In multibyte encodings 
other than UTF-8? How bad is the regression?

I tested this on my first generation Raspberry Pi (chipmunk). I had to 
tweak it a bit to make it compile, since the SSE autodetection code was 
not finished yet. And I used generate_series(1, 1000) instead of 
generate_series(1, 1) in the test script (mbverifystr-speed.sql) 
because this system is so slow.

master:

 mixed | ascii
---+---
  1310 |  1041
(1 row)

v2-add-portability-stub-and-new-fallback.patch:

 mixed | ascii
---+---
  2979 |   910
(1 row)

I'm guessing that's because the unaligned access in check_ascii() is 
expensive on this platform.

- Heikki

Re: Online checksums patch - once again

2021-02-15 Thread Daniel Gustafsson

> On 11 Feb 2021, at 14:10, Bruce Momjian  wrote:
> 
> On Wed, Feb 10, 2021 at 01:26:18PM -0500, Bruce Momjian wrote:
>> On Wed, Feb 10, 2021 at 03:25:58PM +0100, Magnus Hagander wrote:
>>> A fairly large amount of this complexity comes out of the fact that it
>>> now supports restarting and tracks checksums on a per-table basis. We
>>> skipped this in the original patch for exactly this reason (that's not
>>> to say there isn't a fair amount of complexity even without it, but it
>>> did substantially i increase both the size and the complexity of the
>>> patch), but in the review of that i was specifically asked for having
>>> that added. I personally don't think it's worth that complexity but at
>>> the time that seemed to be a pretty strong argument. So I'm not
>>> entirely sure how to move forward with that...
>>> 
>>> is your impression that it would still be too complicated, even without 
>>> that?
>> 
>> I was wondering why this feature has stalled for so long --- now I know.
>> This does highlight the risk of implementing too many additions to a
>> feature. I am working against this dynamic in the cluster file
>> encryption feature I am working on.
> 
> Oh, I think another reason this patchset has had problems is related to
> something I mentioned in 2018:
> 
>   https://www.postgresql.org/message-id/20180801163613.ga2...@momjian.us
> 
>   This patchset is weird because it is perhaps our first case of trying to
>   change the state of the server while it is running.  We just don't have
>   an established protocol for how to orchestrate that, so we are limping
>   along toward a solution.  Forcing a restart is probably part of that
>   primitive orchestration.  We will probably have similar challenges if we
>   ever allowed Postgres to change its data format on the fly.  These
>   challenges are one reason pg_upgrade only modifies the new cluster,
>   never the old one.
> 
> I don't think anyone has done anything wrong --- rather, it is what we
> are _trying_ to do that is complex.

Global state changes in a cluster are complicated, and are unlikely to never
not be.  By writing patches to attempts such state changes we can see which
pieces of infrastructure we're lacking to reduce complexity.  A good example is
the ProcSignalBarrier work that Andres and Robert did, inspired in part by this
patch IIUC.  The more we do, the more we learn.

--
Daniel Gustafsson   https://vmware.com/

Re: logical replication seems broken

2021-02-15 Thread er



> On 2021.02.15. 12:31 Amit Kapila  wrote:
> On Mon, Feb 15, 2021 at 11:53 AM vignesh C  wrote:
> > On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers  wrote:
> > > I compiled just now a binary from HEAD, and a binary from HEAD+patch
> > > HEAD is still broken; your patch rescues it, so yes, fixed.
> > > Maybe a test (check or check-world) should be added to run a second 
> > > replica?  (Assuming that would have caught this bug)
> > >
> > +1 for the idea of having a test for this. I have written a test for this.
> > Thanks for the fix Amit, I could reproduce the issue without your fix
> > and verified that the issue gets fixed with the patch you shared.
> > Attached a patch for the same. Thoughts?
> >
> 
> I have slightly modified the comments in the test case to make things
> clear. I am planning not to backpatch this because there is no way in
> the core code to hit this prior to commit ce0fdbfe97 and we haven't
> received any complaints so far. What do you think?

My tests indeed run OK with this.

(I haven't tested whether the newly added test actually tests for the problem 
that was there - I suppose one of you did that)

Thanks!

Erik Rijkers

Re: logical replication seems broken

2021-02-15 Thread vignesh C

On Mon, Feb 15, 2021 at 5:02 PM Amit Kapila  wrote:
>
> On Mon, Feb 15, 2021 at 11:53 AM vignesh C  wrote:
> >
> > On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers  wrote:
> > >
> > >
> > > I compiled just now a binary from HEAD, and a binary from HEAD+patch
> > >
> > > HEAD is still broken; your patch rescues it, so yes, fixed.
> > >
> > > Maybe a test (check or check-world) should be added to run a second 
> > > replica?  (Assuming that would have caught this bug)
> > >
> >
> > +1 for the idea of having a test for this. I have written a test for this.
> > Thanks for the fix Amit, I could reproduce the issue without your fix
> > and verified that the issue gets fixed with the patch you shared.
> > Attached a patch for the same. Thoughts?
> >
>
> I have slightly modified the comments in the test case to make things
> clear. I am planning not to backpatch this because there is no way in
> the core code to hit this prior to commit ce0fdbfe97 and we haven't
> received any complaints so far. What do you think?

The changes look fine to me.

Regards,
Vignesh

pg_replication_origin_session_setup and superuser

2021-02-15 Thread Zohar Gofer

Hi,

Problem description:
While working on a homegrown limited solution to replace (a very limited set 
of) golden gate capabilities we have created a CDC solution using the WAL 
capabilities.

The data flows like this:
PG1 --> Debezium(wal2json) --> Kafka1 --> MM2 --> Kafka2 --> Kafka Connect Sink 
Plugin --> PG2
And we wanted also changes to flow the other direction as well:
PG1 <-- Kafka Connect Sink Plugin <-- Kafka1 <-- MM2 <-- Kafka2 <--  
Debezium(wal2json) <-- PG2

Where our homegrown "Kafka Connect Sink Plugin" will do manipulations on 
replicated data.

How do we prevent cyclic replication in this case?

Looking around I came across this nice explanation:

https://www.highgo.ca/2020/04/18/the-origin-in-postgresql-logical-decoding/

Using the origin to filter records in the wal2json works perfect once we set up 
an origin.

But, calling pg_replication_origin_session_setup requires superuser privileges. 
Our intent is to make this call when starting a write session in the "Kafka 
Connect Sink Plugin" that writes data to PG.

The logical replication is usually done on the replication channel rather than 
the normal user space session so I see the reason for requiring superuser. This 
is aligned with the documentation, so this is not a bug per se.

In my mind the requirement for superuser is too strong. I think that requiring 
privileges of a replication user is more suitable. This way we can require that 
only a user with replication privileges will actually do replication, even if 
this is not really a replication.

Taking it one step further, I see no reason why stamping a session with origin 
requires elevated privileges at all, but don't know enough about this.

Zohar Gofer

This email and the information contained herein is proprietary and confidential 
and subject to the Amdocs Email Terms of Service, which you may review at 
https://www.amdocs.com/about/email-terms-of-service

Re: ERROR: invalid spinlock number: 0

2021-02-15 Thread Fujii Masao





On 2021/02/15 19:45, Michael Paquier wrote:

On Mon, Feb 15, 2021 at 10:47:05PM +1300, Thomas Munro wrote:

Why not initialise it in WalRcvShmemInit()?


I was thinking about doing that as well, but we have no real need to
initialize this stuff in most cases, say standalone deployments.  In
particular for the fallback implementation of atomics, we would
prepare a spinlock for nothing.


But on second thought, if we make WalRceiverMain() call pg_atomic_init_u64(),
the variable is initialized (i,e., SpinLockInit() is called in 
--disable-atomics)
every time walreceiver is started. That may be problematic? If so, the variable
needs to be initialized in WalRcvShmemInit(), instead.

BTW, the recent commit 46d6e5f567 has the similar issue. The variable
that commit added is initialized in InitProcess(), but maybe should be done
in InitProcGlobal() or elsewhere.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: logical replication seems broken

2021-02-15 Thread Amit Kapila

On Mon, Feb 15, 2021 at 11:53 AM vignesh C  wrote:
>
> On Sat, Feb 13, 2021 at 5:58 PM Erik Rijkers  wrote:
> >
> >
> > I compiled just now a binary from HEAD, and a binary from HEAD+patch
> >
> > HEAD is still broken; your patch rescues it, so yes, fixed.
> >
> > Maybe a test (check or check-world) should be added to run a second 
> > replica?  (Assuming that would have caught this bug)
> >
>
> +1 for the idea of having a test for this. I have written a test for this.
> Thanks for the fix Amit, I could reproduce the issue without your fix
> and verified that the issue gets fixed with the patch you shared.
> Attached a patch for the same. Thoughts?
>

I have slightly modified the comments in the test case to make things
clear. I am planning not to backpatch this because there is no way in
the core code to hit this prior to commit ce0fdbfe97 and we haven't
received any complaints so far. What do you think?

-- 
With Regards,
Amit Kapila.


fix_origin_message_2.patch
Description: Binary data

Re: Refactoring HMAC in the core code

On Sat, Jan 23, 2021 at 01:43:20PM +0900, Michael Paquier wrote:
> Rebased patch is attached wiht SHA1 added as of a8ed6bb.  Now that
> SHA1 is part of the set of options for cryptohashes, a lot of code of
> pgcrypto can be cleaned up thanks to the refactoring done here, but
> I am leaving that as a separate item to address later.

Again a new rebase, giving v5:
- Fixed the APIs to return -1 if the caller gives NULL in input, to be
consistent with cryptohash.
- Added a length argument to pg_hmac_final(), wiht sanity checks.
--
Michael
From 3980f5c191d03f76d339f66a4df8a1377368d71b Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 15 Feb 2021 20:23:05 +0900
Subject: [PATCH v5] Refactor HMAC implementations

---
 src/include/common/hmac.h |  29 +++
 src/include/common/md5.h  |   2 +
 src/include/common/scram-common.h |  13 --
 src/include/common/sha1.h |   2 +
 src/include/pg_config.h.in|   6 +
 src/include/utils/resowner_private.h  |   7 +
 src/backend/libpq/auth-scram.c|  61 +++---
 src/backend/utils/resowner/resowner.c |  61 ++
 src/common/Makefile   |   4 +-
 src/common/hmac.c | 263 ++
 src/common/hmac_openssl.c | 256 +
 src/common/scram-common.c | 158 
 src/interfaces/libpq/fe-auth-scram.c  |  71 ---
 configure |   2 +-
 configure.ac  |   2 +-
 src/tools/msvc/Mkvcbuild.pm   |   2 +
 src/tools/msvc/Solution.pm|   4 +
 src/tools/pgindent/typedefs.list  |   2 +-
 18 files changed, 747 insertions(+), 198 deletions(-)
 create mode 100644 src/include/common/hmac.h
 create mode 100644 src/common/hmac.c
 create mode 100644 src/common/hmac_openssl.c

diff --git a/src/include/common/hmac.h b/src/include/common/hmac.h
new file mode 100644
index 00..ea0343a9da
--- /dev/null
+++ b/src/include/common/hmac.h
@@ -0,0 +1,29 @@
+/*-
+ *
+ * hmac.h
+ *	  Generic headers for HMAC
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		  src/include/common/hmac.h
+ *
+ *-
+ */
+
+#ifndef PG_HMAC_H
+#define PG_HMAC_H
+
+#include "common/cryptohash.h"
+
+/* opaque context, private to each HMAC implementation */
+typedef struct pg_hmac_ctx pg_hmac_ctx;
+
+extern pg_hmac_ctx *pg_hmac_create(pg_cryptohash_type type);
+extern int	pg_hmac_init(pg_hmac_ctx *ctx, const uint8 *key, size_t len);
+extern int	pg_hmac_update(pg_hmac_ctx *ctx, const uint8 *data, size_t len);
+extern int	pg_hmac_final(pg_hmac_ctx *ctx, uint8 *dest, size_t len);
+extern void pg_hmac_free(pg_hmac_ctx *ctx);
+
+#endif			/* PG_HMAC_H */
diff --git a/src/include/common/md5.h b/src/include/common/md5.h
index 6d100f5cfc..62a31e6ed4 100644
--- a/src/include/common/md5.h
+++ b/src/include/common/md5.h
@@ -18,6 +18,8 @@
 
 /* Size of result generated by MD5 computation */
 #define MD5_DIGEST_LENGTH 16
+/* Block size for MD5 */
+#define MD5_BLOCK_SIZE	64
 
 /* password-related data */
 #define MD5_PASSWD_CHARSET	"0123456789abcdef"
diff --git a/src/include/common/scram-common.h b/src/include/common/scram-common.h
index 9d684b41e8..5777ce9fe3 100644
--- a/src/include/common/scram-common.h
+++ b/src/include/common/scram-common.h
@@ -46,19 +46,6 @@
  */
 #define SCRAM_DEFAULT_ITERATIONS	4096
 
-/*
- * Context data for HMAC used in SCRAM authentication.
- */
-typedef struct
-{
-	pg_cryptohash_ctx *sha256ctx;
-	uint8		k_opad[SHA256_HMAC_B];
-} scram_HMAC_ctx;
-
-extern int	scram_HMAC_init(scram_HMAC_ctx *ctx, const uint8 *key, int keylen);
-extern int	scram_HMAC_update(scram_HMAC_ctx *ctx, const char *str, int slen);
-extern int	scram_HMAC_final(uint8 *result, scram_HMAC_ctx *ctx);
-
 extern int	scram_SaltedPassword(const char *password, const char *salt,
  int saltlen, int iterations, uint8 *result);
 extern int	scram_H(const uint8 *str, int len, uint8 *result);
diff --git a/src/include/common/sha1.h b/src/include/common/sha1.h
index a61bc47ded..b1ee36f8ea 100644
--- a/src/include/common/sha1.h
+++ b/src/include/common/sha1.h
@@ -15,5 +15,7 @@
 
 /* Size of result generated by SHA1 computation */
 #define SHA1_DIGEST_LENGTH 20
+/* Block size for SHA1 */
+#define SHA1_BLOCK_SIZE 64
 
 #endif			/* PG_SHA1_H */
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 55cab4d2bf..781f747434 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -268,6 +268,12 @@
 /* Define to 1 if you have the `history_truncate_file' function. */
 #undef HAVE_HISTORY_TRUNCATE_FILE
 
+/* Define to 1 if you have the `HMAC_CTX_free' function. */
+#undef HAVE_HMAC_CTX_FREE
+
+/* Define to 1 if you have

Re: 64-bit XIDs in deleted nbtree pages

2021-02-15 Thread Masahiko Sawada

On Sun, Feb 14, 2021 at 3:47 PM Peter Geoghegan  wrote:
>
> On Fri, Feb 12, 2021 at 9:04 PM Peter Geoghegan  wrote:
> > On Fri, Feb 12, 2021 at 8:38 PM Masahiko Sawada  
> > wrote:
> > > I agree that there already are huge problems in that case. But I think
> > > we need to consider an append-only case as well; after bulk deletion
> > > on an append-only table, vacuum deletes heap tuples and index tuples,
> > > marking some index pages as dead and setting an XID into btpo.xact.
> > > Since we trigger autovacuums even by insertions based on
> > > autovacuum_vacuum_insert_scale_factor/threshold autovacuum will run on
> > > the table again. But if there is a long-running query a "wasted"
> > > cleanup scan could happen many times depending on the values of
> > > autovacuum_vacuum_insert_scale_factor/threshold and
> > > vacuum_cleanup_index_scale_factor. This should not happen in the old
> > > code. I agree this is DBA problem but it also means this could bring
> > > another new problem in a long-running query case.
> >
> > I see your point.
>
> My guess is that this concern of yours is somehow related to how we do
> deletion and recycling *in general*. Currently (and even in v3 of the
> patch), we assume that recycling the pages that a VACUUM operation
> deletes will happen "eventually". This kind of makes sense when you
> have "typical vacuuming" -- deletes/updates, and no big bursts, rare
> bulk deletes, etc.
>
> But when you do have a mixture of different triggering positions,
> which is quite possible, it is difficult to understand what
> "eventually" actually means...
>
> > BTW, I am thinking about making recycling take place for pages that
> > were deleted during the same VACUUM. We can just use a
> > work_mem-limited array to remember a list of blocks that are deleted
> > but not yet recyclable (plus the XID found in the block).
>
> ...which brings me back to this idea.
>
> I've prototyped this. It works really well. In most cases the
> prototype makes VACUUM operations with nbtree index page deletions
> also recycle the pages that were deleted, at the end of the
> btvacuumscan(). We do very little or no "indefinite deferring" work
> here. This has obvious advantages, of course, but it also has a
> non-obvious advantage: the awkward question of concerning "what
> eventually actually means" with mixed triggering conditions over time
> mostly goes away. So perhaps this actually addresses your concern,
> Masahiko.

Yes. I think this would simplify the problem by resolving almost all
problems related to indefinite deferring page recycle.

We will be able to recycle almost all just-deleted pages in practice
especially when btvacuumscan() took a long time. And there would not
be a noticeable downside, I think.

BTW if btree index starts to use maintenan_work_mem for this purpose,
we also need to set amusemaintenanceworkmem to true which is
considered when parallel vacuum.

>
> I've been testing this with BenchmarkSQL [1], which has several
> indexes that regularly need page deletions. There is also a realistic
> "life cycle" to the data in these indexes. I added custom
> instrumentation to display information about what's going on with page
> deletion when the benchmark is run. I wrote a quick-and-dirty patch
> that makes log_autovacuum show the same information that you see about
> index page deletion when VACUUM VERBOSE is run (including the new
> pages_newly_deleted field from my patch). With this particular
> TPC-C/BenchmarkSQL workload, VACUUM seems to consistently manage to go
> on to place every page that it deletes in the FSM without leaving
> anything to the next VACUUM. There are a very small number of
> exceptions where we "only" manage to recycle maybe 95% of the pages
> that were deleted.

Great!

>
> The race condition that nbtree avoids by deferring recycling was
> always a narrow one, outside of the extremes -- the way we defer has
> always been overkill. It's almost always unnecessary to delay placing
> deleted pages in the FSM until the *next* VACUUM. We only have to
> delay it until the end of the *same* VACUUM -- why wait until the next
> VACUUM if we don't have to? In general this deferring recycling
> business has nothing to do with MVCC/GC/whatever, and yet the code
> seems to suggest that it does. While it is convenient to use an XID
> for page deletion and recycling as a way of implementing what Lanin &
> Shasha call "the drain technique" [2], all we have to do is prevent
> certain race conditions. This is all about the index itself, the data
> structure, how it is maintained -- nothing more. It almost seems
> obvious to me.

Agreed.

>
> It's still possible to imagine extremes. Extremes that even the "try
> to recycle pages we ourselves deleted when we reach the end of
> btvacuumscan()" version of my patch cannot deal with. Maybe it really
> is true that it's inherently impossible to recycle a deleted page even
> at the end of a VACUUM -- maybe a long-running transaction (that could

Re: increase size of pg_commit_ts buffers

2021-02-15 Thread Andrey Borodin




> 16 янв. 2021 г., в 03:07, Alvaro Herrera  написал(а):
> 
> Andrey Borodin already has a patch to change the behavior for
> multixact, which is something we should perhaps also do.  I now notice
> that they're also reporting a bug in that thread ... sigh

I've tried in that thread [0] to do more intelligent optimisation than just 
increase number of buffers.
Though, in fact bigger memory had dramatically better effect that lock tricks.

Maybe let's make all SLRUs buffer sizes configurable?

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/flat/b4911e88-9969-aaba-f6be-ed57bd5fec36%40darold.net#ecfdfc8a40af563a0f8b1211266b6fcc

Re: ERROR: invalid spinlock number: 0

On Mon, Feb 15, 2021 at 10:47:05PM +1300, Thomas Munro wrote:
> Why not initialise it in WalRcvShmemInit()?

I was thinking about doing that as well, but we have no real need to
initialize this stuff in most cases, say standalone deployments.  In
particular for the fallback implementation of atomics, we would
prepare a spinlock for nothing.
--
Michael


signature.asc
Description: PGP signature

Re: increase size of pg_commit_ts buffers

2021-02-15 Thread Noah Misch

On Fri, Jan 15, 2021 at 07:07:44PM -0300, Alvaro Herrera wrote:
> I wrote this patch last year in response to a customer issue and I
> thought I had submitted it here, but evidently I didn't.  So here it is.
> 
> The short story is: in commit 5364b357fb11 we increased the size of
> pg_commit (née pg_clog) but we didn't increase the size of pg_commit_ts
> to match.  When commit_ts is in use, this can lead to significant buffer
> thrashing and thus poor performance.
> 
> Since commit_ts entries are larger than pg_commit, my proposed patch uses
> twice as many buffers.

This is a step in the right direction.  With commit_ts entries being forty
times as large as pg_xact, it's not self-evident that just twice as many
buffers is appropriate.  Did you try other numbers?  I'm fine with proceeding
even if not, but the comment should then admit that the new number was a guess
that solved problems for one site.

> --- a/src/backend/access/transam/commit_ts.c
> +++ b/src/backend/access/transam/commit_ts.c
> @@ -530,7 +530,7 @@ pg_xact_commit_timestamp_origin(PG_FUNCTION_ARGS)

The comment right above here is outdated.

>  Size
>  CommitTsShmemBuffers(void)
>  {
> - return Min(16, Max(4, NBuffers / 1024));
> + return Min(256, Max(4, NBuffers / 512));
>  }
>  
>  /*

Re: ERROR: invalid spinlock number: 0

2021-02-15 Thread Thomas Munro

On Mon, Feb 15, 2021 at 9:27 PM Michael Paquier  wrote:
> On Thu, Feb 11, 2021 at 11:30:13PM +0900, Fujii Masao wrote:
> > Yes, so what about the attached patch?
>
> I see.  So the first error triggering the spinlock error would cause
> a transaction failure because the fallback implementation of atomics
> uses a spinlock for this variable, and it may not initialized in this
> code path.

Why not initialise it in WalRcvShmemInit()?

Re: PG vs LLVM 12 on seawasp, next round

2021-02-15 Thread Fabien COELHO



Hello Thomas,

Thanks for looking at this, I'm currently far behind on many things and 
not very responsive:-(



Here is the creation of llvmjit.so:

g++ ... -o llvmjit.so ... -L/home/fabien/clgtk/lib ... -lLLVMOrcJIT ...

That'd be from llvm-config --ldflags or similar, from this binary:

checking for llvm-config... (cached) /home/fabien/clgtk/bin/llvm-config

So what does ls -slap /home/fabien/clgtk/lib show?


Plenty files and links, eg:

 0 lrwxrwxrwx 1 fabien fabien   21 janv. 23 09:40 
/home/fabien/clgtk/lib/libLLVMMCJIT.so -> libLLVMMCJIT.so.12git
  2140 -rw-r--r-- 1 fabien fabien  2190824 janv. 23 09:28 
/home/fabien/clgtk/lib/libLLVMMCJIT.so.12git
 0 lrwxrwxrwx 1 fabien fabien   22 janv. 23 09:40 
/home/fabien/clgtk/lib/libLLVMOrcJIT.so -> libLLVMOrcJIT.so.12git
 40224 -rw-r--r-- 1 fabien fabien 41187208 janv. 23 09:34 
/home/fabien/clgtk/lib/libLLVMOrcJIT.so.12git

Hmmm, clang recompilation has been failing for some weeks. Argh, this is 
due to the recent "master" to "main" branch renaming.



 What does ldd
/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so


No such file of directory:-) because the directory is cleaned up.


show?  What do you have in seawap's LD_LIBRARY_PATH, and is the
problem that you need to add /home/fabien/clgtk/lib to it


Argh. Would it be so stupid? :-( I thought the configuration stuff would
manage the link path automatically, which may be quite naïve, indeed.


PS Could you try blowing away the accache directory so we can rule out
bad cached configure stuff?


Hmmm. I've tried that before. I can do it again.

I've added an explicit LD_LIBRARY_PATH, which will be triggered at some 
point later.


--
Fabien Coelho - CRI, MINES ParisTech

Re: A reloption for partitioned tables - parallel_workers

Hi Seamus,

On Mon, Feb 15, 2021 at 5:28 PM Seamus Abshere  wrote:
> It turns out parallel_workers may be a useful reloption for certain uses of 
> partitioned tables, at least if they're made up of fancy column store 
> partitions (see 
> https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520ed55%40www.fastmail.com).
>
> Would somebody tell me what I'm doing wrong? I would love to submit a patch 
> but I'm stuck:
>
> diff --git a/src/backend/optimizer/path/allpaths.c 
> b/src/backend/optimizer/path/allpaths.c
> index cd3fdd259c..f1ade035ac 100644
> --- a/src/backend/optimizer/path/allpaths.c
> +++ b/src/backend/optimizer/path/allpaths.c
> @@ -3751,6 +3751,7 @@ compute_parallel_worker(RelOptInfo *rel, double 
> heap_pages, double index_pages,
>  * If the user has set the parallel_workers reloption, use that; 
> otherwise
>  * select a default number of workers.
>  */
> +   // I want to affect this
> if (rel->rel_parallel_workers != -1)
> parallel_workers = rel->rel_parallel_workers;
> else
>
> so I do this
>
> diff --git a/src/backend/access/common/reloptions.c 
> b/src/backend/access/common/reloptions.c
> index c687d3ee9e..597b209bfb 100644
> --- a/src/backend/access/common/reloptions.c
> +++ b/src/backend/access/common/reloptions.c
> @@ -1961,13 +1961,15 @@ build_local_reloptions(local_relopts *relopts, Datum 
> options, bool validate)
>  bytea *
>  partitioned_table_reloptions(Datum reloptions, bool validate)
>  {
> -   /*
> -* There are no options for partitioned tables yet, but this is able 
> to do
> -* some validation.
> -*/
> +   static const relopt_parse_elt tab[] = {
> +   {"parallel_workers", RELOPT_TYPE_INT,
> +   offsetof(StdRdOptions, parallel_workers)},
> +   };
> +
> return (bytea *) build_reloptions(reloptions, validate,
>   
> RELOPT_KIND_PARTITIONED,
> - 0, 
> NULL, 0);
> + 
> sizeof(StdRdOptions),
> + 
> tab, lengthof(tab));
>  }
>
> That "works":
>
> postgres=# alter table test_3pd_cstore_partitioned set (parallel_workers = 
> 33);
> ALTER TABLE
> postgres=# select relname, relkind, reloptions from pg_class where relname = 
> 'test_3pd_cstore_partitioned';
>relname   | relkind |  reloptions
> -+-+---
>  test_3pd_cstore_partitioned | p   | {parallel_workers=33}
> (1 row)
>
> But it seems to be ignored:
>
> diff --git a/src/backend/optimizer/path/allpaths.c 
> b/src/backend/optimizer/path/allpaths.c
> index cd3fdd259c..c68835ce38 100644
> --- a/src/backend/optimizer/path/allpaths.c
> +++ b/src/backend/optimizer/path/allpaths.c
> @@ -3751,6 +3751,8 @@ compute_parallel_worker(RelOptInfo *rel, double 
> heap_pages, double index_pages,
>  * If the user has set the parallel_workers reloption, use that; 
> otherwise
>  * select a default number of workers.
>  */
> +   // I want to affect this, but this assertion always passes
> +   Assert(rel->rel_parallel_workers == -1)
> if (rel->rel_parallel_workers != -1)
> parallel_workers = rel->rel_parallel_workers;
> else

You may see by inspecting the callers of compute_parallel_worker()
that it never gets called on a partitioned table, only its leaf
partitions.  Maybe you could try calling compute_parallel_worker()
somewhere in add_paths_to_append_rel(), which has this code to figure
out parallel_workers to use for a parallel Append path for a given
partitioned table:

/* Find the highest number of workers requested for any subpath. */
foreach(lc, partial_subpaths)
{
Path   *path = lfirst(lc);

parallel_workers = Max(parallel_workers, path->parallel_workers);
}
Assert(parallel_workers > 0);

/*
 * If the use of parallel append is permitted, always request at least
 * log2(# of children) workers.  We assume it can be useful to have
 * extra workers in this case because they will be spread out across
 * the children.  The precise formula is just a guess, but we don't
 * want to end up with a radically different answer for a table with N
 * partitions vs. an unpartitioned table with the same data, so the
 * use of some kind of log-scaling here seems to make some sense.
 */
if (enable_parallel_append)
{
parallel_workers = Max(parallel_workers,
   fls(list_length(live_childrels)));
parallel_workers = Min(parallel_workers,
   max_parallel_workers_p

Re: [POC] Fast COPY FROM command for the table with foreign partitions

Tsunakawa-san, Andrey,

On Mon, Feb 15, 2021 at 1:54 PM tsunakawa.ta...@fujitsu.com
 wrote:
> From: Justin Pryzby 
> > This is crashing during fdw check.
> > http://cfbot.cputube.org/andrey-lepikhov.html
> >
> > Maybe it's related to this patch:
> > |commit 6214e2b2280462cbc3aa1986e350e167651b3905
> > |Fix permission checks on constraint violation errors on partitions.
> > |Security: CVE-2021-3393
>
> Thank you for your kind detailed investigation.  The rebased version is 
> attached.

Thanks for updating the patch.

The commit message says this:

Move that decision logic into InitResultRelInfo which now sets a new
boolean field ri_usesMultiInsert of ResultRelInfo when a target
relation is first initialized.  That prevents repeated computation
of the same information in some cases, especially for partitions,
and the new arrangement results in slightly more readability.
Enum CopyInsertMethod removed. This logic implements by ri_usesMultiInsert
field of the ResultRelInfo structure.

However, it is no longer InitResultRelInfo() that sets
ri_usesMultiInsert.  Doing that is now left for concerned functions
who set it when they have enough information to do that correctly.
Maybe update the message to make that clear to interested readers.

+   /*
+* Use multi-insert mode if the condition checking passes for the
+* parent and its child.
+*/
+   leaf_part_rri->ri_usesMultiInsert =
+   ExecMultiInsertAllowed(leaf_part_rri, rootResultRelInfo);

Think I have mentioned upthread that this looks better as:

if (rootResultRelInfo->ri_usesMultiInsert)
leaf_part_rri->ri_usesMultiInsert = ExecMultiInsertAllowed(leaf_part_rri);

This keeps the logic confined to ExecInitPartitionInfo() where it
belongs.  No point in burdening other callers of
ExecMultiInsertAllowed() in deciding whether or not it should pass a
valid value for the 2nd parameter.

+static void
+postgresBeginForeignCopy(ModifyTableState *mtstate,
+  ResultRelInfo *resultRelInfo)
+{
...
+   if (resultRelInfo->ri_RangeTableIndex == 0)
+   {
+   ResultRelInfo *rootResultRelInfo = resultRelInfo->ri_RootResultRelInfo;
+
+   rte = exec_rt_fetch(rootResultRelInfo->ri_RangeTableIndex, estate);

It's better to add an Assert(rootResultRelInfo != NULL) here.
Apparently, there are cases where ri_RangeTableIndex == 0 without
ri_RootResultRelInfo being set.  The Assert will ensure that
BeginForeignCopy() is not mistakenly called on such ResultRelInfos.

+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+   appendStringInfoString(buf, "COPY ");
+   deparseRelation(buf, rel);
+   (void) deparseRelColumnList(buf, rel, true);
+
+   appendStringInfoString(buf, " FROM STDIN ");
+}

I can't parse what the function's comment says about "using list of
parameters".  Maybe it means to say "list of columns" specified in the
COPY FROM statement.  How about writing this as:

/*
 * Deparse remote COPY FROM statement
 *
 * Note that this explicitly specifies the list of COPY's target columns
 * to account for the fact that the remote table's columns may not match
 * exactly with the columns declared in the local definition.
 */

I'm hoping that I'm interpreting the original note correctly.  Andrey?

+
+ mtstate is the overall state of the
+ ModifyTable plan node being executed;
global data about
+ the plan and execution state is available via this structure.
...
+typedef void (*BeginForeignCopy_function) (ModifyTableState *mtstate,
+  ResultRelInfo *rinfo);

Maybe a bit late realizing this, but why does BeginForeignCopy()
accept a ModifyTableState pointer whereas maybe just an EState pointer
will do?  I can't imagine why an FDW would want to look at the
ModifyTableState.  Case in point, I see that
postgresBeginForeignCopy() only uses the EState from the
ModifyTableState passed to it.  I think the ResultRelInfo that's being
passed to the Copy APIs contains most of the necessary information.
Also, EndForeignCopy() seems fine with just receiving the EState.

+   TupleDesc   tupDesc;/* COPY TO will be used for manual tuple copying
+ * into the destination */
...
@@ -382,19 +393,24 @@ EndCopy(CopyToState cstate)
 CopyToState
 BeginCopyTo(ParseState *pstate,
Relation rel,
+   TupleDesc srcTupDesc,

I think that either the commentary around tupDesc/srcTupDesc needs to
be improved or we should really find a way to do this without
maintaining TupleDesc separately from the CopyState.rel.  IIUC, this
change is merely to allow postgres_fdw's ExecForeignCopy() to use
CopyOneRowTo() which needs to be passed a valid CopyState.
postgresBeginForeignCopy() initializes one by calling BeginCopyTo(),
but it can't just pass the target foreign Relation to it, because
generic BeginCopyTo() has this:

if

A reloption for partitioned tables - parallel_workers

2021-02-15 Thread Seamus Abshere

hi,

It turns out parallel_workers may be a useful reloption for certain uses of 
partitioned tables, at least if they're made up of fancy column store 
partitions (see 
https://www.postgresql.org/message-id/7d6fdc20-857c-4cbe-ae2e-c0ff9520ed55%40www.fastmail.com).

Would somebody tell me what I'm doing wrong? I would love to submit a patch but 
I'm stuck:

diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index cd3fdd259c..f1ade035ac 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3751,6 +3751,7 @@ compute_parallel_worker(RelOptInfo *rel, double 
heap_pages, double index_pages,
 * If the user has set the parallel_workers reloption, use that; 
otherwise
 * select a default number of workers.
 */
+   // I want to affect this
if (rel->rel_parallel_workers != -1)
parallel_workers = rel->rel_parallel_workers;
else

so I do this

diff --git a/src/backend/access/common/reloptions.c 
b/src/backend/access/common/reloptions.c
index c687d3ee9e..597b209bfb 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -1961,13 +1961,15 @@ build_local_reloptions(local_relopts *relopts, Datum 
options, bool validate)
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
-   /*
-* There are no options for partitioned tables yet, but this is able to 
do
-* some validation.
-*/
+   static const relopt_parse_elt tab[] = {
+   {"parallel_workers", RELOPT_TYPE_INT,
+   offsetof(StdRdOptions, parallel_workers)},
+   };
+
return (bytea *) build_reloptions(reloptions, validate,
  
RELOPT_KIND_PARTITIONED,
- 0, 
NULL, 0);
+ 
sizeof(StdRdOptions),
+ tab, 
lengthof(tab));
 }

That "works":

postgres=# alter table test_3pd_cstore_partitioned set (parallel_workers = 33);
ALTER TABLE
postgres=# select relname, relkind, reloptions from pg_class where relname = 
'test_3pd_cstore_partitioned';
   relname   | relkind |  reloptions   
-+-+---
 test_3pd_cstore_partitioned | p   | {parallel_workers=33}
(1 row)

But it seems to be ignored:

diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index cd3fdd259c..c68835ce38 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3751,6 +3751,8 @@ compute_parallel_worker(RelOptInfo *rel, double 
heap_pages, double index_pages,
 * If the user has set the parallel_workers reloption, use that; 
otherwise
 * select a default number of workers.
 */
+   // I want to affect this, but this assertion always passes
+   Assert(rel->rel_parallel_workers == -1)
if (rel->rel_parallel_workers != -1)
parallel_workers = rel->rel_parallel_workers;
else

Thanks and please forgive my code pasting etiquette as this is my first post to 
pgsql-hackers and I'm not quite sure what the right format is.

Thank you,
Seamus

Re: ERROR: invalid spinlock number: 0