date:20210115

On Sat, Jan 16, 2021 at 12:02 PM Amit Kapila  wrote:
>
> On Sat, Jan 16, 2021 at 11:52 AM Bharath Rupireddy
>  wrote:
> >
> > I made an entry in the commitfest[1], so that the patches will get a
> > chance to run on all the platforms.
> >
> > Attaching v4 patch set i.e. 0001 - fix, 0002 - test case.
> >
>
> In the test, can we have multiple publications for the subscription
> because that is how we discovered that the originally proposed patch
> was not correct? Also, is it possible to extend some of the existing
> tests in 001_rep_changes.pl or anywhere so that we can avoid some cost
> of the replication setup.

Sure, I will add the multiple publications use case provided by japin
and also see if I can move the tests from 100_bugs.pl to
0001_rep_changes.pl.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-15 Thread japin



On Sat, 16 Jan 2021 at 14:21, Bharath Rupireddy 
 wrote:
> I made an entry in the commitfest[1], so that the patches will get a
> chance to run on all the platforms.
>
> Attaching v4 patch set i.e. 0001 - fix, 0002 - test case.
>
> [1] - https://commitfest.postgresql.org/32/2944/
>

Thanks for the updated patch.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

On Sat, Jan 16, 2021 at 11:52 AM Bharath Rupireddy
 wrote:
>
> I made an entry in the commitfest[1], so that the patches will get a
> chance to run on all the platforms.
>
> Attaching v4 patch set i.e. 0001 - fix, 0002 - test case.
>

In the test, can we have multiple publications for the subscription
because that is how we discovered that the originally proposed patch
was not correct? Also, is it possible to extend some of the existing
tests in 001_rep_changes.pl or anywhere so that we can avoid some cost
of the replication setup.

-- 
With Regards,
Amit Kapila.

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

I made an entry in the commitfest[1], so that the patches will get a
chance to run on all the platforms.

Attaching v4 patch set i.e. 0001 - fix, 0002 - test case.

[1] - https://commitfest.postgresql.org/32/2944/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v4-0001-Fix-ALTER-PUBLICATION.DROP-TABLE-behaviour.patch
Description: Binary data


v4-0002-Test-behaviour-of-ALTER-PUBLICATION-.-DROP-TABLE.patch
Description: Binary data

Re: Determine parallel-safety of partition relations for Inserts

On Fri, Jan 15, 2021 at 7:35 PM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > This will surely increase planning time but the execution is reduced
> > to an extent due to parallelism that it won't matter for either of the
> > cases if we see just total time. For example, see the latest results
> > for parallel inserts posted by Haiying Tang [3]. There might be an
> > impact when Selects can't be parallelized due to the small size of the
> > Select-table but we still have to traverse all the partitions to
> > determine parallel-safety but not sure how much it is compared to
> > overall time. I guess we need to find the same but apart from that can
> > anyone think of a better way to determine parallel-safety of
> > partitioned relation for Inserts?
>
> Three solutions(?) quickly come to my mind:
>
>
> (1) Have the user specify whether they want to parallelize DML
> Oracle [1] and SQL Server [2] take this approach.  Oracle disables parallel 
> DML execution by default.  The reason is described as "This mode is required 
> because parallel DML and serial DML have different locking, transaction, and 
> disk space requirements and parallel DML is disabled for a session by 
> default."  To enable parallel DML in a session or in a specific statement, 
> you need to run either of the following:
>
>   ALTER SESSION ENABLE PARALLEL DML;
>   INSERT /*+ ENABLE_PARALLEL_DML */ …
>
> Besides, the user has to specify a parallel hint in a DML statement, or 
> specify the parallel attribute in CREATE or ALTER TABLE.
>
> SQL Server requires a TABLOCK hint to be specified in the INSERT SELECT 
> statement like this:
>
>   INSERT INTO Sales.SalesHistory WITH (TABLOCK)  (target columns...) SELECT 
> ...;
>

I think it would be good if the parallelism works by default when
required but I guess if we want to use something on these lines then
we can always check if the parallel_workers option is non-zero for a
relation (with RelationGetParallelWorkers). So users can always say
Alter Table  Set (parallel_workers = 0) if they don't want
to enable write parallelism for tbl and if someone is bothered that
this might impact Selects as well because the same option is used to
compute the number of workers for it then we can invent a second
option parallel_dml_workers or something like that.

>
> (2) Postpone the parallel safety check after the planner finds a worthwhile 
> parallel query plan
> I'm not sure if the current planner code allows this easily...
>

I think it is possible but it has a bit of disadvantage as well as
mentioned in response to Ashutosh's email [1].

>
> (3) Record the parallel safety in system catalog
> Add a column like relparallel in pg_class that indicates the parallel safety 
> of the relation.  planner just checks the value instead of doing heavy work 
> for every SQL statement.  That column's value is modified whenever a relation 
> alteration is made that affects the parallel safety, such as adding a domain 
> column and CHECK constraint.  In case of a partitioned relation, the parallel 
> safety attributes of all its descendant relations are merged.  For example, 
> if a partition becomes parallel-unsafe, the ascendant partitioned tables also 
> become parallel-unsafe.
>
> But... developing such code would be burdonsome and bug-prone?
>
>
> I'm inclined to propose (1).  Parallel DML would be something that a limited 
> people run in limited circumstances (data loading in data warehouse and batch 
> processing in OLTP systems by the DBA or data administrator), so I think it's 
> legitimate to require explicit specification of parallelism.
>
> As an aside, (1) and (2) has a potential problem with memory consumption.
>

I can see the memory consumption argument for (2) because we might end
up generating parallel paths (partial paths) for reading the table but
don't see how it applies to (1)?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1J80Rzn4M-A5sfkmJ8NjgTxbaC8UWVaNHK6%2B2BCYYv2Nw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

new release pspg

2021-01-15 Thread Pavel Stehule

Hi

I released pspg 4.0.0 https://github.com/okbob/pspg/releases/tag/4.0.0

Now with the possibility to export content to file or clipboard in CSV,
TSVC, text or INSERT formats.

pspg is a pager like "less" or "more" designed specially for usage in TUI
database clients like "psql". It can work like a CSV viewer too.

https://github.com/okbob/pspg

Regards

Pavel

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

On Thu, Jan 14, 2021 at 3:52 PM Fujii Masao  wrote:
> On 2021/01/09 10:12, Bharath Rupireddy wrote:
> > On Fri, Jan 8, 2021 at 9:55 AM Bharath Rupireddy
> >  wrote:
> >> I will make the changes and post a new patch set soon.
> >
> > Attaching v9 patch set that has addressed the review comments on the
> > disconnect function returning setof records, documentation changes,
> > and postgres_fdw--1.0-1.1.sql changes.
> >
> > Please consider the v9 patch set for further review.
>
> Thanks for updating the patch! I reviewed only 0001 patch.
>
> +   /*
> +* Quick exit if the cache has been destroyed in
> +* disconnect_cached_connections.
> +*/
> +   if (!ConnectionHash)
> +   return;
>
> This code is not necessary at least in pgfdw_xact_callback() and
> pgfdw_subxact_callback()? Because those functions check
> "if (!xact_got_connection)" before checking the above condition.

Yes, if xact_got_connection is true, then ConnectionHash wouldn't have
been cleaned up in disconnect_cached_connections. +1 to remove that in
pgfdw_xact_callback and pgfdw_subxact_callback. But we need that check
in pgfdw_inval_callback, because we may reach there after
ConnectionHash is destroyed and set to NULL in
disconnect_cached_connections.

> +   server = GetForeignServerExtended(entry->serverid, true);
>
> Since the type of second argument in GetForeignServerExtended() is bits16,
> it's invalid to specify "true" there?

Yeah. I will change it to be something like below:
bits16flags = FSV_MISSING_OK;
server = GetForeignServerExtended(entry->serverid, flags);

> +   if (no_server_conn_cnt > 0)
> +   {
> +   ereport(WARNING,
> +   (errmsg_plural("found an active connection 
> for which the foreign server would have been dropped",
> +  "found some active 
> connections for which the foreign servers would have been dropped",
> +  
> no_server_conn_cnt),
> +no_server_conn_cnt > 1 ?
> +errdetail("Such connections are discarded at 
> the end of remote transaction.")
> +: errdetail("Such connection is discarded at 
> the end of remote transaction.")));
>
> At least for me, I like returning such connections with "NULL" in server_name
> column and "false" in valid column, rather than emitting a warning. Because
> which would enable us to count the number of actual foreign connections
> easily by using SQL, for example.

+1. I was also of the similar opinion about this initially. I will change this.

> +* During the first call, we initialize the function context, get the 
> list
> +* of active connections using get_connections and store this in the
> +* function's memory context so that it can live multiple calls.
> +*/
> +   if (SRF_IS_FIRSTCALL())
>
> I guess that you used value-per-call mode to make the function return
> a set result since you refered to dblink_get_pkey(). But isn't it better to
> use materialize mode like dblink_get_notify() does rather than
> value-per-call because this function returns not so many records? ISTM
> that we can simplify postgres_fdw_get_connections() by using materialize mode.

Yeah. +1 I will change it to use materialize mode.

> +   hash_destroy(ConnectionHash);
> +   ConnectionHash = NULL;
>
> If GetConnection() is called after ConnectionHash is destroyed,
> it initialize the hashtable and registers some callback functions again
> even though the same function have already been registered. This causes
> same function to be registered as a callback more than once. This is
> a bug.

Yeah, we will register the same callbacks many times. I'm thinking to
have something like below:

static bool conn_cache_destroyed = false;

if (!active_conn_exists)
{
hash_destroy(ConnectionHash);
ConnectionHash = NULL;
conn_cache_destroyed = true;
}

/*
 * Register callback functions that manage connection cleanup. This
 * should be done just once in each backend. We don't register the
 * callbacks again, if the connection cache is destroyed at least once
 * in the backend.
 */
if (!conn_cache_destroyed)
{
RegisterXactCallback(pgfdw_xact_callback, NULL);
RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
CacheRegisterSyscacheCallback(FOREIGNSERVEROID,
  pgfdw_inval_callback, (Datum) 0);
CacheRegisterSyscacheCallback(USERMAPPINGOID,
  pgfdw_inval_callback, (Datum) 0);
}

Thoughts?

> +CREATE FUNCTION postgres_fdw_disconnect ()
>
> Do we really want postgres_fdw_disconnect() with no argument?
> IMO

Re: remove unneeded pstrdup in fetch_table_list

On Thu, Jan 14, 2021 at 3:05 PM Amit Kapila  wrote:
>
> On Thu, Jan 14, 2021 at 10:51 AM Tom Lane  wrote:
> >
> > Michael Paquier  writes:
> > > On Thu, Jan 14, 2021 at 01:17:57AM +, Hou, Zhijie wrote:
> >  Thanks. I am thinking to backpatch this even though there is no
> >  problem reported from any production system. What do you think?
> >
> > > text_to_cstring() indeed allocates a new string, so the extra
> > > allocation is useless.  FWIW, I don't see much point in poking at
> > > the stable branches here.
> >
> > Yeah, unless there's some reason to think that this creates a
> > meaningful memory leak, I wouldn't bother back-patching.
> >
>
> The only case where it might impact as per the report of Zhijie Hou is
> where the user is subscribed to the publication that has a lot of
> tables like Create Publication ... For All Tables. Even though for
> such a case the memory consumed could be high but all the memory is
> allocated in the Portal context and will be released at the statement
> end. I was not sure if that could create a meaningful leak to any user
> so to be on the safer side proposed to backpatch it. However, if
> others don't think we need to backpatch this then I am fine doing it
> just for HEAD.
>

Hearing no further suggestions, pushed just to HEAD.

-- 
With Regards,
Amit Kapila.

Re: Determine parallel-safety of partition relations for Inserts

On Fri, Jan 15, 2021 at 6:45 PM Amit Langote  wrote:
>
> On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila  wrote:
> > We want to do this for Inserts where only Select can be parallel and
> > Inserts will always be done by the leader backend. This is actually
> > the case we first want to implement.
>
> Sorry, I haven't looked at the linked threads and the latest patches
> there closely enough yet, so I may be misreading this, but if the
> inserts will always be done by the leader backend as you say, then why
> does the planner need to be checking the parallel safety of the
> *target* table's expressions?
>

The reason is that once we enter parallel-mode we can't allow
parallel-unsafe things (like allocation of new CIDs, XIDs, etc.). We
enter the parallel-mode at the beginning of the statement execution,
see ExecutePlan(). So, the Insert will be performed in parallel-mode
even though it happens in the leader backend. It is not possible that
we finish getting all the tuples from the gather node first and then
start inserting. Even, if we somehow find something to make this work
anyway the checks being discussed will be required to make inserts
parallel (where inserts will be performed by workers) which is
actually the next patch in the thread I mentioned in the previous
email.

Does this answer your question?

-- 
With Regards,
Amit Kapila.

Re: Occasional tablespace.sql failures in check-world -jnn

2021-01-15 Thread Michael Paquier

On Fri, Jan 15, 2021 at 09:59:02AM +0100, Peter Eisentraut wrote:
> I vaguely recall that this had something to do with SELinux (or something
> similar?), where it matters in what context you create a file or directory
> and then certain properties attach to it that are relevant to subsequent
> programs that run on it.  Again, vague.

Hmm.  Does it?  sepgsql has some tests for tablespaces involving only
pg_default, so it does not seem that this applies in the context of
the regression tests.  The cleanup of testtablespace in GNUMakefile
comes from 2467394, as of June 2004, that introduced tablespaces.
--
Michael


signature.asc
Description: PGP signature

Re: Key management with tests

2021-01-15 Thread Michael Paquier

On Fri, Jan 15, 2021 at 08:20:36PM -0800, Andres Freund wrote:
> On 2021-01-15 20:49:10 -0500, Bruce Momjian wrote:
>> What Perl tap tests run initdb and manage the cluster?  I didn't find
>> any.
> 
> find . -name '*.pl'|xargs grep 'use PostgresNode;'
> 
> should give you a nearly complete list.

Just to add that all the perl modules we use for the tests are within
src/test/perl/.  The coolest tests are within src/bin/ and src/test/.
--
Michael


signature.asc
Description: PGP signature

Re: Wrong usage of RelationNeedsWAL

2021-01-15 Thread Noah Misch

On Wed, Jan 13, 2021 at 04:07:05PM +0900, Kyotaro Horiguchi wrote:
> The definition of the macro RelationNeedsWAL has been changed by
> c6b92041d3 to include conditions related to the WAL-skip optimization
> but some uses of the macro are not relevant to the optimization. That
> misuses are harmless for now as they are only executed while wal_level
> >= replica or WAL-skipping is inactive. However, this should be
> corrected to prevent future hazard.

I see user-visible consequences:

> --- a/src/backend/catalog/pg_publication.c
> +++ b/src/backend/catalog/pg_publication.c
> @@ -67,7 +67,7 @@ check_publication_add_relation(Relation targetrel)
>errdetail("System tables cannot be added to 
> publications.")));
>  
>   /* UNLOGGED and TEMP relations cannot be part of publication. */
> - if (!RelationNeedsWAL(targetrel))
> + if (!RelationIsWalLogged(targetrel))

Today, the following fails needlessly under wal_level=minimal:

BEGIN;
SET client_min_messages = 'ERROR';
CREATE TABLE skip_wal ();
CREATE PUBLICATION p FOR TABLE skip_wal;
ROLLBACK;

Could you add that test to one of the regress scripts?

>   ereport(ERROR,
>   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>errmsg("table \"%s\" cannot be replicated",
> diff --git a/src/backend/optimizer/util/plancat.c 
> b/src/backend/optimizer/util/plancat.c
> index da322b453e..0500efcdb9 100644
> --- a/src/backend/optimizer/util/plancat.c
> +++ b/src/backend/optimizer/util/plancat.c
> @@ -126,7 +126,7 @@ get_relation_info(PlannerInfo *root, Oid 
> relationObjectId, bool inhparent,
>   relation = table_open(relationObjectId, NoLock);
>  
>   /* Temporary and unlogged relations are inaccessible during recovery. */
> - if (!RelationNeedsWAL(relation) && RecoveryInProgress())
> + if (!RelationIsWalLogged(relation) && RecoveryInProgress())

This has no user-visible consequences, but I agree you've clarified it.

>   ereport(ERROR,
>   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>errmsg("cannot access temporary or unlogged 
> relations during recovery")));
> diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
> index 10b63982c0..810806a542 100644
> --- a/src/include/utils/rel.h
> +++ b/src/include/utils/rel.h
> @@ -552,16 +552,23 @@ typedef struct ViewOptions
>   (relation)->rd_smgr->smgr_targblock = (targblock); \
>   } while (0)
>  
> +/*
> + * RelationIsPermanent
> + *   True if relation is WAL-logged.
> + */
> +#define RelationIsWalLogged(relation)
> \
> + ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)

To me, the names RelationIsWalLogged() and RelationNeedsWAL() don't convey
their behavior difference.  How about one of the following spellings?

- Name the new macro RelationEverNeedsWAL().
- Just test "relpersistence == RELPERSISTENCE_PERMANENT", without a macro.

> +
>  /*
>   * RelationNeedsWAL
> - *   True if relation needs WAL.
> + *   True if relation needs WAL at the time.
>   *
>   * Returns false if wal_level = minimal and this relation is created or
>   * truncated in the current transaction.  See "Skipping WAL for New
>   * RelFileNode" in src/backend/access/transam/README.
>   */
>  #define RelationNeedsWAL(relation)   
> \
> - ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&  
> \
> + (RelationIsWalLogged(relation) &&   
> \
>(XLogIsNeeded() || 
> \
> (relation->rd_createSubid == InvalidSubTransactionId &&   
> \
>  relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
> @@ -619,7 +626,7 @@ typedef struct ViewOptions
>   */
>  #define RelationIsAccessibleInLogicalDecoding(relation) \
>   (XLogLogicalInfoActive() && \
> -  RelationNeedsWAL(relation) && \
> +  RelationIsWalLogged(relation) && \

Today's condition expands to:

  wal_level >= WAL_LEVEL_LOGICAL &&
  relation->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&
  (wal_level >= WAL_LEVEL_REPLICA || ...)

The proposed change doesn't affect semantics, and it likely doesn't change
optimized code.  I slightly prefer to leave this line unchanged.

>(IsCatalogRelation(relation) || 
> RelationIsUsedAsCatalogTable(relation)))
>  
>  /*
> @@ -635,7 +642,7 @@ typedef struct ViewOptions
>   */
>  #define RelationIsLogicallyLogged(relation) \
>   (XLogLogicalInfoActive() && \
> -  RelationNeedsWAL(relation) && \
> +  RelationIsWalLogged(relation) && \

Likewise.

>!IsCatalogRelation(relation))
>  
>  /*

Re: Add table access method as an option to pgbench

2021-01-15 Thread Michael Paquier

On Fri, Jan 15, 2021 at 01:22:45PM -0800, Andres Freund wrote:
> I think that objection is right. All that's needed to change this from
> the client side is to do something like
> PGOPTIONS='-c default_table_access_method=foo' pgbench ...
> 
> I don't think adding pgbench options for individual GUCs really is a
> useful exercise?

Yeah.  Looking at the latest patch, it just uses SET
default_table_access_method to achieve its goal and to bypass the fact
that partitions don't support directly the AM clause.  So I agree to
mark this patch as rejected and move on.  One thing that looks like an
issue to me is that PGOPTIONS is not listed in the section for 
environment variables on the docs of pgbench.  5aaa584 has plugged in
a lot of holes, but things could be improved more, for more clients,
where it makes sense of course.
--
Michael


signature.asc
Description: PGP signature

Re: Key management with tests

Hi,

On 2021-01-15 20:49:10 -0500, Bruce Momjian wrote:
> On Fri, Jan 15, 2021 at 04:56:24PM -0800, Andres Freund wrote:
> > On 2021-01-15 19:21:32 -0500, Bruce Momjian wrote:
> > > You have to understand cryptography and Postgres internals to understand
> > > the design, and I don't think it is realistic to explain that all to the
> > > community.  We did much of this in voice calls over months because it
> > > was too much of a burden to explain all the cryptographic details so
> > > everyone could follow along.
> > 
> > I think that's not at all acceptable. I don't mind hashing out details
> > on calls / off-list, but the design needs to be public, documented, and
> > reviewable.  And if it's something the community can't understand, then
> > it can't get in. We're going to have to maintain this going forward.
> 
> OK, so we don't want it.  That's fine with me.

That's not what I said...


> > This isn't specific to this topic? I don't really understand why this
> > specific feature gets to avoid normal community development processes?
> 
> What is being avoided?

You previously pushed a patch without tests, now you want to push a
patch that was barely reviewed and also doesn't contain an explanation
of the design. I mean:

> > > You have to understand cryptography and Postgres internals to understand
> > > the design, and I don't think it is realistic to explain that all to the
> > > community.  We did much of this in voice calls over months because it
> > > was too much of a burden to explain all the cryptographic details so
> > > everyone could follow along.

really is very far from the normal community process. Again, how is this
supposed to be maintained in the future, if it's based on a design
that's only understandable to the people on those phone calls?


> > We have had perl tap tests for quite a while now? And all new tests that
> > aren't regression / isolation tests are expected to be written in it.
> 
> What Perl tap tests run initdb and manage the cluster?  I didn't find
> any.

find . -name '*.pl'|xargs grep 'use PostgresNode;'

should give you a nearly complete list.

Greetings,

Andres Freund

Re: Is Recovery actually paused?

2021-01-15 Thread Masahiko Sawada

On Wed, Jan 13, 2021 at 9:20 PM Dilip Kumar  wrote:
>
> On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar  wrote:
> >
> > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA  wrote:
> > >
> > > On Thu, 10 Dec 2020 11:25:23 +0530
> > > Dilip Kumar  wrote:
> > >
> > > > > > However, I wonder users don't expect pg_is_wal_replay_paused to 
> > > > > > wait.
> > > > > > Especially, if max_standby_streaming_delay is -1, this will be 
> > > > > > blocked forever,
> > > > > > although this setting may not be usual. In addition, some users may 
> > > > > > set
> > > > > > recovery_min_apply_delay for a large.  If such users call 
> > > > > > pg_is_wal_replay_paused,
> > > > > > it could wait for a long time.
> > > > > >
> > > > > > At least, I think we need some descriptions on document to explain
> > > > > > pg_is_wal_replay_paused could wait while a time.
> > > > >
> > > > > Ok
> > > >
> > > > Fixed this, added some comments in .sgml as well as in function header
> > >
> > > Thank you for fixing this.
> > >
> > > Also, is it better to fix the description of pg_wal_replay_pause from
> > > "Pauses recovery." to "Request to pause recovery." in according with
> > > pg_is_wal_replay_paused?
> >
> > Okay
> >
> > >
> > > > > > Also, how about adding a new boolean argument to 
> > > > > > pg_is_wal_replay_paused to
> > > > > > control whether this waits for recovery to get paused or not? By 
> > > > > > setting its
> > > > > > default value to true or false, users can use the old format for 
> > > > > > calling this
> > > > > > and the backward compatibility can be maintained.
> > > > >
> > > > > So basically, if the wait_recovery_pause flag is false then we will
> > > > > immediately return true if the pause is requested?  I agree that it is
> > > > > good to have an API to know whether the recovery pause is requested or
> > > > > not but I am not sure is it good idea to make this API serve both the
> > > > > purpose?  Anyone else have any thoughts on this?
> > > > >
> > >
> > > I think the current pg_is_wal_replay_paused() already has another purpose;
> > > this waits recovery to actually get paused. If we want to limit this API's
> > > purpose only to return the pause state, it seems better to fix this to 
> > > return
> > > the actual state at the cost of lacking the backward compatibility. If we 
> > > want
> > > to know whether pause is requested, we may add a new API like
> > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait recovery 
> > > to actually
> > > get paused, we may add an option to pg_wal_replay_pause() for this 
> > > purpose.
> > >
> > > However, this might be a bikeshedding. If anyone don't care that
> > > pg_is_wal_replay_paused() can make user wait for a long time, I don't 
> > > care either.
> >
> > I don't think that it will be blocked ever, because
> > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> > recovery process will not be stuck on waiting for the WAL.
> >
> > > > > > As another comment, while pg_is_wal_replay_paused is blocking, I 
> > > > > > can not cancel
> > > > > > the query. I think CHECK_FOR_INTERRUPTS() is necessary in the 
> > > > > > waiting loop.
> > >
> > > How about this fix? I think users may want to cancel 
> > > pg_is_wal_replay_paused() during
> > > this is blocking.
> >
> > Yeah, we can do this.  I will send the updated patch after putting
> > some more thought into these comments.  Thanks again for the feedback.
> >
>
> Please find the updated patch.

I've looked at the patch. Here are review comments:

+   /* Recovery pause state */
+   RecoveryPauseState  recoveryPause;

Now that the value can have tri-state, how about renaming it to
recoveryPauseState?

---
 bool
 RecoveryIsPaused(void)
+{
+   boolrecoveryPause;
+
+   SpinLockAcquire(>info_lck);
+   recoveryPause = (XLogCtl->recoveryPause == RECOVERY_PAUSED) ?
true : false;
+   SpinLockRelease(>info_lck);
+
+   return recoveryPause;
+}
+
+bool
+RecoveryPauseRequested(void)
 {
boolrecoveryPause;

SpinLockAcquire(>info_lck);
-   recoveryPause = XLogCtl->recoveryPause;
+   recoveryPause = (XLogCtl->recoveryPause !=
RECOVERY_IN_PROGRESS) ? true : false;
SpinLockRelease(>info_lck);

return recoveryPause;
 }

We can write like recoveryPause = (XLogCtl->recoveryPause == RECOVERY_PAUSED);

Also, since these functions do the almost same thing, I think we can
have a common function to get XLogCtl->recoveryPause, say
GetRecoveryPauseState() or GetRecoveryPause(), and both
RecoveryIsPaused() and RecoveryPauseRequested() use the returned
value. What do you think?

---
+static void
+CheckAndSetRecoveryPause(void)

Maybe we need to declare the prototype of this function like other
functions in xlog.c.

---
+   /*
+* If recovery is not in progress anymore then report an error this
+* could happen if the standby is promoted while we were waiting for
+* recovery to get paused.
+

Re: Key management with tests

On Fri, Jan 15, 2021 at 04:56:24PM -0800, Andres Freund wrote:
> On 2021-01-15 19:21:32 -0500, Bruce Momjian wrote:
> > You have to understand cryptography and Postgres internals to understand
> > the design, and I don't think it is realistic to explain that all to the
> > community.  We did much of this in voice calls over months because it
> > was too much of a burden to explain all the cryptographic details so
> > everyone could follow along.
> 
> I think that's not at all acceptable. I don't mind hashing out details
> on calls / off-list, but the design needs to be public, documented, and
> reviewable.  And if it's something the community can't understand, then
> it can't get in. We're going to have to maintain this going forward.

OK, so we don't want it.  That's fine with me.

> I don't mean to say that we need to re-hash all design details from
> scratch - but that there needs to be an explanation somewhere that
> describes what's being done on a medium-high level, and what drove those
> design decisions.

I thought the TODO list was that, and the email threads.

> > > The wiki page doesn't really describe a design either. It has a very
> > > long todo, a bunch of implementation details, but no design.
> > 
> > I am not sure what design document you are requesting.  I thought the
> > TODO was that.
> 
> The TODO in 
> https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
> is a design document?

Yes.

> > > Nor did 978f869b99 include much in the way of design description.
> > > 
> > > You cannot expect anybody to review a patch if developing some basic
> > > understanding of the intended design requires reading hundreds of
> > > messages in which the design evolved. And I don't think it's acceptable
> > > to push it due to lack of further feedback, given this situation - the
> > > lack of design description is a blocker in itself.
> > 
> > OK, I will just move on to something else then.  It is not worth the
> > feature to go into that kind of discussion again.  I am willing to have
> > voice calls with individuals to explain the logic, but repeatedly
> > explaining it to the entire group I find unproductive.  I don't think
> > another 400-email thread would help anyone.
> 
> Explaining something over voice doesn't help with people in a year or
> five trying to understand the code and the design, so they can adapt it
> when making half-related changes. Nor do I see why another 400 email
> thread would be a necessary consequence of you explaining the design
> that you came up with.

I have underestimated the amount of discussion this has required
repeatedly, and I don't want to make that mistake again.

> This isn't specific to this topic? I don't really understand why this
> specific feature gets to avoid normal community development processes?

What is being avoided?

> > > - tests:
> > >   - wait, a .sh test script? No, we shouldn't add any more of those,
> > > they're nightmare across platforms
> > 
> > The script originatad from pg_upgrade.  I don't know how to do things
> > like initdb and stuff another way, at least in our code.
> 
> We have had perl tap tests for quite a while now? And all new tests that
> aren't regression / isolation tests are expected to be written in it.

What Perl tap tests run initdb and manage the cluster?  I didn't find
any.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Key management with tests

Hi,

On 2021-01-15 19:21:32 -0500, Bruce Momjian wrote:
> On Fri, Jan 15, 2021 at 02:37:56PM -0800, Andres Freund wrote:
> > On 2021-01-15 16:47:19 -0500, Bruce Momjian wrote:
> > > > I am not even sure there is a consensus on the design, without which
> > > > any commit is always premature.
> > >
> > > If people want changes, I need to hear about it here.  I have address
> > > everything people have mentioned in these threads so far.
> > 
> > I don't even know how anybody is supposed to realistically review the
> > design or the patch:
> > 
> > This thread started at
> > https://postgr.es/m/20210101045047.GB30966%40momjian.us - there's no
> > reference to any discussion of the design at all and the supposed links
> > to code are dead.
> 
> You have to understand cryptography and Postgres internals to understand
> the design, and I don't think it is realistic to explain that all to the
> community.  We did much of this in voice calls over months because it
> was too much of a burden to explain all the cryptographic details so
> everyone could follow along.

I think that's not at all acceptable. I don't mind hashing out details
on calls / off-list, but the design needs to be public, documented, and
reviewable.  And if it's something the community can't understand, then
it can't get in. We're going to have to maintain this going forward.

I don't mean to say that we need to re-hash all design details from
scratch - but that there needs to be an explanation somewhere that
describes what's being done on a medium-high level, and what drove those
design decisions.


> > The last version of the code that I see posted ([1]), has the useless
> > commit message of "key squash commit" - nothing else. There's no design
> > documentation included in the patch either, as far as I can tell.
> > 
> > Manually searching for the topic brings me to
> > https://www.postgresql.org/message-id/20201202213814.GG20285%40momjian.us
> > , a thread of 52 messages, which provides a bit more context, but
> > largely just references another thread and a wiki article. The link to
> > the other thread is into the middle of a 112 message thread.
> > 
> > The wiki page doesn't really describe a design either. It has a very
> > long todo, a bunch of implementation details, but no design.
> 
> I am not sure what design document you are requesting.  I thought the
> TODO was that.

The TODO in 
https://wiki.postgresql.org/wiki/Transparent_Data_Encryption#Other_requirements
is a design document?



> > Nor did 978f869b99 include much in the way of design description.
> > 
> > You cannot expect anybody to review a patch if developing some basic
> > understanding of the intended design requires reading hundreds of
> > messages in which the design evolved. And I don't think it's acceptable
> > to push it due to lack of further feedback, given this situation - the
> > lack of design description is a blocker in itself.
> 
> OK, I will just move on to something else then.  It is not worth the
> feature to go into that kind of discussion again.  I am willing to have
> voice calls with individuals to explain the logic, but repeatedly
> explaining it to the entire group I find unproductive.  I don't think
> another 400-email thread would help anyone.

Explaining something over voice doesn't help with people in a year or
five trying to understand the code and the design, so they can adapt it
when making half-related changes. Nor do I see why another 400 email
thread would be a necessary consequence of you explaining the design
that you came up with.

This isn't specific to this topic? I don't really understand why this
specific feature gets to avoid normal community development processes?



> > - tests:
> >   - wait, a .sh test script? No, we shouldn't add any more of those,
> > they're nightmare across platforms
> 
> The script originatad from pg_upgrade.  I don't know how to do things
> like initdb and stuff another way, at least in our code.

We have had perl tap tests for quite a while now? And all new tests that
aren't regression / isolation tests are expected to be written in it.

Greetings,

Andres Freund

Re: poc - possibility to write window function in PL languages

Pavel Stehule  writes:
> [ plpgsql-window-functions-20210104.patch.gz ]

I spent some time looking at this patch.  It would certainly be
appealing to have some ability to write custom window functions
without descending into C; but I'm not very happy about the details.

I'm okay with the idea of having a special variable of a new pseudotype.
That's not exactly pretty, but it descends directly from how we handle
the arguments of trigger functions, so at least there's precedent.
What's bugging me though is the "typedvalue" stuff.  That seems like a
conceptual mess, a performance loss, and a permanent maintenance time
sink.  To avoid performance complaints, eventually this hard-wired set
of conversions would have to bloom to cover every built-in cast, and
as for extension types, you're just out of luck.

One way to avoid that would be to declare the argument-fetching
functions as polymorphics with a dummy argument that just provides
the expected result type.  So users would write something like

create function pl_lag(x numeric)
  ...
  v := get_input_value_in_partition(windowobject, x, 1, -1,
'seek_current', false);

where the argument-fetching function is declared

   get_input_value_in_partition(windowobject, anyelement, int, ...)
   returns anyelement

and internally it could verify that the n'th window function argument
matches the type of its second argument.  While this could be made
to work, it's kind of unsatisfying because the argument number "1" is
so obviously redundant with the reference to "x".  Ideally one should
only have to write "x".  I don't quite see how to make that work,
but maybe there's a way?

On the whole though, I think your original idea of bespoke plpgsql
syntax is better, ie let's write something like

   GET WINDOW VALUE v := x AT PARTITION CURRENT(-1);

and hide all the mechanism behind that.  The reference to "x" is enough
to provide the argument number and type, and the window object doesn't
have to be explicitly visible at all.

Yeah, this will mean that anybody who wants to provide equivalent
functionality in some other PL will have to do more work.  But it's
not like it was going to be zero effort for them before.  Furthermore,
it's not clear to me that other PLs would want to adopt your current
design anyway.  For example, I bet PL/R would like to somehow make
window arguments map into vectors on the R side, but there's no chance
of that with this SQL layer in between.

regards, tom lane

Re: Change default of checkpoint_completion_target

Hi,

On 2021-01-15 23:05:02 +0100, Tomas Vondra wrote:
> Yeah. The flushing probably makes that mostly unnecessary, but we still
> allow disabling that. I'm not really convinced replacing it with a
> compile-time #define is a good idea, exactly because it can't be changed
> if needed.

It's also not available everywhere...

> As for the exact value, maybe the right solution is to make it dynamic.
> The usual approach is to leave "enough time" for the kernel to flush
> dirty data, so we could say 60 seconds and calculate the exact target
> depending on the checkpoint_timeout.

IME the kernel flushing at some later time precisely is the problem,
because of the latency spikes that happen when it decides to do so. That
commonly starts to happen well before the fsyncs. The reason that
setting a very small checkpoint_completion_target can help is that it
condenses the period of unrealiable performance into one short time,
rather than spreading it over the whole checkpoint...

Greetings,

Andres Freund

Re: Key management with tests

Hi,

On 2021-01-15 16:47:19 -0500, Bruce Momjian wrote:
> On Fri, Jan 15, 2021 at 04:23:22PM -0500, Robert Haas wrote:
> > On Fri, Jan 15, 2021 at 3:49 PM Bruce Momjian  wrote:
> > I don't think that's appropriate. Several prominent community members
> > have told you that the patch, as committed the first time, needed a
> > lot more work. There hasn't been enough time between then and now for
> > you, or anyone, to do that amount of work. This patch needs detailed
> > and substantial review from senior community members, and multiple
> > rounds of feedback and improvement, before it should be considered for
> > commit.
> >
> > I am not even sure there is a consensus on the design, without which
> > any commit is always premature.
>
> If people want changes, I need to hear about it here.  I have address
> everything people have mentioned in these threads so far.

I don't even know how anybody is supposed to realistically review the
design or the patch:

This thread started at
https://postgr.es/m/20210101045047.GB30966%40momjian.us - there's no
reference to any discussion of the design at all and the supposed links
to code are dead.

The last version of the code that I see posted ([1]), has the useless
commit message of "key squash commit" - nothing else. There's no design
documentation included in the patch either, as far as I can tell.

Manually searching for the topic brings me to
https://www.postgresql.org/message-id/20201202213814.GG20285%40momjian.us
, a thread of 52 messages, which provides a bit more context, but
largely just references another thread and a wiki article. The link to
the other thread is into the middle of a 112 message thread.

The wiki page doesn't really describe a design either. It has a very
long todo, a bunch of implementation details, but no design.

Nor did 978f869b99 include much in the way of design description.

You cannot expect anybody to review a patch if developing some basic
understanding of the intended design requires reading hundreds of
messages in which the design evolved. And I don't think it's acceptable
to push it due to lack of further feedback, given this situation - the
lack of design description is a blocker in itself.

There's a few things that stand out on a very very brief scan:
- the patch badly needs to be split up into independently reviewable
  pieces
- tests:
  - wait, a .sh test script? No, we shouldn't add any more of those,
they're nightmare across platforms
  - Do the tests actually do anything useful? It's not clear to me what
they are trying to achieve. En/Decrypting test vectors doesn't seem to
buy that much?
  - the new pg_alterckey is completely untested
  - the pg_upgrade paths is untested
  - ..
- Without further comment BootStrapKmgr() does "copy cluster file
  encryption keys from an old cluster?", but there's no explanation as
  to why / when that's the case. Presumably pg_upgrade, but, uh, explain
  that.

- pg_alterckey.c
  - appears to create it's own cluster lock file, using its
own routine for doing so. How does that lock file  interact with the
running server?
  - retrieve_cluster_keys() is missing (void).

I think this is at the very least a month away from being committable,
even if the design were completely correct (which I do not know, see
above).

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20210115204926.GD8740%40momjian.us

Re: Rename of triggers for partitioned tables

2021-01-15 Thread Alvaro Herrera

On 2020-Nov-27, Arne Roland wrote:

> I got too annoyed at building queries for gexec all the time. So wrote
> a patch to fix the issue that the rename of partitioned trigger
> doesn't affect children.

As you say, triggers on children don't necessarily have to have the same
name as on parent; this already happens when the trigger is renamed in
the child table but not on parent.  In that situation the search on the
child will fail, which will cause the whole thing to fail I think.

We now have the column pg_trigger.tgparentid, and I think it would be
better (more reliable) to search for the trigger in the child by the
tgparentid column instead, rather than by name.

Also, I think it would be good to have
 ALTER TRIGGER .. ON ONLY parent RENAME TO ..
to avoid recursing to children.  This seems mostly pointless, but since
we've allowed changing the name of the trigger in children thus far,
then we've implicitly made it supported to have triggers that are named
differently.  (And it's not entirely academic, since the trigger name
determines firing order.)

Alternatively to this last point, we could decide to disallow renaming
of triggers on children (i.e. if trigger has tgparentid set, then
renaming is disallowed).  I don't have a problem with that, but it would
have to be an explicit decision to take.

I think you did not register the patch in commitfest, so I did that for
you: https://commitfest.postgresql.org/32/2943/

Thanks!

-- 
Álvaro Herrera   Valdivia, Chile
"¿Cómo puedes confiar en algo que pagas y que no ves,
y no confiar en algo que te dan y te lo muestran?" (Germán Poo)

increase size of pg_commit_ts buffers

2021-01-15 Thread Alvaro Herrera

I wrote this patch last year in response to a customer issue and I
thought I had submitted it here, but evidently I didn't.  So here it is.

The short story is: in commit 5364b357fb11 we increased the size of
pg_commit (née pg_clog) but we didn't increase the size of pg_commit_ts
to match.  When commit_ts is in use, this can lead to significant buffer
thrashing and thus poor performance.

Since commit_ts entries are larger than pg_commit, my proposed patch uses
twice as many buffers.

Suffice it to say once we did this the customer problem went away.

(Andrey Borodin already has a patch to change the behavior for
multixact, which is something we should perhaps also do.  I now notice
that they're also reporting a bug in that thread ... sigh)

-- 
Álvaro Herrera39°49'30"S 73°17'W
"The problem with the future is that it keeps turning into the present"
(Hobbes)
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index b786eeff7a..712f1eff01 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -530,7 +530,7 @@ pg_xact_commit_timestamp_origin(PG_FUNCTION_ARGS)
 Size
 CommitTsShmemBuffers(void)
 {
-	return Min(16, Max(4, NBuffers / 1024));
+	return Min(256, Max(4, NBuffers / 512));
 }
 
 /*

Re: Key management with tests

On Fri, Jan 15, 2021 at 04:59:17PM -0500, Robert Haas wrote:
> On Fri, Jan 15, 2021 at 4:47 PM Bruce Momjian  wrote:
> > If people want changes, I need to hear about it here.  I have address
> > everything people have mentioned in these threads so far.
> 
> That does not match my perception of the situation.

Well, that's not very specific, is it?  You might be confusing the POC
data encryption patch that was posted in this thread with the key
management patch that I am working on.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Key management with tests

2021-01-15 Thread David G. Johnston

On Fri, Jan 15, 2021 at 2:59 PM Robert Haas  wrote:

> On Fri, Jan 15, 2021 at 4:47 PM Bruce Momjian  wrote:
> > If people want changes, I need to hear about it here.  I have address
> > everything people have mentioned in these threads so far.
>
> That does not match my perception of the situation.
>
>
Looking at the Commitfest there are three authors and no reviewers.  Given
the previous incident at minimum each of the people in the Commitfest
should add their approval to commit this patch to this thread.  And while
committers get some leeway, in this case having a non-author review and
sign-off on it being ready-to-commit seems like it should be required.

David J.

Re: Change default of checkpoint_completion_target

2021-01-15 Thread Tomas Vondra

On 1/15/21 10:51 PM, Andres Freund wrote:
> Hi,
> 
> On 2020-12-08 12:41:35 -0500, Tom Lane wrote:
>> FWIW, I kind of like the idea of getting rid of it completely.
>> Is there really ever a good reason to set it to something different
>> than that?  If not, well, we have too many GUCs already, and each
>> of them carries nonzero performance, documentation, and maintenance
>> overhead.
> 
> I like the idea of getting rid of it too, but I think we should consider
> evaluating the concrete hard-coded value a bit more careful than just
> going for 0.9 based on some old recommendations in the docs. It not
> being changeable afterwards...
> 
> I think it might be a good idea to immediately change the default to
> 0.9, and concurrently try to evaluate whether it's really the best value
> (vs 0.95, 1 or ...).
> 
> FWIW I have seen a few cases in the past where setting the target to
> something very small helped, but I think that was mostly because we
> didn't yet tell the kernel to flush dirty data more aggressively.
> 

Yeah. The flushing probably makes that mostly unnecessary, but we still
allow disabling that. I'm not really convinced replacing it with a
compile-time #define is a good idea, exactly because it can't be changed
if needed.

As for the exact value, maybe the right solution is to make it dynamic.
The usual approach is to leave "enough time" for the kernel to flush
dirty data, so we could say 60 seconds and calculate the exact target
depending on the checkpoint_timeout.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Key management with tests

2021-01-15 Thread Robert Haas

On Fri, Jan 15, 2021 at 4:47 PM Bruce Momjian  wrote:
> If people want changes, I need to hear about it here.  I have address
> everything people have mentioned in these threads so far.

That does not match my perception of the situation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Outdated replication protocol error?

Hi,

On 2021-01-14 16:40:26 +0900, Fujii Masao wrote:
> On 2021/01/12 9:06, Jeff Davis wrote:
> > Commit 5ee2197767 (about 4 years old) introduced the error:
> > 
> >"IDENTIFY_SYSTEM has not been run before START_REPLICATION"
> > 
> > But it seems like running START_REPLICATION first works (at least in
> > the simple case).
> > 
> > We should either:
> > 
> > 1. Document that IDENTIFY_SYSTEM must always be run before
> > START_REPLICATION, and always issue a WARNING if that's not done (an
> > ERROR might break existing applications); or
> > 
> > 2. If the commit is out of date and no longer needed, or if it's easy
> > enough to fix, just remove the error (and Assert a valid
> > ThisTimeLineID).
> 
> +1 to remove the error if START_REPLICATION can always work fine without
> IDENTIFY_SYSTEM. I found that the error happens when we connect to the standby
> and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not sure
> if IDENTIFY_SYSTEM is actually necessary even in that case.

The current approach seems quite bad to me too. By that point the value
could be just about arbitrarily out of date - but that doesn't really
matter because GetStandbyFlushRecPtr() updates it. And for the
!am_cascading_walsender it's trivial to compute.

Has anybody dug out the thread leading to the commit?

Greetings,

Andres Freund

Re: Change default of checkpoint_completion_target

Hi,

On 2020-12-08 12:41:35 -0500, Tom Lane wrote:
> FWIW, I kind of like the idea of getting rid of it completely.
> Is there really ever a good reason to set it to something different
> than that?  If not, well, we have too many GUCs already, and each
> of them carries nonzero performance, documentation, and maintenance
> overhead.

I like the idea of getting rid of it too, but I think we should consider
evaluating the concrete hard-coded value a bit more careful than just
going for 0.9 based on some old recommendations in the docs. It not
being changeable afterwards...

I think it might be a good idea to immediately change the default to
0.9, and concurrently try to evaluate whether it's really the best value
(vs 0.95, 1 or ...).

FWIW I have seen a few cases in the past where setting the target to
something very small helped, but I think that was mostly because we
didn't yet tell the kernel to flush dirty data more aggressively.

Greetings,

Andres Freund

Re: Key management with tests

On Fri, Jan 15, 2021 at 04:23:22PM -0500, Robert Haas wrote:
> On Fri, Jan 15, 2021 at 3:49 PM Bruce Momjian  wrote:
> > I am planning to apply this next week.
> 
> I don't think that's appropriate. Several prominent community members
> have told you that the patch, as committed the first time, needed a
> lot more work. There hasn't been enough time between then and now for
> you, or anyone, to do that amount of work. This patch needs detailed
> and substantial review from senior community members, and multiple
> rounds of feedback and improvement, before it should be considered for
> commit.
> 
> I am not even sure there is a consensus on the design, without which
> any commit is always premature.

If people want changes, I need to hear about it here.  I have address
everything people have mentioned in these threads so far.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: patch: reduce overhead of execution of CALL statement in no atomic mode from PL/pgSQL

Pavel Stehule  writes:
> [ plpgsql-using-local-resowner-for-call-plans-20200108.patch ]

I took a quick look through this patch, just reading it without
any testing.  A few thoughts:

* Instead of adding an argument to GetCachedPlan and ReleaseCachedPlan,
I think it'd be better to *replace* the useResOwner bool with
a ResourceOwner pointer, with the obvious semantics "do nothing if
it's NULL".  Otherwise you have to explain what it means to pass NULL
with useResOwner = true.  In any case, changing the APIs of these
functions without updating their header comments is not okay.

* I'm not really happy with adding yet another nonorthogonal variant
of SPI_execute_plan.  Maybe better to do something like I did with
SPI_prepare_extended() in commit 844fe9f15, and create a struct with
all the inessential parameters so that we can make future API extensions
without inventing whole new functions.  Remember also that new SPI
functions have to be documented in spi.sgml.

* Do we really need a PG_TRY in exec_toplevel_block?  Not to mention
creating and destroying a ResOwner?  That seems pretty expensive, and it
should be unnecessary for ordinary plpgsql functions.  (I'm also unhappy
that you've utterly falsified that function's comment without doing
anything to update it.)  This is really the part that needs more
work.  I'm not sure that you can sell a speedup of CALL operations
if the penalty is to slow down every other plpgsql function.

* The part of the patch around exec_stmt_call is just about unreadable,
mainly because git diff seems to think that exec_stmt_call is being
changed into make_callstmt_target.  Maybe it'd be less messy if you
put make_callstmt_target after exec_stmt_call.

* Looks like an extra exec_prepare_plan() call snuck into
exec_assign_expr()?

regards, tom lane

Re: WIP: document the hook system

2021-01-15 Thread David G. Johnston

On Fri, Jan 15, 2021 at 12:28 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 2020-12-31 04:28, David Fetter wrote:
> > This could probably use a lot of filling in, but having it in the
> > actual documentation beats needing to know folklore even to know
> > that the capability is there.
>

Typo - the entity definition hooks should be listed before infoschema, not
after.

> This patch seems quite short of a state where one could begin to
> evaluate it.  Documenting the hooks better seems a worthwhile goal.   I
> think the question is whether we can develop documentation that is
> genuinely useful by itself without studying the relevant source code.
> This submission does not address that question.
>

Yeah, there seems to be a point further along this path we want to reach.
In particular, having a complete example would be nice.  Also needing
explaining is the whole hook swapping thing (I don't think "cache" is the
right term to use here) - like "why" it is important and how its useful
given that the "Foo_hook" is never assigned away from "prev_Foo" so the
restoration in _PG_fini seems redundant (I see the full example does assign
the various old values away - but why is the needed and why doesn't another
module doing the same thing end up clobbering this one in a last-one-wins
way?)

I would also be curious whether a static listing of hooks in the
documentation could instead be accomplished by writing a query and having
the build routine populate a "pg_hooks" catalog table which would
be referenced, and ideally could be queried at runtime to enumerate
installed hooks.

Pointing out the presence of src/test/modules/test_rls_hooks would be
advised (in addition to a more minimal "hello world" like example in the
documentation itself).  Having the test module point to the documentation
for more explanations would be good as well.

Coming in with fresh eyes the main thing I would care about is that these
exist, a brief idea of how they operate without having to dig into the
source code, and pointers on where to learn which ones exist (ideally
without digging into source code) and how to go about writing one (which
builds upon material already documented about extending the service using
the C programming language - so links there).  I'm good with running a
catalog query to learn about which ones exist instead of reading them in
the documentation - though the later has some appeal and if it can be
maintained as a build artefact alongside the catalog entries that would be
a bonus.

David J.

Re: Wrong usage of RelationNeedsWAL

Hi,

On 2021-01-13 16:07:05 +0900, Kyotaro Horiguchi wrote:
> Commit c6b92041d3 changed the definition of RelationNeedsWAL().
> 
> -#define RelationNeedsWAL(relation) \
> - ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
> +#define RelationNeedsWAL(relation)   
> \
> + ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&  
> \
> +  (XLogIsNeeded() || 
> \
> +   (relation->rd_createSubid == InvalidSubTransactionId &&   
> \
> +relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
> 
> On the other hand I found this usage.
> 
> plancat.c:128 get_relation_info()
> > /* Temporary and unlogged relations are inaccessible during recovery. */
> > if (!RelationNeedsWAL(relation) && RecoveryInProgress())
> > ereport(ERROR,
> > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >  errmsg("cannot access temporary or unlogged 
> > relations during recovery")));
> 
> It works as expected accidentally, but the meaning is off.
> WAL-skipping optmization is irrelevant to the condition for the error.
> 
> I found five misues in the tree. Please find the attached.

Noah?

Greetings,

Andres Freund

Re: Key management with tests

2021-01-15 Thread Robert Haas

On Fri, Jan 15, 2021 at 3:49 PM Bruce Momjian  wrote:
> I am planning to apply this next week.

I don't think that's appropriate. Several prominent community members
have told you that the patch, as committed the first time, needed a
lot more work. There hasn't been enough time between then and now for
you, or anyone, to do that amount of work. This patch needs detailed
and substantial review from senior community members, and multiple
rounds of feedback and improvement, before it should be considered for
commit.

I am not even sure there is a consensus on the design, without which
any commit is always premature.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add table access method as an option to pgbench

Hi,

On 2020-11-25 12:41:25 +0900, Michael Paquier wrote:
> On Tue, Nov 24, 2020 at 03:32:38PM -0800, David Zhang wrote:
> > But, providing another option for the end user may not be a bad idea, and it
> > might make the tests easier at some points.
> 
> My first thought is that we have no need to complicate pgbench with
> this option because there is a GUC able to do that, but we do that for
> tablespaces, so...  No objections from here.

I think that objection is right. All that's needed to change this from
the client side is to do something like
PGOPTIONS='-c default_table_access_method=foo' pgbench ...

I don't think adding pgbench options for individual GUCs really is a
useful exercise?

Greetings,

Andres Freund

Re: jit and explain nontext

2021-01-15 Thread Justin Pryzby

On Fri, Jan 15, 2021 at 02:53:49PM -0500, Tom Lane wrote:
> On balance I agree with Peter's opinion that this isn't worth
> changing.  I would be for the patch if the executor had a little
> more freedom of action, but as things stand there's not much
> freedom there.

Thanks for looking
CF: withdrawn.

-- 
Justin

Re: Improve pg_dump dumping publication tables

"Hsu, John"  writes:
> I was wondering if there's a good reason in pg_dump getPublicationTables() 
> to iterate through all tables one by one and querying to see if it has a 
> corresponding publication other than memory concerns?

I just came across this entry in the CommitFest, and I see that it's
practically the same as something I did in passing in 8e396a773.
The main difference is that I got rid of the server-side join, too,
in favor of having getPublicationTables locate the PublicationInfo
that should have been created already by getPublications.  (The
immediate rationale for that was to get the publication owner name
from the PublicationInfo; but we'd have had to do that eventually
anyway if we ever want to allow selective dumping of publications.)

Anyway, I apologize for treading on your toes.  If I'd noticed this
thread earlier I would certainly have given you credit for being the
first to have the idea.

As far as the memory argument goes, I'm not too concerned about it
because both the PublicationRelInfo structs and the tuples of the
transient PGresult are pretty small.  In principle if you had very
many entries in pg_publication_rel, but a selective dump was only
interested in a few of them, there might be an interesting amount of
space wasted here.  But that argument fails because even a selective
dump collects data about all tables, for reasons that are hard to get
around.  The incremental space usage for PublicationRelInfos seems
unlikely to be significant compared to the per-table data we'd have
anyway.

I'll mark this CF entry "withdrawn", since it wasn't rejected
exactly.  Too bad we don't have a classification of "superseded
by events", or something like that.

regards, tom lane

Re: Printing backtrace of postgres processes

On 2021-01-15 09:53:05 +0100, Peter Eisentraut wrote:
> On 2020-12-08 10:38, vignesh C wrote:
> > I have implemented printing of backtrace based on handling it in
> > CHECK_FOR_INTERRUPTS. This patch also includes the change to allow
> > getting backtrace of any particular process based on the suggestions.
> > Attached patch has the implementation for the same.
> > Thoughts?
> 
> Are we willing to use up a signal for this?

Why is a full signal needed? Seems the procsignal infrastructure should
suffice?

Re: Add table access method as an option to pgbench

2021-01-15 Thread David Zhang


Hi,

`v6-0001-add-table-access-method-as-an-option-to-pgbench` fixed the 
wording problems for pgbench document and help as they were pointed out 
by Justin and Youichi.


`0001-update-tablespace-to-keep-document-consistency` is trying to make 
the *tablespace* to be more consistent in pgbench document.


Thank you,

On 2021-01-14 3:51 p.m., David Zhang wrote:

On 2021-01-09 5:44 a.m., youichi aramaki wrote:

+   " create tables with using specified table access 
method,\n"
In my opinion,  this sentence should be "create tables with specified 
table access method" or "create tables using specified table access 
method".
"create tables with specified table access method" may be more 
consistent with other options.


Thank you Youichi. I will change it to "create tables with specified 
table access method" in next patch.



--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca
From ff94675fda38a0847b893a29dce8c3068a74e118 Mon Sep 17 00:00:00 2001
From: David Zhang 
Date: Fri, 15 Jan 2021 11:40:30 -0800
Subject: [PATCH] update tablespace to keep document consistency

---
 doc/src/sgml/ref/pgbench.sgml | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index faa7c26b0a..080c25731d 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -325,10 +325,10 @@ pgbench  options 
 d
  
 
  
-  
--index-tablespace=index_tablespace
+  
--index-tablespace=TABLESPACE
   

-Create indexes in the specified tablespace, rather than the default
+Create indexes in the specified TABLESPACE, 
rather than the default
 tablespace.

   
@@ -360,10 +360,10 @@ pgbench  options 
 d
  
 
  
-  
--tablespace=tablespace
+  
--tablespace=TABLESPACE
   

-Create tables in the specified tablespace, rather than the default
+Create tables in the specified TABLESPACE, 
rather than the default
 tablespace.

   
-- 
2.17.1

From 517fcb61c7065b7bccf6105a7bb89007fe2fcb08 Mon Sep 17 00:00:00 2001
From: David Zhang 
Date: Fri, 15 Jan 2021 10:32:53 -0800
Subject: [PATCH] add table access method as an option to pgbench

---
 doc/src/sgml/ref/pgbench.sgml|  9 +++
 src/bin/pgbench/pgbench.c| 25 +++-
 src/bin/pgbench/t/001_pgbench_with_server.pl | 22 +
 3 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index faa7c26b0a..d8018388e6 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -359,6 +359,15 @@ pgbench  options 
 d
   
  
 
+ 
+  
--table-access-method=TABLEAM
+  
+   
+Create tables with specified table access method, rather than the 
default.
+   
+  
+ 
+ 
  
   
--tablespace=tablespace
   
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index f7da3e1f62..3f8a119811 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -184,10 +184,11 @@ doublethrottle_delay = 0;
 int64  latency_limit = 0;
 
 /*
- * tablespace selection
+ * tablespace and table access method selection
  */
 char  *tablespace = NULL;
 char  *index_tablespace = NULL;
+char  *table_access_method = NULL;
 
 /*
  * Number of "pgbench_accounts" partitions.  0 is the default and means no
@@ -641,6 +642,9 @@ usage(void)
   "  --partition-method=(range|hash)\n"
   "   partition pgbench_accounts with 
this method (default: range)\n"
   "  --partitions=NUM partition pgbench_accounts into 
NUM parts (default: 0)\n"
+  "  --table-access-method=TABLEAM\n"
+  "   create tables with specified 
table access method,\n"
+  "   rather than the default.\n"
   "  --tablespace=TABLESPACE  create tables in the specified 
tablespace\n"
   "  --unlogged-tablescreate tables as unlogged 
tables\n"
   "\nOptions to select what to run:\n"
@@ -3761,6 +3765,20 @@ initCreateTables(PGconn *con)
 
fprintf(stderr, "creating tables...\n");
 
+   if (table_access_method != NULL)
+   {
+   char   *escaped_table_access_method;
+
+   initPQExpBuffer();
+   escaped_table_access_method = PQescapeIdentifier(con,
+   table_access_method, 
strlen(table_access_method));
+   appendPQExpBuffer(, "set default_table_access_method to 
%s;\n",
+   escaped_table_access_method);
+   PQfreemem(escaped_table_access_method);
+   executeStatement(con, query.data);
+   termPQExpBuffer();
+

Re: jit and explain nontext

Justin Pryzby  writes:
> On Sat, Nov 21, 2020 at 10:26:00AM -0600, Justin Pryzby wrote:
>> On Sat, Nov 21, 2020 at 08:39:11AM +0100, Peter Eisentraut wrote:
>>> In this context, I don't see the point of this change.  If you set jit=off
>>> explicitly, then there is no need to clutter the EXPLAIN output with a bunch
>>> of zeroes about JIT.

> If there's no interest or agreement in it, we should just close it.
> I have no personal need for it, but noticed it in passing.

I dug around a bit and saw that essentially all of the JIT control
GUCs are consulted only at plan time (cf standard_planner, which
fills PlannedStmt.jitFlags based on the then-active settings).
So the only thing that really counts as a "run time decision"
here is that if you set jit = off between planning and execution,
or if we fail to load the JIT provider at all, then you'll get
no JITting even though the planner expected it to happen.

On balance I agree with Peter's opinion that this isn't worth
changing.  I would be for the patch if the executor had a little
more freedom of action, but as things stand there's not much
freedom there.

regards, tom lane

Re: [patch] [doc] Further note required activity aspect of automatic checkpoint and archving

2021-01-15 Thread David G. Johnston

On Fri, Jan 15, 2021 at 12:16 AM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 2020-10-12 23:54, David G. Johnston wrote:
> > --- a/doc/src/sgml/backup.sgml
> > +++ b/doc/src/sgml/backup.sgml
> > @@ -722,6 +722,8 @@ test ! -f
> > /mnt/server/archivedir/000100A90065  cp pg_wal/0
> >   short archive_timeout  it will bloat
> > your archive
> >   storage.  archive_timeout settings of a minute
> > or so are
> >   usually reasonable.
> > +This is mitigated by the fact that empty WAL segments will not be
> > archived
> > +even if the archive_timeout period has elapsed.
> >  
>
> This is hopefully not what happens.  What this would mean is that I'd
> then have a sequence of WAL files named, say,
>
> 1, 2, 3, 7, 8, ...
>
> because a few in the middle were not archived because they were empty.
>

This addition assumes it is known that the archive process first fills the
files to their maximum size and then archives them.  That filling of the
file is what causes the next file in the sequence to be created.  So if the
archiving doesn't happen the files do not get filled and the status-quo
prevails.

If the above wants to be made more explicit in this change maybe:

"This is mitigated by the fact that archiving, and thus filling, the active
WAL segment will not happen if that segment is empty; it will continue as
the active segment."


> > --- a/doc/src/sgml/config.sgml
> > +++ b/doc/src/sgml/config.sgml
> > @@ -3131,6 +3131,8 @@ include_dir 'conf.d'
> > 
> >  
> >   Maximum time between automatic WAL checkpoints.
> > +The automatic checkpoint will do nothing if no new WAL has been
> > +written since the last recorded checkpoint.
> >   If this value is specified without units, it is taken as
> seconds.
> >   The valid range is between 30 seconds and one day.
> >   The default is five minutes (5min).
>
> I think what happens is that the checkpoint is skipped, not that the
> checkpoint happens but does nothing.  That is the wording you cited in
> the other thread from
> .
>

Consistency is good; and considering it further the skipped wording is
generally better anyway.

"The automatic checkpoint will be skipped if no new WAL has been written
since the last recorded checkpoint."

David J.

Re: WIP: System Versioned Temporal Table

2021-01-15 Thread legrand legrand

Hello,

it seems that Oracle (11R2) doesn't add the Start and End timestamp columns 
and permit statement like

select * from tt
union
select * from tt
AS OF TIMESTAMP (SYSTIMESTAMP - INTERVAL '6' SECOND)
minus 
select * from tt
VERSIONS BETWEEN TIMESTAMP (SYSTIMESTAMP - INTERVAL '6' second) and
SYSTIMESTAMP;

Regards
PAscal



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: O(n^2) system calls in RemoveOldXlogFiles()

Hi,

Thanks for pushing! Sorry to not get around to a review before...

On 2021-01-15 11:42:50 +0900, Michael Paquier wrote:
> On Fri, Jan 15, 2021 at 03:25:24PM +1300, Thomas Munro wrote:
> > Thanks Michael!  Another notch for the unnecessary system call
> > hitlist: https://wiki.postgresql.org/wiki/Syscall_Reduction
> 
> A quick question.  How much does it matter in terms of
> micro-performance for this code path depending on max/min_wal_size?

I can't see it being a problem here. Disision-by-non-constant is
expensive, but syscalls are more expensive. And journalled filesystem
operations like renames are *much* more expensive.

> Andres has mentioned its aio work, without telling any numbers.

I only found it while working on the AIO stuff, it's not really
dependant on it. I saw significant performance drops in the last part of
a checkpoint in both aio / master, but additional debugging output I had
in the aio branch made it obvious that it's not actuall the buffer sync
where the time is spent.

>  "No backpatch is done per the lack of field complaints."

Because there's no real way to attribute the slowdown to WAL file
recycling in production workloads on master, I don't think we could
really expect field complaints. Everyone will just attribute the
slowdown to BufferSync() or file sync.

I think the way we currently emit WAL timings is quite quite
unhelpful. The fact that we attribute CheckpointWriteDelay() to the
write time makes it nearly useless until you're at the point the
checkpoint can't be completed in time. Similarly, the "sync" time"
covers many things that aren't syncing...

Greetings,

Andres Freund

Re: fdatasync(2) on macOS

On Fri, Jan 15, 2021 at 12:55:52PM -0500, Tom Lane wrote:
> > So... does this unreleased function flush drive caches?  We know that
> > fsync(2) doesn't, based on Apple's advice[1] for databases hackers to
> > call fcntl(fd, F_FULLSYNC, 0) instead.  We do that.
> 
> pg_test_fsync results make it clear that fdatasync is the same or a shade
> faster than fsync, which is pretty much what you'd expect.  On my
> late-model Macbook Pro:
> 
> Compare file sync methods using two 8kB writes:
> (in wal_sync_method preference order, except fdatasync is Linux's default)
> open_datasync 14251.416 ops/sec  70 usecs/op
> fdatasync 25345.103 ops/sec  39 usecs/op
> fsync 24677.445 ops/sec  41 usecs/op
> fsync_writethrough   41.519 ops/sec   24085 usecs/op
> open_sync 14188.903 ops/sec  70 usecs/op
> 
> and on an old Mac mini with spinning rust:
> 
> Compare file sync methods using two 8kB writes:
> (in wal_sync_method preference order, except fdatasync is Linux's default)
> open_datasync  2536.535 ops/sec 394 usecs/op
> fdatasync  4602.192 ops/sec 217 usecs/op
> fsync  4600.365 ops/sec 217 usecs/op
> fsync_writethrough   12.135 ops/sec   82404 usecs/op
> open_sync  2506.674 ops/sec 399 usecs/op
> 
> So it's not a no-op, but on the other hand it's not succeeding in getting
> bits down to the platter.  I'm not inclined to dike it out, but it does
> seem problematic that we're defaulting to open_datasync, which is also
> not getting bits down to the platter.
> 
> I have a vague recollection that we discussed changing the default
> wal_sync_method for Darwin years ago, but I don't recall why we
> didn't pull the trigger.  These results certainly suggest that
> we oughta.

Is this with an SSD?  We used to be able to know something wasn't
flushing to durable storage because magnetic disk was so slow you could
tell from the numbers, but with SSDs, it might be harder to guess. 
Maybe time to use:

https://brad.livejournal.com/2116715.html
diskchecker.pl

or find a way to automate that test.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Git, diffs, and patches

I learned a few things when working on the key management patch that I
want to share here in case it helps anyone:

*  git diff effectively creates a squashed diff of all commits/changes
*  git format-patch wants to retain each commit (no squash)
*  git format-patch has information about file name changes
   (add/rename/remove) that git diff does not

*  git apply and git am cannot process context diffs, only unified diffs
*  git apply only applies changes to the files and therefore cannot
   record file name changes in git, e.g., git add
*  git am applies and merges changes, including file name changes

*  to create a squashed format-patch, you have to create a new branch
   and merge --squash your changed branch into that, then use git
   format-patch
*  to create a squashed git format-patch on top of a lower branch
   you have to make a copy of the lower branch, merge --squash on the
   upper branch on top of that, and then use git format-patch comparing
   the lower branch to the upper one

Maybe everyone else knew these things, but I didn't.  I can provide more
details if desired.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: fdatasync(2) on macOS

Thomas Munro  writes:
> While following along with the nearby investigation into weird
> cross-version Apple toolchain issues that confuse configure, I noticed
> that the newer buildfarm Macs say:
> checking for fdatasync... (cached) yes
> That's a bit strange because there's no man page and no declaration:

Yeah, it's been there but undeclared for a long time.  Who knows why.

> So... does this unreleased function flush drive caches?  We know that
> fsync(2) doesn't, based on Apple's advice[1] for databases hackers to
> call fcntl(fd, F_FULLSYNC, 0) instead.  We do that.

pg_test_fsync results make it clear that fdatasync is the same or a shade
faster than fsync, which is pretty much what you'd expect.  On my
late-model Macbook Pro:

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 14251.416 ops/sec  70 usecs/op
fdatasync 25345.103 ops/sec  39 usecs/op
fsync 24677.445 ops/sec  41 usecs/op
fsync_writethrough   41.519 ops/sec   24085 usecs/op
open_sync 14188.903 ops/sec  70 usecs/op

and on an old Mac mini with spinning rust:

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync  2536.535 ops/sec 394 usecs/op
fdatasync  4602.192 ops/sec 217 usecs/op
fsync  4600.365 ops/sec 217 usecs/op
fsync_writethrough   12.135 ops/sec   82404 usecs/op
open_sync  2506.674 ops/sec 399 usecs/op

So it's not a no-op, but on the other hand it's not succeeding in getting
bits down to the platter.  I'm not inclined to dike it out, but it does
seem problematic that we're defaulting to open_datasync, which is also
not getting bits down to the platter.

I have a vague recollection that we discussed changing the default
wal_sync_method for Darwin years ago, but I don't recall why we
didn't pull the trigger.  These results certainly suggest that
we oughta.

regards, tom lane

Re: PG vs LLVM 12 on seawasp, next round

Alvaro Herrera  writes:
> On 2020-Dec-11, Fabien COELHO wrote:
>> I'll look into it when I have time, which make take some time. Hopefully
>> over the holidays.

> This is still happening ... Any chance you can have a look at it?

If you don't have time to debug it, perhaps you could just disable
the buildfarm animal till you do.  It's cluttering the buildfarm
failure report without providing useful info ...

regards, tom lane

Re: PG vs LLVM 12 on seawasp, next round

2021-01-15 Thread Alvaro Herrera

On 2020-Dec-11, Fabien COELHO wrote:

> > I hadn't checked that before, but for the last few days there's been a
> > different failure than the one I saw earlier:
> > 
> > +ERROR:  could not load library 
> > "/home/fabien/pg/build-farm-11/buildroot/HEAD/pgsql.build/tmp_install/home/fabien/pg/build-farm-11/buildroot/HEAD/inst/lib/postgresql/llvmjit.so":
> >  libLLVMOrcJIT.so.12git: cannot open shared object file: No such file or 
> > directory

> > The "no such file" error seems more like a machine local issue to me.
> 
> I'll look into it when I have time, which make take some time. Hopefully
> over the holidays.

This is still happening ... Any chance you can have a look at it?

-- 
Álvaro Herrera39°49'30"S 73°17'W

Re: Yet another fast GiST build

2021-01-15 Thread Andrey Borodin




> 15 янв. 2021 г., в 10:24, Peter Eisentraut 
>  написал(а):
> 
> I noticed this patch while working on another patch for pageinspect [0], and 
> this one appears to introduce a problem similar to the one the other patch 
> attempts to fix: The "itemlen" output parameters are declared to be of type 
> smallint, but the underlying C data is of type uint16 (OffsetNumber).  I 
> don't know the details of gist enough to determine whether overflow is 
> possible here.  If not, perhaps a check or at least a comment would be 
> useful.  Otherwise, these parameters should be of type int in SQL.

Item offsets cannot exceed maximum block size of 32768. And even 
32768/sizeof(ItemId). Thus overflow is impossible.
Interesting question is wether pageinspect should protect itself from corrupted 
input?
Generating description from bogus tuple, probably, can go wrong.

Best regards, Andrey Borodin.

Re: WIP: System Versioned Temporal Table

2021-01-15 Thread Simon Riggs

On Fri, Jan 15, 2021 at 4:56 PM Surafel Temesgen  wrote:
>
>
>
> On Fri, Jan 15, 2021 at 12:22 AM Simon Riggs  
> wrote:
>>
>> SELECT * FROM foo FOR SYSTEM_TIME AS OF ...
>> should NOT include the Start and End timestamp columns
>> because this acts like a normal query just with a different snapshot 
>> timestamp
>>
>> SELECT * FROM foo FOR SYSTEM_TIME BETWEEN x AND y
>> SHOULD include the Start and End timestamp columns
>> since this form of query can include multiple row versions for the
>> same row, so it makes sense to see the validity times
>>
>
> One disadvantage of returning system time columns is it
> breaks upward compatibility. if an existing application wants to
> switch to system versioning it will break.

There are no existing applications, so for PostgreSQL, it wouldn't be an issue.

If you mean compatibility with other databases, that might be an
argument to do what others have done. What have other databases done
for SELECT * ?

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: WIP: System Versioned Temporal Table

2021-01-15 Thread Surafel Temesgen

On Fri, Jan 15, 2021 at 12:22 AM Simon Riggs 
wrote:

> SELECT * FROM foo FOR SYSTEM_TIME AS OF ...
> should NOT include the Start and End timestamp columns
> because this acts like a normal query just with a different snapshot
> timestamp
>
> SELECT * FROM foo FOR SYSTEM_TIME BETWEEN x AND y
> SHOULD include the Start and End timestamp columns
> since this form of query can include multiple row versions for the
> same row, so it makes sense to see the validity times
>
>
One disadvantage of returning system time columns is it
breaks upward compatibility. if an existing application wants to
switch to system versioning it will break.

regards
Surafel

Re: WIP: System Versioned Temporal Table

2021-01-15 Thread Simon Riggs

On Fri, Jan 15, 2021 at 4:46 PM Surafel Temesgen  wrote:
>
>
>
> On Fri, Jan 15, 2021 at 12:27 AM Simon Riggs  
> wrote:
>>
>>
>> Yes, I think it can. The current situation is that the Start or End is
>> set to the Transaction Start Timestamp.
>> So if t2 starts before t1, then if t1 creates a row and t2 deletes it
>> then we will have start=t1 end=t2, but t2> Your tests don't show that because it must happen concurrently.
>> We need to add an isolation test to show this, or to prove it doesn't happen.
>>
>
>
> Does MVCC allow that? i am not expert on MVCC but i don't
> think t2 can see the row create by translation started before
> itself

Yeh. Read Committed mode can see anything committed prior to the start
of the current statement, but UPDATEs always update the latest version
even if they can't see it.

Anyway, isolationtester spec file needed to test this. See
src/test/isolation and examples in specs/ directory

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: WIP: System Versioned Temporal Table

2021-01-15 Thread Surafel Temesgen

On Fri, Jan 15, 2021 at 12:27 AM Simon Riggs 
wrote:

>
> Yes, I think it can. The current situation is that the Start or End is
> set to the Transaction Start Timestamp.
> So if t2 starts before t1, then if t1 creates a row and t2 deletes it
> then we will have start=t1 end=t2, but t2 Your tests don't show that because it must happen concurrently.
> We need to add an isolation test to show this, or to prove it doesn't
> happen.
>
>

Does MVCC allow that? i am not expert on MVCC but i don't
think t2 can see the row create by translation started before
itself

regards
Surafel

Re: Transactions involving multiple postgres foreign servers, take 2

2021-01-15 Thread Zhihong Yu

Hi,
For v32-0004-Add-PrepareForeignTransaction-API.patch :

+ * Whenever a foreign transaction is processed, the corresponding FdwXact
+ * entry is update. To avoid holding the lock during transaction
processing
+ * which may take an unpredicatable time the in-memory data of foreign

entry is update -> entry is updated

unpredictable -> unpredictable

+   int nlefts = 0;

nlefts -> nremaining

+   elog(DEBUG1, "left %u foreign transactions", nlefts);

The message can be phrased as "%u foreign transactions remaining"

+FdwXactResolveFdwXacts(int *fdwxact_idxs, int nfdwxacts)

Fdw and Xact are repeated. Seems one should suffice. How about naming the
method FdwXactResolveTransactions() ?
Similar comment for FdwXactResolveOneFdwXact(FdwXact fdwxact)

For get_fdwxact():

+   /* This entry matches the condition */
+   found = true;
+   break;

Instead of breaking and returning, you can return within the loop directly.

Cheers

On Thu, Jan 14, 2021 at 9:17 PM Masahiko Sawada 
wrote:

> On Fri, Jan 15, 2021 at 4:03 AM Zhihong Yu  wrote:
> >
> > Hi,
> > For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :
>
> Thank you for reviewing the patch!
>
> >
> > +   boolhave_notwophase = false;
> >
> > Maybe name the variable have_no_twophase so that it is easier to read.
>
> Fixed.
>
> >
> > +* Two-phase commit is not required if the number of servers
> performed
> >
> > performed -> performing
>
> Fixed.
>
> >
> > +errmsg("cannot process a distributed transaction that
> has operated on a foreign server that does not support two-phase commit
> protocol"),
> > +errdetail("foreign_twophase_commit is \'required\' but
> the transaction has some foreign servers which are not capable of two-phase
> commit")));
> >
> > The lines are really long. Please wrap into more lines.
>
> Hmm, we can do that but if we do that, it makes grepping by the error
> message hard. Please refer to the documentation about the formatting
> guideline[1]:
>
> Limit line lengths so that the code is readable in an 80-column
> window. (This doesn't mean that you must never go past 80 columns. For
> instance, breaking a long error message string in arbitrary places
> just to keep the code within 80 columns is probably not a net gain in
> readability.)
>
> These changes have been made in the local branch. I'll post the
> updated patch set after incorporating all the comments.
>
> Regards,
>
> [1] https://www.postgresql.org/docs/devel/source-format.html
>
> --
> Masahiko Sawada
> EnterpriseDB:  https://www.enterprisedb.com/
>

Re: pg_preadv() and pg_pwritev()

Sergey Shinderuk  writes:
> On 15.01.2021 04:45, Tom Lane wrote:
>> Hence, I propose the attached.  This works as far as I can tell
>> to fix the problem you're seeing.

> Yes, it works fine. Thank you very much.

Great.  Pushed with a little more polishing.

regards, tom lane

Re: list of extended statistics on psql

2021-01-15 Thread Tomas Vondra




On 1/15/21 9:47 AM, Julien Rouhaud wrote:
> On Wed, Jan 13, 2021 at 10:22:05AM +0900, Tatsuro Yamada wrote:
>> Hi Tomas,
>>
>> On 2021/01/13 7:48, Tatsuro Yamada wrote:
>>> On 2021/01/12 20:08, Tomas Vondra wrote:
 On 1/12/21 2:57 AM, Tatsuro Yamada wrote:
> On 2021/01/09 9:01, Tomas Vondra wrote:
 ...>
>> While working on that, I realized that 'defined' might be a bit
>> ambiguous, I initially thought it means 'NOT NULL' (which it does not).
>> I propose to change it to 'requested' instead. Tatsuro, do you agree, or
>> do you think 'defined' is better?
>
> Regarding the status of extended stats, I think the followings:
>
>   - "defined": it shows the extended stats defined only. We can't know
>    whether it needs to analyze or not. I agree this name was
>     ambiguous. Therefore we should replace it with a more 
> suitable
>    name.
>   - "requested": it shows the extended stats needs something. Of course,
>    we know it needs to ANALYZE because we can create the 
> patch.
>    However, I feel there is a little ambiguity for DBA.
>    To solve this, it would be better to write an explanation 
> of
>    the status in the document. For example,
>
> ==
> The column of the kind of extended stats (e. g. Ndistinct) shows some 
> statuses.
> "requested" means that it needs to gather data by ANALYZE. "built" means 
> ANALYZE
>   was finished, and the planner can use it. NULL means that it doesn't 
> exists.
> ==
>
> What do you think? :-D
>

 Yes, that seems reasonable to me. Will you provide an updated patch?
>>>
>>>
>>> Sounds good. I'll send the updated patch today.
>>
>>
>>
>> I updated the patch to add the explanation of the extended stats' statuses.
>> Please feel free to modify the patch to improve it more clearly.
>>
>> The attached files are:
>>0001: Add psql \dx and the fixed document
>>0002: Regression test for psql \dX
>>app-psql.html: Created by "make html" command (You can check the
>>   explanation of the statuses easily, probably)
> 
> Hello Yamada-san,
> 
> I reviewed the patch and don't have specific complaints, it all looks good!
> 
> I'm however thinking about the "requested" status.  I'm wondering if it could
> lead to people think that an ANALYZE is scheduled and will happen soon.
> Maybe "defined" or "declared" might be less misleading, or even "waiting for
> analyze"?
> 

Well, the "defined" option is not great either, because it can be
interpreted as "NOT NULL" - that's why I proposed "requested". Not sure
about "declared" - I wouldn't use it in this context, but I'm not a
native speaker so maybe it's OK.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POC: postgres_fdw insert batching

2021-01-15 Thread Amit Langote

On Sat, Jan 16, 2021 at 12:00 AM tsunakawa.ta...@fujitsu.com
 wrote:
> From: Amit Langote 
> > Okay, so maybe not moving the whole logic into the FDW's
> > BeginForeignModify(), but at least if we move this...
> >
> > @@ -441,6 +449,72 @@ ExecInsert(ModifyTableState *mtstate,
> > +   /*
> > +* Determine if the FDW supports batch insert and determine the
> > batch
> > +* size (a FDW may support batching, but it may be disabled for the
> > +* server/table). Do this only once, at the beginning - we don't 
> > want
> > +* the batch size to change during execution.
> > +*/
> > +   if (resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
> > +   resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert &&
> > +   resultRelInfo->ri_BatchSize == 0)
> > +   resultRelInfo->ri_BatchSize =
> > +
> > resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);
> >
> > ...into ExecInitModifyTable(), ExecInsert() only needs the following block:
>
> Does ExecInitModifyTable() know all leaf partitions where the tuples produced 
> by VALUES or SELECT go?  ExecInsert() doesn't find the target leaf partition 
> for the first time through the call to ExecPrepareTupleRouting()?  Leaf 
> partitions can have different batch_size settings.

Good thing you reminded me that this is about inserts, and in that
case no, ExecInitModifyTable() doesn't know all leaf partitions, it
only sees the root table whose batch_size doesn't really matter.  So
it's really ExecInitRoutingInfo() that I would recommend to set
ri_BatchSize; right after this block:

/*
 * If the partition is a foreign table, let the FDW init itself for
 * routing tuples to the partition.
 */
if (partRelInfo->ri_FdwRoutine != NULL &&
partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);

Note that ExecInitRoutingInfo() is called only once for a partition
when it is initialized after being inserted into for the first time.

For a non-partitioned targets, I'd still say set ri_BatchSize in
ExecInitModifyTable().

-- 
Amit Langote
EDB: http://www.enterprisedb.com

RE: POC: postgres_fdw insert batching

2021-01-15 Thread tsunakawa.ta...@fujitsu.com

From: Amit Langote 
> Okay, so maybe not moving the whole logic into the FDW's
> BeginForeignModify(), but at least if we move this...
> 
> @@ -441,6 +449,72 @@ ExecInsert(ModifyTableState *mtstate,
> +   /*
> +* Determine if the FDW supports batch insert and determine the
> batch
> +* size (a FDW may support batching, but it may be disabled for the
> +* server/table). Do this only once, at the beginning - we don't want
> +* the batch size to change during execution.
> +*/
> +   if (resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
> +   resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert &&
> +   resultRelInfo->ri_BatchSize == 0)
> +   resultRelInfo->ri_BatchSize =
> +
> resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);
> 
> ...into ExecInitModifyTable(), ExecInsert() only needs the following block:

Does ExecInitModifyTable() know all leaf partitions where the tuples produced 
by VALUES or SELECT go?  ExecInsert() doesn't find the target leaf partition 
for the first time through the call to ExecPrepareTupleRouting()?  Leaf 
partitions can have different batch_size settings.


Regards
Takayuki Tsunakawa

Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2021-01-15 Thread Álvaro Herrera

So one last remaining improvement was to have VACUUM ignore processes
doing CIC and RC to compute the Xid horizon of tuples to remove.  I
think we can do something simple like the attached patch.

-- 
Álvaro Herrera   Valdivia, Chile
"Aprender sin pensar es inútil; pensar sin aprender, peligroso" (Confucio)
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index cf12eda504..f584230b79 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1601,7 +1601,7 @@ TransactionIdIsActive(TransactionId xid)
  * well as "internally" by GlobalVisUpdate() (see comment above struct
  * GlobalVisState).
  *
- * See the definition of ComputedXidHorizonsResult for the various computed
+ * See the definition of ComputeXidHorizonsResult for the various computed
  * horizons.
  *
  * For VACUUM separate horizons (used to decide which deleted tuples must
@@ -1610,7 +1610,12 @@ TransactionIdIsActive(TransactionId xid)
  * relations that's not required, since only backends in my own database could
  * ever see the tuples in them. Also, we can ignore concurrently running lazy
  * VACUUMs because (a) they must be working on other tables, and (b) they
- * don't need to do snapshot-based lookups.
+ * don't need to do snapshot-based lookups.  Also, for the non-catalog
+ * horizon, we can ignore CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY
+ * when they are working on non-partial, non-expressional indexes, for the
+ * same reasons and because they can't run in transaction blocks.  (They are
+ * not possible to ignore for catalogs, because CIC and RC do some catalog
+ * operations.)
  *
  * This also computes a horizon used to truncate pg_subtrans. For that
  * backends in all databases have to be considered, and concurrently running
@@ -1660,9 +1665,6 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 	bool		in_recovery = RecoveryInProgress();
 	TransactionId *other_xids = ProcGlobal->xids;
 
-	/* inferred after ProcArrayLock is released */
-	h->catalog_oldest_nonremovable = InvalidTransactionId;
-
 	LWLockAcquire(ProcArrayLock, LW_SHARED);
 
 	h->latest_completed = ShmemVariableCache->latestCompletedXid;
@@ -1682,6 +1684,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 
 		h->oldest_considered_running = initial;
 		h->shared_oldest_nonremovable = initial;
+		h->catalog_oldest_nonremovable = initial;
 		h->data_oldest_nonremovable = initial;
 
 		/*
@@ -1752,7 +1755,7 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		if (statusFlags & (PROC_IN_VACUUM | PROC_IN_LOGICAL_DECODING))
 			continue;
 
-		/* shared tables need to take backends in all database into account */
+		/* shared tables need to take backends in all databases into account */
 		h->shared_oldest_nonremovable =
 			TransactionIdOlder(h->shared_oldest_nonremovable, xmin);
 
@@ -1768,16 +1771,29 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		 * to prune still needed data away). If the current backend never
 		 * connects to a database that is harmless, because
 		 * data_oldest_nonremovable will never be utilized.
+		 *
+		 * Additionally, processes doing CREATE INDEX CONCURRENTLY and REINDEX
+		 * CONCURRENTLY on "safe" indexes can be ignored for non-catalog
+		 * horizon. (But not for catalogs: some transactions in CIC/RC do
+		 * catalog updates.)
 		 */
 		if (in_recovery ||
 			MyDatabaseId == InvalidOid || proc->databaseId == MyDatabaseId ||
 			proc->databaseId == 0)	/* always include WalSender */
 		{
-			h->data_oldest_nonremovable =
-TransactionIdOlder(h->data_oldest_nonremovable, xmin);
+			if (statusFlags & PROC_IN_SAFE_IC)
+h->catalog_oldest_nonremovable =
+	TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
+			else
+h->data_oldest_nonremovable = h->catalog_oldest_nonremovable =
+	TransactionIdOlder(h->data_oldest_nonremovable, xmin);
 		}
 	}
 
+	/* catalog horizon should never be later than data */
+	Assert(TransactionIdPrecedesOrEquals(h->catalog_oldest_nonremovable,
+		 h->data_oldest_nonremovable));
+
 	/*
 	 * If in recovery fetch oldest xid in KnownAssignedXids, will be applied
 	 * after lock is released.
@@ -1799,6 +1815,8 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 			TransactionIdOlder(h->shared_oldest_nonremovable, kaxmin);
 		h->data_oldest_nonremovable =
 			TransactionIdOlder(h->data_oldest_nonremovable, kaxmin);
+		h->catalog_oldest_nonremovable =
+			TransactionIdOlder(h->catalog_oldest_nonremovable, kaxmin);
 		/* temp relations cannot be accessed in recovery */
 	}
 	else
@@ -1825,6 +1843,9 @@ ComputeXidHorizons(ComputeXidHorizonsResult *h)
 		h->data_oldest_nonremovable =
 			TransactionIdRetreatedBy(h->data_oldest_nonremovable,
 	 vacuum_defer_cleanup_age);
+		h->catalog_oldest_nonremovable =
+			TransactionIdRetreatedBy(h->catalog_oldest_nonremovable,
+	 vacuum_defer_cleanup_age);
 		/* defer doesn't apply to temp relations */
 	}
 
@@ -1847,7 +1868,6 @@

RE: Disable WAL logging to speed up data loading

2021-01-15 Thread osumi.takami...@fujitsu.com

Hi, Movead


Thanks for your comments.
> I read the patch and have two points:
> 
> 1. I do basebackup for database then switch wal level from logical to none to
> logical and of cause I archive the wal segments. Next I do PITR base on the
> basebackup, as a result it success startup with a waring said maybe data
> missed.
This applies to wal_level=minimal and is going to be addressed by
another thread [1].

> 
> Because the 'none' level is to bulkload data, do you think it's good that we 
> still
> support recover from a 'none' wal level.
> 
> 2. I just mark wal_level as 'none' but fail to startup, it success after I 
> drop the
> publication and it's subscription,mark max_wal_senders as 0, drop replicate 
> slot.
> I think it worth to write how we can startup a 'none' wal level database in
> document .
The documentation [2] touches this aspect.
It says "Also, wal_level must be set to replica or
higher to allow replication slots to be used." already.
It is documented and applies to wal_level=minimal also
because 'src/backend/replication/slot.c' has code that throws the error you got 
as below.

else if (wal_level < WAL_LEVEL_REPLICA)
ereport(FATAL,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("physical replication slot \"%s\" exists, but wal_level 
< replica",
NameStr(cp.slotdata.name)),
 errhint("Change wal_level to be replica or higher.")));

[1] - 
https://www.postgresql.org/message-id/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90%40OSBPR01MB4888.jpnprd01.prod.outlook.com
[2] - https://www.postgresql.org/docs/13/runtime-config-replication.html

Best Regards,
Takamichi Osumi

RE: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread tsunakawa.ta...@fujitsu.com

From: Amit Langote 
> Sorry, I haven't looked at the linked threads and the latest patches
> there closely enough yet, so I may be misreading this, but if the
> inserts will always be done by the leader backend as you say, then why
> does the planner need to be checking the parallel safety of the
> *target* table's expressions?

Yeah, I also wanted to confirm this next - that is, whether the current patch 
allows the SELECT operation to execute in parallel but the INSERT operation 
serially.  Oracle allows it; it even allows the user to specify a hint after 
INSERT and SELECT separately.  Even if INSERT in INSERT SELECT can't be run in 
parallel, it is useful for producing summary data, such as aggregating large 
amounts of IoT sensor data in parallel and inserting the small amount of 
summary data serially.


Regards
Takayuki Tsunakawa

RE: Disable WAL logging to speed up data loading

2021-01-15 Thread osumi.takami...@fujitsu.com

Hi


Thank you everyone
On Thursday, January 14, 2021 9:27 AM Tsunakawa, Takayuki/綱川 貴之 
 wrote:
> From: Kyotaro Horiguchi 
> > > XLogSetRecordFlags(XLOG_MARK_UNIMPORTANT |
> > XLOG_MARK_ESSENTIAL);
> > > XLogRegisterData((char *) , sizeof(dummy));
> > >
> > > (Here's a word play - unimportant but essential, what's that?)
> >
> > Hmm. Food may not be important to someone but it is essential for
> > survival.  I think this is somethig like that :p
> 
> Ah, that's a good answer.  I know a person around me who enjoys drinking
> alcohol but doesn't enjoy eating - he says he doesn't care about taste of 
> food.
> So food is unimportant but nutrition is essential for him.
> 
> 
> > Unfortunately, I prefer the latter as it is simple because it is in a
> > hot path.  As I think I mentioned upthread, I think the xlog stuff
> > should refrain from being conscious of things deeper than RMger-ID
> > level.  One of other reasons is that generally the issuer site is the
> > authoritative about the importance and essentiality of a record being
> > issued.  And I don't think it's very good to do the same thing in
> > different ways at the same time.  Fortunately each type of the recrods
> > has only few issuer places.
> 
> Agreed.
> 
> 
> > By the way, I noticed that pg_switch_wal() silently skips its task.
> > Desn't it need to give any sort of ERROR?
> 
> Osumi-san, can you check this?  (I thought we were aware of this according to
> Fujii-san's comment.)
I updated my patch to take in those feedbacks !
Have a look at the latest patch.


Best Regards,
Takamichi Osumi



disable_WAL_logging_v08.patch
Description: disable_WAL_logging_v08.patch

RE: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> This will surely increase planning time but the execution is reduced
> to an extent due to parallelism that it won't matter for either of the
> cases if we see just total time. For example, see the latest results
> for parallel inserts posted by Haiying Tang [3]. There might be an
> impact when Selects can't be parallelized due to the small size of the
> Select-table but we still have to traverse all the partitions to
> determine parallel-safety but not sure how much it is compared to
> overall time. I guess we need to find the same but apart from that can
> anyone think of a better way to determine parallel-safety of
> partitioned relation for Inserts?

Three solutions(?) quickly come to my mind:


(1) Have the user specify whether they want to parallelize DML
Oracle [1] and SQL Server [2] take this approach.  Oracle disables parallel DML 
execution by default.  The reason is described as "This mode is required 
because parallel DML and serial DML have different locking, transaction, and 
disk space requirements and parallel DML is disabled for a session by default." 
 To enable parallel DML in a session or in a specific statement, you need to 
run either of the following:

  ALTER SESSION ENABLE PARALLEL DML;
  INSERT /*+ ENABLE_PARALLEL_DML */ …

Besides, the user has to specify a parallel hint in a DML statement, or specify 
the parallel attribute in CREATE or ALTER TABLE.

SQL Server requires a TABLOCK hint to be specified in the INSERT SELECT 
statement like this:

  INSERT INTO Sales.SalesHistory WITH (TABLOCK)  (target columns...) SELECT ...;


(2) Postpone the parallel safety check after the planner finds a worthwhile 
parallel query plan
I'm not sure if the current planner code allows this easily...


(3) Record the parallel safety in system catalog
Add a column like relparallel in pg_class that indicates the parallel safety of 
the relation.  planner just checks the value instead of doing heavy work for 
every SQL statement.  That column's value is modified whenever a relation 
alteration is made that affects the parallel safety, such as adding a domain 
column and CHECK constraint.  In case of a partitioned relation, the parallel 
safety attributes of all its descendant relations are merged.  For example, if 
a partition becomes parallel-unsafe, the ascendant partitioned tables also 
become parallel-unsafe.

But... developing such code would be burdonsome and bug-prone?


I'm inclined to propose (1).  Parallel DML would be something that a limited 
people run in limited circumstances (data loading in data warehouse and batch 
processing in OLTP systems by the DBA or data administrator), so I think it's 
legitimate to require explicit specification of parallelism.

As an aside, (1) and (2) has a potential problem with memory consumption.  
Opening relations bloat CacheMemoryContext with relcaches and catcaches, and 
closing relations does not free the (all) memory.  But I don't think it could 
really become a problem in practice, because parallel DML would be run in 
limited number of concurrent sessions.


[1]
Types of Parallelism
https://docs.oracle.com/en/database/oracle/oracle-database/21/vldbg/types-parallelism.html#GUID-D8290A02-BE5F-436A-B814-D6FD71CEE81F

[2]
INSERT (Transact-SQL)
https://docs.microsoft.com/en-us/sql/t-sql/statements/insert-transact-sql?view=sql-server-ver15#best-practices


Regards
Takayuki Tsunakawa

Re: POC: postgres_fdw insert batching

2021-01-15 Thread Amit Langote

On Fri, Jan 15, 2021 at 12:05 AM Tomas Vondra
 wrote:
> Attached is v9 with all of those tweaks,

Thanks.

> except for moving the BatchSize
> call to BeginForeignModify - I tried that, but it did not seem like an
> improvement, because we'd still need the checks for API callbacks in
> ExecInsert for example. So I decided not to do that.

Okay, so maybe not moving the whole logic into the FDW's
BeginForeignModify(), but at least if we move this...

@@ -441,6 +449,72 @@ ExecInsert(ModifyTableState *mtstate,
+   /*
+* Determine if the FDW supports batch insert and determine the batch
+* size (a FDW may support batching, but it may be disabled for the
+* server/table). Do this only once, at the beginning - we don't want
+* the batch size to change during execution.
+*/
+   if (resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize &&
+   resultRelInfo->ri_FdwRoutine->ExecForeignBatchInsert &&
+   resultRelInfo->ri_BatchSize == 0)
+   resultRelInfo->ri_BatchSize =
+
resultRelInfo->ri_FdwRoutine->GetForeignModifyBatchSize(resultRelInfo);

...into ExecInitModifyTable(), ExecInsert() only needs the following block:

   /*
+* If the FDW supports batching, and batching is requested, accumulate
+* rows and insert them in batches. Otherwise use the per-row inserts.
+*/
+   if (resultRelInfo->ri_BatchSize > 1)
+   {
+ ...

AFAICS, I don't see anything that will cause ri_BatchSize to become 0
once set so don't see the point of checking whether it needs to be set
again on every ExecInsert() call.  Also, maybe not that important, but
shaving off 3 comparisons for every tuple would add up nicely IMHO
especially given that we're targeting bulk loads.

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2021-01-15 Thread Álvaro Herrera

On 2021-Jan-12, Álvaro Herrera wrote:

> On 2021-Jan-12, Álvaro Herrera wrote:
> 
> > > For the 0001 patch, since ReindexIndexInfo is used only within
> > > ReindexRelationConcurrently() I think it’s a function-local structure
> > > type. So we can declare it within the function. What do you think?
> > 
> > That's a good idea.  Pushed with that change, thanks.
> 
> Here's the other patch, with comments fixed per reviews, and rebased to
> account for the scope change.

Pushed.  At the last minute I decided to back off on reverting the flag
set that DefineIndex has, because there was no point in doing that.  I
also updated the comment in (two places of) ReindexRelationConcurrently
about why we don't do it there.  The previous submission had this:

> + /*
> +  * This transaction doesn't need to set the PROC_IN_SAFE_IC flag, 
> because
> +  * it only acquires an Xid to do some catalog manipulations, after the
> +  * wait is over.
> +  */

but this fails to point out that the main reason not to do it, is to
avoid having to decide whether to do it or not when some indexes are
safe and others aren't.  So I changed to:

+   /*
+* While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
+* real need for that, because we only acquire an Xid after the wait is
+* done, and that lasts for a very short period.
+*/

Thanks!

-- 
Álvaro Herrera   Valdivia, Chile
"The eagle never lost so much time, as
when he submitted to learn of the crow." (William Blake)

Re: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread Amit Langote

On Fri, Jan 15, 2021 at 9:59 PM Amit Kapila  wrote:
> We want to do this for Inserts where only Select can be parallel and
> Inserts will always be done by the leader backend. This is actually
> the case we first want to implement.

Sorry, I haven't looked at the linked threads and the latest patches
there closely enough yet, so I may be misreading this, but if the
inserts will always be done by the leader backend as you say, then why
does the planner need to be checking the parallel safety of the
*target* table's expressions?

-- 
Amit Langote
EDB: http://www.enterprisedb.com

Re: Wrong HINT during database recovery when occur a minimal wal.

2021-01-15 Thread lchch1990

>I think it's also important to suggest to the users how they can turn
>on hot_standby on their standby. So, perhaps-a-bit-verbose hint would
>be like this.
>"Either start this standby from base backup taken after setting
>wal_level to \"replica\" on the primary, or turn off hot_standby
>here."
>This this make sense?Can you help me understand what [setting wal_level to 
>\"replica\"] helpfor this startup from basebackup?
Do you mean set wal_level on basebackup or on the database we dobasebackup?

Re: Determine parallel-safety of partition relations for Inserts

On Fri, Jan 15, 2021 at 5:53 PM Ashutosh Bapat
 wrote:
>
> On Fri, Jan 15, 2021 at 3:48 PM Amit Kapila  wrote:
> >
> > While reviewing parallel insert [1] (Insert into  Select) and
> > parallel copy patches [2], it came to my notice that both the patches
> > traverse the entire partition hierarchy to determine parallel-safety
> > of partitioned relations. This is required because before considering
> > the Insert or Copy can be considered for parallelism, we need to
> > determine whether it is safe to do so. We need to check for each
> > partition because any of the partitions can have some parallel-unsafe
> > index expression, constraint, etc. We do a similar thing for Selects
> > in standard_planner.
> >
> > The plain Select case for partitioned tables was simpler because we
> > anyway loop through all the partitions in set_append_rel_size() and we
> > determine parallel-safety of each partition at that time but the same
> > is not true for Inserts.
> >
> > For Inserts, currently, we only open the partition table when we are
> > about to insert into that partition. During ExecInsert, we find out
> > the partition matching the partition-key value and then lock if it is
> > not already locked. In this patch, we need to open each partition at
> > the planning time to determine its parallel-safety.
>
> We don't want to open the partitions where no rows will be inserted.
>
> >
> > This will surely increase planning time but the execution is reduced
> > to an extent due to parallelism that it won't matter for either of the
> > cases if we see just total time. For example, see the latest results
> > for parallel inserts posted by Haiying Tang [3]. There might be an
> > impact when Selects can't be parallelized due to the small size of the
> > Select-table but we still have to traverse all the partitions to
> > determine parallel-safety but not sure how much it is compared to
> > overall time. I guess we need to find the same but apart from that can
> > anyone think of a better way to determine parallel-safety of
> > partitioned relation for Inserts?
>
> In case of SELECT we open only those partitions which surive pruning.
> So those are the ones which will definitely required to be scanned. We
> perform parallelism checks only on those partitions. The actual check
> isn't much costly.
>
> >
> > Thoughts?
> >
> > Note: I have kept a few people in Cc who are either directly involved
> > in this work or work regularly in the partitioning related work just
> > in the hope that might help in moving the discussion forward.
>
> Since you brought up comparison between SELECT and INSERT, "pruning"
> partitions based on the values being INSERTed might help. It should be
> doable in case of INSERT ... SELECT where we need to prune partitions
> based on the clauses of SELECT. Doable with some little effort in case
> of VALUEs and COPY.
>

I don't think we should try pruning in this patch as that is a
separate topic and even after pruning the same problem can happen when
we are not able to prune partitions in the table where we want to
Insert.

> Second possibility is to open partitions only when the estimated
> number of rows to be inserted goes beyond a certain value.
>

Yeah, this option has merits in the sense that we will pay the cost to
traverse partitions only when the parallel plan is possible. If you
see the 0001 patch in email [1], it tries to determine parallel-safety
(which in turn will traverse all partition tables) in standard_planner
where we determine the parallel-safety for the Query tree. Now, if we
have to postpone it for partitioned tables then we need to determine
it at the places where we create_modifytable_path if the
current_rel->pathlist has some parallel_safe paths. And which will
also mean that we need to postpone generating gather node as well till
that time because Insert can be parallel-unsafe.

This will have one disadvantage over the current approach implemented
by the patch which is that we might end up spending a lot of time in
the optimizer to create partial paths and later (while determining
parallel-safety of Insert) find that none of them is required.

If we decide to postpone the parallel-safety checking for Inserts then
why not we check parallel-safety for all sorts of Inserts at that
point. That can at least simplify the patch.

> Third idea is to use something similar to parallel append where
> individual partitions are scanned sequentially but multiple partitions
> are scanned in parallel. When a row is inserted into a non-yet-opened
> partition, allocate one/more backends to insert into partitions which
> do not allow parallelism, otherwise continue to use a common pool of
> parallel workers for insertion. This means the same thread performing
> select may not perform insert. So some complications will be involved.
>

We want to do this for Inserts where only Select can be parallel and
Inserts will always be done by the leader backend. This is actually
the case we first want

Re: Commitfest 2021-01 Now in Progress

2021-01-15 Thread Masahiko Sawada

Hi,

On Fri, Jan 1, 2021 at 9:22 PM Masahiko Sawada  wrote:
>
> Hi,
>
> Happy new year to all!
>
> The January Commitfest for PG14 is now in progress! I'm happy to
> volunteer to manage it.
>

Now that a half month has passed, the current state is:

Needs review: 163 (-25)
Waiting on Author: 35 (+7)
Ready for Committer: 20 (-2)
Committed: 38 (+19)
Withdrawn: 4 (+1)

We got 19 patches committed during a half month, which is a good figure I think.

The number of patch entries on the Commitfest app that changed their
status since this Commitfest started is 50. There are still many
patches waiting to get reviewed, possibly for a long time (of course
some patch entries just might have missed to update on the app). So we
need non-committer contributors to review other people's patches.
Please do your part and review patches of at least equivalent
complexity to the patches you're submitting.

I'll report the statistics again next week, 1 week before closing.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

Re: Determine parallel-safety of partition relations for Inserts

2021-01-15 Thread Ashutosh Bapat

On Fri, Jan 15, 2021 at 3:48 PM Amit Kapila  wrote:
>
> While reviewing parallel insert [1] (Insert into  Select) and
> parallel copy patches [2], it came to my notice that both the patches
> traverse the entire partition hierarchy to determine parallel-safety
> of partitioned relations. This is required because before considering
> the Insert or Copy can be considered for parallelism, we need to
> determine whether it is safe to do so. We need to check for each
> partition because any of the partitions can have some parallel-unsafe
> index expression, constraint, etc. We do a similar thing for Selects
> in standard_planner.
>
> The plain Select case for partitioned tables was simpler because we
> anyway loop through all the partitions in set_append_rel_size() and we
> determine parallel-safety of each partition at that time but the same
> is not true for Inserts.
>
> For Inserts, currently, we only open the partition table when we are
> about to insert into that partition. During ExecInsert, we find out
> the partition matching the partition-key value and then lock if it is
> not already locked. In this patch, we need to open each partition at
> the planning time to determine its parallel-safety.

We don't want to open the partitions where no rows will be inserted.

>
> This will surely increase planning time but the execution is reduced
> to an extent due to parallelism that it won't matter for either of the
> cases if we see just total time. For example, see the latest results
> for parallel inserts posted by Haiying Tang [3]. There might be an
> impact when Selects can't be parallelized due to the small size of the
> Select-table but we still have to traverse all the partitions to
> determine parallel-safety but not sure how much it is compared to
> overall time. I guess we need to find the same but apart from that can
> anyone think of a better way to determine parallel-safety of
> partitioned relation for Inserts?

In case of SELECT we open only those partitions which surive pruning.
So those are the ones which will definitely required to be scanned. We
perform parallelism checks only on those partitions. The actual check
isn't much costly.

>
> Thoughts?
>
> Note: I have kept a few people in Cc who are either directly involved
> in this work or work regularly in the partitioning related work just
> in the hope that might help in moving the discussion forward.

Since you brought up comparison between SELECT and INSERT, "pruning"
partitions based on the values being INSERTed might help. It should be
doable in case of INSERT ... SELECT where we need to prune partitions
based on the clauses of SELECT. Doable with some little effort in case
of VALUEs and COPY.

Second possibility is to open partitions only when the estimated
number of rows to be inserted goes beyond a certain value.

Third idea is to use something similar to parallel append where
individual partitions are scanned sequentially but multiple partitions
are scanned in parallel. When a row is inserted into a non-yet-opened
partition, allocate one/more backends to insert into partitions which
do not allow parallelism, otherwise continue to use a common pool of
parallel workers for insertion. This means the same thread performing
select may not perform insert. So some complications will be involved.

-- 
Best Wishes,
Ashutosh Bapat

Re: Wrong HINT during database recovery when occur a minimal wal.

2021-01-15 Thread Kyotaro Horiguchi

At Fri, 15 Jan 2021 17:04:19 +0800, "lchch1...@sina.cn"  
wrote in 
> 
> >Mmm. Maybe something's missing. If you took the base-backup using
> >pg_basebackup, that means max_wal_senders > 0 on the primary. If you
> >lowered wal_level in the backup (or replica) then started it, You
> >would get something like this.
> >| FATAL: WAL streaming (max_wal_senders > 0) requires wal_level "replica" or 
> >"logical".
> >If you changed max_wal_senders to zero, you would get the following instead.
> >| FATAL:  hot standby is not possible because max_wal_senders = 0 is a lower 
> >setting than on the primary server (its value was 2)
> Then mark hot_standby off and continue try lowered wal_level.
> And do recovery from the basebackup, then you will see the FATAL.
> 
> >So I couldn't reproduce the situation.
> >Anyways.
>  
> >> My question is that what's the mean of  [set wal_level to "replica" on the 
> >> primary] in
> >> HINT describe, I can't think over a case to solve this FATAL by set 
> >> wal_level, I can
> >> solve it by turn off hot_standby only.
> >>
> >> Do you think we can do this code change?
> >> --- a/src/backend/access/transam/xlog.c
> >> +++ b/src/backend/access/transam/xlog.c
> >> @@ -6300,7 +6300,7 @@ CheckRequiredParameterValues(void)
> >>   if (ControlFile->wal_level < WAL_LEVEL_REPLICA)
> >>   ereport(ERROR,
> >>   (errmsg("hot standby is not possible because wal_level was not set to 
> >> \"replica\" or higher on the primary server"),
> >> -  errhint("Either set wal_level to \"replica\" on the primary, or turn 
> >> off hot_standby here.")));
> >> +  errhint("You should turn off hot_standby here.")));
>  
> >Since it's obvious that the change in a primary cannot be propagted by
> >taking a backup or starting replication, the first sentence reads to
> >me as "you should retake a base-backup from a primary where wal_level
> >is replica or higher". So *I* don't think it needs a fix.
> I think this HINT is want to guide users to finish this recovery, and the 
> first guide is
> invalid in my opinion.

I think it's also important to suggest to the users how they can turn
on hot_standby on their standby.  So, perhaps-a-bit-verbose hint would
be like this.

"Either start this standby from base backup taken after setting
wal_level to \"replica\" on the primary, or turn off hot_standby
here."

This this make sense?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

On Fri, Jan 15, 2021 at 4:54 PM japin  wrote:
> On Fri, 15 Jan 2021 at 18:57, Bharath Rupireddy 
>  wrote:
> > On Fri, Jan 15, 2021 at 4:03 PM Amit Kapila  wrote:
> >> That sounds like a better way to fix and in fact, I was about to
> >> suggest the same after reading your previous email. I'll think more on
> >> this but in the meantime, can you add the test case in the patch as
> >> requested earlier as well.
> >
> > @Li Japin Please let me know if you have already started to work on
> > the test case, otherwise I can make a 0002 patch for the test case and
> > post.
> >
>
> Yeah, I'm working on the test case.  Since I'm not familair with the logical
> replication test, it may take some times.

Thanks a lot. I quickly added the initial use case where we saw the
bug, attached is 0002 patch. Please have a look and add further use
cases if necessary. If okay, it's better to make a cf entry.

I have one comment in v3-0001 patch,
+ * There might some relations dropped from the publication, we do

I think we should change it to - "There might be some relations".

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v3-0002-Test-behaviour-of-ALTER-PUBLICATION-.-DROP-TABLE.patch
Description: Binary data

Re: Improve handling of parameter differences in physical replication

2021-01-15 Thread Sergei Kornilov

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Hello
Look good for me. I think the patch is ready for commiter.

The new status of this patch is: Ready for Committer

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-15 Thread japin



On Fri, 15 Jan 2021 at 18:57, Bharath Rupireddy 
 wrote:
> On Fri, Jan 15, 2021 at 4:03 PM Amit Kapila  wrote:
>> That sounds like a better way to fix and in fact, I was about to
>> suggest the same after reading your previous email. I'll think more on
>> this but in the meantime, can you add the test case in the patch as
>> requested earlier as well.
>
> @Li Japin Please let me know if you have already started to work on
> the test case, otherwise I can make a 0002 patch for the test case and
> post.
>

Yeah, I'm working on the test case.  Since I'm not familair with the logical
replication test, it may take some times.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: cost_sort vs cost_agg

2021-01-15 Thread Ashutosh Bapat

On Thu, Jan 14, 2021 at 7:12 PM Andy Fan  wrote:
>
> Currently the cost_sort doesn't consider the number of columns to sort, which
> means the cost of SELECT * FROM t ORDER BY a;  equals with the SELECT *
> FROM t ORDER BY a, b; which is obviously wrong.  The impact of this is when we
> choose the plan for SELECT DISTINCT * FROM t ORDER BY c between:
>
>  Sort
>Sort Key: c
>->  HashAggregate
>  Group Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n
>
> and
>
>  Unique
>->  Sort
>  Sort Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n
>
>
> Since "Sort (c)" has the same cost as  "Sort (c, a, b, d, e, f, g, h, i, j, k,
> l, m, n)", and Unique node on a sorted input is usually cheaper than
> HashAggregate, so the later one will win usually which might bad at many
> places.

I can imagine that HashAggregate + Sort will perform better if there
are very few distinct rows but otherwise, Unique on top of Sort would
be a better strategy since it doesn't need two operations.

>
> Optimizer chose HashAggregate with my patch, but it takes 6s. after set
> enable_hashagg = off, it takes 2s.

This example actually shows that using Unique is better than
HashAggregate + Sort. May be you want to try with some data which has
very few distinct rows.

-- 
Best Wishes,
Ashutosh Bapat

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

On Fri, Jan 15, 2021 at 4:03 PM Amit Kapila  wrote:
> That sounds like a better way to fix and in fact, I was about to
> suggest the same after reading your previous email. I'll think more on
> this but in the meantime, can you add the test case in the patch as
> requested earlier as well.

@Li Japin Please let me know if you have already started to work on
the test case, otherwise I can make a 0002 patch for the test case and
post.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

On Fri, Jan 15, 2021 at 4:20 PM Amit Kapila  wrote:
> I feel it is better to not do anything for this because neither we
> have a test to reproduce the problem nor is it clear from theory if
> there is any problem to solve here.

+1 to ignore 0002 patch. Thanks Amit.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

On Fri, Jan 15, 2021 at 11:48 AM Bharath Rupireddy
 wrote:
>
> On Thu, Jan 14, 2021 at 10:53 AM Amit Kapila  wrote:
> > > 0002 - Invalidates the relation map cache in subscriber syscache
> > > invalidation callbacks. Currently, I'm setting entry->state to
> > > SUBREL_STATE_UNKNOWN in the new invalidation function that's
> > > introduced logicalrep_relmap_invalidate, so that in the next call to
> > > logicalrep_rel_open the entry's state will be read from the system
> > > catalogues using GetSubscriptionRelState. If this is sound's bit
> > > strange, I can add a boolean invalid to LogicalRepRelMapEntry
> > > structure and set that here and in logicalrep_rel_open, I can have
> > > something like:
> > > if (entry->state != SUBREL_STATE_READY || entry->invalid)
> > > entry->state = GetSubscriptionRelState(MySubscription->oid,
> > >entry->localreloid,
> > >>statelsn);
> > >
> > >if (entry->invalid)
> > > entry->invalid = false;
> > >
> >
> > So, the point of the second patch is that it good to have such a
> > thing, otherwise, we don't see any problem if we just use patch-0001,
> > right? I think if we fix the first-one, automatically, we will achieve
> > what we are trying to with the second-one because ultimately
> > logicalrep_relmap_update will be called due to Alter Pub... Drop table
> > .. step and it will mark the entry as SUBREL_STATE_UNKNOWN.
>
> On some further analysis, I found that logicalrep_relmap_update is
> called in subscribers for inserts, delets, updates and truncate
> statements on the dropped(from publication) relations in the
> publisher.
>
> If any alters to pg_subscription, then we invalidate the subscription
> in subscription_change_cb, maybe_reread_subscription subscription will
> take care of re-reading from the system catalogues, see
> apply_handle_->ensure_transaction -> maybe_reread_subscription.
> And we don't have any entry in LogicalRepRelMapEntry that gets
> affected because of changes to pg_subscription, so we don't need to
> invalidate the rel map cache entries in subscription_change_cb.
>
> If any alters to pg_subscription_rel, then there are two parameters in
> LogicalRepRelMapEntry structure, state and statelsn that may get
> affected. Changes to statelsn column is taken care of with the
> invalidation callback invalidate_syncing_table_states setting
> table_states_valid to true and
> process_syncing_tables->process_syncing_tables_for_apply.
>

I think you are saying the reverse here. The function
invalidate_syncing_table_states sets table_states_valid to false and
the other one sets it to true.

> But, if
> state column is changed somehow (although I'm not quite sure how it
> can change to different states 'i', 'd', 's', 'r' after the initial
> table sync phase in logical replication,
>

These states are for the initial copy-phase after that these won't change.

> but we can pretty much do
> something like update pg_subscription_rel set srsubstate = 'd';), in
> this case invalidate_syncing_table_states gets called, but if we don't
> revalidation of relation map cache entries, they will have the old
> state value.
>

Hmm, modifying state values as you are suggesting have much dire
consequences like in this case it can let tablesync worker to restart
and try to do perform initial-copy again or it can lead to missing
data in subscriber. We don't recommend to change system catalogs
manually due to such problems.

> So what I feel is at least we need the 0002 patch but with only
> invalidating the entries in invalidate_syncing_table_states not in
> subscription_change_cb, although I'm not able to back it up with any
> use case or bug as such.
>

I feel it is better to not do anything for this because neither we
have a test to reproduce the problem nor is it clear from theory if
there is any problem to solve here.

-- 
With Regards,
Amit Kapila.

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

On Fri, Jan 15, 2021 at 2:46 PM japin  wrote:
>
> >
> > I'm sorry, the 0001 patch is totally wrong.  If we have only one 
> > publication, it works,
> > however, this is a special case.  If we have multiple publication in one 
> > subscription,
> > it doesn't work.  Here is a usecase.
> >
> > Step (1)
> > -- On publisher
> > CREATE TABLE t1 (a int);
> > CREATE TABLE t2 (a int);
> > CREATE PUBLICATION mypub1 FOR TABLE t1 WITH (publish='insert');
> > CREATE PUBLICATION mypub2 FOR TABLE t2;
> >
> > Step (2)
> > -- On subscriber
> > CREATE TABLE t1 (a int);
> > CREATE TABLE t2 (a int);
> > CREATE SUBSCRIPTION mysub CONNECTION 'host=localhost port=5432 
> > dbname=postgres' PUBLICATION mypub1, mypub2;
> >
> > Step (3)
> > -- On publisher
> > INSERT INTO t1 VALUES (1);
> >
> > Step (4)
> > -- On subscriber
> > SELECT * FROM t1;
> >
> > In step (4), we cannot get the records, because there has two publications 
> > in
> > data->publications, so we will check one by one.
> > The first publication is "mypub1" and entry->pubactions.pubinsert will be 
> > set
> > true, however, in the second round, the publication is "mypub2", and we 
> > cannot
> > find pub->oid in pubids (only oid of "mypub1"), so it will set all 
> > entry->pubactions
> > to false, and nothing will be replicate to subscriber.
> >
> > My apologies!
>
> As I mentioned in previous thread, if there has multiple publications in
> single subscription, it might lead data loss.
>
> When the entry got invalidated in rel_sync_cache_publication_cb(), I think we
> should mark the pubactions to false, and let get_rel_sync_entry() recalculate
> the pubactions.
>
> 0001 - modify as above described.
> 0002 - do not change.
>
> Any thought?
>

That sounds like a better way to fix and in fact, I was about to
suggest the same after reading your previous email. I'll think more on
this but in the meantime, can you add the test case in the patch as
requested earlier as well.

-- 
With Regards,
Amit Kapila.

Determine parallel-safety of partition relations for Inserts

While reviewing parallel insert [1] (Insert into Select) and
parallel copy patches [2], it came to my notice that both the patches
traverse the entire partition hierarchy to determine parallel-safety
of partitioned relations. This is required because before considering
the Insert or Copy can be considered for parallelism, we need to
determine whether it is safe to do so. We need to check for each
partition because any of the partitions can have some parallel-unsafe
index expression, constraint, etc. We do a similar thing for Selects
in standard_planner.

The plain Select case for partitioned tables was simpler because we
anyway loop through all the partitions in set_append_rel_size() and we
determine parallel-safety of each partition at that time but the same
is not true for Inserts.

For Inserts, currently, we only open the partition table when we are
about to insert into that partition. During ExecInsert, we find out
the partition matching the partition-key value and then lock if it is
not already locked. In this patch, we need to open each partition at
the planning time to determine its parallel-safety.

This will surely increase planning time but the execution is reduced
to an extent due to parallelism that it won't matter for either of the
cases if we see just total time. For example, see the latest results
for parallel inserts posted by Haiying Tang [3]. There might be an
impact when Selects can't be parallelized due to the small size of the
Select-table but we still have to traverse all the partitions to
determine parallel-safety but not sure how much it is compared to
overall time. I guess we need to find the same but apart from that can
anyone think of a better way to determine parallel-safety of
partitioned relation for Inserts?

Thoughts?

Note: I have kept a few people in Cc who are either directly involved
in this work or work regularly in the partitioning related work just
in the hope that might help in moving the discussion forward.

[1] -
https://www.postgresql.org/message-id/CAJcOf-dF9ohqub_D805k57Y_AuDLeAQfvtaax9SpwjTSEVdiXg%40mail.gmail.com
[2] -
https://www.postgresql.org/message-id/CALDaNm32c7kWpZm9UkS5P%2BShsfRhyMTWKvFHyn9O53WZWvO2iA%40mail.gmail.com
[3] -
https://www.postgresql.org/message-id/b54f2e306780449093c38cd8a04e%40G08CNEXMBPEKD05.g08.fujitsu.local

--
With Regards,
Amit Kapila.

Re: Added schema level support for publication.

2021-01-15 Thread vignesh C

Thanks Rahila for your comments, please find my thoughts below.

On Tue, Jan 12, 2021 at 5:16 PM Rahila Syed  wrote:
>
> Hi Vignesh,
>
> I had a look at the patch, please consider following comments.
>
> On Thu, Jan 7, 2021 at 10:03 PM vignesh C  wrote:
>>
>> Hi,
>>
>> This feature adds schema option while creating publication. Users will
>> be able to specify one or more schemas while creating publication,
>> when the user specifies schema option, then the data changes for the
>> tables present in the schema specified by the user will be replicated
>> to the subscriber. Few examples have been listed below:
>>
>> Create a publication that publishes all changes for all the tables
>> present in production schema:
>> CREATE PUBLICATION production_publication FOR ALL TABLES SCHEMA
production;
>>
> Should it be FOR TABLES IN SCHEMA instead of FOR ALL TABLES SCHEMA?
>

For adding tables into publication we have syntax like:
CREATE PUBLICATION mypub FOR TABLE tbl1, tbl2;
For all tables we have syntax like:
CREATE PUBLICATION mypub FOR ALL TABLES;

Initial syntax that I proposed was:
CREATE PUBLICATION production_publication *FOR ALL TABLES SCHEMA*
production;

I feel the below syntax is better, as it is consistent with others:
CREATE PUBLICATION mypub *FOR SCHEMA* sch1, sch2;

>>
>> Create a publication that publishes all changes for all the tables
>> present in marketing and sales schemas:
>> CREATE PUBLICATION sales_publication FOR ALL TABLES SCHEMA marketing,
sales;
>>
>> Add some schemas to the publication:
>> ALTER PUBLICATION sales_publication ADD SCHEMA marketing_june,
sales_june;
>>
> As per current implementation this command fails even if one of the
schemas does not
> exist. I think this is counterintuitive, it should throw a warning and
continue adding the rest.
>

We have the similar behavior in case of adding non-existent table while
creating a publication:
CREATE PUBLICATION mypub3 FOR TABLE non_existent_table;
ERROR:  relation "non_existent_table" does not exist
I feel we can keep the behavior similarly to maintain the consistency.

>>
>> Drop some schema from the publication:
>> ALTER PUBLICATION production_quarterly_publication DROP SCHEMA
production_july;
>>
> Same for drop schema, if one of these schemas does not exist in
publication,
> the entire DROP operation is aborted.

We have similar behavior in case of dropping non-existent table while
altering publication
alter publication mypub5 drop table test1,testx;
ERROR:  relation "testx" does not exist
I feel we can keep the behavior similarly to maintain the consistency.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel INSERT (INTO ... SELECT ...)

On Thu, Jan 14, 2021 at 2:37 PM Tang, Haiying
 wrote:
>
> Hi Greg, Amit
> Cc:hackers
>
> > > > 4. Have you checked the overhead of this on the planner for
> > > > different kinds of statements like inserts into tables having 100
> > > > or 500 partitions? Similarly, it is good to check the overhead of
> > > > domain related checks added in the patch.
> > > >
> > >
> > > Checking that now and will post results soon.
> > >
> >I am seeing a fair bit of overhead in the planning for the INSERT
> >parallel-safety checks (mind you, compared to the overall performance
> >gain, it's not too bad).
>
> Considering the 'real-world' use cases and extreme cases I can imagine, I 
> took 3 kinds of measurements on partition table for the latest patch(V11).
> The measurement is mainly focus on small rows because this could be easier to 
> evaluate check overhead among the parallelism optimization.
> From current results, the overhead looks acceptable compared to the benefits 
> as Greg said.
>

Can we test cases when we have few rows in the Select table (say 1000)
and there 500 or 1000 partitions. In that case, we won't select
parallelism but we have to pay the price of checking parallel-safety
of all partitions. Can you check this with 100, 200, 500, 1000
partitions table?

-- 
With Regards,
Amit Kapila.

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-15 Thread japin


On Fri, 15 Jan 2021 at 15:49, japin  wrote:
> On Fri, 15 Jan 2021 at 14:50, Bharath Rupireddy 
>  wrote:
>> On Fri, Jan 15, 2021 at 11:33 AM Hou, Zhijie  
>> wrote:
>>>
>>> > On Thu, Jan 14, 2021 at 5:36 PM Li Japin  wrote
>>> > > Do we really need to access PUBLICATIONRELMAP in this patch? What if
>>> > > we just set it to false in the else condition of (if (publish &&
>>> > > (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot)))
>>> > >
>>> > > Thank for you review. I agree with you, it doesn’t need to access
>>> > > PUBLICATIONRELMAP, since We already get the publication oid in
>>> > > GetRelationPublications(relid) [1], which also access
>>> > > PUBLICATIONRELMAP.  If the current pub->oid does not in pubids, the
>>> > publish is false, so we do not need publish the table.
>>> >
>>> > +1. This is enough.
>>> >
>>> > > I have another question, the data->publications is a list, when did it
>>> > has more then one items?
>>> >
>>> > IIUC, when the single table is associated with multiple publications, then
>>> > data->publications will have multiple entries. Though I have not tried,
>>> > we can try having two or three publications for the same table and verify
>>> > that.
>>> >
>>> > > 0001 - change as you suggest.
>>> > > 0002 - does not change.
>>>
>>>
>>> Hi
>>>
>>> In 0001 patch
>>>
>>> The comments says " Relation is not associated with the publication anymore 
>>> "
>>> That comment was accurate when check is_relation_part_of_publication() in 
>>> the last patch.
>>>
>>> But when put the code in the else branch, not only the case in the 
>>> comments(Relation is dropped),
>>> Some other case such as "partitioned tables" will hit else branch too.
>>>
>>> Do you think we need fix the comment a little to make it accurate?
>>
>> Thanks, that comment needs a rework, in fact the else condition
>> also(because we may fall into else branch even when publish is true
>> and (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot) is
>> false, but our intention(to fix the bug reported here) is to have the
>> else condition only when publish is false. And also instead of just
>> relying on the publish variable which can get set even when the
>> relation is not in the publication but ancestor_published is true, we
>> can just have something like below to fix the bug reported here:
>>
>> if (publish &&
>> (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot))
>> {
>> entry->pubactions.pubinsert |= pub->pubactions.pubinsert;
>> entry->pubactions.pubupdate |= pub->pubactions.pubupdate;
>> entry->pubactions.pubdelete |= pub->pubactions.pubdelete;
>> entry->pubactions.pubtruncate |= pub->pubactions.pubtruncate;
>> }
>> else if (!list_member_oid(pubids, pub->oid))
>> {
>> /*
>>  * Relation is not associated with the publication anymore 
>> i.e
>>  * it would have been dropped from the publication. So we 
>> don't
>>  * need to publish the changes for it.
>>  */
>> entry->pubactions.pubinsert = false;
>> entry->pubactions.pubupdate = false;
>> entry->pubactions.pubdelete = false;
>> entry->pubactions.pubtruncate = false;
>> }
>>
>> So this way, the fix only focuses on what we have reported here and it
>> doesn't cause any regressions may be, and the existing comment becomes
>> appropriate.
>>
>> Thoughts?
>>
>
> I'm sorry, the 0001 patch is totally wrong.  If we have only one publication, 
> it works,
> however, this is a special case.  If we have multiple publication in one 
> subscription,
> it doesn't work.  Here is a usecase.
>
> Step (1)
> -- On publisher
> CREATE TABLE t1 (a int);
> CREATE TABLE t2 (a int);
> CREATE PUBLICATION mypub1 FOR TABLE t1 WITH (publish='insert');
> CREATE PUBLICATION mypub2 FOR TABLE t2;
>
> Step (2)
> -- On subscriber
> CREATE TABLE t1 (a int);
> CREATE TABLE t2 (a int);
> CREATE SUBSCRIPTION mysub CONNECTION 'host=localhost port=5432 
> dbname=postgres' PUBLICATION mypub1, mypub2;
>
> Step (3)
> -- On publisher
> INSERT INTO t1 VALUES (1);
>
> Step (4)
> -- On subscriber
> SELECT * FROM t1;
>
> In step (4), we cannot get the records, because there has two publications in
> data->publications, so we will check one by one.
> The first publication is "mypub1" and entry->pubactions.pubinsert will be set
> true, however, in the second round, the publication is "mypub2", and we cannot
> find pub->oid in pubids (only oid of "mypub1"), so it will set all 
> entry->pubactions
> to false, and nothing will be replicate to subscriber.
>
> My apologies!

As I mentioned in previous thread, if there has multiple publications in
single subscription, it might lead data loss.

When the entry got invalidated in rel_sync_cache_publication_cb(), I think we
should mark the pubactions to false, and let

RE: ResourceOwner refactoring

2021-01-15 Thread kuroda.hay...@fujitsu.com

Dear Heikki,

> Hmm. ResOwnerReleaseTupleDesc() does exist, those functions are needed 
> for the callbacks. I think you meant the wrappers around 
> ResourceOwnerRemember and ResourceOwnerForget, like 
> ResourceOwnerRememberCatCacheRef(). I admit those are not fully 
> consistent: I didn't bother with the wrapper functions when there is 
> only one caller.

Yeah, I meant it. And I prefer your policy.

> Hmm. ResOwnerReleaseTupleDesc() does exist, those functions are needed 
> for the callbacks. I think you meant the wrappers around 
> ResourceOwnerRemember and ResourceOwnerForget, like 
> ResourceOwnerRememberCatCacheRef(). I admit those are not fully 
> consistent: I didn't bother with the wrapper functions when there is 
> only one caller.

Good job. I confirmed your fixes, and I confirmed it looks fine.
I will check another ResourceOwnerEnlarge() if I have a time.

> I've been working on performance testing too.

I'm looking forward to seeing it.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: fdatasync(2) on macOS

2021-01-15 Thread Thomas Munro

On Fri, Jan 15, 2021 at 7:53 PM Thomas Munro  wrote:
> That was fun, but now I'm asking myself: do we really want to use an
> IO synchronisation facility that's not declared by the vendor?

I should add, the default wal_sync_method is open_datasync, not
fdatasync.  I'm pretty suspicious of that too: neither O_SYNC nor
O_DSYNC appears as a documented flag for open(2) and the numbers look
suspicious.  Perhaps they only define them to support aio_fsync(2).

Re: Wrong HINT during database recovery when occur a minimal wal.

2021-01-15 Thread lchch1...@sina.cn


>Mmm. Maybe something's missing. If you took the base-backup using
>pg_basebackup, that means max_wal_senders > 0 on the primary. If you
>lowered wal_level in the backup (or replica) then started it, You
>would get something like this.
>| FATAL: WAL streaming (max_wal_senders > 0) requires wal_level "replica" or 
>"logical".
>If you changed max_wal_senders to zero, you would get the following instead.
>| FATAL:  hot standby is not possible because max_wal_senders = 0 is a lower 
>setting than on the primary server (its value was 2)
Then mark hot_standby off and continue try lowered wal_level.
And do recovery from the basebackup, then you will see the FATAL.

>So I couldn't reproduce the situation.
>Anyways.
 
>> My question is that what's the mean of  [set wal_level to "replica" on the 
>> primary] in
>> HINT describe, I can't think over a case to solve this FATAL by set 
>> wal_level, I can
>> solve it by turn off hot_standby only.
>>
>> Do you think we can do this code change?
>> --- a/src/backend/access/transam/xlog.c
>> +++ b/src/backend/access/transam/xlog.c
>> @@ -6300,7 +6300,7 @@ CheckRequiredParameterValues(void)
>>   if (ControlFile->wal_level < WAL_LEVEL_REPLICA)
>>   ereport(ERROR,
>>   (errmsg("hot standby is not possible because wal_level was not set to 
>> \"replica\" or higher on the primary server"),
>> -  errhint("Either set wal_level to \"replica\" on the primary, or turn off 
>> hot_standby here.")));
>> +  errhint("You should turn off hot_standby here.")));
 
>Since it's obvious that the change in a primary cannot be propagted by
>taking a backup or starting replication, the first sentence reads to
>me as "you should retake a base-backup from a primary where wal_level
>is replica or higher". So *I* don't think it needs a fix.
I think this HINT is want to guide users to finish this recovery, and the first 
guide is
invalid in my opinion.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: Occasional tablespace.sql failures in check-world -jnn

2021-01-15 Thread Peter Eisentraut


On 2020-12-09 02:29, Andres Freund wrote:

I suspect this is related to the pg_upgrade test and the main regression
test running at the same time. We have the following in 
src/test/regress/GNUMakefile

# Tablespace setup

.PHONY: tablespace-setup
tablespace-setup:
echo $(realpath ./testtablespace) >> /tmp/tablespace.log
rm -rf ./testtablespace
mkdir ./testtablespace
...

which pg_upgrade triggers. Even though it, as far as I can tell, never
actually ends up putting any data in it:

# Send installcheck outputs to a private directory.  This avoids conflict when
# check-world runs pg_upgrade check concurrently with src/test/regress check.
# To retrieve interesting files after a run, use pattern tmp_check/*/*.diffs.
outputdir="$temp_root/regress"
EXTRA_REGRESS_OPTS="$EXTRA_REGRESS_OPTS --outputdir=$outputdir"
export EXTRA_REGRESS_OPTS
mkdir "$outputdir"
mkdir "$outputdir"/testtablespace

It's not clear to me why we have this logic in the makefile at all?
Somebody taught pg_regress to do so, but only on windows... See
convert_sourcefiles_in().


I vaguely recall that this had something to do with SELinux (or 
something similar?), where it matters in what context you create a file 
or directory and then certain properties attach to it that are relevant 
to subsequent programs that run on it.  Again, vague.

Re: Printing backtrace of postgres processes

2021-01-15 Thread Peter Eisentraut


On 2020-12-08 10:38, vignesh C wrote:

I have implemented printing of backtrace based on handling it in
CHECK_FOR_INTERRUPTS. This patch also includes the change to allow
getting backtrace of any particular process based on the suggestions.
Attached patch has the implementation for the same.
Thoughts?


Are we willing to use up a signal for this?

Re: list of extended statistics on psql

2021-01-15 Thread Julien Rouhaud

On Wed, Jan 13, 2021 at 10:22:05AM +0900, Tatsuro Yamada wrote:
> Hi Tomas,
> 
> On 2021/01/13 7:48, Tatsuro Yamada wrote:
> > On 2021/01/12 20:08, Tomas Vondra wrote:
> > > On 1/12/21 2:57 AM, Tatsuro Yamada wrote:
> > > > On 2021/01/09 9:01, Tomas Vondra wrote:
> > > ...>
> > > > > While working on that, I realized that 'defined' might be a bit
> > > > > ambiguous, I initially thought it means 'NOT NULL' (which it does 
> > > > > not).
> > > > > I propose to change it to 'requested' instead. Tatsuro, do you agree, 
> > > > > or
> > > > > do you think 'defined' is better?
> > > > 
> > > > Regarding the status of extended stats, I think the followings:
> > > > 
> > > >   - "defined": it shows the extended stats defined only. We can't know
> > > >    whether it needs to analyze or not. I agree this name was
> > > >     ambiguous. Therefore we should replace it with a more 
> > > > suitable
> > > >    name.
> > > >   - "requested": it shows the extended stats needs something. Of course,
> > > >    we know it needs to ANALYZE because we can create the 
> > > > patch.
> > > >    However, I feel there is a little ambiguity for DBA.
> > > >    To solve this, it would be better to write an 
> > > > explanation of
> > > >    the status in the document. For example,
> > > > 
> > > > ==
> > > > The column of the kind of extended stats (e. g. Ndistinct) shows some 
> > > > statuses.
> > > > "requested" means that it needs to gather data by ANALYZE. "built" 
> > > > means ANALYZE
> > > >   was finished, and the planner can use it. NULL means that it doesn't 
> > > > exists.
> > > > ==
> > > > 
> > > > What do you think? :-D
> > > > 
> > > 
> > > Yes, that seems reasonable to me. Will you provide an updated patch?
> > 
> > 
> > Sounds good. I'll send the updated patch today.
> 
> 
> 
> I updated the patch to add the explanation of the extended stats' statuses.
> Please feel free to modify the patch to improve it more clearly.
> 
> The attached files are:
>0001: Add psql \dx and the fixed document
>0002: Regression test for psql \dX
>app-psql.html: Created by "make html" command (You can check the
>   explanation of the statuses easily, probably)

Hello Yamada-san,

I reviewed the patch and don't have specific complaints, it all looks good!

I'm however thinking about the "requested" status.  I'm wondering if it could
lead to people think that an ANALYZE is scheduled and will happen soon.
Maybe "defined" or "declared" might be less misleading, or even "waiting for
analyze"?

Re: Parallel INSERT (INTO ... SELECT ...)