date:20201223

Re: Movement of restart_lsn position movement of logical replication slots is very slow

2020-12-23 Thread Amit Kapila

On Wed, Dec 23, 2020 at 7:06 PM Jammie  wrote:
>
> Thanks Amit for the response.
> Two things :
> 1) In our observation via PSQL the advance command as well do not move the 
> restart_lsn immediately. It is similar to our approach that use the 
> confirmed_flush_lsn via stream
> 2) I am ok to understand the point that we are not reading from the stream so 
> we might be facing the issue. But the question is why we are able to move the 
> restart_lsn most of the time by updating the confirmed_flush_lsn via pgJDBC. 
> But only occasionally it lags behind too far behind.
>

I am not sure why you are seeing such behavior. Is it possible for you
to debug the code? Both confirmed_flush_lsn and restart_lsn are
advanced in LogicalConfirmReceivedLocation. You can add elog to print
the values to see the progress. Here, the point to note is that even
though we update confirmed_flush_lsn every time with the new value but
restart_lsn is updated only when candidate_restart_valid has a valid
value each time after a call to LogicalConfirmReceivedLocation. We
update candidate_restart_valid in
LogicalIncreaseRestartDecodingForSlot which is called only during
decoding of XLOG_RUNNING_XACTS record. So, it is not clear to me how
in your case restart_lsn is getting advanced without decode? I think
if you add some elogs in the code to track the values of
candidate_restart_valid, confirmed_flush_lsn, and restart_lsn, you
might get some clue.

-- 
With Regards,
Amit Kapila.

Re: postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped

2020-12-23 Thread Bharath Rupireddy

On Thu, Dec 24, 2020 at 7:21 AM Fujii Masao  wrote:
> On 2020/12/23 23:40, Bharath Rupireddy wrote:
> > On Wed, Dec 23, 2020 at 7:31 PM Fujii Masao  
> > wrote:
> >> I agree to make pgfdw_xact_callback() close the connection when
> >> entry->invalidated == true. But I think that it's better to get rid of
> >> have_invalid_connections flag and make pgfdw_inval_callback() close
> >> the connection immediately if entry->xact_depth == 0, to avoid unnecessary
> >> scan of the hashtable during COMMIT of transaction not accessing to
> >> foreign servers. Attached is the POC patch that I'm thinking. Thought?
> >
> > We could do that way as well. It seems okay to me. Now the disconnect
> > code is spread in pgfdw_inval_callback() and pgfdw_xact_callback().
> > There's also less burden on pgfdw_xact_callback() to close a lot of
> > connections on a single commit. The behaviour is like this - If
> > entry->xact_depth == 0, then the entries wouldn't have got any
> > connection in the current txn so they can be immediately closed in
> > pgfdw_inval_callback() and pgfdw_xact_callback() can exit immediately
> > as xact_got_connection is false. If entry->xact_depth > 0 which means
> > that probably pgfdw_inval_callback() came from a sub txn, we would
> > have got a connection in the txn i.e. xact_got_connection becomes true
> > due to which it can get invalidated in pgfdw_inval_callback() and get
> > closed in pgfdw_xact_callback() at the end of the txn.
> >
> > And there's no chance of entry->xact_depth > 0 and xact_got_connection 
> > false.
> >
> > I think we need to change the comment before pgfdw_inval_callback() a bit:
> >
> >   * After a change to a pg_foreign_server or pg_user_mapping catalog entry,
> >   * mark connections depending on that entry as needing to be remade.
> >   * We can't immediately destroy them, since they might be in the midst of
> >   * a transaction, but we'll remake them at the next opportunity.
> >
> > to
> >
> >   * After a change to a pg_foreign_server or pg_user_mapping catalog entry,
> > *  try to close the cached connections associated with them if they
> > are not in the
> > *  midst of a transaction otherwise mark them as invalidated. We will
> > destroy the
> >   * invalidated connections in pgfdw_xact_callback() at the end of the main 
> > xact.
> >   * Closed connections will be remade at the next opportunity.
>
> Yes, I agree that we need to update that comment.
>
> >
> > Any thoughts on the earlier point in [1] related to a test case(which
> > becomes unnecessary with this patch) coverage?
> >
>
> ISTM that we can leave that test as it is because it's still useful to test
> the case where the cached connection is discarded in pgfdw_inval_callback().
> Thought?

Yes, that test case covers the code this patch adds i.e. closing the
connection in pgfdw_inval_callback() while committing alter foreign
server stmt.

> By applying the patch, probably we can get rid of the code to discard
> the invalidated cached connection in GetConnection(). But at least for
> the back branches, I'd like to leave the code as it is so that we can make
> sure that the invalidated cached connection doesn't exist when getting
> new connection. Maybe we can improve that in the master, but I'm not
> sure if it's really worth doing that against the gain. Thought?

+1 to keep that code as is even after this patch is applied(at least
it works as an assertion that we don't have any leftover invalid
connections). I'm not quite sure, we may need that in some cases, say
if we don't hit disconnect_pg_server() code in pgfdw_xact_callback()
and pgfdw_inval_callback() due to some errors in between. I can not
think of an exact use case though.

Attaching v2 patch that adds the comments and the other test case that
covers disconnecting code in pgfdw_xact_callback. Please review it.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From 20f605a7586a6d6bc3cb671092c5c1e08c383c4f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 24 Dec 2020 11:53:30 +0530
Subject: [PATCH v2] postgres_fdw - cached connection leaks if the associated
 user mapping is dropped

In postgres_fdw the cached connections to remote servers can stay
until the lifetime of the local session, if the underlying user
mapping is dropped in another session.

To solve the above connection leak problem, it looks like the right
place to close the invalid connections is either in pgfdw_inval_callback()
if they are not in midst any xact, or in pgfdw_xact_callback(), which gets
called at the end of every act once registered, in the
current session(by then all the sub xacts also would have been finished).
Note that if there are too many invalidated entries, then the
following xact has to bear running this extra disconnect code, but
that's okay than having leaked connections.
---
 contrib/postgres_fdw/connection.c | 32 ++-
 .../postgres_fdw/expected/postgres_fdw.out| 18 +++

Re: A failure of standby to follow timeline switch

2020-12-23 Thread Fujii Masao





On 2020/12/09 17:43, Kyotaro Horiguchi wrote:

Hello.

We found a behavioral change (which seems to be a bug) in recovery at
PG13.

The following steps might seem somewhat strange but the replication
code deliberately cope with the case.  This is a sequense seen while
operating a HA cluseter using Pacemaker.

- Run initdb to create a primary.
- Set archive_mode=on on the primary.
- Start the primary.

- Create a standby using pg_basebackup from the primary.
- Stop the standby.
- Stop the primary.

- Put stnadby.signal to the primary then start it.
- Promote the primary.

- Start the standby.


Until PG12, the parimary signals end-of-timeline to the standby and
switches to the next timeline.  Since PG13, that doesn't happen and
the standby continues to request for the segment of the older
timeline, which no longer exists.

FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 
00010003 has already been removed

It is because WalSndSegmentOpen() can fail to detect a timeline switch
on a historic timeline, due to use of a wrong variable to check
that. It is using state->seg.ws_segno but it seems to be a thinko when
the code around was refactored in 709d003fbd.

The first patch detects the wrong behavior.  The second small patch
fixes it.


Thanks for reporting this! This looks like a bug.

When I applied two patches in the master branch and
ran "make check-world", I got the following error.

== creating database "contrib_regression" ==
# Looks like you planned 37 tests but ran 36.
# Looks like your test exited with 255 just after 36.
t/001_stream_rep.pl ..
Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/37 subtests
...
Test Summary Report
---
t/001_stream_rep.pl(Wstat: 65280 Tests: 36 Failed: 0)
  Non-zero exit status: 255
  Parse errors: Bad plan.  You planned 37 tests but ran 36.
Files=21, Tests=239, 302 wallclock secs ( 0.10 usr  0.05 sys + 41.69 cusr 39.84 
csys = 81.68 CPU)
Result: FAIL
make[2]: *** [check] Error 1
make[1]: *** [check-recovery-recurse] Error 2
make[1]: *** Waiting for unfinished jobs
t/070_dropuser.pl . ok


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Cannot ship records to subscriber for partition tables using logical replication (publish_via_partition_root=false)

2020-12-23 Thread Amit Kapila

On Thu, Dec 24, 2020 at 11:02 AM Li Japin  wrote:
>
> Hi, hackers
>
> When I use logical stream replication on partition table, I find that if we 
> create a new
> partitions after the subscription on subscriber,  the records in new 
> partitions cannot be
> shipped to the subscriber.
>
> Here is an example:
>
> 1. Create a view to check the subscription tables.
>
> ```
> — on subscriber
> CREATE VIEW pg_subscription_tables AS
> SELECT
> s.subname,
> n.nspname AS schemaname,
> c.relname AS tablename
> FROM
> pg_subscription s JOIN pg_subscription_rel p ON s.oid = p.srsubid,
> pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace
> WHERE c.oid = p.srrelid;
> ```
>
> 1. Create a publication and subscription.
>
> ```
> — on publisher
> CREATE TABLE test_parent (a int, b int) PARTITION BY RANGE (a);
> CREATE TABLE test_child_01 PARTITION OF test_parent FOR VALUES FROM (1) TO 
> (10);
> CREATE PUBLICATION my_test_pub FOR TABLE test_parent;
>
> — on subscriber
> CREATE TABLE test_parent (a int, b int) PARTITION BY RANGE (a);
> CREATE TABLE test_child_01 PARTITION OF test_parent FOR VALUES FROM (1) TO 
> (10);
> CREATE SUBSCRIPTION my_test_sub CONNECTION 'host=localhost port=8765 
> dbname=postgres' PUBLICATION my_test_pub;
> ```
>
> 2. The insert data into test_parent on publisher, and everything looks good.
>
> ```
> — on publisher
> INSERT INTO test_parent VALUES (5, 50);
> SELECT * FROM pg_publication_tables;
>pubname   | schemaname |   tablename
> -++---
>  my_test_pub | public | test_child_01
> (1 row)
>
> — on subscriber
> SELECT * FROM test_parent;
>  a | b
> ---+
>  5 | 50
> (1 row)
>
> SELECT * FROM pg_subscription_tables;
>subname   | schemaname |   tablename
> -++---
>  my_test_sub | public | test_child_01
> (1 row)
> ```
>
> 3. However, If we create a new partitions on both publisher and subscriber. 
> And the records
> in new partitions cannot ship to the subscriber. When I check the 
> `pg_publication_tables`, I
> found that the new partitions are already in publication. But on the 
> subscriber, the
> `pg_subscription_rel` do not have the new partitions.
>
> ```
> — on publisher
> CREATE TABLE test_child_02 PARTITION OF test_parent FOR VALUES FROM (10)  TO 
> (20);
> SELECT * FROM pg_publication_tables;
>pubname   | schemaname |   tablename
> -++---
>  my_test_pub | public | test_child_01
>  my_test_pub | public | test_child_02
> (2 rows)
> INSERT INTO test_parent VALUES (15, 150);
>
> — on subscriber
> CREATE TABLE test_child_02 PARTITION OF test_parent FOR VALUES FROM (10)  TO 
> (20);
> SELECT * FROM test_parent;
>  a | b
> ---+
>  5 | 50
> (1 row)
>
> SELECT * FROM pg_subscription_tables;
>subname   | schemaname |   tablename
> -++---
>  my_test_sub | public | test_child_01
> (1 row)
> ```
>
> I think it looks strange.
>

The current behavior of partitioned tables is the same as for regular
tables. We don't automatically replicate the newly added tables to the
existing publication. So, if you try Alter Subscription my_test_sub
Refresh Publication;, it will replicate the newly added partition.

-- 
With Regards,
Amit Kapila.

Re: Movement of restart_lsn position movement of logical replication slots is very slow

2020-12-23 Thread Jammie

However when the situation comes and that one slot gets behind it never
recovers and no way to recover from this situation even after reading using
advance ro pg_logical_get_changes sql command.

On Wed, Dec 23, 2020 at 7:05 PM Jammie  wrote:

> Thanks Amit for the response.
> Two things :
> 1) In our observation via PSQL the advance command as well do not move the
> restart_lsn immediately. It is similar to our approach that use the
> confirmed_flush_lsn via stream
> 2) I am ok to understand the point that we are not reading from the stream
> so we might be facing the issue. But the question is why we are able to
> move the restart_lsn most of the time by updating the confirmed_flush_lsn
> via pgJDBC. But only occasionally it lags behind too far behind.
>
> Regards
> Shailesh
>
>
>
> On Tue, Dec 15, 2020 at 6:30 PM Amit Kapila 
> wrote:
>
>> On Tue, Dec 15, 2020 at 11:00 AM Jammie 
>> wrote:
>> >
>> > Thanks Amit for the response
>> >
>> > We are using pgJDBC sample program here
>> > https://jdbc.postgresql.org/documentation/head/replication.html
>> >
>> > the setFlushLSN is coming from the pgJDBC only.
>> >
>> > git hub for APIs of pgJDBC methods available.
>> >
>> > https://github.com/pgjdbc/pgjdbc
>> >
>> > The second slot refers to "private" slot.
>> >
>> > So ""we are not doing reading from the stream' ==> It means that we are
>> having readPending call only from the shared slot then we get the
>> lastReceivedLSN() from stream and
>> > send it back to stream as confirmed_flush_lsn for both private and
>> shared slot. We dont do readPending call to private slot. we will use
>> private slot only when we dont have choice. It is kind of reserver slot for
>> us.
>> >
>>
>> I think this (not performing read/decode on the private slot) could be
>> the reason why it lagging behind. If you want to use as a reserve slot
>> then you probably want to at least perform
>> pg_replication_slot_advance() to move it to the required position. The
>> restart_lsn won't move unless you read/decode from that slot.
>>
>> --
>> With Regards,
>> Amit Kapila.
>>
>

Re: Parallel Inserts in CREATE TABLE AS

2020-12-23 Thread Amit Kapila

On Thu, Dec 24, 2020 at 10:25 AM vignesh C  wrote:
>
> On Tue, Dec 22, 2020 at 2:16 PM Bharath Rupireddy
>  wrote:
> >
> > On Tue, Dec 22, 2020 at 12:32 PM Bharath Rupireddy
> > Attaching v14 patch set that has above changes. Please consider this
> > for further review.
> >
>
> Few comments:
> In the below case, should create be above Gather?
> postgres=# explain  create table t7 as select * from t6;
> QUERY PLAN
> ---
>  Gather  (cost=0.00..9.17 rows=0 width=4)
>Workers Planned: 2
>  ->  Create t7
>->  Parallel Seq Scan on t6  (cost=0.00..9.17 rows=417 width=4)
> (4 rows)
>
> Can we change it to something like:
> ---
> Create t7
>  -> Gather  (cost=0.00..9.17 rows=0 width=4)
>   Workers Planned: 2
>   ->  Parallel Seq Scan on t6  (cost=0.00..9.17 rows=417 width=4)
> (4 rows)
>

I think it is better to have it in a way as in the current patch
because that reflects that we are performing insert/create below
Gather which is the purpose of this patch. I think this is similar to
what the Parallel Insert patch [1] has for a similar plan.


[1] - https://commitfest.postgresql.org/31/2844/

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] logical decoding of two-phase transactions

2020-12-23 Thread Amit Kapila

On Wed, Dec 23, 2020 at 3:08 PM Ajin Cherian  wrote:
>
>
> > Can you please update the patch for the points we agreed upon?
>
> Changed and attached.
>

Thanks, I have looked at these patches again and it seems patches 0001
to 0004 are in good shape, and among those
v33-0001-Extend-the-output-plugin-API-to-allow-decoding-o is good to
go. So, I am planning to push the first patch (0001*) in next week
sometime unless you or someone else has any comments on it.

-- 
With Regards,
Amit Kapila.

Cannot ship records to subscriber for partition tables using logical replication (publish_via_partition_root=false)

2020-12-23 Thread Li Japin

Hi, hackers

When I use logical stream replication on partition table, I find that if we 
create a new
partitions after the subscription on subscriber,  the records in new partitions 
cannot be
shipped to the subscriber.

Here is an example:

1. Create a view to check the subscription tables.

```
— on subscriber
CREATE VIEW pg_subscription_tables AS
SELECT
s.subname,
n.nspname AS schemaname,
c.relname AS tablename
FROM
pg_subscription s JOIN pg_subscription_rel p ON s.oid = p.srsubid,
pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.oid = p.srrelid;
```

1. Create a publication and subscription.

```
— on publisher
CREATE TABLE test_parent (a int, b int) PARTITION BY RANGE (a);
CREATE TABLE test_child_01 PARTITION OF test_parent FOR VALUES FROM (1) TO (10);
CREATE PUBLICATION my_test_pub FOR TABLE test_parent;

— on subscriber
CREATE TABLE test_parent (a int, b int) PARTITION BY RANGE (a);
CREATE TABLE test_child_01 PARTITION OF test_parent FOR VALUES FROM (1) TO (10);
CREATE SUBSCRIPTION my_test_sub CONNECTION 'host=localhost port=8765 
dbname=postgres' PUBLICATION my_test_pub;
```

2. The insert data into test_parent on publisher, and everything looks good.

```
— on publisher
INSERT INTO test_parent VALUES (5, 50);
SELECT * FROM pg_publication_tables;
   pubname   | schemaname |   tablename
-++---
 my_test_pub | public | test_child_01
(1 row)

— on subscriber
SELECT * FROM test_parent;
 a | b
---+
 5 | 50
(1 row)

SELECT * FROM pg_subscription_tables;
   subname   | schemaname |   tablename
-++---
 my_test_sub | public | test_child_01
(1 row)
```

3. However, If we create a new partitions on both publisher and subscriber. And 
the records
in new partitions cannot ship to the subscriber. When I check the 
`pg_publication_tables`, I
found that the new partitions are already in publication. But on the 
subscriber, the
`pg_subscription_rel` do not have the new partitions.

```
— on publisher
CREATE TABLE test_child_02 PARTITION OF test_parent FOR VALUES FROM (10)  TO 
(20);
SELECT * FROM pg_publication_tables;
   pubname   | schemaname |   tablename
-++---
 my_test_pub | public | test_child_01
 my_test_pub | public | test_child_02
(2 rows)
INSERT INTO test_parent VALUES (15, 150);

— on subscriber
CREATE TABLE test_child_02 PARTITION OF test_parent FOR VALUES FROM (10)  TO 
(20);
SELECT * FROM test_parent;
 a | b
---+
 5 | 50
(1 row)

SELECT * FROM pg_subscription_tables;
   subname   | schemaname |   tablename
-++---
 my_test_sub | public | test_child_01
(1 row)
```

I think it looks strange. But if we create publication with 
`publish_via_partition_root` it work fine,
since all records are ship on the partitioned table [1].

When `publish_via_partition_root` is false, since the publisher add the new 
partitions in
publication,  should we add them on the subscriber automatically?


[1] https://www.postgresql.org/docs/devel/sql-createpublication.html

--
Best regards
Japin Li
ChengDu WenWu Information Technology Co.Ltd.

Re: Parallel Inserts in CREATE TABLE AS

2020-12-23 Thread vignesh C

On Tue, Dec 22, 2020 at 2:16 PM Bharath Rupireddy
 wrote:
>
> On Tue, Dec 22, 2020 at 12:32 PM Bharath Rupireddy
> Attaching v14 patch set that has above changes. Please consider this
> for further review.
>

Few comments:
In the below case, should create be above Gather?
postgres=# explain  create table t7 as select * from t6;
QUERY PLAN
---
 Gather  (cost=0.00..9.17 rows=0 width=4)
   Workers Planned: 2
 ->  Create t7
   ->  Parallel Seq Scan on t6  (cost=0.00..9.17 rows=417 width=4)
(4 rows)

Can we change it to something like:
---
Create t7
 -> Gather  (cost=0.00..9.17 rows=0 width=4)
  Workers Planned: 2
  ->  Parallel Seq Scan on t6  (cost=0.00..9.17 rows=417 width=4)
(4 rows)

You could change intoclause_len = strlen(intoclausestr) to
strlen(intoclausestr) + 1 and use intoclause_len in the remaining
places. We can avoid the +1 in the other places.
+   /* Estimate space for into clause for CTAS. */
+   if (IS_CTAS(intoclause) && OidIsValid(objectid))
+   {
+   intoclausestr = nodeToString(intoclause);
+   intoclause_len = strlen(intoclausestr);
+   shm_toc_estimate_chunk(>estimator, intoclause_len + 1);
+   shm_toc_estimate_keys(>estimator, 1);
+   }

Can we use  node->nworkers_launched == 0 in place of
node->need_to_scan_locally, that way the setting and resetting of
node->need_to_scan_locally can be removed. Unless need_to_scan_locally
is needed in any of the functions that gets called.
+   /* Enable leader to insert in case no parallel workers were launched. */
+   if (node->nworkers_launched == 0)
+   node->need_to_scan_locally = true;
+
+   /*
+* By now, for parallel workers (if launched any), would have
started their
+* work i.e. insertion to target table. In case the leader is chosen to
+* participate for parallel inserts in CTAS, then finish its
share before
+* going to wait for the parallel workers to finish.
+*/
+   if (node->need_to_scan_locally)
+   {

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

2020-12-23 Thread Michael Paquier

On Thu, Dec 24, 2020 at 09:10:22AM +0530, Bharath Rupireddy wrote:
> Since I tested that with all the formats manually here and it works,
> so I don't want to make the test cases complicated with adding
> explain_filter() function into matview.sql and select_into.sql and all
> that. I'm okay without those test cases.

Please note that I have added an entry in the CF app for the moment so
as we don't lose track of it:
https://commitfest.postgresql.org/31/2892/

>> What we have here looks rather committable.  Let's wait until the
>> period of vacations is over before wrapping this up to give others the
>> occasion to comment.
> 
> Thanks! Happy Vacations!

You too!
--
Michael


signature.asc
Description: PGP signature

Re: New IndexAM API controlling index vacuum strategies

2020-12-23 Thread Peter Geoghegan

On Tue, Dec 22, 2020 at 2:54 PM Masahiko Sawada  wrote:
> I've started this separate thread from [1] for discussing the general
> API design of index vacuum.

This is a very difficult and very important problem. Clearly defining
the problem is probably the hardest part. This prototype patch seems
like a good start, though.

Private discussion between Masahiko and myself led to a shared
understanding of what the best *general* direction is for VACUUM now.
It is necessary to deal with several problems all at once here, and to
at least think about several more problems that will need to be solved
later. If anybody reading the thread initially finds it hard to see
the connection between the specific items that Masahiko has
introduced, they should note that that's *expected*.

> Summary:
>
> * Call ambulkdelete and amvacuumcleanup even when INDEX_CLEANUP is
> false, and leave it to the index AM whether or not skip them.

Makes sense. I like the way you unify INDEX_CLEANUP and the
vacuum_cleanup_index_scale_factor stuff in a way that is now quite
explicit and obvious in the code.

> The second and third points are to introduce a general framework for
> future extensibility. User-visible behavior is not changed by this
> change.

In some ways the ideas in your patch might be considered radical, or
at least novel: they introduce the idea that bloat can be a
qualitative thing. But at the same time the design is quite
conservative: these are fairly isolated changes, at least code-wise. I
am not 100% sure that this approach will be successful in
vacuumlazy.c, in the end (I'm ~95% sure). But I am 100% sure that our
collective understanding of the problems in this area will be
significantly improved by this effort. A fundamental rethink does not
necessarily require a fundamental redesign, and yet it might be just
as effective.

This is certainly what I see when testing my bottom-up index deletion
patch, which adds an incremental index deletion mechanism that merely
intervenes in a precise, isolated way. Despite my patch's simplicity,
it manages to practically eliminate an entire important *class* of
index bloat (at least once you make certain mild assumptions about the
duration of snapshots). Sometimes it is possible to solve a hard
problem by thinking about it only *slightly* differently.

This is a tantalizing possibility for VACUUM, too. I'm willing to risk
sounding grandiose if that's what it takes to get more hackers
interested in these questions. With that in mind, here is a summary of
the high level hypothesis behind this VACUUM patch:

VACUUM can and should be reimagined as a top-down mechanism that
complements various bottom-up mechanisms (including the stuff from my
deletion patch, heap pruning, and possibly an enhanced version of heap
pruning based on similar principles). This will be possible without
changing any of the fundamental invariants of the current vacuumlazy.c
design. VACUUM's problems are largely pathological behaviors of one
kind or another, that can be fixed with specific well-targeted
interventions. Workload characteristics can naturally determine how
much of the cleanup is done by VACUUM itself -- large variations are
possible within a single database, and even across indexes on the same
table.

> The new index AM API, amvacuumstrategy(), which is called before
> bulkdelete() for each index and asks the index bulk-deletion strategy.
> On this API, lazy vacuum asks, "Hey index X, I collected garbage heap
> tuples during heap scanning, how urgent is vacuuming for you?", and
> the index answers either "it's urgent" when it wants to do
> bulk-deletion or "it's not urgent, I can skip it". The point of this
> proposal is to isolate heap vacuum and index vacuum for each index so
> that we can employ different strategies for each index. Lazy vacuum
> can decide whether or not to do heap clean based on the answers from
> the indexes.

Right -- workload characteristics (plus appropriate optimizations at
the local level) make it possible that amvacuumstrategy() will give
*very* different answers from different indexes *on the same table*.
The idea that all indexes on the table are more or less equally
bloated at any given point in time is mostly wrong. Actually,
*sometimes* it really is correct! But other times it is *dramatically
wrong* -- it all depends on workload characteristics. What is likely
to be true *on average* across all tables/indexes is *irrelevant* (the
mean/average is simply not a useful concept, in fact).

The basic lazy vacuum design needs to recognize this important
difference, and other similar issues. That's the point of
amvacuumstrategy().

> Currently, if INDEX_CLEANUP option is not set (i.g.
> VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do
> heap clean. But with this patch we use the default as a neutral state
> ('smart' mode). This neutral state could be "on" and "off" depending
> on several factors including the answers of amvacuumstrategy(), the
> table

Re: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

2020-12-23 Thread Bharath Rupireddy

On Thu, Dec 24, 2020 at 7:40 AM Michael Paquier  wrote:
>
> On Wed, Dec 23, 2020 at 07:13:33PM +0530, Bharath Rupireddy wrote:
> > +1. Shall we add some test cases(with xml, yaml, json formats as is
> > currently being done in explain.sql) to cover that? We can have the
> > explain_filter() function to remove the unstable parts in the output,
> > it looks something like below. If yes, please let me know I can add
> > them to matview and select_into.
>
> I am not sure that we need tests for all the formats, but having at
> least one of them sounds good to me.  I leave the choice up to you.

Since I tested that with all the formats manually here and it works,
so I don't want to make the test cases complicated with adding
explain_filter() function into matview.sql and select_into.sql and all
that. I'm okay without those test cases.

> What we have here looks rather committable.  Let's wait until the
> period of vacations is over before wrapping this up to give others the
> occasion to comment.

Thanks! Happy Vacations!

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Logical decoding of TRUNCATE

2020-12-23 Thread Noah Misch

On Mon, Dec 21, 2020 at 09:42:47AM -0800, Andres Freund wrote:
> On 2020-12-20 15:54:31 -0800, Peter Geoghegan wrote:
> > On Sun, Dec 20, 2020 at 3:13 PM Andres Freund  wrote:
> > > Hm. Do I understand correctly that this problem is hit solely because
> > > the parallel mode code relies on there already have been a transaction
> > > snapshot set, thus avoiding the error? And that the code normally only
> > > works because GetTransactionSnapshot() will already have been called
> > > somewhere, before EnterParallelMode()?
> > 
> > It seems unlikely that InitializeParallelDSM() was ever intended to be
> > run in a background worker.
> 
> IDK, the environment in a bgworker shouldn't be that different from the
> normal query environment in a normal connection. And it's far from
> insane to want to be able to run a paralell query in a bgworker (and I
> *think* I have seen that work before). This case here seems more like
> an accidental dependency than anything to me, once that could perhaps
> even hint at problems in normal backends too.

Yeah.  SPI_execute("TRUNCATE foo", false, 0) has no trouble doing a parallel
index build in a bgworker.  Let's ignore intention and be pleased about that.

Re: Confused about stream replication protocol documentation

2020-12-23 Thread Li Japin


On Dec 23, 2020, at 8:11 PM, Fujii Masao 
mailto:masao.fu...@oss.nttdata.com>> wrote:


On 2020/12/23 11:08, Li Japin wrote:
On Dec 22, 2020, at 11:13 PM, Fujii Masao 
mailto:masao.fu...@oss.nttdata.com> 
> wrote:

‘B’ means a backend and ‘F’ means a frontend. Maybe as [1] does, we should
add the note like "Each is marked to indicate that it can be sent by
a frontend (F) and a backend (B)" into the description about each message
format for START_REPLICATION.

[1]
https://www.postgresql.org/docs/devel/protocol-message-formats.html 

Thanks for your clarify.  Maybe we should move the "protocol message formats”
before “stream replication protocol” or referenced it in "stream replication 
protocol”.

I like the latter. And maybe it's better to reference to also
"53.6. Message Data Types" there because the messages for
START_REPLICATION use the message data types.

Add reference about “protocol message types” and “protocol message formats”.

index 4899bacda7..5793936b42 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2069,8 +2069,9 @@ The commands accepted in replication mode are:
  

  
-  WAL data is sent as a series of CopyData messages.  (This allows
-  other information to be intermixed; in particular the server can send
+  WAL data is sent as a series of CopyData messages
+  (See  and ).
+  (This allows other information to be intermixed; in particular the 
server can send
   an ErrorResponse message if it encounters a failure after beginning
   to stream.)  The payload of each CopyData message from server to the
   client contains a message of one of the following formats:

--
Best regards
Japin Li



stream-replication-protocol-documentation.patch
Description: stream-replication-protocol-documentation.patch

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Michael Paquier

On Thu, Dec 24, 2020 at 10:50:34AM +0900, Michael Paquier wrote:
> FWIW, it still makes the most sense to me to keep the options that are
> extracted from the grammar or things that apply to all the
> sub-routines of REINDEX to be tracked in a single structure, so this
> should include only the REINDEXOPT_* set for now, with the tablespace
> OID as of this thread, and also the reindex filtering options.
> REINDEX_REL_* is in my opinion of a different family because they only
> apply to reindex_relation(), and partially to reindex_index(), so they
> are very localized.  In short, anything in need of only
> reindex_relation() has no need to know about the whole ReindexOption
> business.

I need more coffee here..  reindex_relation() knows about
ReindexOptions.  Still it would be weird to track REINDEX_REL_* at a
global level as ExecReindex(), ReindexTable(), ReindexMultipleTables()
and the like don't need to know about that.
--
Michael


signature.asc
Description: PGP signature

Re: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

2020-12-23 Thread Michael Paquier

On Wed, Dec 23, 2020 at 07:13:33PM +0530, Bharath Rupireddy wrote:
> +1. Shall we add some test cases(with xml, yaml, json formats as is
> currently being done in explain.sql) to cover that? We can have the
> explain_filter() function to remove the unstable parts in the output,
> it looks something like below. If yes, please let me know I can add
> them to matview and select_into.

I am not sure that we need tests for all the formats, but having at
least one of them sounds good to me.  I leave the choice up to you.

What we have here looks rather committable.  Let's wait until the
period of vacations is over before wrapping this up to give others the
occasion to comment.
--
Michael


signature.asc
Description: PGP signature

Re: postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped

2020-12-23 Thread Fujii Masao





On 2020/12/23 23:40, Bharath Rupireddy wrote:

On Wed, Dec 23, 2020 at 7:31 PM Fujii Masao  wrote:

I agree to make pgfdw_xact_callback() close the connection when
entry->invalidated == true. But I think that it's better to get rid of
have_invalid_connections flag and make pgfdw_inval_callback() close
the connection immediately if entry->xact_depth == 0, to avoid unnecessary
scan of the hashtable during COMMIT of transaction not accessing to
foreign servers. Attached is the POC patch that I'm thinking. Thought?


We could do that way as well. It seems okay to me. Now the disconnect
code is spread in pgfdw_inval_callback() and pgfdw_xact_callback().
There's also less burden on pgfdw_xact_callback() to close a lot of
connections on a single commit. The behaviour is like this - If
entry->xact_depth == 0, then the entries wouldn't have got any
connection in the current txn so they can be immediately closed in
pgfdw_inval_callback() and pgfdw_xact_callback() can exit immediately
as xact_got_connection is false. If entry->xact_depth > 0 which means
that probably pgfdw_inval_callback() came from a sub txn, we would
have got a connection in the txn i.e. xact_got_connection becomes true
due to which it can get invalidated in pgfdw_inval_callback() and get
closed in pgfdw_xact_callback() at the end of the txn.

And there's no chance of entry->xact_depth > 0 and xact_got_connection false.

I think we need to change the comment before pgfdw_inval_callback() a bit:

  * After a change to a pg_foreign_server or pg_user_mapping catalog entry,
  * mark connections depending on that entry as needing to be remade.
  * We can't immediately destroy them, since they might be in the midst of
  * a transaction, but we'll remake them at the next opportunity.

to

  * After a change to a pg_foreign_server or pg_user_mapping catalog entry,
*  try to close the cached connections associated with them if they
are not in the
*  midst of a transaction otherwise mark them as invalidated. We will
destroy the
  * invalidated connections in pgfdw_xact_callback() at the end of the main 
xact.
  * Closed connections will be remade at the next opportunity.


Yes, I agree that we need to update that comment.



Any thoughts on the earlier point in [1] related to a test case(which
becomes unnecessary with this patch) coverage?



ISTM that we can leave that test as it is because it's still useful to test
the case where the cached connection is discarded in pgfdw_inval_callback().
Thought?

By applying the patch, probably we can get rid of the code to discard
the invalidated cached connection in GetConnection(). But at least for
the back branches, I'd like to leave the code as it is so that we can make
sure that the invalidated cached connection doesn't exist when getting
new connection. Maybe we can improve that in the master, but I'm not
sure if it's really worth doing that against the gain. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Michael Paquier

On Wed, Dec 23, 2020 at 07:07:54PM -0600, Justin Pryzby wrote:
> On Wed, Dec 23, 2020 at 09:14:18PM -0300, Alvaro Herrera wrote:
>> On 2020-Dec-23, Justin Pryzby wrote: 
>>> This was getting ugly:
>>> 
>>> extern void reindex_index(Oid indexId, bool skip_constraint_checks,
>>>   char relpersistence, int options, Oid 
>>> tablespaceOid)
>> 
>> Is this what I suggested?

No idea if this is what you suggested, but it would be annoying to
have to change this routine's signature each time we need to pass down
a new option.

> No ; that was from an earlier revision of the patch adding the tablespace,
> before Michael suggested a ReindexOptions struct, which subsumes 'options' and
> 'tablespaceOid'.
> 
> I see now that 'skip_constraint_checks' is from REINDEX_REL_CHECK_CONSTRAINTS.
> It seems like that should be a REINDEXOPT_* flag, rather than REINDEX_REL_*,
> so doesn't need to be a separate boolean.  See also: 2d3320d3d.

FWIW, it still makes the most sense to me to keep the options that are
extracted from the grammar or things that apply to all the
sub-routines of REINDEX to be tracked in a single structure, so this
should include only the REINDEXOPT_* set for now, with the tablespace
OID as of this thread, and also the reindex filtering options.
REINDEX_REL_* is in my opinion of a different family because they only
apply to reindex_relation(), and partially to reindex_index(), so they
are very localized.  In short, anything in need of only
reindex_relation() has no need to know about the whole ReindexOption
business.
--
Michael


signature.asc
Description: PGP signature

RE: how to use valgrind for TAP tests

2020-12-23 Thread osumi.takami...@fujitsu.com

Hi, Alexander

On Sunday, December 20, 2020 5:00 PM Alexander Lakhin wrote:
> > "osumi.takami...@fujitsu.com"  writes:
> >> I have a question about how to execute valgrind with TAP tests in
> >> order to check some patches in the community.
> >> My main interest is testing src/test/subscription now but is there
> >> any general way to do it ?
> > The standard solution is
> >
> > (1) Build normally (well, with -DUSE_VALGRIND)
> > (2) Move the postgres executable aside, say
> > mv src/backend/postgres src/backend/postgres.orig
> > (3) Replace the executable with a wrapper script that invokes
> > valgrind on the original executable
> > (4) Now you can run "make check" with a valgrind'ed server,
> > as well as things that depend on "make check", such as TAP tests
> >
> > The script I use for (3) is attached; adjust paths and options to taste.
> I use the attached patch for this purpose, that slightly simplifies things and
> covers all the other binaries:
> git apply .../install-vrunner.patch
> CPPFLAGS="-DUSE_VALGRIND -Og" ./configure --enable-tap-tests
> --enable-debug --enable-cassert && make && make check `make
> check-world` is possible too, with src/bin/pg_ctl/t/001_start_stop.pl
> disabled (removed).
Thank you for giving me a fruitful advice !
When I encounter another needs, I'll apply this method as well.


Best Regards,
Takamichi Osumi

RE: how to use valgrind for TAP tests

2020-12-23 Thread osumi.takami...@fujitsu.com

Hello

On Saturday, December 19, 2020 1:03 AM Tom Lane  wrote:
> "osumi.takami...@fujitsu.com"  writes:
> > I have a question about how to execute valgrind with TAP tests in
> > order to check some patches in the community.
> > My main interest is testing src/test/subscription now but is there any
> > general way to do it ?
> 
> The standard solution is
> 
> (1) Build normally (well, with -DUSE_VALGRIND)
> (2) Move the postgres executable aside, say
> mv src/backend/postgres src/backend/postgres.orig
> (3) Replace the executable with a wrapper script that invokes
> valgrind on the original executable
> (4) Now you can run "make check" with a valgrind'ed server,
> as well as things that depend on "make check", such as TAP tests
> 
> The script I use for (3) is attached; adjust paths and options to taste.
Thank you so much.
I couldn't come up with the idea to prepare a wrapper script.
This worked successfully.


Best,
Takamichi Osumi

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Justin Pryzby

On Wed, Dec 23, 2020 at 09:14:18PM -0300, Alvaro Herrera wrote:
> On 2020-Dec-23, Justin Pryzby wrote:
> 
> > This was getting ugly:
> > 
> > extern void reindex_index(Oid indexId, bool skip_constraint_checks,
> >   char relpersistence, int options, Oid 
> > tablespaceOid)Z
> 
> Is this what I suggested?

No ; that was from an earlier revision of the patch adding the tablespace,
before Michael suggested a ReindexOptions struct, which subsumes 'options' and
'tablespaceOid'.

I see now that 'skip_constraint_checks' is from REINDEX_REL_CHECK_CONSTRAINTS.
It seems liek that should be a REINDEXOPT_* flag, rather than REINDEX_REL_*,
so doesn't need to be a separate boolean.  See also: 2d3320d3d.

-- 
Justin

Re: New Table Access Methods for Multi and Single Inserts

2020-12-23 Thread Bharath Rupireddy

On Mon, Dec 21, 2020 at 1:17 PM Justin Pryzby  wrote:
> On Fri, Dec 18, 2020 at 11:54:39AM -0600, Justin Pryzby wrote:
> > On Fri, Dec 18, 2020 at 07:39:14AM +0530, Bharath Rupireddy wrote:
> > > On Fri, Dec 18, 2020 at 2:14 AM Justin Pryzby  
> > > wrote:
> > > > Are you thinking that TableInsertState would eventually have additional
> > > > attributes which would apply to other tableams, but not to heap ?  Is
> > > > heap_insert_begin() really specific to heap ?  It's allocating and 
> > > > populating a
> > > > structure based on its arguments, but those same arguments would be 
> > > > passed to
> > > > every other AM's insert_begin routine, too.  Do you need a more 
> > > > flexible data
> > > > structure, something that would also accomodate extensions?  I'm 
> > > > thinking of
> > > > reloptions as a loose analogy.
> > >
> > > I could not think of other tableam attributes now. But +1 to have that
> > > kind of flexible structure for TableInsertState. So, it can have
> > > tableam type and attributes within the union for each type.
> >
> > Right now you have heap_insert_begin(), and I asked if it was really
> > heap-specific.  Right now, it populates a struct based on a static list of
> > arguments, which are what heap uses.
> >
> > If you were to implement a burp_insert_begin(), how would it differ from
> > heap's?  With the current API, they'd (have to) be the same, which means 
> > either
> > that it should apply to all AMs (or have a "default" implementation that 
> > can be
> > overridden by an AM), or that this API assumes that other AMs will want to 
> > do
> > exactly what heap does, and fails to allow other AMs to implement 
> > optimizations
> > for bulk inserts as claimed.
> >
> > I don't think using a "union" solves the problem, since it can only 
> > accommodate
> > core AMs, and not extensions, so I suggested something like reloptions, 
> > which
> > have a "namespace" prefix (and core has toast.*, like ALTER TABLE t SET
> > toast.autovacuum_enabled).
>
> I think you'd want to handle things like:
>
>  - a compressed AM wants to specify a threshold for a tuple's *compressed* 
> size
>(maybe in addition to the uncompressed size);
>  - a "columnar" AM wants to specify a threshold size for a column, rather
>than for each tuple;
>
> I'm not proposing to handle those specific parameters, but rather pointing out
> that your implementation doesn't allow handling AM-specific considerations,
> which I think was the goal.
>
> The TableInsertState structure would need to store those, and then the AM's
> multi_insert_v2 routine would need to make use of them.
>
> It feels a bit like we'd introduce the idea of an "AM option", except that it
> wouldn't be user-facing (or maybe some of them would be?).  Maybe I've
> misunderstood though, so other opinions are welcome.

Attaching a v2 patch for the new table AMs.

This patch has following changes:

1) Made the TableInsertState structure generic by having a void
pointer for multi insert state and defined the heap specific multi
insert state information in heapam.h. This way each AM can have it's
own multi insert state structure and dereference the void pointer
using that structure inside the respective AM implementations.
2) Earlier in the v1 patch, the bulk insert state
allocation/deallocation was moved to AM level, but I see that there's
nothing specific in doing so and I think it should be independent of
AM. So I'm doing that in table_insert_begin() and table_insert_end().
Because of this, I had to move the BulkInsert function declarations
from heapam.h to tableam.h
3) Corrected the typos and tried to adjust indentation of the code.

Note that I have not yet made the multi_insert_v2 API optional as
suggested earlier. I will think more on this and update.

I'm not posting the updated 0002 to 0004 patches, I plan to do so
after a couple of reviews happen on the design of the APIs in 0001.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From cab7baa6f5c0229816e09a887c0468a1ca4edccb Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Thu, 24 Dec 2020 05:18:13 +0530
Subject: [PATCH v2] New Table AMs for Multi and Single Inserts

This patch introduces new table access methods for multi and
single inserts. Also implements/rearranges the outside code for
heap am into these new APIs.

Main design goal of these new APIs is to give flexibility to
tableam developers in implementing multi insert logic dependent on
the underlying storage engine. Currently, for all the underlying
storage engines, we follow the same multi insert logic such as when
and how to flush the buffered tuples, tuple size calculation, and
this logic doesn't take into account the underlying storage engine
capabilities.

We can also avoid duplicating multi insert code (for existing COPY,
and upcoming CTAS, CREATE/REFRESH MAT VIEW and INSERT SELECTs). We
can also move bulk insert state allocation and deallocation inside
these APIs
---

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Alvaro Herrera

On 2020-Dec-23, Justin Pryzby wrote:

> This was getting ugly:
> 
> extern void reindex_index(Oid indexId, bool skip_constraint_checks,
>   char relpersistence, int options, Oid 
> tablespaceOid)Z

Is this what I suggested?

timestamp bogus parser?

2020-12-23 Thread Fabien COELHO




Hello,

I tried:

 psql> SELECT TIMESTAMP '2020-12-23Z19:28:45';

The result of which is:

 2020-12-23 00:00:00

This is disappointing. Ok, my fault, I should have written TIMESTAMPTZ, or 
use T instead of Z, or whatever.


Anyway, is there a rational for this behavior? I would have expected 
either the time to be taken into account or an error message, but 
certainly not half of my string being silently ignored. Skimming through 
the doc (Section 8.5.1.3) did not provide an answer.


Any clues? Is this a bug?

--
Fabien.

Re: Re: Cache relation sizes?

2020-12-23 Thread Thomas Munro

On Wed, Dec 23, 2020 at 1:31 AM 陈佳昕(步真)  wrote:
> I studied your patch these days and found there might be a problem.
> When execute 'drop database', the smgr shared pool will not be removed 
> because of no call 'smgr_drop_sr'. Function 'dropdb' in dbcommands.c remove 
> the buffer from bufferpool and unlink the real files by 'rmtree', It doesn't 
> call smgrdounlinkall, so the smgr shared cache will not be dropped although 
> the table has been removed. This will cause some errors when smgr_alloc_str 
> -> smgropen、smgrimmedsync. Table file has been removed, so smgropen and 
> smgrimmedsync will get a unexpected result.

Hi Buzhen,

Thanks, you're right -- it needs to scan the pool of SRs and forget
everything from the database you're dropping.

Re: Cache relation sizes?

2020-12-23 Thread Thomas Munro

On Thu, Dec 17, 2020 at 10:22 PM Andy Fan  wrote:
> Let me try to understand your point.  Suppose process 1 extends a file to
> 2 blocks from 1 block, and fsync is not called, then a). the lseek *may* still
> return 1 based on the comments in the ReadBuffer_common ("because
> of buggy Linux kernels that sometimes return an seek SEEK_END result
> that doesn't account for a recent write").  b). When this patch is introduced,
> we would get 2 blocks for sure if we can get the size from cache. c). After
> user reads the 2 blocks from cache and then the cache is evicted, and user
> reads the blocks again, it is possible to get 1 again.
>
> Do we need to consider case a) in this patch? and Is case c). the situation 
> you
> want to avoid and do we have to introduce fsync to archive this? Basically I
> think case a) should not happen by practice so we don't need to handle case
> c) and finally we don't need the fsync/SR_SYNCING/SR_JUST_DIRTIED flag.
> I'm not opposed to adding them, I am just interested in the background in case
> you are also willing to discuss it.

Sorry for the lack of background -- there was some discussion on the
thread "Optimize dropping of relation buffers using dlist", but it's
long and mixed up with other topics.  I'll try to summarise here.

It's easy to reproduce files shrinking asynchronously on network
filesystems that reserve space lazily[1], and we know that there have
historically been problems even with local filesystems[2], leading to
that comment about buggy kernels.  I am only interested in the
behaviour I can demonstrate, because I believe Linux is working as
intended when it does that (hypothetical/unknown bugs would probably
be helped by this approach too, but I'm not interested in speculating
about that beyond these parentheses).

So, why should we consider this?  It's true that commit ffae5cc5a60's
change to ReadBuffer_common() already prevents us from eating data by
overwriting an existing buffer due to this phenomenon, but there are
other problems:

1.  A system that in this state can give an incorrect answer to a
query: heapam.c calls RelationGetNumberOfBlocks() at the beginning of
a scan, so it might not see recently added blocks containing committed
data.  Hopefully we'll panic when we try to create a checkpoint and
see errors, but we could be giving bad results for a while.

2.  Commitfest entry #2319 needs an accurate size, or it might leave
stray blocks in the buffer pool that will cause failures or corruption
later.

It's true that we could decide not to worry about this, and perhaps
even acknowledge in some official way that PostgreSQL doesn't work
reliably on some kinds of filesystems.  But in this prototype I
thought it was fun to try to fix our very high rate of lseek()
syscalls and this reliability problem, at the same time.  I also
speculate that we'll eventually have other reasons to want a
demand-paged per-relation shared memory object.

> I would suggest the below change for smgrnblock_fast,  FYI
>
> @@ -182,7 +182,11 @@ smgrnblocks_fast(SMgrRelation reln, ForkNumber forknum)
> /*
>  * The generation doesn't match, the shared relation must 
> have been
>  * evicted since we got a pointer to it.  We'll need to do 
> more work.
> +* and invalid the fast path for next time.
>  */
> +   reln->smgr_shared = NULL;
> }

Thanks, makes sense -- I'll add this to the next version.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLZTfKuXir9U4JQkY%3Dk3Tb6M_3toGrGOK_fa2d4MPQQ_w%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/BAY116-F146A38326C5630EA33B36BD12B0%40phx.gbl

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Justin Pryzby

On Wed, Dec 23, 2020 at 07:22:05PM -0300, Alvaro Herrera wrote:
> Also: it seems a bit weird to me to put the flags inside the options
> struct.  I would keep them separate -- so initially the options struct
> would only have the tablespace OID, on API cleanliness grounds:

I don't see why they'd be separate or why it's cleaner ?

If the user says REINDEX (CONCURRENTLY, VERBOSE, TABLESPACE ts) , why would we
pass around the boolean flags separately from the other options ?

> struct ReindexOptions
> {
>   tablepaceOidoid;
> };
> extern bool
> reindex_relation(Oid relid, bits32 flags, ReindexOptions *options);

> But also, are we really envisioning that these routines would have all
> that many additional options?  Maybe it is sufficient to do just
> 
> extern bool
> reindex_relation(Oid relid, bits32 flags, tablespaceOid Oid);

That's what we did initially, and Michael suggested to put it into a struct.
Which makes the tablespace patches cleaner for each of REINDEX, CLUSTER,
VACUUM, since it doesn't require modifying the signature of 5-10 functions.
And future patches get to reap the benefit.

These are intended to be like VacuumParams.  Consider that ClusterOptions is
proposed to get not just a tablespaceOid but also an idxtablespaceOid.

This was getting ugly:

extern void reindex_index(Oid indexId, bool skip_constraint_checks, 

  char relpersistence, int options, Oid tablespaceOid); 

-- 
Justin

Re: Discarding DISCARD ALL

2020-12-23 Thread Andreas Karlsson


On 12/23/20 6:49 PM, Simon Riggs wrote:

The whole premise of the patch is tighter integration, with the server
providing the facilities that poolers need.


I am all for that. Ideally I would want builtin connection pooling but 
short term I think the way forward is most likely tighter integration.



The patch can be enhanced to do whatever else we agree is desirable.

Do we need something like DISCARD ALL EXCEPT PREPARED STATEMENTS;  ??

If there are different requirements for each pooler, what are they?


If someone adds prepared statement support to e.g. PgBouncer that might 
be a nice feature to have. Plus maybe something like "SET 
transaction_cleanup = 'except_prepared'". But right now I think the 
pools which support prepared statements do not use DISCARD ALL and 
instead just trust the end user to not run SET or use temporary tables 
which outlive transactions.


Andreas

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Alvaro Herrera

On 2020-Dec-23, Michael Paquier wrote:

>  bool
> -reindex_relation(Oid relid, int flags, int options)
> +reindex_relation(Oid relid, int flags, ReindexOptions *options)
>  {
>   Relationrel;
>   Oid toast_relid;

Wait a minute.  reindex_relation has 'flags' and *also* 'options' with
an embedded 'flags' member?  Surely that's not right.  I see that they
carry orthogonal sets of options ... but why aren't they a single
bitmask instead of two separate ones?  This looks weird and confusing.

Also: it seems a bit weird to me to put the flags inside the options
struct.  I would keep them separate -- so initially the options struct
would only have the tablespace OID, on API cleanliness grounds:

struct ReindexOptions
{
tablepaceOidoid;
};
extern bool
reindex_relation(Oid relid, bits32 flags, ReindexOptions *options);

I guess you could argue that it's more performance to set up only two
arguments to the function call instead of three .. but I doubt that's
measurable for anything in DDL-land.

But also, are we really envisioning that these routines would have all
that many additional options?  Maybe it is sufficient to do just

extern bool
reindex_relation(Oid relid, bits32 flags, tablespaceOid Oid);

Re: Implementing Incremental View Maintenance

2020-12-23 Thread Tatsuo Ishii

Hi Yugo,

> 1. Creating an index on the matview automatically

Nice.

> 2. Use a weaker lock on the matview if possible
> 
> If the view has only one base table in this query, RowExclusiveLock is
> held on the view instead of AccessExclusiveLock, because we don't
> need to wait other concurrent transaction's result in order to
> maintain the view in this case. When the same row in the view is
> affected due to concurrent maintenances, a row level lock will
> protect it.
> 
> On Tue, 24 Nov 2020 12:46:57 +0300
> Konstantin Knizhnik  wrote:
> 
>> The most obvious optimization is not to use exclusive table lock if view 
>> depends just on one table (contains no joins).
>> Looks like there are no any anomalies in this case, are there?
> 
> I confirmed the effect of this optimizations.
> 
> First, when I performed pgbench (SF=100) without any materialized views,
> the results is :
>  
>  pgbench test4 -T 300 -c 8 -j 4
>  latency average = 6.493 ms
>  tps = 1232.146229 (including connections establishing)
> 
> Next, created a view as below, I performed the same pgbench.
>  CREATE INCREMENTAL MATERIALIZED VIEW mv_ivm2 AS
> SELECT bid, count(abalance), sum(abalance), avg(abalance)
> FROM pgbench_accounts GROUP BY bid;
> 
> The result is here:
> 
> [the previous version (v19 with exclusive table lock)]
>  - latency average = 77.677 ms
>  - tps = 102.990159 (including connections establishing)
> 
> [In the latest version (v20 with weaker lock)]
>  - latency average = 17.576 ms
>  - tps = 455.159644 (including connections establishing)
> 
> There is still substantial overhead, but we can see that the effect
> of the optimization.

Great improvement. Now with this patch more than 4x faster than
previous one!

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-12-23 Thread Stephen Frost

Greetings,

* Justin Pryzby (pry...@telsasoft.com) wrote:
> On Mon, Nov 23, 2020 at 04:14:18PM -0500, Tom Lane wrote:
> > * I don't think it's okay to change the existing signatures of
> > pg_ls_logdir() et al.  Even if you can make an argument that it's
> > not too harmful to add more output columns, replacing pg_stat_file's
> > isdir output with something of a different name and datatype is most
> > surely not OK --- there is no possible way that doesn't break existing
> > user queries.
> > 
> > I think possibly a more acceptable approach is to leave these functions
> > alone but add documentation explaining how to get the additional info.
> > You could say things along the lines of "pg_ls_waldir() is the same as
> > pg_ls_dir_metadata('pg_wal') except for showing fewer columns."
> 
> On Mon, Nov 23, 2020 at 06:06:19PM -0500, Tom Lane wrote:
> > I'm mostly concerned about removing the isdir output of pg_stat_file().
> > Maybe we could compromise to the extent of keeping that, allowing it
> > to be partially duplicative of a file-type-code output column.
> 
> On Tue, Nov 24, 2020 at 11:53:22AM -0500, Stephen Frost wrote:
> > I don't have any particular issue with keeping isdir as a convenience
> > column.  I agree it'll now be a bit duplicative but that seems alright.
> 
> Maybe we should do what Tom said, and leave pg_ls_* unchanged, but also mark
> them as deprecated in favour of:
> | pg_ls_dir_metadata(dir), dir={'pg_wal/archive_status', 'log', 'pg_wal', 
> 'base/pgsql_tmp'}

Haven't really time to review the patches here in detail right now
(maybe next month), but in general, I dislike marking things as
deprecated.  If we don't want to change them and we're happy to continue
supporting them as-is (which is what 'deprecated' really means), then we
can just do so- nothing stops us from that.  If we don't think the
current API makes sense, for whatever reason, we can just change that-
there's no need for a 'deprecation period', as we already have major
versions and support each major version for 5 years.

I haven't particularly strong feelings one way or the other regarding
these particular functions.  If you asked which way I leaned, I'd say
that I'd rather redefine the functions to make more sense and to be easy
to use for people who would like to use them.  I wouldn't object to new
functions to provide that either though.  I don't think there's all that
much code or that it's changed often enough to be a big burden to keep
both, but that's more feeling than anything based in actual research at
this point.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-12-23 Thread Justin Pryzby

On Mon, Nov 23, 2020 at 04:14:18PM -0500, Tom Lane wrote:
> * I don't think it's okay to change the existing signatures of
> pg_ls_logdir() et al.  Even if you can make an argument that it's
> not too harmful to add more output columns, replacing pg_stat_file's
> isdir output with something of a different name and datatype is most
> surely not OK --- there is no possible way that doesn't break existing
> user queries.
> 
> I think possibly a more acceptable approach is to leave these functions
> alone but add documentation explaining how to get the additional info.
> You could say things along the lines of "pg_ls_waldir() is the same as
> pg_ls_dir_metadata('pg_wal') except for showing fewer columns."

On Mon, Nov 23, 2020 at 06:06:19PM -0500, Tom Lane wrote:
> I'm mostly concerned about removing the isdir output of pg_stat_file().
> Maybe we could compromise to the extent of keeping that, allowing it
> to be partially duplicative of a file-type-code output column.

On Tue, Nov 24, 2020 at 11:53:22AM -0500, Stephen Frost wrote:
> I don't have any particular issue with keeping isdir as a convenience
> column.  I agree it'll now be a bit duplicative but that seems alright.

Maybe we should do what Tom said, and leave pg_ls_* unchanged, but also mark
them as deprecated in favour of:
| pg_ls_dir_metadata(dir), dir={'pg_wal/archive_status', 'log', 'pg_wal', 
'base/pgsql_tmp'}

However, pg_ls_tmpdir is special since it handles tablespace tmpdirs, which it
seems is not trivial to get from sql:

+SELECT * FROM (SELECT DISTINCT 
COALESCE(NULLIF(pg_tablespace_location(b.oid),'')||suffix, 'base/pgsql_tmp') AS 
dir
+FROM pg_tablespace b, pg_control_system() pcs,
+LATERAL format('/PG_%s_%s', left(current_setting('server_version_num'), 2), 
pcs.catalog_version_no) AS suffix) AS dir,
+LATERAL pg_ls_dir_recurse(dir) AS a;

For context, the line of reasoning that led me to this patch series was
something like this:

0) Why can't I list shared tempfiles (dirs) using pg_ls_tmpdir() ?
1) Implement recursion for pg_ls_tmpdir();
2) Eventually realize that it's silly to implement a function to recurse into
   one particular directory when no general feature exists;
3) Implement generic facility;

> * I noticed that you did s/stat/lstat/.  That's fine on Unix systems,
> but it won't have any effect on Windows systems (cf bed90759f),
> which means that we'll have to document a platform-specific behavioral
> difference.  Do we want to go there?
>
> Maybe this patch needs to wait on somebody fixing our lack of real lstat() on 
> Windows.

I think only the "top" patches depend on lstat (for the "type" column and
recursion, to avoid loops).  The initial patches are independently useful, and
resolve the original issue of hiding tmpdirs.  I've rebased and re-arranged the
patches to reflect this.

-- 
Justin
>From 1fffd1c7af05298295e96fff189caf6eb6cc1539 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 16 Mar 2020 14:12:55 -0500
Subject: [PATCH v26 01/11] Document historic behavior of links to
 directories..

Backpatch to 9.5: pg_stat_file
---
 doc/src/sgml/func.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2707e757ca..806a7b8a9a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26559,7 +26559,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 Returns a record containing the file's size, last access time stamp,
 last modification time stamp, last file status change time stamp (Unix
 platforms only), file creation time stamp (Windows only), and a flag
-indicating if it is a directory.
+indicating if it is a directory (or a symbolic link to a directory).


 This function is restricted to superusers by default, but other users
-- 
2.17.0

>From 19c080b6163833323c53e48519f850d21d3a618d Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 30 Mar 2020 18:59:16 -0500
Subject: [PATCH v26 02/11] pg_stat_file and pg_ls_dir_* to use lstat()..

pg_ls_dir_* will now skip (no longer show) symbolic links, same as other
non-regular file types, as we advertize we do since 8b6d94cf6.  That seems to
be the intented behavior, since irregular file types are 1) less portable; and,
2) we don't currently show a file's type except for "bool is_dir".

pg_stat_file will now 1) show metadata of links themselves, rather than their
target; and, 2) specifically, show links to directories with "is_dir=false";
and, 3) not error on broken symlinks.
---
 doc/src/sgml/func.sgml  | 2 +-
 src/backend/utils/adt/genfile.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 806a7b8a9a..2707e757ca 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26559,7 +26559,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 Returns a record containing the

Re: Perform COPY FROM encoding conversions in larger chunks

2020-12-23 Thread John Naylor

On Wed, Dec 23, 2020 at 3:41 AM Heikki Linnakangas  wrote:

>
> I'm not sure it's worth the trouble, though. Custom conversions are very
> rare. And I don't think any other object can depend on a conversion, so
> you can always drop the conversion before upgrade, and re-create it with
> the new function signature afterwards. A note in the release notes and a
> check in pg_upgrade, with instructions to drop and recreate the
> conversion, are probably enough.
>

That was my thought as well.

-- 
John Naylor
EDB: http://www.enterprisedb.com

Postgres Spark connector

2020-12-23 Thread Zhihong Yu

Hi,
I searched for Postgres support in Apache Spark.
I found Spark doc related to JDBC.

I wonder if the community is aware of Spark connector for Postgres
(hopefully open source) where predicate involving jsonb columns can be
pushed down.

Your input is appreciated.

Thanks

Re: proposal - support tsv output format for psql

2020-12-23 Thread Pavel Stehule

st 23. 12. 2020 v 17:38 odesílatel Andreas Karlsson 
napsal:

> On 12/23/20 7:18 AM, Pavel Stehule wrote:> I am playing with clipboard
> on Linux, and I found a way, how to redirect
> > psql output to clipboard via wl-copy or xclip and then to Libre Office.
> > Now it looks so best format is tsv
> >
> > select * from pg_database \g (format=tsv) | wl-paste -t
> > application/x-libreoffice-tsvc
> >
> > Implementation of tsv format should not be hard.
> >
> > What do you think about this?
>
> I wonder if it would not make more sense to add support for text/csv to
> LibreOffice's clipboard code rather than implementing tsv in PostgreSQL.
> The tsv format is even less clearly defined than the csv format which at
> least has one RFC even if many homebrew their own versions of it.
>

yes, but it is out of my competitions. But after some research, I don't
think this support is necessary, because the tsv implemented by LO is much
more like CSV with \t as delimiter.

Regards

Pavel




>
> Andreas
>
>

Re: Discarding DISCARD ALL

2020-12-23 Thread Simon Riggs

On Wed, 23 Dec 2020 at 15:19, Andreas Karlsson  wrote:
>
> On 12/23/20 3:47 PM, Vladimir Sitnikov wrote:
> > Simon>It seems strange to me that we put this work onto the pooler, forcing
> > Simon>poolers to repeatedly issue the same command
> >
> > What if poolers learn to manage connections and prepared statements better?
> > Then poolers won't have to reset the session every time, and everyone wins.
>
> While that is be possible to implement since some client libraries
> implement this in their pools (e.g. Sequel for Ruby) this patch would
> help connection poolers which are not aware of prepared statements, for
> example PgBouncer, so it is worthwhile as long as there are connection
> poolers out there which are not aware of prepared statements. And even
> the connection poolers which are aware might want to automatically drop
> temporary tables and reset GUCs. So I do not think that this feature
> would become pointless even if people write a patch for PgBouncer.

The whole premise of the patch is tighter integration, with the server
providing the facilities that poolers need.

The patch can be enhanced to do whatever else we agree is desirable.

Do we need something like DISCARD ALL EXCEPT PREPARED STATEMENTS;  ??

If there are different requirements for each pooler, what are they?

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: proposal - support tsv output format for psql

2020-12-23 Thread Andreas Karlsson

On 12/23/20 7:18 AM, Pavel Stehule wrote:> I am playing with clipboard 
on Linux, and I found a way, how to redirect
psql output to clipboard via wl-copy or xclip and then to Libre Office. 
Now it looks so best format is tsv


select * from pg_database \g (format=tsv) | wl-paste -t 
application/x-libreoffice-tsvc


Implementation of tsv format should not be hard.

What do you think about this?


I wonder if it would not make more sense to add support for text/csv to 
LibreOffice's clipboard code rather than implementing tsv in PostgreSQL. 
The tsv format is even less clearly defined than the csv format which at 
least has one RFC even if many homebrew their own versions of it.


Andreas

Re: MultiXact\SLRU buffers configuration

2020-12-23 Thread Gilles Darold


Le 13/12/2020 à 18:24, Andrey Borodin a écrit :



13 дек. 2020 г., в 14:17, Gilles Darold  написал(а):

I've done more review on these patches.

Thanks, Gilles! I'll incorporate all your fixes to patchset.
Can you also benchmark conditional variable sleep? The patch "Add conditional 
variable to wait for next MultXact offset in corner case"?
The problem manifests on Standby when Primary is heavily loaded with 
MultiXactOffsetControlLock.

Thanks!

Best regards, Andrey Borodin.


Hi Andrey,


Sorry for the response delay, we have run several others tests trying to 
figure out the performances gain per patch but unfortunately we have 
very heratic results. With the same parameters and patches the test 
doesn't returns the same results following the day or the hour of the 
day. This is very frustrating and I suppose that this is related to the 
Azure architecture. The only thing that I am sure is that we had the 
best performances results with all patches and


   multixact_offsets_slru_buffers = 256
   multixact_members_slru_buffers = 512
   multixact_local_cache_entries = 4096

but I can not say if all or part of the patches are improving the 
performances. My feeling is that performances gain related to patches 1 
(shared lock) and 3 (conditional variable) do not have much to do with 
the performances gain compared to just tuning the multixact buffers. 
This is when the multixact contention is observed but perhaps they are 
delaying the contention. It's all the more frustrating that we had a 
test case to reproduce the contention but not the architecture apparently.



Can't do much more at this point.


Best regards,

--
Gilles Darold
LzLabs GmbH
http://www.lzlabs.com/

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Alexey Kondratov


On 2020-12-23 10:38, Michael Paquier wrote:

On Tue, Dec 22, 2020 at 03:15:37PM -0600, Justin Pryzby wrote:

Now, I really think utility.c ought to pass in a pointer to a local
ReindexOptions variable to avoid all the memory context, which is 
unnecessary

and prone to error.


Yeah, it sounds right to me to just bite the bullet and do this
refactoring, limiting the manipulations of the options as much as
possible across contexts.  So +1 from me to merge 0001 and 0002
together.

I have adjusted a couple of comments and simplified a bit more the
code in utility.c.  I think that this is commitable, but let's wait
more than a couple of days for Alvaro and Peter first.  This is a
period of vacations for a lot of people, and there is no point to
apply something that would need more work at the end.  Using hexas for
the flags with bitmasks is the right conclusion IMO, but we are not
alone.



After eyeballing the patch I can add that we should alter this comment:

int options;/* bitmask of VacuumOption */

as you are going to replace VacuumOption with VACOPT_* defs. So this 
should say:


/* bitmask of VACOPT_* */

Also I have found naming to be a bit inconsistent:

 * we have ReindexOptions, but VacuumParams
 * and ReindexOptions->flags, but VacuumParams->options

And the last one, you have used bits32 for Cluster/ReindexOptions, but 
left VacuumParams->options as int. Maybe we should also change it to 
bits32 for consistency?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Alexey Kondratov


On 2020-12-23 08:22, Justin Pryzby wrote:

On Tue, Dec 22, 2020 at 03:22:19PM -0800, Zhihong Yu wrote:

Justin:
For reindex_index() :

+   if (options->tablespaceOid == MyDatabaseTableSpace)
+   options->tablespaceOid = InvalidOid;
...
+   oldTablespaceOid = iRel->rd_rel->reltablespace;
+   if (set_tablespace &&
+   (options->tablespaceOid != oldTablespaceOid ||
+   (options->tablespaceOid == MyDatabaseTableSpace &&
OidIsValid(oldTablespaceOid

I wonder why the options->tablespaceOid == MyDatabaseTableSpace clause
appears again in the second if statement.
Since the first if statement would assign InvalidOid
to options->tablespaceOid when the first if condition is satisfied.


Good question.  Alexey mentioned on Sept 23 that he added the first 
stanza.  to

avoid storing the DB's tablespace OID (rather than InvalidOid).

I think the 2nd half of the "or" is unnecessary since that was added 
setting to

options->tablespaceOid = InvalidOid.
If requesting to move to the DB's default tablespace, it'll now hit the 
first

part of the OR:


+   (options->tablespaceOid != oldTablespaceOid ||


Without the first stanza setting, it would've hit the 2nd condition:

+   (options->tablespaceOid == MyDatabaseTableSpace && 
OidIsValid(oldTablespaceOid


which means: "user requested to move a table to the DB's default 
tblspace, and

it was previously on a nondefault space".

So I think we can drop the 2nd half of the OR.  Thanks for noticing.


Yes, I have not noticed that we would have already assigned 
tablespaceOid to InvalidOid in this case. Back to the v7 we were doing 
this assignment a bit later, so this could make sense, but now it seems 
to be redundant. For some reason I have mixed these refactorings 
separated by a dozen of versions...



Thanks
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

Re: proposal - support tsv output format for psql

2020-12-23 Thread Pavel Stehule

st 23. 12. 2020 v 16:54 odesílatel Bruce Momjian  napsal:

> On Wed, Dec 23, 2020 at 04:39:48PM +0100, Pavel Stehule wrote:
> >
> >
> > st 23. 12. 2020 v 16:24 odesílatel Tom Lane  napsal:
> >
> > Bruce Momjian  writes:
> > > On Wed, Dec 23, 2020 at 07:18:24AM +0100, Pavel Stehule wrote:
> > >> Implementation of tsv format should not be hard.
> > >> What do you think about this?
> >
> > > How would you handle tabs in the data?
> >
> > The main problem with csv is the severe lack of standardization
> > around what to do with corner cases (delimiters in data, empty
> > fields, etc).  Seems like "tsv" would be even worse.  Is there
> > an actual standard anywhere?
> >
> >
> > Probably not - there are just common conventions.
> >
> > standard disallows tab chars inside
> >
> > https://www.iana.org/assignments/media-types/text/tab-separated-values
>
> Is it too awkward to throw an error if there is a tab?
>

This is a question. Maybe more practical can be some form of escaping (in
conformance with COPY tsv format) or using CSV rules for special chars (it
does LO). But raising an error can be correct too. It is true, so typical
data should not contain tabs. For example LO Calc can hold tabs in cell,
can export data in cell, but the tabs in cells are not visible, and import
data with tabs are broken (and then probably, the data with tabs are not
common).


> --
>   Bruce Momjian  https://momjian.us
>   EnterpriseDB https://enterprisedb.com
>
>   The usefulness of a cup is in its emptiness, Bruce Lee
>
>

Re: proposal - support tsv output format for psql

2020-12-23 Thread Andrew Dunstan

On 12/23/20 10:13 AM, Pavel Stehule wrote:
>
>
> After this second check, I think the new format is not necessary,
> because tsv in LO is not tsv, but it is cracked csv
>
>

I agree, I don't think this is a format we need to spent much time on.

If you set the quote to an unlikely character like ^A, the escape
character to '\', and the delimiter to TAB, COPY in CSV mode will give
you something pretty close in many cases. Embedded newlines will still
give you trouble.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: proposal - support tsv output format for psql

2020-12-23 Thread Bruce Momjian

On Wed, Dec 23, 2020 at 04:39:48PM +0100, Pavel Stehule wrote:
> 
> 
> st 23. 12. 2020 v 16:24 odesílatel Tom Lane  napsal:
> 
> Bruce Momjian  writes:
> > On Wed, Dec 23, 2020 at 07:18:24AM +0100, Pavel Stehule wrote:
> >> Implementation of tsv format should not be hard.
> >> What do you think about this?
> 
> > How would you handle tabs in the data?
> 
> The main problem with csv is the severe lack of standardization
> around what to do with corner cases (delimiters in data, empty
> fields, etc).  Seems like "tsv" would be even worse.  Is there
> an actual standard anywhere?
> 
> 
> Probably not - there are just common conventions.
> 
> standard disallows tab chars inside
> 
> https://www.iana.org/assignments/media-types/text/tab-separated-values

Is it too awkward to throw an error if there is a tab?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: proposal - support tsv output format for psql

2020-12-23 Thread Pavel Stehule

st 23. 12. 2020 v 16:24 odesílatel Tom Lane  napsal:

> Bruce Momjian  writes:
> > On Wed, Dec 23, 2020 at 07:18:24AM +0100, Pavel Stehule wrote:
> >> Implementation of tsv format should not be hard.
> >> What do you think about this?
>
> > How would you handle tabs in the data?
>
> The main problem with csv is the severe lack of standardization
> around what to do with corner cases (delimiters in data, empty
> fields, etc).  Seems like "tsv" would be even worse.  Is there
> an actual standard anywhere?
>

Probably not - there are just common conventions.

standard disallows tab chars inside

https://www.iana.org/assignments/media-types/text/tab-separated-values

Regards

Pavel


> regards, tom lane
>

Re: proposal - support tsv output format for psql

2020-12-23 Thread Tom Lane

Bruce Momjian  writes:
> On Wed, Dec 23, 2020 at 07:18:24AM +0100, Pavel Stehule wrote:
>> Implementation of tsv format should not be hard.
>> What do you think about this?

> How would you handle tabs in the data?

The main problem with csv is the severe lack of standardization
around what to do with corner cases (delimiters in data, empty
fields, etc).  Seems like "tsv" would be even worse.  Is there
an actual standard anywhere?

regards, tom lane

Re: Discarding DISCARD ALL

2020-12-23 Thread Andreas Karlsson


On 12/23/20 3:47 PM, Vladimir Sitnikov wrote:

Simon>It seems strange to me that we put this work onto the pooler, forcing
Simon>poolers to repeatedly issue the same command

What if poolers learn to manage connections and prepared statements better?
Then poolers won't have to reset the session every time, and everyone wins.


While that is be possible to implement since some client libraries 
implement this in their pools (e.g. Sequel for Ruby) this patch would 
help connection poolers which are not aware of prepared statements, for 
example PgBouncer, so it is worthwhile as long as there are connection 
poolers out there which are not aware of prepared statements. And even 
the connection poolers which are aware might want to automatically drop 
temporary tables and reset GUCs. So I do not think that this feature 
would become pointless even if people write a patch for PgBouncer.



Simon>This has an additional side benefit: if we know we will clean up at
Simon>the end of the transaction, then all temp tables become effectively ON
Simon>COMMIT DROP

The ability to use the temporary tables sounds cool, however, 
server-prepared statements
allow improve performance significantly (both at frontend and backend), 
so I would not treat

"server_reset_query=discard all" as a best practice.
That is why transaction_cleanup=on (discard everything at transaction 
finish) seems to be a workaround

rather than a best practice for the future.


While I know how to add support for prepared statements to a connection 
pooler it is unclear to me how one would add support for temporary 
tables which are not at least ON COMMIT DELETE ROWS since rows inserted 
on one connection will only be visible at the connection.


Andreas

Re: proposal - support tsv output format for psql

2020-12-23 Thread Pavel Stehule

st 23. 12. 2020 v 15:21 odesílatel Bruce Momjian  napsal:

> On Wed, Dec 23, 2020 at 07:18:24AM +0100, Pavel Stehule wrote:
> > Hi
> >
> > I am playing with clipboard on Linux, and I found a way, how to redirect
> psql
> > output to clipboard via wl-copy or xclip and then to Libre Office. Now
> it looks
> > so best format is tsv
> >
> > select * from pg_database \g (format=tsv) | wl-paste -t application/
> > x-libreoffice-tsvc
> >
> > Implementation of tsv format should not be hard.
> >
> > What do you think about this?
>
> How would you handle tabs in the data?
>

This is a hard question.  tabs in data in tsv are handled by backslash
escaping.

Unfortunately, after some check, the LibreOffice Calc has its own format.
It is not true tsv. Looks much more like CSV with a tabelator as a
separator. And the current implementation is buggy. It is working well only
when data doesn't hold some special chars :-/. The special chars are
deleted when data is exported from LO, and when data are imported with some
special chars, then the imported line can be deformated. Looks so LO should
be fixed first.

This command is working well - comma, quotes, or double quotes are working

select * from foo  \g (format=csv tuples_only=on csv_fieldsep='\t') |
wl-copy -t application/x-libreoffice-tsvc

But \n or \t does problems due bug in LO side

After this second check, I think the new format is not necessary, because
tsv in LO is not tsv, but it is cracked csv

Pavel

> --
>   Bruce Momjian  https://momjian.us
>   EnterpriseDB https://enterprisedb.com
>
>   The usefulness of a cup is in its emptiness, Bruce Lee
>
>

Re: Discarding DISCARD ALL

2020-12-23 Thread Vladimir Sitnikov

Simon>It seems strange to me that we put this work onto the pooler, forcing
Simon>poolers to repeatedly issue the same command

What if poolers learn to manage connections and prepared statements better?
Then poolers won't have to reset the session every time, and everyone wins.

Simon>This has an additional side benefit: if we know we will clean up at
Simon>the end of the transaction, then all temp tables become effectively ON
Simon>COMMIT DROP

The ability to use the temporary tables sounds cool, however,
server-prepared statements
allow improve performance significantly (both at frontend and backend), so
I would not treat
"server_reset_query=discard all" as a best practice.
That is why transaction_cleanup=on (discard everything at transaction
finish) seems to be a workaround
rather than a best practice for the future.

Just to clarify: the patch seems to be small, so it looks harmless,
however, I would not be that
enthusiastic with tying new features with transaction_cleanup=on.

---

What do you mean by "session cleanup"?
Does that always have to be the same as "discard all"?
What if the user wants "deallocate all" behavior?
Should transaction_cleanup be an enum rather than on/off?

Vladimir

Re: postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped

2020-12-23 Thread Bharath Rupireddy

On Wed, Dec 23, 2020 at 7:31 PM Fujii Masao  wrote:
> I agree to make pgfdw_xact_callback() close the connection when
> entry->invalidated == true. But I think that it's better to get rid of
> have_invalid_connections flag and make pgfdw_inval_callback() close
> the connection immediately if entry->xact_depth == 0, to avoid unnecessary
> scan of the hashtable during COMMIT of transaction not accessing to
> foreign servers. Attached is the POC patch that I'm thinking. Thought?

We could do that way as well. It seems okay to me. Now the disconnect
code is spread in pgfdw_inval_callback() and pgfdw_xact_callback().
There's also less burden on pgfdw_xact_callback() to close a lot of
connections on a single commit. The behaviour is like this - If
entry->xact_depth == 0, then the entries wouldn't have got any
connection in the current txn so they can be immediately closed in
pgfdw_inval_callback() and pgfdw_xact_callback() can exit immediately
as xact_got_connection is false. If entry->xact_depth > 0 which means
that probably pgfdw_inval_callback() came from a sub txn, we would
have got a connection in the txn i.e. xact_got_connection becomes true
due to which it can get invalidated in pgfdw_inval_callback() and get
closed in pgfdw_xact_callback() at the end of the txn.

And there's no chance of entry->xact_depth > 0 and xact_got_connection false.

I think we need to change the comment before pgfdw_inval_callback() a bit:

 * After a change to a pg_foreign_server or pg_user_mapping catalog entry,
 * mark connections depending on that entry as needing to be remade.
 * We can't immediately destroy them, since they might be in the midst of
 * a transaction, but we'll remake them at the next opportunity.

to

 * After a change to a pg_foreign_server or pg_user_mapping catalog entry,
*  try to close the cached connections associated with them if they
are not in the
*  midst of a transaction otherwise mark them as invalidated. We will
destroy the
 * invalidated connections in pgfdw_xact_callback() at the end of the main xact.
 * Closed connections will be remade at the next opportunity.

Any thoughts on the earlier point in [1] related to a test case(which
becomes unnecessary with this patch) coverage?

[1] - 
https://www.postgresql.org/message-id/CALj2ACXymb%3Dd4KeOq%2Bjnh_E6yThn%2Bcf1CDRhtK_crkj0_hVimQ%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Discarding DISCARD ALL

2020-12-23 Thread Simon Riggs

Currently, session poolers operating in transaction mode need to send
a "server_reset_query" which is mostly DISCARD ALL.

It seems strange to me that we put this work onto the pooler, forcing
poolers to repeatedly issue the same command, at some cost in
performance. Measuring the overhead with pgbench might miss the points
that poolers are frequently configured on different network hosts and
that monitoring tools used in production will record the DISCARD
statement. YMMV, but the overhead is measurably non-zero.

Proposal is to have a simple new parameter:
  transaction_cleanup = off (default) | on
A setting of "on" will issue the equivalent of a DISCARD ALL as soon
as the transaction has been ended by a COMMIT, ROLLBACK or PREPARE.

Poolers such as pgbouncer would then be able to connect transaction
mode pools by setting transaction_cleanup=on at time of connection,
avoiding any need to issue a server_reset_query, removing the DISCARD
ALL command from the normal execution path, while still achieving the
same thing.

This has an additional side benefit: if we know we will clean up at
the end of the transaction, then all temp tables become effectively ON
COMMIT DROP and we are able to allow temp tables in prepared
transactions. There are likely other side benefits from this
knowledge, allowing us to further tune the PostgreSQL server to the
common use case of transaction session poolers. I think it should be
possible to avoid looking for holdable portals if we are dropping them
all anyway.

Patch attached, passes make check with new tests added.

Comments welcome.

-- 
Simon Riggshttp://www.EnterpriseDB.com/


transaction_cleanup.v4.patch
Description: Binary data

Re: proposal - support tsv output format for psql

2020-12-23 Thread Bruce Momjian

On Wed, Dec 23, 2020 at 07:18:24AM +0100, Pavel Stehule wrote:
> Hi
> 
> I am playing with clipboard on Linux, and I found a way, how to redirect psql
> output to clipboard via wl-copy or xclip and then to Libre Office. Now it 
> looks
> so best format is tsv
> 
> select * from pg_database \g (format=tsv) | wl-paste -t application/
> x-libreoffice-tsvc
> 
> Implementation of tsv format should not be hard.
> 
> What do you think about this?

How would you handle tabs in the data?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Feature request: Connection string parsing for postgres_fdw

2020-12-23 Thread Eric Hanson

On Wed, Dec 23, 2020 at 5:39 AM Ashutosh Bapat 
wrote:

> https://www.postgresql.org/docs/13/libpq-connect.html#LIBPQ-PARAMKEYWORDS
> lists the parameters that postgres_fdw accepts. "dbname" can be more
> than just dbname. See
> https://www.postgresql.org/docs/13/libpq-connect.html#LIBPQ-CONNSTRING.
> And "dbname" is not in the list of exception paramters in section
> "F.33.1.1. Connection Options" at
> https://www.postgresql.org/docs/13/postgres-fdw.html#id-1.11.7.42.11.
>
> I haven't tried this myself. But this might help you.

Good idea, but according to this thread:
https://www.postgresql.org/message-id/878tjcbbgb.fsf%40ars-thinkpad
"postgres_fdw forbids usage of connection strings by passing expand_dbname
= false to PQconnectdbParams"

They discuss the reasoning here:  If it were to allow expand_dbname, people
could override username etc, variables that need to be fixed, by setting
them in the dbname connection string.  But this just seems like a bug.  It
should prioritize non-expanded variables over expanded ones.

Eric
--
http://aquameta.org/

Re: [PATCH] Automatic HASH and LIST partition creation

2020-12-23 Thread Pavel Borisov

>
> My 0.02 €: What I think does not matter much, what committers think is the
> way to pass something. However, I do not think that such an idea would
> pass a committer:-)
>

The same idea was the reason for my proposal to make automatic partitioning
clauses to be in accordance with existing declarative syntax (even if it
seems little bit long to write words "configuration (for values" )

CREATE TABLE foo(a int) PARTITION BY LIST(a) CONFIGURATION (FOR VALUES
IN (1,2),(3,4) DEFAULT PARTITION foo_def);
CREATE TABLE foo(a int) PARTITION BY HASH(a) CONFIGURATION (FOR VALUES
WITH MODULUS 3);
CREATE TABLE foo(a int) PARTITION BY RAGE(a) CONFIGURATION (FOR VALUES
FROM 1 TO 1000 INTERVAL 10 DEFAULT PARTITION foo_def)

If we want generic (ident = value,...) then we need to introduce different
to what is already committed for manual partitioning which I considered
worse than my proposal above. Still other opinions are highly valued.
-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: postgres_fdw - cached connection leaks if the associated user mapping/foreign server is dropped

2020-12-23 Thread Fujii Masao




On 2020/12/15 11:38, Bharath Rupireddy wrote:

Hi,

As discussed in [1], in postgres_fdw the cached connections to remote
servers can stay until the lifetime of the local session without
getting a chance to disconnect (connection leak), if the underlying
user mapping or foreign server is dropped in another session. Here are
few scenarios how this can happen:

Use case 1:
1) Run a foreign query in session 1 with server 1 and user mapping 1
2) Drop user mapping 1 in another session 2, an invalidation message
gets generated which will have to be processed by all the sessions
3) Run the foreign query again in session 1, at the start of txn the
cached entry gets invalidated via pgfdw_inval_callback() (as part of
invalidation message processing). Whatever may be the type of foreign
query (select, update, explain, delete, insert, analyze etc.), upon
next call to GetUserMapping() from postgres_fdw.c, the cache lookup
fails(with ERROR:  user mapping not found for "") since the user
mapping 1 has been dropped in session 2 and the query will also fail
before reaching GetConnection() where the connections associated with
the invalidated entries would have got disconnected.

So, the connection associated with invalidated entry would remain
until the local session exits.

Use case 2:
1) Run a foreign query in session 1 with server 1 and user mapping 1
2) Try to drop foreign server 1, then we would not be allowed because
of dependency. Use CASCADE so that dependent objects i.e. user mapping
1 and foreign tables get dropped along with foreign server 1.
3) Run the foreign query again in session 1, at the start of txn, the
cached entry gets invalidated via pgfdw_inval_callback() and the query
fails because there is no foreign table.

Note that the remote connection remains open in session 1 until the
local session exits.

To solve the above connection leak problem, it looks like the right
place to close all the invalid connections is pgfdw_xact_callback(),
once registered, which gets called at the end of every txn in the
current session(by then all the sub txns also would have been
finished). Note that if there are too many invalidated entries, then
the following txn has to close all of them, but that's okay than
having leaked connections and it's a one time job for the following
one txn.

Attaching a patch for the same.

Thoughts?


Thanks for making the patch!

I agree to make pgfdw_xact_callback() close the connection when
entry->invalidated == true. But I think that it's better to get rid of
have_invalid_connections flag and make pgfdw_inval_callback() close
the connection immediately if entry->xact_depth == 0, to avoid unnecessary
scan of the hashtable during COMMIT of transaction not accessing to
foreign servers. Attached is the POC patch that I'm thinking. Thought?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/postgres_fdw/connection.c 
b/contrib/postgres_fdw/connection.c
index 66581e5414..dcd14fa4b5 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -940,12 +940,14 @@ pgfdw_xact_callback(XactEvent event, void *arg)
entry->xact_depth = 0;
 
/*
-* If the connection isn't in a good idle state, discard it to
-* recover. Next GetConnection will open a new connection.
+* If the connection isn't in a good idle state or it is marked 
as
+* invalid, then discard it to recover. Next GetConnection will 
open a
+* new connection.
 */
if (PQstatus(entry->conn) != CONNECTION_OK ||
PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
-   entry->changing_xact_state)
+   entry->changing_xact_state ||
+   entry->invalidated)
{
elog(DEBUG3, "discarding connection %p", entry->conn);
disconnect_pg_server(entry);
@@ -1102,7 +1104,15 @@ pgfdw_inval_callback(Datum arg, int cacheid, uint32 
hashvalue)
 entry->server_hashvalue == hashvalue) ||
(cacheid == USERMAPPINGOID &&
 entry->mapping_hashvalue == hashvalue))
-   entry->invalidated = true;
+   {
+   if (entry->xact_depth == 0)
+   {
+   elog(DEBUG3, "discarding connection %p", 
entry->conn);
+   disconnect_pg_server(entry);
+   }
+   else
+   entry->invalidated = true;
+   }
}
 }

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
compiles. Passes the regression tests too.
> Your feedbacks are definitely welcome.

The code looks correct and has become further compact.  Remains ready for 
committer.


Regards
Takayuki Tsunakawa

Re: [PATCH] Automatic HASH and LIST partition creation

2020-12-23 Thread Fabien COELHO




Fabien, do you consider it possible to change the syntax of declarative
partitioning too?


My 0.02 €: What I think does not matter much, what committers think is the 
way to pass something. However, I do not think that such an idea would 
pass a committer:-)


It is problematic as it is already committed but also is very tempting 
to have the same type of syntax both in automatic partitioning and in 
manual (PARTITION OF...)


I think that if a "common" syntax, for a given meaning of common, can be 
thought of, and without breaking backward compatibility, then there may be 
an argument to provide such a syntax, but I would not put too much energy 
into that if I were you.


I see 3 cases:

 - partition declaration but no actual table generated, the current
   version.

 - partition declaration with actual sub-tables generated, eg for hash
   where it is pretty straightforward to know what would be needed, or for
   a bounded range.

 - partition declaration without generated table, but they are generated
   on demand, when needed, for a range one may want weekly or monthly
   without creating tables in advance, esp. if it is unbounded.

ISTM that the syntax should be clear and if possible homogeneous for all 
three use cases, even if they are not implemented yet. It should also 
allow easy extensibility, hence something without a strong syntax, 
key/value pairs to be interpreted later.


--
Fabien.

Re: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

2020-12-23 Thread Bharath Rupireddy

On Wed, Dec 23, 2020 at 6:01 PM Michael Paquier  wrote:
> On Tue, Dec 22, 2020 at 03:12:15PM +0530, Bharath Rupireddy wrote:
> > On Tue, Dec 22, 2020 at 2:07 PM Michael Paquier  wrote:
> >> Note: I'd like to think that we could choose a better name for
> >> CheckRelExistenceInCTAS().
> >
> > I changed it to IsCTASRelCreationAllowed() and attached a v5 patch.
> > Please let me know if this is okay.
>
> After thinking about that, using "CTAS" while other routines in the
> same area use "CreateTableAs" looks inconsistent to me.  So I have
> come up with CreateTableAsRelExists() as name.

I think CreateTableAsRelExists() can return true if the relation
already exists and false otherwise, to keep in sync with the function
name. I updated this and attached v7 patch.

> As the same time, I have looked at the git history to note 9bd27b7
> where we had better not give an empty output for non-text formats.  So
> I'd like to think that it makes sense to use ExplainDummyGroup() if
> the relation exists with IF NOT EXISTS, keeping some consistency.
>
> What do you think?

+1. Shall we add some test cases(with xml, yaml, json formats as is
currently being done in explain.sql) to cover that? We can have the
explain_filter() function to remove the unstable parts in the output,
it looks something like below. If yes, please let me know I can add
them to matview and select_into.

postgres=#  select explain_filter('explain(analyze, format xml) create
table if not exists t1 as select 1;');
NOTICE:  relation "t1" already exists, skipping
explain_filter
---
 http://www.postgresql.org/N/explain;>+
   +
 
(1 row)

postgres=#  select explain_filter('explain(analyze, format yaml)
create table if not exists t1 as select 1;');
NOTICE:  relation "t1" already exists, skipping
   explain_filter
-
 - "CREATE TABLE AS"
(1 row)

postgres=#  select explain_filter('explain(analyze, format json)
create table if not exists t1 as select 1;');
NOTICE:  relation "t1" already exists, skipping
   explain_filter
-
 [  +
   "CREATE TABLE AS"+
 ]
(1 row)

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com


v7-0001-Fail-Fast-In-CTAS-CMV-If-Relation-Already-Exists.patch
Description: Binary data

Re: libpq compression

2020-12-23 Thread Daniil Zakhlystov

Hi!

I’ve contacted Yann Collet (developer of ZSTD) and told him about our 
discussion. Here is his comment:

> Hi Daniil
>   • Is this an expected behavior of ZSTD to consume more memory during 
> the decompression of data that was compressed with a high compression ratio?
>  
> I assume that the target application is employing the streaming mode.
> In which case, yes, the memory usage is directly dependent on the Window 
> size, and the Window size tend to increase with compression level.
>  
>   • how we can restrict the maximal memory usage during decompression?
>  
> There are several ways.
>  
>   • From a decompression perspective
>  
> the first method is to _not_ use the streaming mode,
> and employ the direct buffer-to-buffer compression instead,
> like ZSTD_decompress() for example.
> In which case, the decompressor will not need additional memory, it will only 
> employ the provided buffers.
>  
> This however entirely depends on the application and can therefore be 
> unpractical.
> It’s fine when decompressing small blocks, it’s not when decompressing 
> gigantic streams of data.
>  
> The second method is more straightforward : set a limit to the window size 
> that the decoder accepts to decode.
> This is the ZSTD_d_windowLogMax parameter, documented here : 
> https://github.com/facebook/zstd/blob/v1.4.7/lib/zstd.h#L536
>  
> This can be set to any arbitrary power of 2 limit.
> A frame requiring more than this value will be rejected by the decoder, 
> precisely to avoid sustaining large memory requirements.
>  
> Lastly, note that, in presence of a large window size requirement, the 
> decoder will allocate a correspondingly large buffer,
> but will not necessarily use it.
> For example, if a frame generated with streaming mode at level 22 declares a 
> 128 MB window size, but effectively only contains ~200 KB of data,
> the buffer will only use 200 KB.
> The rest of the buffer is “allocated” from an address space perspective but 
> is not “used” and therefore does not really occupy physical RAM space.
> This is a capability of all modern OS and contributes to minimizing the 
> impact of outsized window sizes.
>  
>  
>   • From a compression perspective
>  
> Knowing the set limitation, the compressor should be compliant, and avoid 
> going above the threshold.
> One way to do it is to limit the compression level to those which remain 
> below the set limit.
> For example, if the limit is 8 MB, all levels <= 19 will be compatible, as 
> they require 8 MB max (and generally less).
>  
> Another method is to manually set a window size, so that it doesn’t exceed 
> the limit.
> This is the ZSTD_c_windowLog parameter, which is documented here : 
> https://github.com/facebook/zstd/blob/v1.4.7/lib/zstd.h#L289
>  
> Another complementary way is to provide the source size when it’s known.
> By default, the streaming mode doesn’t know the input size, since it’s 
> supposed to receive it in multiple blocks.
> It will only discover it at the end, by which point it’s too late to use this 
> information in the frame header.
> This can be solved, by providing the source size upfront, before starting 
> compression.
> This is the function ZSTD_CCtx_setPledgedSrcSize(), documented here : 
> https://github.com/facebook/zstd/blob/v1.4.7/lib/zstd.h#L483
> Of course, then the total amount of data in the frame must be exact, 
> otherwise it’s detected as an error.
>  
> Taking again the previous example of compressing 200 KB with level 22, on 
> knowing the source size,
> the compressor will resize the window to fit the input, and therefore employ 
> 200 KB, instead of 128 MB.
> This information will be present in the header, and the decompressor will 
> also be able to use 200 KB instead of 128 MB.
> Also, presuming the decompressor has a hard limit set to 8 MB (for example), 
> the header using a 200 KB window size will pass and be properly decoded, 
> while the header using 128 MB will be rejected.
> This method is cumulative with the one setting a manual window size (the 
> compressor will select the smallest of both).
>  
>  
> So yes, memory consumption is a serious topic, and there are tools in the 
> `zstd` library to deal with it.
>  
>  
> Hope it helps
>  
> Best Regards
>  
> Yann Collet

After reading Yann’s advice I repeated yesterday single-directional 
decompression benchmarks with ZSTD_d_windowLogMax set to 23, i.e 8MB max window 
size.

Total committed memory (Committed_AS) size for ZSTD compression levels 1-19 was 
pretty much the same:

Committed_AS baseline (size without any benchmark running) - 42.4 GiB

ScenarioCommitted_ASCommitted_AS - Baseline
no compression  44,36 GiB   1,05 GiB
ZSTD:1  45,03 GiB   1,06 GiB
ZSTD:5  46,06 GiB   1,09 GiB
ZSTD:9  46,00 GiB   1,08 GiB
ZSTD:13 47,46 GiB   1,12 GiB
ZSTD:17

Re: Feature request: Connection string parsing for postgres_fdw

2020-12-23 Thread Ashutosh Bapat

On Wed, Dec 23, 2020 at 6:35 PM Eric Hanson  wrote:
>
> I'm trying to store connection to postgres_fdw in the database  I want to be 
> able to store the full breadth of connection styles and all the different 
> types of connections that libpq supports.  But having some troubles.
>
> Postgres_fdw wants options passed into CREATE SERVER, all broken out into 
> separate variables, but this format is neither a connection URI, nor a 
> keyword/value string.  One could imagine writing some parser that extracts 
> all the variables and honors collisions in the same way libpq does (like when 
> both the path and the query string specify a port), but I really don't want 
> to recreate that.
>
> It would be really nice if I could tap into libpq's connection string parser 
> from SQL and use it to extract all the variables in a given connection 
> string, so that I can then pipe those into CREATE SERVER options.  Either 
> that, or teach postgres_fdw how to consume standard connection strings.
>
> Suggestions?  Is this worthy of a feature request?
>

https://www.postgresql.org/docs/13/libpq-connect.html#LIBPQ-PARAMKEYWORDS
lists the parameters that postgres_fdw accepts. "dbname" can be more
than just dbname. See
https://www.postgresql.org/docs/13/libpq-connect.html#LIBPQ-CONNSTRING.
And "dbname" is not in the list of exception paramters in section
"F.33.1.1. Connection Options" at
https://www.postgresql.org/docs/13/postgres-fdw.html#id-1.11.7.42.11.

I haven't tried this myself. But this might help you.

-- 
Best Wishes,
Ashutosh Bapat

Re: Movement of restart_lsn position movement of logical replication slots is very slow

2020-12-23 Thread Jammie

Thanks Amit for the response.
Two things :
1) In our observation via PSQL the advance command as well do not move the
restart_lsn immediately. It is similar to our approach that use the
confirmed_flush_lsn via stream
2) I am ok to understand the point that we are not reading from the stream
so we might be facing the issue. But the question is why we are able to
move the restart_lsn most of the time by updating the confirmed_flush_lsn
via pgJDBC. But only occasionally it lags behind too far behind.

Regards
Shailesh

On Tue, Dec 15, 2020 at 6:30 PM Amit Kapila  wrote:

> On Tue, Dec 15, 2020 at 11:00 AM Jammie 
> wrote:
> >
> > Thanks Amit for the response
> >
> > We are using pgJDBC sample program here
> > https://jdbc.postgresql.org/documentation/head/replication.html
> >
> > the setFlushLSN is coming from the pgJDBC only.
> >
> > git hub for APIs of pgJDBC methods available.
> >
> > https://github.com/pgjdbc/pgjdbc
> >
> > The second slot refers to "private" slot.
> >
> > So ""we are not doing reading from the stream' ==> It means that we are
> having readPending call only from the shared slot then we get the
> lastReceivedLSN() from stream and
> > send it back to stream as confirmed_flush_lsn for both private and
> shared slot. We dont do readPending call to private slot. we will use
> private slot only when we dont have choice. It is kind of reserver slot for
> us.
> >
>
> I think this (not performing read/decode on the private slot) could be
> the reason why it lagging behind. If you want to use as a reserve slot
> then you probably want to at least perform
> pg_replication_slot_advance() to move it to the required position. The
> restart_lsn won't move unless you read/decode from that slot.
>
> --
> With Regards,
> Amit Kapila.
>

Feature request: Connection string parsing for postgres_fdw

2020-12-23 Thread Eric Hanson

I'm trying to store connection to postgres_fdw in the database  I want to
be able to store the full breadth of connection styles and all the
different types of connections that libpq supports.  But having some
troubles.

Postgres_fdw wants options passed into CREATE SERVER, all broken out into
separate variables, but this format is neither a connection URI, nor a
keyword/value string.  One could imagine writing some parser that extracts
all the variables and honors collisions in the same way libpq does (like
when both the path and the query string specify a port), but I really don't
want to recreate that.

It would be really nice if I could tap into libpq's connection string
parser from SQL and use it to extract all the variables in a given
connection string, so that I can then pipe those into CREATE SERVER
options.  Either that, or teach postgres_fdw how to consume standard
connection strings.

Suggestions?  Is this worthy of a feature request?

Thanks,
Eric
--
http://aquameta.org/

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread k.jami...@fujitsu.com

On Wed, December 23, 2020 5:57 PM (GMT+9), Amit Kapila wrote:
> >
> > At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com"
> >  wrote in
> > > From: Amit Kapila 
> > > > + /* Get the number of blocks for a relation's fork */ block[i][j]
> > > > + = smgrnblocks(smgr_reln[i], j, );
> > > > +
> > > > + if (!cached)
> > > > + goto buffer_full_scan;
> > > >
> > > > Why do we need to use goto here? We can simply break from the loop
> > > > and then check if (cached && nBlocksToInvalidate <
> > > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid
> goto
> > > > if possible without much complexity.
> > >
> > > That's because two for loops are nested -- breaking there only exits
> > > the inner loop.  (I thought the same as you at first... And I
> > > understand some people have alergy to goto, I think modest use of
> > > goto makes the code readable.)
> >
> > I don't strongly oppose to goto's but in this case the outer loop can
> > break on the same condition with the inner loop, since cached is true
> > whenever the inner loop runs to the end. It is needed to initialize
> > the variable cache with true, instead of false, though.
> >
> 
> +1. I think it is better to avoid goto here as it can be done without
> introducing any complexity or making code any less readable.

I also do not mind, so I have removed the goto and followed the advice
of all reviewers. It works fine in the latest attached patch 0003.

Attached herewith are the sets of patches. 0002 & 0003 have the following
changes:

1. I have removed the modifications in smgrnblocks(). So the modifications of 
other functions that uses smgrnblocks() in the previous patch versions were
also reverted.
2. Introduced a new API smgrnblocks_cached() instead which returns either
a cached size for the specified fork or an InvalidBlockNumber.
Since InvalidBlockNumber is used, I think it is logical not to use the 
additional
boolean parameter "cached" in the function as it will be redundant.
Although in 0003, I only used the "cached" as a Boolean variable to do the trick
of not using goto.
This function is called both in DropRelFileNodeBuffers() and 
DropRelFileNodesAllBuffers().
3. Modified some minor comments from the patch and commit logs.

It compiles. Passes the regression tests too.
Your feedbacks are definitely welcome.

Regards,
Kirk Jamison

v37-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch
Description:  v37-0001-Prevent-invalidating-blocks-in-smgrextend-during.patch

v37-0002-Optimize-DropRelFileNodeBuffers-for-recovery.patch
Description:  v37-0002-Optimize-DropRelFileNodeBuffers-for-recovery.patch

v37-0003-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch
Description:  v37-0003-Optimize-DropRelFileNodesAllBuffers-in-recovery.patch

Re: Deadlock between backend and recovery may not be detected

2020-12-23 Thread Fujii Masao

On 2020/12/23 19:28, Masahiko Sawada wrote:

On Tue, Dec 22, 2020 at 11:58 PM Fujii Masao
 wrote:

On 2020/12/22 20:42, Fujii Masao wrote:

On 2020/12/22 10:25, Masahiko Sawada wrote:

On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao  wrote:

On 2020/12/17 2:15, Fujii Masao wrote:

On 2020/12/16 23:28, Drouvot, Bertrand wrote:

Hi,

On 12/16/20 2:36 PM, Victor Yegorov wrote:

*CAUTION*: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

ср, 16 дек. 2020 г. в 13:49, Fujii Masao mailto:masao.fu...@oss.nttdata.com>>:

  After doing this procedure, you can see the startup process and backend
  wait for the table lock each other, i.e., deadlock. But this deadlock 
remains
  even after deadlock_timeout passes.

  This seems a bug to me.

+1

  > * Deadlocks involving the Startup process and an ordinary backend 
process
  > * will be detected by the deadlock detector within the ordinary backend.

  The cause of this issue seems that ResolveRecoveryConflictWithLock() that
  the startup process calls when recovery conflict on lock happens doesn't
  take care of deadlock case at all. You can see this fact by reading the 
above
  source code comment for ResolveRecoveryConflictWithLock().

  To fix this issue, I think that we should enable STANDBY_DEADLOCK_TIMEOUT
  timer in ResolveRecoveryConflictWithLock() so that the startup process can
  send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the backend.
  Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
  the backend should check whether the deadlock actually happens or not.
  Attached is the POC patch implimenting this.

good catch!

I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT shouldn't be 
set in ResolveRecoveryConflictWithLock() too (it is already set in 
ResolveRecoveryConflictWithBufferPin()).

So + 1 to consider this as a bug and for the way the patch proposes to fix it.

Thanks Victor and Bertrand for agreeing!
Attached is the updated version of the patch.

Attached is v3 of the patch. Could you review this version?

While the startup process is waiting for recovery conflict on buffer pin,
it repeats sending the request for deadlock check to all the backends
every deadlock_timeout. This may increase the workload in the startup
process and backends, but since this is the original behavior, the patch
doesn't change that. Also in practice this may not be so harmful because
the period that the buffer is kept pinned is basically not so long.

@@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
   */
  ProcWaitForSignal(PG_WAIT_BUFFER_PIN);

+   if (got_standby_deadlock_timeout)
+   {
+   /*
+* Send out a request for hot-standby backends to check themselves for
+* deadlocks.
+*
+* XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+* to be signaled by UnpinBuffer() again and send a request for
+* deadlocks check if deadlock_timeout happens. This causes the
+* request to continue to be sent every deadlock_timeout until the
+* buffer is unpinned or ltime is reached. This would increase the
+* workload in the startup process and backends. In practice it may
+* not be so harmful because the period that the buffer is kept pinned
+* is basically no so long. But we should fix this?
+*/
+   SendRecoveryConflictWithBufferPin(
+
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+   got_standby_deadlock_timeout = false;
+   }
+

Since SendRecoveryConflictWithBufferPin() sends the signal to all
backends every backend who is waiting on a lock at ProcSleep() and not
holding a buffer pin blocking the startup process will end up doing a
deadlock check, which seems expensive. What is worse is that the
deadlock will not be detected because the deadlock involving a buffer
pin isn't detected by CheckDeadLock(). I thought we can replace
PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
backend who has a buffer pin blocking the startup process and not
waiting on a lock is also canceled after deadlock_timeout. We can have
the backend return from RecoveryConflictInterrupt() when it received
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN and is not waiting on any lock,
but it’s also not good because we cannot cancel the backend after
max_standby_streaming_delay that has a buffer pin blocking the startup
process. So I wonder if we can have a new signal. When the backend
received this signal it returns from RecoveryConflictInterrupt()
without deadlock check either if it’s not waiting on any lock or if it
doesn’t have a buffer pin blocking the startup process. Otherwise it's
cancelled.

Thanks for pointing out that issue! Using new signal is an idea. Another idea
is

Re: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs

2020-12-23 Thread Michael Paquier

On Tue, Dec 22, 2020 at 03:12:15PM +0530, Bharath Rupireddy wrote:
> On Tue, Dec 22, 2020 at 2:07 PM Michael Paquier  wrote:
>> Note: I'd like to think that we could choose a better name for
>> CheckRelExistenceInCTAS().
> 
> I changed it to IsCTASRelCreationAllowed() and attached a v5 patch.
> Please let me know if this is okay.

After thinking about that, using "CTAS" while other routines in the
same area use "CreateTableAs" looks inconsistent to me.  So I have
come up with CreateTableAsRelExists() as name.

As the same time, I have looked at the git history to note 9bd27b7
where we had better not give an empty output for non-text formats.  So
I'd like to think that it makes sense to use ExplainDummyGroup() if
the relation exists with IF NOT EXISTS, keeping some consistency.

What do you think?
--
Michael
From ed6fb44e5433b86fbed454f1b52aaa8538a377ae Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 23 Dec 2020 21:28:07 +0900
Subject: [PATCH v6] Fail Fast In CTAS/CMV If Relation Already Exists

Currently, for CTAS or CREATE MATERIALIZED VIEW(CMV) without
IF-NOT-EXISTS clause, the existence of the relation (either
table or materialized view) gets checked during execution and an
error is thrown there. All the unnecessary rewrite and planning
for the SELECT part of the query have happened just to fail
later. However, if IF-NOT-EXISTS clause is present, then a notice
is issued and returned immediately without rewrite and planning
further. This seems somewhat inconsistent.

This patch propose to check the relation existence early in
ExecCreateTableAs() as well as in ExplainOneUtility() and
throw an error in case it exists already to avoid unnecessary
rewrite, planning and execution of the SELECT part.
---
 src/include/commands/createas.h   |  2 +
 src/backend/commands/createas.c   | 53 ---
 src/backend/commands/explain.c| 13 ++
 src/test/regress/expected/matview.out | 38 
 src/test/regress/expected/select_into.out | 42 ++
 src/test/regress/sql/matview.sql  | 23 ++
 src/test/regress/sql/select_into.sql  | 21 +
 7 files changed, 177 insertions(+), 15 deletions(-)

diff --git a/src/include/commands/createas.h b/src/include/commands/createas.h
index 7629230254..d8cfb62522 100644
--- a/src/include/commands/createas.h
+++ b/src/include/commands/createas.h
@@ -29,4 +29,6 @@ extern int	GetIntoRelEFlags(IntoClause *intoClause);
 
 extern DestReceiver *CreateIntoRelDestReceiver(IntoClause *intoClause);
 
+extern bool CreateTableAsRelExists(CreateTableAsStmt *ctas);
+
 #endif			/* CREATEAS_H */
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 6bf6c5a310..01c94f1bce 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -239,21 +239,9 @@ ExecCreateTableAs(ParseState *pstate, CreateTableAsStmt *stmt,
 	PlannedStmt *plan;
 	QueryDesc  *queryDesc;
 
-	if (stmt->if_not_exists)
-	{
-		Oid			nspid;
-
-		nspid = RangeVarGetCreationNamespace(stmt->into->rel);
-
-		if (get_relname_relid(stmt->into->rel->relname, nspid))
-		{
-			ereport(NOTICE,
-	(errcode(ERRCODE_DUPLICATE_TABLE),
-	 errmsg("relation \"%s\" already exists, skipping",
-			stmt->into->rel->relname)));
-			return InvalidObjectAddress;
-		}
-	}
+	/* Check if the relation exists or not */
+	if (!CreateTableAsRelExists(stmt))
+		return InvalidObjectAddress;
 
 	/*
 	 * Create the tuple receiver object and insert info it will need
@@ -400,6 +388,41 @@ GetIntoRelEFlags(IntoClause *intoClause)
 	return flags;
 }
 
+/*
+ * CreateTableAsRelExists --- check existence of relation for CreateTableAsStmt
+ *
+ * Utility wrapper checking if the relation pending for creation in the
+ * CreateTableAsStmt query already exists or not.  Returns true if the
+ * relation can be created.
+ */
+bool
+CreateTableAsRelExists(CreateTableAsStmt *ctas)
+{
+	Oid nspid;
+	IntoClause *into = ctas->into;
+
+	nspid = RangeVarGetCreationNamespace(into->rel);
+
+	if (get_relname_relid(into->rel->relname, nspid))
+	{
+		if (!ctas->if_not_exists)
+			ereport(ERROR,
+	(errcode(ERRCODE_DUPLICATE_TABLE),
+	 errmsg("relation \"%s\" already exists",
+			into->rel->relname)));
+
+		/* The relation exists and IF NOT EXISTS has been specified */
+		ereport(NOTICE,
+(errcode(ERRCODE_DUPLICATE_TABLE),
+ errmsg("relation \"%s\" already exists, skipping",
+		into->rel->relname)));
+		return false;
+	}
+
+	/* Relation does not exist, it can be created */
+	return true;
+}
+
 /*
  * CreateIntoRelDestReceiver -- create a suitable DestReceiver object
  *
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 43f9b01e83..c2dd38f72b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -435,6 +435,19 @@ ExplainOneUtility(Node *utilityStmt, IntoClause *into, ExplainState *es,
 		CreateTableAsStmt *ctas = (CreateTableAsStmt *)

Re: Confused about stream replication protocol documentation

2020-12-23 Thread Fujii Masao





On 2020/12/23 11:08, Li Japin wrote:



On Dec 22, 2020, at 11:13 PM, Fujii Masao mailto:masao.fu...@oss.nttdata.com>> wrote:

‘B’ means a backend and ‘F’ means a frontend. Maybe as [1] does, we should
add the note like "Each is marked to indicate that it can be sent by
a frontend (F) and a backend (B)" into the description about each message
format for START_REPLICATION.

[1]
https://www.postgresql.org/docs/devel/protocol-message-formats.html 



Thanks for your clarify.  Maybe we should move the "protocol message formats”
before “stream replication protocol” or referenced it in "stream replication 
protocol”.


I like the latter. And maybe it's better to reference to also
"53.6. Message Data Types" there because the messages for
START_REPLICATION use the message data types.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Automatic HASH and LIST partition creation

2020-12-23 Thread Pavel Borisov

>
>
> > BTW could you tell me a couple of words about pros and cons of c-code
> > syntax parsing comparing to parsing using gram.y trees?
>
> I'd rather use an automatic tool (lexer/parser) if possible instead of
> doing it by hand if I can. If you want a really nice syntax with clever
> tricks, then you may need to switch to manual though, but pg/sql is not in
> that class.
>
> > I think both are possible but my predisposition was that we'd better use
> > the later if possible.
>
> I agree.
>
Thank you!

Fabien, do you consider it possible to change the syntax of declarative
partitioning too? It is problematic as it is already committed but also is
very tempting to have the same type of syntax both in automatic
partitioning and in manual (PARTITION OF...)

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Logical decoding without slots: decoding in lockstep with recovery

2020-12-23 Thread Amit Kapila

On Wed, Dec 23, 2020 at 12:26 PM Craig Ringer
 wrote:
>
> Hi all
>
> I want to share an idea I've looked at a few times where I've run into 
> situations where logical slots were inadvertently dropped, or where it became 
> necessary to decode changes in the past on a slot.
>
> As most of you will know you can't just create a logical slot in the past. 
> Even if it was permitted, it'd be unsafe due to catalog_xmin retention 
> requirements and missing WAL.
>
> But if we can arrange a physical replica to replay the WAL of interest and 
> decode each commit as soon as it's replayed by the startup process, we know 
> the needed catalog rows must all exist, so it's safe to decode the change.
>
> So it should be feasible to run logical decoding in standby, even without a 
> replication slot, so long as we:
>
> * pause startup process after each xl_xact_commit
> * wake the walsender running logical decoding
> * decode and process until ReorderBufferCommit for the just-committed xact 
> returns
> * wake the startup process to decode the up to the next commit
>

How will you deal with subscriber restart?  I think you need some way
to remember confirmed_flush_lsn and restart_lsn and then need to teach
WAL machinery to restart from some previous point.

-- 
With Regards,
Amit Kapila.

Re: Deadlock between backend and recovery may not be detected

2020-12-23 Thread Masahiko Sawada

On Tue, Dec 22, 2020 at 11:58 PM Fujii Masao
 wrote:
>
>
>
> On 2020/12/22 20:42, Fujii Masao wrote:
> >
> >
> > On 2020/12/22 10:25, Masahiko Sawada wrote:
> >> On Fri, Dec 18, 2020 at 6:36 PM Fujii Masao  
> >> wrote:
> >>>
> >>>
> >>>
> >>> On 2020/12/17 2:15, Fujii Masao wrote:
> 
> 
>  On 2020/12/16 23:28, Drouvot, Bertrand wrote:
> > Hi,
> >
> > On 12/16/20 2:36 PM, Victor Yegorov wrote:
> >>
> >> *CAUTION*: This email originated from outside of the organization. Do 
> >> not click links or open attachments unless you can confirm the sender 
> >> and know the content is safe.
> >>
> >>
> >> ср, 16 дек. 2020 г. в 13:49, Fujii Masao  >> >:
> >>
> >>  After doing this procedure, you can see the startup process and 
> >> backend
> >>  wait for the table lock each other, i.e., deadlock. But this 
> >> deadlock remains
> >>  even after deadlock_timeout passes.
> >>
> >>  This seems a bug to me.
> >>
> > +1
> >
> >>
> >>  > * Deadlocks involving the Startup process and an ordinary 
> >> backend process
> >>  > * will be detected by the deadlock detector within the ordinary 
> >> backend.
> >>
> >>  The cause of this issue seems that 
> >> ResolveRecoveryConflictWithLock() that
> >>  the startup process calls when recovery conflict on lock happens 
> >> doesn't
> >>  take care of deadlock case at all. You can see this fact by 
> >> reading the above
> >>  source code comment for ResolveRecoveryConflictWithLock().
> >>
> >>  To fix this issue, I think that we should enable 
> >> STANDBY_DEADLOCK_TIMEOUT
> >>  timer in ResolveRecoveryConflictWithLock() so that the startup 
> >> process can
> >>  send PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal to the 
> >> backend.
> >>  Then if PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK signal arrives,
> >>  the backend should check whether the deadlock actually happens or 
> >> not.
> >>  Attached is the POC patch implimenting this.
> >>
> > good catch!
> >
> > I don't see any obvious reasons why the STANDBY_DEADLOCK_TIMEOUT 
> > shouldn't be set in ResolveRecoveryConflictWithLock() too (it is 
> > already set in ResolveRecoveryConflictWithBufferPin()).
> >
> > So + 1 to consider this as a bug and for the way the patch proposes to 
> > fix it.
> 
>  Thanks Victor and Bertrand for agreeing!
>  Attached is the updated version of the patch.
> >>>
> >>> Attached is v3 of the patch. Could you review this version?
> >>>
> >>> While the startup process is waiting for recovery conflict on buffer pin,
> >>> it repeats sending the request for deadlock check to all the backends
> >>> every deadlock_timeout. This may increase the workload in the startup
> >>> process and backends, but since this is the original behavior, the patch
> >>> doesn't change that. Also in practice this may not be so harmful because
> >>> the period that the buffer is kept pinned is basically not so long.
> >>>
> >>
> >> @@ -529,6 +603,26 @@ ResolveRecoveryConflictWithBufferPin(void)
> >>   */
> >>  ProcWaitForSignal(PG_WAIT_BUFFER_PIN);
> >>
> >> +   if (got_standby_deadlock_timeout)
> >> +   {
> >> +   /*
> >> +* Send out a request for hot-standby backends to check themselves 
> >> for
> >> +* deadlocks.
> >> +*
> >> +* XXX The subsequent ResolveRecoveryConflictWithBufferPin() will 
> >> wait
> >> +* to be signaled by UnpinBuffer() again and send a request for
> >> +* deadlocks check if deadlock_timeout happens. This causes the
> >> +* request to continue to be sent every deadlock_timeout until the
> >> +* buffer is unpinned or ltime is reached. This would increase the
> >> +* workload in the startup process and backends. In practice it may
> >> +* not be so harmful because the period that the buffer is kept 
> >> pinned
> >> +* is basically no so long. But we should fix this?
> >> +*/
> >> +   SendRecoveryConflictWithBufferPin(
> >> +
> >> PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
> >> +   got_standby_deadlock_timeout = false;
> >> +   }
> >> +
> >>
> >> Since SendRecoveryConflictWithBufferPin() sends the signal to all
> >> backends every backend who is waiting on a lock at ProcSleep() and not
> >> holding a buffer pin blocking the startup process will end up doing a
> >> deadlock check, which seems expensive. What is worse is that the
> >> deadlock will not be detected because the deadlock involving a buffer
> >> pin isn't detected by CheckDeadLock(). I thought we can replace
> >> PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK with
> >> PROCSIG_RECOVERY_CONFLICT_BUFFERPIN but it’s not good because the
> >> backend who has a buffer pin blocking the

Re: Single transaction in the tablesync worker?

2020-12-23 Thread Amit Kapila

On Tue, Dec 22, 2020 at 4:58 PM Peter Smith  wrote:
>
> On Mon, Dec 21, 2020 at 11:36 PM Amit Kapila  wrote:
> >
> > On Mon, Dec 21, 2020 at 3:17 PM Peter Smith  wrote:
> > >
> > > On Mon, Dec 21, 2020 at 4:23 PM Amit Kapila  
> > > wrote:
> > >
> > > > Few other comments:
> > > > ==
> > >
> > > Thanks for your feedback.
> > >
> > > > 1.
> > > > * FIXME 3 - Crashed tablesync workers may also have remaining slots
> > > > because I don't think
> > > > + * such workers are even iterated by this loop, and nobody else is
> > > > removing them.
> > > > + */
> > > > + if (slotname)
> > > > + {
> > > >
> > > > The above FIXME is not clear to me. Actually, the crashed workers
> > > > should restart, finish their work, and drop the slots. So not sure
> > > > what exactly this FIXME refers to?
> > >
> > > Yes, normally if the tablesync can complete it should behave like that.
> > > But I think there are other scenarios where it may be unable to
> > > clean-up after itself. For example:
> > >
> > > i) Maybe the crashed tablesync worker cannot finish. e.g. A row insert
> > > handled by tablesync can give a PK violation which also will crash
> > > again and again for each re-launched/replacement tablesync worker.
> > > This can be reproduced in the debugger. If the DropSubscription
> > > doesn't clean-up the tablesync's slot then nobody will.
> > >
> > > ii) Also DROP SUBSCRIPTION code has locking (see code commit) "to
> > > ensure that the launcher doesn't restart new worker during dropping
> > > the subscription".
> > >
> >
> > Yeah, I have also read that comment but do you know how it is
> > preventing relaunch? How does the subscription lock help?
>
> Hmmm. I did see there is a matching lock in get_subscription_list of
> launcher.c, which may be what that code comment was referring to. But
> I also am currently unsure how this lock prevents anybody (e.g.
> process_syncing_tables_for_apply) from executing another
> logicalrep_worker_launch.
>

process_syncing_tables_for_apply will be called by the apply worker
and we are stopping the apply worker. So, after that launcher won't
start a new apply worker because of get_subscription_list() and if the
apply worker is not started then it won't be able to start tablesync
worker. So, we need the handling of crashed tablesync workers here
such that we need to drop any new sync slots.

-- 
With Regards,
Amit Kapila.

Re: Single transaction in the tablesync worker?

2020-12-23 Thread Amit Kapila

On Wed, Dec 23, 2020 at 11:49 AM Peter Smith  wrote:
>
> Hi Amit.
>
> PSA my v7 WIP patch for the Solution1.
>

Few comments:

1.
+ * Rarely, the DropSubscription may be issued when a tablesync still
+ * is in SYNCDONE but not yet in READY state. If this happens then
+ * the drop slot could fail because it is already dropped.
+ * In this case suppress and drop slot error.
+ *
+ * FIXME - Is there a better way than this?
+ */
+ if (rstate->state != SUBREL_STATE_SYNCDONE)
+ PG_RE_THROW();

So, does this situation happens when we try to drop subscription after
the state is changed to syncdone but not syncready. If so, then can't
we write a function GetSubscriptionNotDoneRelations similar to
GetSubscriptionNotReadyRelations where we get a list of relations that
are not in done stage. I think this should be safe because once we are
here we shouldn't be allowed to start a new worker and old workers are
already stopped by this function.

2. Your changes in LogicalRepSyncTableStart() doesn't seem to be
right. IIUC, you are copying the table in one transaction, then
updating the state to SUBREL_STATE_COPYDONE in another transaction,
and after that doing replorigin_advance. Consider what happened if we
error out after the first txn is committed in which we have copied the
table. After the restart, it will again try to copy and lead to an
error. Similarly, consider if we error out after the second
transaction, we won't where to start decoding from. I think all these
should be done in a single transaction.

-- 
With Regards,
Amit Kapila.

Re: Single transaction in the tablesync worker?

2020-12-23 Thread Peter Smith

Hi Amit.

PSA my v8 WIP patch for the Solution1.

This has the same code changes as the v7 patch, but the v8 patch can
be applied to the current PG OSS master code base.



Coded / WIP:

* tablesync slot is now permanent instead of temporary. The tablesync
slot name is no longer tied to the Subscription slot name.

* the tablesync slot cleanup (drop) code is added for DropSubscription
and for finish_sync_worker functions

* tablesync worked now allowing multiple tx instead of single tx

* a new state (SUBREL_STATE_COPYDONE) is persisted after a successful
copy_table in LogicalRepSyncTableStart.

* if a relaunched tablesync finds the state is SUBREL_STATE_COPYDONE
then it will bypass the initial copy_table phase.

* tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* tablesync replication origin tracking is cleaned up during
DropSubscription and/or process_syncing_tables_for_apply

* The DropSubscription cleanup code was changed lots in v7. The
subscription TAP tests have been executed 6x now without observing any
race problems that were sometimes seen to happen in the v6 patch.

TODO / Known Issues:

* Help / comments

* There is temporary "!!>>" excessive logging scattered around which I
added to help my testing during development

* Address review comments

---

Kind Regards,
Peter Smith.
Fujitsu Australia


v8-0001-WIP-patch-for-the-Solution1.patch
Description: Binary data

RE: Parallel copy

2020-12-23 Thread Hou, Zhijie

Hi

> Yes this optimization can be done, I will handle this in the next patch
> set.
> 

I have a suggestion for the parallel safety-check.

As designed, The leader does not participate in the insertion of data.
If User use (PARALLEL 1), there is only one worker process which will do the 
insertion.

IMO, we can skip some of the safety-check in this case, becase the safety-check 
is to limit parallel insert.
(except temporary table or ...)

So, how about checking (PARALLEL 1) separately ?
Although it looks a bit complicated, But (PARALLEL 1) do have a good 
performance improvement.

Best regards,
houzj

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-12-23 Thread Andrey V. Lepikhov


On 12/22/20 12:04 PM, Tang, Haiying wrote:

Hi Andrey,

There is an error report in your patch as follows. Please take a check.

https://travis-ci.org/github/postgresql-cfbot/postgresql/jobs/750682857#L1519


copyfrom.c:374:21: error: ‘save_cur_lineno’ is used uninitialized in this 
function [-Werror=uninitialized]


Regards,
Tang




Thank you,
see new version in attachment.

--
regards,
Andrey Lepikhov
Postgres Professional
>From e2bc0980f05061afe199de63b76b00020208510a Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Mon, 14 Dec 2020 13:37:40 +0500
Subject: [PATCH 2/2] Fast COPY FROM into the foreign or sharded table.

This feature enables bulk COPY into foreign table in the case of
multi inserts is possible and foreign table has non-zero number of columns.

FDWAPI was extended by next routines:
* BeginForeignCopy
* EndForeignCopy
* ExecForeignCopy

BeginForeignCopy and EndForeignCopy initialize and free
the CopyState of bulk COPY. The ExecForeignCopy routine send
'COPY ... FROM STDIN' command to the foreign server, in iterative
manner send tuples by CopyTo() machinery, send EOF to this connection.

Code that constructed list of columns for a given foreign relation
in the deparseAnalyzeSql() routine is separated to the deparseRelColumnList().
It is reused in the deparseCopyFromSql().

Added TAP-tests on the specific corner cases of COPY FROM STDIN operation.

By the analogy of CopyFrom() the CopyState structure was extended
with data_dest_cb callback. It is used for send text representation
of a tuple to a custom destination.
The PgFdwModifyState structure is extended with the cstate field.
It is needed for avoid repeated initialization of CopyState. ALso for this
reason CopyTo() routine was split into the set of routines CopyToStart()/
CopyTo()/CopyToFinish().

Enum CopyInsertMethod removed. This logic implements by ri_usesMultiInsert
field of the ResultRelInfo sructure.

Discussion:
https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

Authors: Andrey Lepikhov, Ashutosh Bapat, Amit Langote
---
 contrib/postgres_fdw/deparse.c|  60 ++--
 .../postgres_fdw/expected/postgres_fdw.out|  46 +-
 contrib/postgres_fdw/postgres_fdw.c   | 137 ++
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |  45 ++
 doc/src/sgml/fdwhandler.sgml  |  73 ++
 src/backend/commands/copy.c   |   4 +-
 src/backend/commands/copyfrom.c   | 126 +---
 src/backend/commands/copyto.c |  84 ---
 src/backend/executor/execMain.c   |   8 +-
 src/backend/executor/execPartition.c  |  27 +++-
 src/include/commands/copy.h   |   8 +-
 src/include/foreign/fdwapi.h  |  15 ++
 13 files changed, 540 insertions(+), 94 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ca2f9f3215..b2a71faabc 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList,
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
+static List *deparseRelColumnList(StringInfo buf, Relation rel,
+  bool enclose_in_parens);
 
 /*
  * Helper functions
@@ -1763,6 +1765,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+	appendStringInfoString(buf, "COPY ");
+	deparseRelation(buf, rel);
+	(void) deparseRelColumnList(buf, rel, true);
+
+	appendStringInfoString(buf, " FROM STDIN ");
+}
+
 /*
  * deparse remote UPDATE statement
  *
@@ -2066,6 +2082,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
  */
 void
 deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
+{
+	appendStringInfoString(buf, "SELECT ");
+	*retrieved_attrs = deparseRelColumnList(buf, rel, false);
+
+	/* Don't generate bad syntax for zero-column relation. */
+	if (list_length(*retrieved_attrs) == 0)
+		appendStringInfoString(buf, "NULL");
+
+	/*
+	 * Construct FROM clause
+	 */
+	appendStringInfoString(buf, " FROM ");
+	deparseRelation(buf, rel);
+}
+
+/*
+ * Construct the list of columns of given foreign relation in the order they
+ * appear in the tuple descriptor of the relation. Ignore any dropped columns.
+ * Use column names on the foreign server instead of local names.
+ *
+ * Optionally enclose the list in parantheses.
+ */
+static List *
+deparseRelColumnList(StringInfo buf, Relation rel, bool enclose_in_parens)
 {
 	Oid			relid = RelationGetRelid(rel);
 	TupleDesc	tupdesc =

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread Amit Kapila

On Wed, Dec 23, 2020 at 10:42 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 23 Dec 2020 04:22:19 +, "tsunakawa.ta...@fujitsu.com" 
>  wrote in
> > From: Amit Kapila 
> > > + /* Get the number of blocks for a relation's fork */
> > > + block[i][j] = smgrnblocks(smgr_reln[i], j, );
> > > +
> > > + if (!cached)
> > > + goto buffer_full_scan;
> > >
> > > Why do we need to use goto here? We can simply break from the loop and
> > > then check if (cached && nBlocksToInvalidate <
> > > BUF_DROP_FULL_SCAN_THRESHOLD). I think we should try to avoid goto if
> > > possible without much complexity.
> >
> > That's because two for loops are nested -- breaking there only exits the 
> > inner loop.  (I thought the same as you at first... And I understand some 
> > people have alergy to goto, I think modest use of goto makes the code 
> > readable.)
>
> I don't strongly oppose to goto's but in this case the outer loop can
> break on the same condition with the inner loop, since cached is true
> whenever the inner loop runs to the end. It is needed to initialize
> the variable cache with true, instead of false, though.
>

+1. I think it is better to avoid goto here as it can be done without
introducing any complexity or making code any less readable.

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-12-23 Thread Amit Kapila

On Wed, Dec 23, 2020 at 1:07 PM k.jami...@fujitsu.com
 wrote:
>
> On Tuesday, December 22, 2020 9:11 PM, Amit Kapila wrote:
>
> > In this code, I am slightly worried about the additional cost of each time
> > checking smgrexists. Consider a case where there are many relations and only
> > one or few of them have not cached the information, in such a case we will
> > pay the cost of smgrexists for many relations without even going to the
> > optimized path. Can we avoid that in some way or at least reduce its usage 
> > to
> > only when it is required? One idea could be that we first check if the 
> > nblocks
> > information is cached and if so then we don't need to call smgrnblocks,
> > otherwise, check if it exists. For this, we need an API like
> > smgrnblocks_cahced, something we discussed earlier but preferred the
> > current API. Do you have any better ideas?
>
> Right. I understand the point that let's say there are 100 relations, and
> the first 99 non-local relations happen to enter the optimization path, but 
> the last
> one does not, calling smgrexist() would be too costly and waste of time in 
> that case.
> The only solution I could think of as you mentioned is to reintroduce the new 
> API
> which we discussed before: smgrnblocks_cached().
> It's possible that we call smgrexist() only if smgrnblocks_cached() returns
> InvalidBlockNumber.
> So if everyone agrees, we can add that API smgrnblocks_cached() which will
> Include the "cached" flag parameter, and remove the additional flag 
> modifications
> from smgrnblocks(). In this case, both DropRelFileNodeBuffers() and
> DropRelFileNodesAllBuffers() will use the new API.
>

Yeah, let's do it that way unless anyone has a better idea to suggest.
--
With Regards,
Amit Kapila.

75 matches

Mail list logo