date:20200922

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-09-22 Thread Amit Langote

Hi Alvaro,

Studying the patch to understand how it works.

On Tue, Aug 4, 2020 at 8:49 AM Alvaro Herrera  wrote:
> Why two transactions?  The reason is that in order for this to work, we
> make a catalog change (mark it detached), and commit so that all
> concurrent transactions can see the change.  A second transaction waits
> for anybody who holds any lock on the partitioned table and grabs Access
> Exclusive on the partition (which now no one cares about, if they're
> looking at the partitioned table), where the DDL action on the partition
> can be completed.

+   /*
+* Concurrent mode has to work harder; first we add a new constraint to the
+* partition that matches the partition constraint.  The reason for this is
+* that the planner may have made optimizations that depend on the
+* constraint.  XXX Isn't it sufficient to invalidate the partition's
+* relcache entry?
...
+   /* Add constraint on partition, equivalent to the partition
constraint */
+   n = makeNode(Constraint);
+   n->contype = CONSTR_CHECK;
+   n->conname = NULL;
+   n->location = -1;
+   n->is_no_inherit = false;
+   n->raw_expr = NULL;
+   n->cooked_expr =
nodeToString(make_ands_explicit(RelationGetPartitionQual(partRel)));
+   n->initially_valid = true;
+   n->skip_validation = true;
+   /* It's a re-add, since it nominally already exists */
+   ATAddCheckConstraint(wqueue, tab, partRel, n,
+true, false, true, ShareUpdateExclusiveLock);

I suspect that we don't really need this defensive constraint.  I mean
even after committing the 1st transaction, the partition being
detached still has relispartition set to true and
RelationGetPartitionQual() still returns the partition constraint, so
it seems the constraint being added above is duplicative.  Moreover,
the constraint is not removed as part of any cleaning up after the
DETACH process, so repeated attach/detach of the same partition
results in the constraints piling up:

create table foo (a int, b int) partition by range (a);
create table foo1 partition of foo for values from (1) to (2);
create table foo2 partition of foo for values from (2) to (3);
alter table foo detach partition foo2 concurrently;
alter table foo attach partition foo2 for values from (2) to (3);
alter table foo detach partition foo2 concurrently;
alter table foo attach partition foo2 for values from (2) to (3);
\d foo2
Table "public.foo2"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+-
 a  | integer |   |  |
 b  | integer |   |  |
Partition of: foo FOR VALUES FROM (2) TO (3)
Check constraints:
"foo2_a_check" CHECK (a IS NOT NULL AND a >= 2 AND a < 3)
"foo2_a_check1" CHECK (a IS NOT NULL AND a >= 2 AND a < 3)

Noticed a bug/typo in the patched RelationBuildPartitionDesc():

-   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock);
+   inhoids = find_inheritance_children(RelationGetRelid(rel), NoLock,
+   include_detached);

You're passing NoLock for include_detached which means you never
actually end up including detached partitions from here.  I think it
is due to this bug that partition-concurrent-attach test fails in my
run.  Also, the error seen in the following hunk of
detach-partition-concurrently-1 test is not intentional:

+starting permutation: s1brr s1prep s1s s2d s1s s1exec2 s1c
+step s1brr: BEGIN ISOLATION LEVEL REPEATABLE READ;
+step s1prep: PREPARE f(int) AS INSERT INTO d_listp VALUES ($1);
+step s1s: SELECT * FROM d_listp;
+a
+
+1
+2
+step s2d: ALTER TABLE d_listp DETACH PARTITION d_listp2 CONCURRENTLY;

+step s1s: SELECT * FROM d_listp;
+a
+
+1
+step s1exec2: EXECUTE f(2); DEALLOCATE f;
+step s2d: <... completed>
+error in steps s1exec2 s2d: ERROR:  no partition of relation
"d_listp" found for row
+step s1c: COMMIT;

As you're intending to make the executor always *include* detached
partitions, the insert should be able route (2) to d_listp2, the
detached partition.  It's the bug mentioned above that causes the
executor's partition descriptor build to falsely fail to include the
detached partition.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

RE: Use appendStringInfoString and appendPQExpBufferStr where possible

2020-09-22 Thread Hou, Zhijie

Hi

I made a slight update to the patch

> > > I found some more places that should use appendPQExrBufferStr instead
> > of appendPQExpBuffer.
> > >
> > > Here is the patch.
> >
> > Seems like a good idea.  Please add it to the next commitfest.
> 
> Thanks, added it to the next commitfest.
> https://commitfest.postgresql.org/30/2735/

Best regards,
houzj




0001-Latest-appendStringInfoString-instead-of-appendStringInfo.patch
Description: 0001-Latest-appendStringInfoString-instead-of-appendStringInfo.patch


0001-Latest-appendPQExpBufferStr-instead-of-appendPQExpBuffer.patch
Description: 0001-Latest-appendPQExpBufferStr-instead-of-appendPQExpBuffer.patch

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-22 Thread tsunakawa.ta...@fujitsu.com

From: Jamison, Kirk/ジャミソン カーク 
> I revised the patch based from my understanding of Horiguchi-san's comment,
> but I could be wrong.
> Quoting:
> 
> "
> + /* Get the number of blocks for the supplied relation's
> fork */
> + nblocks = smgrnblocks(smgr_reln,
> forkNum[fork_num]);
> + Assert(BlockNumberIsValid(nblocks));
> +
> + if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)
> 
> As mentioned upthread, the criteria whether we do full-scan or
> lookup-drop is how large portion of NBUFFERS this relation-drop can be
> going to invalidate.  So the nblocks above should be the sum of number
> of blocks to be truncated (not just the total number of blocks) of all
> designated forks.  Then once we decided to do lookup-drop method, we
> do that for all forks."

One takeaway from Horiguchi-san's comment is to use the number of blocks to 
invalidate for comparison, instead of all blocks in the fork.  That is, use

nblocks = smgrnblocks(fork) - firstDelBlock[fork];

Does this make sense?

What do you think is the reason for summing up all forks?  I didn't understand 
why.  Typically, FSM and VM forks are very small.  If the main fork is larger 
than NBuffers / 500, then v14 scans the entire shared buffers for the FSM and 
VM forks as well as the main fork, resulting in three scans in total.

Also, if you want to judge the criteria based on the total blocks of all forks, 
the following if should be placed outside the for loop, right?  Because this if 
condition doesn't change inside the for loop.

+   if ((nblocks / (uint32)NBuffers) < 
BUF_DROP_FULLSCAN_THRESHOLD &&
+   BlockNumberIsValid(nblocks))
+   {



> > (2)
> > +   if ((nblocks / (uint32)NBuffers) <
> > BUF_DROP_FULLSCAN_THRESHOLD &&
> > +   BlockNumberIsValid(nblocks))
> > +   {
> >
> > The division by NBuffers is not necessary, because both sides of = are
> > number of blocks.
> 
> Again I based it from my understanding of the comment above,
> so nblocks is the sum of all blocks to be truncated for all forks.

But the left expression of "<" is a percentage, while the right one is a block 
count.  Two different units are compared.


> > Why is BlockNumberIsValid(nblocks)) call needed?
> 
> I thought we need to ensure that nblocks is not invalid, so I also added

When is it invalid?  smgrnblocks() seems to always return a valid block number. 
 Am I seeing a different source code (I saw HEAD)?




Regards
Takayuki Tsunakawa

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-22 Thread Amit Kapila

On Wed, Sep 23, 2020 at 8:04 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > > > > Don't
> > > > > we need to guarantee the cache to be valid while recovery?
> > > > >
> > > >
> > > > One possibility could be that we somehow detect that the value we
> > > > are using is cached one and if so then only do this optimization.
> > >
> > > I basically like this direction.  But I'm not sure the additional
> > > parameter for smgrnblocks is acceptable.
> > >
> > > But on the contrary, it might be a better design that
> > > DropRelFileNodeBuffers gives up the optimization when
> > > smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.
> > >
> >
> > I haven't thought about what is the best way to achieve this. Let us see if
> > Tsunakawa-San or Kirk-San has other ideas on this?
>
> I see no need for smgrnblocks() to add an argument as it returns the correct 
> cached or measured value.
>

The idea is that we can't use this optimization if the value is not
cached because we can't rely on lseek behavior. See all the discussion
between Horiguchi-San and me in the thread above. So, how would you
ensure that if we don't use Kirk-San's proposal?

-- 
With Regards,
Amit Kapila.

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-09-22 Thread Amit Kapila

On Wed, Sep 23, 2020 at 7:56 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> (3)
> if (reln->smgr_cached_nblocks[forknum] == blocknum)
> reln->smgr_cached_nblocks[forknum] = blocknum + 1;
> else
> +   {
> +   /*
> +* DropRelFileNodeBuffers relies on the behavior that cached 
> nblocks
> +* won't be invalidated by file extension while recovering.
> +*/
> +   Assert(!InRecovery);
> reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
> +   }
>
> I think this change is not directly related to this patch and can be a 
> separate patch, but I want to leave the decision up to a committer.
>

We have added this mainly for testing purpose, basically this
assertion should not fail during the regression tests. We can keep it
in a separate patch but need to ensure that. If this fails then we
can't rely on the caching behaviour during recovery which is actually
required for the correctness of patch.

-- 
With Regards,
Amit Kapila.

Re: [HACKERS] logical decoding of two-phase transactions

2020-09-22 Thread Amit Kapila

On Tue, Sep 22, 2020 at 5:18 PM Ajin Cherian  wrote:
>
> On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar  wrote:
>
> > 3.
> >
> > + /*
> > + * If it's ROLLBACK PREPARED then handle it via callbacks.
> > + */
> > + if (TransactionIdIsValid(xid) &&
> > + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
> > + parsed->dbId == ctx->slot->data.database &&
> > + !FilterByOrigin(ctx, origin_id) &&
> > + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
> > + {
> > + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
> > + commit_time, origin_id, origin_lsn,
> > + parsed->twophase_gid, false);
> > + return;
> > + }
> >
> >
> > I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
> > == ctx->slot->data.database and !FilterByOrigin in DecodePrepare
> > so if those are not true then we wouldn't have prepared this
> > transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
> > need
> > to recheck this conditions.
>
> I didnt change this, as I am seeing cases where the Abort is getting
> called for transactions that needs to be skipped. I also see that the
> same check is there both in DecodePrepare and DecodeCommit.
> So, while the same transactions were not getting prepared or
> committed, it tries to get ROLLBACK PREPARED (as part of finish
> prepared handling). The check in if ReorderBufferTxnIsPrepared() is
> also not proper.
>

If the transaction is prepared which you can ensure via
ReorderBufferTxnIsPrepared() (considering you have a proper check in
that function), it should not require skipping the transaction in
Abort. One way it could happen is if you clean up the ReorderBufferTxn
in Prepare which you were doing in earlier version of patch which I
pointed out was wrong, if you have changed that then I don't know why
it could fail, may be someplace else during prepare the patch is
freeing it. Just check that.

> I will need to relook
> this logic again in a future patch.
>

No problem. I think you can handle the other comments and then we can
come back to this and you might want to share the exact details of the
test (may be a narrow down version of the original test) and I or
someone else might be able to help you with that.

-- 
With Regards,
Amit Kapila.

Re: Removing unneeded self joins

2020-09-22 Thread David Rowley

On Fri, 3 Apr 2020 at 17:43, Andrey Lepikhov  wrote:
> v.23 in attachment:

Hi,

This is only a partial review as I see the patch still does not
incorporate the self join detection method I mentioned in March 2019
and the one the patch only partially works.

Here's the partial review which contains the details:

1. Change to aset.c not relevant to this patch.


--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -1485,7 +1485,6 @@ AllocSetCheck(MemoryContext context)

  chsize = chunk->size; /* aligned chunk size */
  dsize = chunk->requested_size; /* real data */
-
  /*
  * Check chunk size
  */


2. Why GUC_NOT_IN_SAMPLE?

+ {"enable_self_join_removal", PGC_USERSET, QUERY_TUNING_METHOD,
+ gettext_noop("Enable removal of unique self-joins."),
+ NULL,
+ GUC_NO_SHOW_ALL | GUC_NOT_IN_SAMPLE

3. I'd expect this to remove the join.

create table t1 (
  a int not null unique,
  b int not null unique,
  c int not null unique,
  d int not null,
  e int not null
);

explain select * from t1 inner join t1 t2 on t1.a=t2.a and t1.b=t2.c;
  QUERY PLAN
--
 Hash Join  (cost=52.50..88.42 rows=1 width=40)
   Hash Cond: ((t1.a = t2.a) AND (t1.b = t2.c))
   ->  Seq Scan on t1  (cost=0.00..27.00 rows=1700 width=20)
   ->  Hash  (cost=27.00..27.00 rows=1700 width=20)
 ->  Seq Scan on t1 t2  (cost=0.00..27.00 rows=1700 width=20)
(5 rows)

This one seems to work.

This one *does* seem to work.

explain select * from t1 inner join t1 t2 on t1.a=t2.a and t1.d=t2.e;

You should likely implement the method I wrote in the final paragraph in [1]

The basic idea here is you first process the join quals:

t1.a=t2.a and t1.b=t2.c

and keep just the ones that use the same varattno on either side
(t1.a=t2.a). Perhaps it's not worth thinking about Exprs for now, or
if you think it is you can normalise the varnos to 1 and equal()

Then just pass the remaining quals to relation_has_unique_index_for().
If it returns true then there's no need to perform the opposite check
for the other relation. It would just return the same thing.

This will allow you to get rid of using the following as proofs:

/* We must have the same unique index for both relations. */
if (outerinfo->index->indexoid != innerinfo->index->indexoid)
return false;

... as I've shown above.  This only works in some cases and that's not
really good enough.

Doing thing the way I describe will allow you to get rid of all the
UniqueRelInfo stuff.

4. Should be ok for partitioned tables though:

/*
* This optimization won't work for tables that have inheritance
* children.
*/
if (rte->inh)
continue;

David

[1] 
https://www.postgresql.org/message-id/CAKJS1f8p-KiEujr12k-oa52JNWWaQUjEjNg%2Bo1MGZk4mHBn_Rg%40mail.gmail.com

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-22 Thread k.jami...@fujitsu.com

On Wednesday, September 23, 2020 11:26 AM, Tsunakawa, Takayuki wrote:

> I looked at v14.
Thank you for checking it!

> (1)
> + /* Get the total number of blocks for the supplied relation's
> fork */
> + for (j = 0; j < nforks; j++)
> + {
> + BlockNumber block =
> smgrnblocks(smgr_reln, forkNum[j]);
> + nblocks += block;
> + }
> 
> Why do you sum all forks?

I revised the patch based from my understanding of Horiguchi-san's comment,
but I could be wrong.
Quoting:

" 
+   /* Get the number of blocks for the supplied relation's 
fork */
+   nblocks = smgrnblocks(smgr_reln, forkNum[fork_num]);
+   Assert(BlockNumberIsValid(nblocks));
+
+   if (nblocks < BUF_DROP_FULLSCAN_THRESHOLD)

As mentioned upthread, the criteria whether we do full-scan or
lookup-drop is how large portion of NBUFFERS this relation-drop can be
going to invalidate.  So the nblocks above should be the sum of number
of blocks to be truncated (not just the total number of blocks) of all
designated forks.  Then once we decided to do lookup-drop method, we
do that for all forks."

> (2)
> + if ((nblocks / (uint32)NBuffers) <
> BUF_DROP_FULLSCAN_THRESHOLD &&
> + BlockNumberIsValid(nblocks))
> + {
> 
> The division by NBuffers is not necessary, because both sides of = are
> number of blocks.

Again I based it from my understanding of the comment above,
so nblocks is the sum of all blocks to be truncated for all forks.

> Why is BlockNumberIsValid(nblocks)) call needed?

I thought we need to ensure that nblocks is not invalid, so I also added

> (3)
>   if (reln->smgr_cached_nblocks[forknum] == blocknum)
>   reln->smgr_cached_nblocks[forknum] = blocknum + 1;
>   else
> + {
> + /*
> +  * DropRelFileNodeBuffers relies on the behavior that
> cached nblocks
> +  * won't be invalidated by file extension while recovering.
> +  */
> + Assert(!InRecovery);
>   reln->smgr_cached_nblocks[forknum] =
> InvalidBlockNumber;
> + }
> 
> I think this change is not directly related to this patch and can be a 
> separate
> patch, but I want to leave the decision up to a committer.
> 
This is noted. Once we clarified the above comments, I'll put it in a separate 
patch if it's necessary,

Thank you very much for the reviews.

Best regards,
Kirk Jamison

Re: Syncing pg_multixact directories

2020-09-22 Thread Thomas Munro

On Wed, Sep 23, 2020 at 4:09 PM Michael Paquier  wrote:
> On Tue, Sep 22, 2020 at 07:05:36PM -0700, Andres Freund wrote:
> > On 2020-09-23 13:45:51 +1200, Thomas Munro wrote:
> >> I think we should be ensuring that directory entries for newly created
> >> multixact files are durable at checkpoint time.  Please see attached.
> >
> > Good catch! Probably that should probably be backpatched...
>
> +1.  Passing that down to the SLRU layer is a nice thing to do.  Were
> you planning to send a second patch here?  The commit log generated
> mentions patch 1/2.

Oh, that's just because I also have another patch, for master only, to
go on top, but that's in another thread about SLRU fsync offloading.
Sorry for the confusion.

Re: Syncing pg_multixact directories

2020-09-22 Thread Michael Paquier

On Tue, Sep 22, 2020 at 07:05:36PM -0700, Andres Freund wrote:
> On 2020-09-23 13:45:51 +1200, Thomas Munro wrote:
>> I think we should be ensuring that directory entries for newly created
>> multixact files are durable at checkpoint time.  Please see attached.
> 
> Good catch! Probably that should probably be backpatched...

+1.  Passing that down to the SLRU layer is a nice thing to do.  Were
you planning to send a second patch here?  The commit log generated
mentions patch 1/2.
--
Michael


signature.asc
Description: PGP signature

Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2020-09-22 Thread Thomas Munro

On Wed, Sep 23, 2020 at 2:27 PM David Rowley  wrote:
> I've gone as far as running the recovery tests on the v3-0001 patch
> using a Windows machine. They pass:

Thanks!  I pushed that one, because it was effectively a bug fix
(WaitLatch() without a latch was supposed to work).

I'll wait longer for feedback on the main patch; perhaps someone has a
better idea, or wants to take issue with the magic number 1024 (ie
limit on how many records we'll replay before we notice the postmaster
has exited), or my plan to harmonise those wait loops?  It has a CF
entry for the next CF.

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-22 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> > > > Don't
> > > > we need to guarantee the cache to be valid while recovery?
> > > >
> > >
> > > One possibility could be that we somehow detect that the value we
> > > are using is cached one and if so then only do this optimization.
> >
> > I basically like this direction.  But I'm not sure the additional
> > parameter for smgrnblocks is acceptable.
> >
> > But on the contrary, it might be a better design that
> > DropRelFileNodeBuffers gives up the optimization when
> > smgrnblocks(,,must_accurate = true) returns InvalidBlockNumber.
> >
> 
> I haven't thought about what is the best way to achieve this. Let us see if
> Tsunakawa-San or Kirk-San has other ideas on this?

I see no need for smgrnblocks() to add an argument as it returns the correct 
cached or measured value.



Regards
Takayuki Tsunakawa

Re: PostmasterIsAlive() in recovery (non-USE_POST_MASTER_DEATH_SIGNAL builds)

2020-09-22 Thread David Rowley

On Sun, 20 Sep 2020 at 09:29, Thomas Munro  wrote:
>
> Although I know from CI that this builds and passes "make check" on
> Windows, I'm hoping to attract some review of the 0001 patch from a
> Windows person, and confirmation that it passes "check-world" (or at
> least src/test/recovery check) there, because I don't have CI scripts
> for that yet.

I've gone as far as running the recovery tests on the v3-0001 patch
using a Windows machine. They pass:

L:\Projects\Postgres\d\src\tools\msvc>vcregress taptest src/test/recovery
...
t/001_stream_rep.pl .. ok
t/002_archiving.pl ... ok
t/003_recovery_targets.pl  ok
t/004_timeline_switch.pl . ok
t/005_replay_delay.pl  ok
t/006_logical_decoding.pl  ok
t/007_sync_rep.pl  ok
t/008_fsm_truncation.pl .. ok
t/009_twophase.pl  ok
t/010_logical_decoding_timelines.pl .. ok
t/011_crash_recovery.pl .. skipped: Test fails on Windows perl
t/012_subtransactions.pl . ok
t/013_crash_restart.pl ... ok
t/014_unlogged_reinit.pl . ok
t/015_promotion_pages.pl . ok
t/016_min_consistency.pl . ok
t/017_shm.pl . skipped: SysV shared memory not
supported by this platform
t/018_wal_optimize.pl  ok
t/019_replslot_limit.pl .. ok
t/020_archive_status.pl .. ok
All tests successful.
Files=20, Tests=222, 397 wallclock secs ( 0.08 usr +  0.00 sys =  0.08 CPU)
Result: PASS

L:\Projects\Postgres\d\src\tools\msvc>git diff --stat
 src/backend/storage/ipc/latch.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

David

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-22 Thread tsunakawa.ta...@fujitsu.com

I looked at v14.


(1)
+   /* Get the total number of blocks for the supplied relation's 
fork */
+   for (j = 0; j < nforks; j++)
+   {
+   BlockNumber block = smgrnblocks(smgr_reln, 
forkNum[j]);
+   nblocks += block;
+   }

Why do you sum all forks?


(2)
+   if ((nblocks / (uint32)NBuffers) < 
BUF_DROP_FULLSCAN_THRESHOLD &&
+   BlockNumberIsValid(nblocks))
+   {

The division by NBuffers is not necessary, because both sides of = are number 
of blocks.
Why is BlockNumberIsValid(nblocks)) call needed?


(3)
if (reln->smgr_cached_nblocks[forknum] == blocknum)
reln->smgr_cached_nblocks[forknum] = blocknum + 1;
else
+   {
+   /*
+* DropRelFileNodeBuffers relies on the behavior that cached 
nblocks
+* won't be invalidated by file extension while recovering.
+*/
+   Assert(!InRecovery);
reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+   }

I think this change is not directly related to this patch and can be a separate 
patch, but I want to leave the decision up to a committer.


Regards
Takayuki Tsunakawa

Re: Command statistics system (cmdstats)

2020-09-22 Thread Michael Paquier

On Wed, Sep 23, 2020 at 10:41:10AM +0900, Michael Paquier wrote:
> I think that this remark is a bit unfair.  When it comes to any patch
> in the commit fest, and I have managed a couple of them over the
> years, I have tried to keep a neutral view for all of them, meaning
> that I only let the numbers speak by themselves, mostly:
> - When has a patch lastly been updated.
> - What's the current state in the CF app  If the patch had no reviews
> and still in "Needs review", simply switch it to "waiting on author".

Sorry, forgot to mention here that this is in the case where a patch
does not apply anymore or that the CF bot complains, requiring a check
of the patch from the author.

> If the patch has been waiting on author for more than two weeks, it
> would get marked as RwF at the end of the CF.  There are also cases
> where the status of the patch is incorrect, mostly that a patch
> waiting on author because of a review was not updated as such in the
> CF app.  In this case, and in the middle of a CF, this would get get
> marked as Rwf, but the author would get a reminder of that.

Last sentence here is incorrect as well: s/get get/not get/, and the
author would be reminded to update his/her patch.
--
Michael


signature.asc
Description: PGP signature

Re: Syncing pg_multixact directories

2020-09-22 Thread Andres Freund

Hi,

On 2020-09-23 13:45:51 +1200, Thomas Munro wrote:
> I think we should be ensuring that directory entries for newly created
> multixact files are durable at checkpoint time.  Please see attached.

Good catch! Probably that should probably be backpatched...

Greetings,

Andres Freund

Re: Missing TOAST table for pg_class

2020-09-22 Thread Michael Paquier

On Tue, Sep 22, 2020 at 05:35:48PM -0400, Tom Lane wrote:
> What exactly do you argue has changed since the previous decision
> that should cause us to change it?  In particular, where is the
> additional data to change our minds about the safety of such a thing?

Not sure that's safe, as we also want to avoid circular dependencies
similarly for pg_class, pg_index and pg_attribute.
--
Michael


signature.asc
Description: PGP signature

Re: Handing off SLRU fsyncs to the checkpointer

2020-09-22 Thread Thomas Munro

On Tue, Sep 22, 2020 at 9:08 AM Thomas Munro  wrote:
> On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro  wrote:
> > While scanning for comments and identifier names that needed updating,
> > I realised that this patch changed the behaviour of the ShutdownXXX()
> > functions, since they currently flush the SLRUs but are not followed
> > by a checkpoint.  I'm not entirely sure I understand the logic of
> > that, but it wasn't my intention to change it.  So here's a version
> > that converts the existing fsync_fname() to fsync_fname_recurse() to
>
> Bleugh, that was probably a bad idea, it's too expensive.  But it
> forces me to ask the question: *why* do we need to call
> Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a
> shutdown checkpoint?  I wondered if this might date from before the
> WAL, but I see that the pattern was introduced when the CLOG was moved
> out of shared buffers into a proto-SLRU in ancient commit 2589735da08,
> but even in that commit the preceding CreateCheckPoint() call included
> a call to CheckPointCLOG().

I complained about the apparently missing multixact fsync in a new
thread, because if I'm right about that it requires a back-patchable
fix.

As for the ShutdownXXX() functions, I haven't yet come up with any
reason for this code to exist.  Emboldened by a colleague's inability
to explain to me what that code is doing for us, here is a new version
that just rips it all out.
From 89e5b787ea5efd03ced8da46f95f236e03bf4adf Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 23 Sep 2020 13:02:27 +1200
Subject: [PATCH v6 1/2] Fix missing fsync of multixact directories.

Standardize behavior by moving reponsibility for fsyncing directories
down into slru.c.

Back-patch to all supported releases.
---
 src/backend/access/transam/clog.c  | 7 ---
 src/backend/access/transam/commit_ts.c | 6 --
 src/backend/access/transam/slru.c  | 7 +++
 3 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 65aa8841f7..9e352d2658 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -836,13 +836,6 @@ CheckPointCLOG(void)
 	/* Flush dirty CLOG pages to disk */
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
 	SimpleLruFlush(XactCtl, true);
-
-	/*
-	 * fsync pg_xact to ensure that any files flushed previously are durably
-	 * on disk.
-	 */
-	fsync_fname("pg_xact", true);
-
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
 }

diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..f6a7329ba3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -822,12 +822,6 @@ CheckPointCommitTs(void)
 {
 	/* Flush dirty CommitTs pages to disk */
 	SimpleLruFlush(CommitTsCtl, true);
-
-	/*
-	 * fsync pg_commit_ts to ensure that any files flushed previously are
-	 * durably on disk.
-	 */
-	fsync_fname("pg_commit_ts", true);
 }

 /*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 7640f153c2..89ce7c43b5 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1187,6 +1187,13 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 	}
 	if (!ok)
 		SlruReportIOError(ctl, pageno, InvalidTransactionId);
+
+	/*
+	 * Make sure that the directory entries for any newly created files are on
+	 * disk.
+	 */
+	if (ctl->do_fsync)
+		fsync_fname(ctl->Dir, true);
 }

 /*
-- 
2.20.1

From ae51604c43e10ba6d24b628c2b907eca251177e9 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 21 Sep 2020 09:43:32 +1200
Subject: [PATCH v6 2/2] Defer flushing of SLRU files.

Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery.  Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba.  This causes a significant improvement to recovery
performance.

Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them.

Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool.  Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.

Tested-by: Jakub Wartak 
Discussion: https://postgr.es/m/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com
---
 src/backend/access/transam/clog.c  |  36 +++---
 src/backend/access/transam/commit_ts.c |  32 +++--
 src/backend/access/transam/multixact.c |  53 +
 src/backend/access/transam/slru.c  | 154 +
 src/backend/access/transam/subtrans.c  |  25 +---

Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration

2020-09-22 Thread Michael Paquier

On Tue, Sep 22, 2020 at 11:45:14PM +0200, Peter Eisentraut wrote:
> However, I still think the integer type use is a bit inconsistent.  In both
> cases, using strtoul() and dealing with unsigned integer types between
> parsing and final use would be more consistent.

No objections to that either, so changed this way.  I kept those
variables signed because applying values of 2B~4B is not really going
to matter much here ;p
--
Michael
diff --git a/src/bin/pg_test_fsync/.gitignore b/src/bin/pg_test_fsync/.gitignore
index f3b5932498..5eb5085f45 100644
--- a/src/bin/pg_test_fsync/.gitignore
+++ b/src/bin/pg_test_fsync/.gitignore
@@ -1 +1,3 @@
 /pg_test_fsync
+
+/tmp_check/
diff --git a/src/bin/pg_test_fsync/Makefile b/src/bin/pg_test_fsync/Makefile
index 7632c94eb7..c4f9ae0664 100644
--- a/src/bin/pg_test_fsync/Makefile
+++ b/src/bin/pg_test_fsync/Makefile
@@ -22,6 +22,12 @@ install: all installdirs
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
 
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
 uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_test_fsync$(X)'
 
diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index 6e47293123..7119ed0512 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -5,6 +5,7 @@
 
 #include "postgres_fe.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -62,7 +63,7 @@ do { \
 
 static const char *progname;
 
-static int	secs_per_test = 5;
+static uint32 secs_per_test = 5;
 static int	needs_unlink = 0;
 static char full_buf[DEFAULT_XLOG_SEG_SIZE],
 		   *buf,
@@ -148,6 +149,8 @@ handle_args(int argc, char *argv[])
 
 	int			option;			/* Command line option */
 	int			optindex = 0;	/* used by getopt_long */
+	unsigned long	optval;		/* used for option parsing */
+	char	   *endptr;
 
 	if (argc > 1)
 	{
@@ -173,7 +176,24 @@ handle_args(int argc, char *argv[])
 break;
 
 			case 's':
-secs_per_test = atoi(optarg);
+errno = 0;
+optval = strtoul(optarg, , 10);
+
+if (endptr == optarg || *endptr != '\0' ||
+	errno != 0 || optval != (uint32) optval)
+{
+	pg_log_error("invalid argument for option %s", "--secs-per-test");
+	fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+	exit(1);
+}
+
+secs_per_test = (uint32) optval;
+if (secs_per_test == 0)
+{
+	pg_log_error("%s must be in range %u..%u",
+ "--secs-per-test", 1, UINT_MAX);
+	exit(1);
+}
 break;
 
 			default:
@@ -193,8 +213,8 @@ handle_args(int argc, char *argv[])
 		exit(1);
 	}
 
-	printf(ngettext("%d second per test\n",
-	"%d seconds per test\n",
+	printf(ngettext("%u second per test\n",
+	"%u seconds per test\n",
 	secs_per_test),
 		   secs_per_test);
 #if PG_O_DIRECT != 0
diff --git a/src/bin/pg_test_fsync/t/001_basic.pl b/src/bin/pg_test_fsync/t/001_basic.pl
new file mode 100644
index 00..fe9c295c49
--- /dev/null
+++ b/src/bin/pg_test_fsync/t/001_basic.pl
@@ -0,0 +1,25 @@
+use strict;
+use warnings;
+
+use Config;
+use TestLib;
+use Test::More tests => 12;
+
+#
+# Basic checks
+
+program_help_ok('pg_test_fsync');
+program_version_ok('pg_test_fsync');
+program_options_handling_ok('pg_test_fsync');
+
+#
+# Test invalid option combinations
+
+command_fails_like(
+	[ 'pg_test_fsync', '--secs-per-test', 'a' ],
+	qr/\Qpg_test_fsync: error: invalid argument for option --secs-per-test\E/,
+	'pg_test_fsync: invalid argument for option --secs-per-test');
+command_fails_like(
+	[ 'pg_test_fsync', '--secs-per-test', '0' ],
+	qr/\Qpg_test_fsync: error: --secs-per-test must be in range 1..4294967295\E/,
+	'pg_test_fsync: --secs-per-test must be in range');
diff --git a/src/bin/pg_test_timing/.gitignore b/src/bin/pg_test_timing/.gitignore
index f6c664c765..e5aac2ab12 100644
--- a/src/bin/pg_test_timing/.gitignore
+++ b/src/bin/pg_test_timing/.gitignore
@@ -1 +1,3 @@
 /pg_test_timing
+
+/tmp_check/
diff --git a/src/bin/pg_test_timing/Makefile b/src/bin/pg_test_timing/Makefile
index 334d6ff5c0..52994b4103 100644
--- a/src/bin/pg_test_timing/Makefile
+++ b/src/bin/pg_test_timing/Makefile
@@ -22,6 +22,12 @@ install: all installdirs
 installdirs:
 	$(MKDIR_P) '$(DESTDIR)$(bindir)'
 
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
+
 uninstall:
 	rm -f '$(DESTDIR)$(bindir)/pg_test_timing$(X)'
 
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index e14802372b..c878b19035 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -6,15 +6,17 @@
 
 #include "postgres_fe.h"
 
+#include 
+
 #include "getopt_long.h"
 #include "portability/instr_time.h"
 
 static const char *progname;
 
-static int32 test_duration = 3;
+static uint32 test_duration = 3;
 
 static void handle_args(int argc, char *argv[]);
-static uint64 test_timing(int32);
+static uint64

Syncing pg_multixact directories

2020-09-22 Thread Thomas Munro

Hello hackers,

I think we should be ensuring that directory entries for newly created
multixact files are durable at checkpoint time.  Please see attached.
From 89e5b787ea5efd03ced8da46f95f236e03bf4adf Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 23 Sep 2020 13:02:27 +1200
Subject: [PATCH 1/2] Fix missing fsync of multixact directories.

Standardize behavior by moving reponsibility for fsyncing directories
down into slru.c.

Back-patch to all supported releases.
---
 src/backend/access/transam/clog.c  | 7 ---
 src/backend/access/transam/commit_ts.c | 6 --
 src/backend/access/transam/slru.c  | 7 +++
 3 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 65aa8841f7..9e352d2658 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -836,13 +836,6 @@ CheckPointCLOG(void)
 	/* Flush dirty CLOG pages to disk */
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
 	SimpleLruFlush(XactCtl, true);
-
-	/*
-	 * fsync pg_xact to ensure that any files flushed previously are durably
-	 * on disk.
-	 */
-	fsync_fname("pg_xact", true);
-
 	TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
 }
 
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..f6a7329ba3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -822,12 +822,6 @@ CheckPointCommitTs(void)
 {
 	/* Flush dirty CommitTs pages to disk */
 	SimpleLruFlush(CommitTsCtl, true);
-
-	/*
-	 * fsync pg_commit_ts to ensure that any files flushed previously are
-	 * durably on disk.
-	 */
-	fsync_fname("pg_commit_ts", true);
 }
 
 /*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 7640f153c2..89ce7c43b5 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1187,6 +1187,13 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 	}
 	if (!ok)
 		SlruReportIOError(ctl, pageno, InvalidTransactionId);
+
+	/*
+	 * Make sure that the directory entries for any newly created files are on
+	 * disk.
+	 */
+	if (ctl->do_fsync)
+		fsync_fname(ctl->Dir, true);
 }
 
 /*
-- 
2.20.1

Re: Command statistics system (cmdstats)

2020-09-22 Thread Michael Paquier

On Mon, Sep 21, 2020 at 10:41:46AM -0300, Alvaro Herrera wrote:
> "A couple of weeks" of inactivity is not sufficient, in my view, to boot
> a patch out of the commitfest process.  Whenever the patch is
> resurrected, it will be a new entry which won't have the history that it
> had accumulated in the long time since it was created -- which biases
> it against other new patches.

Any of the patches that have been marked as RwF last week have been
waiting on author since at least the end of the commit fest of July,
with zero actions taken on them.  In my experience, this points
strongly to the fact that these are unlikely going to be worked on
actively in a short time frame.

> I'm not really arguing about this one patch only, but more generally
> about the process you're following.  The problem is that you as CFM are
> imposing your personal priorities on the whole process by punting
> patches ahead of time -- you are saying that nobody else should be
> looking at updating this patch because it is now closed.  It seems fair
> to do so at the end of the commitfest, but it does not seem fair to do
> it when it's still mid-cycle.

I think that this remark is a bit unfair.  When it comes to any patch
in the commit fest, and I have managed a couple of them over the
years, I have tried to keep a neutral view for all of them, meaning
that I only let the numbers speak by themselves, mostly:
- When has a patch lastly been updated.
- What's the current state in the CF app  If the patch had no reviews
and still in "Needs review", simply switch it to "waiting on author".
If the patch has been waiting on author for more than two weeks, it
would get marked as RwF at the end of the CF.  There are also cases
where the status of the patch is incorrect, mostly that a patch
waiting on author because of a review was not updated as such in the
CF app.  In this case, and in the middle of a CF, this would get get
marked as Rwf, but the author would get a reminder of that.

Any of the patches that have been discarded by me last week were
waiting in the stack for much, much, longer than that, and I think
that it does not help reviewers and committers in keeping this stuff
longer than necessary because this mainly bloats the CF app in a
non-necessary way, hiding patches that should deserve more attention.
Some of those patches have been left idle for half a year.

Another point that I'd like to make here is that my intention was to
lift the curve here (a term quite fashionable this year particularly,
isn't it?), due to the high number of patches to handle in a single
CF, and the shortage of hands actually doing this kind of work.

Then, let's be clear about one last thing.  If at *any* moment people
here feel that at least the state of *one* patch of this CF has been
changed based on some of my "personal" priority views, I will
immediately stop any action as CFM and resign from it.  So, if that's
the case, please feel free to speak up here or on any of the threads
related to those patches.

> Putting also in perspective the history that the patch had prior to your
> unfairly closing it, there was a lot of effort in getting it reviewed to
> a good state and updated by the author --  yet a single unsubstantiated
> comment, without itself a lot of effort on the reviewer's part, that
> "maybe it has a perf drawback" is sufficient to get it booted?  It
> doesn't seem appropriate.

I agree on the fact that not being able to recreate a new CF entry
that keeps the history of an old one, with all the past threads or
annotations, could be helpful in some cases.  Now, I have seen as well 
cases where a patch got around for so long with a very long thread
made things confusing, and that people also like starting a new thread
with a summary of the past state as a first message to attempt
clarifying things.  So both have advantages.
--
Michael

signature.asc
Description: PGP signature

Re: new heapcheck contrib module

2020-09-22 Thread Tom Lane

Peter Geoghegan  writes:
> On Mon, Sep 21, 2020 at 2:09 PM Robert Haas  wrote:
>> +REVOKE ALL ON FUNCTION
>> +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint)
>> +FROM PUBLIC;
>> 
>> This too.

> Do we really want to use a cstring as an enum-like argument?

Ugh.  We should not be using cstring as a SQL-exposed datatype
unless there really is no alternative.  Why wasn't this argument
declared "text"?

regards, tom lane

Re: Lift line-length limit for pg_service.conf

2020-09-22 Thread Tom Lane

I wrote:
> Yeah.  In a quick scan, it appears that there is only one caller that
> tries to save the result directly.  So I considered making that caller
> do a pstrdup and eliminating the extra thrashing in t_readline itself.
> But it seemed too fragile; somebody would get it wrong and then have
> excess space consumption for their dictionary.

I had a better idea: if we get rid of t_readline() itself, which has
been deprecated for years anyway, we can have the calling layer
tsearch_readline_xxx maintain a StringInfo across the whole file
read process and thus get rid of alloc/free cycles for the StringInfo.

However ... while working on that, I realized that the usage of
the "curline" field in that code is completely unsafe.  We save
a pointer to the result of tsearch_readline() for possible use
by the error context callback, *but the caller is likely to free that
string at some point*.  So there is a window where an error would
result in trying to print already-freed data.

It looks like all of the core-code dictionaries free the result
string at the bottoms of their loops, so that the window for
trouble is pretty much empty.  But contrib/dict_xsyn doesn't
do it like that, and so it's surely at risk.  Any external
dictionaries that have copied that code would also be at risk.

So the attached adds a pstrdup/pfree to ensure that "curline"
has its own storage, putting us right back at two palloc/pfree
cycles per line.  I don't think there's a lot of choice though;
in fact, I'm leaning to the idea that we need to back-patch
that part of this.  The odds of trouble in a production build
probably aren't high, but still...

regards, tom lane

diff --git a/src/backend/tsearch/dict_thesaurus.c b/src/backend/tsearch/dict_thesaurus.c
index cb0835982d..64c979086d 100644
--- a/src/backend/tsearch/dict_thesaurus.c
+++ b/src/backend/tsearch/dict_thesaurus.c
@@ -286,11 +286,6 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 	(errcode(ERRCODE_CONFIG_FILE_ERROR),
 	 errmsg("unexpected end of line")));
 
-		/*
-		 * Note: currently, tsearch_readline can't return lines exceeding 4KB,
-		 * so overflow of the word counts is impossible.  But that may not
-		 * always be true, so let's check.
-		 */
 		if (nwrd != (uint16) nwrd || posinsubst != (uint16) posinsubst)
 			ereport(ERROR,
 	(errcode(ERRCODE_CONFIG_FILE_ERROR),
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index a916dd6cb6..c85939631e 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "catalog/pg_collation.h"
+#include "common/string.h"
 #include "storage/fd.h"
 #include "tsearch/ts_locale.h"
 #include "tsearch/ts_public.h"
@@ -128,6 +129,7 @@ tsearch_readline_begin(tsearch_readline_state *stp,
 		return false;
 	stp->filename = filename;
 	stp->lineno = 0;
+	initStringInfo(>buf);
 	stp->curline = NULL;
 	/* Setup error traceback support for ereport() */
 	stp->cb.callback = tsearch_readline_callback;
@@ -146,11 +148,44 @@ char *
 tsearch_readline(tsearch_readline_state *stp)
 {
 	char	   *result;
+	int			len;
 
+	/* Advance line number to use in error reports */
 	stp->lineno++;
-	stp->curline = NULL;
-	result = t_readline(stp->fp);
-	stp->curline = result;
+
+	/* Clear curline, it's no longer relevant */
+	if (stp->curline)
+	{
+		pfree(stp->curline);
+		stp->curline = NULL;
+	}
+
+	/* Collect next line, if there is one */
+	if (!pg_get_line_buf(stp->fp, >buf))
+		return NULL;
+
+	/* Make sure the input is valid UTF-8 */
+	len = stp->buf.len;
+	(void) pg_verify_mbstr(PG_UTF8, stp->buf.data, len, false);
+
+	/* And convert */
+	result = pg_any_to_server(stp->buf.data, len, PG_UTF8);
+	if (result == stp->buf.data)
+	{
+		/*
+		 * Conversion didn't pstrdup, so we must.  We can use the length of
+		 * the original string, because no conversion was done.
+		 */
+		result = pnstrdup(result, len);
+	}
+
+	/*
+	 * Save a copy of the line for possible use in error reports.  (We cannot
+	 * just save "result", since it's likely to get pfree'd at some point by
+	 * the caller; an error after that would try to access freed data.)
+	 */
+	stp->curline = pstrdup(result);
+
 	return result;
 }
 
@@ -160,7 +195,16 @@ tsearch_readline(tsearch_readline_state *stp)
 void
 tsearch_readline_end(tsearch_readline_state *stp)
 {
+	/* Suppress use of curline in any error reported below */
+	if (stp->curline)
+	{
+		pfree(stp->curline);
+		stp->curline = NULL;
+	}
+
+	pfree(stp->buf.data);
 	FreeFile(stp->fp);
+
 	/* Pop the error context stack */
 	error_context_stack = stp->cb.previous;
 }
@@ -176,8 +220,7 @@ tsearch_readline_callback(void *arg)
 
 	/*
 	 * We can't include the text of the config line for errors that occur
-	 * during t_readline() itself.  This is only partly a consequence of our
-	 * arms-length use of that routine: the major cause of such errors is
+	 * during tsearch_readline() itself.  The major cause of

RE: Global snapshots

2020-09-22 Thread tsunakawa.ta...@fujitsu.com

From: Andrey Lepikhov 
> Thank you for this work!
> As I can see, main development difficulties placed in other areas: CSN, 
> resolver,
> global deadlocks, 2PC commit... I'm not lawyer too. But if we get remarks from
> the patent holders, we can rewrite our Clock-SI implementation.

Yeah, I understand your feeling.  I personally don't want like patents, and 
don't want to be disturbed by them.  But the world is not friendly...  We are 
not a lawyer, but we have to do our best to make sure PostgreSQL will be 
patent-free by checking the technologies as engineers.

Among the above items, CSN is the only concerning one.  Other items are written 
in textbooks, well-known, and used in other DBMSs, so they should be free from 
patents.  However, CSN is not (at least to me.)  Have you checked if CSN is not 
related to some patent?  Or is CSN or similar technology already widely used in 
famous software and we can regard it as patent-free?

And please wait.  As below, the patent holder just says that Clock-SI is not 
based on the patent and an independent development.  He doesn't say Clock-SI 
does not overlap with the patent or implementing Clock-SI does not infringe on 
the patent.  Rather, he suggests that Clock-SI has many similarities and thus 
those may match the claims of the patent (unintentionally?)  I felt this is a 
sign of risking infringement.

"The answer to your question is: No, Clock-SI is not based on the patent - it 
was an entirely independent development. The two approaches are similar in the 
sense that there is no global clock, the commit time of a distributed 
transaction is the same in every partition where it modified data, and a 
transaction gets it snapshot timestamp from a local clock. The difference is 
whether a distributed transaction gets its commit timestamp before or after the 
prepare phase in 2PC."

The timeline of events also worries me.  It seems unnatural to consider that 
Clock-SI and the patent are independent.

2010/6 - 2010/9  One Clock-SI author worked for Microsoft Research as an 
research intern
2010/10  Microsoft filed the patent
2011/9 - 2011/12  The same Clock-SI author worked for Microsoft Research as 
an research intern
2013  The same author moved to EPFL and published the Clock-SI paper with 
another author who has worked for Microsoft Research since then.

So, could you give your opinion whether we can use Clock-SI without overlapping 
with the patent claims?  I also will try to check and see, so that I can 
understand your technical analysis.

And I've just noticed that I got in touch with another author of Clock-SI via 
SNS, and sent an inquiry to him.  I'll report again when I have a reply.


Regards
Takayuki Tsunakawa

Re: new heapcheck contrib module

2020-09-22 Thread Peter Geoghegan

On Sat, Aug 29, 2020 at 10:48 AM Mark Dilger
 wrote:
> I had an earlier version of the verify_heapam patch that included a 
> non-throwing interface to clog.  Ultimately, I ripped that out.  My reasoning 
> was that a simpler patch submission was more likely to be acceptable to the 
> community.

Isn't some kind of pragmatic compromise possible?

> But I don't want to make this patch dependent on that hypothetical patch 
> getting written and accepted.

Fair enough, but if you're alluding to what I said then about
check_tuphdr_xids()/clog checking a while back then FWIW I didn't
intend to block progress on clog/xact status verification at all. I
just don't think that it is sensible to impose an iron clad guarantee
about having no assertion failures with corrupt clog data -- that
leads to far too much code duplication. But why should you need to
provide an absolute guarantee of that?

I for one would be fine with making the clog checks an optional extra,
that rescinds the no crash guarantee that you're keen on -- just like
with the TOAST checks that you have already in v15. It might make
sense to review how often crashes occur with simulated corruption, and
then to minimize the number of occurrences in the real world. Maybe we
could tolerate a usually-no-crash interface to clog -- if it could
still have assertion failures. Making a strong guarantee about
assertions seems unnecessary.

I don't see how verify_heapam will avoid raising an error during basic
validation from PageIsVerified(), which will violate the guarantee
about not throwing errors. I don't see that as a problem myself, but
presumably you will.

--
Peter Geoghegan

Re: new heapcheck contrib module

2020-09-22 Thread Peter Geoghegan

On Mon, Sep 21, 2020 at 2:09 PM Robert Haas  wrote:
> +REVOKE ALL ON FUNCTION
> +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint)
> +FROM PUBLIC;
>
> This too.

Do we really want to use a cstring as an enum-like argument?

I think that I see a bug at this point in check_tuple() (in
v15-0001-Adding-function-verify_heapam-to-amcheck-module.patch):

> +   /* If xmax is a multixact, it should be within valid range */
> +   xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr);
> +   if ((infomask & HEAP_XMAX_IS_MULTI) && !mxid_valid_in_rel(xmax, ctx))
> +   {

*** SNIP ***

> +   }
> +
> +   /* If xmax is normal, it should be within valid range */
> +   if (TransactionIdIsNormal(xmax))
> +   {

Why should it be okay to call TransactionIdIsNormal(xmax) at this
point? It isn't certain that xmax is an XID at all (could be a
MultiXactId, since you called HeapTupleHeaderGetRawXmax() to get the
value in the first place). Don't you need to check "(infomask &
HEAP_XMAX_IS_MULTI) == 0" here?

This does look like it's shaping up. Thanks for working on it, Mark.

-- 
Peter Geoghegan

Re: Lift line-length limit for pg_service.conf

2020-09-22 Thread Tom Lane

Daniel Gustafsson  writes:
>> On 22 Sep 2020, at 23:24, Tom Lane  wrote:
>> In the same vein, here's a patch to remove the hard-coded line length
>> limit for tsearch dictionary files.

> LGTM.  I like the comment on why not to return buf.data directly, that detail
> would be easy to miss.

Yeah.  In a quick scan, it appears that there is only one caller that
tries to save the result directly.  So I considered making that caller
do a pstrdup and eliminating the extra thrashing in t_readline itself.
But it seemed too fragile; somebody would get it wrong and then have
excess space consumption for their dictionary.

regards, tom lane

Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration

2020-09-22 Thread Peter Eisentraut


On 2020-09-20 05:41, Michael Paquier wrote:

On Fri, Sep 18, 2020 at 05:22:15PM +0900, Michael Paquier wrote:

Okay, after looking at that, here is v3.  This includes range checks
as well as errno checks based on strtol().  What do you think?


This fails in the CF bot on Linux because of pg_logging_init()
returning with errno=ENOTTY in the TAP tests, for which I began a new
thread:
https://www.postgresql.org/message-id/20200918095713.ga20...@paquier.xyz

Not sure if this will lead anywhere, but we can also address the
failure by enforcing errno=0 for the new calls of strtol() introduced
in this patch.  So here is an updated patch doing so.


I think the error checking is now structurally correct in this patch.

However, I still think the integer type use is a bit inconsistent.  In 
both cases, using strtoul() and dealing with unsigned integer types 
between parsing and final use would be more consistent.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Missing TOAST table for pg_class

2020-09-22 Thread Tom Lane

=?UTF-8?Q?Fabr=C3=ADzio_de_Royes_Mello?=  writes:
> Attached patch adds the TOAST to pg_class, and let's open again the
> discussion around it.

What exactly do you argue has changed since the previous decision
that should cause us to change it?  In particular, where is the
additional data to change our minds about the safety of such a thing?

One thing I'd want to see is some amount of testing of pg_class toast
accesses under CLOBBER_CACHE_ALWAYS and even CLOBBER_CACHE_RECURSIVELY.
AFAICT from reviewing the prior thread, nobody did any such thing.

regards, tom lane

Re: Lift line-length limit for pg_service.conf

2020-09-22 Thread Daniel Gustafsson

> On 22 Sep 2020, at 23:24, Tom Lane  wrote:
> 
> In the same vein, here's a patch to remove the hard-coded line length
> limit for tsearch dictionary files.

LGTM.  I like the comment on why not to return buf.data directly, that detail
would be easy to miss.

cheers ./daniel

Re: Binaries on s390x arch

2020-09-22 Thread Devrim Gündüz


Hi,

On Tue, 2020-09-22 at 10:57 +0200, Christoph Berg wrote:
> Note that I'm working on the .deb packages for Debian and Ubuntu. For
> .rpm (RHEL, SLES) you need to get Devrim on board. (I'm putting him
> on Cc.)

Thanks!

To be able to keep up with the other architectures when building
> binaries for PostgreSQL releases, the machine would have about the
> same specs as the existing arm64 or ppc64el build daemons:
> 
> VMs: 1
> OS: Debian buster
> vCPUs: 8
> Memory: 16 GB
> Storage: 100GB (most of that is swap space for tmpfs)

@Namrata: Same here for RHEL RPMs:

VMs: 2
OS: RHEL 7 and RHEL 8
vCPUs: 8 (each)
Memory: 16 GB (each)
Storage: 100GB

Regards,
-- 
Devrim Gündüz
Open Source Solution Architect, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR


signature.asc
Description: This is a digitally signed message part

Re: Lift line-length limit for pg_service.conf

2020-09-22 Thread Tom Lane

In the same vein, here's a patch to remove the hard-coded line length
limit for tsearch dictionary files.

regards, tom lane

diff --git a/src/backend/tsearch/dict_thesaurus.c b/src/backend/tsearch/dict_thesaurus.c
index cb0835982d..64c979086d 100644
--- a/src/backend/tsearch/dict_thesaurus.c
+++ b/src/backend/tsearch/dict_thesaurus.c
@@ -286,11 +286,6 @@ thesaurusRead(const char *filename, DictThesaurus *d)
 	(errcode(ERRCODE_CONFIG_FILE_ERROR),
 	 errmsg("unexpected end of line")));
 
-		/*
-		 * Note: currently, tsearch_readline can't return lines exceeding 4KB,
-		 * so overflow of the word counts is impossible.  But that may not
-		 * always be true, so let's check.
-		 */
 		if (nwrd != (uint16) nwrd || posinsubst != (uint16) posinsubst)
 			ereport(ERROR,
 	(errcode(ERRCODE_CONFIG_FILE_ERROR),
diff --git a/src/backend/tsearch/ts_locale.c b/src/backend/tsearch/ts_locale.c
index a916dd6cb6..247180d56e 100644
--- a/src/backend/tsearch/ts_locale.c
+++ b/src/backend/tsearch/ts_locale.c
@@ -14,6 +14,8 @@
 #include "postgres.h"
 
 #include "catalog/pg_collation.h"
+#include "common/string.h"
+#include "lib/stringinfo.h"
 #include "storage/fd.h"
 #include "tsearch/ts_locale.h"
 #include "tsearch/ts_public.h"
@@ -204,29 +206,41 @@ tsearch_readline_callback(void *arg)
 char *
 t_readline(FILE *fp)
 {
+	StringInfoData buf;
 	int			len;
 	char	   *recoded;
-	char		buf[4096];		/* lines must not be longer than this */
 
-	if (fgets(buf, sizeof(buf), fp) == NULL)
+	initStringInfo();
+
+	if (!pg_get_line_buf(fp, ))
+	{
+		pfree(buf.data);
 		return NULL;
+	}
 
-	len = strlen(buf);
+	len = buf.len;
 
 	/* Make sure the input is valid UTF-8 */
-	(void) pg_verify_mbstr(PG_UTF8, buf, len, false);
+	(void) pg_verify_mbstr(PG_UTF8, buf.data, len, false);
 
 	/* And convert */
-	recoded = pg_any_to_server(buf, len, PG_UTF8);
-	if (recoded == buf)
+	recoded = pg_any_to_server(buf.data, len, PG_UTF8);
+	if (recoded == buf.data)
 	{
 		/*
 		 * conversion didn't pstrdup, so we must. We can use the length of the
 		 * original string, because no conversion was done.
+		 *
+		 * Note: it might seem attractive to just return buf.data, and in most
+		 * usages that'd be fine.  But a few callers save the returned string
+		 * as long-term data, so returning a palloc chunk that's bigger than
+		 * necessary is a bad idea.
 		 */
 		recoded = pnstrdup(recoded, len);
 	}
 
+	pfree(buf.data);
+
 	return recoded;
 }

Missing TOAST table for pg_class

2020-09-22 Thread Fabrízio de Royes Mello

Hi all,

I know it has been discussed before [1] but one of our customers complained
about something weird on one of their multi-tenancy databases (thousands of
schemas with a lot of objects inside and one user by schema).

So when I checked the problem is because the missing TOAST for pg_class,
and is easy to break it by just:

fabrizio=# create table foo (id int);
CREATE TABLE
fabrizio=# do
$$
begin
for i in 1..2500
loop
execute 'create user u' || i;
execute 'grant all on foo to u' || i;
end loop;
end;
$$;
ERROR:  row is too big: size 8168, maximum size 8160
CONTEXT:  SQL statement "grant all on foo to u2445"
PL/pgSQL function inline_code_block line 6 at EXECUTE

Attached patch adds the TOAST to pg_class, and let's open again the
discussion around it.

Regards,

[1]
https://www.postgresql.org/message-id/flat/84ddff04-f122-784b-b6c5-3536804495f8%40joeconway.com

-- 
   Fabrízio de Royes Mello Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
diff --git a/src/include/catalog/toasting.h b/src/include/catalog/toasting.h
index 51491c4513..729bc2cc66 100644
--- a/src/include/catalog/toasting.h
+++ b/src/include/catalog/toasting.h
@@ -51,6 +51,7 @@ extern void BootstrapToastTable(char *relName,
 /* normal catalogs */
 DECLARE_TOAST(pg_aggregate, 4159, 4160);
 DECLARE_TOAST(pg_attrdef, 2830, 2831);
+DECLARE_TOAST(pg_class, 4179, 4180);
 DECLARE_TOAST(pg_collation, 4161, 4162);
 DECLARE_TOAST(pg_constraint, 2832, 2833);
 DECLARE_TOAST(pg_default_acl, 4143, 4144);
diff --git a/src/test/regress/expected/misc_sanity.out b/src/test/regress/expected/misc_sanity.out
index 8538173ff8..0a2f5cc2a2 100644
--- a/src/test/regress/expected/misc_sanity.out
+++ b/src/test/regress/expected/misc_sanity.out
@@ -100,12 +100,9 @@ ORDER BY 1, 2;
  pg_attribute| attfdwoptions | text[]
  pg_attribute| attmissingval | anyarray
  pg_attribute| attoptions| text[]
- pg_class| relacl| aclitem[]
- pg_class| reloptions| text[]
- pg_class| relpartbound  | pg_node_tree
  pg_index| indexprs  | pg_node_tree
  pg_index| indpred   | pg_node_tree
  pg_largeobject  | data  | bytea
  pg_largeobject_metadata | lomacl| aclitem[]
-(11 rows)
+(8 rows)

RE: Transactions involving multiple postgres foreign servers, take 2

2020-09-22 Thread tsunakawa.ta...@fujitsu.com

From: Ashutosh Bapat 
> parallelism here has both pros and cons. If one of the servers errors
> out while preparing for a transaction, there is no point in preparing
> the transaction on other servers. In parallel execution we will
> prepare on multiple servers before realising that one of them has
> failed to do so. On the other hand preparing on multiple servers in
> parallel provides a speed up.

And pros are dominant in practice.  If many transactions are erroring out 
(during prepare), the system is not functioning for the user.  Such an 
application should be corrected before they are put into production.


> But this can be an improvement on version 1. The current approach
> doesn't render such an improvement impossible. So if that's something
> hard to do, we should do that in the next version rather than
> complicating this patch.

Could you share your idea on how the current approach could enable parallelism? 
 This is an important point, because (1) the FDW may not lead us to a seriously 
competitive scale-out DBMS, and (2) a better FDW API and/or implementation 
could be considered for non-parallel interaction if we have the realization of 
parallelism in mind.  I think that kind of consideration is the design (for the 
future).


Regards
Takayuki Tsunakawa

Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-09-22 Thread David Rowley

On Tue, 22 Sep 2020 at 19:08, David Rowley  wrote:
> I ran another scale=5 TPCH benchmark on v4 against f859c2ffa using gcc
> 9.3. I'm unable to see any gains with this, however, the results were
> pretty noisy. I only ran pgbench for 60 seconds per query. I'll likely
> need to run that a bit longer. I'll do that tonight.

I've attached the results of a TPCH scale=5 run master (f859c2ffa) vs
master + elog_ereport_attribute_cold_v4.patch

It does not look great. The patched version seems to have done about
1.17% less work than master did.

David


tpch_scale5_elog_ereport_cold_v4_vs_master_10min.ods
Description: application/vnd.oasis.opendocument.spreadsheet

Re: new heapcheck contrib module

2020-09-22 Thread Peter Geoghegan

On Tue, Sep 22, 2020 at 12:41 PM Robert Haas  wrote:
> But now I see that there's no secondary permission check in the
> verify_nbtree.c code. Is that intentional? Peter, what's the
> justification for that?

As noted by comments in contrib/amcheck/sql/check_btree.sql (the
verify_nbtree.c tests), this is intentional. Note that we explicitly
test that a non-superuser role can perform verification following
GRANT EXECUTE ON FUNCTION ... .

As I mentioned earlier, this is supported (or at least it is supported
in my interpretation of things). It just isn't documented anywhere
outside the test itself.

-- 
Peter Geoghegan

Re: Improper use about DatumGetInt32

2020-09-22 Thread Tom Lane

Robert Haas  writes:
> On Mon, Sep 21, 2020 at 3:53 PM Andres Freund  wrote:
>> I think we mostly use it for the few places where we currently expose
>> data as a signed integer on the SQL level, but internally actually treat
>> it as a unsigned data.

> So why is the right solution to that not DatumGetInt32() + a cast to uint32?

You're ignoring the xid use-case, for which DatumGetUInt32 actually is
the right thing.  I tend to agree though that if the SQL argument is
of a signed type, the least API-abusing answer is a signed DatumGetXXX
macro followed by whatever cast you need.

regards, tom lane

Re: Lift line-length limit for pg_service.conf

2020-09-22 Thread Tom Lane

Daniel Gustafsson  writes:
> [ 0001-Refactor-pg_service.conf-and-pg_restore-TOC-file-par.patch ]

I reviewed this and noticed that you'd missed adding resetStringInfo
calls in some code paths, which made me realize that while
pg_get_line_append() is great for its original customer in hba.c,
it kinda sucks for most other callers.  Having to remember to do
resetStringInfo in every path through a loop is too error-prone,
and it's unnecessary.  So I made another subroutine that just adds
that step, and updated the existing callers that could use it.

Pushed with that correction.

regards, tom lane

Re: Improper use about DatumGetInt32

2020-09-22 Thread Robert Haas

On Mon, Sep 21, 2020 at 3:53 PM Andres Freund  wrote:
> On 2020-09-21 14:08:22 -0400, Robert Haas wrote:
> > There is no SQL type corresponding to the C data type uint32, so I'm
> > not sure why we even have DatumGetUInt32.  I'm sort of suspicious that
> > there's some fuzzy thinking going on there.
>
> I think we mostly use it for the few places where we currently expose
> data as a signed integer on the SQL level, but internally actually treat
> it as a unsigned data.

So why is the right solution to that not DatumGetInt32() + a cast to uint32?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: new heapcheck contrib module

2020-09-22 Thread Robert Haas

On Tue, Sep 22, 2020 at 1:55 PM Mark Dilger
 wrote:
> I am inclined to just restrict verify_heapam() to superusers and be done.  
> What do you think?

I think that's an old and largely failed approach. If you want to use
pg_class_ownercheck here rather than pg_class_aclcheck or something
like that, seems fair enough. But I don't think there should be an
is-superuser check in the code, because we've been trying really hard
to get rid of those in most places. And I also don't think there
should be no secondary permissions check, because if somebody does
grant execute permission on these functions, it's unlikely that they
want the person getting that permission to be able to check every
relation in the system even those on which they have no other
privileges at all.

But now I see that there's no secondary permission check in the
verify_nbtree.c code. Is that intentional? Peter, what's the
justification for that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: SEARCH and CYCLE clauses

2020-09-22 Thread Pavel Stehule

Hi

I found another bug

create view xx as  WITH recursive destinations (departure, arrival,
connections, cost, itinerary) AS
(SELECT f.departure, f.arrival, 1, price,
CAST(f.departure || f.arrival AS VARCHAR(2000))
FROM flights f
WHERE f.departure = 'New York'
 UNION ALL
 SELECT r.departure, b.arrival, r.connections + 1 ,
r.cost + b.price, CAST(r.itinerary || b.arrival AS
VARCHAR(2000))
FROM destinations r, flights b
WHERE r.arrival = b.departure)
CYCLE arrival SET cyclic_data TO '1' DEFAULT '0' using path
SELECT departure, arrival, itinerary, cyclic_data
  FROM destinations  ;

postgres=# select * from xx;
ERROR:  attribute number 6 exceeds number of columns 5

Regards

Pavel

Re: SEARCH and CYCLE clauses

2020-09-22 Thread Pavel Stehule

Hi

út 22. 9. 2020 v 20:01 odesílatel Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> napsal:

> I have implemented the SEARCH and CYCLE clauses.
>
> This is standard SQL syntax attached to a recursive CTE to compute a
> depth- or breadth-first ordering and cycle detection, respectively.
> This is just convenience syntax for what you can already do manually.
> The original discussion about recursive CTEs briefly mentioned these as
> something to do later but then it was never mentioned again.
>
> SQL specifies these in terms of syntactic transformations, and so that's
> how I have implemented them also, mainly in the rewriter.
>
> I have successfully tested this against examples I found online that
> were aimed at DB2.
>
> The contained documentation and the code comment in rewriteHandler.c
> explain the details.
>

I am playing with this patch. It looks well, but I found some issues
(example is from attached data.sql)

WITH recursive destinations (departure, arrival, connections, cost) AS
(SELECT f.departure, f.arrival, 0, price
FROM flights f
WHERE f.departure = 'New York'
 UNION ALL
 SELECT r.departure, b.arrival, r.connections + 1,
r.cost + b.price
FROM destinations r, flights b
WHERE r.arrival = b.departure) cycle departure, arrival set
is_cycle to true default false using path

SELECT *
  FROM destinations ;
;

The result is correct. When I tried to use UNION instead UNION ALL, the pg
crash

Program received signal SIGABRT, Aborted.
0x7f761338ebc5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x7f761338ebc5 in raise () from /lib64/libc.so.6
#1  0x7f76133778a4 in abort () from /lib64/libc.so.6
#2  0x0090e7eb in ExceptionalCondition (conditionName=, errorType=, fileName=,
lineNumber=) at assert.c:67
#3  0x007205e7 in generate_setop_grouplist
(targetlist=targetlist@entry=0x7f75fce5d018, op=,
op=)
at prepunion.c:1412
#4  0x007219d0 in generate_recursion_path
(pTargetList=0x7fff073ee728, refnames_tlist=, root=0xf90bd8,
setOp=0xf90840)
at prepunion.c:502
#5  plan_set_operations (root=0xf90bd8) at prepunion.c:156
#6  0x0070f79b in grouping_planner (root=0xf90bd8,
inheritance_update=false, tuple_fraction=) at planner.c:1886
#7  0x00712ea7 in subquery_planner (glob=,
parse=, parent_root=, hasRecursion=,
tuple_fraction=0) at planner.c:1015
#8  0x0071a614 in SS_process_ctes (root=0xf7abd8) at subselect.c:952
#9  0x007125d4 in subquery_planner (glob=glob@entry=0xf8a010,
parse=parse@entry=0xf6cf20, parent_root=parent_root@entry=0x0,
hasRecursion=hasRecursion@entry=false,
tuple_fraction=tuple_fraction@entry=0) at planner.c:645
#10 0x0071425b in standard_planner (parse=0xf6cf20,
query_string=, cursorOptions=256, boundParams=)
at planner.c:405
#11 0x007e5f68 in pg_plan_query (querytree=0xf6cf20,
query_string=query_string@entry=0xea6370 "WITH recursive destinations
(departure, arrival, connections, cost) AS \n(SELECT f.departure,
f.arrival, 0, price\n", ' ' , "FROM flights f \n", ' '
, "WHERE f.departure = 'New York' \n UNION "...,
cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0)
at postgres.c:875
#12 0x007e6061 in pg_plan_queries (querytrees=0xf8b690,
query_string=query_string@entry=0xea6370 "WITH recursive destinations
(departure, arrival, connections, cost) AS \n(SELECT f.departure,
f.arrival, 0, price\n", ' ' , "FROM flights f \n", ' '
, "WHERE f.departure = 'New York' \n UNION "...,
cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0)
at postgres.c:966
#13 0x007e63b8 in exec_simple_query (
query_string=0xea6370 "WITH recursive destinations (departure, arrival,
connections, cost) AS \n(SELECT f.departure, f.arrival, 0, price\n", '
' , "FROM flights f \n", ' ' , "WHERE
f.departure = 'New York' \n UNION "...) at postgres.c:1158
#14 0x007e81e4 in PostgresMain (argc=,
argv=, dbname=, username=) at
postgres.c:4309
#15 0x007592b9 in BackendRun (port=0xecaf20) at postmaster.c:4541
#16 BackendStartup (port=0xecaf20) at postmaster.c:4225
#17 ServerLoop () at postmaster.c:1742
#18 0x0075a0ed in PostmasterMain (argc=,
argv=0xea0c90) at postmaster.c:1415
#19 0x004832ec in main (argc=3, argv=0xea0c90) at main.c:209



looks so clause USING in cycle detection is unsupported for DB2 and Oracle
- the examples from these databases doesn't work on PG without modifications

Regards

Pavel





>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


data.sql
Description: application/sql

Re: new heapcheck contrib module

2020-09-22 Thread Peter Geoghegan

On Tue, Sep 22, 2020 at 10:55 AM Mark Dilger
 wrote:
> I am inclined to just restrict verify_heapam() to superusers and be done.  
> What do you think?

The existing amcheck functions were designed to have execute privilege
granted to non-superusers, though we never actually advertised that
fact. Maybe now would be a good time to start doing so.

-- 
Peter Geoghegan

Re: new heapcheck contrib module

2020-09-22 Thread Mark Dilger

> On Sep 21, 2020, at 2:09 PM, Robert Haas  wrote:
> 
> I think there should be a call to pg_class_aclcheck() here, just like
> the one in pg_prewarm, so that if the superuser does choose to grant
> access, users given access can check tables they anyway have
> permission to access, but not others. Maybe put that in
> check_relation_relkind_and_relam() and rename it. Might want to look
> at the pg_surgery precedent, too. 

In the presence of corruption, verify_heapam() reports to the user (in other 
words, leaks) metadata about the corrupted rows.  Reasoning about the attack 
vectors this creates is hard, but a conservative approach is to assume that an 
attacker can cause corruption in order to benefit from the leakage, and make 
sure the leakage does not violate any reasonable security expectations.

Basing the security decision on whether the user has access to read the table 
seems insufficient, as it ignores row level security.  Perhaps that is ok if 
row level security is not enabled for the table or if the user has been granted 
BYPASSRLS.  There is another problem, though.  There is no grantable privilege 
to read dead rows.  In the case of corruption, verify_heapam() may well report 
metadata about dead rows.

pg_surgery also appears to leak information about dead rows.  Owners of tables 
can probe whether supplied TIDs refer to dead rows.  If a table containing 
sensitive information has rows deleted prior to ownership being transferred, 
the new owner of the table could probe each page of deleted data to determine 
something of the content that was there.  Information about the number of 
deleted rows is already available through the pg_stat_* views, but those views 
don't give such a fine-grained approach to figuring out how large each deleted 
row was.  For a table with fixed content options, the content can sometimes be 
completely inferred from the length of the row.  (Consider a table with a 
single text column containing either "approved" or "denied".)

But pg_surgery is understood to be a collection of sharp tools only to be used 
under fairly exceptional conditions.  amcheck, on the other hand, is something 
that feels safer and more reasonable to use on a regular basis, perhaps from a 
cron job executed by a less trusted user.  Forcing the user to be superuser 
makes it clearer that this feeling of safety is not justified.

I am inclined to just restrict verify_heapam() to superusers and be done.  What 
do you think?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Asynchronous Append on postgres_fdw nodes.

2020-09-22 Thread Etsuro Fujita

On Tue, Sep 22, 2020 at 9:52 PM Konstantin Knizhnik
 wrote:
> On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
> > Come to think of "complex", ExecAsync stuff in this patch might be
> > too-much for a short-term solution until executor overhaul, if it
> > comes shortly. (the patch of mine here as a whole is like that,
> > though..). The queueing stuff in postgres_fdw is, too.

> Looks like current implementation of asynchronous append incorrectly
> handle LIMIT clause:
>
> psql:append.sql:10: ERROR:  another command is already in progress
> CONTEXT:  remote SQL command: CLOSE c1

Thanks for the report (and patch)!

The same issue has already been noticed in [1].  I too think the cause
of the issue would be in the 0003 patch (ie, “the queueing stuff “ in
postgres_fdw), but I’m not sure it is really a good idea to have that
in postgres_fdw in the first place, because it would impact
performance negatively in some cases (see [1]).

Best regards,
Etsuro Fujita

[1] 
https://www.postgresql.org/message-id/CAPmGK16E1erFV9STg8yokoewY6E-zEJtLzHUJcQx%2B3dyivCT%3DA%40mail.gmail.com

Re: pgindent vs dtrace on macos

2020-09-22 Thread Tom Lane

After further thought about this, I concluded that a much better idea
is to just exclude fmgrprotos.h from the pgindent run.  While renaming
these two functions may or may not be worthwhile, it doesn't provide
any sort of permanent fix for fmgrprotos.h, because new typedef conflicts
could arise at any time.  (The fact that the typedef list isn't fully
under our control, thanks to contributions from system headers, makes
this a bigger risk than it might appear.)

So I did that, and also added a README comment along the lines you
suggested.

regards, tom lane

Re: Asynchronous Append on postgres_fdw nodes.

2020-09-22 Thread Konstantin Knizhnik




On 22.09.2020 16:40, Konstantin Knizhnik wrote:



On 22.09.2020 15:52, Konstantin Knizhnik wrote:



On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby 
 wrote in

On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:

As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.

The latest patch doesn't apply so I set as WoA.
https://commitfest.postgresql.org/29/2491/

Thanks. This is rebased version.

At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro 
 wrote in

Either way, we definitely need patch 0001.  One comment:

-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int 
nevents)


I wonder if it's better to have it receive ResourceOwner like that, or
to have it capture CurrentResourceOwner.  I think the latter is more
common in existing code.

There's no existing WaitEventSets belonging to a resowner. So
unconditionally capturing CurrentResourceOwner doesn't work well. I
could pass a bool instead but that make things more complex.

Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.

regards.




Hi,
Looks like current implementation of asynchronous append incorrectly 
handle LIMIT clause:


psql:append.sql:10: ERROR:  another command is already in progress
CONTEXT:  remote SQL command: CLOSE c1




Just FYI: the following patch fixes the problem:

--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node)

     if (cur == node)
     {
+            PGconn *conn = curstate->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+
         prev_state->waiter = curstate->waiter;

         /* relink to the previous node if the last node was 
removed */




Sorry, but it is not the only problem.
If you execute the query above and then in the same backend try to 
insert more records, then backend is crashed:


Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x7f5dfc59a231 in fetch_received_data (node=0x230c130) at 
postgres_fdw.c:3736

3736        Assert(fsstate->s.commonstate->leader == node);
(gdb) p sstate->s.commonstate
No symbol "sstate" in current context.
(gdb) p fsstate->s.commonstate
Cannot access memory at address 0x7f7f7f7f7f7f7f87

Also my patch doesn't solve the problem for small number of records 
(100) in the table.

I attach yet another patch which fix both problems.
Please notice that I did not go deep inside code of async append, so I 
am not sure that my patch is complete and correct.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 1482436..ff15642 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1623,7 +1623,19 @@ remove_async_node(ForeignScanState *node)
 	PgFdwScanState		*prev_state;
 	ForeignScanState	*cur;
 
-	/* no need to remove me */
+	if (fsstate->s.commonstate->busy)
+	{
+		/*
+		 * this node is waiting for result, absorb the result first so
+		 * that the following commands can be sent on the connection.
+		 */
+		PGconn *conn = fsstate->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+	}
+
+/* no need to remove me */
 	if (!leader || !fsstate->queued)
 		return;
 
@@ -1631,23 +1643,7 @@ remove_async_node(ForeignScanState *node)
 
 	if (leader == node)
 	{
-		if (leader_state->s.commonstate->busy)
-		{
-			/*
-			 * this node is waiting for result, absorb the result first so
-			 * that the following commands can be sent on the connection.
-			 */
-			PgFdwScanState *leader_state = GetPgFdwScanState(leader);
-			PGconn *conn = leader_state->s.conn;
-
-			while(PQisBusy(conn))
-PQclear(PQgetResult(conn));
-
-			leader_state->s.commonstate->busy = false;
-		}
-
 		move_to_next_waiter(node);
-
 		return;
 	}
 
@@ -1858,7 +1854,7 @@ postgresEndForeignScan(ForeignScanState *node)
 	/* Release remote connection */
 	ReleaseConnection(fsstate->s.conn);
 	fsstate->s.conn = NULL;
-
+	fsstate->s.commonstate->leader = NULL;
 	/* MemoryContexts will be deleted automatically. */
 }

Re: [patch] Fix checksum verification in base backups for zero page headers

2020-09-22 Thread Michael Banck

Hi,

Am Dienstag, den 22.09.2020, 16:26 +0200 schrieb Michael Banck:
> Am Mittwoch, den 02.09.2020, 16:50 +0300 schrieb Anastasia Lubennikova:
> > I've looked through the previous discussion. As far as I got it, most of 
> > the controversy was about online checksums improvements.
> > 
> > The warning about pd_upper inconsistency that you've added is a good 
> > addition. The patch is a bit messy, though, because a huge code block 
> > was shifted.
> > 
> > Will it be different, if you just leave
> >  "if (!PageIsNew(page) && PageGetLSN(page) < startptr)"
> > block as it was, and add
> >  "else if (PageIsNew(page) && !PageIsZero(page))" ?
> 
> Thanks, that does indeed look better as a patch and I think it's fine
> as-is for the code as well, I've attached a v2.

Sorry, forgot to add you as reviewer in the proposed commit message,
I've fixed that up now in V3.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
From a6706ec63709137881d415a8acf98c390a39ee56 Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Tue, 22 Sep 2020 16:09:36 +0200
Subject: [PATCH] Fix checksum verification in base backups for zero page
 headers

We currently do not checksum a page if it is considered new by PageIsNew().
However, this means that if the whole page header is zero, PageIsNew() will
consider that page new due to the pd_upper field being zero and no subsequent
checksum verification is done.

To fix, factor out the all-zeroes check from PageIsVerified() into a new method
PageIsZero() and call that for pages considered new by PageIsNew(). If a page
is all zero, consider that a checksum failure.

Add one more test to the pg_basebackup TAP tests to check this error.

Reported-By: Andres Freund
Reviewed-By: Asif Rehman, Anastasia Lubennikova
Discussion: https://postgr.es/m/20190319200050.ncuxejradurja...@alap3.anarazel.de
---
 src/backend/replication/basebackup.c | 30 +
 src/backend/storage/page/bufpage.c   | 35 +++-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 18 +-
 src/include/storage/bufpage.h| 11 +++---
 4 files changed, 73 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..c82765a429 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1746,6 +1746,36 @@ sendFile(const char *readfilename, const char *tarfilename,
 			"be reported", readfilename)));
 	}
 }
+else
+{
+	/*
+	 * We skip completely new pages after checking they are
+	 * all-zero, since they don't have a checksum yet.
+	 */
+	if (PageIsNew(page) && !PageIsZero(page))
+	{
+		/*
+		 * pd_upper is zero, but the page is not all zero.  We
+		 * cannot run pg_checksum_page() on the page as it
+		 * would throw an assertion failure.  Consider this a
+		 * checksum failure.
+		 */
+		checksum_failures++;
+
+		if (checksum_failures <= 5)
+			ereport(WARNING,
+	(errmsg("checksum verification failed in "
+			"file \"%s\", block %d: pd_upper "
+			"is zero but page is not all-zero",
+			readfilename, blkno)));
+		if (checksum_failures == 5)
+			ereport(WARNING,
+	(errmsg("further checksum verification "
+			"failures in file \"%s\" will not "
+			"be reported", readfilename)));
+	}
+}
+
 block_retry = false;
 blkno++;
 			}
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index 4bc2bf955d..8be3cd6190 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -82,11 +82,8 @@ bool
 PageIsVerified(Page page, BlockNumber blkno)
 {
 	PageHeader	p = (PageHeader) page;
-	size_t	   *pagebytes;
-	int			i;
 	bool		checksum_failure = false;
 	bool		header_sane = false;
-	bool		all_zeroes = false;
 	uint16		checksum = 0;
 
 	/*
@@ -120,18 +117,7 @@ PageIsVerified(Page page, BlockNumber blkno)
 	}
 
 	/* Check all-zeroes case */
-	all_zeroes = true;
-	pagebytes = (size_t *) page;
-	for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
-	{
-		if (pagebytes[i] != 0)
-		{
-			all_zeroes = false;
-			break;
-		}
-	}
-
-	if (all_zeroes)
+	if (PageIsZero(page))
 		return true;
 
 	/*
@@ -154,6 +140,25 @@ PageIsVerified(Page page, BlockNumber blkno)
 	return false;
 }
 
+/*
+ * PageIsZero
+ *		Check that the page consists only of zero bytes.
+ *
+ */
+bool
+PageIsZero(Page page)
+{
+	int			i;
+	size_t	   *pagebytes = (size_t *) page;
+
+	for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++)
+	{
+

Re: [patch] Fix checksum verification in base backups for zero page headers

2020-09-22 Thread Michael Banck

Hi,

Am Mittwoch, den 02.09.2020, 16:50 +0300 schrieb Anastasia Lubennikova:
> On 01.09.2020 13:22, Michael Banck wrote:
> > as a continuation of [1], I've split-off the zero page header case from
> > the last patch, as this one seems less contentious.
> > [1] https://commitfest.postgresql.org/28/2308/
> 
> I've looked through the previous discussion. As far as I got it, most of 
> the controversy was about online checksums improvements.
> 
> The warning about pd_upper inconsistency that you've added is a good 
> addition. The patch is a bit messy, though, because a huge code block 
> was shifted.
> 
> Will it be different, if you just leave
>  "if (!PageIsNew(page) && PageGetLSN(page) < startptr)"
> block as it was, and add
>  "else if (PageIsNew(page) && !PageIsZero(page))" ?

Thanks, that does indeed look better as a patch and I think it's fine
as-is for the code as well, I've attached a v2.

> While on it, I also have a few questions about the code around:

All fair points, but I think those should be handled in another patch,
if any.

> 1) Maybe we can use other header sanity checks from PageIsVerified() as 
> well? Or even the function itself.

Yeah, I'll keep that in mind.

> 2) > /* Reset loop to validate the block again */
> How many times do we try to reread the block? Is one reread enough?

Not sure whether more rereads would help, but I think the general
direction was to replace this with something better anyway I think (see
below).

> Speaking of which, 'reread_cnt' name looks confusing to me. I would 
> expect that this variable contains a number of attempts, not the number 
> of bytes read.

Well, there are other "cnt" variables that keep the number of bytes read
in that file, so it does not seem to be out of place for me. A comment
might not hurt though.

On second glance, it should maybe also be of size_t and not int type, as
is the other cnt variable?

> If everyone agrees, that for basebackup purpose it's enough to rely on a 
> single reread, I'm ok with it.
> Another approach is to read the page directly from shared buffers to 
> ensure that the page is fine. This way is more complicated, but should 
> be almost bullet-proof. Using it we can also check pages with lsn >= 
> startptr.

Right, that's what Andres also advocated for initially I think, and what
should be done going forward.

> 3) Judging by warning messages, we count checksum failures per file, not 
> per page, and don't report after a fifth failure. Why so?  Is it a 
> common case that so many pages of one file are corrupted?

I think this was a request during the original review, on the basis that
if we have more than a few checksum errors, it's likely there is
something fundamentally broken and it does not make sense to spam the
log with potentially millions of broken checksums.

Cheers,

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
From 317df17289d4fd057ffdb4410f9effa8c7a2b975 Mon Sep 17 00:00:00 2001
From: Michael Banck 
Date: Tue, 22 Sep 2020 16:09:36 +0200
Subject: [PATCH] Fix checksum verification in base backups for zero page
 headers

We currently do not checksum a page if it is considered new by PageIsNew().
However, this means that if the whole page header is zero, PageIsNew() will
consider that page new due to the pd_upper field being zero and no subsequent
checksum verification is done.

To fix, factor out the all-zeroes check from PageIsVerified() into a new method
PageIsZero() and call that for pages considered new by PageIsNew(). If a page
is all zero, consider that a checksum failure.

Add one more test to the pg_basebackup TAP tests to check this error.

Reported-By: Andres Freund
Reviewed-By: Asif Rehman
Discussion: https://postgr.es/m/20190319200050.ncuxejradurja...@alap3.anarazel.de
---
 src/backend/replication/basebackup.c | 30 +
 src/backend/storage/page/bufpage.c   | 35 +++-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 18 +-
 src/include/storage/bufpage.h| 11 +++---
 4 files changed, 73 insertions(+), 21 deletions(-)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index b89df01fa7..c82765a429 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -1746,6 +1746,36 @@ sendFile(const char *readfilename, const char *tarfilename,
 			"be reported", readfilename)));
 	}
 }
+else
+{
+	/*
+	 * We skip completely new pages after checking they are
+	 * all-zero, since they don't have a checksum yet.
+	 */
+	if (PageIsNew(page) &&

Re: pgindent vs dtrace on macos

2020-09-22 Thread Tom Lane

Daniel Gustafsson  writes:
>> On 21 Sep 2020, at 20:55, Tom Lane  wrote:
>> Here's a proposed patch to fix it that way.

> +1 on this patch.  Do you think it's worth adding a note about this in the
> documentation to save the next one staring at this a few minutes?  Something
> along the lines of:

> +If an exported function shares a name with a typedef, the header file with 
> the
> +prototype can get incorrect spacing for the function.

AFAIK, a function name that's the same as a typedef name will get
misformatted whether it's exported or not; pgindent doesn't really
know the difference.  But yeah, we could add something about this.

regards, tom lane

Re: Asynchronous Append on postgres_fdw nodes.

2020-09-22 Thread Konstantin Knizhnik




On 22.09.2020 15:52, Konstantin Knizhnik wrote:



On 20.08.2020 10:36, Kyotaro Horiguchi wrote:
At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby 
 wrote in

On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:

As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.

The latest patch doesn't apply so I set as WoA.
https://commitfest.postgresql.org/29/2491/

Thanks. This is rebased version.

At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro 
 wrote in

Either way, we definitely need patch 0001.  One comment:

-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int 
nevents)


I wonder if it's better to have it receive ResourceOwner like that, or
to have it capture CurrentResourceOwner.  I think the latter is more
common in existing code.

There's no existing WaitEventSets belonging to a resowner. So
unconditionally capturing CurrentResourceOwner doesn't work well. I
could pass a bool instead but that make things more complex.

Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.

regards.




Hi,
Looks like current implementation of asynchronous append incorrectly 
handle LIMIT clause:


psql:append.sql:10: ERROR:  another command is already in progress
CONTEXT:  remote SQL command: CLOSE c1




Just FYI: the following patch fixes the problem:

--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node)

     if (cur == node)
     {
+            PGconn *conn = curstate->s.conn;
+
+            while(PQisBusy(conn))
+                PQclear(PQgetResult(conn));
+
         prev_state->waiter = curstate->waiter;

         /* relink to the previous node if the last node was removed */

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 1482436..9fe16cf 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1667,6 +1667,11 @@ remove_async_node(ForeignScanState *node)
 
 		if (cur == node)
 		{
+			PGconn *conn = curstate->s.conn;
+
+			while(PQisBusy(conn))
+PQclear(PQgetResult(conn));
+
 			prev_state->waiter = curstate->waiter;
 
 			/* relink to the previous node if the last node was removed */

Re: Transactions involving multiple postgres foreign servers, take 2

2020-09-22 Thread Ashutosh Bapat

On Tue, Sep 22, 2020 at 6:48 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
>
> > I think it's not necessarily that all FDW implementations need to be
> > able to support xa_complete(). We can support both synchronous and
> > asynchronous executions of prepare/commit/rollback.
>
> Yes, I think parallel prepare and commit can be an option for FDW.  But I 
> don't think it's an option for a serious scale-out DBMS.  If we want to use 
> FDW as part of PostgreSQL's scale-out infrastructure, we should design (if 
> not implemented in the first version) how the parallelism can be realized.  
> That design is also necessary because it could affect the FDW API.

parallelism here has both pros and cons. If one of the servers errors
out while preparing for a transaction, there is no point in preparing
the transaction on other servers. In parallel execution we will
prepare on multiple servers before realising that one of them has
failed to do so. On the other hand preparing on multiple servers in
parallel provides a speed up.

But this can be an improvement on version 1. The current approach
doesn't render such an improvement impossible. So if that's something
hard to do, we should do that in the next version rather than
complicating this patch.

-- 
Best Wishes,
Ashutosh Bapat

Re: Asynchronous Append on postgres_fdw nodes.

2020-09-22 Thread Konstantin Knizhnik




On 20.08.2020 10:36, Kyotaro Horiguchi wrote:

At Wed, 19 Aug 2020 23:25:36 -0500, Justin Pryzby  wrote 
in

On Thu, Jul 02, 2020 at 11:14:48AM +0900, Kyotaro Horiguchi wrote:

As the result of a discussion with Fujita-san off-list, I'm going to
hold off development until he decides whether mine or Thomas' is
better.

The latest patch doesn't apply so I set as WoA.
https://commitfest.postgresql.org/29/2491/

Thanks. This is rebased version.

At Fri, 14 Aug 2020 13:29:16 +1200, Thomas Munro  wrote 
in

Either way, we definitely need patch 0001.  One comment:

-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)

I wonder if it's better to have it receive ResourceOwner like that, or
to have it capture CurrentResourceOwner.  I think the latter is more
common in existing code.

There's no existing WaitEventSets belonging to a resowner. So
unconditionally capturing CurrentResourceOwner doesn't work well. I
could pass a bool instead but that make things more complex.

Come to think of "complex", ExecAsync stuff in this patch might be
too-much for a short-term solution until executor overhaul, if it
comes shortly. (the patch of mine here as a whole is like that,
though..). The queueing stuff in postgres_fdw is, too.

regards.




Hi,
Looks like current implementation of asynchronous append incorrectly 
handle LIMIT clause:


psql:append.sql:10: ERROR:  another command is already in progress
CONTEXT:  remote SQL command: CLOSE c1



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



append.sql
Description: application/sql

Re: OpenSSL 3.0.0 compatibility

2020-09-22 Thread Daniel Gustafsson

> On 22 Sep 2020, at 11:37, Peter Eisentraut  
> wrote:
> 
> On 2020-09-18 16:11, Daniel Gustafsson wrote:
>> Since we support ciphers that are now deprecated, we have no other choice 
>> than
>> to load the legacy provider.
> 
> Well, we could just have deprecated ciphers fail, unless the user loads the 
> legacy provider in the OS configuration.  There might be an argument that 
> that is more proper.

Thats a fair point.  The issue I have with that is that we'd impose a system
wide loading of the legacy provider, impacting other consumers of libssl as
well.

> As a more extreme analogy, what if OpenSSL remove a cipher from the legacy 
> provider?  Are we then obliged to reimplement it manually for the purpose of 
> pgcrypto?  Probably not.

I don't think we have made any such guarantees of support, especially since
it's in contrib/. That doesn't mean that some users wont expect it though.

Another option would be to follow OpenSSL's deprecations and mark these ciphers
as deprecated such that we can remove them in case they indeed get removed from
libcypto.  That would give users a heads-up that they should have a migration
plan for if that time comes.

> The code you wrote to load the necessary providers is small enough that I 
> think it's fine, but it's worth pondering this question briefly.

Absolutely.

cheers ./daniel

Re: Lift line-length limit for pg_service.conf

2020-09-22 Thread Daniel Gustafsson

> On 21 Sep 2020, at 17:09, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> The pg_service.conf parsing thread [0] made me realize that we have a 
>> hardwired
>> line length of max 256 bytes.  Lifting this would be in line with recent work
>> for ecpg, pg_regress and pg_hba (784b1ba1a2 and 8f8154a50).  The attached 
>> moves
>> pg_service.conf to use the new pg_get_line_append API and a StringInfo to 
>> lift
>> the restriction. Any reason not to do that while we're lifting other such 
>> limits?
> 
> +1.  I'd been thinking of looking around at our fgets calls to see
> which ones need this sort of work, but didn't get to it yet.

I took a quick look and found the TOC parsing in pg_restore which used a 100
byte buffer and then did some juggling to find EOL for >100b long lines.  There
we wont see a bugreport for exceeded line length, but simplifying the code
seemed like a win to me so included that in the updated patch as well.

> Personally, I'd avoid depending on StringInfo.cursor here, as the
> dependency isn't really buying you anything.

Fair enough, I was mainly a bit excited at finally finding a use for .cursor =)
Fixed.

> Also, the need for inserting the pfree into multiple exit paths kind
> of makes me itch.  I wonder if we should change the ending code to
> look like
> 
> exit:
>   fclose(f);
>   pfree(linebuf.data);
> 
>   return result;
> 
> and then the early exit spots would be replaced with "result = x;
> goto exit".  (Some of them could use "break", but not all, so
> it's probably better to consistently use "goto".)

Agreed, fixed.  I was a bit tempted to use something less magic and more
descriptive than result = 3; but in the end opted for keeping changes to one
thing at a time.

cheers ./daniel

0001-Refactor-pg_service.conf-and-pg_restore-TOC-file-par.patch
Description: Binary data

Re: [HACKERS] logical decoding of two-phase transactions

2020-09-22 Thread Ajin Cherian

On Sun, Sep 20, 2020 at 3:31 PM Dilip Kumar  wrote:

> 1.
> + /*
> + * Process invalidation messages, even if we're not interested in the
> + * transaction's contents, since the various caches need to always be
> + * consistent.
> + */
> + if (parsed->nmsgs > 0)
> + {
> + if (!ctx->fast_forward)
> + ReorderBufferAddInvalidations(ctx->reorder, xid, buf->origptr,
> +   parsed->nmsgs, parsed->msgs);
> + ReorderBufferXidSetCatalogChanges(ctx->reorder, xid, buf->origptr);
> + }
> +
>
> I think we don't need to add prepare time invalidation messages as we now we
> are already logging the invalidations at the command level and adding them to
> reorder buffer.
>

Removed.

> 2.
>
> + /*
> + * Tell the reorderbuffer about the surviving subtransactions. We need to
> + * do this because the main transaction itself has not committed since we
> + * are in the prepare phase right now. So we need to be sure the snapshot
> + * is setup correctly for the main transaction in case all changes
> + * happened in subtransanctions
> + */
> + for (i = 0; i < parsed->nsubxacts; i++)
> + {
> + ReorderBufferCommitChild(ctx->reorder, xid, parsed->subxacts[i],
> + buf->origptr, buf->endptr);
> + }
> +
> + if (SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) ||
> + (parsed->dbId != InvalidOid && parsed->dbId != ctx->slot->data.database) ||
> + ctx->fast_forward || FilterByOrigin(ctx, origin_id))
> + return;
>
> Do we need to call ReorderBufferCommitChild if we are skiping this 
> transaction?
> I think the below check should be before calling ReorderBufferCommitChild.
>

Done.

> 3.
>
> + /*
> + * If it's ROLLBACK PREPARED then handle it via callbacks.
> + */
> + if (TransactionIdIsValid(xid) &&
> + !SnapBuildXactNeedsSkip(ctx->snapshot_builder, buf->origptr) &&
> + parsed->dbId == ctx->slot->data.database &&
> + !FilterByOrigin(ctx, origin_id) &&
> + ReorderBufferTxnIsPrepared(ctx->reorder, xid, parsed->twophase_gid))
> + {
> + ReorderBufferFinishPrepared(ctx->reorder, xid, buf->origptr, buf->endptr,
> + commit_time, origin_id, origin_lsn,
> + parsed->twophase_gid, false);
> + return;
> + }
>
>
> I think we have already checked !SnapBuildXactNeedsSkip, parsed->dbId
> == ctx->slot->data.database and !FilterByOrigin in DecodePrepare
> so if those are not true then we wouldn't have prepared this
> transaction i.e. ReorderBufferTxnIsPrepared will be false so why do we
> need
> to recheck this conditions.

I didnt change this, as I am seeing cases where the Abort is getting
called for transactions that needs to be skipped. I also see that the
same check is there both in DecodePrepare and DecodeCommit.
So, while the same transactions were not getting prepared or
committed, it tries to get ROLLBACK PREPARED (as part of finish
prepared handling). The check in if ReorderBufferTxnIsPrepared() is
also not proper. I will need to relook
this logic again in a future patch.

>
>
> 4.
>
> + /* If streaming, reset the TXN so that it is allowed to stream
> remaining data. */
> + if (streaming && stream_started)
> + {
> + ReorderBufferResetTXN(rb, txn, snapshot_now,
> +   command_id, prev_lsn,
> +   specinsert);
> + }
> + else
> + {
> + elog(LOG, "stopping decoding of %s (%u)",
> + txn->gid[0] != '\0'? txn->gid:"", txn->xid);
> + ReorderBufferTruncateTXN(rb, txn, true);
> + }
>
> Why only if (streaming) is not enough?  I agree if we are coming here
> and it is streaming mode then streaming started must be true
> but we already have an assert for that.
>
Changed.

Amit,

I have also changed  test_decode startup to support two_phase commits
only if specified similar to how it was done for streaming. I have
also changed the test cases accordingly. However, I have not added it
to the pgoutput startup as that would require create subscription
changes.  I will do that in a future patch. Some other pending changes
are:

1. Remove snapshots on prepare truncate.
2. Look at why ReorderBufferCleanupTXN is failing after a
ReorderBufferTruncateTXN
3. Add prepare support to streaming

regards,
Ajin Cherian
Fujitsu Australia


v5-0001-Support-decoding-of-two-phase-transactions.patch
Description: Binary data


v5-0002-Tap-test-to-test-concurrent-aborts-during-2-phase.patch
Description: Binary data


v5-0003-pgoutput-output-plugin-support-for-logical-decodi.patch
Description: Binary data

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-22 Thread Dilip Kumar

On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila  wrote:
>
> On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma  wrote:
> >
> > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila  wrote:
> > >
> > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma  
> > > wrote:
> > > >
> > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small 
> > > > comment:
> > > >
> > >
> > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a
> > > day or two.
> > >
> >
> > Just a thought:
> >
> > Should we change the sequence of these #define in
> > src/include/replication/logicalproto.h?
> >
> > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > #define LOGICALREP_PROTO_VERSION_NUM 1
> > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
> >
> > I would have changed above to something like this:
> >
> > #define LOGICALREP_PROTO_VERSION_NUM 1
> > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> >
> > #define LOGICALREP_PROTO_MIN_VERSION_NUM  LOGICALREP_PROTO_VERSION_NUM
> > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
> >
>
> I am not sure if this suggestion makes it better than what is purposed
> by Dilip but I think we can declare them in define number order like
> below:
> #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> #define LOGICALREP_PROTO_VERSION_NUM 1
> #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM

Done this way.

> Another thing is can we also test by having a publisher of v14 and
> subscriber of v13 or prior version, just reverse of what Ashutosh has
> tested?

I have also tested this and working fine.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


v3-0001-Bugfix-in-logical-protocol-version.patch
Description: Binary data

Re: Parallel copy

2020-09-22 Thread Bharath Rupireddy

On Thu, Sep 17, 2020 at 11:06 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:
>
> On Wed, Sep 16, 2020 at 1:20 PM Greg Nancarrow 
wrote:
> >
> > Fortunately I have been given permission to share the exact table
> > definition and data I used, so you can check the behaviour and timings
> > on your own test machine.
> >
>
> Thanks Greg for the script. I ran your test case and I didn't observe
> any increase in exec time with 1 worker, indeed, we have benefitted a
> few seconds from 0 to 1 worker as expected.
>
> Execution time is in seconds. Each test case is executed 3 times on
> release build. Each time the data directory is recreated.
>
> Case 1: 100 rows, 2GB
> With Patch, default configuration, 0 worker: 88.933, 92.261, 88.423
> With Patch, default configuration, 1 worker: 73.825, 74.583, 72.678
>
> With Patch, custom configuration, 0 worker: 76.191, 78.160, 78.822
> With Patch, custom configuration, 1 worker: 61.289, 61.288, 60.573
>
> Case 2: 255 rows, 5GB
> With Patch, default configuration, 0 worker: 246.031, 188.323, 216.683
> With Patch, default configuration, 1 worker: 156.299, 153.293, 170.307
>
> With Patch, custom configuration, 0 worker: 197.234, 195.866, 196.049
> With Patch, custom configuration, 1 worker: 157.173, 158.287, 157.090
>

Hi Greg,

If you still observe the issue in your testing environment, I'm attaching a
testing patch(that applies on top of the latest parallel copy patch set
i.e. v5 1 to 6) to capture various timings such as total copy time in
leader and worker, index and table insertion time, leader and worker
waiting time. These logs are shown in the server log file.

Few things to follow before testing:
1. Is the table being dropped/truncated after the test with 0 workers and
before running with 1 worker? If not, then the index insertion time would
increase.[1](for me it is increasing by 10 sec). This is obvious because
the 1st time index will be created from bottom up manner(from leaves to
root), but for the 2nd time it has to search and insert at the proper
leaves and inner B+Tree nodes.
2. If possible, can you also run with custom postgresql.conf settings[2]
along with default? Just to ensure that other bg processes such as
checkpointer, autovacuum, bgwriter etc. don't affect our testcase. For
instance, with default postgresql.conf file, it looks like checkpointing[3]
is happening frequently, could you please let us know if that happens at
your end?
3. Could you please run the test case 3 times at least? Just to ensure the
consistency of the issue.
4. I ran the tests in a performance test system where no other user
processes(except system processes) are running. Is it possible for you to
do the same?

Please capture and share the timing logs with us.

Here's a snapshot of how the added timings show up in the logs: ( I
captured this with your test case case 1: 100 rows, 2GB, custom
postgresql.conf file settings[2]).
with 0 workers:
2020-09-22 10:49:27.508 BST [163910] LOG:  totaltableinsertiontime =
24072.034 ms
2020-09-22 10:49:27.508 BST [163910] LOG:  totalindexinsertiontime = 60.682
ms
2020-09-22 10:49:27.508 BST [163910] LOG:  totalcopytime = 59664.594 ms

with 1 worker:
2020-09-22 10:53:58.409 BST [163947] LOG:  totalcopyworkerwaitingtime =
59.815 ms
2020-09-22 10:53:58.409 BST [163947] LOG:  totaltableinsertiontime =
23585.881 ms
2020-09-22 10:53:58.409 BST [163947] LOG:  totalindexinsertiontime = 30.946
ms
2020-09-22 10:53:58.409 BST [163947] LOG:  totalcopytimeworker = 47047.956
ms
2020-09-22 10:53:58.429 BST [163946] LOG:  totalcopyleaderwaitingtime =
26746.744 ms
2020-09-22 10:53:58.429 BST [163946] LOG:  totalcopytime = 47150.002 ms

[1]
0 worker:
LOG:  totaltableinsertiontime = 25491.881 ms
LOG:  totalindexinsertiontime = 14136.104 ms
LOG:  totalcopytime = 75606.858 ms
table is not dropped and so are indexes
1 worker:
LOG:  totalcopyworkerwaitingtime = 64.582 ms
LOG:  totaltableinsertiontime = 21360.875 ms
LOG:  totalindexinsertiontime = 24843.570 ms
LOG:  totalcopytimeworker = 69837.162 ms
LOG:  totalcopyleaderwaitingtime = 49548.441 ms
LOG:  totalcopytime = 69997.778 ms

[2]
custom postgresql.conf configuration:
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

[3]
LOG:  checkpoints are occurring too frequently (14 seconds apart)
HINT:  Consider increasing the configuration parameter "max_wal_size".

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v1-0001-Parallel-Copy-Exec-Time-Capture.patch
Description: Binary data

Re: OpenSSL 3.0.0 compatibility

2020-09-22 Thread Peter Eisentraut


On 2020-09-18 16:11, Daniel Gustafsson wrote:

Since we support ciphers that are now deprecated, we have no other choice than
to load the legacy provider.


Well, we could just have deprecated ciphers fail, unless the user loads 
the legacy provider in the OS configuration.  There might be an argument 
that that is more proper.


As a more extreme analogy, what if OpenSSL remove a cipher from the 
legacy provider?  Are we then obliged to reimplement it manually for the 
purpose of pgcrypto?  Probably not.


The code you wrote to load the necessary providers is small enough that 
I think it's fine, but it's worth pondering this question briefly.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Binaries on s390x arch

2020-09-22 Thread Christoph Berg

Re: Namrata Bhave
> We will be glad to obtain binaries for s390x on RHEL, SLES and Ubuntu
> distros.

Hi Namrata,

thanks for getting back to me.

Note that I'm working on the .deb packages for Debian and Ubuntu. For
.rpm (RHEL, SLES) you need to get Devrim on board. (I'm putting him on
Cc.)

> We are ready to provide the necessary infra. To enable this, we would need
> more information about VM configuration(No of VMs, OS, vCPUs, memory,
> Storage). Secondly T 's would be needed to be signed electronically.
> Please let me know if you are OK to proceed and we can communicate further.

To be able to keep up with the other architectures when building
binaries for PostgreSQL releases, the machine would have about the
same specs as the existing arm64 or ppc64el build daemons:

VMs: 1
OS: Debian buster
vCPUs: 8
Memory: 16 GB
Storage: 100GB (most of that is swap space for tmpfs)

What does T mean? Not sure which body on the PostgreSQL side would
be able to sign something. We are an open source project after all.

Thanks,
Christoph


signature.asc
Description: PGP signature

RE: Use appendStringInfoString and appendPQExpBufferStr where possible

2020-09-22 Thread Hou, Zhijie

> On Tue, 22 Sep 2020 at 17:00, Hou, Zhijie  wrote:
> > I found some more places that should use appendPQExrBufferStr instead
> of appendPQExpBuffer.
> >
> > Here is the patch.
> 
> Seems like a good idea.  Please add it to the next commitfest.

Thanks, added it to the next commitfest.
https://commitfest.postgresql.org/30/2735/

Best regards,
houzj

Re: pgindent vs dtrace on macos

2020-09-22 Thread Daniel Gustafsson

> On 21 Sep 2020, at 20:55, Tom Lane  wrote:
> 
> Oh wait, I forgot about the fmgrprotos.h discrepancy.
> 
> I wrote:
>> It strikes me that a low-cost workaround would be to rename these
>> C functions.  There's no law that their C names must match the
>> SQL names.
> 
> Here's a proposed patch to fix it that way.

+1 on this patch.  Do you think it's worth adding a note about this in the
documentation to save the next one staring at this a few minutes?  Something
along the lines of:

--- a/src/tools/pgindent/README
+++ b/src/tools/pgindent/README
@@ -110,6 +110,9 @@ Sometimes, if pgindent or perltidy produces odd-looking 
output, it's because
 of minor bugs like extra commas.  Don't hesitate to clean that up while
 you're at it.

+If an exported function shares a name with a typedef, the header file with the
+prototype can get incorrect spacing for the function.
+

cheers ./daniel

Re: Global snapshots

2020-09-22 Thread Andrey Lepikhov





22.09.2020 03:47, tsunakawa.ta...@fujitsu.com пишет:

Does this make sense from your viewpoint, and can we think that we can use 
Clock-SI without infrindging on the patent?  According to the patent holder, 
the difference between Clock-SI and the patent seems to be fewer than the 
similarities.

Thank you for this work!
As I can see, main development difficulties placed in other areas: CSN, 
resolver, global deadlocks, 2PC commit... I'm not lawyer too. But if we 
get remarks from the patent holders, we can rewrite our Clock-SI 
implementation.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Use appendStringInfoString and appendPQExpBufferStr where possible

2020-09-22 Thread David Rowley

On Tue, 22 Sep 2020 at 17:00, Hou, Zhijie  wrote:
> I found some more places that should use appendPQExrBufferStr instead of 
> appendPQExpBuffer.
>
> Here is the patch.

Seems like a good idea.  Please add it to the next commitfest.

David

Re: VACUUM PARALLEL option vs. max_parallel_maintenance_workers

2020-09-22 Thread David Rowley

On Mon, 21 Sep 2020 at 19:15, Peter Eisentraut
 wrote:
>
> On 2020-09-21 05:48, Amit Kapila wrote:
> > What according to you should be the behavior here and how will it be
> > better than current?
>
> I think if I write VACUUM (PARALLEL 5), it should use up to 5 workers
> (up to the number of indexes), even if max_parallel_maintenance_workers
> is 2.

It would be good if we were consistent with these parallel options.
Right now max_parallel_workers_per_gather will restrict the
parallel_workers reloption.  I'd say this
max_parallel_workers_per_gather is similar to
max_parallel_maintenance_workers here and the PARALLEL vacuum option
is like the parallel_workers reloption.

If we want VACUUM's parallel option to work the same way as that then
max_parallel_maintenance_workers should restrict whatever is mentioned
in VACUUM PARALLEL.

Or perhaps this is slightly different as the user is explicitly asking
for this in the command, but you could likely say the same about ALTER
TABLE  SET (parallel_workers = N); too.

David

Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-09-22 Thread David Rowley

On Fri, 11 Sep 2020 at 02:01, Peter Eisentraut
 wrote:
>
> On 2020-09-06 02:24, David Rowley wrote:
> >> I would add DEBUG1 back into the conditional, like
> >>
> >> if (__builtin_constant_p(elevel) && ((elevel) >= ERROR || (elevel) <=
> >> DEBUG1) ? \
> >
> > hmm, but surely we don't want to move all code that's in the same
> > branch as an elog(DEBUG1) call into a cold area.
>
> Yeah, nevermind that.

I've reattached the v4 patch since it just does the >= ERROR case.

> > The v3 patch just put an unlikely() around the errstart() call if the
> > level was <= DEBUG1.  That just to move the code that's inside the if
> > (errstart(...)) in the macro into a cold area.
>
> That could be useful.  Depends on how much effect it has.

I wonder if it is. I'm having trouble even seeing gains from the ERROR
case and I'm considering dropping this patch due to that.

I ran another scale=5 TPCH benchmark on v4 against f859c2ffa using gcc
9.3. I'm unable to see any gains with this, however, the results were
pretty noisy. I only ran pgbench for 60 seconds per query. I'll likely
need to run that a bit longer. I'll do that tonight.

It would be good if someone else could run some tests on their own
hardware to see if they can see any gains.

David

elog_ereport_attribute_cold_v4.patch
Description: Binary data

tpch_scale5_elog_ereport_cold_v4_vs_master.ods
Description: application/vnd.oasis.opendocument.spreadsheet

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-22 Thread Ashutosh Sharma

On Tue, Sep 22, 2020 at 12:22 PM Ashutosh Sharma  wrote:
>
> On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila  wrote:
> >
> > On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma  
> > wrote:
> > >
> > > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma  
> > > > wrote:
> > > > >
> > > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small 
> > > > > comment:
> > > > >
> > > >
> > > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a
> > > > day or two.
> > > >
> > >
> > > Just a thought:
> > >
> > > Should we change the sequence of these #define in
> > > src/include/replication/logicalproto.h?
> > >
> > > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > > #define LOGICALREP_PROTO_VERSION_NUM 1
> > > #define LOGICALREP_PROTO_MAX_VERSION_NUM 
> > > LOGICALREP_PROTO_STREAM_VERSION_NUM
> > >
> > > I would have changed above to something like this:
> > >
> > > #define LOGICALREP_PROTO_VERSION_NUM 1
> > > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > >
> > > #define LOGICALREP_PROTO_MIN_VERSION_NUM  LOGICALREP_PROTO_VERSION_NUM
> > > #define LOGICALREP_PROTO_MAX_VERSION_NUM 
> > > LOGICALREP_PROTO_STREAM_VERSION_NUM
> > >
> >
> > I am not sure if this suggestion makes it better than what is purposed
> > by Dilip but I think we can declare them in define number order like
> > below:
> > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> > #define LOGICALREP_PROTO_VERSION_NUM 1
> > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
> >
>
> The only reason I proposed that was because for the *_MAX_VERSION_NUM
> we are using the latest PROTOCOL version name in its definition so why
> not to do the same for defining *_MIN_VERSION_NUM as well. Other than
> that, I also wanted to correct the sequence so that they are defined
> in the increasing order which you have already done here.
>
> > Another thing is can we also test by having a publisher of v14 and
> > subscriber of v13 or prior version, just reverse of what Ashutosh has
> > tested?
> >
>
> I've tested LR from PGv12 (Publisher) to PGv14 (Subscriber) and it works fine.
>

I meant LR from PGv14 (Publisher) to PGv12 (Subscriber) not the other way.

> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-22 Thread Ashutosh Sharma

On Tue, Sep 22, 2020 at 12:02 PM Amit Kapila  wrote:
>
> On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma  wrote:
> >
> > On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila  wrote:
> > >
> > > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma  
> > > wrote:
> > > >
> > > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small 
> > > > comment:
> > > >
> > >
> > > Thanks Ashutosh and Dilip for working on this. I'll look into it in a
> > > day or two.
> > >
> >
> > Just a thought:
> >
> > Should we change the sequence of these #define in
> > src/include/replication/logicalproto.h?
> >
> > #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> > #define LOGICALREP_PROTO_VERSION_NUM 1
> > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
> >
> > I would have changed above to something like this:
> >
> > #define LOGICALREP_PROTO_VERSION_NUM 1
> > #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> >
> > #define LOGICALREP_PROTO_MIN_VERSION_NUM  LOGICALREP_PROTO_VERSION_NUM
> > #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
> >
>
> I am not sure if this suggestion makes it better than what is purposed
> by Dilip but I think we can declare them in define number order like
> below:
> #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> #define LOGICALREP_PROTO_VERSION_NUM 1
> #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
>

The only reason I proposed that was because for the *_MAX_VERSION_NUM
we are using the latest PROTOCOL version name in its definition so why
not to do the same for defining *_MIN_VERSION_NUM as well. Other than
that, I also wanted to correct the sequence so that they are defined
in the increasing order which you have already done here.

> Another thing is can we also test by having a publisher of v14 and
> subscriber of v13 or prior version, just reverse of what Ashutosh has
> tested?
>

I've tested LR from PGv12 (Publisher) to PGv14 (Subscriber) and it works fine.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Logical replication from PG v13 and below to PG v14 (devel version) is not working.

2020-09-22 Thread Amit Kapila

On Tue, Sep 22, 2020 at 8:34 AM Ashutosh Sharma  wrote:
>
> On Mon, Sep 21, 2020 at 6:58 PM Amit Kapila  wrote:
> >
> > On Mon, Sep 21, 2020 at 6:27 PM Ashutosh Sharma  
> > wrote:
> > >
> > > Thanks Dilip for the patch. AFAIU, the fix looks good. One small comment:
> > >
> >
> > Thanks Ashutosh and Dilip for working on this. I'll look into it in a
> > day or two.
> >
>
> Just a thought:
>
> Should we change the sequence of these #define in
> src/include/replication/logicalproto.h?
>
> #define LOGICALREP_PROTO_MIN_VERSION_NUM 1
> #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
> #define LOGICALREP_PROTO_VERSION_NUM 1
> #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
>
> I would have changed above to something like this:
>
> #define LOGICALREP_PROTO_VERSION_NUM 1
> #define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
>
> #define LOGICALREP_PROTO_MIN_VERSION_NUM  LOGICALREP_PROTO_VERSION_NUM
> #define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM
>

I am not sure if this suggestion makes it better than what is purposed
by Dilip but I think we can declare them in define number order like
below:
#define LOGICALREP_PROTO_MIN_VERSION_NUM 1
#define LOGICALREP_PROTO_VERSION_NUM 1
#define LOGICALREP_PROTO_STREAM_VERSION_NUM 2
#define LOGICALREP_PROTO_MAX_VERSION_NUM LOGICALREP_PROTO_STREAM_VERSION_NUM

Another thing is can we also test by having a publisher of v14 and
subscriber of v13 or prior version, just reverse of what Ashutosh has
tested?

-- 
With Regards,
Amit Kapila.

Re: Dynamic gathering the values for seq_page_cost/xxx_cost

2020-09-22 Thread Andy Fan

>
>
> It's probably worth testing on various other storage systems to see
>> how that applies to those.
>>
>> Yes, I can test more on new hardware once I get it. Now it is still in
> progress.
> However I can only get a physical machine with SSD or  Virtual machine with
> SSD, other types are hard for me right now.
>
>
Here is a result on a different hardware.   The test method is still not
changed.[1]

Hardware Info:

Virtual Machine with 61GB memory.
Linux Kernel: 5.4.0-31-generic  Ubuntu

# lshw -short -C disk
H/W pathDevice Class  Description
=
/0/100/4/0  /dev/vda   disk   42GB Virtual I/O device
/0/100/5/0  /dev/vdb   disk   42GB Virtual I/O device

The disk on the physical machine is claimed as SSD.

This time the FIO and my tools can generate the exact same result.

fs_cache_lat = 0.957756us, seq_read_lat = 70.780327us, random_page_lat =
438.837257us

cache hit ratio: 1.00 random_page_cost 1.00
cache hit ratio: 0.90 random_page_cost 5.635470
cache hit ratio: 0.50 random_page_cost 6.130565
cache hit ratio: 0.10 random_page_cost 6.192183
cache hit ratio: 0.00 random_page_cost 6.199989

| | seq_read_lat(us) | random_read_lat(us) |
| FIO |   70 | 437 |
| MY Tool |   70 | 438 |

The following query plans have changed because we change random_page_cost
to 4
to 6.2, the Execution time also changed.

| | random_page_cost=4 | random_page_cost=6.2 |
|-++--|
| Q1  |   2561 | 2528.272 |
| Q10 |   4675.749 | 4684.225 |
| Q13 |  18858.048 |18565.929 |
| Q2  |329.279 |  308.723 |
| Q5  |  46248.132 | 7900.173 |
| Q6  |  52526.462 |47639.503 |
| Q7  |  27348.900 |25829.221 |

Q5 improved by 5.8 times and Q6 & Q7 improved by ~10%.

[1]
https://www.postgresql.org/message-id/CAKU4AWpRv50k8E3tC3tiLWGe2DbKaoZricRh_YJ8y_zK%2BHdSjQ%40mail.gmail.com

-- 
Best Regards
Andy Fan

70 matches

Mail list logo