date:20200907

On Tue, Sep 8, 2020 at 4:11 PM Andres Freund  wrote:
> At first I was very confused as to why none of the existing tests have
> found this significant issue. But after thinking about it for a minute
> that's because they all use psql, and largely separate psql invocations
> for each query :(. Which means that there's no cached snapshot around...

I prototyped a TAP test patch that could maybe do the sort of thing
you need, in patch 0006 over at [1].  Later versions of that patch set
dropped it, because I figured out how to use the isolation tester
instead, but I guess you can't do that for a standby test (at least
not until someone teaches the isolation tester to support multi-node
schedules, something that would be extremely useful...).  Example:

+# start an interactive session that we can use to interleave statements
+my $session = PsqlSession->new($node, "postgres");
+$session->send("\\set PROMPT1 ''\n", 2);
+$session->send("\\set PROMPT2 ''\n", 1);
...
+# our snapshot is not too old yet, so we can still use it
+@lines = $session->send("select * from t order by i limit 1;\n", 2);
+shift @lines;
+$result = shift @lines;
+is($result, "1");
...
+# our snapshot is too old!  the thing it wants to see has been removed
+@lines = $session->send("select * from t order by i limit 1;\n", 2);
+shift @lines;
+$result = shift @lines;
+is($result, "ERROR:  snapshot too old");

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BFkUuDv-bcBns%3DZ_O-V9QGW0nWZNHOkEPxHZWjegRXvw%40mail.gmail.com

Re: 回复：how to create index concurrently on partitioned table

On Mon, Sep 07, 2020 at 09:39:16PM -0500, Justin Pryzby wrote:
> Also, my previous revision failed to implement your suggestion to first build
> catalog entries with INVALID indexes and to then reindex them.  Fixed.

-   childStmt->oldCreateSubid = InvalidSubTransactionId;
-   childStmt->oldFirstRelfilenodeSubid = childStmt->InvalidSubTransactionId;
+   // childStmt->oldCreateSubid = childStmt->InvalidSubTransactionId;
+   // childStmt->oldFirstRelfilenodeSubid = InvalidSubTransactionId;
In the CIC patch, what is that about?  It is hard to follow what you
are trying to achieve here.

+   index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+   CommandCounterIncrement();
+   index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);
Anyway, this part of the logic is still not good here.  If we fail in
the middle here, we may run into problems with a single partition
index that has only a portion of its flags set.  As this gets called
for each partitioned index, it also means that we could still finish
in a state where we have only a partially-valid partition tree.  For
example, imagine a partition tree with 2 levels (as reported by
pg_partition_tree), then the following happens if an index is created
concurrently from the partitioned table of level 0:
1) indexes are created in level 2 first
2) partitioned table of level 1 is switched to be ready and valid
3) indexes of level 1 are created.
4) partitioned table of level 0 is switched to be ready and valid
If the command has a failure, say between 2 and 3, we would have as
result a command that has partially succeeded in creating a partition
tree, while the user was looking for an index for the full tree.  This
comes back to my previous points, where we should make
index_set_state_flags() first, and I have begun a separate thread
about that:
https://commitfest.postgresql.org/30/2725/

Second, it also means that the patch should really switch all the
indexes to be valid in one single transaction, and that we also need
more careful refactoring of DefineIndex().

I also had a quick look at the patch for CLUSTER, and it does a much
better job, still it has issues.

+   MemoryContext old_context = MemoryContextSwitchTo(cluster_context);
+
+   inhoids = find_all_inheritors(indexOid, NoLock, NULL);
+   foreach(lc, inhoids)
The area where a private memory context is used should be minimized.
In this case, you just need to hold the context while saving the
relation and clustered index information in the list of RelToCluster
items.  As a whole, this case is more simple than CIC, so I'd like to
think that it would be good to work on that as next target.

Coming to my last point..  This thread has dealt since the beginning
with three different problems:
- REINDEX on partitioned relations.
- CLUSTER on partitioned relations.
- CIC on partitioned relations.  (Should we also care about DROP INDEX
CONCURRENTLY as well?)

The first problem has been solved, not the two others yet.  Do you
think that it could be possible to begin two new separate threads for
the remaining issues, with dedicated CF entries?  We could also mark
the existing one as committed, retitled for REINDEX as a matter of
clarity.  Also, please note that I am not sure if I will be able to
look more at this thread in this CF.
--
Michael


signature.asc
Description: PGP signature

Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Ian Barwick


On 2020/09/08 13:11, Andres Freund wrote:

Hi,

On 2020-09-08 13:03:01 +0900, Ian Barwick wrote:

(...)

I wonder if it's possible to increment "xactCompletionCount"
during replay along these lines:

 *** a/src/backend/access/transam/xact.c
 --- b/src/backend/access/transam/xact.c
 *** xact_redo_commit(xl_xact_parsed_commit *
 *** 5915,5920 
 --- 5915,5924 
  */
 if (XactCompletionApplyFeedback(parsed->xinfo))
 XLogRequestWalReceiverReply();
 +
 +   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 +   ShmemVariableCache->xactCompletionCount++;
 +   LWLockRelease(ProcArrayLock);
   }

which seems to work (though quite possibly I've overlooked something I don't
know that I don't know about and it will all break horribly somewhere,
etc. etc.).


We'd also need the same in a few more places. Probably worth looking at
the list where we increment it on the primary (particularly we need to
also increment it for aborts, and 2pc commit/aborts).


Yup.


At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

Do you want to try to write a patch?


Sure, I'll give it a go as I have some time right now.


Regards

Ian Barwick

--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Andres Freund

Hi,

On 2020-09-08 13:03:01 +0900, Ian Barwick wrote:
> On 2020/09/03 17:18, Michael Paquier wrote:
> > On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:
> > > So we get some builfarm results while thinking about this.
> > 
> > Andres, there is an entry in the CF for this thread:
> > https://commitfest.postgresql.org/29/2500/
> > 
> > A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.
> 
> I haven't seen it mentioned here, so apologies if I've overlooked
> something, but as of 623a9ba queries on standbys seem somewhat
> broken.
> 
> Specifically, I maintain some code which does something like this:
> 
> - connects to a standby
> - checks a particular row does not exist on the standby
> - connects to the primary
> - writes a row in the primary
> - polls the standby (using the same connection as above)
>   to verify the row arrives on the standby
> 
> As of recent HEAD it never sees the row arrive on the standby, even
> though it is verifiably there.

Ugh, that's not good.


> I've traced this back to 623a9ba, which relies on "xactCompletionCount"
> being incremented to determine whether the snapshot can be reused,
> but that never happens on a standby, meaning this test in
> GetSnapshotDataReuse():
> 
> if (curXactCompletionCount != snapshot->snapXactCompletionCount)
> return false;
> 
> will never return false, and the snapshot's xmin/xmax never get advanced.
> Which means the session on the standby is not able to see rows on the
> standby added after the session was started.
> 
> It's simple enough to prevent that being an issue by just never calling
> GetSnapshotDataReuse() if the snapshot was taken during recovery
> (though obviously that means any performance benefits won't be available
> on standbys).

Yea, that doesn't sound great. Nor is the additional branch welcome.


> I wonder if it's possible to increment "xactCompletionCount"
> during replay along these lines:
> 
> *** a/src/backend/access/transam/xact.c
> --- b/src/backend/access/transam/xact.c
> *** xact_redo_commit(xl_xact_parsed_commit *
> *** 5915,5920 
> --- 5915,5924 
>  */
> if (XactCompletionApplyFeedback(parsed->xinfo))
> XLogRequestWalReceiverReply();
> +
> +   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
> +   ShmemVariableCache->xactCompletionCount++;
> +   LWLockRelease(ProcArrayLock);
>   }
> 
> which seems to work (though quite possibly I've overlooked something I don't
> know that I don't know about and it will all break horribly somewhere,
> etc. etc.).

We'd also need the same in a few more places. Probably worth looking at
the list where we increment it on the primary (particularly we need to
also increment it for aborts, and 2pc commit/aborts).

At first I was very confused as to why none of the existing tests have
found this significant issue. But after thinking about it for a minute
that's because they all use psql, and largely separate psql invocations
for each query :(. Which means that there's no cached snapshot around...

Do you want to try to write a patch?

Greetings,

Andres Freund

Re: Remove page-read callback from XLogReaderState.

At Tue, 08 Sep 2020 11:56:29 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Thank you for the comment, Menjo-san, and noticing me of that, Michael.

I found why the message I was writing was gone from the draft folder..

Sorry for the garbage.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Ian Barwick


On 2020/09/03 17:18, Michael Paquier wrote:

On Sun, Aug 16, 2020 at 02:26:57PM -0700, Andres Freund wrote:

So we get some builfarm results while thinking about this.


Andres, there is an entry in the CF for this thread:
https://commitfest.postgresql.org/29/2500/

A lot of work has been committed with 623a9ba, 73487a6, 5788e25, etc.


I haven't seen it mentioned here, so apologies if I've overlooked
something, but as of 623a9ba queries on standbys seem somewhat
broken.

Specifically, I maintain some code which does something like this:

- connects to a standby
- checks a particular row does not exist on the standby
- connects to the primary
- writes a row in the primary
- polls the standby (using the same connection as above)
  to verify the row arrives on the standby

As of recent HEAD it never sees the row arrive on the standby, even
though it is verifiably there.

I've traced this back to 623a9ba, which relies on "xactCompletionCount"
being incremented to determine whether the snapshot can be reused,
but that never happens on a standby, meaning this test in
GetSnapshotDataReuse():

if (curXactCompletionCount != snapshot->snapXactCompletionCount)
return false;

will never return false, and the snapshot's xmin/xmax never get advanced.
Which means the session on the standby is not able to see rows on the
standby added after the session was started.

It's simple enough to prevent that being an issue by just never calling
GetSnapshotDataReuse() if the snapshot was taken during recovery
(though obviously that means any performance benefits won't be available
on standbys).

I wonder if it's possible to increment "xactCompletionCount"
during replay along these lines:

*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*** xact_redo_commit(xl_xact_parsed_commit *
*** 5915,5920 
--- 5915,5924 
 */
if (XactCompletionApplyFeedback(parsed->xinfo))
XLogRequestWalReceiverReply();
+
+   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+   ShmemVariableCache->xactCompletionCount++;
+   LWLockRelease(ProcArrayLock);
  }

which seems to work (though quite possibly I've overlooked something I don't
know that I don't know about and it will all break horribly somewhere,
etc. etc.).


Regards

Ian Barwick


--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [Patch] Optimize dropping of relation buffers using dlist

On Mon, Sep 7, 2020 at 1:33 PM k.jami...@fujitsu.com
 wrote:
>
> On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> > On Wed, Sep 2, 2020 at 9:17 AM Tom Lane  wrote:
> > >
> > > Amit Kapila  writes:
> > > > Even if the relation is locked, background processes like
> > > > checkpointer can still touch the relation which might cause
> > > > problems. Consider a case where we extend the relation but didn't
> > > > flush the newly added pages. Now during truncate operation,
> > > > checkpointer can still flush those pages which can cause trouble for
> > > > truncate. But, I think in the recovery path such cases won't cause a
> > problem.
> > >
> > > I wouldn't count on that staying true ...
> > >
> > >
> > https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> > OBV-jC
> > > nacbyvtpptk2a9yy...@mail.gmail.com
> > >
> >
> > I don't think that proposal will matter after commit c5315f4f44 because we 
> > are
> > caching the size/blocks for recovery while doing extend (smgrextend). In the
> > above scenario, we would have cached the blocks which will be used at later
> > point of time.
> >
>
> I'm guessing we can still pursue this idea of improving the recovery path 
> first.
>

I think so.

> I'm working on an updated patch version, because the CFBot's telling
> that postgres fails to build (one of the recovery TAP tests fails).
> I'm still working on refactoring my patch, but have yet to find a proper 
> solution at the moment.
> So I'm going to continue my investigation.
>
> Attached is an updated WIP patch.
> I'd appreciate if you could take a look at the patch as well.
>

So, I see the below log as one of the problems:
2020-09-07 06:20:33.918 UTC [10914] LOG:  redo starts at 0/15FFEC0
2020-09-07 06:20:33.919 UTC [10914] FATAL:  unexpected data beyond EOF
in block 1 of relation base/13743/24581

This indicates that we missed invalidating some buffer which should
have been invalidated. If you are able to reproduce this locally then
I suggest to first write a simple patch without the check of the
threshold, basically in recovery always try to use the new way to
invalidate the buffer. That will reduce the scope of the code that can
create a problem. Let us know if the problem still exists and share
the logs. BTW, I think I see one problem in the code:

if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) &&
+ bufHdr->tag.forkNum == forkNum[j] &&
+ bufHdr->tag.blockNum >= firstDelBlock[j])

Here, I think you need to use 'i' not 'j' for forkNum and
firstDelBlock as those are arrays w.r.t forks. That might fix the
problem but I am not sure as I haven't tried to reproduce it.

-- 
With Regards,
Amit Kapila.

RE: Transactions involving multiple postgres foreign servers, take 2

2020-09-07 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> I intend to say that the global-visibility work can impact this in a
> major way and we have analyzed that to some extent during a discussion
> on the other thread. So, I think without having a complete
> design/solution that addresses both the 2PC and global-visibility, it
> is not apparent what is the right way to proceed. It seems to me that
> rather than working on individual (or smaller) parts one needs to come
> up with a bigger picture (or overall design) and then once we have
> figured that out correctly, it would be easier to decide which parts
> can go first.

I'm really sorry I've been getting late and late and latex10 to publish the 
revised scale-out design wiki to discuss the big picture!  I don't know why I'm 
taking this long time; I feel I were captive in a time prison (yes, nobody is 
holding me captive; I'm just late.)  Please wait a few days.

But to proceed with the development, let me comment on the atomic commit and 
global visibility.

* We have to hear from Andrey about their check on the possibility that 
Clock-SI could be Microsoft's patent and if we can avoid it.

* I have a feeling that we can adopt the algorithm used by Spanner, 
CockroachDB, and YugabyteDB.  That is, 2PC for multi-node atomic commit, Paxos 
or Raft for replica synchronization (in the process of commit) to make 2PC more 
highly available, and the timestamp-based global visibility.  However, the 
timestamp-based approach makes the database instance shut down when the node's 
clock is distant from the other nodes.

* Or, maybe we can use the following Commitment ordering that doesn't require 
the timestamp or any other information to be transferred among the cluster 
nodes.  However, this seems to have to track the order of read and write 
operations among concurrent transactions to ensure the correct commit order, so 
I'm not sure about the performance.  The MVCO paper seems to present the 
information we need, but I haven't understood it well yet (it's difficult.)  
Could you anybody kindly interpret this?

Commitment ordering (CO) - yoavraz2
https://sites.google.com/site/yoavraz2/the_principle_of_co


As for the Sawada-san's 2PC patch, which I find interesting purely as FDW 
enhancement, I raised the following issues to be addressed:

1. Make FDW API implementable by other FDWs than postgres_fdw (this is what 
Amit-san kindly pointed out.)  I think oracle_fdw and jdbc_fdw would be good 
examples to consider, while MySQL may not be good because it exposes the XA 
feature as SQL statements, not C functions as defined in the XA specification.

2. 2PC processing is queued and serialized in one background worker.  That 
severely subdues transaction throughput.  Each backend should perform 2PC.

3. postgres_fdw cannot detect remote updates when the UDF executed on a remote 
node updates data.


Regards
Takayuki Tsunakawa

Re: Logical Replication - detail message with names of missing columns

2020-09-07 Thread Bharath Rupireddy

On Tue, Sep 8, 2020 at 6:50 AM Kyotaro Horiguchi
 wrote:
>
> +1 for objective. However, that can be done simpler way that doesn't
> need additional loops by using bitmapset to hold missing remote
> attribute numbers. This also make the variable "found" useless.
>

Thanks. I will look into it and post a v2 patch soon.

>
> > Here's a snapshot how the error looks with the patch:
> > 2020-09-04 01:00:36.721 PDT [843128] ERROR:  logical replication
> > target relation "public.test_1" is missing "b1, d1" replicated columns
> > 2020-09-04 01:00:46.784 PDT [843132] ERROR:  logical replication
> > target relation "public.test_1" is missing "b1" replicated columns
> > 2020-09-06 21:24:53.645 PDT [902945] ERROR:  logical replication
> > target relation "public.test_1" is missing "a1, c1, d1, b1" replicated
> > columns
> >
> > Thoughts?
>
> FWIW, I would prefer that the message be like
>
>  logical replication target relation "public.test_1" is missing
>  replicated columns: "a1", "c1"
>

This looks fine, I will change that.

>
> I'm not sure we need to have separate messages for singular and plural.
>

I don't think we need to have separate messages. To keep it simple,
let's have plural form.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Online checksums verification in the backend

On Tue, Sep 08, 2020 at 11:36:45AM +0900, Masahiko Sawada wrote:
> On Mon, 7 Sep 2020 at 15:59, Michael Paquier  wrote:
>>  In this case it could be better to unregister from the
>> CF app for this entry.
> 
> Well, I sent review comments on this patch and Julien fixed all
> comments. So I’d marked this as Ready for Committer since I didn't
> have further comments at that time, and I was waiting for the
> committer review. I'll look at this patch again but should I remove my
> name from the reviewer after that if no comments?

Ah, sorry, I somewhat missed the previous status of the patch.
Perhaps that's an overdose of CF.  Keeping your name as reviewer is
fine I guess.  I have begun looking at the patch and spotted some
issues, so let's see where we do from here.
--
Michael


signature.asc
Description: PGP signature

Re: Transactions involving multiple postgres foreign servers, take 2

On Tue, Sep 8, 2020 at 8:05 AM Fujii Masao  wrote:
>
> On 2020/09/08 10:34, Amit Kapila wrote:
> > On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao  
> > wrote:
> >>
> >> IMO it's not easy to commit this 2PC patch at once because it's still large
> >> and complicated. So I'm thinking it's better to separate the feature into
> >> several parts and commit them gradually.
> >>
> >
> > Hmm, I don't see that we have a consensus on the design and or
> > interfaces of this patch and without that proceeding for commit
> > doesn't seem advisable. Here are a few points which I remember offhand
> > that require more work.
>
> Thanks!
>
> > 1. There is a competing design proposed and being discussed in another
> > thread [1] for this purpose. I think both the approaches have pros and
> > cons but there doesn't seem to be any conclusion yet on which one is
> > better.
>
> I was thinking that [1] was discussing global snapshot feature for
> "atomic visibility" rather than the solution like 2PC for "atomic commit".
> But if another approach for "atomic commit" was also proposed at [1],
> that's good. I will check that.
>

Okay, that makes sense.

> > 2. In this thread, we have discussed to try integrating this patch
> > with some other FDWs (say MySQL, mongodb, etc.) to ensure that the
> > APIs we are exposing are general enough that other FDWs can use them
> > to implement 2PC. I could see some speculations about the same but no
> > concrete work on the same has been done.
>
> Yes, you're right.
>
> > 3. In another thread [1], we have seen that the patch being discussed
> > in this thread might need to re-designed if we have to use some other
> > design for global-visibility than what is proposed in that thread. I
> > think it is quite likely that can happen considering no one is able to
> > come up with the solution to major design problems spotted in that
> > patch yet.
>
> You imply that global-visibility patch should be come first before "2PC" 
> patch?
>

I intend to say that the global-visibility work can impact this in a
major way and we have analyzed that to some extent during a discussion
on the other thread. So, I think without having a complete
design/solution that addresses both the 2PC and global-visibility, it
is not apparent what is the right way to proceed. It seems to me that
rather than working on individual (or smaller) parts one needs to come
up with a bigger picture (or overall design) and then once we have
figured that out correctly, it would be easier to decide which parts
can go first.

-- 
With Regards,
Amit Kapila.

Re: Remove page-read callback from XLogReaderState.

Thank you for the comment, Menjo-san, and noticing me of that, Michael.

Sorry for late reply.

At Fri, 17 Jul 2020 14:14:44 +0900, Takashi Menjo  
wrote in 
> Hi,
> 
> I applied your v15 patchset to master
> ed2c7f65bd9f15f8f7cd21ad61602f983b1e72e9.  Here are three feedback points
> for you:
> 
> 
> = 1. Build error when WAL_DEBUG is defined manually =
> How to reproduce:
> 
>  $ sed -i -E -e 's|^/\* #define WAL_DEBUG \*/$|#define WAL_DEBUG|'
> src/include/pg_config_manual.h
>  $ ./configure && make
> 
> Expected: PostgreSQL is successfully made.
> Actual: I got the following make error:
> 
> 
> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type
> -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard
> -Wno-format-truncation -Wno-stringop-truncation -O2
> -I../../../../src/include  -D_GNU_SOURCE   -c -o xlog.o xlog.c
> In file included from /usr/include/x86_64-linux-gnu/bits/types/stack_t.h:23,
>  from /usr/include/signal.h:303,
>  from ../../../../src/include/storage/sinval.h:17,
>  from ../../../../src/include/access/xact.h:22,
>  from ../../../../src/include/access/twophase.h:17,
>  from xlog.c:33:
> xlog.c: In function ‘XLogInsertRecord’:
> xlog.c:1219:56: error: called object is not a function or function pointer
>  1219 |debug_reader = XLogReaderAllocate(wal_segment_size, NULL NULL);
>   |^~~~
> xlog.c:1219:19: error: too few arguments to function ‘XLogReaderAllocate’
>  1219 |debug_reader = XLogReaderAllocate(wal_segment_size, NULL NULL);
>   |   ^~
> In file included from ../../../../src/include/access/clog.h:14,
>  from xlog.c:25:
> ../../../../src/include/access/xlogreader.h:243:25: note: declared here
>   243 | extern XLogReaderState *XLogReaderAllocate(int wal_segment_size,
>   | ^~
> make[4]: *** [: xlog.o] Error 1
> 
> 
> The following chunk in 0002 seems to be the cause of the error.  There is
> no comma between two NULLs.
> 
> 
> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index e570e56a24..f9b0108602 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> (..snipped..)
> @@ -1225,8 +1218,7 @@ XLogInsertRecord(XLogRecData *rdata,
> appendBinaryStringInfo(, rdata->data, rdata->len);
> 
> if (!debug_reader)
> -   debug_reader = XLogReaderAllocate(wal_segment_size, NULL,
> - XL_ROUTINE(), NULL);
> +   debug_reader = XLogReaderAllocate(wal_segment_size, NULL NULL);
> 
> if (!debug_reader)
> {
> 
> 
> 
> = 2. readBuf allocation in XLogReaderAllocate =
> AFAIU, not XLogReaderAllocate() itself but its caller is now responsible
> for allocating XLogReaderState->readBuf.  However, the following code still
> remains in src/backend/access/transam/xlogreader.c:
> 
> 
>  74 XLogReaderState *
>  75 XLogReaderAllocate(int wal_segment_size, const char *waldir,
>  76WALSegmentCleanupCB cleanup_cb)
>  77 {
>   :
>  98 state->readBuf = (char *) palloc_extended(XLOG_BLCKSZ,
>  99   MCXT_ALLOC_NO_OOM);
> 
> 
> Is this okay?
> 
> 
> = 3. XLOG_FROM_ANY assigned to global readSource =
> Regarding the following chunk in 0003:
> 
> 
> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index 6b42d9015f..bcb4ef270f 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -804,18 +804,14 @@ static XLogSegNo openLogSegNo = 0;
>   * These variables are used similarly to the ones above, but for reading
>   * the XLOG.  Note, however, that readOff generally represents the offset
>   * of the page just read, not the seek position of the FD itself, which
> - * will be just past that page. readLen indicates how much of the current
> - * page has been read into readBuf, and readSource indicates where we got
> - * the currently open file from.
> + * will be just past that page. readSource indicates where we got the
> + * currently open file from.
>   * Note: we could use Reserve/ReleaseExternalFD to track consumption of
>   * this FD too; but it doesn't currently seem worthwhile, since the XLOG is
>   * not read by general-purpose sessions.
>   */
>  static int readFile = -1;
> -static XLogSegNo readSegNo = 0;
> -static uint32 readOff = 0;
> -static uint32 readLen = 0;
> -static XLogSource readSource = XLOG_FROM_ANY;
> +static XLogSource readSource = 0;  /* XLOG_FROM_* code */
> 
>  /*
>   * Keeps track of which source we're

Re: [PATCH] Detect escape of ErrorContextCallback stack pointers (and from PG_TRY() )

Craig Ringer  writes:
> Example here:
> https://github.com/ringerc/scrapcode/tree/master/c/clang_return_stack_checks
> So I find that actually, the __attribute__((callback(fn)) approach is
> unnecessary for the purpose proposed.

I tested this by injecting some faults of the described sort.

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index db7d24a511..eaf7716816 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -3025,6 +3025,8 @@ CopyFrom(CopyState cstate)

myslot = 
CopyMultiInsertInfoNextFreeSlot(,

 resultRelInfo);
+   if (!myslot)
+ return 0;
}

/*

leads to

/home/tgl/pgsql/src/backend/commands/copy.c:3029:6: warning: Address of stack 
memory associated with local variable 'errcallback' is still referred to by the 
global variable 'error_context_stack' upon returning to the caller.  This will 
be a dangling reference
  return 0;
  ^~~~

So that's good.  However, a similar experiment with returning from inside
a PG_TRY does *not* produce a warning.  I suspect maybe the compiler
throws up its hands when it sees sigsetjmp?

(These results from clang 10.0.0 on Fedora 32.)

regards, tom lane

Re: Global snapshots





On 2020/09/05 3:31, Alexey Kondratov wrote:

Hi,

On 2020-07-27 09:44, Andrey V. Lepikhov wrote:

On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote:


US8356007B2 - Distributed transaction management for database systems with 
multiversioning - Google Patents
https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?



Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.


I had a look on the patch set, although it is quite outdated, especially on 
0003.

Two thoughts about 0003:

First, IIUC atomicity of the distributed transaction in the postgres_fdw is 
achieved by the usage of 2PC. I think that this postgres_fdw 2PC support should 
be separated from global snapshots.


Agreed.



It could be useful to have such atomic distributed transactions even without a 
proper visibility, which is guaranteed by the global snapshot. Especially 
taking into account the doubts about Clock-SI and general questions about 
algorithm choosing criteria above in the thread.

Thus, I propose to split 0003 into two parts and add a separate GUC 
'postgres_fdw.use_twophase', which could be turned on independently from 
'postgres_fdw.use_global_snapshots'. Of course if the latter is enabled, then 
2PC should be forcedly turned on as well.

Second, there are some problems with errors handling in the 0003 (thanks to 
Arseny Sher for review).

+error:
+    if (!res)
+    {
+    sql = psprintf("ABORT PREPARED '%s'", fdwTransState->gid);
+    BroadcastCmd(sql);
+    elog(ERROR, "Failed to PREPARE transaction on remote node");
+    }

It seems that we should never reach this point, just because BroadcastStmt will 
throw an ERROR if it fails to prepare transaction on the foreign server:

+    if (PQresultStatus(result) != expectedStatus ||
+    (handler && !handler(result, arg)))
+    {
+    elog(WARNING, "Failed command %s: status=%d, expected 
status=%d", sql, PQresultStatus(result), expectedStatus);
+    pgfdw_report_error(ERROR, result, entry->conn, true, sql);
+    allOk = false;
+    }

Moreover, It doesn't make much sense to try to abort prepared xacts, since if 
we failed to prepare it somewhere, then some foreign servers may become 
unavailable already and this doesn't provide us a 100% guarantee of clean up.

+    /* COMMIT open transaction of we were doing 2PC */
+    if (fdwTransState->two_phase_commit &&
+    (event == XACT_EVENT_PARALLEL_COMMIT || event == XACT_EVENT_COMMIT))
+    {
+    BroadcastCmd(psprintf("COMMIT PREPARED '%s'", fdwTransState->gid));
+    }

At this point, the host (local) transaction is already committed and there is 
no way to abort it gracefully. However, BroadcastCmd may rise an ERROR that 
will cause a PANIC, since it is non-recoverable state:

PANIC:  cannot abort transaction 487, it was already committed

Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds 
a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues 
above and tries to add proper comments everywhere. I think, that 0003 should be 
rebased on the top of it, or it could be a first patch in the set, since it may 
be used independently. What do you think?


Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts
about pros and cons between your patch and Sawada-san's?

Regards,

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: v13: CLUSTER segv with wal_level=minimal and parallel index creation

2020-09-07 Thread Noah Misch

On Tue, Sep 08, 2020 at 10:43:32AM +0900, Kyotaro Horiguchi wrote:
> At Tue, 08 Sep 2020 09:13:53 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Mon, 7 Sep 2020 02:32:55 -0700, Noah Misch  wrote in 
> > > As a PoC, this looks promising.  Thanks.  Would you add a test case such 
> > > that
> > > the following demonstrates the bug in the absence of your PoC?
> > > 
> > >   printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
> > > 'max_wal_senders = 0' >/tmp/minimal.conf
> > >   make check TEMP_CONFIG=/tmp/minimal.conf
> > 
> > Mmm. I was close to add some tests to 018_wal_optimize.pl but your
> > suggestion seems better.  I added several ines to create_index.sql.
> > 
> > > Please have the test try both a nailed-and-mapped relation and a "nailed, 
> > > but
> > > not mapped" relation.  I am fairly confident that your PoC fixes the 
> > > former
> > > case, but the latter may need additional code.
> > 
> > Mmm. You're right. I choosed pg_amproc_fam_proc_index as
> > nailed-but-not-mapped index.
> 
> I fixed a typo (s/staring/starting/).

At a glance, this looks reasonable.  If a closer look doesn't reveal problems,
I'll push this.

Re: 回复：how to create index concurrently on partitioned table

2020-09-07 Thread Justin Pryzby

Thanks for completing and pushing the REINDEX patch and others.

Here's a rebasified + fixed version of the others.

On Tue, Sep 01, 2020 at 02:51:58PM +0900, Michael Paquier wrote:
> The REINDEX patch is progressing its way, so I have looked a bit at
> the part for CIC.
> 
> Visibly, the case of multiple partition layers is not handled
> correctly.  Here is a sequence that gets broken:
..
> This fails as follows:
> ERROR: XX000: unrecognized node type: 2139062143
> LOCATION:  copyObjectImpl, copyfuncs.c:5718

Because copyObject needed to be called within a longlived context.

Also, my previous revision failed to implement your suggestion to first build
catalog entries with INVALID indexes and to then reindex them.  Fixed.

-- 
Justin
>From 17a14e5ad128024b05ae288a5096148b3f114c98 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 6 Jun 2020 17:42:23 -0500
Subject: [PATCH v6 1/2] Allow CREATE INDEX CONCURRENTLY on partitioned table

Note, this effectively reverts 050098b14, so take care to not reintroduce the
bug it fixed.
---
 doc/src/sgml/ref/create_index.sgml |   9 --
 src/backend/commands/indexcmds.c   | 118 -
 src/test/regress/expected/indexing.out |  60 -
 src/test/regress/sql/indexing.sql  |  18 +++-
 4 files changed, 165 insertions(+), 40 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 33aa64e81d..c780dc9547 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -657,15 +657,6 @@ Indexes:
 cannot.

 
-   
-Concurrent builds for indexes on partitioned tables are currently not
-supported.  However, you may concurrently build the index on each
-partition individually and then finally create the partitioned index
-non-concurrently in order to reduce the time where writes to the
-partitioned table will be locked out.  In this case, building the
-partitioned index is a metadata only operation.
-   
-
   
  
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f1b5f87e6a..61d0c4914c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -665,17 +665,6 @@ DefineIndex(Oid relationId,
 	partitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
 	if (partitioned)
 	{
-		/*
-		 * Note: we check 'stmt->concurrent' rather than 'concurrent', so that
-		 * the error is thrown also for temporary tables.  Seems better to be
-		 * consistent, even though we could do it on temporary table because
-		 * we're not actually doing it concurrently.
-		 */
-		if (stmt->concurrent)
-			ereport(ERROR,
-	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("cannot create index on partitioned table \"%s\" concurrently",
-			RelationGetRelationName(rel;
 		if (stmt->excludeOpNames)
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -1110,6 +1099,11 @@ DefineIndex(Oid relationId,
 		if (pd->nparts != 0)
 			flags |= INDEX_CREATE_INVALID;
 	}
+	else if (concurrent && OidIsValid(parentIndexId))
+	{
+		/* If concurrent, initial build of index partitions as "invalid" */
+		flags |= INDEX_CREATE_INVALID;
+	}
 
 	if (stmt->deferrable)
 		constr_flags |= INDEX_CONSTR_CREATE_DEFERRABLE;
@@ -1162,6 +1156,14 @@ DefineIndex(Oid relationId,
 		 */
 		if (!stmt->relation || stmt->relation->inh)
 		{
+			/*
+			 * Need to close the relation before recursing into children, so
+			 * copy needed data into a longlived context.
+			 */
+
+			MemoryContext	ind_context = AllocSetContextCreate(PortalContext, "CREATE INDEX",
+	ALLOCSET_DEFAULT_SIZES);
+			MemoryContext	oldcontext = MemoryContextSwitchTo(ind_context);
 			PartitionDesc partdesc = RelationGetPartitionDesc(rel);
 			int			nparts = partdesc->nparts;
 			Oid		   *part_oids = palloc(sizeof(Oid) * nparts);
@@ -1173,8 +1175,10 @@ DefineIndex(Oid relationId,
 		 nparts);
 
 			memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
+			parentDesc = CreateTupleDescCopy(RelationGetDescr(rel));
+			table_close(rel, NoLock);
+			MemoryContextSwitchTo(oldcontext);
 
-			parentDesc = RelationGetDescr(rel);
 			opfamOids = palloc(sizeof(Oid) * numberOfKeyAttributes);
 			for (i = 0; i < numberOfKeyAttributes; i++)
 opfamOids[i] = get_opclass_family(classObjectId[i]);
@@ -1217,10 +1221,12 @@ DefineIndex(Oid relationId,
 	continue;
 }
 
+oldcontext = MemoryContextSwitchTo(ind_context);
 childidxs = RelationGetIndexList(childrel);
 attmap =
 	build_attrmap_by_name(RelationGetDescr(childrel),
 		  parentDesc);
+MemoryContextSwitchTo(oldcontext);
 
 foreach(cell, childidxs)
 {
@@ -1291,10 +1297,14 @@ DefineIndex(Oid relationId,
  */
 if (!found)
 {
-	IndexStmt  *childStmt = copyObject(stmt);
+	IndexStmt  *childStmt;
 	bool		found_whole_row;
 	ListCell   *lc;
 
+	oldcontext = MemoryContextSwitchTo(ind_context);
+	childStmt =

Re: Online checksums verification in the backend

2020-09-07 Thread Masahiko Sawada

On Mon, 7 Sep 2020 at 15:59, Michael Paquier  wrote:
>
> On Tue, Jul 14, 2020 at 11:08:08AM +0200, Julien Rouhaud wrote:
> > On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote:
> >> Small language fixes in comments and user-facing docs.
> >
> > Thanks a lot!  I just fixed a small issue (see below), PFA updated v10.
>
> Sawada-san, you are registered as a reviewer of this patch.  Are you
> planning to look at it?  If you are busy lately, that's fine as well
> (congrats!).

Thanks!

>  In this case it could be better to unregister from the
> CF app for this entry.

Well, I sent review comments on this patch and Julien fixed all
comments. So I’d marked this as Ready for Committer since I didn't
have further comments at that time, and I was waiting for the
committer review. I'll look at this patch again but should I remove my
name from the reviewer after that if no comments?

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2





On 2020/09/08 10:34, Amit Kapila wrote:

On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao  wrote:


IMO it's not easy to commit this 2PC patch at once because it's still large
and complicated. So I'm thinking it's better to separate the feature into
several parts and commit them gradually.



Hmm, I don't see that we have a consensus on the design and or
interfaces of this patch and without that proceeding for commit
doesn't seem advisable. Here are a few points which I remember offhand
that require more work.


Thanks!


1. There is a competing design proposed and being discussed in another
thread [1] for this purpose. I think both the approaches have pros and
cons but there doesn't seem to be any conclusion yet on which one is
better.


I was thinking that [1] was discussing global snapshot feature for
"atomic visibility" rather than the solution like 2PC for "atomic commit".
But if another approach for "atomic commit" was also proposed at [1],
that's good. I will check that.


2. In this thread, we have discussed to try integrating this patch
with some other FDWs (say MySQL, mongodb, etc.) to ensure that the
APIs we are exposing are general enough that other FDWs can use them
to implement 2PC. I could see some speculations about the same but no
concrete work on the same has been done.


Yes, you're right.


3. In another thread [1], we have seen that the patch being discussed
in this thread might need to re-designed if we have to use some other
design for global-visibility than what is proposed in that thread. I
think it is quite likely that can happen considering no one is able to
come up with the solution to major design problems spotted in that
patch yet.


You imply that global-visibility patch should be come first before "2PC" patch?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: [PATCH] Detect escape of ErrorContextCallback stack pointers (and from PG_TRY() )

2020-09-07 Thread Craig Ringer

On Fri, 4 Sep 2020 at 14:55, Craig Ringer  wrote:

> Using a test program:
>
> return_stack_escape.c:14:3: warning: Address of stack memory associated
> with local variable 'g' is still referred to by the global variable
> 'guard_ptr' upon returning to the caller.  This will be a dangling reference
> return do_fail;
> ^~
> 1 warning generated.
>
>
Example here:

https://github.com/ringerc/scrapcode/tree/master/c/clang_return_stack_checks

So I find that actually, the __attribute__((callback(fn)) approach is
unnecessary for the purpose proposed.

I still think we should be looking at tidying up the error contexts API,
but it's easy enough for extensions to provide a wrapper over it, so that's
not a big deal really.

I will submit my docs patches separately to ensure they get some attention,
and I'll experiment with llvm threadsafety annotations / capabilities.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: Resetting spilled txn statistics in pg_stat_replication

2020-09-07 Thread Masahiko Sawada

On Mon, 7 Sep 2020 at 15:24, Amit Kapila  wrote:
>
> On Thu, Jul 23, 2020 at 11:46 AM Masahiko Sawada
>  wrote:
> >
> > I've updated the patch so that the stats collector ignores the
> > 'update' message if the slot stats array is already full.
> >
>
> This patch needs a rebase. I don't see this patch in the CF app. I
> hope you are still interested in working on this.

Thank you for reviewing this patch!

I'm still going to work on this patch although I might be slow
response this month.

>
> Review comments:
> ===
> 1.
> +CREATE VIEW pg_stat_replication_slots AS
> +SELECT
> +s.name,
> +s.spill_txns,
> +s.spill_count,
> +s.spill_bytes
> +FROM pg_stat_get_replication_slots() AS s;
>
> The view pg_stat_replication_slots should have a column 'stats_reset'
> (datatype: timestamp with time zone) as we provide a facitily to reset
> the slots. A similar column exists in pg_stat_slru as well, so is
> there a reason of not providing it here?

I had missed adding the column. Fixed.

>
> 2.
> +   
> +  
> +
> +  
>pg_stat_wal_receiver
>
> It is better to keep one empty line between  and  to
> keep it consistent with the documentation of other views.

Fixed.

>
> 3.
>   pg_stat_reset_replication_slot
> +
> +pg_stat_reset_replication_slot (
> text )
> +void
> +   
> +   
> + Resets statistics to zero for a single replication slot, or for all
> + replication slots in the cluster.  If the argument is NULL,
> all counters
> + shown in the
> pg_stat_replication_slots view for
> + all replication slots are reset.
> +   
>
> I think the information about the parameter for this function is not
> completely clear. It seems to me that it should be the name of the
> slot for which we want to reset the stats, if so, let's try to be
> clear.

Fixed.

>
> 4.
> +pgstat_reset_replslot_counter(const char *name)
> +{
> + PgStat_MsgResetreplslotcounter msg;
> +
> + if (pgStatSock == PGINVALID_SOCKET)
> + return;
> +
> + pgstat_setheader(_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
> + if (name)
> + {
> + memcpy(_slotname, name, NAMEDATALEN);
> + msg.clearall = false;
> + }
>
> Don't we want to verify here or in the caller of this function whether
> the passed slot_name is a valid slot or not? For ex. see
> pgstat_reset_shared_counters where we return an error if the target is
> not valid.

Agreed. Fixed.

>
> 5.
> +static void
> +pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
> + int len)
> +{
> + int i;
> + int idx = -1;
> + TimestampTz ts;
> +
> + if (!msg->clearall)
> + {
> + /* Get the index of replication slot statistics to reset */
> + idx = pgstat_replslot_index(msg->m_slotname, false);
> +
> + if (idx < 0)
> + return; /* not found */
>
> Can we add a comment to describe when we don't expect to find the slot
> here unless there is no way that can happen?

Added.

>
> 6.
> +pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
> + int len)
> {
> ..
> + for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
> ..
> }
>
> I think here we need to traverse till nReplSlotStats, not SLRU_NUM_ELEMENTS.

Fixed.

>
> 7. Don't we need to change PGSTAT_FILE_FORMAT_ID for this patch? We
> can probably do at the end but better to change it now so that it
> doesn't slip from our mind.

Yes, changed.

>
> 8.
> @@ -5350,6 +5474,23 @@ pgstat_read_statsfiles(Oid onlydb, bool
> permanent, bool deep)
>
>   break;
>
> + /*
> + * 'R' A PgStat_ReplSlotStats struct describing a replication slot
> + * follows.
> + */
> + case 'R':
> + if (fread([nReplSlotStats], 1,
> sizeof(PgStat_ReplSlotStats), fpin)
> + != sizeof(PgStat_ReplSlotStats))
> + {
> + ereport(pgStatRunningInCollector ? LOG : WARNING,
> + (errmsg("corrupted statistics file \"%s\"",
> + statfile)));
> + memset([nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
> + goto done;
> + }
> + nReplSlotStats++;
> + break;
>
> Both here and in pgstat_read_db_statsfile_timestamp(), the patch
> handles 'R' message after 'D' whereas while writing the 'R' is written
> before 'D'. So, I think it is better if we keep the order during read
> the same as during write.

Changed the code so that it writes 'R' after 'D'.

>
> 9. While reviewing this patch, I noticed that in
> pgstat_read_db_statsfile_timestamp(), if we fail to read ArchiverStats
> or SLRUStats, we return 'false' from this function but OTOH, if we
> fail to read 'D' or 'R' message, we will return 'true'. I feel the
> handling of 'D' and 'R' message is fine because once we read
> GlobalStats, we can return the stats_timestamp. So the other two
> stands corrected. I understand that this is not directly related to
> this patch but if you agree we can do this as a separate patch.

It seems to make sense to me. We can set *ts and then read both
ArchiverStats and SLRUStats so we can return a valid timestamp even if
we fail to read.

I've attached both patches: 0001 patch fixes the issue you

Re: Evaluate expression at planning time for two more cases

Surafel Temesgen  writes:
> [ null_check_on_pkey_optimization_v1.patch ]

I took a very brief look at this.

> I don’t add NOT NULL constraint optimization to the patch because cached
> plan is not invalidation in case of a change in NOT NULL constraint

That's actually not a problem, even though some people (including me)
have bandied about such suppositions in the past.  Relying on attnotnull
in the planner is perfectly safe [1].  Plus it'd likely be cheaper as
well as more general than looking up pkey information.  If we did need
to explicitly record the plan's dependency on a constraint, this patch
would be wrong anyhow because it fails to make any such notation about
the pkey constraint it relied on.

The "check_null_side" code you're proposing seems really horrid.
For one thing, it seems quite out of place for eval_const_expressions
to be doing that.  For another, it's wrong in principle because
eval_const_expressions doesn't know which part of the query tree
it's being invoked on, so it cannot know whether outer-join
nullability is an issue.  For another, doing that work over again
from scratch every time we see a potentially optimizable NullTest
looks expensive.  (I wonder whether you have tried to measure the
performance penalty imposed by this patch in cases where it fails
to make any proof.)

I've been doing some handwaving about changing the representation
of Vars, with an eye to making it clear by inspection whether a
given Var is nullable by some lower outer join [2].  If that work
ever comes to fruition then the need for "check_null_side" would
go away.  So maybe we should put this idea on the back burner
until that happens.

I'm not sure what I think about Ashutosh's ideas about doing this
somewhere else than eval_const_expressions.  I do not buy the argument
that it's interesting to do this separately for each child partition.
Child partitions that have attnotnull constraints different from their
parent's are at best a tiny minority use-case, if indeed we allow them
at all (I tend to think we shouldn't).  On the other hand it's possible
that postponing the check would allow bypassing the outer-join problem,
ie if we only do it for quals that have dropped down to the relation
scan level then we don't need to worry about outer join effects.

Another angle here is that eval_const_expressions runs before
reduce_outer_joins, meaning that if it's doing things that depend
on outer-join-ness then it will sometimes fail to optimize cases
that could be optimized.  As a not incidental example, consider

select ... from t1 left join t2 on (...) where t2.x is not null;

reduce_outer_joins will realize that the left join can be reduced
to a plain join, whereupon (if t2.x is attnotnull) the WHERE clause
really is constant-true --- and this seems like a poster-child case
for it being useful to optimize away the WHERE clause.  But
we won't be able to detect that if we apply the optimization during
eval_const_expressions.  So maybe that's a good reason to do it
somewhere later.

regards, tom lane

[1] https://www.postgresql.org/message-id/23564.1585885251%40sss.pgh.pa.us
[2] https://www.postgresql.org/message-id/flat/15848.1576515643%40sss.pgh.pa.us

Re: v13: CLUSTER segv with wal_level=minimal and parallel index creation

At Tue, 08 Sep 2020 09:13:53 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Mon, 7 Sep 2020 02:32:55 -0700, Noah Misch  wrote in 
> > As a PoC, this looks promising.  Thanks.  Would you add a test case such 
> > that
> > the following demonstrates the bug in the absence of your PoC?
> > 
> >   printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
> > 'max_wal_senders = 0' >/tmp/minimal.conf
> >   make check TEMP_CONFIG=/tmp/minimal.conf
> 
> Mmm. I was close to add some tests to 018_wal_optimize.pl but your
> suggestion seems better.  I added several ines to create_index.sql.
> 
> > Please have the test try both a nailed-and-mapped relation and a "nailed, 
> > but
> > not mapped" relation.  I am fairly confident that your PoC fixes the former
> > case, but the latter may need additional code.
> 
> Mmm. You're right. I choosed pg_amproc_fam_proc_index as
> nailed-but-not-mapped index.

I fixed a typo (s/staring/starting/).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 44512f8672395dc40a001eb92b7ffca060eb411d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 7 Sep 2020 20:30:30 +0900
Subject: [PATCH v2] Fix assertion failure during reindex while
 wal_level=minimal.

When wal_level=minimal, nailed relations are forgotten to set
rd_firstRelfilenodeSubid correctly on parallel workers. Fix it.
---
 src/backend/utils/cache/relcache.c | 16 
 src/test/regress/expected/create_index.out |  5 +
 src/test/regress/sql/create_index.sql  |  6 ++
 3 files changed, 27 insertions(+)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 96ecad02dd..8a61fd674e 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1273,6 +1273,8 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 static void
 RelationInitPhysicalAddr(Relation relation)
 {
+	Oid oldnode = relation->rd_node.relNode;
+
 	/* these relations kinds never have storage */
 	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
 		return;
@@ -1330,6 +1332,20 @@ RelationInitPhysicalAddr(Relation relation)
 			elog(ERROR, "could not find relation mapping for relation \"%s\", OID %u",
  RelationGetRelationName(relation), relation->rd_id);
 	}
+
+	/*
+	 * Our leader process might have replaced storage before starting this
+	 * worker. We need to set rd_firstRelfilenodeSubid according to pending
+	 * sync hash in that case. Currently the information is not actually used
+	 * in worker processes, but make things tidy.
+	 */
+	if (IsParallelWorker() && oldnode != relation->rd_node.relNode)
+	{
+		if (RelFileNodeSkippingWAL(relation->rd_node))
+			relation->rd_firstRelfilenodeSubid = TopSubTransactionId;
+		else
+			relation->rd_firstRelfilenodeSubid = InvalidTransactionId;
+	}
 }
 
 /*
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 64c0c66859..66ec1465e1 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2732,6 +2732,11 @@ REINDEX TABLE pg_toast.pg_toast_1260;
 ERROR:  must be owner of table pg_toast_1260
 REINDEX INDEX pg_toast.pg_toast_1260_index;
 ERROR:  must be owner of index pg_toast_1260_index
+-- parallel reindex
+RESET ROLE;
+SET min_parallel_table_scan_size = 0;
+REINDEX INDEX pg_attribute_relid_attnum_index;	-- nailed and mapped
+REINDEX INDEX pg_amproc_fam_proc_index; 		-- nailed but not mapped
 -- Clean up
 RESET ROLE;
 REVOKE USAGE ON SCHEMA pg_toast FROM regress_reindexuser;
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 37f7259da9..6b104a2389 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1191,6 +1191,12 @@ SET SESSION ROLE regress_reindexuser;
 REINDEX TABLE pg_toast.pg_toast_1260;
 REINDEX INDEX pg_toast.pg_toast_1260_index;
 
+-- parallel reindex
+RESET ROLE;
+SET min_parallel_table_scan_size = 0;
+REINDEX INDEX pg_attribute_relid_attnum_index;	-- nailed and mapped
+REINDEX INDEX pg_amproc_fam_proc_index; 		-- nailed but not mapped
+
 -- Clean up
 RESET ROLE;
 REVOKE USAGE ON SCHEMA pg_toast FROM regress_reindexuser;
-- 
2.18.4

Re: Transactions involving multiple postgres foreign servers, take 2

On Mon, Sep 7, 2020 at 2:29 PM Fujii Masao  wrote:
>
> IMO it's not easy to commit this 2PC patch at once because it's still large
> and complicated. So I'm thinking it's better to separate the feature into
> several parts and commit them gradually.
>

Hmm, I don't see that we have a consensus on the design and or
interfaces of this patch and without that proceeding for commit
doesn't seem advisable. Here are a few points which I remember offhand
that require more work.
1. There is a competing design proposed and being discussed in another
thread [1] for this purpose. I think both the approaches have pros and
cons but there doesn't seem to be any conclusion yet on which one is
better.
2. In this thread, we have discussed to try integrating this patch
with some other FDWs (say MySQL, mongodb, etc.) to ensure that the
APIs we are exposing are general enough that other FDWs can use them
to implement 2PC. I could see some speculations about the same but no
concrete work on the same has been done.
3. In another thread [1], we have seen that the patch being discussed
in this thread might need to re-designed if we have to use some other
design for global-visibility than what is proposed in that thread. I
think it is quite likely that can happen considering no one is able to
come up with the solution to major design problems spotted in that
patch yet.

It appears to me that even though these points were raised before in
some form we are just trying to bypass them to commit whatever we have
in the current patch which I find quite surprising.

[1] - 
https://www.postgresql.org/message-id/07b2c899-4ed0-4c87-1327-23c750311248%40postgrespro.ru

-- 
With Regards,
Amit Kapila.

Re: Logical Replication - detail message with names of missing columns

Thank you for working on this.

At Mon, 7 Sep 2020 16:30:59 +0530, Bharath Rupireddy 
 wrote in 
> Hi,
> 
> I observed that, in logical replication when a subscriber is missing
> some columns, it currently emits an error message that says "some"
> columns are missing(see logicalrep_rel_open()), but it doesn't specify
> what the missing column names are. The comment there also says that
> finding the missing column names is a todo item(/* TODO, detail
> message with names of missing columns */).
> 
> I propose a patch to find the missing columns on the subscriber
> relation using the publisher relation columns and show them in the
> error message which can make the error more informative to the user.

+1 for objective. However, that can be done simpler way that doesn't
need additional loops by using bitmapset to hold missing remote
attribute numbers. This also make the variable "found" useless.

> Here's a snapshot how the error looks with the patch:
> 2020-09-04 01:00:36.721 PDT [843128] ERROR:  logical replication
> target relation "public.test_1" is missing "b1, d1" replicated columns
> 2020-09-04 01:00:46.784 PDT [843132] ERROR:  logical replication
> target relation "public.test_1" is missing "b1" replicated columns
> 2020-09-06 21:24:53.645 PDT [902945] ERROR:  logical replication
> target relation "public.test_1" is missing "a1, c1, d1, b1" replicated
> columns
> 
> Thoughts?

FWIW, I would prefer that the message be like

 logical replication target relation "public.test_1" is missing
 replicated columns: "a1", "c1"

I'm not sure we need to have separate messages for singlar and plural.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Division in dynahash.c due to HASH_FFACTOR

On Sat, Sep 5, 2020 at 2:34 AM Alvaro Herrera  wrote:
> On 2020-Sep-04, Jakub Wartak wrote:
> > After removing HASH_FFACTOR PostgreSQL still compiles...  Would
> > removing it break some external API/extensions ?
>
> FWIW, HASH_FFACTOR has *never* been used in Postgres core code.
>
> https://postgr.es/m/20160418180711.55ac82c0@fujitsu already reported
> that this flag was unused a few years ago.  And a search through
> codesearch.debian.net shows that no software packaged by Debian uses
> that flag either.
>
> *If* any third-party extension is using HASH_FFACTOR, it can easily be
> unbroken by removing the flag or #defining it to zero; the removal would
> only be a problem if anyone showed that there is a performance loss by
> using the default fillfactor for some dynahash table instead of their
> custom value.
>
> I think if we know that there's a 4% performance increase by removing
> the division that comes with an ability to use a custom fillfactor that
> nobody uses or has ever used, it makes sense to remove that ability.

I think Tomas is right, and the correct fix for this would be to
compute it up front every time the hash table size changes, but since
apparently no one ever used it, +1 for just removing it and leaving it
hard-coded.

For what it's worth, for dshash.c I just hard coded it to double the
hash table size whenever the number of elements in a partition
exceeded 75%.  Popular hash tables in standard libraries seem to use
either .75 or 1.0 as a default or only setting.  There's probably room
for discussion about numbers in those ranges, but I think the concept
of customisable load factors with a range much wider than that may be
more relevant for disk-based hash tables (like our hash indexes),
which have very different economics.  I think the name HASH_FFACTOR
may be a clue that this was possibly interference from disk-based hash
tables from the same Berkeley people: rather than "load factor", the
more usual term (?) for nentries/nbuckets in memory-based hash tables
as a way to model number of expected collisions, they called it "fill
factor", which I guess is because in disk-based designs your actually
want a certain rate of collisions, because you're working not with
elements in an array and wanting to avoid collisions while not wasting
space, but rather with fixed sized buckets that you want to fill up,
but not overflow.  

> ... however, if we're really tense about improving some hash table's
> performance, it might be worth trying to use simplehash.h instead of
> dynahash.c for it.

Sure, but removing the dynahash IDIV also seems like a slam dunk
removal of a useless expensive feature.

Re: pgbench and timestamps (bounced)

Fabien COELHO writes:
> [Resent on hackers for CF registration, sorry for the noise]

For the record, the original thread is at

https://www.postgresql.org/message-id/flat/CAKVUGgQaZVAUi1Ex41H4wrru%3DFU%2BMfwgjG0aM1br6st7sz31Vw%40mail.gmail.com

(I tried but failed to attach that thread to the CF entry, so we'll
have to settle for leaving a breadcrumb in this thread.)

> It requires a mutex around the commands, I tried to do some windows
> implementation which may or may not work.

Ugh, I'd really rather not do that. Even disregarding the effects
of a mutex, though, my initial idea for fixing this has a big problem:
if we postpone PREPARE of the query until first execution, then it's
happening during timed execution of the benchmark scenario and thus
distorting the timing figures. (Maybe if we'd always done it like
that, it'd be okay, but I'm quite against changing the behavior now
that it's stood for a long time.)

However, perhaps there's more than one way to fix this. Once we've
scanned all of the script and seen all the \set commands, we know
(in principle) the set of all variable names that are in use.
So maybe we could fix this by

(1) During the initial scan of the script, make variable-table
entries for every \set argument, with the values shown as undefined
for the moment. Do not try to parse SQL commands in this scan,
just collect them.

(2) Make another scan in which we identify variable references
in the SQL commands and issue PREPAREs (if enabled).

(3) Perform the timed run.

This avoids any impact of this bug fix on the semantics or timing
of the benchmark proper. I'm not sure offhand whether this
approach makes any difference for the concerns you had about
identifying/suppressing variable references inside quotes.

regards, tom lane

Re: v13: CLUSTER segv with wal_level=minimal and parallel index creation

At Mon, 7 Sep 2020 02:32:55 -0700, Noah Misch  wrote in 
> On Mon, Sep 07, 2020 at 05:40:36PM +0900, Kyotaro Horiguchi wrote:
> > At Mon, 07 Sep 2020 13:45:28 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > The cause is that the worker had received pending-sync entry correctly
> > > but not never created a relcache entry for the relation using
> > > RelationBuildDesc. So the rd_firstRelfilenodeSubid is not correctly
> > > set.
> > > 
> > > I'm investigating it.
> > 
> > Relcaches are loaded from a file with old content at parallel worker
> > startup. The relcache entry is corrected by invalidation at taking a
> > lock but pending syncs are not considered.
> > 
> > Since parallel workers don't access the files so we can just ignore
> > the assertion safely, but I want to rd_firstRelfilenodeSubid flag at
> > invalidation, as attached PoC patch.
> 
> > [patch: When RelationInitPhysicalAddr() handles a mapped relation, re-fill
> > rd_firstRelfilenodeSubid from RelFileNodeSkippingWAL(), like
> > RelationBuildDesc() would do.]
> 
> As a PoC, this looks promising.  Thanks.  Would you add a test case such that
> the following demonstrates the bug in the absence of your PoC?
> 
>   printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
> 'max_wal_senders = 0' >/tmp/minimal.conf
>   make check TEMP_CONFIG=/tmp/minimal.conf

Mmm. I was close to add some tests to 018_wal_optimize.pl but your
suggestion seems better.  I added several ines to create_index.sql.

> Please have the test try both a nailed-and-mapped relation and a "nailed, but
> not mapped" relation.  I am fairly confident that your PoC fixes the former
> case, but the latter may need additional code.

Mmm. You're right. I choosed pg_amproc_fam_proc_index as
nailed-but-not-mapped index.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c4d53acd64011f1ed5652177923d010f4c50fa89 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 7 Sep 2020 20:30:30 +0900
Subject: [PATCH] Fix assertion failure during reindex while wal_level=minimal.

When wal_level=minimal, nailed relations are forgotten to set
rd_firstRelfilenodeSubid correctly on parallel workers. Fix it.
---
 src/backend/utils/cache/relcache.c | 16 
 src/test/regress/expected/create_index.out |  5 +
 src/test/regress/sql/create_index.sql  |  6 ++
 3 files changed, 27 insertions(+)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 96ecad02dd..cd6c8bbc70 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1273,6 +1273,8 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 static void
 RelationInitPhysicalAddr(Relation relation)
 {
+	Oid oldnode = relation->rd_node.relNode;
+
 	/* these relations kinds never have storage */
 	if (!RELKIND_HAS_STORAGE(relation->rd_rel->relkind))
 		return;
@@ -1330,6 +1332,20 @@ RelationInitPhysicalAddr(Relation relation)
 			elog(ERROR, "could not find relation mapping for relation \"%s\", OID %u",
  RelationGetRelationName(relation), relation->rd_id);
 	}
+
+	/*
+	 * Our leader process might have replaced storage before staring this
+	 * worker. We need to set rd_firstRelfilenodeSubid according to pending
+	 * sync hash in that case. Currently the information is not actually used
+	 * in worker processes, but make things tidy.
+	 */
+	if (IsParallelWorker() && oldnode != relation->rd_node.relNode)
+	{
+		if (RelFileNodeSkippingWAL(relation->rd_node))
+			relation->rd_firstRelfilenodeSubid = TopSubTransactionId;
+		else
+			relation->rd_firstRelfilenodeSubid = InvalidTransactionId;
+	}
 }
 
 /*
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 814416d936..6468418639 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2598,6 +2598,11 @@ REINDEX TABLE pg_toast.pg_toast_1260;
 ERROR:  must be owner of table pg_toast_1260
 REINDEX INDEX pg_toast.pg_toast_1260_index;
 ERROR:  must be owner of index pg_toast_1260_index
+-- parallel reindex
+RESET ROLE;
+SET min_parallel_table_scan_size = 0;
+REINDEX INDEX pg_attribute_relid_attnum_index;	-- nailed and mapped
+REINDEX INDEX pg_amproc_fam_proc_index; 		-- nailed but not mapped
 -- Clean up
 RESET ROLE;
 REVOKE USAGE ON SCHEMA pg_toast FROM regress_reindexuser;
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index f3667bacdc..d9cb3f14f6 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1115,6 +1115,12 @@ SET SESSION ROLE regress_reindexuser;
 REINDEX TABLE pg_toast.pg_toast_1260;
 REINDEX INDEX pg_toast.pg_toast_1260_index;
 
+-- parallel reindex
+RESET ROLE;
+SET min_parallel_table_scan_size = 0;
+REINDEX INDEX pg_attribute_relid_attnum_index;	-- nailed and mapped
+REINDEX INDEX pg_amproc_fam_proc_index; 		-- nailed but not mapped
+
 -- Clean up
 RESET ROLE;

Re: Optimising compactify_tuples()

On Mon, Sep 7, 2020 at 7:48 PM David Rowley  wrote:
> I was wondering today if we could just get rid of the sort in
> compactify_tuples() completely. It seems to me that the existing sort
> is there just so that the memmove() is done in order of tuple at the
> end of the page first. We seem to be just shunting all the tuples to
> the end of the page so we need to sort the line items in reverse
> offset so as not to overwrite memory for other tuples during the copy.
>
> I wondered if we could get around that just by having another buffer
> somewhere and memcpy the tuples into that first then copy the tuples
> out that buffer back into the page. No need to worry about the order
> we do that in as there's no chance to overwrite memory belonging to
> other tuples.
>
> Doing that gives me 79.166 seconds in the same recovery test. Or about
> 46% faster, instead of 22% (I mistakenly wrote 28% yesterday)

Wow.

One thought is that if we're going to copy everything out and back in
again, we might want to consider doing it in a
memory-prefetcher-friendly order.  Would it be a good idea to
rearrange the tuples to match line pointer order, so that the copying
work and also later sequential scans are in a forward direction?  The
copying could also perhaps be done with single memcpy() for ranges of
adjacent tuples.  Another thought is that it might be possible to
identify some easy cases that it can handle with an alternative
in-place shifting algorithm without having to go to the
copy-out-and-back-in path.  For example, when the offset order already
matches line pointer order but some dead tuples just need to be
squeezed out by shifting ranges of adjacent tuples, and maybe some
slightly more complicated cases, but nothing requiring hard work like
sorting.

> (I don't want to derail the sort improvements here. I happen to think
> those are quite important improvements, so I'll continue to review
> that patch still.  Longer term, we might just end up with something
> slightly different for compactify_tuples)

Yeah.  Perhaps qsort specialisation needs to come back in a new thread
with a new use case.

Re: default partition and concurrent attach partition

On 2020-Sep-07, Alvaro Herrera wrote:

> Ah, it looks like we can get away with initializing the RRI to 0, and
> then explicitly handle that case in ExecPartitionCheckEmitError, as in
> the attached (which means reindenting, but I left it alone to make it
> easy to read).

Well, that was silly -- this seems fixable by mapping the tuple columns
prior to ExecPartitionCheck, which is achievable with something like

if (partidx == partdesc->boundinfo->default_index)
{
TupleTableSlot *tempslot;

if (dispatch->tupmap != NULL)
{
tempslot = 
MakeSingleTupleTableSlot(RelationGetDescr(rri->ri_RelationDesc),

   );
tempslot = 
execute_attr_map_slot(dispatch->tupmap, slot, tempslot);
}
else
tempslot = slot;
ExecPartitionCheck(rri, tempslot, estate, true);
if (dispatch->tupmap != NULL)
ExecDropSingleTupleTableSlot(tempslot);
}

though this exact incantation, apart from being pretty horrible, doesn't
actually work (other than to fix this specific test case -- it crashes
elsewhere.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: A micro-optimisation for walkdir()

Thomas Munro  writes:
> FWIW I just spotted a couple of very suspicious looking failures for
> build farm animal "walleye", a "MinGW64 8.1.0" system, that say:

walleye's been kind of unstable since the get-go, so I wouldn't put
too much faith in reports just from it.

regards, tom lane

Re: default partition and concurrent attach partition

Ah, it looks like we can get away with initializing the RRI to 0, and
then explicitly handle that case in ExecPartitionCheckEmitError, as in
the attached (which means reindenting, but I left it alone to make it
easy to read).  It kinda sucks because we don't report the tuple that
causes the error, but

a) it's a very unlikely case anyway
b) it's better than the bogus error message
c) it's better than some hypothetical crash
d) nobody uses partitioned default partitions anyway
e) nobody uses differing column order anyway

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 7289a0706d95aa621b9d6f626a836ac381fd4f61 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 7 Sep 2020 19:26:37 -0300
Subject: [PATCH] Avoid invalid RRI

---
 src/backend/executor/execMain.c  | 7 +++
 src/backend/executor/execPartition.c | 2 +-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4fdffad6f3..a0c3e56a03 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1840,6 +1840,12 @@ ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
 	 * back to the root table's rowtype so that val_desc in the error message
 	 * matches the input tuple.
 	 */
+	if (resultRelInfo->ri_RangeTableIndex == 0)
+	{
+		val_desc = NULL;
+	}
+	else
+	{
 	if (resultRelInfo->ri_PartitionRoot)
 	{
 		TupleDesc	old_tupdesc;
@@ -1874,6 +1880,7 @@ ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
 			 tupdesc,
 			 modifiedCols,
 			 64);
+	}
 	ereport(ERROR,
 			(errcode(ERRCODE_CHECK_VIOLATION),
 			 errmsg("new row for relation \"%s\" violates partition constraint",
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 0ca1d34dfa..c1ef34d771 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -,7 +,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	{
 		ResultRelInfo *rri = makeNode(ResultRelInfo);
 
-		InitResultRelInfo(rri, rel, 1, proute->partition_root, 0);
+		InitResultRelInfo(rri, rel, 0, proute->partition_root, 0);
 		proute->nonleaf_partitions[dispatchidx] = rri;
 	}
 	else
-- 
2.20.1

Re: default partition and concurrent attach partition

On 2020-Sep-07, Alvaro Herrera wrote:

> Well, they are fake in that the ri_RangeTableIndex they carry is bogus,
> which means that ExecBuildSlotValueDescription will misbehave if the
> partitioned default partition has a different column order than its
> parent.  That can be evidenced by changing the setup block in the
> isolation test as in the attached; and you'll get an undesirable error
> like this:
> 
> 2020-09-07 19:00:49.513 -03 [12981] ERROR:  attribute 2 of type record has 
> wrong type
> 2020-09-07 19:00:49.513 -03 [12981] DETAIL:  Table has type text, but query 
> expects integer.
> 2020-09-07 19:00:49.513 -03 [12981] STATEMENT:  insert into attach_tab values 
> (110, 'eleven and five twenties');

... and I sent before completing.  I'm not sure what a good fix for this
is.  We could try to initialize the resultRelInfo honestly, or we could
set ri_RangeTableIndex to some invalid value that will ... eh ... do
something else.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: default partition and concurrent attach partition

On 2020-Sep-04, Amit Langote wrote:

Hello

> FWIW, I still prefer "minimally valid ResultRelInfo" to "fake
> ResultRelInfo", because those aren't really fake, are they?  They are
> as valid as any other ResultRelInfo as far I can see.  I said
> "minimally valid" because a fully-valid partition ResultRelInfo is one
> made by ExecInitPartitionInfo(), which I avoided to use here thinking
> that would be overkill.

Well, they are fake in that the ri_RangeTableIndex they carry is bogus,
which means that ExecBuildSlotValueDescription will misbehave if the
partitioned default partition has a different column order than its
parent.  That can be evidenced by changing the setup block in the
isolation test as in the attached; and you'll get an undesirable error
like this:

2020-09-07 19:00:49.513 -03 [12981] ERROR:  attribute 2 of type record has 
wrong type
2020-09-07 19:00:49.513 -03 [12981] DETAIL:  Table has type text, but query 
expects integer.
2020-09-07 19:00:49.513 -03 [12981] STATEMENT:  insert into attach_tab values 
(110, 'eleven and five twenties');

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
# Verify that default partition constraint is enforced correctly
# in light of partitions being added concurrently to its parent
setup {
  drop table if exists attach_tab;
  create table attach_tab (i int, j text) partition by range (i);
  create table attach_tab_1 partition of attach_tab for values from (0) to 
(100);
  create table attach_tab_default (j text, i int) partition by range (i);
  alter table attach_tab attach partition attach_tab_default default;
  create table attach_tab_default_1 partition of attach_tab_default for values 
from (110) to (120);
  create table attach_tab_2 (like attach_tab);
  insert into attach_tab_2 values (111, 'ONE HUNDRED AND ELEVENTH BIRTHDAY');
}

session "s1"
step "s1b"  { begin; }
step "s1a"  { alter table attach_tab attach partition attach_tab_2 for 
values from (100) to (200); }
step "s1c"  { commit; }

session "s2"
step "s2b"  { begin; }
step "s2i"  { insert into attach_tab values (111, 'eleven and five 
twenties'); }
step "s2c"  { commit; }

teardown{ drop table if exists attach_tab, attach_tab_1, attach_tab_2; }

# insert into attach_tab by s2 which routes to attach_tab_default due to not 
seeing
# concurrently added attach_tab_2 should fail, because the partition constraint
# of attach_tab_default would have changed due to attach_tab_2 having been added
permutation "s1b" "s1a" "s2b" "s2i" "s1c" "s2c"

# reverse: now the insert into attach_tab_default by s2 occurs first followed by
# attach in s1, which should fail when it scans the default partitions to
# find the violating rows
permutation "s1b" "s2b" "s2i" "s1a" "s2c" "s1c"

Re: A micro-optimisation for walkdir()

On Tue, Sep 8, 2020 at 2:05 AM Juan José Santamaría Flecha
 wrote:
> On Mon, Sep 7, 2020 at 1:41 PM Thomas Munro  wrote:
>> Thanks for confirming.  I ran the Windows patch through pgindent,
>> fixed a small typo, and pushed.
>
> Great, thanks. Should we include something from this discussion as comments?

I dunno, it seems like this may be common knowledge for Windows people
and I was just being paranoid by asking for more info as a
non-Windowsian, but if you want to propose new comments or perhaps
just a pointer to one central place where we explain how that works, I
wouldn't be against that.

FWIW I just spotted a couple of very suspicious looking failures for
build farm animal "walleye", a "MinGW64 8.1.0" system, that say:

c:\\pgbuildfarm\\pgbuildroot\\HEAD\\pgsql.build\\src\\bin\\pg_upgrade>RMDIR
/s/q 
"c:\\pgbuildfarm\\pgbuildroot\\HEAD\\pgsql.build\\src\\bin\\pg_upgrade\\tmp_check\\data.old"
The directory is not empty.

Then I noticed it failed the same way 8 days ago, so I don't know
what's up with that but it looks like we didn't break it...

Re: [PATCH] audo-detect and use -moutline-atomics compilation flag for aarch64

"Zidenberg, Tsahi"  writes:
> Outline-atomics is a gcc compilation flag that adds runtime detection of 
> weather or not the cpu supports atomic instructions. CPUs that don't support 
> atomic instructions will use the old load-exclusive/store-exclusive 
> instructions. If a different compilation flag defined an architecture that 
> unconditionally supports atomic instructions (e.g. -march=armv8.2), the 
> outline-atomic flag will have no effect.

I wonder what version of gcc you intend this for.  AFAICS, older
gcc versions lack this flag at all, while newer ones have it on
by default.  Docs I can find on the net suggest that it would only
help to supply the flag when using gcc 10.0.x.  Is there a sufficient
population of production systems using such gcc releases to make it
worth expending configure cycles on?  (That's sort of a trick question,
because the GCC docs make it look like 10.0.x was never considered
to be production ready.)

regards, tom lane

Re: v13: show extended stats target in \d

2020-09-07 Thread Peter Geoghegan

On Sun, Sep 6, 2020 at 1:48 PM Justin Pryzby  wrote:
> On Sun, Sep 06, 2020 at 01:06:05PM -0700, Peter Geoghegan wrote:
> > How about a line break? That seems like a simple solution that takes
> > all the competing concerns into account.

> Like this ?

> Statistics objects:
> "public"."t" (ndistinct, dependencies, mcv) ON a, b FROM t

No, not like that.

ISTM that the problem with your original proposal is that it suggests
that it ought to be possible to add "STATISTICS 0" to the end of a
CREATE STATISTICS statement. That's not accurate, though. In reality
the only way to set the statistics target for a statistics object is
with ALTER STATISTICS. We should attempt to convey that difference
here.

More concretely, I was suggesting including a second line that's
similar to the following example output (iff the statistics target has
actually been set):

Statistics objects:
"public"."t" (ndistinct, dependencies, mcv) ON a, b FROM t
(STATISTICS 0)

Maybe some variation would be better -- the precise details are not
important. The new line conveys the fact that STATISTICS 0 is a part
of the object, but cannot appear in CREATE STATISTICS itself. That
seems like the right way of framing this.

-- 
Peter Geoghegan

Re: Disk-based hash aggregate's cost model

2020-09-07 Thread Jeff Davis

On Sun, 2020-09-06 at 23:21 +0200, Tomas Vondra wrote:
> I've tested the costing changes on the simplified TPC-H query, on two
> different machines, and it seems like a clear improvement.

Thank you. Committed.

> So yeah, the patched costing is much closer to sort (from the point
> of
> this cost/duration metric), although for higher work_mem values
> there's
> still a clear gap where the hashing seems to be under-costed by a
> factor
> of ~2 or more.

There seems to be a cliff right after 4MB. Perhaps lookup costs on a
larger hash table?

Regards,
Jeff Davis

Re: Auto-vectorization speeds up multiplication of large-precision numerics

I wrote:
> I experimented with a few different ideas such as adding restrict
> decoration to the pointers, and eventually found that what works
> is to write the loop termination condition as "i2 < limit"
> rather than "i2 <= limit".  It took me a long time to think of
> trying that, because it seemed ridiculously stupid.  But it works.

I've done more testing and confirmed that both gcc and clang can
vectorize the improved loop on aarch64 as well as x86_64.  (clang's
results can be confusing because -ftree-vectorize doesn't seem to
have any effect: its vectorizer is on by default.  But if you use
-fno-vectorize it'll go back to the old, slower code.)

The only buildfarm effect I've noticed is that locust and
prairiedog, which are using nearly the same ancient gcc version,
complain

c1: warning: -ftree-vectorize enables strict aliasing. -fno-strict-aliasing is 
ignored when Auto Vectorization is used.

which is expected (they say the same for checksum.c), but then
there are a bunch of

warning: dereferencing type-punned pointer will break strict-aliasing rules

which seems worrisome.  (This sort of thing is the reason I'm
hesitant to apply higher optimization levels across the board.)
Both animals pass the regression tests anyway, but if any other
compilers treat -ftree-vectorize as an excuse to apply stricter
optimization assumptions, we could be in for trouble.

I looked closer and saw that all of those warnings are about
init_var(), and this change makes them go away:

-#define init_var(v)MemSetAligned(v, 0, sizeof(NumericVar))
+#define init_var(v)memset(v, 0, sizeof(NumericVar))

I'm a little inclined to commit that as future-proofing.  It's
essentially reversing out a micro-optimization I made in d72f6c750.
I doubt I had hard evidence that it made any noticeable difference;
and even if it did back then, modern compilers probably prefer the
memset approach.

regards, tom lane

Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Andres Freund

Hi,

On Mon, Sep 7, 2020, at 07:20, Konstantin Knizhnik wrote:
> >> And which pgbench database scale factor you have used?
> > 200
> >
> > Another thing you could try is to run 2-4 pgench instances in different
> > databases.
> I tried to reinitialize database with scale 200 but there was no 
> significant improvement in performance.

If you're replying to the last bit I am quoting, I was talking about having 
four databases with separate pbench tables etc. To see how much of it is 
procarray contention, and how much it is contention of common buffers etc.

> Attachments:
> * pgbench.svg

What numactl was used for this one?

Re: Inaccurate comment, for family-1 polymorphic UNKNOWN type arguments

Himanshu Upadhyaya  writes:
> I observed that we have inaccurate comment in
> enforce_generic_type_consistency.

Fair point, but that's not the only old comment that's not being careful
about it.  I applied the attached.

regards, tom lane

diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c
index 1b11cf731c..2ffe47026b 100644
--- a/src/backend/parser/parse_coerce.c
+++ b/src/backend/parser/parse_coerce.c
@@ -2155,8 +2155,8 @@ enforce_generic_type_consistency(const Oid *actual_arg_types,
 			else
 			{
 /*
- * Only way to get here is if all the polymorphic args have
- * UNKNOWN inputs
+ * Only way to get here is if all the family-1 polymorphic
+ * arguments have UNKNOWN inputs.
  */
 ereport(ERROR,
 		(errcode(ERRCODE_DATATYPE_MISMATCH),
@@ -2254,10 +2254,10 @@ enforce_generic_type_consistency(const Oid *actual_arg_types,
 			else
 			{
 /*
- * Only way to get here is if all the ANYCOMPATIBLE args have
- * UNKNOWN inputs.  Resolve to TEXT as select_common_type()
- * would do.  That doesn't license us to use TEXTRANGE,
- * though.
+ * Only way to get here is if all the family-2 polymorphic
+ * arguments have UNKNOWN inputs.  Resolve to TEXT as
+ * select_common_type() would do.  That doesn't license us to
+ * use TEXTRANGE, though.
  */
 anycompatible_typeid = TEXTOID;
 anycompatible_array_typeid = TEXTARRAYOID;
@@ -2269,7 +2269,7 @@ enforce_generic_type_consistency(const Oid *actual_arg_types,
 			}
 		}
 
-		/* replace polymorphic types by selected types */
+		/* replace family-2 polymorphic types by selected types */
 		for (int j = 0; j < nargs; j++)
 		{
 			Oid			decl_type = declared_arg_types[j];
@@ -2285,11 +2285,11 @@ enforce_generic_type_consistency(const Oid *actual_arg_types,
 	}
 
 	/*
-	 * If we had any UNKNOWN inputs for polymorphic arguments, re-scan to
-	 * assign correct types to them.
+	 * If we had any UNKNOWN inputs for family-1 polymorphic arguments,
+	 * re-scan to assign correct types to them.
 	 *
 	 * Note: we don't have to consider unknown inputs that were matched to
-	 * ANYCOMPATIBLE-family arguments, because we forcibly updated their
+	 * family-2 polymorphic arguments, because we forcibly updated their
 	 * declared_arg_types[] positions just above.
 	 */
 	if (have_poly_unknowns)

Meaning of pg_stats_subscription.latest_end_time

2020-09-07 Thread Simon.Schneider

Hi,

having just spoken with Harald Armin Massa of 2ndQuadrant, he encouraged me to 
ask my question here. This is about publisher/subscriber replication and was 
originally posted in german to pgsql-de-allgem...@lists.postgresql.org.

According to documentation, the field pg_stats_subscription.latest_end_time 
contains the "Time of last write-ahead log location reported to origin WAL 
sender" [1].
The view is built from the internal function pg_stat_get_subscription() [2, 
line 1097], defined in launcher.c, where we can see that the field is populated 
from a struct LogicalRepWorker. Write access to the relevant struct happens in 
worker.c from UpdateWorkerStats() but only if reply=true [3, line 1109], which 
in turn only happens whenever LogicalRepApplyLoop() deals with a message 
beginning with 'k' [3, line 1227]. This message is constructed by the publisher 
in walsender.c in WalSndKeepalive() [4, line 3435], where the corresponding 
value is always GetCurrentTimestamp().

As far as I understand, the WAL replication process has the publisher send 
messages with WAL slices (type 'w') and keepalive messages (type 'k'), to which 
the subscriber answers with a heartbeat. Other message types are not relevant 
here. If the documentation is correct, shouldn't latest_end_time use the 
creation time of the publisher's most recent WAL slice? On the other hand, 
wouldn't "time of last communication from WAL sender" be a more appropriate 
description of the actual behaviour? And in that case, why only consider 
'k'-messages instead of 'w'-messages that contain data?

Either way, I'm confused as to the meaning of 
pg_stats_subscription.latest_end_time. The documentation as is also lead to 
discussions in the office, because "reported to ... sender" is ambiguous as to 
who reports: It might mean some kind of feedback from the subscriber.


Please enlighten me, and please excuse the footer, which I can't turn off.

[1] 
https://www.postgresql.org/docs/12/monitoring-stats.html#PG-STAT-SUBSCRIPTION
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/logical/launcher.c;h=186057bd9327383b7dfdc2d2dda878290caa7967;hb=5060275aa8a1ead56e0a41308d7a43049a6cbe43
[3] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/logical/worker.c;h=790a28fc77e47c7024b8605448311aaa4325fb5b;hb=5060275aa8a1ead56e0a41308d7a43049a6cbe43
[4] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/replication/walsender.c;h=89fbb6fd92788fca66eeb93aaabed7f63dd940c2;hb=5060275aa8a1ead56e0a41308d7a43049a6cbe43


Im Auftrag
Simon Schneider

Stadt Köln - Die Oberbürgermeisterin
Berufsfeuerwehr, Amt für Feuerschutz, Rettungsdienst und Bevölkerungsschutz
Boltensternstr. 10
50735 Köln

Telefon: 0221/9748-9092
Telefax: 0221/9748-9004
E-Mail: simon.schnei...@stadt-koeln.de
Internet: https://www.stadt-koeln.de





Monatlich aktuelle Informationen Ihrer Stadtverwaltung in unserem Newsletter! 
Newsletter 
Anmeldung




[https://styleguide.bundesregierung.de/resource/blob/72496/1760346/6d7f611945ca42908c50804510c5335b/breg-vorschaubild-01-unterstuetzt-842x595px-jpg-srgb-v01-data.png]
[https://www.stadt-koeln.de/images/footer_wahlhelferaufruf2020.jpg]

Re: Improving connection scalability: GetSnapshotData()

2020-09-07 Thread Konstantin Knizhnik





On 06.09.2020 21:52, Andres Freund wrote:

Hi,

On 2020-09-05 16:58:31 +0300, Konstantin Knizhnik wrote:

On 04.09.2020 21:53, Andres Freund wrote:

I also used huge_pages=on / configured them on the OS level. Otherwise
TLB misses will be a significant factor.

As far as I understand there should not be no any TLB misses because size of
the shared buffers (8Mb) as several order of magnitude smaler that available
physical memory.

I assume you didn't mean 8MB but 8GB? If so, that's way large enough to
be bigger than the TLB, particularly across processes (IIRC there's no
optimization to keep shared mappings de-duplicated between processes
from the view of the TLB).



Sorry, certainly 8Gb.
I tried huge pages, but it has almost no effect/

Re: Auto-vectorization speeds up multiplication of large-precision numerics

Amit Khandekar  writes:
> On Mon, 7 Sep 2020 at 11:23, Tom Lane  wrote:
>> BTW, poking at this further, it seems that the patch only really
>> works for gcc.  clang accepts the -ftree-vectorize switch, but
>> looking at the generated asm shows that it does nothing useful.
>> Which is odd, because clang does do loop vectorization.

> Hmm, yeah that's unfortunate. My guess is that the compiler would do
> vectorization only if 'i' is a constant, which is not true for our
> case.

No, they claim to handle variable trip counts, per

https://llvm.org/docs/Vectorizers.html#loops-with-unknown-trip-count

I experimented with a few different ideas such as adding restrict
decoration to the pointers, and eventually found that what works
is to write the loop termination condition as "i2 < limit"
rather than "i2 <= limit".  It took me a long time to think of
trying that, because it seemed ridiculously stupid.  But it works.

regards, tom lane

Re: factorial function/phase out postfix operators?

2020-09-07 Thread John Naylor

On Thu, Sep 3, 2020 at 11:50 AM Mark Dilger
 wrote:
> [v7]

Ok, I've marked it ready for committer.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Yet another fast GiST build

2020-09-07 Thread Andrey M. Borodin




> 7 сент. 2020 г., в 19:10, Heikki Linnakangas  написал(а):
> 
> On 07/09/2020 13:59, Pavel Borisov wrote:
> I suppose there is a big jump in integer value (whether signed or
> unsigned) as you cross from positive to negative floats, and then the
> sort order is reversed.  I have no idea if either of those things is a
> problem worth fixing.  That made me wonder if there might also be an
>>> 
>>> I took a stab at fixing this, see attached patch (applies on top of your
>>> patch v14).
>>> 
>>> To evaluate this, I used the other attached patch to expose the zorder
>>> function to SQL, and plotted points around zero with gnuplot. See the
>>> attached two images, one with patch v14, and the other one with this patch.
>> 
>> I'd made testing of sorted SpGist build in cases of points distributed only
>> in 2d quadrant and points in all 4 quadrants and it appears that this
>> abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
>> is really nice way to avoid what can be avoided and I'd like it is included
>> together with Andrey's patch.
> 
> Thanks! Did you measure the quality of the built index somehow? The 
> ordering shouldn't make any difference to the build speed, but it 
> affects the shape of the resulting index and the speed of queries 
> against it.
I've tried to benchmark the difference between build time v14 and v15. v15 
seems to be slightly slower, but with negligible difference.

> I played with some simple queries like this:
> 
> explain (analyze, buffers) select count(*) from points_good where p <@ 
> box(point(50, 50), point(75, 75));
To observe IndexScan difference query should touch 4 quadrants. i.e. search 
within ((-25,-25),point(25,25))

> and looking at the "Buffers" line for how many pages were accessed. 
> There doesn't seem to be any consistent difference between v14 and my 
> fix. So I concur it doesn't seem to matter much.
> 
> 
> I played some more with plotting the curve. I wrote a little python 
> program to make an animation of it, and also simulated how the points 
> would be divided into pages, assuming that each GiST page can hold 200 
> tuples (I think the real number is around 150 with default page size). 
> In the animation, the leaf pages appear as rectangles as it walks 
> through the Z-order curve. This is just a simulation by splitting all 
> the points into batches of 200 and drawing a bounding box around each 
> batch. I haven't checked the actual pages as the GiST creates, but I 
> think this gives a good idea of how it works.
> The animation shows that there's quite a lot of overlap between the 
> pages. It's not necessarily this patch's business to try to improve 
> that, and the non-sorting index build isn't perfect either. But it 
> occurs to me that there's maybe one pretty simple trick we could do: 
> instead of blindly filling the leaf pages in Z-order, collect tuples 
> into a larger buffer, in Z-order. I'm thinking 32 pages worth of tuples, 
> or something in that ballpark, or maybe go all the way up to work_mem. 
> When the buffer fills up, call the picksplit code to divide the buffer 
> into the actual pages, and flush them to disk. If you look at the 
> animation and imagine that you would take a handful of pages in the 
> order they're created, and re-divide the points with the split 
> algorithm, there would be much less overlap.

Animation looks cool! It really pins the inefficiency of resulting MBRs.
But in R*-tree one of Beckman's points was that overlap optimisation worth 
doing on higher levels, not lower.
But we can do this for splits on each level, I think. We do not know tree depth 
in advance to divide maintenance workmem among level.. But, probably we don't 
need to, let's allocate half to first level, quarter to second, 1/8 to third 
etc until it's one page. Should we take allocations inside picksplit() into 
account?
The more I think about it the cooler idea seem to me.

BTW I've found one more bug in the patch: it writes WAL even for unlogged 
tables. I'm not sending a patch because changes are trivial and currently we 
already have lengthy patchset in different messages.
Also, to avoid critical section we can use log_new_page() instead of 
log_buffer().


Thanks!

Best regards, Andrey Borodin.

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

Michael Paquier  writes:
> Is there still something that needs to absolutely be done here knowing
> that we have bab1500 that got rid of the root issue?  Can the CF entry
> be marked as committed?

I think there is agreement that we're not going to change
cancel_before_shmem_exit's restriction to only allow LIFO popping.
So we should improve its comment to explain why.  The other thing
that seems legitimately on-the-table for this CF entry is whether
we should change cancel_before_shmem_exit to complain, rather than
silently do nothing, if it fails to pop the stack.  Bharath's
last patchset proposed to add an elog(DEBUG3) complaint, which
seems to me to be just about entirely useless.  I'd make it an
ERROR, or maybe an Assert.

regards, tom lane

Re: [doc] minor wording improvement in a couple of places

2020-09-07 Thread John Naylor

On Mon, Sep 7, 2020 at 6:30 AM Magnus Hagander  wrote:
>
> On Mon, Sep 7, 2020 at 6:34 AM Ian Barwick  
> wrote:
>>"..., which see for additional details."
>>
>> which strikes me as a bit odd. Suggested phrasing:
>>
>>"...; see this file for additional details."
>>
>
> I agree this sounds very weird. However, this phrasing is also fairly common 
> in comments around the code.

Seems to be a literal translation of Latin "quod vide" (q.v.), so
intentional, just sounds a bit old fashioned. I'd vote to keep it.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Change a constraint's index - ALTER TABLE ... ALTER CONSTRAINT ... USING INDEX ...

On 2020-Sep-07, Laurenz Albe wrote:

> This patch would provide a more convenient way to do that.
> 
> Again, I am not sure if that justifies the effort.

I have to admit I've seen cases where it'd be useful to have included
columns in primary keys.

TBH I think if we really wanted the feature of primary keys with
included columns, we'd have to add it to the PRIMARY KEY syntax rather
than having an ad-hoc ALTER TABLE ALTER CONSTRAINT USING INDEX command
to replace the index underneath.  Then things like pg_dump would work
normally.

(I have an answer for the information_schema question Tom posed; I'd
like to know what's yours.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Transactions involving multiple postgres foreign servers, take 2





On 2020/09/07 17:59, Fujii Masao wrote:



On 2020/08/21 15:25, Masahiko Sawada wrote:

On Fri, 21 Aug 2020 at 00:36, Fujii Masao  wrote:




On 2020/07/27 15:59, Masahiko Sawada wrote:

On Thu, 23 Jul 2020 at 22:51, Muhammad Usama  wrote:




On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada 
 wrote:


On Sat, 18 Jul 2020 at 01:55, Fujii Masao  wrote:




On 2020/07/16 14:47, Masahiko Sawada wrote:

On Tue, 14 Jul 2020 at 11:19, Fujii Masao  wrote:




On 2020/07/14 9:08, Masahiro Ikeda wrote:

I've attached the latest version patches. I've incorporated the review
comments I got so far and improved locking strategy.


Thanks for updating the patch!


+1
I'm interested in these patches and now studying them. While checking
the behaviors of the patched PostgreSQL, I got three comments.


Thank you for testing this patch!



1. We can access to the foreign table even during recovery in the HEAD.
But in the patched version, when I did that, I got the following error.
Is this intentional?

ERROR:  cannot assign TransactionIds during recovery


No, it should be fixed. I'm going to fix this by not collecting
participants for atomic commit during recovery.


Thanks for trying to fix the issues!

I'd like to report one more issue. When I started new transaction
in the local server, executed INSERT in the remote server via
postgres_fdw and then quit psql, I got the following assertion failure.

TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
0   postgres    0x00010d52f3c0 ExceptionalCondition 
+ 160
1   postgres    0x00010cefbc49 
ForgetAllFdwXactParticipants + 313
2   postgres    0x00010cefff14 AtProcExit_FdwXact + 
20
3   postgres    0x00010d313fe3 shmem_exit + 179
4   postgres    0x00010d313e7a proc_exit_prepare + 
122
5   postgres    0x00010d313da3 proc_exit + 19
6   postgres    0x00010d35112f PostgresMain + 3711
7   postgres    0x00010d27bb3a BackendRun + 570
8   postgres    0x00010d27af6b BackendStartup + 475
9   postgres    0x00010d279ed1 ServerLoop + 593
10  postgres    0x00010d277940 PostmasterMain + 6016
11  postgres    0x00010d1597b9 main + 761
12  libdyld.dylib   0x7fff7161e3d5 start + 1
13  ??? 0x0003 0x0 + 3



Thank you for reporting the issue!

I've attached the latest version patch that incorporated all comments
I got so far. I've removed the patch adding the 'prefer' mode of
foreign_twophase_commit to keep the patch set simple.



I have started to review the patchset. Just a quick comment.

Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
contains changes (adding fdwxact includes) for
src/backend/executor/nodeForeignscan.c,  src/backend/executor/nodeModifyTable.c
and  src/backend/executor/execPartition.c files that doesn't seem to be
required with the latest version.


Thanks for your comment.

Right. I've removed these changes on the local branch.


The latest patches failed to be applied to the master branch. Could you rebase 
the patches?



Thank you for letting me know. I've attached the latest version patch set.


Thanks for updating the patch!

IMO it's not easy to commit this 2PC patch at once because it's still large
and complicated. So I'm thinking it's better to separate the feature into
several parts and commit them gradually. What about separating
the feature into the following parts?

#1
Originally the server just executed xact callback that each FDW registered
when the transaction was committed. The patch changes this so that
the server manages the participants of FDW in the transaction and triggers
them to execute COMMIT or ROLLBACK. IMO this change can be applied
without 2PC feature. Thought?

Even if we commit this patch and add new interface for FDW, we would
need to keep the old interface, for the FDW providing only old interface.


#2
Originally when there was the FDW access in the transaction,
PREPARE TRANSACTION on that transaction failed with an error. The patch
allows PREPARE TRANSACTION and COMMIT/ROLLBACK PREPARED
even when FDW access occurs in the transaction. IMO this change can be
applied without *automatic* 2PC feature (i.e., PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED are automatically executed for each FDW
inside "top" COMMIT command). Thought?

I'm not sure yet whether automatic resolution of "unresolved" prepared
transactions by the resolver process is necessary for this change or not.
If it's not necessary, it's better to exclude the resolver process from this
change, at this stage, to make the patch simpler.


#3
Finally IMO we can provide the patch supporting "automatic" 2PC for each FDW,
based on the #1 and #2

Re: Change a constraint's index - ALTER TABLE ... ALTER CONSTRAINT ... USING INDEX ...

2020-09-07 Thread Laurenz Albe

On Fri, 2020-09-04 at 13:31 -0400, Alvaro Herrera wrote:
> On 2020-Sep-04, Laurenz Albe wrote:
> > On Fri, 2020-09-04 at 10:41 -0400, Alvaro Herrera wrote:
> > > > The value I see in this is:
> > > > - replacing a primary key index
> > > > - replacing the index behind a constraint targeted by a foreign key
> > > But why is this better than using REINDEX CONCURRENTLY?
> > It is not better, but it can be used to replace a constraint index
> > with an index with a different INCLUDE clause, which is something
> > that cannot easily be done otherwise.
> 
> 
> I can see that there is value in having an index that serves both a
> uniqueness constraint and coverage purposes.  But this seems a pretty
> roundabout way to get that -- I think you should have to do "CREATE
> UNIQUE INDEX ... INCLUDING ..." instead.  That way, the fact that this
> is a Postgres extension remains clear.
> 
> 55432 14devel 24138=# create table foo (a int not null, b int not null, c 
> int);
> CREATE TABLE
> Duración: 1,775 ms
> 55432 14devel 24138=# create unique index on foo (a, b) include (c);
> CREATE INDEX
> Duración: 1,481 ms
> 55432 14devel 24138=# create table bar (a int not null, b int not null, 
> foreign key (a, b) references foo (a, b));
>
> CREATE TABLE
> Duración: 2,559 ms
> 
> Now you have a normal index that you can reindex in the normal way, if you 
> need
> it.

Yes, that is true.

But what if you have done

  CREATE TABLE a (id bigint CONSTRAINT a_pkey PRIMARY KEY, val integer);
  CREATE TABLE b (id bigint CONSTRAINT b_fkey REFERENCES a);

and later you figure out later that it would actually be better to have
an index ON mytab (id) INCLUDE (val), and you don't want to maintain
two indexes.

Yes, you could do

  CREATE UNIQUE INDEX CONCURRENTLY ind ON a (id) INCLUDE (val);
  ALTER TABLE a ADD UNIQUE USING INDEX ind;
  ALTER TABLE a DROP CONSTRAINT a_pkey CASCADE;
  ALTER TABLE b ADD FOREIGN KEY (id) REFERENCES a(id);

but then you don't have a primary key, and you have to live without
the foreign key for a while.

Adding a primary key to a large table is very painful, because it
locks the table exclusively for a long time.


This patch would provide a more convenient way to do that.

Again, I am not sure if that justifies the effort.

Yours,
Laurenz Albe

Re: 回复：how to create index concurrently on partitioned table

2020-09-07 Thread Anastasia Lubennikova


On 07.09.2020 04:58, Michael Paquier wrote:

On Fri, Sep 04, 2020 at 09:51:13AM +0900, Michael Paquier wrote:

Glad to hear that, please take the time you need.

Attached is a rebased version to address the various conflicts after
844c05a.
--
Michael


Thank you.

With the fix for the cycle in reindex_partitions() and new comments 
added, I think this patch is ready for commit.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: A micro-optimisation for walkdir()

2020-09-07 Thread Juan José Santamaría Flecha

On Mon, Sep 7, 2020 at 1:41 PM Thomas Munro  wrote:

> On Mon, Sep 7, 2020 at 9:42 PM Magnus Hagander 
> wrote:
> > On Mon, Sep 7, 2020 at 12:27 AM Thomas Munro 
> wrote:
> >> I think the following is a little mysterious, but it does seem to be
> >> what people do for this in other projects.  It is the documented way
> >> to detect mount points, and I guess IO_REPARSE_TAG_MOUNT_POINT is
> >> either overloaded also for junctions, or junctions are the same thing
> >> as mount points.  It would be nice to see a Win32 documentation page
> >> that explicitly said that.
> >
> > The wikipedia page on it is actually fairly decent:
> https://en.wikipedia.org/wiki/NTFS_reparse_point. It's not the
> documentation of course, but it's easier to read :) The core difference is
> whether you mount a whole filesystem (mount point) or just a directory off
> something mounted elsehwere (junction).
> >
> > And yes, the wikipedia page confirms that junctions also use
> IO_REPARSE_TAG_MOUNT_POINT.
>
> Thanks for confirming.  I ran the Windows patch through pgindent,
> fixed a small typo, and pushed.
>

Great, thanks. Should we include something from this discussion as comments?

Regards,

Juan José Santamaría Flecha

Re: shared-memory based stats collector

2020-09-07 Thread Stephen Frost

Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > The commit 8e19a82640 changed SIGQUIT handler of almost all processes
> > not to run atexit callbacks for safety. Archiver process should behave
> > the same way for the same reason. Exit status changes 1 to 2 but that
> > doesn't make any behavioral change.
> 
> Shouldn't this:
> 
> a) be back-patched, as the other change was
> b) also include a change to have the stats collector (which I realize is
>removed later on in this patch set, but we're talking about fixing
>existing things..) for the same reason, and because there isn't much
>point in trying to write out the stats after we get a SIGQUIT, since
>we're just going to blow them away again since we're going to go
>through crash recovery..?

* Michael Paquier (mich...@paquier.xyz) wrote:
> 0001 is just a piece of refactoring, so I see no strong argument in
> favor of a backpatch, IMHO.

No, 0001 changes the SIGQUIT handler for the archiver process, which is
what 8e19a82640 was about changing for a bunch of the other subprocesses
and which was back-patched all the way, so I'm a bit confused about why
you're saying it's just refactoring..?

Note that exit() and _exit() aren't the same, and the latter is what's
now being used in SignalHandlerForCrashExit.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: PG 13 release notes, first draft

2020-09-07 Thread Justin Pryzby

There's some obvious improvements for which I include a patch.

>Add alternate version of jsonb_setI() with special NULL handling (Andrew 
>Dunstan)
>The new function, jsonb_set_lax(), allows null new values to either set the 
>specified key to JSON null, delete the key, raise exception, or ignore the 
>operation. 
jsonb_set()
raise *an* exception
new null values?

>IS 'return_target' CLEAR?
I haven't used these before, but following the examples, it seems to return the
"target" argument, unchanged.

| Add function min_scale() that returns the number of digits to the right the 
decimal point that is required to represent the numeric value with full 
precision (Pavel Stehule)
right *of*
that *are* required ?

|The old function names were kept for backward compatibility. DO WE HAVE NEW 
NAMES?

=> I think the docs are clear:
> In releases of PostgreSQL before 13 there was no xid8 type, so variants of 
> these functions were provided that used bigint to represent a 64-bit XID, 
> with a correspondingly distinct snapshot data type txid_snapshot. These older 
> functions have txid in their names. They are still supported for backward 
> compatibility, but may be removed from a future release. See Table 9.76

> This improves performance for queries that access many object. The internal 
> List API has also been improved.
many objects*

|Allow skipping of WAL for full table writes if wal_level is minimal (Kyotaro 
Horiguchi)
Allow WAL writes to be skipped...
And: this is not related to full_page_writes.

| Enable Unix-domain sockets support on Windows (Peter Eisentraut)
Enable support for ..

| Improve the performance when replaying DROP DATABASE commands when many 
tablespaces are in use (Fujii Masao)
s/the//

|Allow a sample of statements to be logged (Adrien Nayrat)
Allow logging a sample of statements

|Allow WAL receivers use a temporary replication slot if a permanent one is not 
specified (Peter Eisentraut, Sergei Kornilov)
*to* use

| Add leader_pid to pg_stat_activity to report parallel worker ownership 
(Julien Rouhaud)
s/ownership/leader/ ?

|Allow WAL recovery to continue even if invalid pages are referenced (Fujii 
Masao)
remove "WAL" or say:
>Allow recovery to continue even if invalid pages are referenced by WAL (Fujii 
>Masao)








A few things I have't fixed in my patch:

|Previously, this value was adjusted before effecting the number of concurrent 
requests. This value is now used directly. Conversion of old values to new ones 
can be done using:
|SELECT round(sum(OLD / n::float)) FROM generate_series(1, OLD) s(n);

I think the round() should be aliased, "AS new".
I think "before effecting" is confusing, maybe say:
| Previously, the effective value was computed internally from the 
user-supplied parameter...

|Allow partitioned tables to be logically replicated via publications (Amit 
Langote)
|Previously, partitions had to be replicated individually. Now partitioned 
tables can be published explicitly causing all partitions to be automatically 
published. Addition/removal of partitions from partitioned tables are 
automatically added/removed from publications. The CREATE PUBLICATION option 
publish_via_partition_root controls whether changes to partitions are published 
as their own or their ancestor's.

=> "causing any future partitions to be automatically published".

"addition/removal .. are automatically" isn't right

|Implement incremental sorting (James Coleman, Alexander Korotkov, Tomas Vondra)
|If a result is already sorted by several leading keys, this allows for batch 
sorting of additional trailing keys because the previous keys are already 
equal. This is controlled by enable_incremental_sort.

s/several/one or more/
remove "additional" ?
remove "batch" ?
maybe "of ONLY trailing keys"

|Allow inserts to trigger autovacuum activity (Laurenz Albe, Darafei 
Praliaskouski)
|This new behavior reduces the work necessary when the table needs to be frozen 
and allows pages to be set as all-visible. All-visible pages allow index-only 
scans to access fewer heap rows.

There's a lot of "allow" here, but it sounds like relaxing a restriction when
actually this is a new functionality.  Maybe:
| Allow autovacuum to be triggered by INSERTs.
| ..and allows autovacuum to set pages as all-visible.





I already mentioned a couple things back in May that still stand out;:

|Add jsonpath .datetime() method (Nikita Glukhov, Teodor Sigaev, Oleg Bartunov, 
Alexander Korotkov)
|This allows json values to be converted to timestamps, which can then be 
processed in jsonpath expressions. This also adds jsonpath functions that 
support time zone-aware output.
timezone-aware or time-zone-aware, if you must.

> Allow vacuum commands run by vacuumdb to operate in parallel mode (Masahiko 
> Sawada)
=> I think this is still going to be lost/misunderstood/confuse some people.
vacuumdb already supports -j.  This allows it to run vacuum(parallel N).  So
maybe say "...to process indexes in parallel".

--

Re: [HACKERS] logical decoding of two-phase transactions

On Mon, Sep 7, 2020 at 10:54 AM Ajin Cherian  wrote:
>
> Hello,
>
> Trying to revive this patch which attempts to support logical decoding of two 
> phase transactions. I've rebased and polished Nikhil's patch on the current 
> HEAD. Some of the logic in the previous patchset has already been committed 
> as part of large-in-progress transaction commits, like the handling of 
> concurrent aborts, so all that logic has been left out.
>

I am not sure about your point related to concurrent aborts. I think
we need some changes related to this patch. Have you tried to test
this behavior? Basically, we have the below code in
ReorderBufferProcessTXN() which will be hit for concurrent aborts, and
currently, the Asserts shown below will fail.

if (errdata->sqlerrcode == ERRCODE_TRANSACTION_ROLLBACK)
{
/*
* This error can only occur when we are sending the data in
* streaming mode and the streaming is not finished yet.
*/
Assert(streaming);
Assert(stream_started);

Nikhil has a test for the same
(0004-Teach-test_decoding-plugin-to-work-with-2PC.Jan4) in his last
email [1]. You might want to use it to test this behavior. I think you
can also keep the tests as a separate patch as Nikhil had.

One other comment:
===
@@ -27,6 +27,7 @@ typedef struct OutputPluginOptions
 {
  OutputPluginOutputType output_type;
  bool receive_rewrites;
+ bool enable_twophase;
 } OutputPluginOptions;
..
..
@@ -684,6 +699,33 @@ startup_cb_wrapper(LogicalDecodingContext *ctx,
OutputPluginOptions *opt, bool i
  /* do the actual work: call callback */
  ctx->callbacks.startup_cb(ctx, opt, is_init);

+ /*
+ * If the plugin claims to support two-phase transactions, then
+ * check that the plugin implements all callbacks necessary to decode
+ * two-phase transactions - we either have to have all of them or none.
+ * The filter_prepare callback is optional, but can only be defined when
+ * two-phase decoding is enabled (i.e. the three other callbacks are
+ * defined).
+ */
+ if (opt->enable_twophase)
+ {
+ int twophase_callbacks = (ctx->callbacks.prepare_cb != NULL) +
+ (ctx->callbacks.commit_prepared_cb != NULL) +
+ (ctx->callbacks.abort_prepared_cb != NULL);
+
+ /* Plugins with incorrect number of two-phase callbacks are broken. */
+ if ((twophase_callbacks != 3) && (twophase_callbacks != 0))
+ ereport(ERROR,
+ (errmsg("Output plugin registered only %d twophase callbacks. ",
+ twophase_callbacks)));
+ }

I don't know why the patch has used this way to implement an option to
enable two-phase. Can't we use how we implement 'stream-changes'
option in commit 7259736a6e? Just refer how we set ctx->streaming and
you can use a similar way to set this parameter.

-- 
With Regards,
Amit Kapila.

Re: Strange behavior with polygon and NaN

2020-09-07 Thread gkokolatos






‐‐‐ Original Message ‐‐‐
On Thursday, 27 August 2020 14:24, Kyotaro Horiguchi  
wrote:

> At Wed, 26 Aug 2020 08:18:49 -0400, Tom Lane t...@sss.pgh.pa.us wrote in
>
> > Kyotaro Horiguchi horikyota@gmail.com writes:
> >
> > > At Tue, 25 Aug 2020 19:03:50 -0400, Bruce Momjian br...@momjian.us wrote 
> > > in
> > >
> > > > I can confirm that this two-month old email report still produces
> > > > different results with indexes on/off in git master, which I don't think
> > > > is ever correct behavior.
> >
> > > I agree to that the behavior is broken.
> >
> > Yeah, but ... what is "non broken" in this case? I'm not convinced
> > that having point_inside() return zero for any case involving NaN
> > is going to lead to noticeably saner behavior than today.
>
> Yes, just doing that leaves many unfixed behavior come from NaNs, but
> at least it seems to me one of sane definition candidates that a point
> cannot be inside a polygon when NaN is involved. It's similar to
> Fpxx() returns false if NaN is involved. As mentioned, I had't fully
> checked and haven't considered this seriously, but I changed my mind
> to check all the callers. I started checking all the callers of
> point_inside, then finally I had to check all functions in geo_ops.c:(
>

For what is worth, I agree with this definition.


> The attached is the result as of now.
>
> === Resulting behavior
>
> With the attached:
>
> -   All boolean functions return false if NaN is involved.
> -   All float8 functions return NaN if NaN is involved.
> -   All geometric arithmetics return NaN as output if NaN is involved.

Agreed! As in both this behaviour conforms to the definition above and the 
patch provides this behaviour with the exceptions below.

>
> With some exceptions:
>
> -   line_eq: needs to consider that NaNs are equal each other.
> -   point_eq/ne (point_eq_pint): ditto
> -   lseg_eq/ne: ditto
>
> The change makes some difference in the regression test.
> For example,
>
>  <->  changed from 0 to NaN. (distance)
>
>
>  <@  changed from true to false. (contained)
>  <->  changed from 0 to NaN. (distance)
>  ?#  changed from true to false (overlaps)
>
> === pg_hypot mistake?
>
> I noticed that pg_hypot returns inf for the parameters (NaN, Inf) but
> I think NaN is appropriate here since other operators behaves that
> way. This change causes a change of distance between point(1e+300,Inf)
> and line (1,-1,0) from infinity to NaN, which I think is correct
> because the arithmetic generates NaN as an intermediate value.
>
> === Infinity annoyances
>
> Infinity makes some not-great changes in regresssion results. For example:
>
> -   point '(1e+300,Infinity)' <-> path '((10,20))' returns
> NaN(previously Infinity), but point '(1e+300,Infinity)' <-> path
> '[(1,2),(3,4)]' returns Infinity. The difference of the two
> expressions is whether (0 * Inf = NaN) is performed or not. The
> former performs it but that was concealed by not propagating NaN to
> upper layer without the patch.

Although I understand the reasoning for this change. I am not certain I agree 
with the result. I feel that:
point '(1e+300,Infinity)' <-> path '((10,20))'
should return Infinity. Even if I am wrong to think that, the two results 
should be expected to behave the same. Am I wrong on that too?


>
> -   Without the patch, point '(1e+300,Infinity)' ## box '(2,2),(0,0)'
> generates '(0,2)', which is utterly wrong. It is because
> box_closept_point evaluates float8_lt(Inf, NaN) as true(!) and sets
> the wrong point for distance=NaN is set. With the patch, the NaN
> makes the result NULL.

Agreed.

>
> -   This is not a difference caused by this patch, but for both patched
> and unpatched, point '(1e+300,Inf)' <-> line '{3,0,0}' returns NaN,
> which should be 1e+300. However, the behavior comes from arithmetic
> reasons and wouldn't be a problem.
>
> create_index.out is changed since point(NaN,NaN) <@ polygon changed
> from true to false, which seems rather saner.
>
> I haven't checked unchanged results but at least changed results seems
> saner to me.

All in all a great patch!

It is clean, well reasoned and carefully crafted.

Do you think that the documentation needs to get updated to the 'new' behaviour?


//Georgios


>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>

From 432a671517e78061edc87d18aec291f5629fcbe6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 27 Aug 2020 14:49:21 +0900
Subject: [PATCH v1] Fix NaN handling of some geometric operators and functions

Some geometric operators shows somewhat odd behavior comes from NaN
handling and that leads to inconsistency between index scan and seq
scan on geometric conditions.

For example:
  point '(NaN,NaN)' <-> polygon '((0,0),(0,1),(1,1))' => 0, not NaN
  point '(0.3,0.5)' <-> polygon '((0,0),(0,NaN),(1,1))' => 1.14, not NaN
  point '(NaN,NaN)' <@ polygon

Re: proposal: possibility to read dumped table's name from file

2020-09-07 Thread Surafel Temesgen

Hi Pavel

On Fri, Sep 4, 2020 at 6:22 AM Pavel Stehule 
wrote:

>
> Here is updated patch for pg_dump
>
>
pg_dumpall also has –exclude-database=pattern and –no-comments option
doesn't that qualify it to benefits from this feature? And please add a
test case for this option

regards

Surafel

Inaccurate comment, for family-1 polymorphic UNKNOWN type arguments

2020-09-07 Thread Himanshu Upadhyaya

Hi Hackers,

I observed that we have inaccurate comment in
enforce_generic_type_consistency.

 if (!OidIsValid(elem_typeid))
{
if (allow_poly)
{
elem_typeid = ANYELEMENTOID;
array_typeid = ANYARRAYOID;
range_typeid = ANYRANGEOID;
}
else
{
/*
 * Only way to get here is if all the
polymorphic args have
 * UNKNOWN inputs
 */
*ereport(ERROR, *
  ...
 }

}
We reach the error condition even if there is any "anycompatible" parameter
is present (and that is of some known type say int32).
I think developer intend to report error if "we have all the family-1
polymorphic arguments as UNKNOWN".

Thoughts?

Please find attached the patch to fix this typo.

Thanks,
Himanshu
EnterpriseDB: http://www.enterprisedb.com
From 361b8aac87fb3d4e4c968555459f1cedec489440 Mon Sep 17 00:00:00 2001
From: Himanshu Upadhyaya 
Date: Mon, 7 Sep 2020 16:48:22 +0530
Subject: [PATCH v1] FIX- for incorrect comment for UNKNOWN input type for
 polymorphic arguments

---
 src/backend/parser/parse_coerce.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/parser/parse_coerce.c b/src/backend/parser/parse_coerce.c
index 1b11cf731c..36d6483c24 100644
--- a/src/backend/parser/parse_coerce.c
+++ b/src/backend/parser/parse_coerce.c
@@ -2155,7 +2155,7 @@ enforce_generic_type_consistency(const Oid *actual_arg_types,
 			else
 			{
 /*
- * Only way to get here is if all the polymorphic args have
+ * Only way to get here is if all the family-1 polymorphic args have
  * UNKNOWN inputs
  */
 ereport(ERROR,
-- 
2.25.1

Re: A micro-optimisation for walkdir()

On Mon, Sep 7, 2020 at 9:42 PM Magnus Hagander  wrote:
> On Mon, Sep 7, 2020 at 12:27 AM Thomas Munro  wrote:
>> I think the following is a little mysterious, but it does seem to be
>> what people do for this in other projects.  It is the documented way
>> to detect mount points, and I guess IO_REPARSE_TAG_MOUNT_POINT is
>> either overloaded also for junctions, or junctions are the same thing
>> as mount points.  It would be nice to see a Win32 documentation page
>> that explicitly said that.
>
> The wikipedia page on it is actually fairly decent: 
> https://en.wikipedia.org/wiki/NTFS_reparse_point. It's not the documentation 
> of course, but it's easier to read :) The core difference is whether you 
> mount a whole filesystem (mount point) or just a directory off something 
> mounted elsehwere (junction).
>
> And yes, the wikipedia page confirms that junctions also use 
> IO_REPARSE_TAG_MOUNT_POINT.

Thanks for confirming.  I ran the Windows patch through pgindent,
fixed a small typo, and pushed.

Re: Ideas about a better API for postgres_fdw remote estimates

2020-09-07 Thread Andrey V. Lepikhov

On 9/4/20 6:23 PM, Ashutosh Bapat wrote:

On Thu, 3 Sep 2020 at 10:44, Andrey V. Lepikhov 
mailto:a.lepik...@postgrespro.ru>> wrote:

On 8/31/20 6:19 PM, Ashutosh Bapat wrote:
 > On Mon, Aug 31, 2020 at 3:36 PM Andrey V. Lepikhov
 > mailto:a.lepik...@postgrespro.ru>> wrote:
 >>
 >> Thanks for this helpful feedback.
 > I think the patch has some other problems like it works only for
 > regular tables on foreign server but a foreign table can be pointing
 > to any relation like a materialized view, partitioned table or a
 > foreign table on the foreign server all of which have statistics
 > associated with them. I didn't look closely but it does not consider
 > that the foreign table may not have all the columns from the relation
 > on the foreign server or may have different names. But I think those
 > problems are kind of secondary. We have to agree on the design first.
 >
In accordance with discussion i made some changes in the patch:
1. The extract statistic routine moved into the core.

Bulk of the patch implements the statistics conversion to and fro json 
format. I am still not sure whether we need all of that code here.

Yes, i'm sure we'll replace it with something.

Right now, i want to discuss format of statistics dump. Remind, that a 
statistics dump is needed not only for fdw, but it need for the pg_dump. 
And in the dump will be placed something like this:

'SELECT store_relation_statistics(rel, serialized_stat)'

my reasons for using JSON:
* it have conversion infrastructure like json_build_object()
* this is flexible readable format, that can be useful in text dumps of 
relations.

Can we re-use pg_stats view? That is converting some of the OIDs to names. I 
agree with anyarray but if that's a problem here it's also a problem for 
pg_stats view, isn't it?
Right now, I don't know if it is possible to unambiguously convert the 
pg_stats information to a pg_statistic tuple.

If we can reduce the stats handling code to a 
minimum or use it for some other purpose as well e.g. pg_stats 
enhancement, the code changes required will be far less compared to the 
value that this patch provides.

+1

--
regards,
Andrey Lepikhov
Postgres Professional

Logical Replication - detail message with names of missing columns

2020-09-07 Thread Bharath Rupireddy

Hi,

I observed that, in logical replication when a subscriber is missing
some columns, it currently emits an error message that says "some"
columns are missing(see logicalrep_rel_open()), but it doesn't specify
what the missing column names are. The comment there also says that
finding the missing column names is a todo item(/* TODO, detail
message with names of missing columns */).

I propose a patch to find the missing columns on the subscriber
relation using the publisher relation columns and show them in the
error message which can make the error more informative to the user.

Here's a snapshot how the error looks with the patch:
2020-09-04 01:00:36.721 PDT [843128] ERROR:  logical replication
target relation "public.test_1" is missing "b1, d1" replicated columns
2020-09-04 01:00:46.784 PDT [843132] ERROR:  logical replication
target relation "public.test_1" is missing "b1" replicated columns
2020-09-06 21:24:53.645 PDT [902945] ERROR:  logical replication
target relation "public.test_1" is missing "a1, c1, d1, b1" replicated
columns

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
From e923dc6ddfa9b41c70f351ba6ad7dc8cfa8dbde3 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 4 Sep 2020 21:10:52 -0700
Subject: [PATCH v1] Detail message with names of missing columns in logical
 replication

In logical replication when a subscriber is missing some columns,
it currently emits an error message that says "some" columns are
missing(see logicalrep_rel_open()), but it doesn't specify what
the missing column names are. The comment there also says that
finding the missing column names is a todo item(/* TODO, detail
message with names of missing columns */).

This patch finds the missing columns on the subscriber relation
using the publisher relation columns and show them in the error
message which makes error to be more informative to the user.
---
 src/backend/replication/logical/relation.c | 50 --
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/relation.c b/src/backend/replication/logical/relation.c
index a60c73d74d..d34ea3ce39 100644
--- a/src/backend/replication/logical/relation.c
+++ b/src/backend/replication/logical/relation.c
@@ -227,6 +227,44 @@ logicalrep_rel_att_by_name(LogicalRepRelation *remoterel, const char *attname)
 	return -1;
 }
 
+/*
+ * Finds the attributes that are missing in the local relation
+ * compared to remote relation.
+ */
+static void
+logicalrep_find_missing_atts(Relation localrel,
+			 LogicalRepRelation *remoterel,
+			 StringInfo missingatts)
+{
+	int			i;
+	int j;
+	TupleDesc	desc;
+
+	desc = RelationGetDescr(localrel);
+
+	for (i = 0; i < remoterel->natts; i++)
+	{
+		bool found = false;
+		for (j = 0; j < desc->natts; j++)
+		{
+			Form_pg_attribute attr = TupleDescAttr(desc, j);
+			if (strcmp(remoterel->attnames[i], NameStr(attr->attname)) == 0)
+			{
+found = true;
+break;
+			}
+		}
+
+		if (found == false)
+		{
+			if (missingatts->len == 0)
+appendStringInfoString(missingatts, remoterel->attnames[i]);
+			else if (missingatts->len > 0)
+appendStringInfo(missingatts, ", %s", remoterel->attnames[i]);
+		}
+	}
+}
+
 /*
  * Open the local relation associated with the remote one.
  *
@@ -322,13 +360,19 @@ logicalrep_rel_open(LogicalRepRelId remoteid, LOCKMODE lockmode)
 found++;
 		}
 
-		/* TODO, detail message with names of missing columns */
 		if (found < remoterel->natts)
+		{
+			StringInfoData missingatts;
+
+			initStringInfo();
+			logicalrep_find_missing_atts(entry->localrel, remoterel, );
 			ereport(ERROR,
 	(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 	 errmsg("logical replication target relation \"%s.%s\" is missing "
-			"some replicated columns",
-			remoterel->nspname, remoterel->relname)));
+			"\"%s\" replicated columns",
+			remoterel->nspname, remoterel->relname,
+			missingatts.data)));
+		}
 
 		/*
 		 * Check that replica identity matches. We allow for stricter replica
-- 
2.25.1

Re: Parallel copy

2020-09-07 Thread vignesh C

On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow  wrote:
>
> Hi Vignesh,
>
> >Can you share with me the script you used to generate the data & the ddl of 
> >the table, so that it will help me check that >scenario you faced the 
> >>problem.
>
> Unfortunately I can't directly share it (considered company IP),
> though having said that it's only doing something that is relatively
> simple and unremarkable, so I'd expect it to be much like what you are
> currently doing. I can describe it in general.
>
> The table being used contains 100 columns (as I pointed out earlier),
> with the first column of "bigserial" type, and the others of different
> types like "character varying(255)", "numeric", "date" and "time
> without timezone". There's about 60 of the "character varying(255)"
> overall, with the other types interspersed.
>

Thanks Greg for executing & sharing the results.
I tried with a similar test case that you suggested, I was not able to
reproduce the degradation scenario.
If it is possible, can you run perf for the scenario with 1 worker &
non parallel mode & share the perf results, we will be able to find
out which of the functions is consuming more time by doing a
comparison of the perf reports.
Steps for running perf:
1) get the postgres pid
2) perf record -a -g -p 
3) Run copy command
4) Execute "perf report -g" once copy finishes.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Yet another fast GiST build

2020-09-07 Thread Pavel Borisov

>
>
> >> I suppose there is a big jump in integer value (whether signed or
> >> unsigned) as you cross from positive to negative floats, and then the
> >> sort order is reversed.  I have no idea if either of those things is a
> >> problem worth fixing.  That made me wonder if there might also be an
>
> I took a stab at fixing this, see attached patch (applies on top of your
> patch v14).
>
> To evaluate this, I used the other attached patch to expose the zorder
> function to SQL, and plotted points around zero with gnuplot. See the
> attached two images, one with patch v14, and the other one with this patch.
>

I'd made testing of sorted SpGist build in cases of points distributed only
in 2d quadrant and points in all 4 quadrants and it appears that this
abnormality doesn't affect as much as Andrey supposed. But Heikki's patch
is really nice way to avoid what can be avoided and I'd like it is included
together with Andrey's patch.

Pavel.

Re: Bogus documentation for bogus geometric operators

2020-09-07 Thread Emre Hasegeli

> Emre, the CF bot complains that this does not apply anymore, so please
> provide a rebase.  By the way, I am a bit confused to see a patch
> that adds new operators on a thread whose subject is about
> documentation.

Rebased version is attached.

The subject is about the documentation, but the post reveals
inconsistencies of the operators.  Tom Lane fixed the documentation on
the back branches.  The new patch is to fix the operators on the
master only.


0001-Add-and-operators-for-points-v02.patch
Description: Binary data

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-07 Thread Andrey V. Lepikhov


On 9/7/20 12:26 PM, Michael Paquier wrote:

On Mon, Aug 24, 2020 at 06:19:28PM +0900, Amit Langote wrote:

On Mon, Aug 24, 2020 at 4:18 PM Amit Langote  wrote:

I would


Oops, thought I'd continue writing, but hit send before actually doing
that.  Please ignore.

I have some comments on v6, which I will share later this week.


While on it, the CF bot is telling that the documentation of the patch
fails to compile.  This needs to be fixed.
--
Michael


v.7 (in attachment) fixes this problem.
I also accepted Amit's suggestion to rename all fdwapi routines such as 
ForeignCopyIn to *ForeignCopy.


--
regards,
Andrey Lepikhov
Postgres Professional
>From db4ba1bac6a8d642dffd1b907dcc1dd082203fab Mon Sep 17 00:00:00 2001
From: Andrey Lepikhov 
Date: Thu, 9 Jul 2020 11:16:56 +0500
Subject: [PATCH] Fast COPY FROM into the foreign or sharded table.

This feature enables bulk COPY into foreign table in the case of
multi inserts is possible and foreign table has non-zero number of columns.

FDWAPI was extended by next routines:
* BeginForeignCopy
* EndForeignCopy
* ExecForeignCopy

BeginForeignCopy and EndForeignCopy initialize and free
the CopyState of bulk COPY. The ExecForeignCopy routine send
'COPY ... FROM STDIN' command to the foreign server, in iterative
manner send tuples by CopyTo() machinery, send EOF to this connection.

Code that constructed list of columns for a given foreign relation
in the deparseAnalyzeSql() routine is separated to the deparseRelColumnList().
It is reused in the deparseCopyFromSql().

Added TAP-tests on the specific corner cases of COPY FROM STDIN operation.

By the analogy of CopyFrom() the CopyState structure was extended
with data_dest_cb callback. It is used for send text representation
of a tuple to a custom destination.
The PgFdwModifyState structure is extended with the cstate field.
It is needed for avoid repeated initialization of CopyState. ALso for this
reason CopyTo() routine was split into the set of routines CopyToStart()/
CopyTo()/CopyToFinish().

Discussion: https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

Authors: Andrey Lepikhov, Ashutosh Bapat, Amit Langote
---
 contrib/postgres_fdw/deparse.c|  60 ++-
 .../postgres_fdw/expected/postgres_fdw.out|  46 +-
 contrib/postgres_fdw/postgres_fdw.c   | 143 +++
 contrib/postgres_fdw/postgres_fdw.h   |   1 +
 contrib/postgres_fdw/sql/postgres_fdw.sql |  45 ++
 doc/src/sgml/fdwhandler.sgml  |  75 
 src/backend/commands/copy.c   | 398 +-
 src/backend/commands/tablecmds.c  |   1 +
 src/backend/executor/execMain.c   |  53 +++
 src/backend/executor/execPartition.c  |  28 +-
 src/backend/replication/logical/worker.c  |   2 +-
 src/include/commands/copy.h   |  11 +
 src/include/executor/executor.h   |   1 +
 src/include/foreign/fdwapi.h  |  15 +
 src/include/nodes/execnodes.h |   9 +-
 15 files changed, 670 insertions(+), 218 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index ad37a74221..a37981ff66 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -184,6 +184,8 @@ static void appendAggOrderBy(List *orderList, List *targetList,
 static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
+static List *deparseRelColumnList(StringInfo buf, Relation rel,
+  bool enclose_in_parens);
 
 /*
  * Helper functions
@@ -1758,6 +1760,20 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * Deparse COPY FROM into given buf.
+ * We need to use list of parameters at each query.
+ */
+void
+deparseCopyFromSql(StringInfo buf, Relation rel)
+{
+	appendStringInfoString(buf, "COPY ");
+	deparseRelation(buf, rel);
+	(void) deparseRelColumnList(buf, rel, true);
+
+	appendStringInfoString(buf, " FROM STDIN ");
+}
+
 /*
  * deparse remote UPDATE statement
  *
@@ -2061,6 +2077,30 @@ deparseAnalyzeSizeSql(StringInfo buf, Relation rel)
  */
 void
 deparseAnalyzeSql(StringInfo buf, Relation rel, List **retrieved_attrs)
+{
+	appendStringInfoString(buf, "SELECT ");
+	*retrieved_attrs = deparseRelColumnList(buf, rel, false);
+
+	/* Don't generate bad syntax for zero-column relation. */
+	if (list_length(*retrieved_attrs) == 0)
+		appendStringInfoString(buf, "NULL");
+
+	/*
+	 * Construct FROM clause
+	 */
+	appendStringInfoString(buf, " FROM ");
+	deparseRelation(buf, rel);
+}
+
+/*
+ * Construct the list of columns of given foreign relation in the order they
+ * appear in the tuple descriptor of the relation. Ignore any dropped columns.
+ * Use column names on the foreign server instead of local names.
+ *
+ * Optionally enclose the list

Re: [doc] minor wording improvement in a couple of places

2020-09-07 Thread Magnus Hagander

On Mon, Sep 7, 2020 at 6:34 AM Ian Barwick 
wrote:

> Hi
>
> On these pages:
>
>https://www.postgresql.org/docs/current/fdw-callbacks.html
>https://www.postgresql.org/docs/current/tablesample-method.html
>
> we have the phrase:
>
>"..., which see for additional details."
>
> which strikes me as a bit odd. Suggested phrasing:
>
>"...; see this file for additional details."
>
>
I agree this sounds very weird. However, this phrasing is also fairly
common in comments around the code.

I was first guessing that this came out of one of us non-english native
speakers, but AFAICT it's a mix of authors and definitely including those
natively speaking the language.

I would still suggest that we change it, because it does sound odd to me as
well, but I'll defer a bit and see if others might think we should keep it.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Allow CURRENT_ROLE in GRANTED BY

2020-09-07 Thread Asif Rehman

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

The patch applies cleanly and looks fine to me. However wouldn't it be better 
to just map the CURRENT_ROLE to CURRENT_USER in backend grammar?

The new status of this patch is: Waiting on Author

Re: Yet another fast GiST build

2020-09-07 Thread Heikki Linnakangas

On 24/02/2020 10:50, Andrey M. Borodin wrote:

On 24 февр. 2020 г., at 01:58, Thomas Munro  wrote:

On Thu, Feb 20, 2020 at 10:14 AM Thomas Munro  wrote:

1.  We expect floats to be in IEEE format, and the sort order of IEEE
floats is mostly correlated to the binary sort order of the bits
reinterpreted as an int.  It isn't in some special cases, but for this
use case we don't really care about that, we're just trying to
encourage locality.

I suppose there is a big jump in integer value (whether signed or
unsigned) as you cross from positive to negative floats, and then the
sort order is reversed.  I have no idea if either of those things is a
problem worth fixing.  That made me wonder if there might also be an
endianness problem.  It seems from some quick googling that all
current architectures have integers and floats of the same endianness.
Apparently this wasn't always the case, and some ARMs have a weird
half-flipped arrangement for 64 bit floats, but not 32 bit floats as
you are using here.

Yes, this leap is a problem for point as generic data type. And I do not know
how to fix it. It can cause inefficient Index Scans when searching near (0,0) 
and query
window touches simultaneously all quadrants (4x slower).

I took a stab at fixing this, see attached patch (applies on top of your 
patch v14).

To evaluate this, I used the other attached patch to expose the zorder 
function to SQL, and plotted points around zero with gnuplot. See the 
attached two images, one with patch v14, and the other one with this patch.

I'll continue looking at these patches in whole tomorrow. I think it's 
getting close to a committable state.

But everything will be just fine when all data is in 2nd quadrant.

Simon Riggs and friends would agree :-)

- Heikki
>From c7cadbb017aa3ec446136c36bb58b10d35ed095a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 7 Sep 2020 12:23:20 +0300
Subject: [PATCH v15 3/4] Expose point_zorder() to SQL.

---
 src/backend/access/gist/gistproc.c | 19 +++
 src/include/catalog/pg_proc.dat|  4 
 2 files changed, 23 insertions(+)

diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c
index 387c66d3ca3..4ed9f46c9bb 100644
--- a/src/backend/access/gist/gistproc.c
+++ b/src/backend/access/gist/gistproc.c
@@ -1626,3 +1626,22 @@ gist_point_sortsupport(PG_FUNCTION_ARGS)
 	ssup->comparator = gist_bbox_fastcmp;
 	PG_RETURN_VOID();
 }
+
+/*
+ * Expose the Z-Order for debugging purposes
+ */
+Datum
+point_zorder(PG_FUNCTION_ARGS)
+{
+	Point	   *p = PG_GETARG_POINT_P(0);
+	uint64		zorder;
+
+	zorder = point_zorder_internal(p);
+
+	/*
+	 * XXX: Shift by one, so that when it's interpreted as a signed integer,
+	 * it's always positive. We lose the least-significant bit, but that's OK
+	 * for the quick plotting I'm using this for.
+	 */
+	return Int64GetDatum(zorder >> 1);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 96d7efd4270..9a98bed79e8 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -3004,6 +3004,10 @@
   proname => 'point_div', prorettype => 'point', proargtypes => 'point point',
   prosrc => 'point_div' },

+{ oid => '9110',
+  proname => 'point_zorder', prorettype => 'int8', proargtypes => 'point',
+  prosrc => 'point_zorder' },
+
 { oid => '1445',
   proname => 'poly_npoints', prorettype => 'int4', proargtypes => 'polygon',
   prosrc => 'poly_npoints' },
-- 
2.20.1

>From a7e0237d0bfd4909c0c4e0237013efe0cd071dd4 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Mon, 7 Sep 2020 12:33:36 +0300
Subject: [PATCH v15 4/4] Map negative values better.

---
 src/backend/access/gist/gistproc.c | 169 +
 1 file changed, 121 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/gist/gistproc.c b/src/backend/access/gist/gistproc.c
index 4ed9f46c9bb..7ca7eda84b0 100644
--- a/src/backend/access/gist/gistproc.c
+++ b/src/backend/access/gist/gistproc.c
@@ -31,10 +31,11 @@ static bool gist_box_leaf_consistent(BOX *key, BOX *query,
 	 StrategyNumber strategy);
 static bool rtree_internal_consistent(BOX *key, BOX *query,
 	  StrategyNumber strategy);
-static int64 part_bits32_by2(uint32 x);
-static int64 interleave_bits32(uint32 x, uint32 y);
-static inline uint64 point_zorder_internal(Point *p);
-static int gist_bbox_fastcmp(Datum x, Datum y, SortSupport ssup);
+static uint64 part_bits32_by2(uint32 x);
+static uint64 interleave_bits32(uint32 x, uint32 y);
+static uint32 ieee_float32_to_uint32(float f);
+static uint64 point_zorder_internal(Point *p);
+static int	gist_bbox_fastcmp(Datum x, Datum y, SortSupport ssup);

 /* Minimum accepted ratio of split */
@@ -1549,75 +1550,147 @@ gist_poly_distance(PG_FUNCTION_ARGS)

 /* Z-order routines */
 /* Interleave 32 bits with zeroes */
-static int64
+static uint64
 part_bits32_by2(uint32 x)
 {
 	uint64		n = x;

 	n = (n | (n << 16)) &

Re: A micro-optimisation for walkdir()

2020-09-07 Thread Magnus Hagander

On Mon, Sep 7, 2020 at 12:27 AM Thomas Munro  wrote:

> On Sun, Sep 6, 2020 at 5:23 AM Juan José Santamaría Flecha
>  wrote:
> > On Sat, Sep 5, 2020 at 2:13 AM Andres Freund  wrote:
> >> > However, it looks like we might be missing a further opportunity
> >> > here...  Doesn't Windows already give us the flags we need in the
> >> > dwFileAttributes member of the WIN32_FIND_DATA object that the
> >> > Find{First,Next}File() functions populate?
> >>
> >> That'd be better...
> >
> >
> > At first I did not see how to get DT_LNK directly, but it is possible
> without additional calls, so please find attached a version with that logic.
> >
> > This version also drops the enum, defining just the macros.
>
> Excellent.  I'd like to commit these soon, unless someone has a better
> idea for how to name file_utils_febe.c.
>
> I think the following is a little mysterious, but it does seem to be
> what people do for this in other projects.  It is the documented way
> to detect mount points, and I guess IO_REPARSE_TAG_MOUNT_POINT is
> either overloaded also for junctions, or junctions are the same thing
> as mount points.  It would be nice to see a Win32 documentation page
> that explicitly said that.
>

The wikipedia page on it is actually fairly decent:
https://en.wikipedia.org/wiki/NTFS_reparse_point. It's not the
documentation of course, but it's easier to read :) The core difference is
whether you mount a whole filesystem (mount point) or just a directory off
something mounted elsehwere (junction).

And yes, the wikipedia page confirms that junctions also use
IO_REPARSE_TAG_MOUNT_POINT.


+/* For reparse points dwReserved0 field will contain the ReparseTag */
> +else if ((fd.dwFileAttributes & FILE_ATTRIBUTE_REPARSE_POINT) != 0
> + && (fd.dwReserved0 == IO_REPARSE_TAG_MOUNT_POINT))
> +d->ret.d_type = DT_LNK;
>
> Hmm, it's interesting that our existing test for a junction in
> pgwin32_is_junction() only looks for FILE_ATTRIBUTE_REPARSE_POINT and
> doesn't care what kind of reparse point it is.
>

I think that's mostly historical. When that code was written, the only two
types of reparse points that existed were junctions and mount points --
which are as already noted, the same. Symbolic links, unix sockets and such
things came later.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: v13: CLUSTER segv with wal_level=minimal and parallel index creation

2020-09-07 Thread Noah Misch

On Mon, Sep 07, 2020 at 05:40:36PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 07 Sep 2020 13:45:28 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > The cause is that the worker had received pending-sync entry correctly
> > but not never created a relcache entry for the relation using
> > RelationBuildDesc. So the rd_firstRelfilenodeSubid is not correctly
> > set.
> > 
> > I'm investigating it.
> 
> Relcaches are loaded from a file with old content at parallel worker
> startup. The relcache entry is corrected by invalidation at taking a
> lock but pending syncs are not considered.
> 
> Since parallel workers don't access the files so we can just ignore
> the assertion safely, but I want to rd_firstRelfilenodeSubid flag at
> invalidation, as attached PoC patch.

> [patch: When RelationInitPhysicalAddr() handles a mapped relation, re-fill
> rd_firstRelfilenodeSubid from RelFileNodeSkippingWAL(), like
> RelationBuildDesc() would do.]

As a PoC, this looks promising.  Thanks.  Would you add a test case such that
the following demonstrates the bug in the absence of your PoC?

  printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 
'max_wal_senders = 0' >/tmp/minimal.conf
  make check TEMP_CONFIG=/tmp/minimal.conf

Please have the test try both a nailed-and-mapped relation and a "nailed, but
not mapped" relation.  I am fairly confident that your PoC fixes the former
case, but the latter may need additional code.

Re: Online checksums verification in the backend

On Mon, Sep 07, 2020 at 09:38:30AM +0200, Julien Rouhaud wrote:
> Did you mean creating a new checksumfuncs.c file? I don't find any
> such file in the current tree.

Your patch adds checksumfuncs.c, so the subroutines grabbing a given
block could just be moved there.

> I'm not sure I understand.  Unless I missed something this approach
> *cannot* raise a false positive.  What it does is force a 2nd check
> with stronger lock *to make sure it's actually a corruption*, so we
> don't raise false positive.  The only report that can happen in this
> 1st loop is if smgread raises an error, which AFAICT can only happen
> (at least with mdread) if the whole block couldn't be read, which is a
> sign of a very bad problem.  This should clearly be reported, as this
> cannot be caused by the locking heuristics used here.

We don't know how much this optimization matters though?  Could it be
possible to get an idea of that?  For example, take the case of one
relation with a fixed size in a read-only workload and a read-write
workload (as long as autovacuum and updates make the number of
relation blocks rather constant for the read-write case), doing a
checksum verification in parallel of multiple clients working on the
relation concurrently.  Assuming that the relation is fully in the OS
cache, we could get an idea of the impact with multiple
(shared_buffers / relation size) rates to make the eviction more
aggressive?  The buffer partition locks, knowing that
NUM_BUFFER_PARTITIONS caps that, should be the bottleneck, still it
seems to me that it would be good to see if we have a difference.
What do you think?
--
Michael


signature.asc
Description: PGP signature

Re: v13: CLUSTER segv with wal_level=minimal and parallel index creation

At Mon, 07 Sep 2020 13:45:28 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> The cause is that the worker had received pending-sync entry correctly
> but not never created a relcache entry for the relation using
> RelationBuildDesc. So the rd_firstRelfilenodeSubid is not correctly
> set.
> 
> I'm investigating it.

Relcaches are loaded from a file with old content at parallel worker
startup. The relcache entry is corrected by invalidation at taking a
lock but pending syncs are not considered.

Since parallel workers don't access the files so we can just ignore
the assertion safely, but I want to rd_firstRelfilenodeSubid flag at
invalidation, as attached PoC patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 96ecad02dd..d66e914345 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1322,10 +1322,27 @@ RelationInitPhysicalAddr(Relation relation)
 	}
 	else
 	{
+		Oid oldnode = relation->rd_node.relNode;
+
 		/* Consult the relation mapper */
 		relation->rd_node.relNode =
 			RelationMapOidToFilenode(relation->rd_id,
 	 relation->rd_rel->relisshared);
+
+		/*
+		 * The leader process might have replaced storage before staring this
+		 * worker. We need to set rd_firstRelfilenodeSubid according to pending
+		 * sync hash in that case. Currently the information is not actually
+		 * used in worker processes, but make things tidy.
+		 */
+		if (IsParallelWorker() && oldnode != relation->rd_node.relNode)
+		{
+			if (RelFileNodeSkippingWAL(relation->rd_node))
+relation->rd_firstRelfilenodeSubid = TopSubTransactionId;
+			else
+relation->rd_firstRelfilenodeSubid = InvalidTransactionId;
+		}
+
 		if (!OidIsValid(relation->rd_node.relNode))
 			elog(ERROR, "could not find relation mapping for relation \"%s\", OID %u",
  RelationGetRelationName(relation), relation->rd_id);

Re: Yet another fast GiST build (typo)

2020-09-07 Thread Andrey M. Borodin



> 7 сент. 2020 г., в 12:14, Andrey M. Borodin  написал(а):
> 
> Maybe I'm missing something? Like forgot to log 10% of pages, or something 
> like that...

Indeed, there was a bug. I've fixed it, and I still observe same performance 
gain.

Best regards, Andrey Borodin.



v2-0001-nbtree-faster-logging.patch
Description: Binary data

Re: Range checks of pg_test_fsync --secs-per-test and pg_test_timing --duration

2020-09-07 Thread Peter Eisentraut


On 2020-09-06 05:04, Michael Paquier wrote:

I would allow 0. It's not
very useful, but it's not harmful and could be applicable in testing.


Hmm, OK.  For pg_test_fsync, 0 means infinity, and for pg_test_timing
that means stopping immediately (we currently don't allow that).  How
does this apply to testing?  For pg_test_fsync, using 0 would mean to
just remain stuck in the first fsync() pattern, while for
pg_test_fsync this means doing no test loops at all, generating a
useless log once done.  Or do you mean to change the logic of
pg_test_fsync so as --secs-per-test=0 means doing one single write?
That's something I thought about for this thread, but I am not sure
that the extra regression test gain is worth more complexity in this
code.


I think in general doing something 0 times should be allowed if possible.

However, I see that in the case of pg_test_fsync you end up in alarm(0), 
which does something different, so it's okay in that case to disallow it.


I notice that the error checking you introduce is different from the 
checks for pgbench -t and -T (the latter having no errno checks).  I'm 
not sure which is correct, but it's perhaps worth making them the same.


(pgbench -t 0, which is also currently not allowed, is a good example of 
why this could be useful, because that would allow checking whether the 
script etc. can be loaded without running an actual test.)


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-09-07 Thread k.jami...@fujitsu.com

On Wednesday, September 2, 2020 5:49 PM, Amit Kapila wrote:
> On Wed, Sep 2, 2020 at 9:17 AM Tom Lane  wrote:
> >
> > Amit Kapila  writes:
> > > Even if the relation is locked, background processes like
> > > checkpointer can still touch the relation which might cause
> > > problems. Consider a case where we extend the relation but didn't
> > > flush the newly added pages. Now during truncate operation,
> > > checkpointer can still flush those pages which can cause trouble for
> > > truncate. But, I think in the recovery path such cases won't cause a
> problem.
> >
> > I wouldn't count on that staying true ...
> >
> >
> https://www.postgresql.org/message-id/CA+hUKGJ8NRsqgkZEnsnRc2MFR
> OBV-jC
> > nacbyvtpptk2a9yy...@mail.gmail.com
> >
> 
> I don't think that proposal will matter after commit c5315f4f44 because we are
> caching the size/blocks for recovery while doing extend (smgrextend). In the
> above scenario, we would have cached the blocks which will be used at later
> point of time.
> 

Hi,

I'm guessing we can still pursue this idea of improving the recovery path first.

I'm working on an updated patch version, because the CFBot's telling
that postgres fails to build (one of the recovery TAP tests fails).
I'm still working on refactoring my patch, but have yet to find a proper 
solution at the moment.
So I'm going to continue my investigation.

Attached is an updated WIP patch.
I'd appreciate if you could take a look at the patch as well.

In addition, attached also are the regression logs for the failure and other 
logs
Accompanying it: wal_optimize_node_minimal and wal_optimize_node_replica.

The failures stated in my session was:
t/018_wal_optimize.pl  18/34 Bailout called.
Further testing stopped:  pg_ctl start failed
FAILED--Further testing stopped: pg_ctl start failed

Best regards,
Kirk Jamison

v11-Speedup-dropping-of-relation-buffers-during-recovery.patch
Description:  v11-Speedup-dropping-of-relation-buffers-during-recovery.patch

regress_log_018_wal_optimize
Description: regress_log_018_wal_optimize

018_wal_optimize_node_minimal.log
Description: 018_wal_optimize_node_minimal.log

018_wal_optimize_node_replica.log
Description: 018_wal_optimize_node_replica.log

Re: Improvements in Copy From

2020-09-07 Thread Surafel Temesgen

Hi Vignesh

On Wed, Jul 1, 2020 at 3:46 PM vignesh C  wrote:

> Hi,
>
> While reviewing copy from I identified few  improvements for copy from
> that can be done :
> a) copy from stdin copies lesser amount of data to buffer even though
> space is available in buffer because minread was passed as 1 to
> CopyGetData, Hence it only reads until the data read from libpq is
> less than minread. This can be fixed by passing the actual space
> available in buffer, this reduces the unnecessary frequent calls to
> CopyGetData.
>

why not applying the same optimization on file read ?

> c) Copy from reads header line and do nothing for the header line, we
> need not clear EOL & need not convert to server encoding for the
> header line.
>

We have a patch for column matching feature [1] that may need a header line
to be further processed. Even without that I think it is preferable to
process the header line for nothing than adding those checks to the loop,
performance-wise.

[1].
https://www.postgresql.org/message-id/flat/caf1-j-0ptcwmeltswwgv2m70u26n4g33gpe1rckqqe6wvqd...@mail.gmail.com

regards

Surafel

Re: Optimising compactify_tuples()

2020-09-07 Thread David Rowley

On Sun, 6 Sep 2020 at 23:37, David Rowley  wrote:
> The test replayed ~2.2 GB of WAL. master took 148.581 seconds and
> master+0001+0002 took 115.588 seconds. That's about 28% faster. Pretty
> nice!

I was wondering today if we could just get rid of the sort in
compactify_tuples() completely. It seems to me that the existing sort
is there just so that the memmove() is done in order of tuple at the
end of the page first. We seem to be just shunting all the tuples to
the end of the page so we need to sort the line items in reverse
offset so as not to overwrite memory for other tuples during the copy.

I wondered if we could get around that just by having another buffer
somewhere and memcpy the tuples into that first then copy the tuples
out that buffer back into the page. No need to worry about the order
we do that in as there's no chance to overwrite memory belonging to
other tuples.

Doing that gives me 79.166 seconds in the same recovery test. Or about
46% faster, instead of 22% (I mistakenly wrote 28% yesterday)

The top of perf report says:

  24.19%  postgres  postgres[.] PageRepairFragmentation
   8.37%  postgres  postgres[.] hash_search_with_hash_value
   7.40%  postgres  libc-2.31.so[.] 0x0018e74b
   5.59%  postgres  libc-2.31.so[.] 0x0018e741
   5.49%  postgres  postgres[.] XLogReadBufferExtended
   4.05%  postgres  postgres[.] compactify_tuples
   3.27%  postgres  postgres[.] PinBuffer
   2.88%  postgres  libc-2.31.so[.] 0x0018e470
   2.02%  postgres  postgres[.] hash_bytes

(I'll need to figure out why libc's debug symbols are not working)

I was thinking that there might be a crossover point to where this
method becomes faster than the sort method.  e.g sorting 1 tuple is
pretty cheap, but copying the memory for the entire tuple space might
be expensive as that includes the tuples we might be getting rid of.
So if we did go down that path we might need some heuristics to decide
which method is likely best. Maybe that's based on the number of
tuples, I'm not really sure. I've not made any attempt to try to give
it a worst-case workload to see if there is a crossover point that's
worth worrying about.

The attached patch is what I used to test this.  It kinda goes and
sticks a page-sized variable on the stack, which is not exactly ideal.
I think we'd likely want to figure some other way to do that, but I
just don't know what that would look like yet. I just put the attached
together quickly to test out the idea.

(I don't want to derail the sort improvements here. I happen to think
those are quite important improvements, so I'll continue to review
that patch still.  Longer term, we might just end up with something
slightly different for compactify_tuples)

David
diff --git a/src/backend/storage/page/bufpage.c 
b/src/backend/storage/page/bufpage.c
index d708117a40..5bf35aff37 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -440,22 +440,53 @@ compactify_tuples(itemIdSort itemidbase, int nitems, Page 
page)
Offset  upper;
int i;

-   /* sort itemIdSortData array into decreasing itemoff order */
-   qsort((char *) itemidbase, nitems, sizeof(itemIdSortData),
- itemoffcompare);
+   if (nitems > MaxHeapTuplesPerPage / 2)
+   {
+   char buffer[BLCKSZ + MAXIMUM_ALIGNOF];
+   char *bufp = (char *) MAXALIGN();
+
+   /*
+* Make a temp copy of the tuple data so that we can rearange
+* the tuples freely without having to worry about stomping
+* on the memory of other tuples.
+*/
+   memcpy(bufp + phdr->pd_upper,
+  page + phdr->pd_upper,
+  phdr->pd_special - phdr->pd_upper);

-   upper = phdr->pd_special;
-   for (i = 0; i < nitems; i++)
+   upper = phdr->pd_special;
+   for (i = 0; i < nitems; i++)
+   {
+   itemIdSort  itemidptr = [i];
+   ItemId  lp;
+
+   lp = PageGetItemId(page, itemidptr->offsetindex + 1);
+   upper -= itemidptr->alignedlen;
+   memcpy((char *) page + upper,
+  bufp + itemidptr->itemoff,
+  itemidptr->alignedlen);
+   lp->lp_off = upper;
+   }
+   }
+   else
{
-   itemIdSort  itemidptr = [i];
-   ItemId  lp;
-
-   lp = PageGetItemId(page, itemidptr->offsetindex + 1);
-   upper -= itemidptr->alignedlen;
-   memmove((char *) page + upper,
-   (char *) page + itemidptr->itemoff,
-   itemidptr->alignedlen);

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

On Mon, Aug 10, 2020 at 10:10:08PM -0400, Robert Haas wrote:
> That split dates to the parallel query work, and there are some
> comments in shmem_exit() about it; see in particular the explanation
> in the middle where it says "Call dynamic shared memory callbacks." It
> seemed to me that I needed the re-entrancy behavior that is described
> there, but for a set of callbacks that needed to run before some of
> the existing callbacks and after others.

We still have a CF entry here:
https://commitfest.postgresql.org/29/2649/

Is there still something that needs to absolutely be done here knowing
that we have bab1500 that got rid of the root issue?  Can the CF entry
be marked as committed?

(FWIW, I would move any discussion about improving more stuff related
to shared memory cleanup code at proc exit into a new thread, as that
looks like a new topic.)
--
Michael


signature.asc
Description: PGP signature

Re: Online checksums verification in the backend

2020-09-07 Thread Julien Rouhaud

On Mon, Sep 7, 2020 at 8:59 AM Michael Paquier  wrote:
>
> +#include "postgres.h"
> +
> +#include "access/tupdesc.h"
> +#include "common/relpath.h"
>  #include "storage/block.h"
> +#include "utils/relcache.h"
> +#include "utils/tuplestore.h"
> [...]
> +extern bool check_one_block(Relation relation, ForkNumber forknum,
> +   BlockNumber blkno, uint16 *chk_expected,
> +   uint16 *chk_found);
> I don't think that it is a good idea to add this much to checksum.h
> as these are includes coming mainly from the backend.  Note that
> pg_checksum_page() is a function designed to be also available for
> frontend tools, with checksum.h something that can be included in
> frontends.  This would mean that we could move all the buffer lookup
> APIs directly to checksumfuncs.c, or move that into a separate file
> closer to the location.

Did you mean creating a new checksumfuncs.c file? I don't find any
such file in the current tree.

> + * A zero checksum can never be computed, see pg_checksum_page() */
> +#define NoComputedChecksum 0
> Wouldn't it be better to rename that something like
> InvalidPageChecksum, and make use of it in pg_checksum_page()?  It
> would be more consistent with the naming of transaction IDs, OIDs or
> even XLogRecPtr.  And that could be a separate patch.

It seems quite ambiguous, as checksum validity usually has a different
meaning.  And in the code added here, the meaning isn't that the
ckecksum is invalid but that there's no checsum as it cannot be
computed due to PageIsNew().

> +++ b/src/test/check_relation/t/01_checksums_check.pl
> @@ -0,0 +1,276 @@
> +use strict;
> +use warnings;
> It could be better to move that to src/test/modules/, so as it could
> be picked more easily by MSVC scripts in the future.  Note that if we
> apply the normal naming convention here this should be named
> 001_checksum_check.pl.
>
> +subdir = src/test/check_relation
> +top_builddir = ../../..
> +include $(top_builddir)/src/Makefile.global
> Let's use a Makefile shaped in a way similar to modules/test_misc that
> makes use of TAP_TESTS = 1.  There is the infra, let's rely on it for
> the regression tests.

Will fix.

> +   pg_usleep(msec * 1000L);
> Could it be possible to add a wait event here?  It would be nice to be
> able to monitor that in pg_stat_activity.

Sure, I missed that as this was first implemented as an extension.

> +if (exists $ENV{MY_PG_REGRESS})
> +{
> +   $ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
> +}
> What is MY_PG_REGRESS for?  A remnant from an external makefile
> perhaps?

Indeed.

> +   /*
> +* If we get a failure and the buffer wasn't found in shared buffers,
> +* reread the buffer with suitable lock to avoid false positive. See
> +* check_get_buffer for more details.
> +*/
> +   if (!found_in_sb && !force_lock)
> +   {
> +   force_lock = true;
> +   goto Retry;
> +   }
> As designed, we have a risk of false positives with a torn page in the
> first loop when trying to look for a given buffer as we would try to
> use smgrread() without a partition lock.  This stresses me a bit, and
> false positives could scare users easily.  Could we consider first a
> safer approach where we don't do that, and just read the page while
> holding the partition lock?  OK, that would be more expensive, but at
> least that's safe in any case.  My memory of this patch is a bit
> fuzzy, but this is itching me and this is the heart of the problem
> dealt with here :)

I'm not sure I understand.  Unless I missed something this approach
*cannot* raise a false positive.  What it does is force a 2nd check
with stronger lock *to make sure it's actually a corruption*, so we
don't raise false positive.  The only report that can happen in this
1st loop is if smgread raises an error, which AFAICT can only happen
(at least with mdread) if the whole block couldn't be read, which is a
sign of a very bad problem.  This should clearly be reported, as this
cannot be caused by the locking heuristics used here.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-09-07 Thread Dilip Kumar

On Mon, Sep 7, 2020 at 12:00 PM Amit Kapila  wrote:

> On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar  wrote:
> >
> >
> > I have reviewed the changes and looks fine to me.
> >
>
> Thanks, I have pushed the last patch. Let's wait for a day or so to
> see the buildfarm reports and then we can probably close this CF
> entry.


Thanks.


> I am aware that we have one patch related to stats still
> pending but I think we can tackle it along with the spill stats patch
> which is being discussed in a different thread [1]. Do let me know if
> I have missed anything?
>
> [1] -
> https://www.postgresql.org/message-id/CAA4eK1JBqQh9cBKjO-nKOOE%3D7f6ONDCZp0TJZfn4VsQqRZ%2BuYA%40mail.gmail.com


Sound good to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [POC] Fast COPY FROM command for the table with foreign partitions

On Mon, Aug 24, 2020 at 06:19:28PM +0900, Amit Langote wrote:
> On Mon, Aug 24, 2020 at 4:18 PM Amit Langote  wrote:
> > I would
> 
> Oops, thought I'd continue writing, but hit send before actually doing
> that.  Please ignore.
> 
> I have some comments on v6, which I will share later this week.

While on it, the CF bot is telling that the documentation of the patch
fails to compile.  This needs to be fixed.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

On Tue, Aug 04, 2020 at 06:59:50AM +0800, Andy Fan wrote:
> You are correct, I would include this in the next version patch, Thank you
> for this checking!

Regression tests are failing with this patch set applied.  The CF bot
says so, and I can reproduce that locally as well.  Could you look at
that please?  I have switched the patch to "waiting on author".
--
Michael


signature.asc
Description: PGP signature

Re: New statistics for tuning WAL buffer size





On 2020/09/07 9:58, Masahiro Ikeda wrote:

Thanks for the review and advice!

On 2020-09-03 16:05, Fujii Masao wrote:

On 2020/09/02 18:56, Masahiro Ikeda wrote:

+/* --
+ * Backend types
+ * --

You seem to forget to add "*/" into the above comment.
This issue could cause the following compiler warning.
../../src/include/pgstat.h:761:1: warning: '/*' within block comment [-Wcomment]


Thanks for the comment. I fixed.


Thanks for the fix! But why are those comments necessary?


Sorry about that. This comment is not necessary.
I removed it.


The pg_stat_walwriter is not security restricted now, so ordinary users can 
access it.
It has the same security level as pg_stat_archiver. If you have any comments, 
please let me know.


+   dirty_writes bigint

I guess that the column name "dirty_writes" derived from
the DTrace probe name. Isn't this name confusing? We should
rename it to "wal_buffers_full" or something?


I agree and rename it to "wal_buffers_full".


+/* --
+ * PgStat_MsgWalWriter    Sent by the walwriter to update statistics.

This comment seems not accurate because backends also send it.

+/*
+ * WAL writes statistics counter is updated in XLogWrite function
+ */
+extern PgStat_MsgWalWriter WalWriterStats;

This comment seems not right because the counter is not updated in XLogWrite().


Right. I fixed it to "Sent by each backend and background workers to update WAL 
statistics."
In the future, other statistics will be included so I remove the function's 
name.



+-- There will surely and maximum one record
+select count(*) = 1 as ok from pg_stat_walwriter;

What about changing this comment to "There must be only one record"?


Thanks, I fixed.


+    WalWriterStats.m_xlog_dirty_writes++;
 LWLockRelease(WALWriteLock);

Since WalWriterStats.m_xlog_dirty_writes doesn't need to be protected
with WALWriteLock, isn't it better to increment that after releasing the lock?


Thanks, I fixed.


+CREATE VIEW pg_stat_walwriter AS
+    SELECT
+    pg_stat_get_xlog_dirty_writes() AS dirty_writes,
+    pg_stat_get_walwriter_stat_reset_time() AS stats_reset;
+
 CREATE VIEW pg_stat_progress_vacuum AS

In system_views.sql, the definition of pg_stat_walwriter should be
placed just after that of pg_stat_bgwriter not pg_stat_progress_analyze.


OK, I fixed it.


 }
-
 /*
  * We found an existing collector stats file. Read it and put all the

You seem to accidentally have removed the empty line here.


Sorry about that. I fixed it.


- errhint("Target must be \"archiver\" or \"bgwriter\".")));
+ errhint("Target must be \"archiver\" or \"bgwriter\" or
\"walwriter\".")));

There are two "or" in the message, but the former should be replaced with ","?


Thanks, I fixed.


On 2020-09-05 18:40, Magnus Hagander wrote:

On Fri, Sep 4, 2020 at 5:42 AM Fujii Masao
 wrote:


On 2020/09/04 11:50, tsunakawa.ta...@fujitsu.com wrote:

From: Fujii Masao 

I changed the view name from pg_stat_walwrites to

pg_stat_walwriter.

I think it is better to match naming scheme with other views

like

pg_stat_bgwriter,

which is for bgwriter statistics but it has the statistics

related to backend.


I prefer the view name pg_stat_walwriter for the consistency with
other view names. But we also have pg_stat_wal_receiver. Which
makes me think that maybe pg_stat_wal_writer is better for
the consistency. Thought? IMO either of them works for me.
I'd like to hear more opinons about this.


I think pg_stat_bgwriter is now a misnomer, because it contains

the backends' activity.  Likewise, pg_stat_walwriter leads to
misunderstanding because its information is not limited to WAL
writer.


How about simply pg_stat_wal?  In the future, we may want to

include WAL reads in this view, e.g. reading undo logs in zheap.

Sounds reasonable.


+1.

pg_stat_bgwriter has had the "wrong name" for quite some time now --
it became even more apparent when the checkpointer was split out to
it's own process, and that's not exactly a recent change. And it had
allocs in it from day one...

I think naming it for what the data in it is ("wal") rather than which
process deals with it ("walwriter") is correct, unless the statistics
can be known to only *ever* affect one type of process. (And then
different processes can affect different columns in the view). As a
general rule -- and that's from what I can tell exactly what's being
proposed.


Thanks for your comments. I agree with your opinions.
I changed the view name to "pg_stat_wal".


I fixed the code to send the WAL statistics from not only backend and walwriter
but also checkpointer, walsender and autovacuum worker.


Good point! Thanks for updating the patch!


@@ -604,6 +604,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 onerel->rd_rel->relisshared,
 Max(new_live_tuples, 0),

Re: Online checksums patch - once again

On Wed, Sep 02, 2020 at 02:22:25PM +0200, Daniel Gustafsson wrote:
> I unfortunately haven't had time to read the READ ONLY patch so I can't 
> comment
> on how these two patches do things in relation to each other.
> 
> The main synchronization mechanisms are the use of the inprogress mode where
> data checksums are written but not verified, and by waiting for all
> pre-existing non-compatible processes (transactions, temp tables) to disappear
> before enabling.

The CF bot is complaining on this one with a TAP test failure:
https://travis-ci.org/github/postgresql-cfbot/postgresql/builds/724717901

t/003_standby_checksum.pl .. 1/10
#   Failed test 'ensure checksums are on or in progress on standby_1'
#   at t/003_standby_checksum.pl line 59.
# 'off'
# ~~
# 'ARRAY(0x1d38c10)'
# Looks like you failed 1 test of 10.
t/003_standby_checksum.pl .. Dubious, test returned 1 (wstat 256,
0x100)
Failed 1/10 subtests

Daniel, could you look at that?
--
Michael


signature.asc
Description: PGP signature

Re: Yet another fast GiST build (typo)

2020-09-07 Thread Andrey M. Borodin

> 6 сент. 2020 г., в 18:26, Heikki Linnakangas  написал(а):
> 
> On 05/09/2020 14:53, Andrey M. Borodin wrote:
>> Thanks for ideas, Heikki. Please see v13 with proposed changes.
> 
> Thanks, that was quick!
> 
>> But I've found out that logging page-by-page slows down GiST build by
>> approximately 15% (when CPU constrained). Though In think that this
>> is IO-wise.
> Hmm, any ideas why that is? log_newpage_range() writes one WAL record for 32 
> pages, while now you're writing one record per page, so you'll have a little 
> bit more overhead from that. But 15% seems like a lot.

Hmm, this works for B-tree too.
this index creation
psql -c '\timing' -c "create table x as select random() from 
generate_series(1,1000,1);" -c "create index ON x (random );"
takes 7 seconds on may machine, but with one weird patch it takes only 6 :)

Maybe I'm missing something? Like forgot to log 10% of pages, or something like 
that...

Best regards, Andrey Borodin.

0001-nbtree-faster-logging.patch
Description: Binary data

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-09-07 Thread Amit Khandekar

On Mon, 7 Sep 2020 at 11:23, Tom Lane  wrote:
>
> I wrote:
> > I made some cosmetic changes to this and committed it.

Thanks!

>
> BTW, poking at this further, it seems that the patch only really
> works for gcc.  clang accepts the -ftree-vectorize switch, but
> looking at the generated asm shows that it does nothing useful.
> Which is odd, because clang does do loop vectorization.
>
> I tried adding -Rpass-analysis=loop-vectorize and got
>
> numeric.c:8341:3: remark: loop not vectorized: could not determine number of 
> loop iterations [-Rpass-analysis=loop-vectorize]
> for (i2 = 0; i2 <= i; i2++)

Hmm, yeah that's unfortunate. My guess is that the compiler would do
vectorization only if 'i' is a constant, which is not true for our
case.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: Online checksums verification in the backend

On Tue, Jul 14, 2020 at 11:08:08AM +0200, Julien Rouhaud wrote:
> On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote:
>> Small language fixes in comments and user-facing docs.
> 
> Thanks a lot!  I just fixed a small issue (see below), PFA updated v10.

Sawada-san, you are registered as a reviewer of this patch.  Are you
planning to look at it?  If you are busy lately, that's fine as well
(congrats!).  In this case it could be better to unregister from the
CF app for this entry.

I am refreshing my mind here, but here are some high-level comments
for now...

+#include "postgres.h"
+
+#include "access/tupdesc.h"
+#include "common/relpath.h"
 #include "storage/block.h"
+#include "utils/relcache.h"
+#include "utils/tuplestore.h"
[...]
+extern bool check_one_block(Relation relation, ForkNumber forknum,
+   BlockNumber blkno, uint16 *chk_expected,
+   uint16 *chk_found);
I don't think that it is a good idea to add this much to checksum.h
as these are includes coming mainly from the backend.  Note that
pg_checksum_page() is a function designed to be also available for
frontend tools, with checksum.h something that can be included in
frontends.  This would mean that we could move all the buffer lookup
APIs directly to checksumfuncs.c, or move that into a separate file
closer to the location.

+ * A zero checksum can never be computed, see pg_checksum_page() */
+#define NoComputedChecksum 0
Wouldn't it be better to rename that something like
InvalidPageChecksum, and make use of it in pg_checksum_page()?  It
would be more consistent with the naming of transaction IDs, OIDs or
even XLogRecPtr.  And that could be a separate patch.

+++ b/src/test/check_relation/t/01_checksums_check.pl
@@ -0,0 +1,276 @@
+use strict;
+use warnings;
It could be better to move that to src/test/modules/, so as it could
be picked more easily by MSVC scripts in the future.  Note that if we
apply the normal naming convention here this should be named
001_checksum_check.pl.

+subdir = src/test/check_relation
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
Let's use a Makefile shaped in a way similar to modules/test_misc that
makes use of TAP_TESTS = 1.  There is the infra, let's rely on it for
the regression tests.

+   pg_usleep(msec * 1000L);
Could it be possible to add a wait event here?  It would be nice to be
able to monitor that in pg_stat_activity.

+if (exists $ENV{MY_PG_REGRESS})
+{
+   $ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS};
+}
What is MY_PG_REGRESS for?  A remnant from an external makefile
perhaps?

+   /*
+* If we get a failure and the buffer wasn't found in shared buffers,
+* reread the buffer with suitable lock to avoid false positive. See
+* check_get_buffer for more details.
+*/
+   if (!found_in_sb && !force_lock)
+   {
+   force_lock = true;
+   goto Retry;
+   }
As designed, we have a risk of false positives with a torn page in the
first loop when trying to look for a given buffer as we would try to
use smgrread() without a partition lock.  This stresses me a bit, and
false positives could scare users easily.  Could we consider first a
safer approach where we don't do that, and just read the page while
holding the partition lock?  OK, that would be more expensive, but at
least that's safe in any case.  My memory of this patch is a bit
fuzzy, but this is itching me and this is the heart of the problem
dealt with here :)
--
Michael


signature.asc
Description: PGP signature

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

On Sat, Sep 5, 2020 at 8:55 PM Dilip Kumar  wrote:
>
>
> I have reviewed the changes and looks fine to me.
>

Thanks, I have pushed the last patch. Let's wait for a day or so to
see the buildfarm reports and then we can probably close this CF
entry. I am aware that we have one patch related to stats still
pending but I think we can tackle it along with the spill stats patch
which is being discussed in a different thread [1]. Do let me know if
I have missed anything?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JBqQh9cBKjO-nKOOE%3D7f6ONDCZp0TJZfn4VsQqRZ%2BuYA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: A micro-optimisation for walkdir()

On Mon, Sep 7, 2020 at 10:36 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > Excellent.  I'd like to commit these soon, unless someone has a better
> > idea for how to name file_utils_febe.c.
>
> As long as it's in src/port or src/common, isn't it implicit that
> it's probably FE/BE common code?
>
> I think it'd make more sense to insert all this stuff into file_utils.c,
> and then just "#ifdef FRONTEND" the existing code there that doesn't work
> in backend.  For one thing, that gives us a saner upgrade path whenever
> somebody gets ambitious enough to make that code work for the backend.

True.  Ok, I committed the Unix patch that way.  I know it works on
Linux, FreeBSD, macOS and Windows (without Juan José's patch), but
I'll have to check the build farm later to make sure HPUX, AIX and
Solaris are OK with this.  It's remotely possible that one of them
defines DT_REG etc but doesn't have d_type; I'm hoping to get away
with not adding a configure check.

I'll wait a bit longer for comments on the Windows patch.

Re: Resetting spilled txn statistics in pg_stat_replication

On Thu, Jul 23, 2020 at 11:46 AM Masahiko Sawada
 wrote:
>
> I've updated the patch so that the stats collector ignores the
> 'update' message if the slot stats array is already full.
>

This patch needs a rebase. I don't see this patch in the CF app. I
hope you are still interested in working on this.

Review comments:
===
1.
+CREATE VIEW pg_stat_replication_slots AS
+SELECT
+s.name,
+s.spill_txns,
+s.spill_count,
+s.spill_bytes
+FROM pg_stat_get_replication_slots() AS s;

The view pg_stat_replication_slots should have a column 'stats_reset'
(datatype: timestamp with time zone) as we provide a facitily to reset
the slots. A similar column exists in pg_stat_slru as well, so is
there a reason of not providing it here?

2.
+   
+  
+
+  
   pg_stat_wal_receiver

It is better to keep one empty line between  and  to
keep it consistent with the documentation of other views.

3.
  pg_stat_reset_replication_slot
+
+pg_stat_reset_replication_slot (
text )
+void
+   
+   
+ Resets statistics to zero for a single replication slot, or for all
+ replication slots in the cluster.  If the argument is NULL,
all counters
+ shown in the
pg_stat_replication_slots view for
+ all replication slots are reset.
+   

I think the information about the parameter for this function is not
completely clear. It seems to me that it should be the name of the
slot for which we want to reset the stats, if so, let's try to be
clear.

4.
+pgstat_reset_replslot_counter(const char *name)
+{
+ PgStat_MsgResetreplslotcounter msg;
+
+ if (pgStatSock == PGINVALID_SOCKET)
+ return;
+
+ pgstat_setheader(_hdr, PGSTAT_MTYPE_RESETREPLSLOTCOUNTER);
+ if (name)
+ {
+ memcpy(_slotname, name, NAMEDATALEN);
+ msg.clearall = false;
+ }

Don't we want to verify here or in the caller of this function whether
the passed slot_name is a valid slot or not? For ex. see
pgstat_reset_shared_counters where we return an error if the target is
not valid.

5.
+static void
+pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
+ int len)
+{
+ int i;
+ int idx = -1;
+ TimestampTz ts;
+
+ if (!msg->clearall)
+ {
+ /* Get the index of replication slot statistics to reset */
+ idx = pgstat_replslot_index(msg->m_slotname, false);
+
+ if (idx < 0)
+ return; /* not found */

Can we add a comment to describe when we don't expect to find the slot
here unless there is no way that can happen?

6.
+pgstat_recv_resetreplslotcounter(PgStat_MsgResetreplslotcounter *msg,
+ int len)
{
..
+ for (i = 0; i < SLRU_NUM_ELEMENTS; i++)
..
}

I think here we need to traverse till nReplSlotStats, not SLRU_NUM_ELEMENTS.

7. Don't we need to change PGSTAT_FILE_FORMAT_ID for this patch? We
can probably do at the end but better to change it now so that it
doesn't slip from our mind.

8.
@@ -5350,6 +5474,23 @@ pgstat_read_statsfiles(Oid onlydb, bool
permanent, bool deep)

  break;

+ /*
+ * 'R' A PgStat_ReplSlotStats struct describing a replication slot
+ * follows.
+ */
+ case 'R':
+ if (fread([nReplSlotStats], 1,
sizeof(PgStat_ReplSlotStats), fpin)
+ != sizeof(PgStat_ReplSlotStats))
+ {
+ ereport(pgStatRunningInCollector ? LOG : WARNING,
+ (errmsg("corrupted statistics file \"%s\"",
+ statfile)));
+ memset([nReplSlotStats], 0, sizeof(PgStat_ReplSlotStats));
+ goto done;
+ }
+ nReplSlotStats++;
+ break;

Both here and in pgstat_read_db_statsfile_timestamp(), the patch
handles 'R' message after 'D' whereas while writing the 'R' is written
before 'D'. So, I think it is better if we keep the order during read
the same as during write.

9. While reviewing this patch, I noticed that in
pgstat_read_db_statsfile_timestamp(), if we fail to read ArchiverStats
or SLRUStats, we return 'false' from this function but OTOH, if we
fail to read 'D' or 'R' message, we will return 'true'. I feel the
handling of 'D' and 'R' message is fine because once we read
GlobalStats, we can return the stats_timestamp. So the other two
stands corrected. I understand that this is not directly related to
this patch but if you agree we can do this as a separate patch.

-- 
With Regards,
Amit Kapila.

Re: clarify "rewritten" in pg_checksums docs