date:20200714

Re: Parallel copy

2020-07-14 Thread Amit Kapila

On Sun, Jul 12, 2020 at 5:48 PM Bharath Rupireddy
 wrote:
>
> >
> > Hi Bharath,
> >
> > I was looking forward to review this patch-set but unfortunately it is
> > showing a reject in copy.c, and might need a rebase.
> > I was applying on master over the commit-
> > cd22d3cdb9bd9963c694c01a8c0232bbae3ddcfb.
> >
>
> Thanks for showing interest. Please find the patch set rebased to
> latest commit b1e48bbe64a411666bb1928b9741e112e267836d.
>

Few comments:

0001-Copy-code-readjustment-to-support-parallel-copy

I am not sure converting the code to macros is a good idea, it makes
this code harder to read.  Also, there are a few changes which I am
not sure are necessary.
1.
+/*
+ * CLEAR_EOL_FROM_COPIED_DATA - Clear EOL from the copied data.
+ */
+#define CLEAR_EOL_FROM_COPIED_DATA(copy_line_data, copy_line_pos,
copy_line_size) \
+{ \
+ /* \
+ * If we didn't hit EOF, then we must have transferred the EOL marker \
+ * to line_buf along with the data.  Get rid of it. \
+ */ \
+   switch (cstate->eol_type) \
+   { \
+case EOL_NL: \
+Assert(copy_line_size >= 1); \
+Assert(copy_line_data[copy_line_pos - 1] == '\n'); \
+copy_line_data[copy_line_pos - 1] = '\0'; \
+copy_line_size--; \
+break; \
+case EOL_CR: \
+Assert(copy_line_size >= 1); \
+Assert(copy_line_data[copy_line_pos - 1] == '\r'); \
+copy_line_data[copy_line_pos - 1] = '\0'; \
+copy_line_size--; \
+break; \
+case EOL_CRNL: \
+Assert(copy_line_size >= 2); \
+Assert(copy_line_data[copy_line_pos - 2] == '\r'); \
+Assert(copy_line_data[copy_line_pos - 1] == '\n'); \
+copy_line_data[copy_line_pos - 2] = '\0'; \
+copy_line_size -= 2; \
+break; \
+case EOL_UNKNOWN: \
+/* shouldn't get here */ \
+Assert(false); \
+break; \
+   } \
+}

In the original code, we are using only len and buffer, here we are
using position, length/size and buffer.  Is it really required or can
we do with just len and buffer?

2.
+/*
+ * INCREMENTPROCESSED - Increment the lines processed.
+ */
+#define INCREMENTPROCESSED(processed)  \
+processed++;
+
+/*
+ * GETPROCESSED - Get the lines processed.
+ */
+#define GETPROCESSED(processed) \
+return processed;
+

I don't like converting above to macros.  I don't think converting
such things to macros will buy us much.

0002-Framework-for-leader-worker-in-parallel-copy
3.
 /*
+ * Copy data block information.
+ */
+typedef struct ParallelCopyDataBlock

It is better to add a few comments atop this data structure to explain
how it is used?

4.
+ * ParallelCopyLineBoundary is common data structure between leader & worker,
+ * this is protected by the following sequence in the leader & worker.
+ * Leader should operate in the following order:
+ * 1) update first_block, start_offset & cur_lineno in any order.
+ * 2) update line_size.
+ * 3) update line_state.
+ * Worker should operate in the following order:
+ * 1) read line_size.
+ * 2) only one worker should choose one line for processing, this is handled by
+ *using pg_atomic_compare_exchange_u32, worker will change the sate to
+ *LINE_WORKER_PROCESSING only if line_state is LINE_LEADER_POPULATED.
+ * 3) read first_block, start_offset & cur_lineno in any order.
+ */
+typedef struct ParallelCopyLineBoundary

Here, you have mentioned how workers and leader should operate to make
sure access to the data is sane.  However, you have not explained what
is the problem if they don't do so and it is not apparent to me.
Also, it is not very clear what is the purpose of this data structure
from comments.

5.
+/*
+ * Circular queue used to store the line information.
+ */
+typedef struct ParallelCopyLineBoundaries
+{
+ /* Position for the leader to populate a line. */
+ uint32 leader_pos;

I don't think the variable needs to be named as leader_pos, it is okay
to name it is as 'pos' as the comment above it explains its usage.

7.
+#define DATA_BLOCK_SIZE RAW_BUF_SIZE
+#define RINGSIZE (10 * 1000)
+#define MAX_BLOCKS_COUNT 1000
+#define WORKER_CHUNK_COUNT 50 /* should be mod of RINGSIZE */

It would be good if you can write a few comments to explain why you
have chosen these default values.

8.
ParallelCopyCommonKeyData, shall we name this as
SerializedParallelCopyState or something like that?  For example, see
SerializedSnapshotData which has been used to pass snapshot
information to passed to workers.

9.
+CopyCommonInfoForWorker(CopyState cstate, ParallelCopyCommonKeyData
*shared_cstate)

If you agree with point-8, then let's name this as
SerializeParallelCopyState.  See, if there is more usage of similar
types in the patch then lets change those as well.

10.
+ * in the DSM. The specified number of workers will then be launched.
+ *
+ */
+static ParallelContext*
+BeginParallelCopy(int nworkers, CopyState cstate, List *attnamelist, Oid relid)

No need of an extra line with only '*' in the above multi-line comment.

11.
BeginParallelCopy(..)
{
..
+ EstimateLineKeysStr(pcxt,

Re: Default setting for enable_hashagg_disk

2020-07-14 Thread Peter Geoghegan

On Mon, Jul 13, 2020 at 9:47 AM Alvaro Herrera  wrote:
> I'm in favor of hash_mem_multiplier.  I think a >1 default is more
> sensible than =1 in the long run, but if strategic vote is what we're
> doing, then I support the =1 option.

Attached is a WIP patch implementing hash_mem_multiplier, with 1.0 as
the GUC's default value (i.e. the patch introduces no behavioral
changes by default). The first patch in the series renames some local
variables whose name is made ambiguous by the second, main patch.

Since the patch doesn't add a new work_mem-style GUC, but existing
consumers of work_mem expect something like that, the code is
structured in a way that allows the planner and executor to pretend
that there really is a work_mem-style GUC called hash_mem, which they
can determine the value of by calling the get_hash_mem() function.
This seemed like the simplest approach overall. I placed the
get_hash_mem() function in nodeHash.c, which is a pretty random place
for it. If anybody has any better ideas about where it should live,
please say so.

ISTM that the planner changes are where there's mostly likely to be
problems. Reviewers should examine consider_groupingsets_paths() in
detail.

--
Peter Geoghegan

v2-0001-Rename-hash_mem-local-variable.patch
Description: Binary data

v2-0002-Add-hash_mem_multiplier-GUC.patch
Description: Binary data

Re: [PATCH] Performance Improvement For Copy From Binary Files

2020-07-14 Thread vignesh C

On Wed, Jul 15, 2020 at 8:03 AM Amit Langote  wrote:
>
> Hi Vignesh,
>
> On Tue, Jul 14, 2020 at 10:23 PM vignesh C  wrote:
> > On Tue, Jul 14, 2020 at 11:19 AM Amit Langote  
> > wrote:
> > > Sounds fine to me.  Although CopyLoadRawBuf() does not seem to a
> > > candidate for rigorous code optimization as it does not get called
> > > that often.
> >
> > I thought we could include that change as we are making changes around
> > that code.
>
> Sure, done.
>
> > Rest of the changes looked fine to me. Also I noticed that
> > commit message was missing in the patch.
>
> Please see the attached v7.
>

Thanks for fixing the comments.
Patch applies cleanly, make check & make check-world passes.
The changes looks fine to me.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-14 Thread Dilip Kumar

On Mon, Jul 13, 2020 at 10:47 AM Dilip Kumar  wrote:
>
> On Sun, Jul 12, 2020 at 9:56 PM Dilip Kumar  wrote:
> >
> > On Mon, Jul 6, 2020 at 11:43 AM Dilip Kumar  wrote:
> > >
> > > On Mon, Jul 6, 2020 at 11:31 AM Amit Kapila  
> > > wrote:
> > > >
> > > > On Sun, Jul 5, 2020 at 4:47 PM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > > On Sat, Jul 4, 2020 at 11:35 AM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > >
> > > > > > 9.
> > > > > > +ReorderBufferHandleConcurrentAbort(ReorderBuffer *rb, 
> > > > > > ReorderBufferTXN *txn,
> > > > > > {
> > > > > > ..
> > > > > > + ReorderBufferToastReset(rb, txn);
> > > > > > + if (specinsert != NULL)
> > > > > > + ReorderBufferReturnChange(rb, specinsert);
> > > > > > ..
> > > > > > }
> > > > > >
> > > > > > Why do we need to do these here when we wouldn't have been done for
> > > > > > any exception other than ERRCODE_TRANSACTION_ROLLBACK?
> > > > >
> > > > > Because we are handling this exception "ERRCODE_TRANSACTION_ROLLBACK"
> > > > > gracefully and we are continuing with further decoding so we need to
> > > > > return this change back.
> > > > >
> > > >
> > > > Okay, then I suggest we should do these before calling stream_stop and
> > > > also move ReorderBufferResetTXN after calling stream_stop  to follow a
> > > > pattern similar to try block unless there is a reason for not doing
> > > > so.  Also, it would be good if we can initialize specinsert with NULL
> > > > after returning the change as we are doing at other places.
> > >
> > > Okay
> > >
> > > > > > 10.  I have got the below failure once.  I have not investigated 
> > > > > > this
> > > > > > in detail as the patch is still under progress.  See, if you have 
> > > > > > any
> > > > > > idea?
> > > > > > #   Failed test 'check extra columns contain local defaults'
> > > > > > #   at t/013_stream_subxact_ddl_abort.pl line 81.
> > > > > > #  got: '2|0'
> > > > > > # expected: '1000|500'
> > > > > > # Looks like you failed 1 test of 2.
> > > > > > make[2]: *** [check] Error 1
> > > > > > make[1]: *** [check-subscription-recurse] Error 2
> > > > > > make[1]: *** Waiting for unfinished jobs
> > > > > > make: *** [check-world-src/test-recurse] Error 2
> > > > >
> > > > > Even I got the failure once and after that, it did not reproduce.  I
> > > > > have executed it multiple time but it did not reproduce again.  Are
> > > > > you able to reproduce it consistently?
> > > > >
> > > >
> > > > No, I am also not able to reproduce it consistently but I think this
> > > > can fail if a subscriber sends the replay_location before actually
> > > > replaying the changes.  First, I thought that extra send_feedback we
> > > > have in apply_handle_stream_commit might have caused this but I guess
> > > > that can't happen because we need the commit time location for that
> > > > and we are storing the same at the end of apply_handle_stream_commit
> > > > after applying all messages.  I am not sure what is going on here.  I
> > > > think we somehow need to reproduce this or some variant of this test
> > > > consistently to find the root cause.
> > >
> > > And I think it appeared first time for me,  so maybe either induced
> > > from past few versions so some changes in the last few versions might
> > > have exposed it.  I have noticed that almost 50% of the time I am able
> > > to reproduce after the clean build so I can trace back from which
> > > version it started appearing that way it will be easy to narrow down.
> >
> > I think the reason for the failure is that we are not setting
> > remote_final_lsn, in the streaming mode.  I have put multiple logs and
> > executed in log and from logs it appeared that some of the logical wal
> > did not get replayed due to below check in
> > should_apply_changes_for_rel.
> > return (rel->state == SUBREL_STATE_READY || (rel->state ==
> > SUBREL_STATE_SYNCDONE && rel->statelsn <= remote_final_lsn));
> >
> > I still need to do the detailed analysis that why does this fail in
> > some cases,  basically, most of the time the rel->state is
> > SUBREL_STATE_READY so this check passes but whenever the state is
> > SUBREL_STATE_SYNCDONE it failed because we never update
> > remote_final_lsn.  I will try to set this value in
> > apply_handle_stream_commit and see whether it ever fails or not.
>
> I have verified that after setting the remote_final_lsn in the
> apply_handle_stream_commit, I don't see that regression failure in
> over 70 runs whereas without that change it failed 6 times in 50 runs.
> Apart from this, I have noticed one more thing related to the same
> point.  Basically, in the apply_handle_commit, we are calling
> process_syncing_tables whereas we are not calling the same in
 > apply_handle_stream_commit.

I have set the remote_final_lsn as well as called
process_syncing_tables, in apply_handle_stream_commit.  Please see the
latest patch set v33.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-14 Thread Dilip Kumar

On Mon, Jul 13, 2020 at 4:00 PM Amit Kapila  wrote:
>
> On Mon, Jul 13, 2020 at 3:04 PM Dilip Kumar  wrote:
> >
> > On Mon, Jul 13, 2020 at 2:56 PM Amit Kapila  wrote:
> > >
> > > On Mon, Jul 13, 2020 at 2:32 PM Dilip Kumar  wrote:
> > > >
> > > > On Mon, Jul 13, 2020 at 11:10 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > I think you can refer to commit message as well for that "We however
> > > > > must explicitly disable streaming replication during replication slot
> > > > > creation, even if the plugin supports it. We don't need to replicate
> > > > > the changes accumulated during this phase, and moreover, we don't have
> > > > > a replication connection open so we don't have where to send the data
> > > > > anyway.".  I don't think this is a good way to hack the streaming flag
> > > > > because for SQL API's, we don't have a good reason to disable the
> > > > > streaming in this way.  I guess if we had a condition related to
> > > > > reaching CONSISTENT snapshot during streaming then we won't need to
> > > > > hack the streaming flag in this way.  Once we reach the CONSISTENT
> > > > > snapshot state, we come out of the creation of a replication slot (see
> > > > > how we use DecodingContextReady to achieve that) phase.  So, I feel we
> > > > > should remove the ctx->streaming setting to false and add a CONSISTENT
> > > > > snapshot check during streaming unless you have a reason for not doing
> > > > > so.
> > > >
> > > > I was worried about the point that streaming on/off is sent by the
> > > > subscriber on START REPLICATION, not on CREATE REPLICATION SLOT, so if
> > > > we keep streaming on during create then it may not be right.
> > > >
> > >
> > > Then, how is that used on the publisher-side?  AFAICS, the streaming
> > > is enabled based on whether streaming callbacks are provided and we do
> > > that in 0003-Extend-the-logical-decoding-output-plugin-API-wi patch.
> >
> > Basically, first, we enable based on whether we have the callbacks or
> > not but later once we get the START REPLICATION command from the
> > subscriber then we set it to false if the streaming is not enabled
> > from the subscriber side.  You can refer below code in patch 0007.
> >
> > pgoutput_startup
> > {
> > parse_output_parameters(ctx->output_plugin_options,
> > >protocol_version,
> > - >publication_names);
> > + >publication_names,
> > + _streaming);
> > /* Check if we support requested protocol */
> > if (data->protocol_version > LOGICALREP_PROTO_VERSION_NUM)
> > @@ -222,6 +284,27 @@ pgoutput_startup(LogicalDecodingContext *ctx,
> > OutputPluginOptions *opt,
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > errmsg("publication_names parameter missing")));
> > + /*
> > + * Decide whether to enable streaming. It is disabled by default, in
> > + * which case we just update the flag in decoding context. Otherwise
> > + * we only allow it with sufficient version of the protocol, and when
> > + * the output plugin supports it.
> > + */
> > + if (!enable_streaming)
> > + ctx->streaming = false;
> > }
> >
>
> Okay, in that case, we can do both enable and disable streaming in
> this function itself rather than allow the caller to later modify it.
> I suggest similarly we can enable/disable it for SQL API in
> pg_decode_startup via output_plugin_options.  This way it will look
> consistent for both SQL APIs and for command-based replication.  If we
> can do so, then probably adding an Assert for Consistent Snapshot
> while performing streaming should be okay.

Done this way In the latest patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan vs kernel read ahead

2020-07-14 Thread David Rowley

On Wed, 15 Jul 2020 at 14:51, Amit Kapila  wrote:
>
> On Wed, Jul 15, 2020 at 5:55 AM David Rowley  wrote:
> > If we've not seen any performance regressions within 1 week, then I
> > propose that we (pending final review) push this to allow wider
> > testing.
>
> I think Soumyadeep has reported a regression case [1] with the earlier
> version of the patch.  I am not sure if we have verified that the
> situation improves with the latest version of the patch.  I request
> Soumyadeep to please try once with the latest patch.

Yeah, it would be good to see some more data points on that test.
Jumping from 2 up to 6 workers just leaves us to guess where the
performance started to become bad. It would be good to know if the
regression is repeatable or if it was affected by some other process.

I see the disk type on that report was Google PersistentDisk. I don't
pretend to be any sort of expert on network filesystems, but I guess a
regression would be possible in that test case if say there was an
additional layer of caching of very limited size between the kernel
cache and the disks, maybe on a remote machine. If it were doing some
sort of prefetching to try to reduce latency and requests to the
actual disks then perhaps going up to 6 workers with 64 chunk size (as
Thomas' patch used at that time) caused more cache misses on that
cache due to the requests exceeding what had already been prefetched.
That's just a stab in the dark. Maybe someone with knowledge of these
network file systems can come up with a better theory.

It would be good to see EXPLAIN (ANALYZE, BUFFERS) with SET
track_io_timing = on; for each value of max_parallel_workers.

David

> [1] - 
> https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB%3DxyJ192EZCNwGfcCa_WJ5GHVM7Sv8oenuA%40mail.gmail.com

Re: pg_ls_tmpdir to show directories and shared filesets (and pg_ls_*)

2020-07-14 Thread Justin Pryzby

On Sun, Jun 21, 2020 at 08:53:25PM -0500, Justin Pryzby wrote:
> I'm still waiting to hear feedback from a commiter if this is a good idea to
> put this into the system catalog.  Right now, ts_debug is the only nontrivial
> function.

I'm still missing feedback from committers about the foundation of this
approach.

But I finally looked into the pg_rewind test failure 

That led met to keep the "dir" as a separate column, since that's what's needed
there, and it's more convenient to have a separate column than to provide a
column needing to be parsed.

-- 
Justin
>From 29846d21841a76b7ebfbf5dfbeef36963bff545b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 16 Mar 2020 14:12:55 -0500
Subject: [PATCH v20 01/10] Document historic behavior of links to
 directories..

Backpatch to 9.5: pg_stat_file
---
 doc/src/sgml/func.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 959f6a1c2f..7ef8c7a847 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25911,7 +25911,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 Returns a record containing the file's size, last access time stamp,
 last modification time stamp, last file status change time stamp (Unix
 platforms only), file creation time stamp (Windows only), and a flag
-indicating if it is a directory.
+indicating if it is a directory (or a symbolic link to a directory).


 This function is restricted to superusers by default, but other users
-- 
2.17.0

>From 971d6165400ee85fb0ec4c19ad36a8e1146cc382 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 30 Mar 2020 18:59:16 -0500
Subject: [PATCH v20 02/10] pg_stat_file and pg_ls_dir_* to use lstat()..

pg_ls_dir_* will now skip (no longer show) symbolic links, same as other
non-regular file types, as we advertize we do since 8b6d94cf6.  That seems to
be the intented behavior, since irregular file types are 1) less portable; and,
2) we don't currently show a file's type except for "bool is_dir".

pg_stat_file will now 1) show metadata of links themselves, rather than their
target; and, 2) specifically, show links to directories with "is_dir=false";
and, 3) not error on broken symlinks.
---
 doc/src/sgml/func.sgml  | 2 +-
 src/backend/utils/adt/genfile.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7ef8c7a847..959f6a1c2f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25911,7 +25911,7 @@ SELECT convert_from(pg_read_binary_file('file_in_utf8.txt'), 'UTF8');
 Returns a record containing the file's size, last access time stamp,
 last modification time stamp, last file status change time stamp (Unix
 platforms only), file creation time stamp (Windows only), and a flag
-indicating if it is a directory (or a symbolic link to a directory).
+indicating if it is a directory.


 This function is restricted to superusers by default, but other users
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index c1cc19d1f5..46d0977c2e 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -406,7 +406,7 @@ pg_stat_file(PG_FUNCTION_ARGS)
 
 	filename = convert_and_check_filename(filename_t);
 
-	if (stat(filename, ) < 0)
+	if (lstat(filename, ) < 0)
 	{
 		if (missing_ok && errno == ENOENT)
 			PG_RETURN_NULL();
@@ -632,7 +632,7 @@ pg_ls_dir_files(FunctionCallInfo fcinfo, const char *dir, bool missing_ok)
 
 		/* Get the file info */
 		snprintf(path, sizeof(path), "%s/%s", dir, de->d_name);
-		if (stat(path, ) < 0)
+		if (lstat(path, ) < 0)
 		{
 			/* Ignore concurrently-deleted files, else complain */
 			if (errno == ENOENT)
-- 
2.17.0

>From 75c95441077969bcb6d0151de4a9224697442a23 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 17 Mar 2020 13:16:24 -0500
Subject: [PATCH v20 03/10] Add tests on pg_ls_dir before changing it

---
 src/test/regress/expected/misc_functions.out | 18 ++
 src/test/regress/sql/misc_functions.sql  |  5 +
 2 files changed, 23 insertions(+)

diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index d3acb98d04..2e87c548eb 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -201,6 +201,24 @@ select count(*) > 0 from
  t
 (1 row)
 
+select * from (select pg_ls_dir('.', false, true) as name) as ls where ls.name='.'; -- include_dot_dirs=true
+ name 
+--
+ .
+(1 row)
+
+select * from (select pg_ls_dir('.', false, false) as name) as ls where ls.name='.'; -- include_dot_dirs=false
+ name 
+--
+(0 rows)
+
+select pg_ls_dir('does not exist', true, false); -- ok with missingok=true
+ pg_ls_dir 
+---
+(0 rows)
+
+select pg_ls_dir('does not

Re: Parallel Seq Scan vs kernel read ahead

2020-07-14 Thread Amit Kapila

On Wed, Jul 15, 2020 at 5:55 AM David Rowley  wrote:
>
> On Tue, 14 Jul 2020 at 19:13, Thomas Munro  wrote:
> >
> > On Fri, Jun 26, 2020 at 3:33 AM Robert Haas  wrote:
> > > On Tue, Jun 23, 2020 at 11:53 PM David Rowley  
> > > wrote:
> > > > In summary, based on these tests, I don't think we're making anything
> > > > worse in regards to synchronize_seqscans if we cap the maximum number
> > > > of blocks to allocate to each worker at once to 8192. Perhaps there's
> > > > some argument for using something smaller than that for servers with
> > > > very little RAM, but I don't personally think so as it still depends
> > > > on the table size and It's hard to imagine tables in the hundreds of
> > > > GBs on servers that struggle with chunk allocations of 16MB.  The
> > > > table needs to be at least ~70GB to get a 8192 chunk size with the
> > > > current v2 patch settings.
> > >
> > > Nice research. That makes me happy. I had a feeling the maximum useful
> > > chunk size ought to be more in this range than the larger values we
> > > were discussing before, but I didn't even think about the effect on
> > > synchronized scans.
> >
> > +1.  This seems about right to me.  We can always reopen the
> > discussion if someone shows up with evidence in favour of a tweak to
> > the formula, but this seems to address the basic problem pretty well,
> > and also fits nicely with future plans for AIO and DIO.
>
> Thank you both of you for having a look at the results.
>
> I'm now pretty happy with this too. I do understand that we've not
> exactly exhaustively tested all our supported operating systems.
> However, we've seen some great speedups with Windows 10 and Linux with
> SSDs. Thomas saw great speedups with FreeBSD with the original patch
> using chunk sizes of 64 blocks. (I wonder if it's worth verifying that
> it increases further with the latest patch with the same test you did
> in the original email on this thread?)
>
> I'd like to propose that if anyone wants to do further testing on
> other operating systems with SSDs or HDDs then it would be good if
> that could be done within a 1 week from this email. There are various
> benchmarking ideas on this thread for inspiration.
>

Yeah, I agree it would be good if we could do what you said.

> If we've not seen any performance regressions within 1 week, then I
> propose that we (pending final review) push this to allow wider
> testing.

I think Soumyadeep has reported a regression case [1] with the earlier
version of the patch.  I am not sure if we have verified that the
situation improves with the latest version of the patch.  I request
Soumyadeep to please try once with the latest patch.

[1] - 
https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB%3DxyJ192EZCNwGfcCa_WJ5GHVM7Sv8oenuA%40mail.gmail.com
-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: GSSENC'ed connection stalls while reconnection attempts.

2020-07-14 Thread Kyotaro Horiguchi

At Tue, 14 Jul 2020 13:31:31 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Agreed to separate the change from this issue.  I also don't think
> that change in behavior dramatically improve the situation since we
> should have had a bunch of trouble when a write actually failed in the
> normal case.
> 
> I'm going to post a patch to change the comment of pqPacketSend.

So this is a proposal to add a description about the behavior on write
failure.  The last half of the addition is a copy from the comment of
pqFlush.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 5e01ed1fccce083b1e1c13692d19474378c66db6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 15 Jul 2020 11:27:57 +0900
Subject: [PATCH] Explain pqPacketSend more precisely.

The comment explains that the function always returns STATUS_ERRUR for
write failure, but this may not be the case. Add description to make
it clear that there is such a case.
---
 src/interfaces/libpq/fe-connect.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 7bee9dd201..af2818add4 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -4505,7 +4505,9 @@ PQrequestCancel(PGconn *conn)
  * buf, buf_len: contents of message.  The given length includes only what
  * is in buf; the message type and message length fields are added here.
  *
- * RETURNS: STATUS_ERROR if the write fails, STATUS_OK otherwise.
+ * RETURNS: STATUS_ERROR if the write fails, STATUS_OK otherwise. Note that the
+ * function may return STATUS_OK even if it failed to write to the underlying
+ * socket. (See pqSendSome comments about how failure should be handled.)
  * SIDE_EFFECTS: may block.
  *
  * Note: all messages sent with this routine have a length word, whether
-- 
2.18.4

Re: Binary support for pgoutput plugin

2020-07-14 Thread Andres Freund

Hi,

On 2020-07-14 22:28:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > What is the gain in having these checks? recv functions need to be safe
> > against arbitrary input, so a type crosscheck doesn't buy additional
> > safety in that regard. Not that a potential attacker couldn't just
> > change the content anyways?
> 
> You're confusing security issues with user-friendliness issues.
> Detecting that you sent the wrong type via an OID mismatch error
> is a lot less painful than trying to figure out why you've got
> errors along the line of "incorrect binary data format".

An oid mismatch error without knowing what that's about isn't very
helpful either.

How about adding an errcontext that shows the "source type oid", the
target type oid & type name and, for records, the column name of the
target table? That'd make this a lot easier to debug.

Greetings,

Andres Freund

Re: Is it useful to record whether plans are generic or custom?

2020-07-14 Thread Fujii Masao





On 2020/07/14 21:24, torikoshia wrote:

On 2020-07-10 10:49, torikoshia wrote:

On 2020-07-08 16:41, Fujii Masao wrote:

On 2020/07/08 10:14, torikoshia wrote:

On 2020-07-06 22:16, Fujii Masao wrote:

On 2020/06/11 14:59, torikoshia wrote:

On 2020-06-10 18:00, Kyotaro Horiguchi wrote:



+    TupleDescInitEntry(tupdesc, (AttrNumber) 8, "last_plan",

This could be a problem if we showed the last plan in this view. I
think "last_plan_type" would be better.

+    if (prep_stmt->plansource->last_plan_type == 
PLAN_CACHE_TYPE_CUSTOM)
+    values[7] = CStringGetTextDatum("custom");
+    else if (prep_stmt->plansource->last_plan_type == 
PLAN_CACHE_TYPE_GENERIC)
+    values[7] = CStringGetTextDatum("generic");
+    else
+    nulls[7] = true;

Using swith-case prevents future additional type (if any) from being
unhandled.  I think we are recommending that as a convension.


Thanks for your reviewing!

I've attached a patch that reflects your comments.


Thanks for the patch! Here are the comments.


Thanks for your review!


+    Number of times generic plan was choosen
+    Number of times custom plan was choosen

Typo: "choosen" should be "chosen"?


Thanks, fixed them.


+  
+   last_plan_type text
+  
+  
+    Tells the last plan type was generic or custom. If the prepared
+    statement has not executed yet, this field is null
+  

Could you tell me how this information is expected to be used?
I think that generic_plans and custom_plans are useful when investigating
the cause of performance drop by cached plan mode. But I failed to get
how much useful last_plan_type is.


This may be an exceptional case, but I once had a case needed
to ensure whether generic or custom plan was chosen for specific
queries in a development environment.


In your case, probably you had to ensure that the last multiple (or every)
executions chose generic or custom plan? If yes, I'm afraid that displaying
only the last plan mode is not enough for your case. No?
So it seems better to check generic_plans or custom_plans columns in the
view rather than last_plan_type even in your case. Thought?


Yeah, I now feel last_plan is not so necessary and only the numbers of
generic/custom plan is enough.

If there are no objections, I'm going to remove this column and related codes.


As mentioned, I removed last_plan column.


Thanks for updating the patch! It basically looks good to me.

I have one comment; you added the regression tests for generic and
custom plans into prepare.sql. But the similar tests already exist in
plancache.sql. So isn't it better to add the tests for generic_plans and
custom_plans columns, into plancache.sql?

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: replication_origin and replication_origin_lsn usage on subscriber

2020-07-14 Thread Amit Kapila

On Tue, Jul 14, 2020 at 3:37 PM Petr Jelinek  wrote:
>
> On 14/07/2020 11:36, Dilip Kumar wrote:
> > On Tue, Jul 14, 2020 at 2:47 PM Petr Jelinek  wrote:
> >>
> >> I am not sure if I can follow the discussion here very well, but if I
> >> understand correctly I'd like to clarify two things:
> >> - origin id does not change mid transaction as you can only have one per 
> >> xid
> >
> > Actually, I was talking about if someone changes the session origin
> > then which origin id we should send?  currently, we send data only
> > during the commit so we take the origin id from the commit wal and
> > send the same.  In the below example, I am inserting 2 records in a
> > transaction and each of them has different origin id.
> >
> > begin;
> > select pg_replication_origin_session_setup('o1');
> > insert into t values(1, 'test');
> > select pg_replication_origin_session_reset();
> > select pg_replication_origin_session_setup('o2');   --> Origin ID changed
> > insert into t values(2, 'test');
> > commit;
> >
>
> Commit record and commit_ts record will both include only 'o2', while
> individual DML WAL records will contain one or the other depending on
> when they were done.
>
> The origin API is really not really prepared for this situation
> (independently of streaming) because the origin lookup for all rows in
> that transaction will return 'o2', but decoding will decode whatever is
> in the DML WAL record.
>
> One can't even use this approach for sensible filtering as the ultimate
> faith of whole transaction is decided by what's in commit record since
> the filter callback only provides origin id, not record being processed
> so plugin can't differentiate. So it's hard to see how the above pattern
> could be used for anything but breaking things.
>

Fair enough, I think we can proceed with the assumption that it won't
change during the transaction and send origin_id along with the very
first *start* message during the streaming of in-progress
transactions.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: WIP: BRIN multi-range indexes

2020-07-14 Thread Alexander Korotkov

On Mon, Jul 13, 2020 at 5:59 PM Tomas Vondra
 wrote:
> >> > If we really want to print something nicer, I'd say it needs to be a
> >> > special function in the BRIN opclass.
> >>
> >> If that can be done, then +1.  We just need to ensure that the function
> >> knows and can verify the type of index that the value comes from.  I
> >> guess we can pass the index OID so that it can extract the opclass from
> >> catalogs to verify.
> >
> >+1 from me, too. Perhaps we can have it as optional. If a BRIN opclass
> >doesn't have it, the 'values' can be null.
> >
>
> I'd say that if the opclass does not have it, then we should print the
> bytea value (or whatever the opclass uses to store the summary) using
> the type functions.

I've read the recent messages in this thread and I'd like to share my thoughts.

I think the way brin_page_items() displays values is not really
generic.  It uses a range-like textual representation of an array of
values, while that array doesn't necessarily have range semantics.

However, I think it's good that brin_page_items() uses a type output
function to display values.  So, it's not necessary to introduce a new
BRIN opclass function in order to get values displayed in a
human-readable way.  Instead, we could just make a standard of BRIN
value to be human readable.  I see at least two possibilities for
that.
1. Use standard container data-types to represent BRIN values.  For
instance we could use an array of ranges instead of bytea for
multirange.  Not about how convenient/performant it would be.
2. Introduce new data-type to represent values in BRIN index. And for
that type we can define output function with user-readable output. We
did similar things for GiST.  For instance, pg_trgm defines gtrgm
type, which has no input and no output. But for BRIN opclass we can
define type with just output.

BTW, I've applied the patchset to the current master, but I got a lot
of duplicate oids.  Could you please resolve these conflicts.  I think
it would be good to use high oid numbers to evade conflicts during
development/review, and rely on committer to set final oids (as
discussed in [1]).

Links
1. 
https://www.postgresql.org/message-id/CAH2-WzmMTGMcPuph4OvsO7Ykut0AOCF_i-%3DeaochT0dd2BN9CQ%40mail.gmail.com

--
Regards,
Alexander Korotkov

Re: [PATCH] Performance Improvement For Copy From Binary Files

2020-07-14 Thread Amit Langote

Hi Vignesh,

On Tue, Jul 14, 2020 at 10:23 PM vignesh C  wrote:
> On Tue, Jul 14, 2020 at 11:19 AM Amit Langote  wrote:
> > Sounds fine to me.  Although CopyLoadRawBuf() does not seem to a
> > candidate for rigorous code optimization as it does not get called
> > that often.
>
> I thought we could include that change as we are making changes around
> that code.

Sure, done.

> Rest of the changes looked fine to me. Also I noticed that
> commit message was missing in the patch.

Please see the attached v7.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


v7-0001-Improve-performance-of-binary-COPY-FROM-with-buff.patch
Description: Binary data

Re: Binary support for pgoutput plugin

2020-07-14 Thread Tom Lane

Andres Freund  writes:
> What is the gain in having these checks? recv functions need to be safe
> against arbitrary input, so a type crosscheck doesn't buy additional
> safety in that regard. Not that a potential attacker couldn't just
> change the content anyways?

You're confusing security issues with user-friendliness issues.
Detecting that you sent the wrong type via an OID mismatch error
is a lot less painful than trying to figure out why you've got
errors along the line of "incorrect binary data format".

regards, tom lane

Re: Binary support for pgoutput plugin

2020-07-14 Thread Andres Freund

Hi,

On 2020-07-14 19:46:52 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > There's also send/receive functions that do not work across systems,
> > unfortunately :(. In particular record and array send functions embed
> > type oids and their receive functions verify that they match the local
> > system. Which basically means that if there's any difference in oid
> > assignment order between two systems that they will not allow to
> > send/recv such data between them :(.
> 
> It's not a problem particularly for built-in types, but I agree
> there's an issue for extension types.

I'm not so sure. That's true for builtin types within a single major
version, but not necessarily across major versions. Not that I can
immediately recall cases where we renumbered type oids.

It also assumes that the type specification exactly matches between the
source / target system. It's probably not a great idea to try to use
send/recv for meaningfully different types, but it doesn't seem to crazy
to e.g. allow to e.g. change varchar to text while doing a major version
upgrade over logical rep.

What is the gain in having these checks? recv functions need to be safe
against arbitrary input, so a type crosscheck doesn't buy additional
safety in that regard. Not that a potential attacker couldn't just
change the content anyways?

> > I've several times suggested that we should remove those type checks in
> > recv, as they afaict don't provide any actual value. But unfortunately
> > there hasn't been much response to that. See e.g.
> > https://postgr.es/m/20160426001713.hbqdiwvf4mkzkg55%40alap3.anarazel.de
> 
> Maybe we could compromise by omitting the check if both OIDs are
> outside the built-in range?

Hm. That'd be a lot better than the current situation. So I'd definitely
go for that if that's what we can agree on.

Greetings,

Andres Freund

Re: Parallel Seq Scan vs kernel read ahead

2020-07-14 Thread David Rowley

On Tue, 14 Jul 2020 at 19:13, Thomas Munro  wrote:
>
> On Fri, Jun 26, 2020 at 3:33 AM Robert Haas  wrote:
> > On Tue, Jun 23, 2020 at 11:53 PM David Rowley  wrote:
> > > In summary, based on these tests, I don't think we're making anything
> > > worse in regards to synchronize_seqscans if we cap the maximum number
> > > of blocks to allocate to each worker at once to 8192. Perhaps there's
> > > some argument for using something smaller than that for servers with
> > > very little RAM, but I don't personally think so as it still depends
> > > on the table size and It's hard to imagine tables in the hundreds of
> > > GBs on servers that struggle with chunk allocations of 16MB.  The
> > > table needs to be at least ~70GB to get a 8192 chunk size with the
> > > current v2 patch settings.
> >
> > Nice research. That makes me happy. I had a feeling the maximum useful
> > chunk size ought to be more in this range than the larger values we
> > were discussing before, but I didn't even think about the effect on
> > synchronized scans.
>
> +1.  This seems about right to me.  We can always reopen the
> discussion if someone shows up with evidence in favour of a tweak to
> the formula, but this seems to address the basic problem pretty well,
> and also fits nicely with future plans for AIO and DIO.

Thank you both of you for having a look at the results.

I'm now pretty happy with this too. I do understand that we've not
exactly exhaustively tested all our supported operating systems.
However, we've seen some great speedups with Windows 10 and Linux with
SSDs. Thomas saw great speedups with FreeBSD with the original patch
using chunk sizes of 64 blocks. (I wonder if it's worth verifying that
it increases further with the latest patch with the same test you did
in the original email on this thread?)

I'd like to propose that if anyone wants to do further testing on
other operating systems with SSDs or HDDs then it would be good if
that could be done within a 1 week from this email. There are various
benchmarking ideas on this thread for inspiration.

If we've not seen any performance regressions within 1 week, then I
propose that we (pending final review) push this to allow wider
testing. It seems we're early enough in the PG14 cycle that there's a
large window of time for us to do something about any reported
performance regressions that come in.

I also have in mind that Amit was keen to see a GUC or reloption to
allow users to control this. My thoughts on that are still that it
would be possible to craft a case where we scan an entire heap to get
a very small number of rows that are all located in the same area in
the table and then call some expensive function on those rows. The
chunk size ramp down code will help reduce the chances of one worker
running on much longer than its co-workers, but not eliminate the
chances.  Even the code as it stands today could suffer from this to a
lesser extent if all the matching rows are on a single page. My
current thoughts are that this just seems unlikely and that the
granularity of 1 block for cases like this was never that great
anyway. I suppose a more ideal plan shape would "Distribute" matching
rows to allow another set of workers to pick these rows up one-by-one
and process them. Our to-date lack of such an operator probably counts
a little towards the fact that one parallel worker being tied up with
a large amount of work is not that common.  Based on those thoughts,
I'd like to avoid any GUC/reloption until we see evidence that it's
really needed.

Any objections to any of the above?

David

Re: avoid bitmapOR-ing indexes with scan condition inconsistent with partition constraint

2020-07-14 Thread Justin Pryzby

Added here:
https://commitfest.postgresql.org/29/2644/

And updated tests to pass following:
|commit 689696c7110f148ede8004aae50d7543d05b5587
|Author: Tom Lane 
|Date:   Tue Jul 14 18:56:49 2020 -0400
|
|Fix bitmap AND/OR scans on the inside of a nestloop partition-wise join.
>From 5c323ffaa8c76e91fff6c9c77019c6d6ad4c627c Mon Sep 17 00:00:00 2001
From: Konstantin Knizhnik 
Date: Fri, 12 Oct 2018 15:53:51 +0300
Subject: [PATCH v2 1/2] Secondary index access optimizations

---
 .../postgres_fdw/expected/postgres_fdw.out|   8 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql |   2 +-
 src/backend/optimizer/path/allpaths.c |   2 +
 src/backend/optimizer/util/plancat.c  |  45 ++
 src/include/optimizer/plancat.h   |   3 +
 src/test/regress/expected/create_table.out|  14 +-
 src/test/regress/expected/inherit.out | 123 ++--
 .../regress/expected/partition_aggregate.out  |  10 +-
 src/test/regress/expected/partition_join.out  |  42 +-
 src/test/regress/expected/partition_prune.out | 571 ++
 src/test/regress/expected/rowsecurity.out |  12 +-
 src/test/regress/expected/update.out  |   4 +-
 12 files changed, 315 insertions(+), 521 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 90db550b92..dbbae1820e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -629,12 +629,12 @@ EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- Nu
Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NULL))
 (3 rows)
 
-EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest
- QUERY PLAN  
--
+EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest
+QUERY PLAN
+--
  Foreign Scan on public.ft1 t1
Output: c1, c2, c3, c4, c5, c6, c7, c8
-   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NOT NULL))
+   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE ((c3 IS NOT NULL))
 (3 rows)
 
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..08aef9289e 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -304,7 +304,7 @@ RESET enable_nestloop;
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 1; -- Var, OpExpr(b), Const
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 100 AND t1.c2 = 0; -- BoolExpr
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- NullTest
-EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest
+EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 = -c1;  -- OpExpr(l)
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE 1 = c1!;   -- OpExpr(r)
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6da0dcd61c..a9171c075c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -387,6 +387,7 @@ set_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		switch (rel->rtekind)
 		{
 			case RTE_RELATION:
+remove_restrictions_implied_by_constraints(root, rel, rte);
 if (rte->relkind == RELKIND_FOREIGN_TABLE)
 {
 	/* Foreign table */
@@ -1040,6 +1041,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 			set_dummy_rel_pathlist(childrel);
 			continue;
 		}
+		remove_restrictions_implied_by_constraints(root, childrel, childRTE);
 
 		/*
 		 * Constraint exclusion failed, so copy the parent's join quals and
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 25545029d7..45cd72a0fe 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1557,6 +1557,51 @@ relation_excluded_by_constraints(PlannerInfo *root,
 	return false;
 }
 
+/*
+ * Remove from restrictions list items implied by table constraints
+ */
+void remove_restrictions_implied_by_constraints(PlannerInfo *root,
+RelOptInfo *rel, RangeTblEntry *rte)
+{
+	List

Refactoring relkind as an enum

2020-07-14 Thread Mark Dilger

> On Jul 14, 2020, at 4:12 PM, Alvaro Herrera  wrote:
> 
> This patch is way too large.  Probably in part explained by the fact
> that you seem to have run pgindent and absorbed a lot of unrelated
> changes.  This makes the patch essentially unreviewable.

I did not run pgindent, but when changing

if (relkind == RELKIND_INDEX)
{
/* foo */
}

to

switch (relkind)
{
case RELKIND_INDEX:
/* foo */
}

the indentation of /* foo */ changes.  For large foo, that results in a lot of 
lines.  There are also cases in the code where comparisons of multiple 
variables are mixed together.  To split those out into switch/case statements I 
had to rearrange some of the code blocks.

> I think you should define a RelationGetRelkind() static function that
> returns the appropriate datatype without requiring a cast and assert in
> every single place that processes a relation's relkind.  Similarly
> you've chosen to leave get_rel_relkind untouched, but that seems unwise.

I was well aware of how large the patch had gotten, and didn't want to add 
more

> I think the chr_ macros are pointless.

Look more closely at the #define RelKindAsString(x) CppAsString2(chr_##x)

> Reading back the thread, it seems that the whole point of your patch was
> to change the tests that currently use 'if' tests to switch blocks.  I
> cannot understand what's the motivation for that,

There might not be sufficient motivation to make the patch worth doing.  The 
motivation was to leverage the project's recent addition of -Wswitch to make it 
easier to know which code needs updating when you add a new relkind.  That 
doesn't happen very often, but I was working towards that kind of thing, and 
thought this might be a good prerequisite patch for that work.  Stylistically, 
I also prefer

+   switch ((RelKind) rel->rd_rel->relkind)
+   {
+   case RELKIND_RELATION:
+   case RELKIND_MATVIEW:
+   case RELKIND_TOASTVALUE:

over 

-   if (rel->rd_rel->relkind == RELKIND_RELATION ||
- rel->rd_rel->relkind == RELKIND_MATVIEW ||
- rel->rd_rel->relkind == RELKIND_TOASTVALUE)

which is a somewhat common pattern in the code.  It takes less mental effort to 
see that only one variable is being compared against those three enum values.  
In some cases, though not necessarily this exact example, it also *might* save 
duplicated work computing the variable, depending on the situation and what the 
compiler can optimize away.

> but it appears to me
> that the approach is backwards: I'd counsel to *first* change the APIs
> (get_rel_relkind and defining an enum, plus adding RelationGetRelkind)
> so that everything is more sensible and safe, including appropriate
> answers for the places where an "invalid" relkind is returned;

Ok.

> and once
> that's in place, replace if tests with switch blocks where it makes
> sense to do so.

Ok.

> 
> Also, I suggest that this thread is not a good one for this patch.
> Subject is entirely not appropriate.  Open a new thread perhaps?

I've changed the subject line.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Cache lookup errors with functions manipulation object addresses

2020-07-14 Thread Michael Paquier

On Tue, Jul 07, 2020 at 11:08:30AM -0400, Alvaro Herrera wrote:
g> That's a holdover from old times, when we thought functions were
> procedures.  That's no longer the case.

Thanks, so "routine" it is.  I have done an extra round of review of
this patch, and noticed two things:
- In getObjectDescription(), we were doing a call to get_attname() for
nothing with OCLASS_CLASS with an attribute.
- The regression test output has been changed to use \a\t to make
future diffs more readable if we add an object type that increases the
column size.

And applied the change.  Thanks to everybody who took the time to look
at this code and comment about it.  It took actually less than 3 years
for this threadto conclude, as it began on the 19th of July, 2017.
--
Michael


signature.asc
Description: PGP signature

Re: Binary support for pgoutput plugin

2020-07-14 Thread Tom Lane

Andres Freund  writes:
> There's also send/receive functions that do not work across systems,
> unfortunately :(. In particular record and array send functions embed
> type oids and their receive functions verify that they match the local
> system. Which basically means that if there's any difference in oid
> assignment order between two systems that they will not allow to
> send/recv such data between them :(.

It's not a problem particularly for built-in types, but I agree
there's an issue for extension types.

> I've several times suggested that we should remove those type checks in
> recv, as they afaict don't provide any actual value. But unfortunately
> there hasn't been much response to that. See e.g.
> https://postgr.es/m/20160426001713.hbqdiwvf4mkzkg55%40alap3.anarazel.de

Maybe we could compromise by omitting the check if both OIDs are
outside the built-in range?

regards, tom lane

Re: Postgres is not able to handle more than 4k tables!?

2020-07-14 Thread Alvaro Herrera

On 2020-Jul-10, Konstantin Knizhnik wrote:

> @@ -3076,6 +3080,10 @@ relation_needs_vacanalyze(Oid relid,
>   instuples = tabentry->inserts_since_vacuum;
>   anltuples = tabentry->changes_since_analyze;
>  
> + rel = RelationIdGetRelation(relid);
> + oldestXmin = 
> TransactionIdLimitedForOldSnapshots(GetOldestXmin(rel, 
> PROCARRAY_FLAGS_VACUUM), rel);
> + RelationClose(rel);

*cough*

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Towards easier AMs: Cleaning up inappropriate use of name "relkind"

2020-07-14 Thread Alvaro Herrera

This patch is way too large.  Probably in part explained by the fact
that you seem to have run pgindent and absorbed a lot of unrelated
changes.  This makes the patch essentially unreviewable.

I think you should define a RelationGetRelkind() static function that
returns the appropriate datatype without requiring a cast and assert in
every single place that processes a relation's relkind.  Similarly
you've chosen to leave get_rel_relkind untouched, but that seems unwise.

I think the chr_ macros are pointless.

Reading back the thread, it seems that the whole point of your patch was
to change the tests that currently use 'if' tests to switch blocks.  I
cannot understand what's the motivation for that, but it appears to me
that the approach is backwards: I'd counsel to *first* change the APIs
(get_rel_relkind and defining an enum, plus adding RelationGetRelkind)
so that everything is more sensible and safe, including appropriate
answers for the places where an "invalid" relkind is returned; and once
that's in place, replace if tests with switch blocks where it makes
sense to do so.

Also, I suggest that this thread is not a good one for this patch.
Subject is entirely not appropriate.  Open a new thread perhaps?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Postgres is not able to handle more than 4k tables!?

2020-07-14 Thread Bruce Momjian

On Fri, Jul 10, 2020 at 02:10:20AM +, tsunakawa.ta...@fujitsu.com
wrote:
> > There were many proposed patches which help to improve this
> > situation.  But as far as this patches increase performance only
> > at huge servers with large number of cores and show almost no
> > improvement (or even some degradation) at standard 4-cores desktops,
> > almost none of them were committed.  Consequently our customers have
> > a lot of troubles trying to replace Oracle with Postgres and provide
> > the same performance at same (quite good and expensive) hardware.
>
> Yeah, it's a pity that the shiny-looking patches from Postgres Pro
> (mostly from Konstantin san?) -- autoprepare, built-in connection
> pooling, fair lwlock, and revolutionary multi-threaded backend --
> haven't gained hot atension.

Yeah, it is probably time for us to get access to a current large-scale
machine again and really find the bottlenecks.  We seem to next this
every few years.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Default setting for enable_hashagg_disk

2020-07-14 Thread Peter Geoghegan

On Tue, Jul 14, 2020 at 12:46 PM Robert Haas  wrote:
> - I thought the problem we were trying to solve here was that, in v12,
> if the planner thinks that your hashagg will fit in memory when really
> it doesn't, you will get good performance because we'll cheat; in v13,
> you'll get VERY bad performance because we won't.

That is the problem we started out with. I propose to solve a broader
problem that I believe mostly encompasses the original problem (it's
an "inventor's paradox" situation). Although the exact degree to which
it truly addresses the original problem will vary across
installations, I believe that it will go a very long way towards
cutting down on problems for users upgrading to Postgres 13 generally.

> - So, if hash_mem_multiplier affects both planning and execution, it
> doesn't really solve the problem. Neither does adjusting the existing
> work_mem setting. Imagine that you have two queries. The planner
> thinks Q1 will use 1GB of memory for a HashAgg but it will actually
> need 2GB. It thinks Q2 will use 1.5GB for a HashAgg but it will
> actually need 3GB. If you plan using a 1GB memory limit, Q1 will pick
> a HashAgg and perform terribly when it spills. Q2 will pick a
> GroupAggregate which will be OK but not great. If you plan with a 2GB
> memory limit, Q1 will pick a HashAgg and will not spill so now it will
> be in great shape. But Q2 will pick a HashAgg and then spill so it
> will stink. Oops.

Maybe I missed your point here. The problem is not so much that we'll
get HashAggs that spill -- there is nothing intrinsically wrong with
that. While it's true that the I/O pattern is not as sequential as a
similar group agg + sort, that doesn't seem like the really important
factor here. The really important factor is that in-memory HashAggs
can be blazingly fast relative to *any* alternative strategy -- be it
a HashAgg that spills, or a group aggregate + sort that doesn't spill,
whatever. We're mostly concerned about keeping the one available fast
strategy than we are about getting a new, generally slow strategy.

There will be no problems at all unless and until we're short on
memory, because you can just increase work_mem and everything works
out, regardless of the details. Obviously the general problems we
anticipate only crop up when increasing work_mem stops being a viable
DBA strategy.

By teaching the system to have at least a crude appreciation of the
value of memory when hashing vs when sorting, the system is often able
to give much more memory to Hash aggs (and hash joins). Increasing
hash_mem_multiplier (maybe while also decreasing work_mem) will be
beneficial when we take memory from things that don't really need so
much, like sorts (or even CTE tuplestores) -- we reduce the memory
pressure without paying a commensurate price in system throughput
(maybe even only a very small hit). As a bonus, everything going
faster may actually *reduce* the memory usage for the system as a
whole, even as individual queries use more memory.

Under this scheme, it may well not matter that you cannot cheat
(Postgres 12 style) anymore, because you'll be able to use the memory
that is available sensibly -- regardless of whether or not the group
estimates are very good (we have to have more than zero faith in the
estimates -- they can be bad without being terrible). Maybe no amount
of tuning can ever restore the desirable Postgres 12 performance
characteristics you came to rely on, but remaining "regressions" are
probably cases where the user was flying pretty close to the sun
OOM-wise all along. They may have been happy with Postgres 12, but at
a certain point that really is something that you have to view as a
fool's paradise, even if like me you happen to be a memory Keynesian.

Really big outliers tend to be rare and therefore something that the
user can afford to have go slower. It's the constant steady stream of
medium-sized hash aggs that we mostly need to worry about. To the
extent that that's true, hash_mem_multiplier addresses the problem on
the table.

> - An escape hatch that prevents spilling at execution time *does*
> solve this problem, but now suppose we add a Q3 which the planner
> thinks will use 512MB of memory but at execution time it will actually
> consume 512GB due to the row count estimate being 1024x off. So if you
> enable the escape hatch to get back to a situation where Q1 and Q2
> both perform acceptably, then Q3 makes your system OOM.

Right. Nothing stops these two things from being true at the same time.

> - If you were to instead introduce a GUC like what I proposed before,
> which allows the execution-time memory usage to exceed what was
> planned, but only by a certain margin, then you can set
> hash_mem_execution_overrun_multiplier_thingy=2.5 and call it a day.
> Now, no matter how you set work_mem, you're fine. Depending on the
> value you choose for work_mem, you may get group aggregates for some
> of the queries. If you set it large enough that you get

Re: [patch] demote

2020-07-14 Thread Jehan-Guillaume de Rorthais

On Tue, 14 Jul 2020 12:49:51 -0700
Andres Freund  wrote:

> Hi,
> 
> On 2020-07-14 17:26:37 +0530, Amul Sul wrote:
> > On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais
> >  wrote:  
> > >
> > > Hi,
> > >
> > > Another summary + patch + tests.
> > >
> > > This patch supports 2PC. The goal is to keep them safe during
> > > demote/promote actions so they can be committed/rollbacked later on a
> > > primary. See tests. 
> > 
> > Wondering is it necessary to clear prepared transactions from shared memory?
> > Can't simply skip clearing and restoring prepared transactions while
> > demoting?  
> 
> Recovery doesn't use the normal PGXACT/PGPROC mechanisms to store
> transaction state. So I don't think it'd be correct to leave them around
> in their previous state. We'd likely end up with incorrect snapshots
> if a demoted node later gets promoted...

Indeed. I experienced it while debugging. PGXACT/PGPROC/locks need to
be cleared.

Re: SQL/JSON: functions

2020-07-14 Thread Andrew Dunstan



On 7/14/20 1:00 PM, Andrew Dunstan wrote:
> On 7/5/20 1:29 PM, Justin Pryzby wrote:
>> On Mon, Mar 23, 2020 at 08:28:52PM +0300, Nikita Glukhov wrote:
>>> Attached 47th version of the patches.
>> The patch checker/cfbot says this doesn't apply ; could you send a rebasified
>> version ?
>>
> To keep things moving, I've rebased these patches. However, 1) the docs
> patches use  in many cases where they
> should now just use  and b) patch 4 fails when run under
> force_parallel=regress.
>
>


Turns out these patches also need to get the message on the new way of
writing entries in func.sgml - I'll publish some updates on that in the
next day or so so that "make doc" will succeed.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Rethinking opclass member checks and dependency strength

2020-07-14 Thread Anastasia Lubennikova


On 31.03.2020 23:45, Tom Lane wrote:

I wrote:

Still haven't got a better naming idea, but in the meantime here's
a rebase to fix a conflict with 612a1ab76.


Maybe "amadjustmembers" will work?

I've looked through the patch and noticed this comment:

+            default:
+                /* Probably we should throw error here */
+                break;

I suggest adding an ERROR or maybe Assert, so that future developers 
wouldn't
forget about setting dependencies. Other than that, the patch looks good 
to me.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Multi-byte character case-folding

2020-07-14 Thread Bruce Momjian

On Mon, Jul  6, 2020 at 08:32:22PM -0400, Tom Lane wrote:
> Yes, a GUC changing this would be a headache.  It would be just as much of
> a compatibility and security hazard as standard_conforming_strings (which
> indeed I've been thinking of proposing that we get rid of; it's hung
> around long enough).

+1

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Robert Haas

On Tue, Jul 14, 2020 at 3:42 PM Andres Freund  wrote:
> The "found xmin ... from before relfrozenxid ..." cases should all be
> fixable without needing such a function, and without it making fixing
> them significantly easier, no? As far as I understand your suggested
> solution, you need the tid(s) of these tuples, right? If you have those,
> I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid =
>  or something like that.
>
> ISTM that the hard part is finding all problematic tuples in an
> efficient manner (i.e. that doesn't require one manual VACUUM for each
> individual block + parsing VACUUMs error message), not "fixing" those
> tuples.

I haven't tried the INSERT ... DELETE approach, but I've definitely
seen a case where a straight UPDATE did not fix the problem; VACUUM
continued failing afterwards. In that case, it was a system catalog
that was affected, and not one where TRUNCATE + re-INSERT was remotely
practical. The only solution I could come up with was to drop the
database and recreate it. Fortunately in that case the affected
database didn't seem to have any actual data in it, but if it had been
a 1TB database I think we would have been in really bad trouble.

Do you have a reason for believing that INSERT ... DELETE is going to
be better than UPDATE? It seems to me that either way you can end up
with a deleted and thus invisible tuple that you still can't get rid
of.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Binary support for pgoutput plugin

2020-07-14 Thread Andres Freund

Hi,

On 2020-07-14 14:08:53 -0400, Dave Cramer wrote:
> On Tue, 14 Jul 2020 at 12:59, Tom Lane  wrote:
> 
> > So I started looking through this seriously, and my first question
> > is why do the docs and code keep saying that "base types" are sent
> > in binary?  Why not just "data"?  Are there any cases where we
> > don't use binary format, if the subscription requests it?
> >
> > If there's not a concrete reason to use that terminology,
> > I'd rather flush it, because it seems confusing.
> >
> 
> Well for some reason I thought there were some types that did not have send
> and receive functions.

There's also send/receive functions that do not work across systems,
unfortunately :(. In particular record and array send functions embed
type oids and their receive functions verify that they match the local
system. Which basically means that if there's any difference in oid
assignment order between two systems that they will not allow to
send/recv such data between them :(.

I suspect that is what that comments might have been referring to?

I've several times suggested that we should remove those type checks in
recv, as they afaict don't provide any actual value. But unfortunately
there hasn't been much response to that. See e.g.

https://postgr.es/m/20160426001713.hbqdiwvf4mkzkg55%40alap3.anarazel.de

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Alvaro Herrera

On 2020-Jul-14, Andres Freund wrote:

> Hi,
> 
> On 2020-07-14 13:20:25 -0400, Alvaro Herrera wrote:

> > Just having the block number is already a tremendous step forward; with
> > that you can ask the customer to set a pageinspect dump of tuple
> > headers, and then the problem is obvious.  Now if you want to add block
> > number to that, by all means do so.
> 
> offset number I assume?

Eh, yeah, that.

> > One useful thing to do is to mark a tuple frozen unconditionally if it's
> > marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case.
> > The attached (for 9.6) does that; IIRC it would have helped in a couple
> > of cases.
> 
> I think it might also have hidden corruption in at least one case where
> we subsequently fixed a bug (and helped detect at least one unfixed
> bug). That should only be possible if either required clog has been
> removed, or if relfrozenxid/datfrozenxid are corrupt, right?

Yes, that's precisely the reason I never submitted it :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [patch] demote

2020-07-14 Thread Andres Freund

Hi,

On 2020-07-14 17:26:37 +0530, Amul Sul wrote:
> On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais
>  wrote:
> >
> > Hi,
> >
> > Another summary + patch + tests.
> >
> > This patch supports 2PC. The goal is to keep them safe during demote/promote
> > actions so they can be committed/rollbacked later on a primary. See tests.
> >
> 
> Wondering is it necessary to clear prepared transactions from shared memory?
> Can't simply skip clearing and restoring prepared transactions while demoting?

Recovery doesn't use the normal PGXACT/PGPROC mechanisms to store
transaction state. So I don't think it'd be correct to leave them around
in their previous state. We'd likely end up with incorrect snapshots
if a demoted node later gets promoted...

Greetings,

Andres Freund

Re: Default setting for enable_hashagg_disk

2020-07-14 Thread Robert Haas

On Mon, Jul 13, 2020 at 2:50 PM Peter Geoghegan  wrote:
> Primarily in favor of escape hatch:
>
> Jeff,
> DavidR,
> Pavel,
> Andres,
> Robert ??,
> Amit ??
>
> Primarily in favor of hash_mem/hash_mem_multiplier:
>
> PeterG,
> Tom,
> Alvaro,
> Tomas,
> Justin,
> DavidG,
> Jonathan Katz
>
> There are clear problems with this summary, including for example the
> fact that Robert weighed in before the hash_mem/hash_mem_multiplier
> proposal was even on the table. What he actually said about it [1]
> seems closer to hash_mem, so I feel that putting him in that bucket is
> a conservative assumption on my part. Same goes for Amit, who warmed
> to the idea of hash_mem_multiplier recently. (Though I probably got
> some detail wrong, in which case please correct me.)

My view is:

- I thought the problem we were trying to solve here was that, in v12,
if the planner thinks that your hashagg will fit in memory when really
it doesn't, you will get good performance because we'll cheat; in v13,
you'll get VERY bad performance because we won't.

- So, if hash_mem_multiplier affects both planning and execution, it
doesn't really solve the problem. Neither does adjusting the existing
work_mem setting. Imagine that you have two queries. The planner
thinks Q1 will use 1GB of memory for a HashAgg but it will actually
need 2GB. It thinks Q2 will use 1.5GB for a HashAgg but it will
actually need 3GB. If you plan using a 1GB memory limit, Q1 will pick
a HashAgg and perform terribly when it spills. Q2 will pick a
GroupAggregate which will be OK but not great. If you plan with a 2GB
memory limit, Q1 will pick a HashAgg and will not spill so now it will
be in great shape. But Q2 will pick a HashAgg and then spill so it
will stink. Oops.

- An escape hatch that prevents spilling at execution time *does*
solve this problem, but now suppose we add a Q3 which the planner
thinks will use 512MB of memory but at execution time it will actually
consume 512GB due to the row count estimate being 1024x off. So if you
enable the escape hatch to get back to a situation where Q1 and Q2
both perform acceptably, then Q3 makes your system OOM.

- If you were to instead introduce a GUC like what I proposed before,
which allows the execution-time memory usage to exceed what was
planned, but only by a certain margin, then you can set
hash_mem_execution_overrun_multiplier_thingy=2.5 and call it a day.
Now, no matter how you set work_mem, you're fine. Depending on the
value you choose for work_mem, you may get group aggregates for some
of the queries. If you set it large enough that you get hash
aggregates, then Q1 and Q2 will avoid spilling (which works but is
slow) because the overrun is less than 2x. Q3 will spill, so you won't
OOM. Wahoo!

- I do agree in general that it makes more sense to allow
hash_work_mem > sort_work_mem, and even to make that the default.
Allowing the same budget for both is unreasonable, because I think we
have good evidence that inadequate memory has a severe impact on
hashing operations but usually only a fairly mild effect on sorting
operations, except in the case where the underrun is severe. That is,
if you need 1GB of memory for a sort and you only get 768MB, the
slowdown is much much less severe than if the same thing happens for a
hash. If you have 10MB of memory, both are going to suck, but that's
kinda unavoidable.

- If you hold my feet to the fire and ask me to choose between a
Boolean escape hatch (rather than a multiplier-based one) and
hash_mem_multiplier, gosh, I don't know. I guess the Boolean escape
hatch? I mean it's a pretty bad solution, but at least if I have that
I can get both Q1 and Q2 to perform well at the same time, and I guess
I'm no worse off than I was in v12. The hash_mem_multiplier thing,
assuming it affects both planning and execution, seems like a very
good idea in general, but I guess I don't see how it helps with this
problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Andres Freund

Hi,

On 2020-07-14 07:51:27 -0400, Robert Haas wrote:
> On Tue, Jul 14, 2020 at 3:08 AM Peter Eisentraut
>  wrote:
> > In the meantime, if you're wizard enough to deal with this kind of
> > thing, you could also clone the module from the PG14 tree and build it
> > against older versions manually.
> 
> But what if you are NOT a wizard, and a wizard is giving you
> directions? Then having to build from source is a real pain. And
> that's normally the situation I'm in when a customer has this issue.

The "found xmin ... from before relfrozenxid ..." cases should all be
fixable without needing such a function, and without it making fixing
them significantly easier, no? As far as I understand your suggested
solution, you need the tid(s) of these tuples, right? If you have those,
I don't think it's meaningfully harder to INSERT ... DELETE WHERE ctid =
 or something like that.

ISTM that the hard part is finding all problematic tuples in an
efficient manner (i.e. that doesn't require one manual VACUUM for each
individual block + parsing VACUUMs error message), not "fixing" those
tuples.

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Andres Freund

Hi,

On 2020-07-14 13:20:25 -0400, Alvaro Herrera wrote:
> On 2020-Jul-13, Andres Freund wrote:
> 
> > Hi,
> > 
> > On 2020-07-13 17:12:18 -0400, Robert Haas wrote:
> > > 1. There's nothing to identify the tuple that has the problem, and no
> > > way to know how many more of them there might be. Back-patching
> > > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> > > part of this.
> > 
> > Not fully, I'm afraid. Afaict it doesn't currently tell you the item
> > pointer offset, just the block numer, right? We probably should extend
> > it to also include the offset...
> 
> Just having the block number is already a tremendous step forward; with
> that you can ask the customer to set a pageinspect dump of tuple
> headers, and then the problem is obvious.  Now if you want to add block
> number to that, by all means do so.

offset number I assume?


> One useful thing to do is to mark a tuple frozen unconditionally if it's
> marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case.
> The attached (for 9.6) does that; IIRC it would have helped in a couple
> of cases.

I think it might also have hidden corruption in at least one case where
we subsequently fixed a bug (and helped detect at least one unfixed
bug). That should only be possible if either required clog has been
removed, or if relfrozenxid/datfrozenxid are corrupt, right?

Greetings,

Andres Freund

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Andres Freund

Hi,

On 2020-07-13 21:18:10 -0400, Robert Haas wrote:
> On Mon, Jul 13, 2020 at 9:10 PM Andres Freund  wrote:
> > > What if clog has been truncated so that the xmin can't be looked up?
> >
> > That's possible, but probably only in cases where xmin actually
> > committed.
> 
> Isn't that the normal case? I'm imagining something like:
> 
> - Tuple gets inserted. Transaction commits.
> - VACUUM processes table.
> - Mischievous fairies mark page all-visible in the visibility map.
> - VACUUM runs lots more times, relfrozenxid advances, but without ever
> looking at the page in question, because it's all-visible.
> - clog is truncated, rendering xmin no longer accessible.
> - User runs VACUUM disabling page skipping, gets ERROR.
> - User deletes offending tuple.
> - At this point, I think the tuple is both invisible and unprunable?
> - Fairies happy, user sad.

I'm not saying it's impossible that that happens, but the cases I did
investigate didn't look like this. If something just roguely wrote to
the VM I'd expect a lot more "is not marked all-visible but visibility
map bit is set in relation" type WARNINGs, and I've not seen much of
those (they're WARNINGs though, so maybe we wouldn't). Presumably this
wouldn't always just happen with tuples that'd trigger an error first
during hot pruning.

I've definitely seen indications of both datfrozenxid and relfrozenxid
getting corrupted (in particular vac_update_datfrozenxid being racy as
hell), xid wraparound, indications of multixact problems (although it's
possible we've now fixed those) and some signs of corrupted relcache
entries for shared relations leading to vacuums being skipped.

Greetings,

Andres Freund

Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2020-07-14 Thread Andres Freund

Hi,

On 2020-07-10 19:01:49 -0400, Alvaro Herrera wrote:
> Totally unasked for, here's a rebase of this patch series.  I didn't do
> anything other than rebasing to current master, solving a couple of very
> trivial conflicts, fixing some whitespace complaints by git apply, and
> running tests to verify everthing works.
> 
> I don't foresee working on this at all, so if anyone is interested in
> seeing this feature in, I encourage them to read and address
> Horiguchi-san's feedback.

Nor am I planning to do so, but I do think its a pretty important
improvement.



> +/*
> + * PQrecyclePipelinedCommand
> + *   Push a command queue entry onto the freelist. It must be a dangling 
> entry
> + *   with null next pointer and not referenced by any other entry's next 
> pointer.
> + */

Dangling sounds a bit like it's already freed.



> +/*
> + * PQbatchSendQueue
> + *   End a batch submission by sending a protocol sync. The connection will
> + *   remain in batch mode and unavailable for new synchronous command 
> execution
> + *   functions until all results from the batch are processed by the client.

I feel like the reference to the protocol sync is a bit too low level
for an external API. It should first document what the function does
from a user's POV.

I think it'd also be good to document whether / whether not queries can
already have been sent before PQbatchSendQueue is called or not.


> +/*
> + * PQbatchProcessQueue
> + *In batch mode, start processing the next query in the queue.
> + *
> + * Returns 1 if the next query was popped from the queue and can
> + * be processed by PQconsumeInput, PQgetResult, etc.
> + *
> + * Returns 0 if the current query isn't done yet, the connection
> + * is not in a batch, or there are no more queries to process.
> + */
> +int
> +PQbatchProcessQueue(PGconn *conn)
> +{
> + PGcommandQueueEntry *next_query;
> +
> + if (!conn)
> + return 0;
> +
> + if (conn->batch_status == PQBATCH_MODE_OFF)
> + return 0;
> +
> + switch (conn->asyncStatus)
> + {
> + case PGASYNC_COPY_IN:
> + case PGASYNC_COPY_OUT:
> + case PGASYNC_COPY_BOTH:
> + printfPQExpBuffer(>errorMessage,
> +libpq_gettext_noop("internal error, COPY in 
> batch mode"));
> + break;

Shouldn't there be a return 0 here?



> + if (conn->batch_status == PQBATCH_MODE_ABORTED && conn->queryclass != 
> PGQUERY_SYNC)
> + {
> + /*
> +  * In an aborted batch we don't get anything from the server 
> for each
> +  * result; we're just discarding input until we get to the next 
> sync
> +  * from the server. The client needs to know its queries got 
> aborted
> +  * so we create a fake PGresult to return immediately from
> +  * PQgetResult.
> +  */
> + conn->result = PQmakeEmptyPGresult(conn,
> + 
>PGRES_BATCH_ABORTED);
> + if (!conn->result)
> + {
> + printfPQExpBuffer(>errorMessage,
> +   libpq_gettext("out of 
> memory"));
> + pqSaveErrorResult(conn);
> + return 0;

Is there any way an application can recover at this point? ISTM we'd be
stuck in the previous asyncStatus, no?


> +/* pqBatchFlush
> + * In batch mode, data will be flushed only when the out buffer reaches the 
> threshold value.
> + * In non-batch mode, data will be flushed all the time.
> + */
> +static int
> +pqBatchFlush(PGconn *conn)
> +{
> + if ((conn->batch_status == PQBATCH_MODE_OFF)||(conn->outCount >= 
> OUTBUFFER_THRESHOLD))
> + return(pqFlush(conn));
> + return 0; /* Just to keep compiler quiet */
> +}

unnecessarily long line.


> +/*
> + * Connection's outbuffer threshold is set to 64k as it is safe
> + * in Windows as per comments in pqSendSome() API.
> + */
> +#define OUTBUFFER_THRESHOLD  65536

I don't think the comment explains much. It's fine to send more than 64k
with pqSendSome(), they'll just be send with separate pgsecure_write()
invocations. And only on windows.

It clearly makes sense to start sending out data at a certain
granularity to avoid needing unnecessary amounts of memory, and to make
more efficient use of latency / serer side compute.

It's not implausible that 64k is the right amount for that, I just don't
think the explanation above is good.

> diff --git a/src/test/modules/test_libpq/testlibpqbatch.c 
> b/src/test/modules/test_libpq/testlibpqbatch.c
> new file mode 100644
> index 00..4d6ba266e5
> --- /dev/null
> +++ b/src/test/modules/test_libpq/testlibpqbatch.c
> @@ -0,0 +1,1456 @@
> +/*
> + * src/test/modules/test_libpq/testlibpqbatch.c
> + *
> + *
> + * testlibpqbatch.c
> + *   Test of batch execution functionality
> + */

Re: INSERT INTO SELECT, Why Parallelism is not selected?

2020-07-14 Thread Robert Haas

On Sat, Jul 11, 2020 at 8:37 AM Dilip Kumar  wrote:
> I have just notice that the parallelism is off even for the select
> part of the query mentioned in the $subject.  I see the only reason it
> is not getting parallel because we block the parallelism if the query
> type is not SELECT.  I don't see any reason for not selecting the
> parallelism for this query.

There's a relevant comment near the top of heap_prepare_insert().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: warnings for invalid function casts

2020-07-14 Thread Peter Eisentraut


On 2020-07-07 18:08, Tom Lane wrote:

Peter Eisentraut  writes:

On 2020-07-04 16:16, Tom Lane wrote:

I'm for a typedef.  There is *nothing* readable about "(void (*) (void))",
and the fact that it's theoretically incorrect for the purpose doesn't
exactly aid intelligibility either.  With a typedef, not only are
the uses more readable but there's a place to put a comment explaining
that this is notionally wrong but it's what gcc specifies to use
to suppress thus-and-such warnings.



Makes sense.  New patch here.


I don't have a compiler handy that emits these warnings, but this
passes an eyeball check.


committed

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Binary support for pgoutput plugin

2020-07-14 Thread Tom Lane

Dave Cramer  writes:
> On Tue, 14 Jul 2020 at 12:59, Tom Lane  wrote:
>> So I started looking through this seriously, and my first question
>> is why do the docs and code keep saying that "base types" are sent
>> in binary?  Why not just "data"?  Are there any cases where we
>> don't use binary format, if the subscription requests it?

> Well for some reason I thought there were some types that did not have send
> and receive functions.

There are, but they're all base types, so this terminology is still
unhelpful ;-).

It'd be possible for the sender to send binary for columns it has a
typsend function for, and otherwise send text.  However, this only helps
if the receiver has receive functions for all those types; in
cross-version cases they might disagree about which types can be sent
in binary.  (Hm ... maybe we could have the receiver verify that it has
typreceive for every column included in its version of the table, before
asking for binary mode?)

regards, tom lane

RE: Index Skip Scan (new UniqueKeys)

2020-07-14 Thread Floris Van Nee

> 
> > - the uniquekeys that is built, still contains some redundant keys, that are
> normally eliminated from the path keys lists.
> 
> I guess you're talking about:
> 
> + if (EC_MUST_BE_REDUNDANT(ec))
> +   continue;
> 
> Can you add some test cases to your changes to show the effect of it? It
> seem to me redundant keys are already eliminated at this point by either
> make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be
> I've missed something.
> 

The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks 
for uniqueness using pathkey_is_unique, but not for constantness. Consider a 
query like:
SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c

All the regular path keys filter out 'a' for constantness here - so they would 
end up with a distinct pathkeys of (b) and sort path keys of (b,c).
The unique keys would end up with (a,b) though. In the previous patch, it'd 
compared in create_distinct_paths, the pathkeys to the unique keys, so it 
wouldn't consider the skip scan.
Due to the other changes I made in 
create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan 
now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to 
give an actual failing test case now. However, since all code filters out 
constant keys, I think uniqueness should do it too - otherwise you could get 
into problems later on. It's also more consistent. If you already know 
something is unique by just (b), it doesn't make sense to store that it's 
unique by (a,b). Now that I think of it, the best place to do this 
EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, 
rather than build_uniquekeys though. It's probably good to move it there.

> Along the lines I'm also curious about this part:
> 
> - ListCell   *k;
> - List *exprs = NIL;
> -
> - foreach(k, ec->ec_members)
> - {
> - EquivalenceMember *mem = (EquivalenceMember *)
> lfirst(k);
> - exprs = lappend(exprs, mem->em_expr);
> - }
> -
> - result = lappend(result, makeUniqueKey(exprs, false, false));
> + EquivalenceMember *mem = (EquivalenceMember*)
> +lfirst(list_head(ec->ec_members));
> 
> I'm curious about this myself, maybe someone can clarify. It looks like
> generaly speaking there could be more than one member (if not
> ec_has_volatile), which "representing knowledge that multiple items are
> effectively equal". Is this information is not interesting enough to preserve 
> it
> in unique keys?

Yeah, that's a good question. Hence my question about the choice for Expr 
rather than EquivalenceClass for the Unique Keys patch to Andy/David. When 
storing just Expr, it is rather difficult to check equivalence in joins for 
example. Suppose, later on we decide to add join support to the distinct skip 
scan. Consider a query like this:
SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
As far as my understanding goes (I didn't look into it in detail though), I 
think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results 
in a UniqueKey with expr (t1.a) (because currently we only take the first Expr 
in the list to construct the UniqueKey). We could also construct *two* unique 
keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's 
the correct approach either, as it will explode when you have multiple 
pathkeys, each having multiple Expr inside their EqClasses.
That makes it difficult to check if we can perform the DISTINCT skip scan on 
table t2 as well (theoretically we could, but for that we need to check 
equivalence classes rather than expressions).

> 
> > - the distinct_pathkeys may be NULL, even though there's a possibility for
> skipping. But it wouldn't create the uniquekeys in this case. This makes the
> planner not choose skip scans even though it could. For example in queries
> that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
> is constant, it's eliminated from regular pathkeys.
> 
> What would be the point of skipping if it's a constant?

For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
There may be 1000s of records with a=1. We're only interested in the first one 
though. The traditional non-skip approach would still scan all records with 
a=1. Skip would just fetch the first one with a=1 and then skip to the next 
prefix and stop the scan.

> 
> > - to combat the issues mentioned earlier, there's now a check in
> build_index_paths that checks if the query_pathkeys matches the
> useful_pathkeys. Note that we have to use the path keys here rather than
> any of the unique keys. The unique keys are only Expr nodes - they do not
> contain the necessary information about ordering. Due to elimination of
> some constant path keys, we have to search the attributes of the index to
> find the correct prefix to use in skipping.
> 
> IIUC here you mean this function, right?
> 
> + prefix = find_index_prefix_for_pathkey(root,

Re: Binary support for pgoutput plugin

2020-07-14 Thread Dave Cramer

On Tue, 14 Jul 2020 at 12:59, Tom Lane  wrote:

> So I started looking through this seriously, and my first question
> is why do the docs and code keep saying that "base types" are sent
> in binary?  Why not just "data"?  Are there any cases where we
> don't use binary format, if the subscription requests it?
>
> If there's not a concrete reason to use that terminology,
> I'd rather flush it, because it seems confusing.
>

Well for some reason I thought there were some types that did not have send
and receive functions.

I've changed the docs to say data and the flag from binary_basetypes to
just binary

See attached.

Thanks,

Dave

>
>


0001-Add-binary-protocol-for-publications-and-subscriptio.patch
Description: Binary data

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2020-07-14 Thread Anastasia Lubennikova


On 01.07.2020 12:38, Daniel Gustafsson wrote:

This patch incurs a compiler warning, which probably is quite simple to fix:

heapam.c: In function ‘heap_multi_insert’:
heapam.c:2349:4: error: ‘recptr’ may be used uninitialized in this function 
[-Werror=maybe-uninitialized]
 visibilitymap_set(relation, BufferGetBlockNumber(buffer), buffer,
 ^
heapam.c:2136:14: note: ‘recptr’ was declared here
XLogRecPtr recptr;
   ^

Please fix and submit a new version, I'm marking the entry Waiting on Author in
the meantime.

cheers ./daniel


This patch looks very useful to me, so I want to pick it up.


The patch that fixes the compiler warning is in the attachment. Though, 
I'm not
entirely satisfied with this fix. Also, the patch contains some FIXME 
comments.

I'll test it more and send fixes this week.

Questions from the first review pass:

1) Do we need XLH_INSERT_ALL_VISIBLE_SET ? IIUC, in the patch it is always
implied by XLH_INSERT_ALL_FROZEN_SET.

2) What does this comment mean? Does XXX refers to the lsn comparison? 
Since it

is definitely necessary to update the VM.

+             * XXX: This seems entirely unnecessary?
+             *
+             * FIXME: Theoretically we should only do this after we've
+             * modified the heap - but it's safe to do it here I think,
+             * because this means that the page previously was empty.
+             */
+            if (lsn > PageGetLSN(vmpage))
+                visibilitymap_set(reln, blkno, InvalidBuffer, lsn, 
vmbuffer,

+                                  InvalidTransactionId, vmbits);

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out
index ca4b6e186b..0017e3415c 100644
--- a/contrib/pg_visibility/expected/pg_visibility.out
+++ b/contrib/pg_visibility/expected/pg_visibility.out
@@ -179,6 +179,69 @@ select pg_truncate_visibility_map('test_partition');
  
 (1 row)
 
+-- test copy freeze
+create table copyfreeze (a int, b char(1500));
+-- load all rows via COPY FREEZE and ensure that all pages are set all-visible
+-- and all-frozen.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | t   | t
+ 1 | t   | t
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
+-- load half the rows via regular COPY and rest via COPY FREEZE. The pages
+-- which are touched by regular COPY must not be set all-visible/all-frozen. On
+-- the other hand, pages allocated by COPY FREEZE should be marked
+-- all-frozen/all-visible.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | f   | f
+ 1 | f   | f
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
+-- Try a mix of regular COPY and COPY FREEZE.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin freeze;
+copy copyfreeze from stdin;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | t   | t
+ 1 | f   | f
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
 -- cleanup
 drop table test_partitioned;
 drop view test_view;
@@ -188,3 +251,4 @@ drop server dummy_server;
 drop foreign data wrapper dummy;
 drop materialized view matview_visibility_test;
 drop table regular_table;
+drop table copyfreeze;
diff --git a/contrib/pg_visibility/sql/pg_visibility.sql b/contrib/pg_visibility/sql/pg_visibility.sql
index f79b54480b..ec1afd4906 100644
--- a/contrib/pg_visibility/sql/pg_visibility.sql
+++ b/contrib/pg_visibility/sql/pg_visibility.sql
@@ -94,6 +94,82 @@ select count(*) > 0 from pg_visibility_map_summary('test_partition');
 select * from pg_check_frozen('test_partition'); -- hopefully none
 select pg_truncate_visibility_map('test_partition');
 
+-- test copy freeze
+create table copyfreeze (a int, b char(1500));
+
+-- load all rows via COPY FREEZE and ensure that all pages are set all-visible
+-- and all-frozen.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin freeze;
+1	'1'
+2	'2'
+3	'3'
+4	'4'
+5	'5'
+6	'6'
+7	'7'
+8	'8'
+9	'9'
+10	'10'
+11	'11'
+12	'12'
+\.
+commit;
+select * from pg_visibility_map('copyfreeze');
+select * from pg_check_frozen('copyfreeze');
+
+-- load half the rows via regular COPY and rest via COPY FREEZE. The pages
+-- which are touched by regular COPY must not be set all-visible/all-frozen. On
+--

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Alvaro Herrera

On 2020-Jul-13, Andres Freund wrote:

> Hi,
> 
> On 2020-07-13 17:12:18 -0400, Robert Haas wrote:
> > 1. There's nothing to identify the tuple that has the problem, and no
> > way to know how many more of them there might be. Back-patching
> > b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first
> > part of this.
> 
> Not fully, I'm afraid. Afaict it doesn't currently tell you the item
> pointer offset, just the block numer, right? We probably should extend
> it to also include the offset...

Just having the block number is already a tremendous step forward; with
that you can ask the customer to set a pageinspect dump of tuple
headers, and then the problem is obvious.  Now if you want to add block
number to that, by all means do so.

FWIW I do support the idea of backpatching the vacuum errcontext commit.

One useful thing to do is to mark a tuple frozen unconditionally if it's
marked hinted XMIN_COMMITTED; no need to consult pg_clog in that case.
The attached (for 9.6) does that; IIRC it would have helped in a couple
of cases.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 2d5574e4e3caa0299d27d77f0fc319463fc84162 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 18 Jul 2019 12:50:11 -0400
Subject: [PATCH] Don't test for xact commit if tuple is hinted

---
 src/backend/access/heap/heapam.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fe3e83b489..d42bc6502c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6724,7 +6724,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 		if (TransactionIdPrecedes(xid, cutoff_xid))
 		{
-			if (!TransactionIdDidCommit(xid))
+			if (!HeapTupleHeaderXminCommitted(xid) &&
+!TransactionIdDidCommit(xid))
 ereport(ERROR,
 		(errcode(ERRCODE_DATA_CORRUPTED),
 		 errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
-- 
2.20.1

Re: Binary support for pgoutput plugin

2020-07-14 Thread Tom Lane

So I started looking through this seriously, and my first question
is why do the docs and code keep saying that "base types" are sent
in binary?  Why not just "data"?  Are there any cases where we
don't use binary format, if the subscription requests it?

If there's not a concrete reason to use that terminology,
I'd rather flush it, because it seems confusing.

regards, tom lane

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-07-14 Thread Bruce Momjian

On Tue, Jul 14, 2020 at 03:38:49PM +0530, Bharath Rupireddy wrote:
> Approach #4:
> A postgres_fdw foreign server level option: connection idle time, the
> amount of idle time for that server cached entry, after which the
> cached entry goes away. Probably the backend, before itself going to
> idle, has to be checking the cached entries and see if any of the
> entries has timed out. One problem is that, if the backend just did it
> before going idle, then what about sessions that haven't reached the
> timeout at the point when we go idle, but do reach the timeout later?

Imagine implementing idle_in_session_timeout (which is useful on its
own), and then, when you connect to a foreign postgres_fdw server, you
set idle_in_session_timeout on the foreign side, and it just
disconnects/exits after an idle timeout.  There is nothing the sending
side has to do.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Index Skip Scan (new UniqueKeys)

2020-07-14 Thread Dmitry Dolgov

> On Sun, Jul 12, 2020 at 12:48:47PM +, Floris Van Nee wrote:
> >
> > Good point, thanks for looking at this. With the latest planner version 
> > there
> > are indeed more possibilities to use skipping. It never occured to me that
> > some of those paths will still rely on index scan returning full data set. 
> > I'll look
> > in details and add verification to prevent putting something like this on 
> > top of
> > skip scan in the next version.
>
> I believe the required changes are something like in attached patch. There 
> were a few things I've changed:
> - build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, 
> it would create two unique keys, one with a and one with b. However, it 
> should be one unique key with (a,b).

Yes, I've also noticed that while preparing fix for index scan not
covered by index and included it.

> - the uniquekeys that is built, still contains some redundant keys, that are 
> normally eliminated from the path keys lists.

I guess you're talking about:

+ if (EC_MUST_BE_REDUNDANT(ec))
+   continue;

Can you add some test cases to your changes to show the effect of it? It
seem to me redundant keys are already eliminated at this point by either
make_pathkeys_for_uniquekeys or even earlier for distinct on, but could
be I've missed something.

Along the lines I'm also curious about this part:

-   ListCell   *k;
-   List *exprs = NIL;
-
-   foreach(k, ec->ec_members)
-   {
-   EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
-   exprs = lappend(exprs, mem->em_expr);
-   }
-
-   result = lappend(result, makeUniqueKey(exprs, false, false));
+   EquivalenceMember *mem = (EquivalenceMember*) 
lfirst(list_head(ec->ec_members));

I'm curious about this myself, maybe someone can clarify. It looks like
generaly speaking there could be more than one member (if not
ec_has_volatile), which "representing knowledge that multiple items are
effectively equal". Is this information is not interesting enough to
preserve it in unique keys?

> - the distinct_pathkeys may be NULL, even though there's a possibility for 
> skipping. But it wouldn't create the uniquekeys in this case. This makes the 
> planner not choose skip scans even though it could. For example in queries 
> that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is 
> constant, it's eliminated from regular pathkeys.

What would be the point of skipping if it's a constant?

> - to combat the issues mentioned earlier, there's now a check in 
> build_index_paths that checks if the query_pathkeys matches the 
> useful_pathkeys. Note that we have to use the path keys here rather than any 
> of the unique keys. The unique keys are only Expr nodes - they do not contain 
> the necessary information about ordering. Due to elimination of some constant 
> path keys, we have to search the attributes of the index to find the correct 
> prefix to use in skipping.

IIUC here you mean this function, right?

+ prefix = find_index_prefix_for_pathkey(root,
+
index,
+
BackwardScanDirection,
+
llast_node(PathKey,
+
root->distinct_pathkeys));

Doesn't it duplicate the job already done in build_index_pathkeys by
building those pathkeys again? If yes, probably it's possible to reuse
useful_pathkeys. Not sure about unordered indexes, but looks like
query_pathkeys should also match in this case.

Will also look at the follow up questions in the next email.

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-07-14 Thread Robert Haas

On Tue, Jul 14, 2020 at 6:09 AM Bharath Rupireddy
 wrote:
> Thanks all for the ideas. There have been various points/approaches
> discussed in the entire email chain so far.
> I would like to summarize all of them here, so that we can agree on
> one of the options and proceed further with this feature.

In my opinion, approach #2 seems easy to implement and it's hard to
imagine anyone finding much to complain about there, but it's not that
powerful either, because it isn't automatic. Now the other approaches
have to do with the way in which this should be controlled, and I
think there are two separate questions.

1. Should this be controlled by (a) a core GUC, (b) a postgres_fdw
GUC, (c) a postgres_fdw server-level option?
2. Should it be (a) a timeout or (b) a Boolean (keep vs. don't keep)?

With regard to #1, even if we decided on a core GUC, I cannot imagine
that we'd accept enable_connectioncache as a name, because most
enable_whatever GUCs are for the planner, and this is something else.
Also, underscores between some words but not all words is a lousy
convention; let's not do more of that. Apart from those points, I
don't have a strong opinion; other people might. With regard to #2, a
timeout seems a lot more powerful, but also harder to implement
because you'd need some kind of core changes to let the FDW get
control at the proper time. Maybe that's an argument for 2(b), but I
have a bit of a hard time believing that 2(b) will provide a good user
experience. I doubt that most people want to have to decide between
slamming the connection shut even if it's going to be used again
almost immediately and keeping it open until the end of time. Those
are two pretty extreme positions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-07-14 Thread David G. Johnston

On Tue, Jul 14, 2020 at 7:21 AM Pavel Stehule 
wrote:

> út 14. 7. 2020 v 16:09 odesílatel David G. Johnston <
> david.g.johns...@gmail.com> napsal:
>
>> On Tue, Jul 14, 2020 at 6:56 AM Pavel Stehule 
>> wrote:
>>
>>> út 14. 7. 2020 v 15:55 odesílatel David G. Johnston <
>>> david.g.johns...@gmail.com> napsal:
>>>
 Further comments welcome so I'm putting it back into needs review for
 the moment while I work on the refactor.

>>>
>>> attached fixed patch
>>>
>>> all tests passed
>>> doc build without problems
>>>
>>
>> Thanks.
>>
>> Actually, one question I didn't pose before, does the SQL standard define
>> DROP TYPE to target domains while also providing for a DROP DOMAIN
>> command?  Do drop commands for the other types we have not exist because
>> those aren't SQL standard types (or the standard they are standard types
>> but the commands aren't defined)?
>>
>
> It looks like Postgres user defined types are something else than ANSI SQL
> - so CREATE TYPE and DROP TYPE did different work.
>
> In the section DROP TYPE in ANSI SQL there is not mentioned any relation
> to domains.
>

Attaching a backpatch-able patch for the main docs and tests, v4
Added a head-only patch for the glossary changes, set to v4 as well.

I didn't try and address any SQL standard dynamics here.

David J.


001-drop-if-exists-docs-and-tests-v4.patch
Description: Binary data


002-types-glossary-v4.patch
Description: Binary data

Re: Partitioning and postgres_fdw optimisations for multi-tenancy

2020-07-14 Thread Alexey Kondratov


On 2020-07-14 15:27, Ashutosh Bapat wrote:

On Tue, Jul 14, 2020 at 12:48 AM Alexey Kondratov
 wrote:

I built a simple two node multi-tenant schema for tests, which can be
easily set up with attached scripts. It creates three tables 
(companies,
users, documents) distributed over two nodes. Everything can be found 
in

this Gist [2] as well.

Some real-life test queries show, that all single-node queries aren't
pushed-down to the required node. For example:

SELECT
 *
FROM
 documents
 INNER JOIN users ON documents.user_id = users.id
WHERE
 documents.company_id = 5
 AND users.company_id = 5;


There are a couple of things happening here
1. the clauses on company_id in WHERE clause are causing partition
pruning. Partition-wise join is disabled with partition pruning before
PG13. In PG13 we have added advanced partition matching algorithm
which will allow partition-wise join with partition pruning.



I forgot to mention that I use a recent master (991c444e7a) for tests 
with


enable_partitionwise_join = 'on'
enable_partitionwise_aggregate = 'on'

of course. I've also tried postgres_fdw.use_remote_estimate = true 
followed by ANALYSE on both nodes (it is still used in setup.sh script).


BTW, can you, please, share a link to commit / thread about allowing 
partition-wise join and partition pruning to work together in PG13?




2. the query has no equality condition on the partition key of the
tables being joined. Partitionwise join is possible only when there's
an equality condition on the partition keys (company_id) of the
joining tables. PostgreSQL's optimizer is not smart enough to convert
the equality conditions in WHERE clause into equality conditions on
partition keys. So having those conditions just in WHERE clause does
not help. Instead please add equality conditions on partition keys in
JOIN .. ON clause or WHERE clause (only for INNER join).



With adding documents.company_id = users.company_id

SELECT *
FROM
documents
INNER JOIN users ON (documents.company_id = users.company_id
AND documents.user_id = users.id)
WHERE
documents.company_id = 5
AND users.company_id = 5;

query plan remains the same.



executed as following

   QUERY PLAN
---
  Nested Loop
Join Filter: (documents.user_id = users.id)
->  Foreign Scan on users_node2 users
->  Materialize
  ->  Foreign Scan on documents_node2 documents

i.e. it uses two foreign scans and does the final join locally. 
However,

once I specify target partitions explicitly, then the entire query is
pushed down to the foreign node:

QUERY PLAN
-
  Foreign Scan
Relations: (documents_node2) INNER JOIN (users_node2)

Execution time is dropped significantly as well — by more than 3 times
even for this small test database. Situation for simple queries with
aggregates or joins and aggregates followed by the sharding key filter
is the same. Something similar was briefly discussed in this thread 
[3].


IIUC, it means that push-down of queries through the postgres_fdw 
works

perfectly well, the problem is with partition-wise operation detection
at the planning time. Currently, partition-wise aggregate routines,
e.g., looks for a GROUP BY and checks whether sharding key exists 
there

or not. After that PARTITIONWISE_AGGREGATE_* flag is set. However, it
doesn't look for a content of WHERE clause, so frankly speaking it 
isn't

a problem, this functionality is not yet implemented.

Actually, sometimes I was able to push down queries with aggregate
simply by adding an additional GROUP BY with sharding key, like this:

SELECT
 count(*)
FROM
 documents
WHERE
 company_id = 5
GROUP BY company_id;


This gets pushed down since GROUP BY clause is on the partition key.



Sure, but it only works *sometimes*, I've never seen most of such simple 
queries with aggregates to be pushed down, e.g.:


SELECT
sum(id)
FROM
documents_node2
WHERE
company_id = 5
GROUP BY
company_id;

whether 'GROUP BY company_id' is used or not.



Although it seems that it will be easier to start with aggregates,
probably we should initially plan a more general solution? For 
example,
check that all involved tables are filtered by partitioning key and 
push

down the entire query if all of them target the same foreign server.

Any thoughts?


I think adding just equality conditions on the partition key will be
enough. No need for any code change.


So, it hasn't helped. Maybe I could modify some costs to verify that 
push-down of such joins is ever possible?


Anyway, what about aggregates? Partition-wise aggregates work fine for 
queries like


SELECT
count(*)
FROM
documents
GROUP BY
company_id;

but once I narrow it to a single partition with 'WHERE company_id = 5', 
then it is being executed in a very inefficient way — takes all rows

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Stephen Frost

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Tue, Jul 14, 2020 at 4:09 PM Robert Haas  wrote:
> > On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander 
> > wrote:
> > > I don't think that it necessarily has to be. As long as we're talking
> > about adding something and not actually changing their existing packages,
> > getting this into both yum and apt shouldn't be *that* hard, if it's
> > coordinated well with Christoph and Devrim (obviously that's based on my
> > experience and they will have to give a more complete answer themselves).
> > It would be a lot more complicated if it involved changing an existing
> > package.
> >
> > I mean, you presumably could not move pg_resetwal to this new package
> > in existing branches, right?
> 
> Probably and eventually. But that can be done for 14+ (or 13+ depending on
> how "done" the packaging is there -- we should just make sure that hits the
> biggest platform in the same release).

Considering we just got rid of the -contrib independent package on at
least Debian-based systems, it doesn't really seem likely that the
packagers are going to be anxious to create a new one- they are not
without costs.

Also, in such dire straits as this thread is contemplating, I would
think we'd *really* like to have access to these tools with as small an
amount of change as absolutely possible to the system: what if
pg_extension itself got munged and we aren't able to install this new
contrib module, for example?

I would suggest that, instead, we make this part of core, but have it be
in a relatively clearly marked special schema that isn't part of
search_path by default- eg: pg_hacks, or pg_dirty_hands (I kinda like
the latter though it seems a bit unprofessional for us).

I'd also certainly be in support of having a contrib module with the
same functions that's independent from core and available and able to be
installed on pre-v14 systems.  I'd further support having another repo
that's "core maintained" or however we want to phrase it which includes
this proposed module (and possibly all of contrib) and which has a
different release cadence and requirements for what gets into it, has
its own packages, etc.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [Proposal] Global temporary tables

2020-07-14 Thread Pavel Stehule

pá 10. 7. 2020 v 11:04 odesílatel wenjing zeng 
napsal:

> HI all
>
> I started using my personal email to respond to community issue.
>
>
>
> 2020年7月7日 下午6:05，Pavel Stehule  写道：
>
> Hi
>
>
>> GTT Merge the latest PGMaster and resolves conflicts.
>>
>>
>>
> I tested it and it looks fine. I think it is very usable in current form,
> but still there are some issues:
>
> postgres=# create global temp table foo(a int);
> CREATE TABLE
> postgres=# insert into foo values(10);
> INSERT 0 1
> postgres=# alter table foo add column x int;
> ALTER TABLE
> postgres=# analyze foo;
> WARNING:  reloid 16400 not support update attstat after add colunm
> WARNING:  reloid 16400 not support update attstat after add colunm
> ANALYZE
>
> This is a limitation that we can completely eliminate.
>
>
> Please, can you summarize what is done, what limits are there, what can be
> implemented hard, what can be implemented easily?
>
> Sure.
>
> The current version of the GTT implementation supports all regular table
> operations.
> 1 what is done
> 1.1 insert/update/delete on GTT.
> 1.2 The GTT supports all types of indexes, and the query statement
> supports the use of GTT indexes to speed up the reading of data in the GTT.
> 1.3 GTT statistics keep a copy of THE GTT local statistics, which are
> provided to the optimizer to choose the best query plan.
> 1.4 analyze vacuum GTT.
> 1.5 truncate cluster GTT.
> 1.6 all DDL on GTT.
> 1.7 GTT table can use  GTT sequence  or Regular sequence.
> 1.8 Support for creating views on GTT.
> 1.9 Support for creating views on foreign key.
> 1.10 support global temp partition.
>
> I feel like I cover all the necessary GTT requirements.
>
> For cluster GTT,I think it's complicated.
> I'm not sure the current implementation is quite reasonable. Maybe you can
> help review it.
>
>
>
>
>
> I found one open question - how can be implemented table locks - because
> data is physically separated, then we don't need table locks as protection
> against race conditions.
>
> Yes, but GTT’s DML DDL still requires table locking.
> 1 The DML requires table locks (RowExclusiveLock) to ensure that
> definitions do not change during run time (the DDL may modify or delete
> them).
> This part of the implementation does not actually change the code,
> because the DML on GTT does not block each other between sessions.
>
> 2 For truncate/analyze/vacuum reinidex cluster GTT is now like DML,
> they only modify local data and do not modify the GTT definition.
> So I lowered the table lock level held by the GTT, only need
> RowExclusiveLock.
>
> 3 For DDLs that need to be modified the GTT table definition(Drop
> GTT Alter GTT),
> an exclusive level of table locking is required(AccessExclusiveLock),
> as is the case for regular table.
> This part of the implementation also does not actually change the code.
>
> Summary: What I have done is to adjust the GTT lock levels in different
> types of statements based on the above thinking.
> For example, truncate GTT, I'm reducing the GTT holding table lock level
> to RowExclusiveLock,
> So We can truncate data in the same GTT between different sessions at the
> same time.
>
> What do you think about table locks on GTT?
>

I am thinking about explicit LOCK statements. Some applications use
explicit locking from some reasons - typically as protection against race
conditions.

But on GTT race conditions are not possible. So my question is - does the
exclusive lock on GTT  protection other sessions do insert into their own
instances of the same GTT?

What is a level where table locks are active? shared part of GTT or session
instance part of GTT?




>
> Wenjing
>
>
>
> Now, table locks are implemented on a global level. So exclusive lock on
> GTT in one session block insertion on the second session. Is it expected
> behaviour? It is safe, but maybe it is too strict.
>
> We should define what table lock is meaning on GTT.
>
> Regards
>
> Pavel
>
>
>> Pavel
>>
>>
>>> With Regards,
>>> Prabhat Kumar Sahu
>>> EnterpriseDB: http://www.enterprisedb.com
>>>
>>>
>>>
>>
>

Re: Creating foreign key on partitioned table is too slow

2020-07-14 Thread Daniel Gustafsson

> On 24 Mar 2020, at 16:26, Alvaro Herrera  wrote:

> It
> would be great if Kato Sho can try the original test case with my latest
> patch (the one in https://postgr.es/m/20191113214544.GA16060@alvherre.pgsql )
> and let us know if it improves things.

Hi!,

Are you able to test Alvaros latest patch to see if that solves the originally
reported problem, so that we can reach closure on this item during the
commitfest?

cheers ./daniel

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Magnus Hagander

On Tue, Jul 14, 2020 at 4:09 PM Robert Haas  wrote:

> On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander 
> wrote:
> > I don't think that it necessarily has to be. As long as we're talking
> about adding something and not actually changing their existing packages,
> getting this into both yum and apt shouldn't be *that* hard, if it's
> coordinated well with Christoph and Devrim (obviously that's based on my
> experience and they will have to give a more complete answer themselves).
> It would be a lot more complicated if it involved changing an existing
> package.
>
> I mean, you presumably could not move pg_resetwal to this new package
> in existing branches, right?
>

Probably and eventually. But that can be done for 14+ (or 13+ depending on
how "done" the packaging is there -- we should just make sure that hits the
biggest platform in the same release).

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-07-14 Thread Pavel Stehule

út 14. 7. 2020 v 16:09 odesílatel David G. Johnston <
david.g.johns...@gmail.com> napsal:

> On Tue, Jul 14, 2020 at 6:56 AM Pavel Stehule 
> wrote:
>
>> út 14. 7. 2020 v 15:55 odesílatel David G. Johnston <
>> david.g.johns...@gmail.com> napsal:
>>
>>> Further comments welcome so I'm putting it back into needs review for
>>> the moment while I work on the refactor.
>>>
>>
>> attached fixed patch
>>
>> all tests passed
>> doc build without problems
>>
>
> Thanks.
>
> Actually, one question I didn't pose before, does the SQL standard define
> DROP TYPE to target domains while also providing for a DROP DOMAIN
> command?  Do drop commands for the other types we have not exist because
> those aren't SQL standard types (or the standard they are standard types
> but the commands aren't defined)?
>

It looks like Postgres user defined types are something else than ANSI SQL
- so CREATE TYPE and DROP TYPE did different work.

In the section DROP TYPE in ANSI SQL there is not mentioned any relation to
domains.


> David J.
>

Re: [Proposal] Global temporary tables

2020-07-14 Thread Pavel Stehule

po 13. 7. 2020 v 13:59 odesílatel wenjing zeng 
napsal:

>
>
> 2020年7月10日 下午5:03，wenjing zeng  写道：
>
> HI all
>
> I started using my personal email to respond to community issue.
>
>
>
> 2020年7月7日 下午6:05，Pavel Stehule  写道：
>
> Hi
>
>
>> GTT Merge the latest PGMaster and resolves conflicts.
>>
>>
>>
> I tested it and it looks fine. I think it is very usable in current form,
> but still there are some issues:
>
> postgres=# create global temp table foo(a int);
> CREATE TABLE
> postgres=# insert into foo values(10);
> INSERT 0 1
> postgres=# alter table foo add column x int;
> ALTER TABLE
> postgres=# analyze foo;
> WARNING:  reloid 16400 not support update attstat after add colunm
> WARNING:  reloid 16400 not support update attstat after add colunm
> ANALYZE
>
> This is a limitation that we can completely eliminate.
>
>
> Please, can you summarize what is done, what limits are there, what can be
> implemented hard, what can be implemented easily?
>
> Sure.
>
> The current version of the GTT implementation supports all regular table
> operations.
> 1 what is done
> 1.1 insert/update/delete on GTT.
> 1.2 The GTT supports all types of indexes, and the query statement
> supports the use of GTT indexes to speed up the reading of data in the GTT.
> 1.3 GTT statistics keep a copy of THE GTT local statistics, which are
> provided to the optimizer to choose the best query plan.
> 1.4 analyze vacuum GTT.
> 1.5 truncate cluster GTT.
> 1.6 all DDL on GTT.
> 1.7 GTT table can use  GTT sequence  or Regular sequence.
> 1.8 Support for creating views on GTT.
> 1.9 Support for creating views on foreign key.
> 1.10 support global temp partition.
>
> I feel like I cover all the necessary GTT requirements.
>
> For cluster GTT,I think it's complicated.
> I'm not sure the current implementation is quite reasonable. Maybe you can
> help review it.
>
>
>
>
>
> I found one open question - how can be implemented table locks - because
> data is physically separated, then we don't need table locks as protection
> against race conditions.
>
> Yes, but GTT’s DML DDL still requires table locking.
> 1 The DML requires table locks (RowExclusiveLock) to ensure that
> definitions do not change during run time (the DDL may modify or delete
> them).
> This part of the implementation does not actually change the code,
> because the DML on GTT does not block each other between sessions.
>
> As a side note, since the same row of GTT data can not modified by
> different sessions,
> So, I don't see the need to care the GTT's PG_class.relminmxID.
> What do you think?
>

yes, probably it is not necessary

Regards

Pavel

>
>
> Wenjing
>
>
>
> 2 For truncate/analyze/vacuum reinidex cluster GTT is now like DML,
> they only modify local data and do not modify the GTT definition.
> So I lowered the table lock level held by the GTT, only need
> RowExclusiveLock.
>
> 3 For DDLs that need to be modified the GTT table definition(Drop
> GTT Alter GTT),
> an exclusive level of table locking is required(AccessExclusiveLock),
> as is the case for regular table.
> This part of the implementation also does not actually change the code.
>
> Summary: What I have done is to adjust the GTT lock levels in different
> types of statements based on the above thinking.
> For example, truncate GTT, I'm reducing the GTT holding table lock level
> to RowExclusiveLock,
> So We can truncate data in the same GTT between different sessions at the
> same time.
>
> What do you think about table locks on GTT?
>
>
> Wenjing
>
>
>
> Now, table locks are implemented on a global level. So exclusive lock on
> GTT in one session block insertion on the second session. Is it expected
> behaviour? It is safe, but maybe it is too strict.
>
> We should define what table lock is meaning on GTT.
>
> Regards
>
> Pavel
>
>
>> Pavel
>>
>>
>>> With Regards,
>>> Prabhat Kumar Sahu
>>> EnterpriseDB: http://www.enterprisedb.com
>>>
>>>
>>>
>>
>
>

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Robert Haas

On Tue, Jul 14, 2020 at 8:25 AM Magnus Hagander  wrote:
> I don't think that it necessarily has to be. As long as we're talking about 
> adding something and not actually changing their existing packages, getting 
> this into both yum and apt shouldn't be *that* hard, if it's coordinated well 
> with Christoph and Devrim (obviously that's based on my experience and they 
> will have to give a more complete answer themselves). It would be a lot more 
> complicated if it involved changing an existing package.

I mean, you presumably could not move pg_resetwal to this new package
in existing branches, right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-07-14 Thread David G. Johnston

On Tue, Jul 14, 2020 at 6:56 AM Pavel Stehule 
wrote:

> út 14. 7. 2020 v 15:55 odesílatel David G. Johnston <
> david.g.johns...@gmail.com> napsal:
>
>> Further comments welcome so I'm putting it back into needs review for the
>> moment while I work on the refactor.
>>
>
> attached fixed patch
>
> all tests passed
> doc build without problems
>

Thanks.

Actually, one question I didn't pose before, does the SQL standard define
DROP TYPE to target domains while also providing for a DROP DOMAIN
command?  Do drop commands for the other types we have not exist because
those aren't SQL standard types (or the standard they are standard types
but the commands aren't defined)?

David J.

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Robert Haas

On Mon, Jul 13, 2020 at 9:29 PM Fujii Masao  wrote:
> But updating this tool can fit to the release schedule and
> policy of PostgreSQL?
>
> While investigating the problem by using this tool, we may want to
> add new feature into the tool because it's necessary for the investigation.
> But users would need to wait for next minor version release, to use this
> new feature.

Yeah, that's a point that needs careful thought. I don't think it
means that we shouldn't have something in core; after all, this is a
problem that is created in part by the way that PostgreSQL itself
works, and I think it would be quite unfriendly if we refused to do
anything about that in the core distribution. On the other hand, it
might be a good reason not to back-patch, which is something most
people don't seem enthusiastic about anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-07-14 Thread Pavel Stehule

út 14. 7. 2020 v 15:55 odesílatel David G. Johnston <
david.g.johns...@gmail.com> napsal:

> On Tue, Jul 14, 2020 at 5:40 AM Justin Pryzby 
> wrote:
>
>> On Tue, Jul 14, 2020 at 07:25:56AM +0200, Pavel Stehule wrote:
>> > út 14. 7. 2020 v 0:37 odesílatel David G. Johnston <
>> david.g.johns...@gmail.com> napsal:
>> > > On Mon, Jul 13, 2020 at 2:12 AM Pavel Stehule <
>> pavel.steh...@gmail.com> wrote:
>> > I think so now all changes are correct and valuable. I''l mark this
>> patch
>> > as ready for commit
>>
>> This is failing relevant tests in cfbot:
>>
>>  drop_if_exists   ... FAILED  450 ms
>>
>>
> Oops, did a minor whitespace cleanup in the test file and didn't re-copy
> expected output.  I'm actually going to try and clean up the commenting in
> the test file a bit to make it easier to read, and split out the glossary
> changes into their own diff so that the bulk of the changes can be
> back-patched.
>
> Further comments welcome so I'm putting it back into needs review for the
> moment while I work on the refactor.
>

attached fixed patch

all tests passed
doc build without problems


> David J.
>
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
index 76525c6302..1dac096f23 100644
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1689,6 +1689,30 @@

   
 
+  
+   Type
+   
+
+ All data stored within the database is typed.  Each type has both a name, and associated
+ properties, as well as an overarching Type, of which there are two variations. Extensible
+ types can created by users and comprise of: base, composite, domain, enum, and range. There
+ are also the system-only pseudo-types.  Type records are stored in pg_type and the
+ overarching type is found in pg_type.typtype.
+
+   
+
+
+  
+   Type Definition
+   
+
+ Collective term for the various Type
+ variants that are allowed for pg_type.typtype catalog table column values;
+ specifically, ones that are extensible.
+
+   
+  
+
   
Unique constraint

diff --git a/doc/src/sgml/ref/drop_domain.sgml b/doc/src/sgml/ref/drop_domain.sgml
index b18faf3917..8b04b3818b 100644
--- a/doc/src/sgml/ref/drop_domain.sgml
+++ b/doc/src/sgml/ref/drop_domain.sgml
@@ -30,7 +30,9 @@ DROP DOMAIN [ IF EXISTS ] name [, .
 
   
DROP DOMAIN removes a domain.  Only the owner of
-   a domain can remove it.
+   a domain can remove it.  The duplicates the functionality provided by 
+but restricts the object type to domain.
+
   
  
 
@@ -42,8 +44,13 @@ DROP DOMAIN [ IF EXISTS ] name [, .
 IF EXISTS
 
  
-  Do not throw an error if the domain does not exist. A notice is issued
-  in this case.
+  This parameter instructs PostgreSQL to search 
+  for the first instance of any 
+  type definition
+  with the provided name.
+  If no type definitions are found a notice is issued and the command ends.
+  If a type definition is found then one of two things will happen:
+  if the type definition is a domain it is dropped, otherwise the command fails.  
  
 

diff --git a/doc/src/sgml/ref/drop_foreign_table.sgml b/doc/src/sgml/ref/drop_foreign_table.sgml
index 07b3fd4251..0288fb2062 100644
--- a/doc/src/sgml/ref/drop_foreign_table.sgml
+++ b/doc/src/sgml/ref/drop_foreign_table.sgml
@@ -42,8 +42,11 @@ DROP FOREIGN TABLE [ IF EXISTS ] nameIF EXISTS
 
  
-  Do not throw an error if the foreign table does not exist.
-  A notice is issued in this case.
+  This parameter instructs PostgreSQL to search 
+  for the first instance of any relation with the provided name.
+  If no relations are found a notice is issued and the command ends.
+  If a relation is found then one of two things will happen:
+  if the relation is a foreign table it is dropped, otherwise the command fails.
  
 

diff --git a/doc/src/sgml/ref/drop_index.sgml b/doc/src/sgml/ref/drop_index.sgml
index 0aedd71bd6..dff437cf9b 100644
--- a/doc/src/sgml/ref/drop_index.sgml
+++ b/doc/src/sgml/ref/drop_index.sgml
@@ -70,8 +70,11 @@ DROP INDEX [ CONCURRENTLY ] [ IF EXISTS ] nameIF EXISTS
 
  
-  Do not throw an error if the index does not exist. A notice is issued
-  in this case.
+  This parameter instructs PostgreSQL to search 
+  for the first instance of any relation with the provided name.
+  If no relations are found a notice is issued and the command ends.
+  If a relation is found then one of two things will happen:
+  if the relation is an index it is dropped, otherwise the command fails.
  
 

diff --git a/doc/src/sgml/ref/drop_materialized_view.sgml b/doc/src/sgml/ref/drop_materialized_view.sgml
index c8f3bc5b0d..6647a0db0d 100644
--- a/doc/src/sgml/ref/drop_materialized_view.sgml
+++ b/doc/src/sgml/ref/drop_materialized_view.sgml
@@ -43,8 +43,11 @@ DROP MATERIALIZED VIEW [ IF EXISTS ] nameIF EXISTS
 
  
-  Do not

Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-07-14 Thread David G. Johnston

On Tue, Jul 14, 2020 at 5:40 AM Justin Pryzby  wrote:

> On Tue, Jul 14, 2020 at 07:25:56AM +0200, Pavel Stehule wrote:
> > út 14. 7. 2020 v 0:37 odesílatel David G. Johnston <
> david.g.johns...@gmail.com> napsal:
> > > On Mon, Jul 13, 2020 at 2:12 AM Pavel Stehule 
> wrote:
> > I think so now all changes are correct and valuable. I''l mark this patch
> > as ready for commit
>
> This is failing relevant tests in cfbot:
>
>  drop_if_exists   ... FAILED  450 ms
>
>
Oops, did a minor whitespace cleanup in the test file and didn't re-copy
expected output.  I'm actually going to try and clean up the commenting in
the test file a bit to make it easier to read, and split out the glossary
changes into their own diff so that the bulk of the changes can be
back-patched.

Further comments welcome so I'm putting it back into needs review for the
moment while I work on the refactor.

David J.

Re: Binary support for pgoutput plugin

2020-07-14 Thread Dave Cramer

On Tue, 14 Jul 2020 at 09:26, Daniel Gustafsson  wrote:

> > On 13 Jul 2020, at 15:11, Dave Cramer  wrote:
>
> I took another look at the updated version today.  Since there now were
> some
> unused variables and (I believe) unnecessary checks (int size and
> endianness
> etc) left, I took the liberty to fix those.  I also fixed some markup in
> the
> catalog docs, did some minor tidying up and ran pgindent on it.
>
> The attached is a squash of the 4 patches in your email with the above
> fixes.
> I'm again marking it RfC since I believe all concerns raised so far has
> been
> addressed.
>
> > Added the test case that Daniel had created.
>
> Nope, still missing AFAICT =) But I've included it in the attached.
>
>
Thanks!

Dave


>

Re: Improvements in Copy From

2020-07-14 Thread vignesh C

On Tue, Jul 14, 2020 at 12:17 PM vignesh C  wrote:
>
> On Tue, Jul 14, 2020 at 11:13 AM David Rowley  wrote:
> >
> > On Tue, 14 Jul 2020 at 17:22, David Rowley  wrote:
> > >
> > > On Thu, 2 Jul 2020 at 00:46, vignesh C  wrote:
> > > > b) CopyMultiInsertInfoNextFreeSlot had an unused function parameter
> > > > that is not being used, it can be removed.
> > >
> > > This was raised in [1]. We decided not to remove it.
> >
> > I just added a comment to the function to mention why we want to keep
> > the parameter. I hope that will save any wasted time proposing its
> > removal in the future.
> >
> > FWIW, the function is inlined.  Removing it will gain us nothing
> > performance-wise anyway.
> >
> > David
> >
> > > [1] 
> > > https://www.postgresql.org/message-id/flat/CAKJS1f-A5aYvPHe10Wy9LjC4RzLsBrya8b2gfuQHFabhwZT_NQ%40mail.gmail.com#3bae9a84be253c527b0e621add0fbaef
>
> Thanks David for pointing it out, as this has been discussed and
> concluded no point in discussing the same thing again. This patch has
> a couple of other improvements which can still be taken forward. I
> will remove this change and post a new patch to retain the other
> issues that were fixed.
>

I have removed the changes that david had pointed out and retained the
remaining changes. Attaching the patch for the same.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From fbafa5eaaa84028b3bbfb7cde0cbcc3963fd033a Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Tue, 14 Jul 2020 12:21:37 +0530
Subject: [PATCH] Improvements in copy from.

There are couple of improvements for copy from in this patch which is detailed
below:
a) copy from stdin copies lesser amount of data to buffer even though space is
available in buffer because minread was passed as 1 to CopyGetData, fixed it by
passing the actual space available in buffer, this reduces the frequent call to
CopyGetData.
b) Copy from reads header line and does nothing for the read line, we need not
clear EOL & need not convert to server encoding for the header line.
---
 src/backend/commands/copy.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 44da71c..bc27dfc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -796,6 +796,7 @@ CopyLoadRawBuf(CopyState cstate)
 {
 	int			nbytes;
 	int			inbytes;
+	int			minread = 1;
 
 	if (cstate->raw_buf_index < cstate->raw_buf_len)
 	{
@@ -807,8 +808,11 @@ CopyLoadRawBuf(CopyState cstate)
 	else
 		nbytes = 0;/* no data need be saved */
 
+	if (cstate->copy_dest == COPY_NEW_FE)
+		minread = RAW_BUF_SIZE - nbytes;
+
 	inbytes = CopyGetData(cstate, cstate->raw_buf + nbytes,
-		  1, RAW_BUF_SIZE - nbytes);
+		  minread, RAW_BUF_SIZE - nbytes);
 	nbytes += inbytes;
 	cstate->raw_buf[nbytes] = '\0';
 	cstate->raw_buf_index = 0;
@@ -3869,7 +3873,7 @@ CopyReadLine(CopyState cstate)
 			} while (CopyLoadRawBuf(cstate));
 		}
 	}
-	else
+	else if (!(cstate->cur_lineno == 0 && cstate->header_line))
 	{
 		/*
 		 * If we didn't hit EOF, then we must have transferred the EOL marker
@@ -3903,8 +3907,9 @@ CopyReadLine(CopyState cstate)
 		}
 	}
 
-	/* Done reading the line.  Convert it to server encoding. */
-	if (cstate->need_transcoding)
+	/* Done reading the line.  Convert it to server encoding if not header. */
+	if (cstate->need_transcoding &&
+		!(cstate->cur_lineno == 0 && cstate->header_line))
 	{
 		char	   *cvt;
 
-- 
1.8.3.1

Re: Binary support for pgoutput plugin

2020-07-14 Thread Daniel Gustafsson

> On 13 Jul 2020, at 15:11, Dave Cramer  wrote:

I took another look at the updated version today.  Since there now were some
unused variables and (I believe) unnecessary checks (int size and endianness
etc) left, I took the liberty to fix those.  I also fixed some markup in the
catalog docs, did some minor tidying up and ran pgindent on it.

The attached is a squash of the 4 patches in your email with the above fixes.
I'm again marking it RfC since I believe all concerns raised so far has been
addressed.

> Added the test case that Daniel had created.

Nope, still missing AFAICT =) But I've included it in the attached.

cheers ./daniel

0001-Add-binary-protocol-for-publications-and-subscriptio.patch
Description: Binary data

Re: [PATCH] Performance Improvement For Copy From Binary Files

2020-07-14 Thread vignesh C

On Tue, Jul 14, 2020 at 11:19 AM Amit Langote  wrote:
>
>
> Sounds fine to me.  Although CopyLoadRawBuf() does not seem to a
> candidate for rigorous code optimization as it does not get called
> that often.
>

I thought we could include that change as we are making changes around
that code. Rest of the changes looked fine to me. Also I noticed that
commit message was missing in the patch.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-07-14 Thread Bharath Rupireddy

On Tue, Jul 14, 2020 at 6:13 PM Ashutosh Bapat
 wrote:
>
> Has this been added to CF, possibly next CF?
>

I have not added yet. Request to get it done in this CF, as the final
patch for review(v3 patch) is ready and shared. We can target it to
the next CF if there are major issues with the patch.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Retry Cached Remote Connections for postgres_fdw in case remote backend gets killed/goes away

2020-07-14 Thread Ashutosh Bapat

Has this been added to CF, possibly next CF?

On Tue, Jul 14, 2020 at 7:27 AM Bharath Rupireddy
 wrote:
>
> >
> > You could get a backend's PID using PQbackendPID and then kill it by
> > calling pg_terminate_backend() to kill the remote backend to automate
> > scenario of remote backend being killed.
> >
>
> I already added the test case in v2 patch itself(added one more test
> case in v3 patch), using the similar approach.
>
> >
> > > For, one of the Ashutosh's review comments from [3] "the fact that the
> > > same connection may be used by multiple plan nodes", I tried to have
> > > few use cases where there exist joins on two foreign tables on the
> > > same remote server, in a single query, so essentially, the same
> > > connection was used for multiple plan nodes. In this case we avoid
> > > retrying for the second GetConnection() request for the second foreign
> > > table, with the check entry->xact_depth <= 0 , xact_depth after the
> > > first GetConnection() and the first remote query will become 1 and we
> > > don't hit the retry logic and seems like we are safe here. Please add
> > > If I'm missing something here.
> > >
> > > Request the community to consider the patch for further review if the
> > > overall idea seems beneficial.
> >
> > I think this idea will be generally useful if your work on dropping
> > stale connection uses idle_connection_timeout or something like that
> > on the remote server.
>
> Assuming we use idle_connection_timeout or some other means(as it is
> not yet finalized, I will try to respond in that mail chain) to drop
> stale/idle connections from the local backend, I think we have two
> options 1) deleting that cached entry from the connection cache
> entirely using disconnect_pg_server() and hash table remove. This
> frees up some space and we don't have to deal with the connection
> invalidations and don't have to bother on resetting cached entry's
> other parameters such as xact_depth, have_prep_stmt etc. 2) or we
> could just drop the connections using disconnect_pg_server(), retain
> the hash entry, reset other parameters, and deal with the
> invalidations. This is like, we maintain unnecessary info in the
> cached entry, where we actually don't have a connection at all and
> keep holding some space for cached entry.
>
> IMO, if we go with option 1, then it will be good.
>
> Anyways, this connection retry feature will not have any dependency on
> the idle_connection_timeout or dropping stale connection feature,
> because there can be many reasons where remote backends go away/get
> killed.
>
> If I'm not sidetracking - if we use something like
> idle_session_timeout [1] on the remote server, this retry feature will
> be very useful.
>
> >
> > About the patch. It seems we could just catch the error from
> > begin_remote_xact() in GetConnection() and retry connection if the
> > error is "bad connection". Retrying using PQreset() might be better
> > than calling PQConnect* always.
> >
>
> Thanks for the comment, it made life easier. Added the patch with the
> changes. Please take a look at the v3 patch and let me know if still
> something needs to be done.
>
> [1] - 
> https://www.postgresql.org/message-id/763A0689-F189-459E-946F-F0EC4458980B%40hotmail.com
>
> With Regards,
> Bharath Rupireddy.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Best Wishes,
Ashutosh Bapat

Re: DROP relation IF EXISTS Docs and Tests - Bug Fix

2020-07-14 Thread Justin Pryzby

On Tue, Jul 14, 2020 at 07:25:56AM +0200, Pavel Stehule wrote:
> út 14. 7. 2020 v 0:37 odesílatel David G. Johnston 
>  napsal:
> > On Mon, Jul 13, 2020 at 2:12 AM Pavel Stehule  
> > wrote:
> I think so now all changes are correct and valuable. I''l mark this patch
> as ready for commit

This is failing relevant tests in cfbot:

 drop_if_exists   ... FAILED  450 ms

-- 
Justin

Re: Partitioning and postgres_fdw optimisations for multi-tenancy

2020-07-14 Thread Ashutosh Bapat

On Tue, Jul 14, 2020 at 12:48 AM Alexey Kondratov
 wrote:
>
> Hi Hackers,
>
> The idea of achieving Postgres scaling via sharding using postgres_fdw +
> partitioning got a lot of attention last years. Many optimisations have
> been done in this direction: partition pruning, partition-wise
> aggregates / joins, postgres_fdw push-down of LIMIT, GROUP BY, etc. In
> many cases they work really nice.
>
> However, still there is a vast case, where postgres_fdw + native
> partitioning doesn't perform so good — Multi-tenant architecture. From
> the database perspective it is presented well in this Citus tutorial
> [1]. The main idea is that there is a number of tables and all of them
> are sharded / partitioned by the same key, e.g. company_id. That way, if
> every company mostly works within its own data, then every query may be
> effectively executed on a single node without a need for an internode
> communication.
>
> I built a simple two node multi-tenant schema for tests, which can be
> easily set up with attached scripts. It creates three tables (companies,
> users, documents) distributed over two nodes. Everything can be found in
> this Gist [2] as well.
>
> Some real-life test queries show, that all single-node queries aren't
> pushed-down to the required node. For example:
>
> SELECT
>  *
> FROM
>  documents
>  INNER JOIN users ON documents.user_id = users.id
> WHERE
>  documents.company_id = 5
>  AND users.company_id = 5;

There are a couple of things happening here
1. the clauses on company_id in WHERE clause are causing partition
pruning. Partition-wise join is disabled with partition pruning before
PG13. In PG13 we have added advanced partition matching algorithm
which will allow partition-wise join with partition pruning.
2. the query has no equality condition on the partition key of the
tables being joined. Partitionwise join is possible only when there's
an equality condition on the partition keys (company_id) of the
joining tables. PostgreSQL's optimizer is not smart enough to convert
the equality conditions in WHERE clause into equality conditions on
partition keys. So having those conditions just in WHERE clause does
not help. Instead please add equality conditions on partition keys in
JOIN .. ON clause or WHERE clause (only for INNER join).

>
> executed as following
>
>QUERY PLAN
> ---
>   Nested Loop
> Join Filter: (documents.user_id = users.id)
> ->  Foreign Scan on users_node2 users
> ->  Materialize
>   ->  Foreign Scan on documents_node2 documents
>
> i.e. it uses two foreign scans and does the final join locally. However,
> once I specify target partitions explicitly, then the entire query is
> pushed down to the foreign node:
>
> QUERY PLAN
> -
>   Foreign Scan
> Relations: (documents_node2) INNER JOIN (users_node2)
>
> Execution time is dropped significantly as well — by more than 3 times
> even for this small test database. Situation for simple queries with
> aggregates or joins and aggregates followed by the sharding key filter
> is the same. Something similar was briefly discussed in this thread [3].
>
> IIUC, it means that push-down of queries through the postgres_fdw works
> perfectly well, the problem is with partition-wise operation detection
> at the planning time. Currently, partition-wise aggregate routines,
> e.g., looks for a GROUP BY and checks whether sharding key exists there
> or not. After that PARTITIONWISE_AGGREGATE_* flag is set. However, it
> doesn't look for a content of WHERE clause, so frankly speaking it isn't
> a problem, this functionality is not yet implemented.
>
> Actually, sometimes I was able to push down queries with aggregate
> simply by adding an additional GROUP BY with sharding key, like this:
>
> SELECT
>  count(*)
> FROM
>  documents
> WHERE
>  company_id = 5
> GROUP BY company_id;

This gets pushed down since GROUP BY clause is on the partition key.

>
> where this GROUP BY obviously doesn't change a results, it just allows
> planner to choose from more possible paths.
>
> Also, I have tried to hack it a bit and forcedly set
> PARTITIONWISE_AGGREGATE_FULL for this particular query. Everything
> executed fine and returned result was correct, which means that all
> underlying machinery is ready.
>
> That way, I propose a change to the planner, which will check whether
> partitioning key exist in the WHERE clause and will set
> PARTITIONWISE_AGGREGATE_* flags if appropriate. The whole logic may look
> like:
>
> 1. If the only one condition by partitioning key is used (like above),
> then it is PARTITIONWISE_AGGREGATE_FULL.
> 2. If several conditions are used, then it should be
> PARTITIONWISE_AGGREGATE_PARTIAL.
>
> I'm aware that WHERE clause may be extremely complex in general, but we
> could narrow this possible optimisation to the same

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Magnus Hagander

On Tue, Jul 14, 2020 at 1:52 PM Robert Haas  wrote:

> On Tue, Jul 14, 2020 at 4:59 AM Magnus Hagander 
> wrote:
> > The countersable of this is pg_resetwal. The number of people who have
> broken their database with that tool is not small.
>
> Very true.
>
> > That said, we could have a separate "class" of extensions/tools in the
> distribution, and encourage packagers to pack them up as separate packages
> for example. Technically they don't have to be in the same source
> repository at all of course, but I have a feeling some of them might be a
> lot easier to maintain if they are. And then the user would just have to
> install something like "postgresql-14-wizardtools". They'd still be
> available to everybody, of course, but at least the knives would be in a
> closed drawer until intentionally picked up.
>
> I don't think that does much to help with the immediate problem here,
> because people are being bitten by this problem *now* and a packaging
> change like this will take a long time to happen and become standard
> out there, but I think it's a good idea.
>

I don't think that it necessarily has to be. As long as we're talking about
adding something and not actually changing their existing packages, getting
this into both yum and apt shouldn't be *that* hard, if it's coordinated
well with Christoph and Devrim (obviously that's based on my experience and
they will have to give a more complete answer themselves). It would be a
lot more complicated if it involved changing an existing package.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Is it useful to record whether plans are generic or custom?

2020-07-14 Thread torikoshia


On 2020-07-10 10:49, torikoshia wrote:

On 2020-07-08 16:41, Fujii Masao wrote:

On 2020/07/08 10:14, torikoshia wrote:

On 2020-07-06 22:16, Fujii Masao wrote:

On 2020/06/11 14:59, torikoshia wrote:

On 2020-06-10 18:00, Kyotaro Horiguchi wrote:



+    TupleDescInitEntry(tupdesc, (AttrNumber) 8, "last_plan",

This could be a problem if we showed the last plan in this view.  
I

think "last_plan_type" would be better.

+    if (prep_stmt->plansource->last_plan_type == 
PLAN_CACHE_TYPE_CUSTOM)

+    values[7] = CStringGetTextDatum("custom");
+    else if (prep_stmt->plansource->last_plan_type == 
PLAN_CACHE_TYPE_GENERIC)

+    values[7] = CStringGetTextDatum("generic");
+    else
+    nulls[7] = true;

Using swith-case prevents future additional type (if any) from 
being

unhandled.  I think we are recommending that as a convension.


Thanks for your reviewing!

I've attached a patch that reflects your comments.


Thanks for the patch! Here are the comments.


Thanks for your review!


+    Number of times generic plan was choosen
+    Number of times custom plan was choosen

Typo: "choosen" should be "chosen"?


Thanks, fixed them.

+  role="column_definition">

+   last_plan_type text
+  
+  
+    Tells the last plan type was generic or custom. If the 
prepared

+    statement has not executed yet, this field is null
+  

Could you tell me how this information is expected to be used?
I think that generic_plans and custom_plans are useful when 
investigating
the cause of performance drop by cached plan mode. But I failed to 
get

how much useful last_plan_type is.


This may be an exceptional case, but I once had a case needed
to ensure whether generic or custom plan was chosen for specific
queries in a development environment.


In your case, probably you had to ensure that the last multiple (or 
every)
executions chose generic or custom plan? If yes, I'm afraid that 
displaying

only the last plan mode is not enough for your case. No?
So it seems better to check generic_plans or custom_plans columns in 
the

view rather than last_plan_type even in your case. Thought?


Yeah, I now feel last_plan is not so necessary and only the numbers of
generic/custom plan is enough.

If there are no objections, I'm going to remove this column and related 
codes.


As mentioned, I removed last_plan column.


Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATIONFrom e417fc58d51ab0c06de34a563d580c7d4017a1db Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Tue, 14 Jul 2020 20:57:16 +0900
Subject: [PATCH] We didn't have an easy way to find how many times generic and
 custom plans of a prepared statement have been executed. This patch exposes
 the numbers of both plans in pg_prepared_statements.

---
 doc/src/sgml/catalogs.sgml| 18 ++
 src/backend/commands/prepare.c| 15 +---
 src/backend/utils/cache/plancache.c   | 17 +
 src/include/catalog/pg_proc.dat   |  6 ++--
 src/include/utils/plancache.h |  3 +-
 src/test/regress/expected/prepare.out | 51 +++
 src/test/regress/expected/rules.out   |  6 ++--
 src/test/regress/sql/prepare.sql  | 15 
 8 files changed, 115 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index e9cdff4864..8bc6d685f4 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -10830,6 +10830,24 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
frontend/backend protocol
   
  
+
+ 
+  
+   generic_plans int8
+  
+  
+Number of times generic plan was chosen
+  
+ 
+
+ 
+  
+   custom_plans int8
+  
+  
+Number of times custom plan was chosen
+  
+ 
 

   
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 80d6df8ac1..4b18be5b27 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -694,7 +694,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 
 /*
  * This set returning function reads all the prepared statements and
- * returns a set of (name, statement, prepare_time, param_types, from_sql).
+ * returns a set of (name, statement, prepare_time, param_types, from_sql,
+ * generic_plans, custom_plans).
  */
 Datum
 pg_prepared_statement(PG_FUNCTION_ARGS)
@@ -723,7 +724,7 @@ pg_prepared_statement(PG_FUNCTION_ARGS)
 	 * build tupdesc for result tuples. This must match the definition of the
 	 * pg_prepared_statements view in system_views.sql
 	 */
-	tupdesc = CreateTemplateTupleDesc(5);
+	tupdesc = CreateTemplateTupleDesc(7);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 1, "name",
 	   TEXTOID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 2, "statement",
@@ -734,6 +735,10 @@ pg_prepared_statement(PG_FUNCTION_ARGS)
 	   REGTYPEARRAYOID, -1, 0);

Re: [patch] demote

2020-07-14 Thread Amul Sul

On Mon, Jul 13, 2020 at 8:35 PM Jehan-Guillaume de Rorthais
 wrote:
>
> Hi,
>
> Another summary + patch + tests.
>
> This patch supports 2PC. The goal is to keep them safe during demote/promote
> actions so they can be committed/rollbacked later on a primary. See tests.
>

Wondering is it necessary to clear prepared transactions from shared memory?
Can't simply skip clearing and restoring prepared transactions while demoting?

Regards,
Amul

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Robert Haas

On Tue, Jul 14, 2020 at 4:59 AM Magnus Hagander  wrote:
> The countersable of this is pg_resetwal. The number of people who have broken 
> their database with that tool is not small.

Very true.

> That said, we could have a separate "class" of extensions/tools in the 
> distribution, and encourage packagers to pack them up as separate packages 
> for example. Technically they don't have to be in the same source repository 
> at all of course, but I have a feeling some of them might be a lot easier to 
> maintain if they are. And then the user would just have to install something 
> like "postgresql-14-wizardtools". They'd still be available to everybody, of 
> course, but at least the knives would be in a closed drawer until 
> intentionally picked up.

I don't think that does much to help with the immediate problem here,
because people are being bitten by this problem *now* and a packaging
change like this will take a long time to happen and become standard
out there, but I think it's a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Robert Haas

On Tue, Jul 14, 2020 at 3:08 AM Peter Eisentraut
 wrote:
> In the meantime, if you're wizard enough to deal with this kind of
> thing, you could also clone the module from the PG14 tree and build it
> against older versions manually.

But what if you are NOT a wizard, and a wizard is giving you
directions? Then having to build from source is a real pain. And
that's normally the situation I'm in when a customer has this issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Peter Eisentraut


On 2020-07-14 02:41, Robert Haas wrote:

I think that our normal back-patching rules are based primarily on the
risk of breaking things, and a new contrib module carries a pretty
negligible risk of breaking anything that works today.


I think that all feature code ought to go through a beta cycle.  So if 
this code makes it to 14.0 or 14.1, then I'd consider backpatching it.



Now, if this goes into v14, we can certainly stick it up on github, or
put it out there in some other way for users to download,
self-compile, and install, but that seems noticeably less convenient
for people who need it, and I'm not clear what the benefit to the
project is.


In the meantime, if you're wizard enough to deal with this kind of 
thing, you could also clone the module from the PG14 tree and build it 
against older versions manually.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: problem with RETURNING and update row movement

2020-07-14 Thread osumi.takami...@fujitsu.com

Amit san


Hello. I've tested your patch.

This patch can be applied comfortably and make check-world has produced no 
failure.

I didn't do performance test
because this patch doesn't have an effect on it in my opinion.

I reproduced the bug you described before,
which can be prevented by your patch currently.

After applying your patch, I've gotten a correct output without error
using the test case in the 1st mail of this thread.

Just small comment about your patch.
I felt the test you added in update.sql could be simpler or shorter in other 
form.
Excuse me if I say something silly.
It's because I supposed you can check the bug is prevented without definitions 
of both a function and its trigger for this case. Neither of them is 
essentially connected with the row movement between source partition and 
destination partition and can be replaced by simpler expression ?

Best,
Takamichi Osumi

Re: TAP tests and symlinks on Windows

2020-07-14 Thread Peter Eisentraut


On 2020-07-10 13:58, Andrew Dunstan wrote:

After much frustration and gnashing of teeth here's a patch that allows
almost all the TAP tests involving symlinks to work as expected on all
Windows build environments, without requiring an additional Perl module.
I have tested this on a system that is very similar to that running
drongo and fairywren, with both msys2 and  MSVC builds.


Thanks.  This patch works for me in my environment.  The code changes 
look very clean, so it seems like a good improvement.


Attached is a small fixup patch for some typos and a stray debug message.

A perltidy run might be worthwhile, as Michael already mentioned.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 61fc47e7fd82f65ada6a25ddb9ea0458da8e6d5e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 14 Jul 2020 09:11:42 +0200
Subject: [PATCH] fixup! win-tap-symlink-2.patch

---
 doc/src/sgml/regress.sgml | 2 +-
 src/test/perl/TestLib.pm  | 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/regress.sgml b/doc/src/sgml/regress.sgml
index e6e2b2762d..4e370aeae4 100644
--- a/doc/src/sgml/regress.sgml
+++ b/doc/src/sgml/regress.sgml
@@ -775,7 +775,7 @@ TAP Tests

 The TAP tests require the Perl module IPC::Run.
 This module is available from CPAN or an operating system package.
-On Windows, Win32API::File is also required .
+On Windows, Win32API::File is also required.

 

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 207e92fdc0..d595c7e3c2 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -630,7 +630,7 @@ sub check_pg_config
 
 =item dir_symlink(oldname, newname)
 
-Portably create a symlink for a director. On Windows this creates a junction.
+Portably create a symlink for a directory. On Windows this creates a junction.
 Elsewhere it just calls perl's builtin symlink.
 
 =cut
@@ -651,7 +651,6 @@ sub dir_symlink
# need some indirection on msys
$cmd = qq{echo '$cmd' | \$COMSPEC /Q};
}
-   note("dir_symlink cmd: $cmd");
system($cmd);
}
else
-- 
2.27.0

Re: max_slot_wal_keep_size and wal_keep_segments

2020-07-14 Thread David Steele


On 7/14/20 12:00 AM, Fujii Masao wrote:


The patch was no longer applied cleanly because of recent commit.
So I updated the patch. Attached.

Barring any objection, I will commit this patch.


This doesn't look right:

+   the  most recent megabytes
+   WAL files plus one WAL file are

How about:

+megabytes of
+   WAL files plus one WAL file are

Other than that, looks good to me.

Regards,
--
-David
da...@pgmasters.net

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-07-14 Thread Julien Rouhaud

On Tue, Jul 14, 2020 at 07:11:02PM +0900, Atsushi Torikoshi wrote:
> Hi,
> 
> v9 patch fails to apply to HEAD, could you check and rebase it?

Thanks for the notice, v10 attached!

> And here are minor typos.
> 
>  79 +* utility statements.  Note that we don't compute a queryId
> for prepared
>  80 +* statemets related utility, as those will inherit from the
> underlying
>  81 +* statements's one (except DEALLOCATE which is entirely
> untracked).
> 
> statemets -> statements
> statements's -> statements' or statement's?

Thanks!  I went with "statement's".
>From 8c651ee05c8a5e55966ad1646f48e83941d3776a Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 18 Mar 2019 18:55:50 +0100
Subject: [PATCH v10] Expose queryid in pg_stat_activity and log_line_prefix

Similarly to other fields in pg_stat_activity, only the queryid from the top
level statements are exposed, and if the backends status isn't active then the
queryid from the last executed statements is displayed.

Also add a %Q placeholder to include the queryid in the log_line_prefix, which
will also only expose top level statements.

Author: Julien Rouhaud
Reviewed-by: Evgeny Efimkin, Michael Paquier, Yamada Tatsuro, Atsushi Torikoshi
Discussion: https://postgr.es/m/CA+8PKvQnMfOE-c3YLRwxOsCYXQDyP8VXs6CDtMZp1V4=d4l...@mail.gmail.com
---
 .../pg_stat_statements/pg_stat_statements.c   | 179 --
 doc/src/sgml/config.sgml  |   9 +-
 doc/src/sgml/monitoring.sgml  |  15 ++
 src/backend/catalog/system_views.sql  |   1 +
 src/backend/executor/execMain.c   |   8 +
 src/backend/executor/execParallel.c   |  14 +-
 src/backend/executor/nodeGather.c |   3 +-
 src/backend/executor/nodeGatherMerge.c|   4 +-
 src/backend/parser/analyze.c  |   5 +
 src/backend/postmaster/pgstat.c   |  65 +++
 src/backend/tcop/postgres.c   |   5 +
 src/backend/utils/adt/pgstatfuncs.c   |   7 +-
 src/backend/utils/error/elog.c|  10 +-
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat   |   6 +-
 src/include/executor/execParallel.h   |   3 +-
 src/include/pgstat.h  |   5 +
 src/test/regress/expected/rules.out   |   9 +-
 18 files changed, 270 insertions(+), 79 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 14cad19afb..a51c207b49 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -115,6 +115,14 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
 
 #define JUMBLE_SIZE1024	/* query serialization buffer size */
 
+/*
+ * Utility statements that pgss_ProcessUtility and pgss_post_parse_analyze
+ * ignores.
+ */
+#define PGSS_HANDLED_UTILITY(n)		(!IsA(n, ExecuteStmt) && \
+	!IsA(n, PrepareStmt) && \
+	!IsA(n, DeallocateStmt))
+
 /*
  * Extension version number, for supporting older extension versions' objects
  */
@@ -345,7 +353,8 @@ static void pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 ProcessUtilityContext context, ParamListInfo params,
 QueryEnvironment *queryEnv,
 DestReceiver *dest, QueryCompletion *qc);
-static uint64 pgss_hash_string(const char *str, int len);
+static const char *pgss_clean_querytext(const char *query, int *location, int *len);
+static uint64 pgss_compute_utility_queryid(const char *query, int query_len);
 static void pgss_store(const char *query, uint64 queryId,
 	   int query_location, int query_len,
 	   pgssStoreKind kind,
@@ -845,16 +854,34 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query)
 		return;
 
 	/*
-	 * Utility statements get queryId zero.  We do this even in cases where
-	 * the statement contains an optimizable statement for which a queryId
-	 * could be derived (such as EXPLAIN or DECLARE CURSOR).  For such cases,
-	 * runtime control will first go through ProcessUtility and then the
-	 * executor, and we don't want the executor hooks to do anything, since we
-	 * are already measuring the statement's costs at the utility level.
+	 * We compute a queryId now so that it can get exported in out
+	 * PgBackendStatus.  pgss_ProcessUtility will later discard it to prevents
+	 * double counting of optimizable statements that are directly contained in
+	 * utility statements.  Note that we don't compute a queryId for prepared
+	 * statements related utility, as those will inherit from the underlying
+	 * statement's one (except DEALLOCATE which is entirely untracked).
 	 */
 	if (query->utilityStmt)
 	{
-		query->queryId = UINT64CONST(0);
+		if (pgss_track_utility && PGSS_HANDLED_UTILITY(query->utilityStmt)
+			&& pstate->p_sourcetext)
+		{
+			const char *querytext = pstate->p_sourcetext;
+			int query_location =

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Jochem van Dieten

On Tue, Jul 14, 2020 at 3:26 AM Tom Lane  wrote:
> I think you're attacking a straw man.  I'm well aware of how open source
> works, thanks.  What I'm saying is that contrib is mostly seen to be
> reasonably harmless stuff.  Sure, you can overwrite data you didn't want
> to with adminpack's pg_file_write.  But that's the price of having such a
> capability at all, and in general it's not hard for users to understand
> both the uses and risks of that function.  That statement does not apply
> to the functions being proposed here.  It doesn't seem like they could
> possibly be safe to use without very specific expert advice --- and even
> then, we're talking rather small values of "safe".

Would it be possible to make them safe(r)? For instance, truncate
only, don't freeze; only tuples whose visibility information is
corrupted; and only in non-catalog tables. What exactly is the risk in
that case? Foreign keys might not be satisfied, which might make it
impossible to restore a dump, but is that worse than what a DBA can do
anyway? I would think that it is not and would leave the database in a
state DBAs are much better equipped to deal with.
Or would it be possible to create a table like the original table
(minus any constraints) and copy all tuples with corrupted visibility
there before truncating to a dead line pointer?

Jochem

Re: Don't choke on files that are removed while pg_rewind runs.

2020-07-14 Thread Daniel Gustafsson

> On 13 Jul 2020, at 14:18, Michael Paquier  wrote:

> That sounds like a good idea with an extra qual in the first part of
> the inner CTE, if coupled with a check to make sure that we never
> get a NULL result.  Now there is IMO an argument to not complicate
> more this query as it is not like a lot of tuples would get filtered
> out anyway because of a NULL set of values?  I don't have strong
> feelings for one approach or the other, but if I were to choose, I
> would just let the code as-is, without the change in the CTE.

I don't have strong opinions either, both approaches will work, so feel free to
go ahead with the proposed change.

cheers ./daniel

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2020-07-14 Thread Atsushi Torikoshi

Hi,

v9 patch fails to apply to HEAD, could you check and rebase it?

And here are minor typos.

 79 +* utility statements.  Note that we don't compute a queryId
for prepared
 80 +* statemets related utility, as those will inherit from the
underlying
 81 +* statements's one (except DEALLOCATE which is entirely
untracked).

statemets -> statements
statements's -> statements' or statement's?

Regards,

--
Atsushi Torikoshi

On Wed, Apr 8, 2020 at 11:38 PM Julien Rouhaud  wrote:

> On Tue, Apr 7, 2020 at 8:40 AM Tatsuro Yamada
>  wrote:
> >
> > Hi Julien,
> >
> > On 2020/04/02 22:25, Julien Rouhaud wrote:
> > > New conflict, rebased v9 attached.
> >
> > I tested the patch on the head (c7654f6a3) and
> > the result was fine. See below:
> >
> > $ make installcheck-world
> > =
> >   All 1 tests passed.
> > =
>
> Thanks Yamada-san!  Unfortunately this patch still didn't attract any
> committer, so I moved it to the next commitfest.
>
>
>

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-07-14 Thread Bharath Rupireddy

Thanks all for the ideas. There have been various points/approaches
discussed in the entire email chain so far.
I would like to summarize all of them here, so that we can agree on
one of the options and proceed further with this feature.

The issue this feature is trying to solve:
In postgres_fdw, rarely used remote connections lie ilde in the
connection cache(per backend) and so are remote sessions, for long
lasting local sessions which may unnecessarily eatup connections on
remote postgres servers.

Approach #1:
A new session level GUC (proposed name "enable_connectioncache"), when
set to true(which is by default) caches the remote connections
otherwise not. When set to false, everytime foreign query is issued a
new connection is made at the remote xact begin and dropped from the
connection cache at the remote xact end. This GUC applies to all the
foreign servers that are used in the session, it may not be possible
to have the control at the foreign server level. It may not be a good
idea to have postgres core controlling postgres_fdw property.

Approach #2:
A new postgres_fdw function, similar to dblink's dblink_disconnect(),
(possibly named postgres_fdw_disconnect_open_connections()). Seems
easy, but users have to frequently call this function to clean up the
cached entries. This may not be always possible, requires some sort of
monitoring and issuing this new disconnect function from in between
application code.

Approach #3:
A postgres_fdw foreign server level option: keep_connection(on/off).
When set to on (which is by default), caches the entries related to
that particular foreign server otherwise not. This gives control at
the foreign server level, which may not be possible with a single GUC.
It also addresses the concern that having postgres core solving
postgres_fdw problem. But, when the same foreign server is to be used
in multiple other sessions with different keep_connection
options(on/off), then a possible solution is to have two foreign
server definitions for the same server, one with keep_connection on
and another with off and use the foreign server accordingly and when
there is any change in other foreign server properties/options, need
to maintain the two versions of foreign servers.

Approach #4:
A postgres_fdw foreign server level option: connection idle time, the
amount of idle time for that server cached entry, after which the
cached entry goes away. Probably the backend, before itself going to
idle, has to be checking the cached entries and see if any of the
entries has timed out. One problem is that, if the backend just did it
before going idle, then what about sessions that haven't reached the
timeout at the point when we go idle, but do reach the timeout later?

I tried to summarize and put in the points in a concise manner,
forgive if I miss anything.

Thoughts?

Credits and thanks to: vignesh C, David G. Johnston, Masahiko Sawada,
Bruce Momjian, Rushabh Lathia, Ashutosh Bapat, Robert Haas.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: replication_origin and replication_origin_lsn usage on subscriber

2020-07-14 Thread Petr Jelinek


On 14/07/2020 11:36, Dilip Kumar wrote:

On Tue, Jul 14, 2020 at 2:47 PM Petr Jelinek  wrote:


Hi,

On 14/07/2020 10:29, Amit Kapila wrote:

On Tue, Jul 14, 2020 at 12:05 PM Dilip Kumar  wrote:


On Tue, Jul 14, 2020 at 11:14 AM Amit Kapila  wrote:


On Tue, Jul 14, 2020 at 11:00 AM Dilip Kumar  wrote:


On Thu, Jul 9, 2020 at 6:55 PM Amit Kapila  wrote:


On Thu, Jul 9, 2020 at 6:14 PM Petr Jelinek  wrote:



If we were to support the origin forwarding, then strictly speaking we
need everything only at commit time from correctness perspective,



Okay.  Anyway streaming mode is optional, so in such cases, we can keep it 'off'


but
ideally origin_id would be best sent with first message as it can be
used to filter out changes at decoding stage rather than while we
process the commit so having it set early improves performance of decoding.



Yeah, makes sense.  So, we will just send origin_id (with first
streaming start message) and leave others.


So IIUC, currently we are sending the latest origin_id which is set
during the commit time.  So in our case, while we start streaming we
will send the origin_id of the latest change in the current stream
right?



It has to be sent only once with the first start message not with
consecutive start messages.


Okay,  so do you mean to say that with the first start message we send
the origin_id of the latest change?



Yes.


   because during the transaction
lifetime, the origin id can be changed.



Yeah, it could be changed but if we have to send again apart from with
the first message then it should be sent with each message.  So, I
think it is better to just send it once during the transaction as we
do it now (send with begin message).




I am not sure if I can follow the discussion here very well, but if I
understand correctly I'd like to clarify two things:
- origin id does not change mid transaction as you can only have one per xid


Actually, I was talking about if someone changes the session origin
then which origin id we should send?  currently, we send data only
during the commit so we take the origin id from the commit wal and
send the same.  In the below example, I am inserting 2 records in a
transaction and each of them has different origin id.

begin;
select pg_replication_origin_session_setup('o1');
insert into t values(1, 'test');
select pg_replication_origin_session_reset();
select pg_replication_origin_session_setup('o2');   --> Origin ID changed
insert into t values(2, 'test');
commit;



Commit record and commit_ts record will both include only 'o2', while 
individual DML WAL records will contain one or the other depending on 
when they were done.


The origin API is really not really prepared for this situation 
(independently of streaming) because the origin lookup for all rows in 
that transaction will return 'o2', but decoding will decode whatever is 
in the DML WAL record.


One can't even use this approach for sensible filtering as the ultimate 
faith of whole transaction is decided by what's in commit record since 
the filter callback only provides origin id, not record being processed 
so plugin can't differentiate. So it's hard to see how the above pattern 
could be used for anything but breaking things. Not sure what Andres' 
original intention was with allowing this.


--
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/

Re: replication_origin and replication_origin_lsn usage on subscriber

2020-07-14 Thread Amit Kapila

On Tue, Jul 14, 2020 at 2:47 PM Petr Jelinek  wrote:
>
> Hi,
>
> On 14/07/2020 10:29, Amit Kapila wrote:
> > On Tue, Jul 14, 2020 at 12:05 PM Dilip Kumar  wrote:
> >>
> >> On Tue, Jul 14, 2020 at 11:14 AM Amit Kapila  
> >> wrote:
> >>>
> >>> On Tue, Jul 14, 2020 at 11:00 AM Dilip Kumar  
> >>> wrote:
> 
>  On Thu, Jul 9, 2020 at 6:55 PM Amit Kapila  
>  wrote:
> >
> > On Thu, Jul 9, 2020 at 6:14 PM Petr Jelinek  
> > wrote:
> >>
> >>
> >> If we were to support the origin forwarding, then strictly speaking we
> >> need everything only at commit time from correctness perspective,
> >>
> >
> > Okay.  Anyway streaming mode is optional, so in such cases, we can keep 
> > it 'off'
> >
> >> but
> >> ideally origin_id would be best sent with first message as it can be
> >> used to filter out changes at decoding stage rather than while we
> >> process the commit so having it set early improves performance of 
> >> decoding.
> >>
> >
> > Yeah, makes sense.  So, we will just send origin_id (with first
> > streaming start message) and leave others.
> 
>  So IIUC, currently we are sending the latest origin_id which is set
>  during the commit time.  So in our case, while we start streaming we
>  will send the origin_id of the latest change in the current stream
>  right?
> 
> >>>
> >>> It has to be sent only once with the first start message not with
> >>> consecutive start messages.
> >>
> >> Okay,  so do you mean to say that with the first start message we send
> >> the origin_id of the latest change?
> >>
> >
> > Yes.
> >
> >>   because during the transaction
> >> lifetime, the origin id can be changed.
> >>
> >
> > Yeah, it could be changed but if we have to send again apart from with
> > the first message then it should be sent with each message.  So, I
> > think it is better to just send it once during the transaction as we
> > do it now (send with begin message).
> >
> >
>
> I am not sure if I can follow the discussion here very well, but if I
> understand correctly I'd like to clarify two things:
> - origin id does not change mid transaction as you can only have one per xid
>

As shown by Dilip, I don't think currently we have any way to prevent
this from changing during the transaction.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: replication_origin and replication_origin_lsn usage on subscriber

2020-07-14 Thread Dilip Kumar

On Tue, Jul 14, 2020 at 2:47 PM Petr Jelinek  wrote:
>
> Hi,
>
> On 14/07/2020 10:29, Amit Kapila wrote:
> > On Tue, Jul 14, 2020 at 12:05 PM Dilip Kumar  wrote:
> >>
> >> On Tue, Jul 14, 2020 at 11:14 AM Amit Kapila  
> >> wrote:
> >>>
> >>> On Tue, Jul 14, 2020 at 11:00 AM Dilip Kumar  
> >>> wrote:
> 
>  On Thu, Jul 9, 2020 at 6:55 PM Amit Kapila  
>  wrote:
> >
> > On Thu, Jul 9, 2020 at 6:14 PM Petr Jelinek  
> > wrote:
> >>
> >>
> >> If we were to support the origin forwarding, then strictly speaking we
> >> need everything only at commit time from correctness perspective,
> >>
> >
> > Okay.  Anyway streaming mode is optional, so in such cases, we can keep 
> > it 'off'
> >
> >> but
> >> ideally origin_id would be best sent with first message as it can be
> >> used to filter out changes at decoding stage rather than while we
> >> process the commit so having it set early improves performance of 
> >> decoding.
> >>
> >
> > Yeah, makes sense.  So, we will just send origin_id (with first
> > streaming start message) and leave others.
> 
>  So IIUC, currently we are sending the latest origin_id which is set
>  during the commit time.  So in our case, while we start streaming we
>  will send the origin_id of the latest change in the current stream
>  right?
> 
> >>>
> >>> It has to be sent only once with the first start message not with
> >>> consecutive start messages.
> >>
> >> Okay,  so do you mean to say that with the first start message we send
> >> the origin_id of the latest change?
> >>
> >
> > Yes.
> >
> >>   because during the transaction
> >> lifetime, the origin id can be changed.
> >>
> >
> > Yeah, it could be changed but if we have to send again apart from with
> > the first message then it should be sent with each message.  So, I
> > think it is better to just send it once during the transaction as we
> > do it now (send with begin message).
> >
> >
>
> I am not sure if I can follow the discussion here very well, but if I
> understand correctly I'd like to clarify two things:
> - origin id does not change mid transaction as you can only have one per xid

Actually, I was talking about if someone changes the session origin
then which origin id we should send?  currently, we send data only
during the commit so we take the origin id from the commit wal and
send the same.  In the below example, I am inserting 2 records in a
transaction and each of them has different origin id.

begin;
select pg_replication_origin_session_setup('o1');
insert into t values(1, 'test');
select pg_replication_origin_session_reset();
select pg_replication_origin_session_setup('o2');   --> Origin ID changed
insert into t values(2, 'test');
commit;

> - until we have origin forwarding feature, the origin id is always same
> for a given subscription

ok

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: replication_origin and replication_origin_lsn usage on subscriber

2020-07-14 Thread Petr Jelinek


Hi,

On 14/07/2020 10:29, Amit Kapila wrote:

On Tue, Jul 14, 2020 at 12:05 PM Dilip Kumar  wrote:


On Tue, Jul 14, 2020 at 11:14 AM Amit Kapila  wrote:


On Tue, Jul 14, 2020 at 11:00 AM Dilip Kumar  wrote:


On Thu, Jul 9, 2020 at 6:55 PM Amit Kapila  wrote:


On Thu, Jul 9, 2020 at 6:14 PM Petr Jelinek  wrote:



If we were to support the origin forwarding, then strictly speaking we
need everything only at commit time from correctness perspective,



Okay.  Anyway streaming mode is optional, so in such cases, we can keep it 'off'


but
ideally origin_id would be best sent with first message as it can be
used to filter out changes at decoding stage rather than while we
process the commit so having it set early improves performance of decoding.



Yeah, makes sense.  So, we will just send origin_id (with first
streaming start message) and leave others.


So IIUC, currently we are sending the latest origin_id which is set
during the commit time.  So in our case, while we start streaming we
will send the origin_id of the latest change in the current stream
right?



It has to be sent only once with the first start message not with
consecutive start messages.


Okay,  so do you mean to say that with the first start message we send
the origin_id of the latest change?



Yes.


  because during the transaction
lifetime, the origin id can be changed.



Yeah, it could be changed but if we have to send again apart from with
the first message then it should be sent with each message.  So, I
think it is better to just send it once during the transaction as we
do it now (send with begin message).




I am not sure if I can follow the discussion here very well, but if I 
understand correctly I'd like to clarify two things:

- origin id does not change mid transaction as you can only have one per xid
- until we have origin forwarding feature, the origin id is always same 
for a given subscription


--
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/

Re: Avoid useless retrieval of defaults and check constraints in pg_dump -a

2020-07-14 Thread Julien Rouhaud

On Sun, Jul 12, 2020 at 4:29 PM Justin Pryzby  wrote:
>
> On Sun, Jul 12, 2020 at 07:48:50AM +0200, Julien Rouhaud wrote:
> > Currently, getTableAttrs() always retrieves info about columns defaults and
> > check constraints, while this will never be used if --data-only option if 
> > used.
> > This seems like a waste of resources, so here's a patch to skip those parts
> > when the DDL won't be generated.
>
> Note that the speed of default and constraint handling has come up before:
> https://www.postgresql.org/message-id/flat/CAMkU%3D1xPqHP%3D7YPeChq6n1v_qd4WGf%2BZvtnR-b%2BgyzFqtJqMMQ%40mail.gmail.com
> https://www.postgresql.org/message-id/CAMkU=1x-e+maqefhm1ymesij8j9z+sjhgw7c9bqo3e3jmg4...@mail.gmail.com

Oh, I wasn't aware of that.

> I'd pointed out that a significant fraction of our pg_upgrade time was in
> pg_dump, due to having wide tables with many child tables, and "default 0" on
> every column.  (I've since dropped our defaults so this is no longer an issue
> here).
>
> It appears your patch would avoid doing unnecessary work in the --data-only
> case, but it wouldn't help the pg_upgrade case.

Indeed.  Making the schema part faster would probably require a bigger
refactoring.  I'm wondering if we could introduce some facility to
temporarily deny any DDL change, so that the initial pg_dump -s done
by pg_upgrade can be performed before shutting down the instance.

Note that those extraneous queries were found while trying to dump
data out of a corrupted database.  The issue wasn't an excessive
runtime but corrupted catalog entries, so bypassing this code (since I
was only interested in the data anyway) was simpler than trying to
recover yet other corrupted rows.

Re: Online checksums verification in the backend

2020-07-14 Thread Julien Rouhaud

On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote:
> Small language fixes in comments and user-facing docs.

Thanks a lot!  I just fixed a small issue (see below), PFA updated v10.

> 
> diff --git a/src/backend/storage/page/checksum.c 
> b/src/backend/storage/page/checksum.c
> index eb2c919c34..17cd95ec95 100644
> --- a/src/backend/storage/page/checksum.c
> +++ b/src/backend/storage/page/checksum.c
> @@ -36,7 +36,7 @@
>   * actual storage, you have to discard the operating system cache before
>   * running those functions.
>   *
> - * To avoid torn page and possible false positive when reading data, and
> + * To avoid torn pages and possible false positives when reading data, and to
>   * keeping overhead as low as possible, the following heuristics are used:
>   *

Changed for "to keep".
>From ff8b384d3acdee2986543742c22af9de0335ff4f Mon Sep 17 00:00:00 2001
From: Julien Rouhaud 
Date: Mon, 4 Nov 2019 08:40:23 +0100
Subject: [PATCH v10] Add a pg_check_relation() function.

This functions checks the validity of the checksums for all non-dirty blocks of
a given relation, and optionally a given fork, and returns the list of all
blocks that don't match, along with the expected and found checksums.

Author: Julien Rouhaud
Reviewed-by: Michael Paquier, Masahiko Sawada, Justin Pryzby
Discussion: 
https://postgr.es/m/CAOBaU_aVvMjQn%3Dge5qPiJOPMmOj5%3Dii3st5Q0Y%2BWuLML5sR17w%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  85 +
 doc/src/sgml/func.sgml|  51 +++
 src/backend/storage/page/checksum.c   | 318 +-
 src/backend/utils/adt/Makefile|   1 +
 src/backend/utils/adt/checksumfuncs.c | 217 
 src/backend/utils/init/globals.c  |   7 +
 src/backend/utils/misc/guc.c  |  33 ++
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/catalog/pg_proc.dat   |  16 +
 src/include/miscadmin.h   |   7 +
 src/include/storage/checksum.h|  13 +
 src/include/utils/guc_tables.h|   1 +
 src/test/Makefile |   3 +-
 src/test/check_relation/.gitignore|   2 +
 src/test/check_relation/Makefile  |  23 ++
 src/test/check_relation/README|  23 ++
 .../check_relation/t/01_checksums_check.pl| 276 +++
 17 files changed, 1078 insertions(+), 4 deletions(-)
 create mode 100644 src/backend/utils/adt/checksumfuncs.c
 create mode 100644 src/test/check_relation/.gitignore
 create mode 100644 src/test/check_relation/Makefile
 create mode 100644 src/test/check_relation/README
 create mode 100644 src/test/check_relation/t/01_checksums_check.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b353c61683..f6a03ff931 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2038,6 +2038,91 @@ include_dir 'conf.d'
  
 
 
+
+ Cost-based Checksum Verification Delay
+
+ 
+  During the execution of 
+  function, the system maintains an internal counter that keeps track of
+  the estimated cost of the various I/O operations that are performed.
+  When the accumulated cost reaches a limit (specified by
+  checksum_cost_limit), the process performing the
+  operation will sleep for a short period of time, as specified by
+  checksum_cost_delay. Then it will reset the counter
+  and continue execution.
+ 
+
+ 
+  This feature is disabled by default. To enable it, set the
+  checksum_cost_delay variable to a nonzero
+  value.
+ 
+
+ 
+  
+   checksum_cost_delay (floating 
point)
+   
+checksum_cost_delay configuration 
parameter
+   
+   
+   
+
+ The amount of time that the process will sleep
+ when the cost limit has been exceeded.
+ If this value is specified without units, it is taken as milliseconds.
+ The default value is zero, which disables the cost-based checksum
+ verification delay feature.  Positive values enable cost-based
+ checksum verification.
+
+   
+  
+
+  
+   checksum_cost_page (integer)
+   
+checksum_cost_page configuration 
parameter
+   
+   
+   
+
+ The estimated cost for verifying a buffer, whether it's found in the
+ shared buffer cache or not. It represents the cost to lock the buffer
+ pool, lookup the shared hash table, read the content of the page from
+ disk and compute its checksum.  The default value is 10.
+
+   
+  
+
+  
+   checksum_cost_limit (integer)
+   
+checksum_cost_limit configuration 
parameter
+   
+   
+   
+
+ The accumulated cost that will cause the verification process to 
sleep.
+ The default value is 200.
+
+   
+  
+ 
+
+

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-07-14 Thread Magnus Hagander

On Tue, Jul 14, 2020 at 3:26 AM Tom Lane  wrote:

> Robert Haas  writes:
> > On Mon, Jul 13, 2020 at 8:58 PM Tom Lane  wrote:
> >> The more that I think about it, the more I think that the proposed
> >> functions are tools for wizards only, and so I'm getting hesitant
> >> about having them in contrib at all.  We lack a better place to
> >> put them, but that doesn't mean they should be there.
>
> > I understand that it's not too great when we give people access to
> > sharp tools and they hurt themselves with said tools. But this is open
> > source. That's how it goes.
>
> I think you're attacking a straw man.  I'm well aware of how open source
> works, thanks.  What I'm saying is that contrib is mostly seen to be
> reasonably harmless stuff.  Sure, you can overwrite data you didn't want
> to with adminpack's pg_file_write.  But that's the price of having such a
> capability at all, and in general it's not hard for users to understand
> both the uses and risks of that function.  That statement does not apply
> to the functions being proposed here.  It doesn't seem like they could
> possibly be safe to use without very specific expert advice --- and even
> then, we're talking rather small values of "safe".  So I wish we had some
> other way to distribute them than via contrib.
>

The countersable of this is pg_resetwal. The number of people who have
broken their database with that tool is not small.

That said, we could have a separate "class" of extensions/tools in the
distribution, and encourage packagers to pack them up as separate packages
for example. Technically they don't have to be in the same source
repository at all of course, but I have a feeling some of them might be a
lot easier to maintain if they are. And then the user would just have to
install something like "postgresql-14-wizardtools". They'd still be
available to everybody, of course, but at least the knives would be in a
closed drawer until intentionally picked up.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

2020-07-14 Thread Bharath Rupireddy

On Tue, Jul 7, 2020 at 10:24 PM Andres Freund  wrote:
>
> On 2020-07-07 09:44:41 -0400, Tom Lane wrote:
> > Bharath Rupireddy  writes:
> > > Firstly, pg_start_backup registers nonexclusive_base_backup_cleanup as
> > > on_shmem_exit call back which will
> > > add this function to the before_shmem_exit_list array which is
> > > supposed to be removed on pg_stop_backup
> > > so that we can do the pending cleanup and issue a warning for each
> > > pg_start_backup for which we did not call
> > > the pg_stop backup. Now, I executed a query for which JIT is picked
> > > up, then the the llvm compiler inserts it's
> > > own exit callback i.e. llvm_shutdown at the end of
> > > before_shmem_exit_list array. Now, I executed pg_stop_backup
> > > and call to cancel_before_shmem_exit() is made with the expectation
> > > that the nonexclusive_base_backup_cleanup
> > > callback is removed from before_shmem_exit_list array.
> >
> > I'm of the opinion that the JIT code is abusing this mechanism, and the
> > right thing to do is fix that.
>
> What are you proposing? For now we could easily enough work around this
> by just making it a on_proc_exit() callback, but that doesn't really
> change the fundamental issue imo.
>
> > The restriction you propose to remove is not just there out of
> > laziness, it's an expectation about what safe use of this mechanism
> > would involve.  Un-ordered removal of callbacks seems pretty risky: it
> > would mean that whatever cleanup is needed is not going to be done in
> > LIFO order.
>

I quickly searched(in HEAD) what all the callbacks are getting
registered to before_shmem_exit_list, with the intention to see if
they also call corresponding cancel_before_shmem_exit() after their
intended usage is done.

For few of the callbacks there is no cancel_before_shmem_exit(). This
seems expected; those callbacks ought to be executed before shmem
exit. These callbacks are(let say SET 1): ShutdownPostgres,
logicalrep_worker_onexit, llvm_shutdown, Async_UnlistenOnExit,
RemoveTempRelationsCallback, ShutdownAuxiliaryProcess,
do_pg_abort_backup in xlog.c (this callback exist only in v13 or
later), AtProcExit_Twophase.

Which means, once they are into the before_shmem_exit_list array, in
some order, they are never going to be removed from it as they don't
have corresponding cancel_before_shmem_exit() and the relative order
of execution remains the same.

And there are other callbacks that are getting registered to
before_shmem_exit_list array(let say SET 2): apw_detach_shmem,
_bt_end_vacuum_callback, pg_start_backup_callback,
pg_stop_backup_callback, createdb_failure_callback,
movedb_failure_callback, do_pg_abort_backup(in basebackup.c). They all
have corresponding cancel_before_shmem_exit() to unregister/remove the
callbacks from before_shmem_exit_list array.

I think the callbacks that have no cancel_before_shmem_exit()(SET 1)
may have to be executed in the LIFO order: it makes sense to execute
ShutdownPostgres at the end after let's say other callbacks in SET 1.

And the SET 2 callbacks have cancel_before_shmem_exit() with the only
intention that there's no need to call the callbacks on the
before_shmem_exit(), since they are not needed, and try to remove from
the before_shmem_exit_list array and may fail, if any other callback
gets registered in between.

If I'm not wrong with the above points, we must enhance
cancel_before_shmem_exit() or have cancel_before_shmem_exit_v2() (as
mentioned in my below response).

>
> Maybe I am confused, but isn't it <13's pg_start_backup() that's
> violating the protocol much more clearly than the JIT code? Given that
> it relies on there not being any callbacks registered between two SQL
> function calls?  I mean, what it does is basically:
>
> 1) before_shmem_exit(nonexclusive_base_backup_cleanup...
> 2) arbitrary code executed for a long time
> 3) cancel_before_shmem_exit(nonexclusive_base_backup_cleanup...
>
> which pretty obviously can't at all deal with any other
> before_shmem_exit callbacks being registered in 2).  Won't this be a
> problem for every other before_shmem_exit callback that we register
> on-demand? Say Async_UnlistenOnExit, RemoveTempRelationsCallback?
>

Yes, for versions <13's, clearly pg_start_backup causes the problem
and the issue can also be reproduced with Async_UnlistenOnExit,
RemoveTempRelationsCallback coming in between pg_start_backup and
pg_stop_backup.

We can have it fixed in a few ways: 1) enhance
cancel_before_shmem_exit() as attached in the original patch. 2) have
existing cancel_before_shmem_exit(), whenever called for
nonexclusive_base_backup_cleanup(), we can look for the entire array
instead of just the last entry. 3) have a separate function, say,
cancel_before_shmem_exit_v2(), that searches for the entire
before_shmem_exit_list array(the logic proposed in this patch) so that
it will not disturb the existing cancel_before_shmem_exit(). 4) or try
to have the pg_start_backup code that exists in after > 13 versions.

If okay

Re: replication_origin and replication_origin_lsn usage on subscriber

2020-07-14 Thread Amit Kapila

On Tue, Jul 14, 2020 at 12:05 PM Dilip Kumar  wrote:
>
> On Tue, Jul 14, 2020 at 11:14 AM Amit Kapila  wrote:
> >
> > On Tue, Jul 14, 2020 at 11:00 AM Dilip Kumar  wrote:
> > >
> > > On Thu, Jul 9, 2020 at 6:55 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Jul 9, 2020 at 6:14 PM Petr Jelinek  
> > > > wrote:
> > > > >
> > > > >
> > > > > If we were to support the origin forwarding, then strictly speaking we
> > > > > need everything only at commit time from correctness perspective,
> > > > >
> > > >
> > > > Okay.  Anyway streaming mode is optional, so in such cases, we can keep 
> > > > it 'off'
> > > >
> > > > > but
> > > > > ideally origin_id would be best sent with first message as it can be
> > > > > used to filter out changes at decoding stage rather than while we
> > > > > process the commit so having it set early improves performance of 
> > > > > decoding.
> > > > >
> > > >
> > > > Yeah, makes sense.  So, we will just send origin_id (with first
> > > > streaming start message) and leave others.
> > >
> > > So IIUC, currently we are sending the latest origin_id which is set
> > > during the commit time.  So in our case, while we start streaming we
> > > will send the origin_id of the latest change in the current stream
> > > right?
> > >
> >
> > It has to be sent only once with the first start message not with
> > consecutive start messages.
>
> Okay,  so do you mean to say that with the first start message we send
> the origin_id of the latest change?
>

Yes.

>  because during the transaction
> lifetime, the origin id can be changed.
>

Yeah, it could be changed but if we have to send again apart from with
the first message then it should be sent with each message.  So, I
think it is better to just send it once during the transaction as we
do it now (send with begin message).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Transactions involving multiple postgres foreign servers, take 2

2020-07-14 Thread Masahiro Ikeda


I've attached the latest version patches. I've incorporated the review
comments I got so far and improved locking strategy.


I want to ask a question about streaming replication with 2PC.
Are you going to support 2PC with streaming replication?

I tried streaming replication using v23 patches.
I confirm that 2PC works with streaming replication,
which there are primary/standby coordinator.

But, in my understanding, the WAL of "PREPARE" and
"COMMIT/ABORT PREPARED" can't be replicated to the standby server in 
sync.


If this is right, the unresolved transaction can be occurred.

For example,

1. PREPARE is done
2. crash primary before the WAL related to PREPARE is
   replicated to the standby server
3. promote standby server // but can't execute "ABORT PREPARED"

In above case, the remote server has the unresolved transaction.
Can we solve this problem to support in-sync replication?

But, I think some users use async replication for performance.
Do we need to document the limitation or make another solution?

Regards,

--
Masahiro Ikeda
NTT DATA CORPORATION

Re: INSERT INTO SELECT, Why Parallelism is not selected?

2020-07-14 Thread Rushabh Lathia

On Sat, Jul 11, 2020 at 6:07 PM Dilip Kumar  wrote:

> I have just notice that the parallelism is off even for the select
> part of the query mentioned in the $subject.  I see the only reason it
> is not getting parallel because we block the parallelism if the query
> type is not SELECT.  I don't see any reason for not selecting the
> parallelism for this query.  I have quickly hacked the code to enable
> the parallelism for this query.  I can see there is a significant
> improvement if we can enable the parallelism in this case.  For an
> experiment, I have just relaxed a couple of checks, maybe if we think
> that it's good to enable the parallelism for this case we can try to
> put better checks which are specific for this quey.
>
>
+1 for the idea.  For the given example also it shows a good performance
gain and I also don't any reason on restrict the parallel case for INSERT
INTO SELECT.


> No parallel:
> postgres[36635]=# explain analyze insert into t2 select * from t where a <
> 100;
>  Insert on t2  (cost=0.00..29742.00 rows=100 width=105) (actual
> time=278.300..278.300 rows=0 loops=1)
>->  Seq Scan on t  (cost=0.00..29742.00 rows=100 width=105) (actual
> time=0.061..277.330 rows=99 loops=1)
>  Filter: (a < 100)
>  Rows Removed by Filter: 01
>  Planning Time: 0.093 ms
>  Execution Time: 278.330 ms
> (6 rows)
>
> With parallel
>  Insert on t2  (cost=1000.00..23460.33 rows=100 width=105) (actual
> time=108.410..108.410 rows=0 loops=1)
>->  Gather  (cost=1000.00..23460.33 rows=100 width=105) (actual
> time=0.306..108.973 rows=99 loops=1)
>  Workers Planned: 2
>  Workers Launched: 2
>  ->  Parallel Seq Scan on t  (cost=0.00..22450.33 rows=42
> width=105) (actual time=66.396..101.979 rows=33 loops=3)
>Filter: (a < 100)
>Rows Removed by Filter: 00
>  Planning Time: 0.154 ms
>  Execution Time: 110.158 ms
> (9 rows)
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


-- 
Rushabh Lathia

Re: INSERT INTO SELECT, Why Parallelism is not selected?

2020-07-14 Thread Dilip Kumar

On Mon, Jul 13, 2020 at 4:23 PM Amit Kapila  wrote:
>
> On Sat, Jul 11, 2020 at 6:07 PM Dilip Kumar  wrote:
> >
> > I have just notice that the parallelism is off even for the select
> > part of the query mentioned in the $subject.  I see the only reason it
> > is not getting parallel because we block the parallelism if the query
> > type is not SELECT.  I don't see any reason for not selecting the
> > parallelism for this query.  I have quickly hacked the code to enable
> > the parallelism for this query.  I can see there is a significant
> > improvement if we can enable the parallelism in this case.  For an
> > experiment, I have just relaxed a couple of checks, maybe if we think
> > that it's good to enable the parallelism for this case we can try to
> > put better checks which are specific for this quey.
> >
>
> +1.  I also don't see any problem with this idea considering we will
> find a better way to enable the parallelism for this case because we
> can already use parallelism for statements like "create table
>  as select ...".

Okay, thanks for the feedback.

  I think we can do more than this by
> parallelizing the Insert part of this query as well as we have lifted
> group locking restrictions related to RelationExtension and Page lock
> in PG13.  It would be really cool to do that unless we see any
> fundamental problems with it.

+1, I think it will be cool to support for insert part here as well as
insert part in 'Create Table As Select..' as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel Seq Scan vs kernel read ahead

2020-07-14 Thread Thomas Munro

On Fri, Jun 26, 2020 at 3:33 AM Robert Haas  wrote:
> On Tue, Jun 23, 2020 at 11:53 PM David Rowley  wrote:
> > In summary, based on these tests, I don't think we're making anything
> > worse in regards to synchronize_seqscans if we cap the maximum number
> > of blocks to allocate to each worker at once to 8192. Perhaps there's
> > some argument for using something smaller than that for servers with
> > very little RAM, but I don't personally think so as it still depends
> > on the table size and It's hard to imagine tables in the hundreds of
> > GBs on servers that struggle with chunk allocations of 16MB.  The
> > table needs to be at least ~70GB to get a 8192 chunk size with the
> > current v2 patch settings.
>
> Nice research. That makes me happy. I had a feeling the maximum useful
> chunk size ought to be more in this range than the larger values we
> were discussing before, but I didn't even think about the effect on
> synchronized scans.

+1.  This seems about right to me.  We can always reopen the
discussion if someone shows up with evidence in favour of a tweak to
the formula, but this seems to address the basic problem pretty well,
and also fits nicely with future plans for AIO and DIO.

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2020-07-14 Thread Peter Geoghegan

On Thu, Jul 2, 2020 at 8:47 AM James Coleman  wrote:
> It seems like the consensus over at another discussion on this topic
> [1] is that we ought to go ahead and print the zeros [for machine
> readable output formats], even though that creates some interesting
> scenarios like the fact that disk sorts will print 0 for memory even
> though that's not true.

What about having it print -1 for memory in this case instead? That's
still not ideal, but machine readable EXPLAIN output ought to
consistently show the same information per node, even when the answer
is in some sense indeterminate. That seems to be the standard that
we've settled on.

It might be worth teaching the JSON format to show a JSON null or
something instead. Not sure about that, though.

-- 
Peter Geoghegan

Re: Reducing WaitEventSet syscall churn

2020-07-14 Thread Thomas Munro

On Sun, Jul 12, 2020 at 7:22 AM Tom Lane  wrote:
> [...complaints about 0005 and 0006...] We already have
> PGEVT_CONNRESET and PGEVT_CONNDESTROY as application-visible connection
> state change events, and I don't see why those aren't sufficient.

I think Horiguchi-san's general idea of using event callbacks for this
sounds much more promising than my earlier idea of exposing a change
counter (that really was terrible).  If it can be done with existing
events then that's even better.  Perhaps he and/or I can look into
that for the next CF.

In the meantime, here's a rebase of the more straightforward patches
in the stack.  These are the ones that deal only with fixed sets of
file descriptors, and they survive check-world on Linux,
Linux+EXEC_BACKEND (with ASLR disabled) and FreeBSD, and at least
check on macOS and Windows (my CI recipes need more work to get
check-world working on those two).  There's one user-visible change
that I'd appreciate feedback on: I propose to drop the FATAL error
when the postmaster goes away, to make things more consistent.  See
below for more on that.

Responding to earlier review from Horiguchi-san:

On Tue, Mar 10, 2020 at 12:20 PM Kyotaro Horiguchi
 wrote:
> > 0002: "Use a long lived WaitEventSet for WaitLatch()."
> >
> > In the last version, I had a new function WaitMyLatch(), but that
> > didn't help with recoveryWakeupLatch.  Switching between latches
> > doesn't require a syscall, so I didn't want to have a separate WES and
> > function just for that.  So I went back to using plain old
> > WaitLatch(), and made it "modify" the latch every time before waiting
> > on CommonWaitSet.  An alternative would be to get rid of the concept
> > of latches other than MyLatch, and change the function to
> > WaitMyLatch().  A similar thing happens for exit_on_postmaster_death,
> > for which I didn't want to have a separate WES, so I just set that
> > flag every time.  Thoughts?
>
> It is surely an improvement from that we create a full-fledged WES
> every time. The name CommonWaitSet gives an impression as if it is
> used for a variety of waitevent sets, but it is used only by
> WaitLatch. So I would name it LatchWaitSet. With that name I won't be
> surprised by that the function is pointing WL_LATCH_SET by index 0
> without any explanation when calling ModifyWaitSet.

Ok, I changed it to LatchWaitSet.  I also replaced the index 0 with a
symbolic name LatchWaitSetLatchPos, to make that clearer.

> @@ -700,7 +739,11 @@ FreeWaitEventSet(WaitEventSet *set)
> ReleaseExternalFD();
>  #elif defined(WAIT_USE_KQUEUE)
> close(set->kqueue_fd);
> -   ReleaseExternalFD();
> +   if (set->kqueue_fd >= 0)
> +   {
> +   close(set->kqueue_fd);
> +   ReleaseExternalFD();
> +   }
>
> Did you forget to remove the close() outside the if block?
> Don't we need the same amendment for epoll_fd with kqueue_fd?

Hmm, maybe I screwed that up when resolving a conflict with the
ReleaseExternalFD() stuff.  Fixed.

> WaitLatch is defined as "If the latch is already set (and WL_LATCH_SET
> is given), the function returns immediately.". But now the function
> reacts to latch even if WL_LATCH_SET is not set. I think actually it
> is alwys set so I think we need to modify Assert and function comment
> following the change.

It seems a bit silly to call WaitLatch() if you don't want to wait for
a latch, but I think we can keep that comment and logic by assigning
set->latch = NULL when you wait without WL_LATCH_SET.  I tried that in
the attached.

> > 0004: "Introduce RemoveWaitEvent()."
> >
> > We'll need that to be able to handle connections that are closed and
> > reopened under the covers by libpq (replication, postgres_fdw).  We
> > also wanted this on a couple of other threads for multiplexing FDWs,
> > to be able to adjust the wait set dynamically for a proposed async
> > Append feature.
> >
> > The implementation is a little naive, and leaves holes in the
> > "pollfds" and "handles" arrays (for poll() and win32 implementations).
> > That could be improved with a bit more footwork in a later patch.
> >
> > XXX The Windows version is based on reading documentation.  I'd be
> > very interested to know if check-world passes (especially
> > contrib/postgres_fdw and src/test/recovery).  Unfortunately my
> > appveyor.yml fu is not yet strong enough.
>
> I didn't find the documentation about INVALID_HANDLE_VALUE in
> lpHandles. Could you give me the URL for that?

I was looking for how you do the equivalent of Unix file descriptor -1
in a call to poll(), and somewhere I read that INVALID_HANDLE_VALUE
has the right effect.  I can't find that reference now.  Apparently it
works because that's the pseudo-handle value -1 that is returned by
GetCurrentProcess(), and waiting for your own process waits forever so
it's a suitable value for holes in an array of event handles.  We
should probably call GetCurrentProcess() instead, but really that is
just stand-in code: we

Re: Improvements in Copy From

2020-07-14 Thread vignesh C

On Tue, Jul 14, 2020 at 11:13 AM David Rowley  wrote:
>
> On Tue, 14 Jul 2020 at 17:22, David Rowley  wrote:
> >
> > On Thu, 2 Jul 2020 at 00:46, vignesh C  wrote:
> > > b) CopyMultiInsertInfoNextFreeSlot had an unused function parameter
> > > that is not being used, it can be removed.
> >
> > This was raised in [1]. We decided not to remove it.
>
> I just added a comment to the function to mention why we want to keep
> the parameter. I hope that will save any wasted time proposing its
> removal in the future.
>
> FWIW, the function is inlined.  Removing it will gain us nothing
> performance-wise anyway.
>
> David
>
> > [1] 
> > https://www.postgresql.org/message-id/flat/CAKJS1f-A5aYvPHe10Wy9LjC4RzLsBrya8b2gfuQHFabhwZT_NQ%40mail.gmail.com#3bae9a84be253c527b0e621add0fbaef

Thanks David for pointing it out, as this has been discussed and
concluded no point in discussing the same thing again. This patch has
a couple of other improvements which can still be taken forward. I
will remove this change and post a new patch to retain the other
issues that were fixed.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

1 2 >

1 - 100 of 105 matches

Mail list logo