date:20200727

Re: Parallel worker hangs while handling errors.

2020-07-27 Thread vignesh C

Thanks for your comments Bharath.

On Mon, Jul 27, 2020 at 10:13 AM Bharath Rupireddy
 wrote:
> 1. Do we need "worker" as a function argument in
> update_parallel_worker_sigmask(BackgroundWorker *worker, ? Since
> MyBgworkerEntry is a global variable, can't we have a local variable
> instead?

Fixed, I have moved the worker check to the caller function.

> 2. Instead of update_parallel_worker_sigmask() serving only for
> parallel workers, can we make it generic, so that for any bgworker,
> given a signal it unblocks it, although there's no current use case
> for a bg worker unblocking a single signal other than a parallel
> worker doing it for SIGUSR1 for this hang issue. Please note that we
> have BackgroundWorkerBlockSignals() and
> BackgroundWorkerUnblockSignals().

Fixed. I have slightly modified the changes to break into
BackgroundWorkerRemoveBlockSignal & BackgroundWorkerAddBlockSignal.
This maintains the consistency similar to
BackgroundWorkerBlockSignals() and BackgroundWorkerUnblockSignals().

> If okay, with the BackgroundWorkerUpdateSignalMask() function, please
> note that we might have to add it in bgworker.sgml as well.

Included the documentation.

Attached the updated patch for the same.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From f06a5966d754d6622a3425c6cac1ad72314832ef Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Tue, 28 Jul 2020 10:48:17 +0530
Subject: [PATCH v4] Fix for Parallel worker hangs while handling errors.

Worker is not able to receive the signals while processing error flow. Worker
hangs in this case because when the worker is started the signals will be
masked using sigprocmask. Unblocking of signals is done by calling
BackgroundWorkerUnblockSignals in ParallelWorkerMain. Now due to error
handling the worker has jumped to setjmp in StartBackgroundWorker function.
Here the signals are in blocked state, hence the signal is not received by the
worker process.

Authors: Vignesh C, Bharath Rupireddy
---
 doc/src/sgml/bgworker.sgml  |  6 +-
 src/backend/postmaster/bgworker.c   | 20 
 src/backend/postmaster/postmaster.c | 17 +
 src/include/postmaster/bgworker.h   |  4 
 4 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
index 6e1cf12..d900fc9 100644
--- a/doc/src/sgml/bgworker.sgml
+++ b/doc/src/sgml/bgworker.sgml
@@ -210,7 +210,11 @@ typedef struct BackgroundWorker
allow the process to customize its signal handlers, if necessary.
Signals can be unblocked in the new process by calling
BackgroundWorkerUnblockSignals and blocked by calling
-   BackgroundWorkerBlockSignals.
+   BackgroundWorkerBlockSignals. A particular signal can
+   be unblocked in the new process by calling
+   BackgroundWorkerRemoveBlockSignal(int signum)
+   and blocked by calling
+   BackgroundWorkerAddBlockSignal(int signum).
   
 
   
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85..458b513 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -747,6 +747,17 @@ StartBackgroundWorker(void)
 	 */
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
+		/*
+		 * In case of parallel workers, unblock SIGUSR1 signal, it was blocked
+		 * when the postmaster forked us. Leader process will send SIGUSR1 signal
+		 * to the worker process(worker process will be in waiting state as
+		 * there is no space available) to indicate shared memory space is freed
+		 * up. Once the signal is received worker process will start populating
+		 * the error message further.
+		 */
+		if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+			BackgroundWorkerRemoveBlockSignal(SIGUSR1);
+
 		/* Since not using PG_TRY, must reset error stack by hand */
 		error_context_stack = NULL;
 
@@ -757,6 +768,15 @@ StartBackgroundWorker(void)
 		EmitErrorReport();
 
 		/*
+		 * Undo the unblocking of SIGUSR1 which was done above, as to
+		 * not cause any further issues from unblocking SIGUSR1 during
+		 * the execution of callbacks and other processing that will be
+		 * done during proc_exit().
+		 */
+		if ((worker->bgw_flags & BGWORKER_CLASS_PARALLEL) != 0)
+			BackgroundWorkerAddBlockSignal(SIGUSR1);
+
+		/*
 		 * Do we need more cleanup here?  For shmem-connected bgworkers, we
 		 * will call InitProcess below, which will install ProcKill as exit
 		 * callback.  That will take care of releasing locks, etc.
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index dec0258..751cd16 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -5765,6 +5765,23 @@ BackgroundWorkerUnblockSignals(void)
 	PG_SETMASK();
 }
 
+/*
+ * Block/unblock a particular signal in a background worker.
+ */
+void
+BackgroundWorkerAddBlockSignal(int signum)
+{
+	sigaddset(, signum);
+	PG_SETMASK();
+}
+
+void
+BackgroundWorkerRemoveBlockSignal(int signum)
+{
+

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-07-27 Thread Amit Kapila

On Sun, Jul 26, 2020 at 11:04 AM Dilip Kumar  wrote:
>
> > Today, I have again looked at the first patch
> > (v42-0001-Extend-the-logical-decoding-output-plugin-API-wi) and didn't
> > find any more problems with it so planning to commit the same unless
> > you or someone else want to add more to it.   Just for ease of others,
> > "the next patch extends the logical decoding output plugin API with
> > stream methods".   It adds seven methods to the output plugin API,
> > adding support for streaming changes for large in-progress
> > transactions. The methods are stream_start, stream_stop, stream_abort,
> > stream_commit, stream_change, stream_message, and stream_truncate.
> > Most of this is a simple extension of the existing methods, with the
> > semantic difference that the transaction (or subtransaction) is
> > incomplete and may be aborted later (which is something the regular
> > API does not really need to deal with).
> >
> > This also extends the 'test_decoding' plugin, implementing these new
> > stream methods.  The stream_start/start_stop are used to demarcate a
> > chunk of changes streamed for a particular toplevel transaction.
> >
> > This commit simply adds these new APIs and the upcoming patch to
> > "allow the streaming mode in ReorderBuffer" will use these APIs.
>
> LGTM
>

Pushed.  Feel free to submit the remaining patches.

-- 
With Regards,
Amit Kapila.

Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.

2020-07-27 Thread David Rowley

On Tue, 28 Jul 2020 at 15:21, Peter Geoghegan  wrote:
> Separately, I wonder what your opinion is about what should happen for
> the partial sort related EXPLAIN ANALYZE format open item, described
> here:
>
> https://www.postgresql.org/message-id/flat/20200619040358.GZ17995%40telsasoft.com#b20bd205851a0390220964f7c31b23d1
>
> ISTM that EXPLAIN ANALYZE for incremental sort manages to show the
> same information as the sort case, aggregated across each tuplesort in
> a fairly sensible way.
>
> (No activity over on the incremental sort thread, so I thought I'd ask
> again here, while I was reminded of that issue.)

TBH, I've not really looked at that.

Tom did mention his view on this in [1]. I think that's a pretty good
policy. However, I've not looked at the incremental sort EXPLAIN
output enough to know how it'll best apply there.

David

[1] https://www.postgresql.org/message-id/2276865.1593102811%40sss.pgh.pa.us

Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.

2020-07-27 Thread David Rowley

On Tue, 28 Jul 2020 at 15:01, Justin Pryzby  wrote:
>
> On Tue, Jul 28, 2020 at 12:54:35PM +1200, David Rowley wrote:
> > On Mon, 27 Jul 2020 at 14:54, Justin Pryzby  wrote:
> > > It's unrelated to hashAgg vs hashJoin, but I also noticed that this is 
> > > shown
> > > only conditionally:
> > >
> > > if (es->format != EXPLAIN_FORMAT_TEXT)
> > > {
> > > if (es->costs && aggstate->hash_planned_partitions > 0)
> > > {
> > > ExplainPropertyInteger("Planned Partitions", NULL,
> > >
> > > aggstate->hash_planned_partitions, es);
> > >
> > > That was conditional since it was introduced at 1f39bce02:
> > >
> > > if (es->costs && aggstate->hash_planned_partitions > 0)
> > > {
> > > ExplainPropertyInteger("Planned Partitions", NULL,
> > >
> > > aggstate->hash_planned_partitions, es);
> > > }
> > >
> > > I think 40efbf870 should've handled this, too.
> >
> > hmm. I'm not sure. I think this should follow the same logic as what
> > "Disk Usage" follows, and right now we don't show Disk Usage unless we
> > spill.
>
> Huh ?  I'm referring to non-text format, which is what you changed, on the
> reasoning that the same plan *could* spill:

Oh, right. ... (Sudden bout of confusion due to lack of sleep)

Looks like it'll just need this line:

if (es->costs && aggstate->hash_planned_partitions > 0)

changed to:

if (es->costs)

I think we'll likely need to maintain not showing that property with
explain (costs off) as it'll be a bit more difficult to write
regression tests if we display it regardless of that option.

David

Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.

2020-07-27 Thread Justin Pryzby

On Mon, Jul 27, 2020 at 08:20:45PM -0700, Peter Geoghegan wrote:
> On Mon, Jul 27, 2020 at 5:54 PM David Rowley  wrote:
> > hmm. I'm not sure. I think this should follow the same logic as what
> > "Disk Usage" follows, and right now we don't show Disk Usage unless we
> > spill. Since we only use partitions when spilling, I don't think it
> > makes sense to show the estimated partitions when we don't plan on
> > spilling.
> 
> I'm confused about what the guiding principles for EXPLAIN ANALYZE
> output (text or otherwise) are.

I don't know of a guideline for text format, but the issues I've raised for
HashAgg and IncrSort are based on previous commits which change to "show field
even if its value is zero" for non-text format:

7d91b604d9b5d6ec8c19c57a9ffd2f27129cdd94
8ebb69f85445177575684a0ba5cfedda8d840a91
3ec20c7091e97a554e7447ac2b7f4ed795631395

> > I think if we change this then we should change Disk Usage too.
> > However, I don't think we should as Sort will only show "Disk" if the
> > sort spills. I think Hash Agg should follow that.
> 
> I don't follow your remarks here.
> 
> Separately, I wonder what your opinion is about what should happen for
> the partial sort related EXPLAIN ANALYZE format open item, described
> here:
> 
> https://www.postgresql.org/message-id/flat/20200619040358.GZ17995%40telsasoft.com#b20bd205851a0390220964f7c31b23d1
> 
> ISTM that EXPLAIN ANALYZE for incremental sort manages to show the
> same information as the sort case, aggregated across each tuplesort in
> a fairly sensible way.
> 
> (No activity over on the incremental sort thread, so I thought I'd ask
> again here, while I was reminded of that issue.)

Note that I did mail recently, on this 2nd thread:

https://www.postgresql.org/message-id/20200723141454.gf4...@telsasoft.com

-- 
Justin

Re: stress test for parallel workers

2020-07-27 Thread Tom Lane

Thomas Munro  writes:
> Hehe, the dodgy looking magic numbers *were* wrong:
> - * The kernel signal delivery code writes up to about 1.5kB
> + * The kernel signal delivery code writes a bit over 4KB
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200724092528.1578671-2-...@ellerman.id.au/

... and, having seemingly not learned a thing, they just replaced
them with new magic numbers.  Mumble sizeof() mumble.

Anyway, I guess the interesting question for us is how long it
will take for this fix to propagate into real-world systems.
I don't have much of a clue about the Linux kernel workflow,
anybody want to venture a guess?

regards, tom lane

Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.

2020-07-27 Thread Peter Geoghegan

On Mon, Jul 27, 2020 at 5:54 PM David Rowley  wrote:
> hmm. I'm not sure. I think this should follow the same logic as what
> "Disk Usage" follows, and right now we don't show Disk Usage unless we
> spill. Since we only use partitions when spilling, I don't think it
> makes sense to show the estimated partitions when we don't plan on
> spilling.

I'm confused about what the guiding principles for EXPLAIN ANALYZE
output (text or otherwise) are.

> I think if we change this then we should change Disk Usage too.
> However, I don't think we should as Sort will only show "Disk" if the
> sort spills. I think Hash Agg should follow that.

I don't follow your remarks here.

Separately, I wonder what your opinion is about what should happen for
the partial sort related EXPLAIN ANALYZE format open item, described
here:

https://www.postgresql.org/message-id/flat/20200619040358.GZ17995%40telsasoft.com#b20bd205851a0390220964f7c31b23d1

ISTM that EXPLAIN ANALYZE for incremental sort manages to show the
same information as the sort case, aggregated across each tuplesort in
a fairly sensible way.

(No activity over on the incremental sort thread, so I thought I'd ask
again here, while I was reminded of that issue.)

-- 
Peter Geoghegan

Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.

2020-07-27 Thread Justin Pryzby

On Tue, Jul 28, 2020 at 12:54:35PM +1200, David Rowley wrote:
> On Mon, 27 Jul 2020 at 14:54, Justin Pryzby  wrote:
> > It's unrelated to hashAgg vs hashJoin, but I also noticed that this is shown
> > only conditionally:
> >
> > if (es->format != EXPLAIN_FORMAT_TEXT)
> > {
> > if (es->costs && aggstate->hash_planned_partitions > 0)
> > {
> > ExplainPropertyInteger("Planned Partitions", NULL,
> >
> > aggstate->hash_planned_partitions, es);
> >
> > That was conditional since it was introduced at 1f39bce02:
> >
> > if (es->costs && aggstate->hash_planned_partitions > 0)
> > {
> > ExplainPropertyInteger("Planned Partitions", NULL,
> >
> > aggstate->hash_planned_partitions, es);
> > }
> >
> > I think 40efbf870 should've handled this, too.
> 
> hmm. I'm not sure. I think this should follow the same logic as what
> "Disk Usage" follows, and right now we don't show Disk Usage unless we
> spill.

Huh ?  I'm referring to non-text format, which is what you changed, on the
reasoning that the same plan *could* spill:

40efbf8706cdd96e06bc4d1754272e46d9857875
if (es->format != EXPLAIN_FORMAT_TEXT)
{

if (es->costs && aggstate->hash_planned_partitions > 0)
{
ExplainPropertyInteger("Planned Partitions", NULL,
   
aggstate->hash_planned_partitions, es);
}
...
/* EXPLAIN ANALYZE */
ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, 
es);
-   if (aggstate->hash_batches_used > 0)
-   {
ExplainPropertyInteger("Disk Usage", "kB", 
   
aggstate->hash_disk_used, es);
ExplainPropertyInteger("HashAgg Batches", NULL,
   
aggstate->hash_batches_used, es);

> Since we only use partitions when spilling, I don't think it
> makes sense to show the estimated partitions when we don't plan on
> spilling.

In any case, my thinking is that we *should* show PlannedPartitions=0,
specifically to indicate *that* we didn't plan to spill.

> I think if we change this then we should change Disk Usage too.
> However, I don't think we should as Sort will only show "Disk" if the
> sort spills. I think Hash Agg should follow that.
> 
> For the patch I posted yesterday, I'll go ahead in push it in about 24
> hours unless there are any objections.

Re: Online checksums patch - once again

2020-07-27 Thread Justin Pryzby

On Thu, Jun 25, 2020 at 11:43:00AM +0200, Daniel Gustafsson wrote:
> The attached v19 fixes a few doc issues I had missed.

+   They can also be enabled or disabled at a later timne, either as an offline
=> time

+* awaiting shutdown, but we can continue turning off checksums anyway
=> a waiting

+* We are starting a checksumming process scratch, and need to start by
=> FROM scratch

+* to inprogress new relations will set relhaschecksums in pg_class so 
it
=> inprogress COMMA

+* Relation no longer exist. We don't consider this an error 
since
=> exists

+* so when the cluster comes back up processing will habe to be resumed.
=> have

+"completed one pass over all databases for checksum 
enabling, %i databases processed",
=> I think this will be confusing to be hardcoded "one".  It'll say "one" over
and over.

+* still exist.
=> exists

In many places, you refer to "datachecksumsworker" (sums) but in nine places
you refer to datachecksumworker (sum).

+ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, 
BufferAccessStrategy strategy)
+{
+   BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum);

=> I think looping over numblocks is safe since new blocks are intended to be
written with checksum, right?  Maybe it's good to say that here.

+   BlockNumber b;

blknum will be easier to grep for

+   (errmsg("background worker \"datachecksumsworker\" 
starting for database oid %d",
=> Should be %u or similar (several of these)

Some questions:

It looks like you rewrite every page, even if it already has correct checksum,
to handle replicas.  I wonder if it's possible/reasonable/good to skip pages
with correct checksum when wal_level=minimal ?

It looks like it's not possible to change the checksum delay while a checksum
worker is already running.  That may be important to allow: 1) decreased delay
during slow periods; 2) increased delay if the process is significantly done,
but needs to be throttled to avoid disrupting production environment.

Have you collaborated with Julien about this one?  His patch adds new GUCs:
https://www.postgresql.org/message-id/20200714090808.GA20780@nol
checksum_cost_delay
checksum_cost_page
checksum_cost_limit

Maybe you'd say that Julien's pg_check_relation() should accept parameters
instead of adding GUCs.  I think you should be in agreement on that.  It'd be
silly if the verification function added three GUCs and allowed adjusting
throttle midcourse, but the checksum writer process didn't use them.

If you used something like that, I guess you'd also want to distinguish
checksum_cost_page_read vs write.  Possibly, the GUCs part should be a
preliminary shared patch 0001 that you both used.

-- 
Justin

Re: max_slot_wal_keep_size and wal_keep_segments

2020-07-27 Thread Fujii Masao





On 2020/07/20 21:21, David Steele wrote:

On 7/20/20 6:02 AM, Fujii Masao wrote:



On 2020/07/20 13:48, Fujii Masao wrote:



On 2020/07/17 20:24, David Steele wrote:


On 7/17/20 5:11 AM, Fujii Masao wrote:



On 2020/07/14 20:30, David Steele wrote:

On 7/14/20 12:00 AM, Fujii Masao wrote:


The patch was no longer applied cleanly because of recent commit.
So I updated the patch. Attached.

Barring any objection, I will commit this patch.


This doesn't look right:

+   the  most recent megabytes
+   WAL files plus one WAL file are

How about:

+    megabytes of
+   WAL files plus one WAL file are


Thanks for the comment! Isn't it better to keep "most recent" part?
If so, what about either of the followings?

1.  megabytes of WAL files plus
 one WAL file that were most recently generated are kept all time.

2.  megabytes +  bytes
 of WAL files that were most recently generated are kept all time.


"most recent" seemed implied to me, but I see your point.

How about:

+   the most recent  megabytes of
+   WAL files plus one additional WAL file are


I adopted this and pushed the patch. Thanks!

Also we need to update the release note for v13. What about adding the 
following?


Rename configuration parameter wal_keep_segments to wal_keep_size.

This allows how much WAL files to retain for the standby server, by bytes 
instead of the number of files.
If you previously used wal_keep_segments, the following formula will give you 
an approximately equivalent setting:

wal_keep_size = wal_keep_segments * wal_segment_size (typically 16MB)



I would rework that first sentence a bit. How about:

+ This determines how much WAL to retain for the standby server,
+ specified in megabytes rather than number of files.

The rest looks fine to me.


Thanks for the review!
I adopted your suggestion in the updated version of the patch and pushed it.

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

printing oid with %d

2020-07-27 Thread Justin Pryzby

+JsonEncodeDateTime(char *buf, Datum value, Oid typid)
...
+   elog(ERROR, "unknown jsonb value datetime type oid %d", 
typid);

I think this should be %u.

commit cc4feded0a31d2b732d4ea68613115cb720e624e
Author: Andrew Dunstan 
Date:   Tue Jan 16 19:07:13 2018 -0500

Centralize json and jsonb handling of datetime types

Re: stress test for parallel workers

2020-07-27 Thread Thomas Munro

On Wed, Dec 11, 2019 at 3:22 PM Thomas Munro  wrote:
> On Tue, Oct 15, 2019 at 4:50 AM Tom Lane  wrote:
> > > Filed at
> > > https://bugzilla.kernel.org/show_bug.cgi?id=205183
>
> For the curious-and-not-subscribed, there's now a kernel patch
> proposed for this.  We guessed pretty close, but the problem wasn't
> those dodgy looking magic numbers, it was that the bad stack expansion
> check only allows for user space to expand the stack
> (FAULT_FLAG_USER), and here the kernel itself wants to build a stack
> frame.

Hehe, the dodgy looking magic numbers *were* wrong:

- * The kernel signal delivery code writes up to about 1.5kB
+ * The kernel signal delivery code writes a bit over 4KB

https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200724092528.1578671-2-...@ellerman.id.au/

Re: expose parallel leader in CSV and log_line_prefix

2020-07-27 Thread Michael Paquier

On Sun, Jul 26, 2020 at 01:54:27PM -0500, Justin Pryzby wrote:
> +
> + %P
> + For a parallel worker, this is the Process ID of its 
> leader
> + process.
> + 
> + no
> +

Let's be a maximum simple and consistent with surrounding descriptions
here and what we have for pg_stat_activity, say:
"Process ID of the parallel group leader, if this process is a
parallel query worker."

> +case 'P':
> +if (MyProc)
> +{
> +PGPROC *leader = MyProc->lockGroupLeader;
> +if (leader == NULL || leader->pid == MyProcPid)
> +/* padding only */
> +appendStringInfoSpaces(buf,
> +padding > 0 ? padding : -padding);
> +else if (padding != 0)
> +appendStringInfo(buf, "%*d", padding, leader->pid);
> +else
> +appendStringInfo(buf, "%d", leader->pid);

It seems to me we should document here that the check on MyProcPid
ensures that this only prints the leader PID only for parallel workers
and discards the leader.

> +appendStringInfoChar(, ',');
> +
> +/* leader PID */
> +if (MyProc)
> +{
> +PGPROC *leader = MyProc->lockGroupLeader;
> +if (leader && leader->pid != MyProcPid)
> +appendStringInfo(, "%d", leader->pid);
> +}
> +

Same here.

Except for those nits, I have tested the patch and things behave as we
want (including padding and docs), so this looks good to me.
--
Michael


signature.asc
Description: PGP signature

Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Peter Geoghegan

On Mon, Jul 27, 2020 at 5:10 PM Jeff Davis  wrote:
> I noticed that one of the conditionals, "cheapest_total_path != NULL",
> was already redundant with the outer conditional before your patch. I
> guess that was just a mistake which your patch corrects along the way?

Makes sense.

> Anyway, the patch looks good to me. We can have a separate discussion
> about pessimizing the costing, if necessary.

Pushed the hashagg_avoid_disk_plan patch -- thanks!

-- 
Peter Geoghegan

Re: HashAgg's batching counter starts at 0, but Hash's starts at 1.

2020-07-27 Thread David Rowley

On Mon, 27 Jul 2020 at 14:54, Justin Pryzby  wrote:
> It's unrelated to hashAgg vs hashJoin, but I also noticed that this is shown
> only conditionally:
>
> if (es->format != EXPLAIN_FORMAT_TEXT)
> {
> if (es->costs && aggstate->hash_planned_partitions > 0)
> {
> ExplainPropertyInteger("Planned Partitions", NULL,
>
> aggstate->hash_planned_partitions, es);
>
> That was conditional since it was introduced at 1f39bce02:
>
> if (es->costs && aggstate->hash_planned_partitions > 0)
> {
> ExplainPropertyInteger("Planned Partitions", NULL,
>
> aggstate->hash_planned_partitions, es);
> }
>
> I think 40efbf870 should've handled this, too.

hmm. I'm not sure. I think this should follow the same logic as what
"Disk Usage" follows, and right now we don't show Disk Usage unless we
spill. Since we only use partitions when spilling, I don't think it
makes sense to show the estimated partitions when we don't plan on
spilling.

I think if we change this then we should change Disk Usage too.
However, I don't think we should as Sort will only show "Disk" if the
sort spills. I think Hash Agg should follow that.

For the patch I posted yesterday, I'll go ahead in push it in about 24
hours unless there are any objections.

David

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-27 Thread Andrey Lepikhov





27.07.2020 21:34, Alexey Kondratov пишет:

Hi Andrey,

On 2020-07-23 09:23, Andrey V. Lepikhov wrote:

On 7/16/20 2:14 PM, Amit Langote wrote:

Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Version 5 of the patch. With changes caused by Amit's comments.


Just got a segfault with your v5 patch by deleting from a foreign table. 
It seems that ShutdownForeignScan inside node->fdwroutine doesn't have a 
correct pointer to the required function.


I haven't had a chance to look closer on the code, but you can easily 
reproduce this error with the attached script (patched Postgres binaries 
should be available in the PATH). It works well with master and fails 
with your patch applied.


I used master a3ab7a707d and v5 version of the patch with your script.
No errors found. Can you check your test case?

--
regards,
Andrey Lepikhov
Postgres Professional

Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Jeff Davis

On Mon, 2020-07-27 at 11:30 -0700, Peter Geoghegan wrote:
> The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are
> surprisingly complicated. It would be nice if you could take a look
> at
> that aspect (or confirm that it's included in your review).

I noticed that one of the conditionals, "cheapest_total_path != NULL",
was already redundant with the outer conditional before your patch. I
guess that was just a mistake which your patch corrects along the way?

Anyway, the patch looks good to me. We can have a separate discussion
about pessimizing the costing, if necessary.

Regards,
Jeff Davis

Re: deferred primary key and logical replication

2020-07-27 Thread Euler Taveira

On Fri, 24 Jul 2020 at 05:16, Ajin Cherian  wrote:

> The patch no longer applies, because of additions in the test source.
> Otherwise, I have tested the patch and confirmed that updates and deletes
> on tables with deferred primary keys work with logical replication.
>
> The new status of this patch is: Waiting on Author
>

Thanks for testing. I attached a rebased patch.


-- 
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c1ad1581962a56c83183bec1501df6f54406db89 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Sun, 19 Apr 2020 20:04:39 -0300
Subject: [PATCH] Table with deferred PK is not updatable in logical
 replication

In logical replication, an UPDATE or DELETE cannot be executed if a
table has a primary key that is marked as deferred. RelationGetIndexList
does not fill rd_replidindex accordingly. The consequence is that UPDATE
or DELETE cannot be executed in a deferred PK table. Deferrable primary
key cannot prevent a primary key to be used as replica identity.
---
 src/backend/utils/cache/relcache.c |  7 +++
 src/test/subscription/t/001_rep_changes.pl | 21 +++--
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index a2453cf1f4..8996279f3b 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4556,12 +4556,11 @@ RelationGetIndexList(Relation relation)
 		result = lappend_oid(result, index->indexrelid);
 
 		/*
-		 * Invalid, non-unique, non-immediate or predicate indexes aren't
-		 * interesting for either oid indexes or replication identity indexes,
-		 * so don't check them.
+		 * Invalid, non-unique or predicate indexes aren't interesting for
+		 * either oid indexes or replication identity indexes, so don't check
+		 * them.
 		 */
 		if (!index->indisvalid || !index->indisunique ||
-			!index->indimmediate ||
 			!heap_attisnull(htup, Anum_pg_index_indpred, NULL))
 			continue;
 
diff --git a/src/test/subscription/t/001_rep_changes.pl b/src/test/subscription/t/001_rep_changes.pl
index 3f8318fc7c..f164d23a94 100644
--- a/src/test/subscription/t/001_rep_changes.pl
+++ b/src/test/subscription/t/001_rep_changes.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 23;
+use Test::More tests => 24;
 
 # Initialize publisher node
 my $node_publisher = get_new_node('publisher');
@@ -38,6 +38,8 @@ $node_publisher->safe_psql('postgres',
 	"CREATE TABLE tab_full_pk (a int primary key, b text)");
 $node_publisher->safe_psql('postgres',
 	"ALTER TABLE tab_full_pk REPLICA IDENTITY FULL");
+$node_publisher->safe_psql('postgres',
+	"CREATE TABLE tab_deferred_pk (a int PRIMARY KEY DEFERRABLE)");
 # Let this table with REPLICA IDENTITY NOTHING, allowing only INSERT changes.
 $node_publisher->safe_psql('postgres', "CREATE TABLE tab_nothing (a int)");
 $node_publisher->safe_psql('postgres',
@@ -66,13 +68,17 @@ $node_subscriber->safe_psql('postgres',
 	"CREATE TABLE tab_include (a int, b text, CONSTRAINT covering PRIMARY KEY(a) INCLUDE(b))"
 );
 
+# replication of the table with deferrable primary key
+$node_subscriber->safe_psql('postgres',
+	"CREATE TABLE tab_deferred_pk (a int PRIMARY KEY DEFERRABLE)");
+
 # Setup logical replication
 my $publisher_connstr = $node_publisher->connstr . ' dbname=postgres';
 $node_publisher->safe_psql('postgres', "CREATE PUBLICATION tap_pub");
 $node_publisher->safe_psql('postgres',
 	"CREATE PUBLICATION tap_pub_ins_only WITH (publish = insert)");
 $node_publisher->safe_psql('postgres',
-	"ALTER PUBLICATION tap_pub ADD TABLE tab_rep, tab_full, tab_full2, tab_mixed, tab_include, tab_nothing, tab_full_pk"
+	"ALTER PUBLICATION tap_pub ADD TABLE tab_rep, tab_full, tab_full2, tab_mixed, tab_include, tab_nothing, tab_full_pk, tab_deferred_pk"
 );
 $node_publisher->safe_psql('postgres',
 	"ALTER PUBLICATION tap_pub_ins_only ADD TABLE tab_ins");
@@ -122,6 +128,13 @@ $node_publisher->safe_psql('postgres',
 	"DELETE FROM tab_include WHERE a > 20");
 $node_publisher->safe_psql('postgres', "UPDATE tab_include SET a = -a");
 
+$node_publisher->safe_psql('postgres',
+	"INSERT INTO tab_deferred_pk VALUES (1),(2),(3)");
+$node_publisher->safe_psql('postgres',
+	"UPDATE tab_deferred_pk SET a = 11 WHERE a = 1");
+$node_publisher->safe_psql('postgres',
+	"DELETE FROM tab_deferred_pk WHERE a = 2");
+
 $node_publisher->wait_for_catchup('tap_sub');
 
 $result = $node_subscriber->safe_psql('postgres',
@@ -145,6 +158,10 @@ $result = $node_subscriber->safe_psql('postgres',
 is($result, qq(20|-20|-1),
 	'check replicated changes with primary key index with included columns');
 
+$result = $node_subscriber->safe_psql('postgres',
+	"SELECT count(*), min(a), max(a) FROM tab_deferred_pk");
+is($result, qq(2|3|11), 'check replicated changes with deferred primary key');
+
 # insert some duplicate rows

Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Peter Geoghegan

On Mon, Jul 27, 2020 at 12:52 PM Alvaro Herrera
 wrote:
> On 2020-Jul-27, Peter Geoghegan wrote:
> > The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are
> > surprisingly complicated. It would be nice if you could take a look at
> > that aspect (or confirm that it's included in your review).
>
> I think you mean "it replaces surprisingly complicated code with
> straightforward code".  Right?  Because in the previous code, there was
> a lot of effort going into deciding whether the path needed to be
> generated; the new code just generates the path always.

Yes, that's what I meant.

It's a bit tricky. For example, I have removed a redundant
"cheapest_total_path != NULL" test in create_partial_grouping_paths()
(two, actually). But these two tests were always redundant. I have to
wonder if I missed the point. Though it seems likely that that was
just an accident. Accretions of code over time made the code work like
that; nothing more.

-- 
Peter Geoghegan

Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Alvaro Herrera

On 2020-Jul-27, Peter Geoghegan wrote:

> The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are
> surprisingly complicated. It would be nice if you could take a look at
> that aspect (or confirm that it's included in your review).

I think you mean "it replaces surprisingly complicated code with
straightforward code".  Right?  Because in the previous code, there was
a lot of effort going into deciding whether the path needed to be
generated; the new code just generates the path always.

Similarly the code to decide allow_hash in create_distinct_path, which
used to be nontrivial, could (if you wanted) be simplified down to a
single boolean condition.  Previously, it was nontrivial only because
it needed to consider memory usage -- not anymore.

But maybe you're talking about something more subtle that I'm just too
blind to see.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Peter Geoghegan

On Mon, Jul 27, 2020 at 11:24 AM Alvaro Herrera
 wrote:
> > Are you proposing that I just put the prototype in miscadmin.h, while
> > leaving the implementation where it is (in nodeHash.c)?
>
> Yes, that's in the part of my reply you didn't quote:
>
> : It remains strange to have the function in executor
> : implementation, but I don't offhand see a better place, so maybe it's
> : okay where it is.

Got it.

I tried putting the prototype in miscadmin.h, and I now agree that
that's the best way to do it -- that's how I do it in the attached
revision. No other changes.

The v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch changes are
surprisingly complicated. It would be nice if you could take a look at
that aspect (or confirm that it's included in your review).

-- 
Peter Geoghegan

v4-0001-Remove-hashagg_avoid_disk_plan-GUC.patch
Description: Binary data

v4-0002-Add-hash_mem_multiplier-GUC.patch
Description: Binary data

Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Alvaro Herrera

On 2020-Jul-27, Peter Geoghegan wrote:

> On Mon, Jul 27, 2020 at 10:30 AM Alvaro Herrera
>  wrote:
> > On 2020-Jul-23, Peter Geoghegan wrote:
> > I notice you put the prototype for get_hash_mem in nodeHash.h.  This
> > would be fine if not for the fact that optimizer needs to call the
> > function too, which means now optimizer have to include executor headers
> > -- not a great thing.  I'd move the prototype elsewhere to avoid this,
> > and I think miscadmin.h is a decent place for the prototype, next to
> > work_mem and m_w_m.
> 
> The location of get_hash_mem() is awkward,

Yes.

> but there is no obvious alternative.

Agreed.

> Are you proposing that I just put the prototype in miscadmin.h, while
> leaving the implementation where it is (in nodeHash.c)?

Yes, that's in the part of my reply you didn't quote:

: It remains strange to have the function in executor
: implementation, but I don't offhand see a better place, so maybe it's
: okay where it is.

> [...] moving the implementation of get_hash_mem() to either of those
> two files seems worse to me.

Sure.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Peter Geoghegan

On Mon, Jul 27, 2020 at 10:30 AM Alvaro Herrera
 wrote:
> On 2020-Jul-23, Peter Geoghegan wrote:
> I notice you put the prototype for get_hash_mem in nodeHash.h.  This
> would be fine if not for the fact that optimizer needs to call the
> function too, which means now optimizer have to include executor headers
> -- not a great thing.  I'd move the prototype elsewhere to avoid this,
> and I think miscadmin.h is a decent place for the prototype, next to
> work_mem and m_w_m.

The location of get_hash_mem() is awkward, but there is no obvious alternative.

Are you proposing that I just put the prototype in miscadmin.h, while
leaving the implementation where it is (in nodeHash.c)? Maybe that
sounds like an odd question, but bear in mind that the natural place
to put the implementation of a function declared in miscadmin.h is
either utils/init/postinit.c or utils/init/miscinit.c -- moving the
implementation of get_hash_mem() to either of those two files seems
worse to me.

That said, there is an existing oddball case in miscadmin.h, right at
the end -- the two functions whose implementation is in
access/transam/xlog.c. So I can see an argument for adding another
oddball case (i.e. moving the prototype to the end of miscadmin.h
without changing anything else).

> Other than that admittedly trivial complaint, I found nothing to
> complain about in this patch.

Great. Thanks for the review.

My intention is to commit hash_mem_multiplier on Wednesday. We need to
move on from this, and get the release out the door.

-- 
Peter Geoghegan

Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Alvaro Herrera

On 2020-Jul-23, Peter Geoghegan wrote:

> Attached is v3 of the hash_mem_multiplier patch series, which now has
> a preparatory patch that removes hashagg_avoid_disk_plan.

I notice you put the prototype for get_hash_mem in nodeHash.h.  This
would be fine if not for the fact that optimizer needs to call the
function too, which means now optimizer have to include executor headers
-- not a great thing.  I'd move the prototype elsewhere to avoid this,
and I think miscadmin.h is a decent place for the prototype, next to
work_mem and m_w_m.  It remains strange to have the function in executor
implementation, but I don't offhand see a better place, so maybe it's
okay where it is.

Other than that admittedly trivial complaint, I found nothing to
complain about in this patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [BUG] Error in BRIN summarization

2020-07-27 Thread Alvaro Herrera

On 2020-Jul-27, Anastasia Lubennikova wrote:

> Here is the updated version of the fix.
> The problem can be reproduced on all supported versions, so I suggest to
> backpatch it.
> Code slightly changed in v12, so here are two patches: one for versions 9.5
> to 11 and another for versions from 12 to master.

Hi Anastasia, thanks for this report and fix.  I was considering this
last week and noticed that the patch changes the ABI of
heap_get_root_tuples, which may be problematic in back branches.  I
suggest that for unreleased branches (12 and prior) we need to create a
new function with the new signature, and keep heap_get_root_tuples
unchanged.  In 13 and master we don't need that trick, so we can keep
the code as you have it in this version of the patch.

OffsetNumber
heap_get_root_tuples_new(Page page, OffsetNumber *root_offsets)
{ .. full implementation ... }

/* ABI compatibility only */
void
heap_get_root_tuples(Page page, OffsetNumber *root_offsets)
{
(void) heap_get_root_tuples_new(page, root_offsets);
}

(I was also considering whether it needs to be a loop to reobtain root
tuples, in case a concurrent transaction can create a new item while
we're checking that item; but I don't think that can really happen for
one individual tuple.)

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-27 Thread Alexey Kondratov


Hi Andrey,

On 2020-07-23 09:23, Andrey V. Lepikhov wrote:

On 7/16/20 2:14 PM, Amit Langote wrote:

Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Version 5 of the patch. With changes caused by Amit's comments.


Just got a segfault with your v5 patch by deleting from a foreign table. 
Here is a part of backtrace:


  * frame #0: 0x0001029069ec 
postgres`ExecShutdownForeignScan(node=0x7ff28c8909b0) at 
nodeForeignscan.c:385:3
frame #1: 0x0001028e7b06 
postgres`ExecShutdownNode(node=0x7ff28c8909b0) at 
execProcnode.c:779:4
frame #2: 0x00010299b3fa 
postgres`planstate_walk_members(planstates=0x7ff28c8906d8, nplans=1, 
walker=(postgres`ExecShutdownNode at execProcnode.c:752), 
context=0x) at nodeFuncs.c:3998:7
frame #3: 0x00010299b010 
postgres`planstate_tree_walker(planstate=0x7ff28c8904c0, 
walker=(postgres`ExecShutdownNode at execProcnode.c:752), 
context=0x) at nodeFuncs.c:3914:8
frame #4: 0x0001028e7ab7 
postgres`ExecShutdownNode(node=0x7ff28c8904c0) at 
execProcnode.c:771:2


(lldb) f 0
frame #0: 0x0001029069ec 
postgres`ExecShutdownForeignScan(node=0x7ff28c8909b0) at 
nodeForeignscan.c:385:3

   382  FdwRoutine *fdwroutine = node->fdwroutine;
   383
   384  if (fdwroutine->ShutdownForeignScan)
-> 385   fdwroutine->ShutdownForeignScan(node);
   386  }
(lldb) p node->fdwroutine->ShutdownForeignScan
(ShutdownForeignScan_function) $1 = 0x7f7f7f7f7f7f7f7f

It seems that ShutdownForeignScan inside node->fdwroutine doesn't have a 
correct pointer to the required function.


I haven't had a chance to look closer on the code, but you can easily 
reproduce this error with the attached script (patched Postgres binaries 
should be available in the PATH). It works well with master and fails 
with your patch applied.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company#!/usr/bin/env sh

pg_ctl -D node1 stop > /dev/null
pg_ctl -D node2 stop > /dev/null

rm -rf node1 node2
rm node1.log node2.log

initdb -D node1
initdb -D node2

echo "port = 5433" >> node2/postgresql.conf

pg_ctl -D node1 -l node1.log start
pg_ctl -D node2 -l node2.log start

createdb
createdb -p5433

psql -p5433 -c "CREATE TABLE test (id INT) PARTITION BY HASH (id)"

psql -c "CREATE EXTENSION IF NOT EXISTS postgres_fdw"
psql -c "CREATE SERVER node2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (port 
'5433'); CREATE USER MAPPING FOR current_user SERVER node2"
psql -c "CREATE FOREIGN TABLE test(id INT) SERVER node2"

psql -c "DELETE FROM test"

Re: Should we remove a fallback promotion? take 2

2020-07-27 Thread Fujii Masao





On 2020/07/27 17:53, Hamid Akhtar wrote:

Applying the patch to the current master branch throws 9 hunks. AFAICT, the 
patch is good otherwise.


So you think that the patch can be marked as Ready for Committer?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Display individual query in pg_stat_activity

2020-07-27 Thread Dave Page

On Mon, Jul 27, 2020 at 4:28 PM Jeremy Schneider 
wrote:

> On 7/27/20 07:57, Dave Page wrote:
>
> I'm not sure I'd want that to happen, as it could make it much harder to
> track the activity back to a query in the application layer or server logs.
>
> Perhaps a separate field could be added for the current statement, or a
> value to indicate what the current statement number in the query is?
>
>
> Might be helpful to give some specifics about circumstances where strings
> can appear in pg_stat_activity.query with multiple statements.
>
> 1) First of all, IIUC multiple statements are only supported in the first
> place by the simple protocol and PLs.  Anyone using parameterized
> statements (bind variables) should be unaffected by this.
>
> 2) My read of the official pg JDBC driver is that even for batch
> operations it currently iterates and sends each statement individually. I
> don't think the JDBC driver has the capability to send multiple statements,
> so java apps using this driver should be unaffected.
>

That is just one of a number of different popular drivers of course.


>
> 3) psql -c will always send the string as a single "simple protocol"
> request.  Scripts will be impacted.
>
> 4) PLs also seem to have a code path that can put multiple statements in
> pg_stat_activity when parallel slaves are launched.  PL code will be
> impacted.
>
> 5) pgAdmin uses the simple protocol and when a user executes a block of
> statements, pgAdmin seems to send the whole block as a single "simple
> protocol" request.  Tools like pgAdmin will be impacted.
>

It does. It also prepends some queries with comments, specifically to allow
users to filter them out when they're analysing logs (a feature requested
by users, not just something we thought was a good idea). I'm assuming that
this patch would also strip those?


>
> At the application layer, it doesn't seem problematic to me if PostgreSQL
> reports each query one at a time.  IMO most people will find this to be a
> more useful behavior and they will still find their queries in their app
> code or app logs.
>

I think there are arguments to be made for both approaches.


>
> However at the PostgreSQL logging layer this is a good call-out.  I just
> did a quick test on 14devel to double-check my assumption and it does seem
> that PostgreSQL logs the entire combined query for psql -c.  I think it
> would be better for PostgreSQL to report queries individually in the log
> too - for example pgBadger summaries will be even more useful if they
> report information for each individual query rather than a single big block
> of multiple queries.
>
> Given how small this patch is, it seems worthwhile to at least investigate
> whether the logging component could be addressed just as easily.
>
> -Jeremy
>
> --
> Jeremy Schneider
> Database Engineer
> Amazon Web Services
>
>
>

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

Re: Default setting for enable_hashagg_disk

2020-07-27 Thread Peter Geoghegan

On Sun, Jul 26, 2020 at 11:34 AM Tomas Vondra
 wrote:
> That's 1.6GB, if I read it right. Which is more than 200MB ;-)

Sigh. That solves that "mystery": the behavior that my sorted vs
random example exhibited is a known limitation in hash aggs that spill
(and an acceptable one). The memory usage is reported on accurately by
EXPLAIN ANALYZE.

-- 
Peter Geoghegan

Re: hashagg slowdown due to spill changes

2020-07-27 Thread Peter Geoghegan

On Sun, Jul 26, 2020 at 4:17 PM Jeff Davis  wrote:
> I saw less of an improvement than you or Andres (perhaps just more
> noise). But given that both you and Andres are reporting a measurable
> improvement, then I went ahead and committed it. Thank you.

Thanks!

-- 
Peter Geoghegan

Re: [BUG] Error in BRIN summarization

2020-07-27 Thread Anastasia Lubennikova


On 23.07.2020 20:39, Anastasia Lubennikova wrote:
One of our clients caught an error "failed to find parent tuple for 
heap-only tuple at (50661,130) in table "tbl'" in PostgreSQL v12.


Steps to reproduce (REL_12_STABLE):

1) Create table with primary key, create brin index, fill table with 
some initial data:


create table tbl (id int primary key, a int) with (fillfactor=50);
create index idx on tbl using brin (a) with (autosummarize=on);
insert into tbl select i, i from generate_series(0,10) as i;

2) Run script test_brin.sql using pgbench:

 pgbench postgres -f ../review/brin_test.sql  -n -T 120

The script is a bit messy because I was trying to reproduce a 
problematic workload. Though I didn't manage to simplify it.
The idea is that it inserts new values into the table to produce 
unindexed pages and also updates some values to trigger HOT-updates on 
these pages.


3) Open psql session and run brin_summarize_new_values

select brin_summarize_new_values('idx'::regclass::oid); \watch 2

Wait a bit. And in psql you will see the ERROR.

This error is caused by the problem with root_offsets array bounds. It 
occurs if a new HOT tuple was inserted after we've collected 
root_offsets, and thus we don't have root_offset for tuple's offnum. 
Concurrent insertions are possible, because 
brin_summarize_new_values() only holds ShareUpdateLock on table and no 
lock (only pin) on the page.


The draft fix is in the attachments. It saves root_offsets_size and 
checks that we only access valid fields.
Patch also adds some debug messages, just to ensure that problem was 
caught.


TODO:

- check if  heapam_index_validate_scan() has the same problem
- code cleanup
- test other PostgreSQL versions

[1] 
https://www.postgresql.org/message-id/flat/CA%2BTgmoYgwjmmjK24Qxb_vWAu8_Hh7gfVFcr3%2BR7ocdLvYOWJXg%40mail.gmail.com




Here is the updated version of the fix.
The problem can be reproduced on all supported versions, so I suggest to 
backpatch it.
Code slightly changed in v12, so here are two patches: one for versions 
9.5 to 11 and another for versions from 12 to master.


As for heapam_index_validate_scan(), I've tried to reproduce the same 
error with CREATE INDEX CONCURRENTLY, but haven't found any problem with it.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index fc19f40a2e3..cb29a833663 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1176,6 +1176,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
 	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
+	OffsetNumber root_offsets_size = 0;
 
 	/*
 	 * sanity checks
@@ -1341,6 +1342,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		 * buffer continuously while visiting the page, so no pruning
 		 * operation can occur either.
 		 *
+		 * It is essential, though, to check the root_offsets_size bound
+		 * before accessing the array, because concurrently inserted HOT tuples
+		 * don't have a valid cached root offset and we need to build the map
+		 * once again for them.
+		 *
 		 * Also, although our opinions about tuple liveness could change while
 		 * we scan the page (due to concurrent transaction commits/aborts),
 		 * the chain root locations won't, so this info doesn't need to be
@@ -1355,7 +1361,7 @@ heapam_index_build_range_scan(Relation heapRelation,
 			Page		page = BufferGetPage(hscan->rs_cbuf);
 
 			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
+			root_offsets_size = heap_get_root_tuples(page, root_offsets);
 			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
 
 			root_blkno = hscan->rs_cblock;
@@ -1643,6 +1649,22 @@ heapam_index_build_range_scan(Relation heapRelation,
 			rootTuple = *heapTuple;
 			offnum = ItemPointerGetOffsetNumber(>t_self);
 
+			/*
+			 * As we do not hold buffer lock, concurrent insertion can happen.
+			 * If so, collect the map once again to find the root offset for
+			 * the new tuple.
+			 */
+			if (root_offsets_size < offnum)
+			{
+Page	page = BufferGetPage(hscan->rs_cbuf);
+
+LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+root_offsets_size = heap_get_root_tuples(page, root_offsets);
+LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
+			}
+
+			Assert(root_offsets_size >= offnum);
+
 			if (!OffsetNumberIsValid(root_offsets[offnum - 1]))
 ereport(ERROR,
 		(errcode(ERRCODE_DATA_CORRUPTED),
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 0efe3ce9995..5ddb123f30c 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -740,12 +740,15 @@ heap_page_prune_execute(Buffer buffer,
  * Note: The information collected here is valid only as long as

Re: Display individual query in pg_stat_activity

2020-07-27 Thread Dave Page

Hi

On Mon, Jul 27, 2020 at 3:40 PM Drouvot, Bertrand 
wrote:

> Hi hackers,
>
> I've attached a patch to display individual query in the pg_stat_activity
> query field when multiple SQL statements are currently displayed.
>
> *Motivation:*
>
> When multiple statements are displayed then we don’t know which one is
> currently running.
>

I'm not sure I'd want that to happen, as it could make it much harder to
track the activity back to a query in the application layer or server logs.

Perhaps a separate field could be added for the current statement, or a
value to indicate what the current statement number in the query is?

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

Re: display offset along with block number in vacuum errors

2020-07-27 Thread Justin Pryzby

On Fri, Jul 24, 2020 at 11:18:43PM +0530, Mahendra Singh Thalor wrote:
> Hi hackers,
> We discussed in another email thread[1], that it will be helpful if we can
> display offset along with block number in vacuum error. Here, proposing a
> patch to add offset along with block number in vacuum errors.

Thanks.  I happened to see both threads, only by chance.

I'd already started writing the same as your 0001, which is essentially the
same as yours.

Here:

@@ -1924,14 +1932,22 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, 
Buffer buffer,
BlockNumber tblk;
OffsetNumber toff;
ItemId  itemid;
+   LVSavedErrInfo loc_saved_err_info;

tblk = 
ItemPointerGetBlockNumber(_tuples->itemptrs[tupindex]);
if (tblk != blkno)
break;  /* past end of tuples 
for this block */
toff = 
ItemPointerGetOffsetNumber(_tuples->itemptrs[tupindex]);
+
+   /* Update error traceback information */
+   update_vacuum_error_info(vacrelstats, _saved_err_info, 
VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+blkno, toff);
itemid = PageGetItemId(page, toff);
ItemIdSetUnused(itemid);
unused[uncnt++] = toff;
+
+   /* Revert to the previous phase information for error traceback 
*/
+   restore_vacuum_error_info(vacrelstats, _saved_err_info);
}

I'm not sure why you use restore_vacuum_error_info() at all.  It's already
called at the end of lazy_vacuum_page() (and others) to allow functions to
clean up after their own state changes, rather than requiring callers to do it.
I don't think you should use it in a loop, nor introduce another
LVSavedErrInfo.

Since phase and blkno are already set, I think you should just set
vacrelstats->offnum = toff, rather than calling update_vacuum_error_info().
Similar to whats done in lazy_vacuum_heap():

tblk = 
ItemPointerGetBlockNumber(>dead_tuples->itemptrs[tupindex]);
vacrelstats->blkno = tblk;

I think you should also do:

@@ -2976,6 +2984,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
ItemId  itemid;
HeapTupleData tuple;

+   vacrelstats->offset = offnum;

I'm not sure, but maybe you'd also want to do the same in more places:

@@ -2024,6 +2030,7 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
@@ -2790,6 +2797,7 @@ count_nondeletable_pages(Relation onerel, LVRelStats 
*vacrelstats)

-- 
Justin

Re: shared tempfile was not removed on statement_timeout

2020-07-27 Thread Justin Pryzby

On Mon, Jul 27, 2020 at 08:00:46PM +1200, Thomas Munro wrote:
> On Tue, Jul 21, 2020 at 4:33 PM Justin Pryzby  wrote:
> >  /*
> >   * clean up a spool structure and its substructures.
> >   */
> >  static void
> >  _bt_spooldestroy(BTSpool *btspool)
> >  {
> > +   void *fileset = tuplesort_shared_fileset(btspool->sortstate);
> > +   if (fileset)
> > +   SharedFileSetDeleteAll(fileset);
> > tuplesort_end(btspool->sortstate);
> > pfree(btspool);
> >  }
> 
> Why can't tuplesort_end do it?

Because then I think the parallel workers remove their own files, with tests
failing like:

+ERROR:  could not open temporary file "0.0" from BufFile "0": No such file or 
directory

I look around a bit more and came up with this, which works, but I don't know
enough to say if it's right.

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420efb2..f89d42f475 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3344,6 +3344,7 @@ walkdir(const char *path,
struct stat fst;
int sret;
 
+   usleep(9);
CHECK_FOR_INTERRUPTS();
 
if (strcmp(de->d_name, ".") == 0 ||
diff --git a/src/backend/utils/sort/tuplesort.c 
b/src/backend/utils/sort/tuplesort.c
index 3c49476483..c6e5e6d00b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1387,6 +1387,9 @@ tuplesort_free(Tuplesortstate *state)
 void
 tuplesort_end(Tuplesortstate *state)
 {
+   if (state->shared && state->shared->workersFinished == 
state->nParticipants)
+   SharedFileSetDeleteAll(>shared->fileset);
+
tuplesort_free(state);
 
/*

Re: Index Skip Scan (new UniqueKeys)

2020-07-27 Thread Dmitry Dolgov

> On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote:
>
> > Well, it's obviously wrong, thanks for noticing. What is necessary is to
> > compare two index tuples, the start and the next one, to test if they're
> > the same (in which case if I'm not mistaken probably we can compare item
> > pointers). I've got this question when I was about to post a new version
> > with changes to address feedback from Andy, now I'll combine them and
> > send a cumulative patch.
>
> This sounds like approximately the same problem as the one that
> _bt_killitems() has to deal with as of Postgres 13. This is handled in
> a way that is admittedly pretty tricky, even though the code does not
> need to be 100% certain that it's "the same" tuple. Deduplication kind
> of makes that a fuzzy concept. In principle there could be one big
> index tuple instead of 5 tuples, even though the logical contents of
> the page have not been changed between the time we recording heap TIDs
> in local and the time _bt_killitems() tried to match on those heap
> TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
> deduplication pass could do that. Your code needs to be prepared for
> stuff like that.
>
> Ultimately posting list tuples are just a matter of understanding the
> on-disk representation -- a "Small Matter of Programming". Even
> without deduplication there are potential hazards from the physical
> deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
> not code that runs in VACUUM, despite the name). Make sure that you
> hold a buffer pin on the leaf page throughout, because you need to do
> that to make sure that VACUUM cannot concurrently recycle heap TIDs.
> If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
> subtly broken. _bt_killitems() is safe because it either holds on to a
> pin or gives up when the LSN changes at all. (ISTM that your only
> choice is to hold on to a leaf page pin, since you cannot just decide
> to give up in the way that _bt_killitems() sometimes can.)

I see, thanks for clarification. You're right, in this part of
implementation there is no way to give up if LSN changes like
_bt_killitems does. As far as I can see the leaf page is already pinned
all the time between reading relevant tuples and comparing them, I only
need to handle posting list tuples.

Re: Should we remove a fallback promotion? take 2

2020-07-27 Thread Hamid Akhtar

Applying the patch to the current master branch throws 9 hunks. AFAICT, the
patch is good otherwise.

On Wed, Jun 3, 2020 at 3:20 PM Fujii Masao 
wrote:

>
>
> On 2020/06/03 12:06, Kyotaro Horiguchi wrote:
> > At Wed, 3 Jun 2020 09:43:17 +0900, Fujii Masao <
> masao.fu...@oss.nttdata.com> wrote in
> >> I will change the status back to Needs Review.
>
> Thanks for the review!
>
> >   record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1,
> false);
> >   if (record != NULL)
> >   {
> > -  fast_promoted = true;
> > +  promoted = true;
> >
> > Even if we missed the last checkpoint record, we don't give up
> > promotion and continue fall-back promotion but the variable "promoted"
> > stays false. That is confusiong.
> >
> > How about changing it to fallback_promotion, or some names with more
> > behavior-specific name like immediate_checkpoint_needed?
>
>
> I like doEndOfRecoveryCkpt or something, but I have no strong opinion
> about that flag naming. So I'm ok with immediate_checkpoint_needed
> if others also like that, too.
>
> Regards,
>
> --
> Fujii Masao
> Advanced Computing Technology Center
> Research and Development Headquarters
> NTT DATA CORPORATION
>


-- 
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL：+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus

Re: shared tempfile was not removed on statement_timeout

2020-07-27 Thread Thomas Munro

On Tue, Jul 21, 2020 at 4:33 PM Justin Pryzby  wrote:
>  /*
>   * clean up a spool structure and its substructures.
>   */
>  static void
>  _bt_spooldestroy(BTSpool *btspool)
>  {
> +   void *fileset = tuplesort_shared_fileset(btspool->sortstate);
> +   if (fileset)
> +   SharedFileSetDeleteAll(fileset);
> tuplesort_end(btspool->sortstate);
> pfree(btspool);
>  }

Why can't tuplesort_end do it?

Re: display offset along with block number in vacuum errors

2020-07-27 Thread Mahendra Singh Thalor

Thanks Michael for looking into this.

On Sat, 25 Jul 2020 at 15:02, Michael Paquier  wrote:
>
> On Fri, Jul 24, 2020 at 11:18:43PM +0530, Mahendra Singh Thalor wrote:
> > In commit b61d161(Introduce vacuum errcontext to display additional
> > information), we added vacuum errcontext to display additional
> > information(block number) so that in case of vacuum error, we can identify
> > which block we are getting error.  Addition to block number, if we can
> > display offset, then it will be more helpful for users. So to display
> > offset, here proposing two different methods(Thanks Robert for suggesting
> > these 2 methods):
>
> new_rel_pages = count_nondeletable_pages(onerel, vacrelstats);
> vacrelstats->blkno = new_rel_pages;
> +   vacrelstats->offnum = InvalidOffsetNumber;
>
> Adding more context would be interesting for some cases, but not all
> contrary to what your patch does in some code paths like
> lazy_truncate_heap() as you would show up an offset of 0 for an
> invalid value.  This could confuse some users.  Note that we are
> careful enough to not print a context message if the block number is
> invalid.

Okay. I agree with you. In case of inavlid offset, we can skip offset
printing. I will do this change in the next patch.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: display offset along with block number in vacuum errors

2020-07-27 Thread Masahiko Sawada

On Sat, 25 Jul 2020 at 02:49, Mahendra Singh Thalor  wrote:
>
> Hi hackers,
> We discussed in another email thread[1], that it will be helpful if we can 
> display offset along with block number in vacuum error. Here, proposing a 
> patch to add offset along with block number in vacuum errors.
>
> In commit b61d161(Introduce vacuum errcontext to display additional 
> information), we added vacuum errcontext to display additional 
> information(block number) so that in case of vacuum error, we can identify 
> which block we are getting error.  Addition to block number, if we can 
> display offset, then it will be more helpful for users. So to display offset, 
> here proposing two different methods(Thanks Robert for suggesting these 2 
> methods):
>
> Method 1: We can report the TID as well as the block number in  errcontext.
> - errcontext("while scanning block %u of relation \"%s.%s\"",
> -   errinfo->blkno, errinfo->relnamespace, errinfo->relname);
> + errcontext("while scanning block %u and offset %u of relation \"%s.%s\"",
> +   errinfo->blkno, errinfo->offnum, errinfo->relnamespace, errinfo->relname);
>
> Above fix requires more calls to update_vacuum_error_info(). Attaching 
> v01_0001 patch for this method.
>
> Method 2: We can improve the error messages by passing the relevant TID to 
> heap_prepare_freeze_tuple and having it report the TID as part of the error 
> message or in the error detail.
>   ereport(ERROR,
>   (errcode(ERRCODE_DATA_CORRUPTED),
> - errmsg_internal("found xmin %u from before relfrozenxid %u",
> + errmsg_internal("for block %u and offnum %u, found xmin %u from before 
> relfrozenxid %u",
> + ItemPointerGetBlockNumber(tid),
> + ItemPointerGetOffsetNumber(tid),
>   xid, relfrozenxid)));
>
> Attaching v01_0002 patch for this method.
>
> Please let me know your thoughts.
>

+1 for adding offset in error messages.

I had a look at 0001 patch. You've set the vacuum error info but I
think an error won't happen during setting itemids unused:

@@ -1924,14 +1932,22 @@ lazy_vacuum_page(Relation onerel, BlockNumber
blkno, Buffer buffer,
BlockNumber tblk;
OffsetNumber toff;
ItemId  itemid;
+   LVSavedErrInfo loc_saved_err_info;

tblk =
ItemPointerGetBlockNumber(_tuples->itemptrs[tupindex]);
if (tblk != blkno)
break;  /* past end of
tuples for this block */
toff =
ItemPointerGetOffsetNumber(_tuples->itemptrs[tupindex]);
+
+   /* Update error traceback information */
+   update_vacuum_error_info(vacrelstats,
_saved_err_info, VACUUM_ERRCB_PHASE_VACUUM_HEAP,
+blkno, toff);
itemid = PageGetItemId(page, toff);
ItemIdSetUnused(itemid);
unused[uncnt++] = toff;
+
+   /* Revert to the previous phase information for error
traceback */
+   restore_vacuum_error_info(vacrelstats, _saved_err_info);
}

PageRepairFragmentation(page);

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Postgres-native method to identify if a tuple is frozen

2020-07-27 Thread Lawrence Jones

Thanks for the help. I'd seen the heap_page_items functions, but wanted to
avoid the superuser requirement and wondered if this was going to be a
performant method of finding the freeze column (we're scanning some
billions of rows).

Fwiw, we think we'll probably go with a tiny extension that exposes the
frozen state exactly. For reference, this is the basic sketch:

Datum
frozen(PG_FUNCTION_ARGS)
{
Oid reloid = PG_GETARG_OID(0);
ItemPointer tid = PG_GETARG_ITEMPOINTER(1);
Relation rel;
HeapTupleData tuple;
Buffer buf;
int result;
// Open table and snapshot- ensuring we later close them
rel = heap_open(reloid, AccessShareLock);
// Initialise the tuple data with a tid that matches our input
ItemPointerCopy(tid, &(tuple.t_self));
#if PG_MAJOR < 12
if (!heap_fetch(rel, SnapshotAny, , , true, NULL))
#else
if (!heap_fetch(rel, SnapshotAny, , ))
#endif
{
result = 3;
}
else
{
result = HeapTupleHeaderXminFrozen(tuple.t_data);
}
// Close any opened resources here
heap_close(rel, AccessShareLock);
ReleaseBuffer(buf);
PG_RETURN_INT32(result);
}

On Tue, 21 Jul 2020 at 13:22, Amit Kapila  wrote:

> On Mon, Jul 20, 2020 at 9:07 PM Lawrence Jones 
> wrote:
> >
> >
> > So we hit the question: how can we identify if a tuple is frozen? I know
> the tuple has both committed and aborted hint bits set, but accessing those
> bits seems to require superuser functions and are unlikely to be that fast.
> >
> > Are there system columns (similar to xmin, tid, cid) that we don't know
> about?
> >
>
> I think the way to get that information is to use pageinspect
> extension and use some query like below but you are right that you
> need superuser privilege for that:
>
> SELECT t_ctid, raw_flags, combined_flags
>  FROM heap_page_items(get_raw_page('pg_class', 0)),
>LATERAL heap_tuple_infomask_flags(t_infomask, t_infomask2)
>  WHERE t_infomask IS NOT NULL OR t_infomask2 IS NOT NULL;
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: Transactions involving multiple postgres foreign servers, take 2

2020-07-27 Thread Masahiko Sawada

On Thu, 23 Jul 2020 at 22:51, Muhammad Usama  wrote:
>
>
>
> On Wed, Jul 22, 2020 at 12:42 PM Masahiko Sawada 
>  wrote:
>>
>> On Sat, 18 Jul 2020 at 01:55, Fujii Masao  
>> wrote:
>> >
>> >
>> >
>> > On 2020/07/16 14:47, Masahiko Sawada wrote:
>> > > On Tue, 14 Jul 2020 at 11:19, Fujii Masao  
>> > > wrote:
>> > >>
>> > >>
>> > >>
>> > >> On 2020/07/14 9:08, Masahiro Ikeda wrote:
>> >  I've attached the latest version patches. I've incorporated the review
>> >  comments I got so far and improved locking strategy.
>> > >>>
>> > >>> Thanks for updating the patch!
>> > >>
>> > >> +1
>> > >> I'm interested in these patches and now studying them. While checking
>> > >> the behaviors of the patched PostgreSQL, I got three comments.
>> > >
>> > > Thank you for testing this patch!
>> > >
>> > >>
>> > >> 1. We can access to the foreign table even during recovery in the HEAD.
>> > >> But in the patched version, when I did that, I got the following error.
>> > >> Is this intentional?
>> > >>
>> > >> ERROR:  cannot assign TransactionIds during recovery
>> > >
>> > > No, it should be fixed. I'm going to fix this by not collecting
>> > > participants for atomic commit during recovery.
>> >
>> > Thanks for trying to fix the issues!
>> >
>> > I'd like to report one more issue. When I started new transaction
>> > in the local server, executed INSERT in the remote server via
>> > postgres_fdw and then quit psql, I got the following assertion failure.
>> >
>> > TRAP: FailedAssertion("fdwxact", File: "fdwxact.c", Line: 1570)
>> > 0   postgres0x00010d52f3c0 
>> > ExceptionalCondition + 160
>> > 1   postgres0x00010cefbc49 
>> > ForgetAllFdwXactParticipants + 313
>> > 2   postgres0x00010cefff14 
>> > AtProcExit_FdwXact + 20
>> > 3   postgres0x00010d313fe3 shmem_exit + 179
>> > 4   postgres0x00010d313e7a 
>> > proc_exit_prepare + 122
>> > 5   postgres0x00010d313da3 proc_exit + 19
>> > 6   postgres0x00010d35112f PostgresMain + 
>> > 3711
>> > 7   postgres0x00010d27bb3a BackendRun + 570
>> > 8   postgres0x00010d27af6b BackendStartup 
>> > + 475
>> > 9   postgres0x00010d279ed1 ServerLoop + 593
>> > 10  postgres0x00010d277940 PostmasterMain 
>> > + 6016
>> > 11  postgres0x00010d1597b9 main + 761
>> > 12  libdyld.dylib   0x7fff7161e3d5 start + 1
>> > 13  ??? 0x0003 0x0 + 3
>> >
>>
>> Thank you for reporting the issue!
>>
>> I've attached the latest version patch that incorporated all comments
>> I got so far. I've removed the patch adding the 'prefer' mode of
>> foreign_twophase_commit to keep the patch set simple.
>
>
> I have started to review the patchset. Just a quick comment.
>
> Patch v24-0002-Support-atomic-commit-among-multiple-foreign-ser.patch
> contains changes (adding fdwxact includes) for
> src/backend/executor/nodeForeignscan.c,  
> src/backend/executor/nodeModifyTable.c
> and  src/backend/executor/execPartition.c files that doesn't seem to be
> required with the latest version.

Thanks for your comment.

Right. I've removed these changes on the local branch.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

2020-07-27 Thread Bharath Rupireddy

On Fri, Jul 24, 2020 at 8:07 PM Robert Haas  wrote:
>
> On Fri, Jul 24, 2020 at 7:10 AM Bharath Rupireddy
>  wrote:
> > I looked at what actually llvm_shutdown() does? It frees up JIT stacks,
also if exists perf related resource, using LLVMOrcDisposeInstance() and
LLVMOrcUnregisterPerf(), that were dynamically allocated in
llvm_session_initialize through a JIT library function
LLVMOrcCreateInstance() [1].
> >
> > It looks like there is no problem in moving llvm_shutdown to either
on_shmem_exit() or on_proc_exit().
>
> If it doesn't involve shared memory, I guess it can be on_proc_exit()
> rather than on_shmem_exit().
>
> I guess the other question is why we're doing it at all. What
> resources are being allocated that wouldn't be freed up by process
> exit anyway?
>

LLVMOrcCreateInstance() and LLVMOrcDisposeInstance() are doing new and
delete respectively, I just found these functions from the link [1]. But I
don't exactly know whether there are any other resources being allocated
that can't be freed up by proc_exit(). Tagging @Andres Freund  for inputs
on whether we have any problem making llvm_shutdown() a on_proc_exit()
callback instead of before_shmem_exit() callback.

And as suggested in the previous mails, we wanted to make it on_proc_exit()
to avoid the abort issue reported in this mail chain, however if we take
the abort issue fix commit # 303640199d0436c5e7acdf50b837a027b5726594 as
mentioned in the previous response[2], then it may not be necessary, right
now, but just to be safer and to avoid any of these similar kind of issues
in future, we can consider this change as well.

[1] - https://llvm.org/doxygen/OrcCBindings_8cpp_source.html

 LLVMOrcJITStackRef LLVMOrcCreateInstance(LLVMTargetMachineRef TM) {
  TargetMachine *TM2(unwrap(TM));
   Triple T(TM2->getTargetTriple());
   auto IndirectStubsMgrBuilder =
  orc::createLocalIndirectStubsManagerBuilder(T);
   OrcCBindingsStack *JITStack =
  new OrcCBindingsStack(*TM2, std::move(IndirectStubsMgrBuilder));
   return wrap(JITStack);
 }

LLVMErrorRef LLVMOrcDisposeInstance(LLVMOrcJITStackRef JITStack) {
  auto *J = unwrap(JITStack);
  auto Err = J->shutdown();
  delete J;
  return wrap(std::move(Err));
 }

[2] -
https://www.postgresql.org/message-id/CALj2ACVwOKZ8qYUsZrU2y2efnYZOLRxPC6k52FQcB3oriH9Kcg%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: autovac issue with large number of tables

2020-07-27 Thread Masahiko Sawada

On Mon, 27 Jul 2020 at 06:43, Nasby, Jim  wrote:
>
> A database with a very large number of  tables eligible for autovacuum can 
> result in autovacuum workers “stuck” in a tight loop of 
> table_recheck_autovac() constantly reporting nothing to do on the table. This 
> is because a database with a very large number of tables means it takes a 
> while to search the statistics hash to verify that the table still needs to 
> be processed[1]. If a worker spends some time processing a table, when it’s 
> done it can spend a significant amount of time rechecking each table that it 
> identified at launch (I’ve seen a worker in this state for over an hour). A 
> simple work-around in this scenario is to kill the worker; the launcher will 
> quickly fire up a new worker on the same database, and that worker will build 
> a new list of tables.
>
>
>
> That’s not a complete solution though… if the database contains a large 
> number of very small tables you can end up in a state where 1 or 2 workers is 
> busy chugging through those small tables so quickly than any additional 
> workers spend all their time in table_recheck_autovac(), because that takes 
> long enough that the additional workers are never able to “leapfrog” the 
> workers that are doing useful work.
>

As another solution, I've been considering adding a queue having table
OIDs that need to vacuumed/analyzed on the shared memory (i.g. on
DSA). Since all autovacuum workers running on the same database can
see a consistent queue, the issue explained above won't happen and
probably it makes the implementation of prioritization of tables being
vacuumed easier which is sometimes discussed on pgsql-hackers. I guess
it might be worth to discuss including this idea.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reigning in ExecParallelHashRepartitionFirst

2020-07-27 Thread Thomas Munro

On Thu, Jul 9, 2020 at 8:17 AM Melanie Plageman
 wrote:
> Last week as I was working on adaptive hash join [1] and trying to get
> parallel adaptive hash join batch 0 to spill correctly, I noticed what
> seemed like a problem with the code to repartition batch 0.
>
> If we run out of space while inserting tuples into the hashtable during
> the build phase of parallel hash join and proceed to increase the number
> of batches, we need to repartition all of the tuples from the old
> generation (when nbatch was x) and move them to their new homes in the
> new generation (when nbatch is 2x). Before we do this repartitioning we
> disable growth in the number of batches.
>
> Then we repartition the tuples from the hashtable, inserting them either
> back into the hashtable or into a batch file. While inserting them into
> the hashtable, we call ExecParallelHashTupleAlloc(), and, if there is no
> space for the current tuple in the current chunk and growth in the
> number of batches is disabled, we go ahead and allocate a new chunk of
> memory -- regardless of whether or not we will exceed the space limit.

Hmm.  It shouldn't really be possible for
ExecParallelHashRepartitionFirst() to run out of memory anyway,
considering that the input of that operation previously fit (just... I
mean we started repartitioning because one more chunk would have
pushed us over the edge, but the tuples so far fit, and we'll insert
them in the same order for each input chunk, possibly filtering some
out).  Perhaps you reached this condition because
batches[0].shared->size finishes up accounting for the memory used by
the bucket array in PHJ_GROW_BUCKETS_ELECTING, but didn't originally
account for it in generation 0, so what previously appeared to fit no
longer does :-(.  I'll look into that.

Re: Global snapshots

2020-07-27 Thread Andrey V. Lepikhov


On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote:

Hi Andrey san, Movead san,


From: tsunakawa.ta...@fujitsu.com 

While Clock-SI seems to be considered the best promising for global
serializability here,

* Why does Clock-SI gets so much attention?  How did Clock-SI become the
only choice?

* Clock-SI was devised in Microsoft Research.  Does Microsoft or some other
organization use Clock-SI?


Could you take a look at this patent?  I'm afraid this is the Clock-SI for MVCC.  Microsoft 
holds this until 2031.  I couldn't find this with the keyword "Clock-SI.""


US8356007B2 - Distributed transaction management for database systems with 
multiversioning - Google Patents
https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?


Regards
Takayuki Tsunakawa




Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.

--
regards,
Andrey Lepikhov
Postgres Professional

RE: Global snapshots

2020-07-27 Thread tsunakawa.ta...@fujitsu.com

Hi Andrey san, Movead san,


From: tsunakawa.ta...@fujitsu.com 
> While Clock-SI seems to be considered the best promising for global
> serializability here,
> 
> * Why does Clock-SI gets so much attention?  How did Clock-SI become the
> only choice?
> 
> * Clock-SI was devised in Microsoft Research.  Does Microsoft or some other
> organization use Clock-SI?

Could you take a look at this patent?  I'm afraid this is the Clock-SI for 
MVCC.  Microsoft holds this until 2031.  I couldn't find this with the keyword 
"Clock-SI.""


US8356007B2 - Distributed transaction management for database systems with 
multiversioning - Google Patents
https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?


Regards
Takayuki Tsunakawa

46 matches

Mail list logo