Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side

2020-02-03 Thread Amit Langote
On Tue, Feb 4, 2020 at 10:34 AM Amit Langote  wrote:
> Reading this:
>
> + backup_total
> + bigint
> + 
> +  Total amount of data that will be streamed. If progress reporting
> +  is not enabled in pg_basebackup
> +  (i.e., --progress option is not specified),
> +  this is 0.
>
> I wonder how hard would it be to change basebackup.c to always set
> backup_total, irrespective of whether --progress is specified with
> pg_basebackup or not?  It seems quite misleading to leave it set to 0,
> because one may panic that they have lost their data, that is, if they
> haven't first read the documentation.

For example, the attached patch changes basebackup.c to always set
tablespaceinfo.size, irrespective of whether --progress was passed
with BASE_BACKUP command.  It passes make check-world, so maybe safe.
Maybe it would be a good idea to add a couple of more comments around
tablespaceinfo struct definition, such as how 'size' is to be
interpreted.

Thanks,
Amit
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 3813eadfb4..b4eb7063b8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10227,8 +10227,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 XLogRecPtr
 do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
   StringInfo labelfile, List **tablespaces,
-  StringInfo tblspcmapfile, bool infotbssize,
-  bool needtblspcmapfile)
+  StringInfo tblspcmapfile, bool 
needtblspcmapfile)
 {
boolexclusive = (labelfile == NULL);
boolbackup_started_in_recovery = false;
@@ -10512,7 +10511,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, 
TimeLineID *starttli_p,
ti->oid = pstrdup(de->d_name);
ti->path = pstrdup(buflinkpath.data);
ti->rpath = relpath ? pstrdup(relpath) : NULL;
-   ti->size = infotbssize ? sendTablespace(fullpath, true) 
: -1;
+   ti->size = sendTablespace(fullpath, true);
 
if (tablespaces)
*tablespaces = lappend(*tablespaces, ti);
diff --git a/src/backend/access/transam/xlogfuncs.c 
b/src/backend/access/transam/xlogfuncs.c
index 20316539b6..c0b46dceae 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -76,7 +76,7 @@ pg_start_backup(PG_FUNCTION_ARGS)
if (exclusive)
{
startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL,
-   
NULL, NULL, false, true);
+   
NULL, NULL, true);
}
else
{
@@ -94,7 +94,7 @@ pg_start_backup(PG_FUNCTION_ARGS)
register_persistent_abort_backup_handler();
 
startpoint = do_pg_start_backup(backupidstr, fast, NULL, 
label_file,
-   
NULL, tblspc_map_file, false, true);
+   
NULL, tblspc_map_file, true);
}
 
PG_RETURN_LSN(startpoint);
diff --git a/src/backend/replication/basebackup.c 
b/src/backend/replication/basebackup.c
index dea8aab45e..f92695ce91 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -243,8 +243,7 @@ perform_base_backup(basebackup_options *opt)
 
startptr = do_pg_start_backup(opt->label, opt->fastcheckpoint, 
,
  labelfile, 
,
- 
tblspc_map_file,
- 
opt->progress, opt->sendtblspcmapfile);
+ 
tblspc_map_file, opt->sendtblspcmapfile);
 
/*
 * Once do_pg_start_backup has been called, ensure that any failure 
causes
@@ -274,7 +273,7 @@ perform_base_backup(basebackup_options *opt)
 
/* Add a node for the base directory at the end */
ti = palloc0(sizeof(tablespaceinfo));
-   ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, 
true) : -1;
+   ti->size = sendDir(".", 1, true, tablespaces, true);
tablespaces = lappend(tablespaces, ti);
 
/* Send tablespace header */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..d4ca68900a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -345,7 +345,7 @@ typedef enum SessionBackupState
 
 extern XLogRecPtr do_pg_start_backup(const char 

RE: Complete data erasure

2020-02-03 Thread imai.yoshik...@fujitsu.com
>From tsunakawa.ta...@fujitsu.com 
> What is concerned about is that the need to erase and delete the data file 
> would be forgotten if the server crashes during step
> 3.  If so, postmaster can do the job at startup, just like it deletes 
> temporary files (although it delays the startup.)

I suspect erasing and deleting the data file at startup is rather not
acceptable.
We can query to the table foo during erasing the table bar in
normal conditions. However, if a crash happens and postmaster erase the table
bar at startup, we can't execute any queries until the erasure is finished
(and it would take long time). I'm afraid there will be someone complain about
that.
Can we erase the table after startup when a crash happens and also implement
the function that returns whether the erasure of a specified table is completed
or not? The users who want to know whether the erasure is completed can use
that function and wait their tasks until the erasure is done, and the other
users can execute query while erasing is processed.


I have an another point to want to discuss. In current specification, every
table will be completely erased if erase_command is set.
The security conscious systems might want to delete all data completely, but
isn't there a case that handles data that doesn't care about security? In
that case, someone would want to choose which tables to be erased completely
and which tables to be only dropped, which is achieved adding an option to
"DROP TABLE" or implementing "ERASE TABLE" command.

--
Yoshikazu Imai




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-02-03 Thread Dilip Kumar
On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila  wrote:
>
> On Tue, Jan 28, 2020 at 11:34 AM Dilip Kumar  wrote:
> >
> > On Tue, Jan 28, 2020 at 11:28 AM Amit Kapila  
> > wrote:
> > >
> > > On Wed, Jan 22, 2020 at 10:30 AM Dilip Kumar  
> > > wrote:
> > > >
> > > > On Tue, Jan 14, 2020 at 10:44 AM Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > Hmm, I think this can turn out to be inefficient because we can easily
> > > > > end up spilling the data even when we don't need to so.  Consider
> > > > > cases, where part of the streamed changes are for toast, and remaining
> > > > > are the changes which we would have streamed and hence can be removed.
> > > > > In such cases, we could have easily consumed remaining changes for
> > > > > toast without spilling.  Also, I am not sure if spilling changes from
> > > > > the hash table is a good idea as they are no more in the same order as
> > > > > they were in ReorderBuffer which means the order in which we serialize
> > > > > the changes normally would change and that might have some impact, so
> > > > > we would need some more study if we want to pursue this idea.
> > > > I have fixed this bug and attached it as a separate patch.  I will
> > > > merge it to the main patch after we agree with the idea and after some
> > > > more testing.
> > > >
> > > > The idea is that whenever we get the toasted chunk instead of directly
> > > > inserting it into the toast hash I am inserting it into some local
> > > > list so that if we don't get the change for the main table then we can
> > > > insert these changes back to the txn->changes.  So once we get the
> > > > change for the main table at that time I am preparing the hash table
> > > > to merge the chunks.
> > > >
> > >
> > >
> > > I think this idea will work but appears to be quite costly because (a)
> > > you might need to serialize/deserialize the changes multiple times and
> > > might attempt streaming multiple times even though you can't do (b)
> > > you need to remove/add the same set of changes from the main list
> > > multiple times.
> > I agree with this.
> > >
> > > It seems to me that we need to add all of this new handling because
> > > while taking the decision whether to stream or not we don't know
> > > whether the txn has changes that can't be streamed.  One idea to make
> > > it work is that we identify it while decoding the WAL.  I think we
> > > need to set a bit in the insert/delete WAL record to identify if the
> > > tuple belongs to a toast relation.  This won't add any additional
> > > overhead in WAL and reduce a lot of complexity in the logical decoding
> > > and also decoding will be efficient.  If this is feasible, then we can
> > > do the same for speculative insertions.
> > The Idea looks good to me.  I will work on this.
> >
>
> One more thing we can do is to identify whether the tuple belongs to
> toast relation while decoding it.  However, I think to do that we need
> to have access to relcache at that time and that might add some
> overhead as we need to do that for each tuple.  Can we investigate
> what it will take to do that and if it is better than setting a bit
> during WAL logging.
>
I have done some more analysis on this and it appears that there are
few problems in doing this.  Basically, once we get the confirmed
flush location, we advance the replication_slot_catalog_xmin so that
vacuum can garbage collect the old tuple.  So the problem is that
while we are collecting the changes in the ReorderBuffer our catalog
version might have removed,  and we might not find any relation entry
with that relfilenodeid (because it is dropped or altered in the
future).

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: base backup client as auxiliary backend process

2020-02-03 Thread Michael Paquier
On Mon, Feb 03, 2020 at 01:37:25AM -0800, Andres Freund wrote:
> On 2020-01-23 15:49:37 -0500, Robert Haas wrote:
>> I assume that's probably discussed on the thread that is linked here,
>> but you shouldn't have to dig through the discussion thread to figure
>> out what the benefits of a change like this are.
> 
> which I fully agree with.
> 
> It's not at all clear to me that the potential downsides of this have
> been fully thought through. And even if they have, they've not been
> documented.

There is this, and please let me add a reference to another complaint
I had about this commit:
https://www.postgresql.org/message-id/20200122055510.gh174...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-02-03 Thread Dilip Kumar
On Mon, Feb 3, 2020 at 9:51 AM Amit Kapila  wrote:
>
> On Fri, Jan 10, 2020 at 10:53 AM Dilip Kumar  wrote:
> >
> > On Mon, Dec 30, 2019 at 3:43 PM Amit Kapila  wrote:
> > >
> > >
> > > > 2. During commit time in DecodeCommit we check whether we need to skip
> > > > the changes of the transaction or not by calling
> > > > SnapBuildXactNeedsSkip but since now we support streaming so it's
> > > > possible that before we decode the commit WAL, we might have already
> > > > sent the changes to the output plugin even though we could have
> > > > skipped those changes.  So my question is instead of checking at the
> > > > commit time can't we check before adding to ReorderBuffer itself
> > > >
> > >
> > > I think if we can do that then the same will be true for current code
> > > irrespective of this patch.  I think it is possible that we can't take
> > > that decision while decoding because we haven't assembled a consistent
> > > snapshot yet.  I think we might be able to do that while we try to
> > > stream the changes.  I think we need to take care of all the
> > > conditions during streaming (when the logical_decoding_workmem limit
> > > is reached) as we do in DecodeCommit.  This needs a bit more study.
> >
> > I have analyzed this further and I think we can not decide all the
> > conditions even while streaming.  Because IMHO once we get the
> > SNAPBUILD_FULL_SNAPSHOT we can add the changes to the reorder buffer
> > so that if we get the commit of the transaction after we reach to the
> > SNAPBUILD_CONSISTENT.  However, if we get the commit before we reach
> > to SNAPBUILD_CONSISTENT then we need to ignore this transaction.  Now,
> > even if we have SNAPBUILD_FULL_SNAPSHOT we can stream the changes
> > which might get dropped later but that we can not decide while
> > streaming.
> >
>
> This makes sense to me, but we should add a comment for the same when
> we are streaming to say we can't skip similar to how we do during
> commit time because of the above reason described by you.  Also, what
> about other conditions where we can skip the transaction, basically
> cases like (a) when the transaction happened in another database,  (b)
> when the output plugin is not interested in the origin and (c) when we
> are doing fast-forwarding
I will analyze those and fix in my next version of the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: error context for vacuum to include block number

2020-02-03 Thread Masahiko Sawada
On Sun, 2 Feb 2020 at 15:03, Justin Pryzby  wrote:
>
> Thanks for reviewing again
>
> On Sun, Feb 02, 2020 at 10:45:12AM +0900, Masahiko Sawada wrote:
> > Thank you for updating the patch. Here is some review comments:
> >
> > 1.
> > +typedef struct
> > +{
> > +   char*relnamespace;
> > +   char*relname;
> > +   char*indname; /* If vacuuming index */
> >
> > I think "Non-null if vacuuming index" is better.
>
> Actually it's undefined garbage (not NULL) if not vacuuming index.

So how about something like "set index name only during vacuuming
index". My point is that the current comment seems to be unclear to me
what describing.

>
> > And tablename is better than relname for accuracy?
>
> The existing code uses relname, so I left that, since it's strange to
> start using tablename and then write things like:
>
> |   errcbarg.tblname = relname;
> ...
> |   errcontext(_("while scanning block %u of relation \"%s.%s\""),
> |   cbarg->blkno, cbarg->relnamespace, cbarg->tblname);
>
> Also, mat views can be vacuumed.

ok, agreed.

>
> > 2.
> > +   BlockNumber blkno;
> > +   int stage;  /* 0: scan heap; 1: vacuum heap; 2: vacuum index */
> > +} vacuum_error_callback_arg;
> >
> > Why do we not support index cleanup phase?
>
> The patch started out just handling scan_heap.  The added vacuum_heap.  Then
> added vacuum_index.  Now, I've added index cleanup.
>
> > 4.
> > +/*
> > + * Setup error traceback support for ereport()
> > + * ->relnamespace and ->relname are already set
> > + */
> > +errcbarg.blkno = InvalidBlockNumber; /* Not known yet */
> > +errcbarg.stage = 1;
> >
> > relnamespace and relname of errcbarg is not set as it is initialized
> > in this function.
>
> Thanks. That's an oversight from switching back to local vars instead of
> LVRelStats while updating the patch while out of town..
>
> I don't know how to consistently test the vacuum_heap case, but rechecked it 
> just now.
>
> postgres=# SET client_min_messages=debug; SET statement_timeout=0; UPDATE t 
> SET a=1+a; SET statement_timeout=150; VACUUM(VERBOSE, PARALLEL 1) t;
> ...
> 2020-02-01 23:11:06.482 CST [26609] ERROR:  canceling statement due to 
> statement timeout
> 2020-02-01 23:11:06.482 CST [26609] CONTEXT:  while vacuuming block 33 of 
> relation "public.t"
>
> > 5.
> > @@ -177,6 +177,7 @@ typedef struct LVShared
> >  * the lazy vacuum.
> >  */
> > Oid relid;
> > +   charrelname[NAMEDATALEN];   /* tablename, used for error 
> > callback */
> >
> > How about getting relation
> > name from index relation? That is, in lazy_vacuum_index we can get
> > table oid from the index relation by IndexGetRelation() and therefore
> > we can get the table name which is in palloc'd memory. That way we
> > don't need to add relname to any existing struct such as LVRelStats
> > and LVShared.
>
> See attached
>
> Also, I think we shouldn't show a block number if it's "Invalid", to avoid
> saying "while vacuuming block 4294967295 of relation ..."
>
> For now, I made it not show any errcontext at all in that case.

Thank you for updating the patch!

Here is the comment for v16 patch:

1.
+   ErrorContextCallback errcallback = { error_context_stack,
vacuum_error_callback, , };

I think it's better to initialize individual fields because we might
need to fix it as well when fields of ErrorContextCallback are
changed.

2.
+   /* Replace error context while continuing heap scan */
+   error_context_stack = 

/*
 * Forget the now-vacuumed tuples, and press on, but be careful
 * not to reset latestRemovedXid since we want that value to be
 * valid.
 */
dead_tuples->num_tuples = 0;

/*
 * Vacuum the Free Space Map to make newly-freed space visible on
 * upper-level FSM pages.  Note we have not yet processed blkno.
 */
FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno);
next_fsm_block_to_vacuum = blkno;

/* Report that we are once again scanning the heap */
pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
 PROGRESS_VACUUM_PHASE_SCAN_HEAP);
}

I think we can set the error context for heap scan again after
freespace map vacuum finishing, maybe after reporting the new phase.
Otherwise the user will get confused if an error occurs during
freespace map vacuum. And I think the comment is unclear, how about
"Set the error context fro heap scan again"?

3.
+   if (cbarg->blkno!=InvalidBlockNumber)
+   errcontext(_("while scanning block %u of relation \"%s.%s\""),
+   cbarg->blkno, cbarg->relnamespace, cbarg->relname);

We can use BlockNumberIsValid macro instead.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, 

Re: table partition and column default

2020-02-03 Thread Amit Langote
On Thu, Dec 26, 2019 at 6:21 PM Julien Rouhaud  wrote:
> On Thu, Dec 26, 2019 at 7:12 AM Fujii Masao  wrote:
> > Thanks for reviewing the patch. Committed!
>
> I saw that you only pushed it on master, shouldn't we backpatch it
> down to pg10 as this is the declarative partitioning behavior since
> the beginning?

I had meant to reply to this but somehow forgot.

I agree that it might be a good idea to back-patch this down to PG 10.

Thanks,
Amit




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-03 Thread Kuntal Ghosh
On Sun, Jan 12, 2020 at 9:51 AM Tom Lane  wrote:
>
> Tomas Vondra  writes:
> > On Sat, Jan 11, 2020 at 10:53:57PM -0500, Tom Lane wrote:
> >> remind me where the win came from, exactly?
>
> > Well, the problem is that in 10 we allocate tuple data in the main
> > memory ReorderBuffer context, and when the transaction gets decoded we
> > pfree() it. But in AllocSet that only moves the data to the freelists,
> > it does not release it entirely. So with the right allocation pattern
> > (sufficiently diverse chunk sizes) this can easily result in allocation
> > of large amount of memory that is never released.
>
> > I don't know if this is what's happening in this particular test, but I
> > wouldn't be surprised by it.
>
> Nah, don't think I believe that: the test inserts a bunch of tuples,
> but they look like they will all be *exactly* the same size.
>
> CREATE TABLE decoding_test(x integer, y text);
> ...
>
> FOR i IN 1..10 LOOP
> BEGIN
> INSERT INTO decoding_test(x) SELECT generate_series(1,5000);
> EXCEPTION
> when division_by_zero then perform 'dummy';
> END;
>
I performed the same test in pg11 and reproduced the issue on the
commit prior to a4ccc1cef5a04 (Generational memory allocator).

ulimit -s 1024
ulimit -v 30

wal_level = logical
max_replication_slots = 4

And executed the following code snippet (shared by Amit Khandekar
earlier in the thread).

SELECT pg_create_logical_replication_slot('test_slot',
'test_decoding');

CREATE TABLE decoding_test(x integer, y text);
do $$
BEGIN
FOR i IN 1..10 LOOP
BEGIN
INSERT INTO decoding_test(x) SELECT
generate_series(1,3000);
EXCEPTION
when division_by_zero then perform 'dummy';
END;
END LOOP;
END $$;

SELECT data from pg_logical_slot_get_changes('test_slot', NULL, NULL) LIMIT 10;

I got the following error:
ERROR:  out of memory
DETAIL:  Failed on request of size 8208.

After that, I applied the "Generational memory allocator" patch and
that solved the issue. From the error message, it is evident that the
underlying code is trying to allocate a MaxTupleSize memory for each
tuple. So, I re-introduced the following lines (which are removed by
a4ccc1cef5a04) on top of the patch:

--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -417,6 +417,9 @@ ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size tuple_len)

alloc_len = tuple_len + SizeofHeapTupleHeader;

+   if (alloc_len < MaxHeapTupleSize)
+   alloc_len = MaxHeapTupleSize;

And, the issue got reproduced with the same error:
WARNING:  problem in Generation Tuples: number of free chunks 0 in
block 0x7fe9e9e74010 exceeds 1018 allocated
.
ERROR:  out of memory
DETAIL:  Failed on request of size 8208.

I don't understand the code well enough to comment whether we can
back-patch only this part of the code. But, this seems to allocate a
huge amount of memory per chunk although the tuple is small.

Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com




Re: Experimenting with hash join prefetch

2020-02-03 Thread Thomas Munro
On Tue, Feb 4, 2020 at 2:31 PM Andres Freund  wrote:
> How much of the benefit here comes from the prefetching, and how much
> just from writing the code in a manner that allows for more out-of-order
> execution? Because there's no dependencies preventing execution of the
> next queued tuple anymore, I'd assume that this is a good part what
> helps?

A small part of the speed-up does indeed seem to come from that sort of thing.

> Code like this really should look something roughly like:
>
> while (true)
> have_skew = False
>
> # get tuples
> for i in 0..batchsize:
> tuples[i] = ExecProcNode(outerNode);
> if tuples[i] == NULL:
># have slowpath handle this
>break;
>
> # compute their hash values
> for i in 0..batchsize:
> hashvalues[i] = ExecHashGetHashValue(tuples[i])
>
> # check whether go into skew buckets
> if have_skew_table:
> for i in 0..batchsize:
> skewbuckets[i] = ExecHashGetSkewBucket(tuples[i], hashvalues[i])
> if (skewbuckets[i] != INVALID_SKEW_BUCKET_NO)
>have_skew = True
> if have_skew:
> # handle everything here
> continue
>
> # assume there's no skew tuples going forward, all handled above
>
> # compute bucket/batch for all tuples
> have_into_batch = False
> for i in 0..batchsize:
> hashbuckets[i] = ExecHashGetBucketAndBatch()
> if hashbuckets[i] != hashtable->curbatch:
>have_into_batchfile = True
>
> if have_into_batchfile:
> # slowpath
> continue
>
> # Allocate all tuples
> for i in 0..batchsize:
> hjtuples[i] = alloc_mintuple(hashtuples[i])
>
> # And finally isert them
> for i in 0..batchsize:
> hjtuple.next = buckets[hashbuckets[i]]
> buckets[hashbuckets[i]] = hjtuple

Hmm.  I see your point: don't use the batch number for a branch
immediately, and so on.  I thought a bit about a multi-pass system a
bit like that too just for prefetching purposes, though I haven't
tested due to lack of required infrastructure.  I guess you need a way
to get the next N tuples and make them all simultaneously available
without copying them yet.

For this experiment, I speculated that it might be better to be
continually inserting a short distance behind so there are no
batch-boundary stalls anyway.  Admittedly, it's pretty hard to choose
the right queue depth if your loop includes ExecProcNode() because you
have no idea what that actually does, but on the other hand, you do
need to put enough cycles between prefetch and fetch to see benefits,
so maybe that's not so crazy.  Perhaps to get more speedup I'd need to
consider dependencies along the lines your'e describing, but also find
a way to keep prefetch and insertion far enough apart to win.  Hmm.

> I would bet this helps significantly even if there's no prefetch
> instruction - but using explicit prefetching might help further. Also
> allows us to increase branch costs, because we can amortize them over a
> few tuples.

Yes, I already observe that performance improves a little bit even
with my simple insert-queue patch if you comment out the
pg_prefetch_mem() call, and figured it was something about execution
order at work there (though I didn't study the effect up close with
perf etc due to lack of PMC access on this host), but the prefetch
apparently supplies most of the speed-up I saw.  It stands to reason
that hash joins should benefit from explicit prefetching (even though
lots of pixels have been expended explaining that explicit prefetching
is often a mistake, cf Linux experience), since hash joins are
basically cache miss machines par excellence, at least in the build
phase with unique keys.

> > create unlogged table t as select generate_series(1, 1)::int i;
> > select pg_prewarm('t');
> > set work_mem='8GB';
> >
> > select count(*) from t t1 join t t2 using (i);
> >
> >masterpatched/N=1  patched/N=4
> > workers=0  89.808s   80.556s (+11%)   76.864 (+16%)
> > workers=2  27.010s   22.679s (+19%)   23.503 (+15%)
> > workers=4  16.728s   14.896s (+12%)   14.167 (+18%)
> >
> > Just an early experiment, but I though it looked promising enough to share.
>
> Nice!

It's starting to look like prefetching of build + prefetching of probe
+ reordering-friendly code + icache-friendly tight loops could add up
to some serious gains, but some architectural stuff is needed for much
of that, hence my lower aim :-)

Other things I noticed on that hacking escapade: the patch generates
PREFETCHT0 instructions on my compiler, but there are also "write" and
"locality" flags you can pass to __builtin_prefetch() to get
PREFETCHW, and variants for predictions about how valuable the data is
after the next access; write=1 slowed down my initial tests for
reasons I don't fully understand, but I didn't look further once I
realised you need -march= anyway.  I didn't look
into the 

Re: Internal key management system

2020-02-03 Thread Masahiko Sawada
On Sun, 2 Feb 2020 at 17:05, Fabien COELHO  wrote:
>
>
> Hello Masahiko-san,
>
> > I've started a new separate thread from the previous long thread[1]
> > for internal key management system to PostgreSQL. As I mentioned in
> > the previous mail[2], we've decided to step back and focus on only
> > internal key management system for PG13. The internal key management
> > system introduces the functionality to PostgreSQL that allows user to
> > encrypt and decrypt data without knowing the actual key. Besides, it
> > will be able to be integrated with transparent data encryption in the
> > future.
> >
> > The basic idea is that PostgreSQL generates the master encryption key
> > which is further protected by the user-provided passphrase. The key
> > management system provides two functions to wrap and unwrap the secret
> > by the master encryption key. A user generates a secret key locally
>
> In understand that the secret key is sent in the clear for being encrypted
> by a master key.

Yeah we need to be careful about the secret key not being logged in
any logs such as server logs for example when log_statement = 'all'. I
guess that wrapping key doesn't often happen during service running
but does once at development phase. So it would not be a big problem
but probably we need to have something to deal with it.

>
> > and send it to PostgreSQL to wrap it using by pg_kmgr_wrap() and save
> > it somewhere. Then the user can use the encrypted secret key to
> > encrypt data and decrypt data by something like:
> >
> > INSERT INTO tbl VALUES (pg_encrypt('user data', pg_kmgr_unwrap('x'));
> > SELECT pg_decrypt(secret_column, pg_kmgr_unwrap('x')) FROM tbl;
> >
> > Where 'x' is the result of pg_kmgr_wrap function.
>
> I'm lost. If pg_{en,de}crypt and pg_kmgr_unwrap are functions, what
> prevent users to:
>
>SELECT pg_kmgr_unwrap('');
>
> so as to recover the key, somehow in contradiction to "allows user to
> encrypt and decrypt data without knowing the actual key".

I might be missing your point but the above '' is the wrapped key
wrapped by the master key stored in PostgreSQL server. So user doesn't
need to know the raw secret key to encrypt/decrypt the data. Even if a
malicious user gets '' they cannot know the actual secret key
without the master key. pg_kmgr_wrap and pg_kmgr_unwrap are functions
and it's possible for user to know the raw secret key by using
pg_kmgr_unwrap(). The master key stored in PostgreSQL server never be
revealed.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Implementing Incremental View Maintenance

2020-02-03 Thread Takuma Hoshiai
Hi, 

Attached is the latest patch (v12) to add support for Incremental Materialized 
View Maintenance (IVM).
It is possible to apply to current latest master branch.

Differences from the previous patch (v11) include:
* support executing REFRESH MATERIALIZED VIEW command with IVM.
* support unscannable state by WITH NO DATA option.
* add a check for LIMIT/OFFSET at creating an IMMV

 If REFRESH is executed for IMMV (incremental maintainable materialized view),  
its contents is re-calculated as same as usual materialized views (full 
REFRESH). Although IMMV is basically keeping up-to-date data, rounding errors 
can be accumulated in aggregated value in some cases, for example, if the view 
contains sum/avg on float type columns. Running REFRESH command on IMMV will 
resolve this. Also, WITH NO DATA option allows to make IMMV unscannable. At 
that time,  IVM triggers are dropped from IMMV because these become unneeded 
and useless. 

Also, we added new deptype option 'm' in pg_depend view for checking a trigger 
is for IVM. Please tell me, if add new deptype option is unacceptable. It is 
also possible to perform the check by referencing pg_depend and pg_trigger, 
pg_proc view instead of adding a new deptype.
We update IVM restrictions. LIMIT/OFFSET clause is not supported with iVM 
because it is not suitable for incremental changes to the materialized view. 
This issue is reported by nuko-san.
https://www.postgresql.org/message-id/CAF3Gu1ZK-s9GQh=70n8+21rbl8+fkw4tv3ce-xufxmsnfpo...@mail.gmail.com

Best Regards,
Takuma Hoshiai

On Mon, 27 Jan 2020 09:19:05 +0900
Takuma Hoshiai  wrote:

> On Mon, 20 Jan 2020 16:57:58 +0900
> Yugo NAGATA  wrote:
> 
> > On Fri, 17 Jan 2020 14:10:32 -0700 (MST)
> > legrand legrand  wrote:
> > 
> > > Hello,
> > > 
> > > It seems that patch v11 doesn't apply any more.
> > > Problem with "scanRTEForColumn" maybe because of change:
> > 
> > Thank you for your reporting! We will fix this in the next update. 
> 
> Although I have been working conflict fix and merge latest master, it
> takes a little longer, because it has large impact than we thought. 
> 
> Please wait a little more.
> 
> Regards
> Takuma Hoshiai
> 
> 
> > Regards,
> > Yugo Nagata
> > 
> > > 
> > > https://git.postgresql.org/pg/commitdiff/b541e9accb28c90656388a3f827ca3a68dd2a308
> > > 
> > > Regards
> > > PAscal
> > > 
> > > 
> > > 
> > > --
> > > Sent from: 
> > > https://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
> > > 
> > > 
> > 
> > 
> > -- 
> > Yugo NAGATA 
> > 
> > 
> > 
> 
> 
> -- 
> Takuma Hoshiai 
> 
> 
> 
> 


-- 
Takuma Hoshiai 


IVM_patches_v12.tar.gz
Description: Binary data


Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side

2020-02-03 Thread Amit Langote
On Mon, Feb 3, 2020 at 11:04 PM Fujii Masao  wrote:
> So we now have the following ideas about the phase names for pg_basebackup.
>
> 1.
> initializing
>
> 2.
> 2-1. starting backup
> 2-2. starting file transfer
> 2-3. performing pg_start_backup
> 2-4. performing checkpoint
> 2-5. waiting for [ backup start ] checkpoint to finish
>
> 3.
> 3-1. streaming backup
> 3-2. transferring database files
> 3-3. streaming database files
> 3-4. transferring files
>
> 4.
> 4-1. stopping backup
> 4-2. finishing file transfer
> 4-3. performing pg_stop_backup
> 4-4. finishing backup
> 4-5. waiting for WAL archiving to finish
> 4-6. performing WAL archive
>
> 5.
> 1. transferring wal
> 2. transferring WAL files
>
> What conbination of these do you prefer?

I like:

1. initializing
2-5 waiting for backup start checkpoint to finish
3-3 streaming database files
4-5 waiting for wal archiving to finish
5-1 transferring wal (or streaming wal)

> > -  
> > +   > xreflabel="BASE_BACKUP">
> >
> > I don't see any new text in the documentation patch that uses above
> > xref, so no need to define it?
>
> The following description that I added uses this.
>
>  certain commands during command execution.  Currently, the only commands
>  which support progress reporting are ANALYZE,
>  CLUSTER,
> -   CREATE INDEX, and VACUUM.
> +   CREATE INDEX, VACUUM,
> +   and  (i.e., replication
> +   command that  issues to take
> +   a base backup).

Sorry, I missed that.  I was mistakenly expecting a different value of linkend.

Some comments on v3:

+ Process ID of a WAL sender process.

"a" sounds redundant.  Maybe:

of this WAL sender process or
of WAL sender process

Reading this:

+ backup_total
+ bigint
+ 
+  Total amount of data that will be streamed. If progress reporting
+  is not enabled in pg_basebackup
+  (i.e., --progress option is not specified),
+  this is 0.

I wonder how hard would it be to change basebackup.c to always set
backup_total, irrespective of whether --progress is specified with
pg_basebackup or not?  It seems quite misleading to leave it set to 0,
because one may panic that they have lost their data, that is, if they
haven't first read the documentation.

Thanks,
Amit




Re: Experimenting with hash join prefetch

2020-02-03 Thread Andres Freund
HI,

On 2020-02-04 01:48:49 +1300, Thomas Munro wrote:
> On Fri, Apr 12, 2019 at 4:23 PM Thomas Munro  wrote:
> > ... if we also prefetch during
> > the build phase ...
>
> Here's an experimental patch to investigate just that part.  I tried
> initiating a prefetch of the bucket just before we copy the tuple and
> then finally insert it, but it doesn't seem to be far enough apart (at
> least for small tuples), which is a shame because that'd be a one line
> patch.  So I made a little insert queue that prefetches and defers the
> insertion until N tuples later, and then I managed to get between 10%
> and 20% speed-up for contrived tests like this:

How much of the benefit here comes from the prefetching, and how much
just from writing the code in a manner that allows for more out-of-order
execution? Because there's no dependencies preventing execution of the
next queued tuple anymore, I'd assume that this is a good part what
helps?

Code like this really should look something roughly like:

while (true)
have_skew = False

# get tuples
for i in 0..batchsize:
tuples[i] = ExecProcNode(outerNode);
if tuples[i] == NULL:
   # have slowpath handle this
   break;

# compute their hash values
for i in 0..batchsize:
hashvalues[i] = ExecHashGetHashValue(tuples[i])

# check whether go into skew buckets
if have_skew_table:
for i in 0..batchsize:
skewbuckets[i] = ExecHashGetSkewBucket(tuples[i], hashvalues[i])
if (skewbuckets[i] != INVALID_SKEW_BUCKET_NO)
   have_skew = True
if have_skew:
# handle everything here
continue

# assume there's no skew tuples going forward, all handled above

# compute bucket/batch for all tuples
have_into_batch = False
for i in 0..batchsize:
hashbuckets[i] = ExecHashGetBucketAndBatch()
if hashbuckets[i] != hashtable->curbatch:
   have_into_batchfile = True

if have_into_batchfile:
# slowpath
continue

# Allocate all tuples
for i in 0..batchsize:
hjtuples[i] = alloc_mintuple(hashtuples[i])

# And finally isert them
for i in 0..batchsize:
hjtuple.next = buckets[hashbuckets[i]]
buckets[hashbuckets[i]] = hjtuple


Obviously it's a bit more complicated in reality than this, but I do
think that's where we've to go to make crucial parts like this faster
(same with hashaggs, and a few other places).


I would bet this helps significantly even if there's no prefetch
instruction - but using explicit prefetching might help further. Also
allows us to increase branch costs, because we can amortize them over a
few tuples.


> create unlogged table t as select generate_series(1, 1)::int i;
> select pg_prewarm('t');
> set work_mem='8GB';
>
> select count(*) from t t1 join t t2 using (i);
>
>masterpatched/N=1  patched/N=4
> workers=0  89.808s   80.556s (+11%)   76.864 (+16%)
> workers=2  27.010s   22.679s (+19%)   23.503 (+15%)
> workers=4  16.728s   14.896s (+12%)   14.167 (+18%)
>
> Just an early experiment, but I though it looked promising enough to share.

Nice!


Greetings,

Andres Freund




Re: Add %x to PROMPT1 and PROMPT2

2020-02-03 Thread Ian Barwick

On 2020/02/03 23:40, Tom Lane wrote:

Daniel Gustafsson  writes:

On 3 Feb 2020, at 08:08, Michael Paquier  wrote:

FWIW, I am not really in favor of changing a default old enough that
it could vote (a45195a).



That by itself doesn't seem a good reason to not change things.



My concern would be that users who have never ever considered that the prompt
can be changed, all of sudden wonder why the prompt is showing characters it
normally isn't, thus causing confusion.  That being said, I agree that this is
a better default long-term.


I've got the same misgivings as Michael.  In a green field this'd likely
be a good idea, but after so many years I'm afraid it will make fewer
people happy than unhappy.

Now on the other hand, we did change the server's default log_line_prefix
not so long ago (7d323to se5ba4), and the feared storm of complaints was pretty
much undetectable.  So maybe this'd go down the same way.  Worth noting
also is that this shouldn't be able to break any applications, since the
prompt is an interactive-only behavior.


The last change I recall affecting default psql behaviour was the addition
of COMP_KEYWORD_CASE in 9.2 (db84ba65), which personally I (and no doubt others)
found annoying, but the world still turns.

+1 one for this change, it's something I also add to every .psqlrc I setup.

Moreover, some of my work involves logging in at short notice to other people's
systems where I don't have control over the .psqlrc setup - while I can of
course set the prompt manually it's an extra step, and it would be really
nice to see the transaction status by default when working on critical
systems in a time-critical situation. (Obviously it'll take a few years
for this change to filter into production...).


Regards


Ian Barwick

--
Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




RE: Complete data erasure

2020-02-03 Thread tsunakawa.ta...@fujitsu.com
From: Tomas Vondra 
> That's not really what I meant - let me explain. When I said DROP TABLE
> should do everything as usual, that includes catalog changes. I.e. after
> the commit there would not be any remaining entries in system catalogs
> or anything like that.
> 
> The only thing we'd do differently is that instead of unlinking the
> relfilenode segments, we'd move the relfilenode to a persistent queue
> (essentially a regular table used as a queue relfilenodes). The
> background worker would watch the queue, and when it gets a new
> relfilenode it'd "delete" the data and then remove the relfilenode from
> the queue.
> 
> So essentially others would not be able to even see the (now dropped)
> object, they could create new object with the same name etc.

That sounds good.  I think we can also follow the way the WAL archiver does its 
job, instead of using a regular table.  That is, when the transaction that 
performed DROP TABLE commits, it puts the data files in the "trash bin," which 
is actually a filesystem directory.  Or, it just renames the data files in the 
original directory by appending some suffix such as ".del".  Then, the 
background worker scans the trash bin or the data directory to erase the file 
content and delete the file.

The trash bin mechanism may open up the application for restoring mistakenly 
dropped tables, a feature like Oracle's Flash Drop.  The dropping transaction 
puts the table metadata (system catalog data or DDL) in the trash bin as well 
as the data file.


> I imagine we might provide a way to wait for the deletion to actually
> complete (can't do that as part of the DROP TABLE, though), so that
> people can be sure when the data is actually gone (for scripts etc.).
> A simple function waiting for the queue to get empty might be enough, I
> guess, but maybe not.

Agreed, because the user should expect the disk space to be available after 
DROP TABLE has been committed.  Can't we really make the COMMIT to wait for the 
erasure to complete?  Do we have to use an asynchronous erasure method with a 
background worker?  For example, COMMIT performs:

1. Writes a commit WAL record, finalizing the system catalog change.
2. Puts the data files in the trash bin or renames them.
3. Erase the file content and delete the file.  This could take a long time.
4. COMMIT replies success to the client.

What is concerned about is that the need to erase and delete the data file 
would be forgotten if the server crashes during step 3.  If so, postmaster can 
do the job at startup, just like it deletes temporary files (although it delays 
the startup.)


> I think this depends on what our requirements are.
> 
> My assumption is that when you perform this "secure data erasure" on the
> primary, you probably also want to erase the data on the replica. But if
> the instances use different file systems (COW vs. non-COW, ...) the
> exact thing that needs to happen may be different. Or maybe the replica
> does not need to do anything, making it noop?

We can guide the use of non-COW file systems on both the primary and standby in 
the manual.


Regards
Takayuki Tsunakawa





Re: Is custom MemoryContext prohibited?

2020-02-03 Thread Andres Freund
Hi,

On 2020-01-28 11:10:47 -0500, Robert Haas wrote:
> I generally like this idea, but I'd like to propose that we instead
> replace the NodeTag with a 4-byte magic number. I was studying how
> feasible it would be to make memory contexts available in frontend
> code, and it doesn't look all that bad, but one of the downsides is
> that nodes/memnodes.h includes nodes/nodes.h, and I think it would not
> be a good idea to make frontend code depend on nodes/nodes.h, which
> seems like it's really a backend-only piece of infrastructure. Using a
> magic number would allow us to avoid that, and it doesn't really cost
> anything, because the memory context nodes really don't participate in
> any of that infrastructure anyway.

Hm. I kinda like the idea of still having one NodeTag identifying memory
contexts, and then some additional field identifying the actual
type. Being able to continue to rely on IsA() etc imo is nice.  I think
nodes.h itself only would be a problem for frontend code because we put
a lot of other stuff too. We should just separate the actually generic
stuff out. I think it's going to be like 2 seconds once we have memory
contexts until we're e.g. going to want to also have pg_list.h - which
is harder without knowing the tags.

It seems like a good idea to still have an additional identifier for
each node type, for some cross checking. How about just frobbing the
pointer to the MemoryContextMethod slightly, and storing that in an
additional field? That'd be something fairly unlikely to ever be a false
positive, and it doesn't require dereferencing any additional memory.

Greetings,

Andres Freund




Re: Missing break in RelationFindReplTupleSeq

2020-02-03 Thread Alvaro Herrera
On 2020-Jan-31, Alvaro Herrera wrote:

> On 2020-Jan-31, Konstantin Knizhnik wrote:
> 
> > Eventually we find out that logical replication in the current version of
> > Postgres works significantly slower on table with replica identity full than
> > old pglogical implementation.
> > 
> > The comment to RelationFindReplTupleSeq says:
> > 
> >     Note that this stops on the first matching tuple.
> > 
> > But actually this function continue traversal until end of the table even if
> > tuple was found.
> > I wonder if break; should be added to the end of for loop.
> 
> Wow, you're right, and the "break" is missing there.  I propose it like
> this.

Pushed, thanks for reporting.

I had one very strange thing happen while testing this -- I put the
tests to run in all branches in parallel, and they took about 12 minutes
to finish instead of the normal 5.  I tried to repeat this result but
was unable to do so.  My only hypothesis is that my laptop entered some
kind of low-performance mode.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Global temporary tables

2020-02-03 Thread Konstantin Knizhnik



On 01.02.2020 14:49, Tomas Vondra wrote:

Hi,

this patch was marked as waiting on author since the beginning of the
CF, most likely because it no longer applies (not sure). As there has
been very little activity since then, I've marked it as returned with
feedback. Feel free to re-submit an updated patch for 2020-03.

This definitely does not mean the feature is not desirable, but my
feeling is most of the discussion happens on the other thread dealing
with global temp tables [1] so maybe we should keep just that one and
combine the efforts.

[1] https://commitfest.postgresql.org/26/2349/



New version of the patch with new method of GTT index construction is 
attached.
Now GTT indexes are checked before query execution and are initialized 
using AM build method.

So now GTT is supported for all indexes, including custom indexes.

--
Konstantin Knizhnik
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/pg_prewarm/pg_prewarm.c b/contrib/pg_prewarm/pg_prewarm.c
index 33e2d28b27..93059ef581 100644
--- a/contrib/pg_prewarm/pg_prewarm.c
+++ b/contrib/pg_prewarm/pg_prewarm.c
@@ -178,7 +178,7 @@ pg_prewarm(PG_FUNCTION_ARGS)
 		for (block = first_block; block <= last_block; ++block)
 		{
 			CHECK_FOR_INTERRUPTS();
-			smgrread(rel->rd_smgr, forkNumber, block, blockbuffer.data);
+			smgrread(rel->rd_smgr, forkNumber, block, blockbuffer.data, false);
 			++blocks_done;
 		}
 	}
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 79430d2b7b..39baddc743 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -158,6 +158,19 @@ static relopt_bool boolRelOpts[] =
 		},
 		true
 	},
+	/*
+	 * For global temp table only
+	 * use AccessExclusiveLock for ensure safety
+	 */
+	{
+		{
+			"on_commit_delete_rows",
+			"global temp table on commit options",
+			RELOPT_KIND_HEAP | RELOPT_KIND_PARTITIONED,
+			ShareUpdateExclusiveLock
+		},
+		false
+	},
 	/* list terminator */
 	{{NULL}}
 };
@@ -1486,6 +1499,8 @@ bytea *
 default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
 {
 	static const relopt_parse_elt tab[] = {
+		{"on_commit_delete_rows", RELOPT_TYPE_BOOL,
+		offsetof(StdRdOptions, on_commit_delete_rows)},
 		{"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)},
 		{"autovacuum_enabled", RELOPT_TYPE_BOOL,
 		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)},
@@ -1586,13 +1601,17 @@ build_reloptions(Datum reloptions, bool validate,
 bytea *
 partitioned_table_reloptions(Datum reloptions, bool validate)
 {
+	static const relopt_parse_elt tab[] = {
+		{"on_commit_delete_rows", RELOPT_TYPE_BOOL,
+		offsetof(StdRdOptions, on_commit_delete_rows)}
+	};
 	/*
 	 * There are no options for partitioned tables yet, but this is able to do
 	 * some validation.
 	 */
 	return (bytea *) build_reloptions(reloptions, validate,
 	  RELOPT_KIND_PARTITIONED,
-	  0, NULL, 0);
+	  sizeof(StdRdOptions), tab, lengthof(tab));
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3fa4b766db..a86de5046f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -670,6 +670,7 @@ heapam_relation_copy_data(Relation rel, const RelFileNode *newrnode)
 			 * init fork of an unlogged relation.
 			 */
 			if (rel->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT ||
+rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION ||
 (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
  forkNum == INIT_FORKNUM))
 log_smgrcreate(newrnode, forkNum);
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 7d6acaed92..7c48e5c2ae 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -396,6 +396,9 @@ GetNewRelFileNode(Oid reltablespace, Relation pg_class, char relpersistence)
 		case RELPERSISTENCE_TEMP:
 			backend = BackendIdForTempRelations();
 			break;
+		case RELPERSISTENCE_SESSION:
+			backend = BackendIdForSessionRelations();
+			break;
 		case RELPERSISTENCE_UNLOGGED:
 		case RELPERSISTENCE_PERMANENT:
 			backend = InvalidBackendId;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8880586c37..22ce8953fd 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3707,7 +3707,7 @@ reindex_relation(Oid relid, int flags, int options)
 		if (flags & REINDEX_REL_FORCE_INDEXES_UNLOGGED)
 			persistence = RELPERSISTENCE_UNLOGGED;
 		else if (flags & REINDEX_REL_FORCE_INDEXES_PERMANENT)
-			persistence = RELPERSISTENCE_PERMANENT;
+			persistence = rel->rd_rel->relpersistence == RELPERSISTENCE_SESSION ? RELPERSISTENCE_SESSION : RELPERSISTENCE_PERMANENT;
 		else
 			persistence = rel->rd_rel->relpersistence;
 
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index fddfbf1d8c..97478352ea 100644
--- 

Re: Parallel grouping sets

2020-02-03 Thread Jesse Zhang
On Mon, Feb 3, 2020 at 12:07 AM Richard Guo  wrote:
>
> Hi Jesse,
>
> Thanks for reviewing these two patches.
I enjoyed it!

>
> On Sat, Jan 25, 2020 at 6:52 AM Jesse Zhang  wrote:
>>
>>
>> I glanced over both patches. Just the opposite, I have a hunch that v3
>> is always better than v5. Here's my 6-minute understanding of both.
>>
>> v3 ("the one with grouping set id") really turns the plan from a tree to
>> a multiplexed pipe: we can execute grouping aggregate on the workers,
>> but only partially. When we emit the trans values, also tag the tuple
>> with a group id. After gather, finalize the aggregates with a modified
>> grouping aggregate. Unlike a non-split grouping aggregate, the finalize
>> grouping aggregate does not "flow" the results from one rollup to the
>> next one. Instead, each group only advances on partial inputs tagged for
>> the group.
>>
>
> Yes, this is what v3 patch does.
>
> We (Pengzhou and I) had an offline discussion on this plan and we have
> some other idea.  Since we have tagged 'GroupingSetId' for each tuple
> produced by partial aggregate, why not then perform a normal grouping
> sets aggregation in the final phase, with the 'GroupingSetId' included
> in the group keys?  The plan looks like:
>
> # explain (costs off, verbose)
> select c1, c2, c3, avg(c3) from gstest group by grouping 
> sets((c1,c2),(c1),(c2,c3));
> QUERY PLAN
> --
>  Finalize GroupAggregate
>Output: c1, c2, c3, avg(c3)
>Group Key: (gset_id), gstest.c1, gstest.c2, gstest.c3
>->  Sort
>  Output: c1, c2, c3, (gset_id), (PARTIAL avg(c3))
>  Sort Key: (gset_id), gstest.c1, gstest.c2, gstest.c3
>  ->  Gather
>Output: c1, c2, c3, (gset_id), (PARTIAL avg(c3))
>Workers Planned: 4
>->  Partial HashAggregate
>  Output: c1, c2, c3, gset_id, PARTIAL avg(c3)
>  Hash Key: gstest.c1, gstest.c2
>  Hash Key: gstest.c1
>  Hash Key: gstest.c2, gstest.c3
>  ->  Parallel Seq Scan on public.gstest
>Output: c1, c2, c3
>
> This plan should be able to give the correct results. We are still
> thinking if it is a better plan than the 'multiplexed pipe' plan as in
> v3. Inputs of thoughts here would be appreciated.

Ha, I believe you meant to say a "normal aggregate", because what's
performed above gather is no longer "grouping sets", right?

The group key idea is clever in that it helps "discriminate" tuples by
their grouping set id. I haven't completely thought this through, but my
hunch is that this leaves some money on the table, for example, won't it
also lead to more expensive (and unnecessary) sorting and hashing? The
groupings with a few partials are now sharing the same tuplesort with
the groupings with a lot of groups even though we only want to tell
grouping 1 *apart from* grouping 10, not neccessarily that grouping 1
needs to come before grouping 10. That's why I like the multiplexed pipe
/ "dispatched by grouping set id" idea: we only pay for sorting (or
hashing) within each grouping. That said, I'm open to the criticism that
keeping multiple tuplesort and agg hash tabes running is expensive in
itself, memory-wise ...

Cheers,
Jesse




Re: [Proposal] Global temporary tables

2020-02-03 Thread Pavel Stehule
po 3. 2. 2020 v 14:03 odesílatel 曾文旌(义从) 
napsal:

>
>
> 2020年2月2日 上午2:00,Pavel Stehule  写道:
>
>
>
> so 1. 2. 2020 v 14:39 odesílatel 曾文旌(义从) 
> napsal:
>
>>
>>
>> 2020年1月30日 下午10:21,Pavel Stehule  写道:
>>
>>
>>
>> čt 30. 1. 2020 v 15:17 odesílatel 曾文旌(义从) 
>> napsal:
>>
>>>
>>>
>>> > 2020年1月29日 下午9:48,Robert Haas  写道:
>>> >
>>> > On Tue, Jan 28, 2020 at 12:12 PM 曾文旌(义从) 
>>> wrote:
>>> >>> Opinion by Pavel
>>> >>> + rel->rd_islocaltemp = true;  <<< if this is valid, then the
>>> name of field "rd_islocaltemp" is not probably best
>>> >>> I renamed rd_islocaltemp
>>> >>
>>> >> I don't see any change?
>>> >>
>>> >> Rename rd_islocaltemp to rd_istemp  in
>>> global_temporary_table_v8-pg13.patch
>>> >
>>> > In view of commit 6919b7e3294702adc39effd16634b2715d04f012, I think
>>> > that this has approximately a 0% chance of being acceptable. If you're
>>> > setting a field in a way that is inconsistent with the current use of
>>> > the field, you're probably doing it wrong, because the field has an
>>> > existing purpose to which new code must conform. And if you're not
>>> > doing that, then you don't need to rename it.
>>> Thank you for pointing it out.
>>> I've rolled back the rename.
>>> But I still need rd_localtemp to be true, The reason is that
>>> 1 GTT The GTT needs to support DML in read-only transactions ,like local
>>> temp table.
>>> 2 GTT does not need to hold the lock before modifying the index buffer
>>> ,also like local temp table.
>>>
>>> Please give me feedback.
>>>
>>
>> maybe some like
>>
>> rel->rd_globaltemp = true;
>>
>> and somewhere else
>>
>> if (rel->rd_localtemp || rel->rd_globaltemp)
>> {
>>   ...
>> }
>>
>> I tried to optimize code in global_temporary_table_v10-pg13.patch
>>
>>
>> Please give me feedback.
>>
>
> I tested this patch and I have not any objections - from my user
> perspective it is work as I expect
>
> +#define RELATION_IS_TEMP(relation) \
> + ((relation)->rd_islocaltemp || \
> + (relation)->rd_rel->relpersistence == RELPERSISTENCE_GLOBAL_TEMP)
>
> It looks little bit unbalanced
>
> maybe is better to inject rd_isglobaltemp to relation structure
>
> and then
>
> it should to like
>
> +#define RELATION_IS_TEMP(relation) \
> + ((relation)->rd_islocaltemp || \
> + (relation)->rd_isglobaltemp))
>
> But I have not idea if it helps in complex
>
> In my opinion
> For local temp table we need (relation)->rd_rel->relpersistence ==
> RELPERSISTENCE_TEMP
> and because one local temp table belongs to only one session, need to mark
> one sessions rd_islocaltemp = true ,and other to rd_islocaltemp = false.
>

so it means so table is assigned to current session or not. In this moment
I think so name "islocaltemp" is not the best - because there can be "local
temp table" that has this value false.

The name should be better describe so this table as attached to only
current session, and not for other, or is accessed by all session.

In this case I can understand why for GTT is possible to write
..rd_islocaltemp = true. But is signal so rd_islocaltemp is not good name
and rd_istemp is not good name too.



> But For GTT, just need (relation)->rd_rel->relpersistence ==
> RELPERSISTENCE_GLOBAL_GLOBAL_TEMP
> One GTT can be used for every session, so no need rd_isglobaltemp
> anymore. This seems duplicated and redundant.
>

I didn't understand well the sematic of rd_islocaltemp so my ideas in this
topics was not good. Now I think so rd_islocalname is not good name and can
be renamed if some body find better name. "istemptable" is not good too,
because there is important if relation is attached to the session or not.



>
>
>
>
>
>
>
>> Wenjing
>>
>>
>>
>>
>>
>>>
>>> Wenjing
>>>
>>>
>>>
>>>
>>> >
>>> > --
>>> > Robert Haas
>>> > EnterpriseDB: http://www.enterprisedb.com
>>> > The Enterprise PostgreSQL Company
>>>
>>>
>>
>


Re: Index only scan and ctid

2020-02-03 Thread Tom Lane
Laurenz Albe  writes:
> I noticed that "ctid" in the select list prevents an index only scan:
> This strikes me as strange, since every index contains "ctid".

There's no provision for an IOS to return a system column, though.
Not sure what it'd take to make that possible.

regards, tom lane




Index only scan and ctid

2020-02-03 Thread Laurenz Albe
I noticed that "ctid" in the select list prevents an index only scan:

CREATE TABLE ios (id bigint NOT NULL, val text NOT NULL);

INSERT INTO ios SELECT i, i::text FROM generate_series(1, 10) AS i;

CREATE INDEX ON ios (id);

VACUUM (ANALYZE) ios;

EXPLAIN (VERBOSE, COSTS off) SELECT ctid, id FROM ios WHERE id < 100;
 QUERY PLAN 

 Index Scan using ios_id_idx on laurenz.ios
   Output: ctid, id
   Index Cond: (ios.id < 100)
(3 rows)

This strikes me as strange, since every index contains "ctid".

This is not an artificial example either, because "ctid" is automatically
added to all data modifying queries to be able to identify the tuple
for EvalPlanQual:

EXPLAIN (VERBOSE, COSTS off) UPDATE ios SET val = '' WHERE id < 100;
QUERY PLAN
--
 Update on laurenz.ios
   ->  Index Scan using ios_id_idx on laurenz.ios
 Output: id, ''::text, ctid
 Index Cond: (ios.id < 100)
(4 rows)

Is this low hanging fruit?  If yes, I might take a stab at it.

Yours,
Laurenz Albe





Re: Dumping/restoring fails on inherited generated column

2020-02-03 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-12-04 21:36, Tom Lane wrote:
>> I wonder if the right fix is to not generate a DO_ATTRDEF
>> object at all for generated columns in child tables.  Am
>> I right in guessing that we propagate generated-ness to
>> child tables automatically?

> Right.  New patch using that approach attached.  (Could use more 
> extensive comments.)

This looks more plausible than the previous attempt, but it's clearly
still not right, because this is what it changes in the regression
test dump:

--- r.dump.head 2020-02-03 14:16:15.774305437 -0500
+++ r.dump.patch2020-02-03 14:18:08.599109703 -0500
@@ -15244,14 +15244,7 @@
 -- Name: gtest1_1 b; Type: DEFAULT; Schema: public; Owner: postgres
 --
 
-ALTER TABLE ONLY public.gtest1_1 ALTER COLUMN b SET DEFAULT (a * 2);
-
-
---
--- Name: gtest30_1 b; Type: DEFAULT; Schema: public; Owner: postgres
---
-
-ALTER TABLE ONLY public.gtest30_1 ALTER COLUMN b SET DEFAULT (a * 2);
+ALTER TABLE ONLY public.gtest1_1 ALTER COLUMN b SET DEFAULT NULL;
 
 
 --

This is showing us at least two distinct problems.  Now as for
"gtest30_1", what we have is that in the parent table "gtest30", column b
exists but it has no default; the generated property is only added
at the child table gtest30_1.  So we need to emit ALTER COLUMN SET
GENERATED ALWAYS for gtest30_1.b.  HEAD is already doing the wrong
thing there (it's emitting the expression, but as a plain default
not GENERATED).  And this patch makes it emit nothing, even worse.
I think the key point here is that "attislocal" refers to whether the
column itself is locally defined, not to whether its default is.

Things are evidently also going wrong for "gtest1_1".  In that case
the generated property is inherited from the parent gtest1, so we
shouldn't be emitting anything ... how come the patch fails to
make it do that?

regards, tom lane




Re: Dumping/restoring fails on inherited generated column

2020-02-03 Thread Peter Eisentraut

On 2019-12-04 21:36, Tom Lane wrote:

I wonder if the right fix is to not generate a DO_ATTRDEF
object at all for generated columns in child tables.  Am
I right in guessing that we propagate generated-ness to
child tables automatically?


Right.  New patch using that approach attached.  (Could use more 
extensive comments.)


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From eeeab95adcaa298851abd3fd44d8646510e7aed2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 3 Feb 2020 19:56:34 +0100
Subject: [PATCH v2] Fix dumping of inherited generated columns

---
 src/bin/pg_dump/pg_dump.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 33e58fa287..c9938323b1 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -8492,6 +8492,13 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int 
numTables)
if (tbinfo->attisdropped[adnum - 1])
continue;
 
+   /*
+* ignore generated columns on child tables 
unless they have a
+* local definition
+*/
+   if (tbinfo->attgenerated[adnum - 1] && 
!tbinfo->attislocal[adnum - 1])
+   continue;
+
attrdefs[j].dobj.objType = DO_ATTRDEF;
attrdefs[j].dobj.catId.tableoid = 
atooid(PQgetvalue(res, j, 0));
attrdefs[j].dobj.catId.oid = 
atooid(PQgetvalue(res, j, 1));
-- 
2.25.0



Re: Memory-Bounded Hash Aggregation

2020-02-03 Thread Jeff Davis
On Wed, 2020-01-29 at 14:48 -0800, Jeff Davis wrote:
> 2) Use a different structure more capable of handling a large
> fraction
> of free space. A compressed bitmap might make sense, but that seems
> like overkill to waste effort tracking a lot of space that is
> unlikely
> to ever be used.

I ended up converting the freelist to a min heap.

Attached is a patch which makes three changes to better support
HashAgg:

1. Use a minheap for the freelist. The original design used an array
that had to be sorted between a read (which frees a block) and a write
(which needs to sort the array to consume the lowest block number). The
comments said:

  * sorted.  This is an efficient way to handle it because we expect
cycles
  * of releasing many blocks followed by re-using many blocks, due to
  * the larger read buffer.  

But I didn't find a case where that actually wins over a simple
minheap. With that in mind, a minheap seems closer to what one might
expect for that purpose, and more robust when the assumptions don't
hold up as well. If someone knows of a case where the re-sorting
behavior is important, please let me know.

Changing to a minheap effectively solves the problem for HashAgg,
though in theory the memory consumption of the freelist itself could
become significant (though it's only 0.1% of the free space being
tracked).

2. Lazily-allocate the read buffer. The write buffer was always lazily-
allocated, so this patch creates better symmetry. More importantly, it
means freshly-rewound tapes don't have any buffer allocated, so it
greatly expands the number of tapes that can be managed efficiently as
long as only a limited number are active at once.

3. Allow expanding the number of tapes for an existing tape set. This
is useful for HashAgg, which doesn't know how many tapes will be needed
in advance.

Regards,
Jeff Davis

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index 42cfb1f9f98..20b27b3558b 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -49,12 +49,8 @@
  * when reading, and read multiple blocks from the same tape in one go,
  * whenever the buffer becomes empty.
  *
- * To support the above policy of writing to the lowest free block,
- * ltsGetFreeBlock sorts the list of free block numbers into decreasing
- * order each time it is asked for a block and the list isn't currently
- * sorted.  This is an efficient way to handle it because we expect cycles
- * of releasing many blocks followed by re-using many blocks, due to
- * the larger read buffer.
+ * To support the above policy of writing to the lowest free block, the
+ * freelist is a min heap.
  *
  * Since all the bookkeeping and buffer memory is allocated with palloc(),
  * and the underlying file(s) are made with OpenTemporaryFile, all resources
@@ -170,7 +166,7 @@ struct LogicalTapeSet
 	/*
 	 * File size tracking.  nBlocksWritten is the size of the underlying file,
 	 * in BLCKSZ blocks.  nBlocksAllocated is the number of blocks allocated
-	 * by ltsGetFreeBlock(), and it is always greater than or equal to
+	 * by ltsReleaseBlock(), and it is always greater than or equal to
 	 * nBlocksWritten.  Blocks between nBlocksAllocated and nBlocksWritten are
 	 * blocks that have been allocated for a tape, but have not been written
 	 * to the underlying file yet.  nHoleBlocks tracks the total number of
@@ -188,15 +184,9 @@ struct LogicalTapeSet
 	 * If forgetFreeSpace is true then any freed blocks are simply forgotten
 	 * rather than being remembered in freeBlocks[].  See notes for
 	 * LogicalTapeSetForgetFreeSpace().
-	 *
-	 * If blocksSorted is true then the block numbers in freeBlocks are in
-	 * *decreasing* order, so that removing the last entry gives us the lowest
-	 * free block.  We re-sort the blocks whenever a block is demanded; this
-	 * should be reasonably efficient given the expected usage pattern.
 	 */
 	bool		forgetFreeSpace;	/* are we remembering free blocks? */
-	bool		blocksSorted;	/* is freeBlocks[] currently in order? */
-	long	   *freeBlocks;		/* resizable array */
+	long	   *freeBlocks;		/* resizable array holding minheap */
 	int			nFreeBlocks;	/* # of currently free blocks */
 	int			freeBlocksLen;	/* current allocated length of freeBlocks[] */
 
@@ -211,6 +201,7 @@ static long ltsGetFreeBlock(LogicalTapeSet *lts);
 static void ltsReleaseBlock(LogicalTapeSet *lts, long blocknum);
 static void ltsConcatWorkerTapes(LogicalTapeSet *lts, TapeShare *shared,
  SharedFileSet *fileset);
+static void ltsInitTape(LogicalTape *lt);
 
 
 /*
@@ -321,46 +312,88 @@ ltsReadFillBuffer(LogicalTapeSet *lts, LogicalTape *lt)
 	return (lt->nbytes > 0);
 }
 
-/*
- * qsort comparator for sorting freeBlocks[] into decreasing order.
- */
-static int
-freeBlocks_cmp(const void *a, const void *b)
+static inline void
+swap_nodes(long *heap, int a, int b)
 {
-	long		ablk = *((const long *) a);
-	long		bblk = *((const long *) b);
-
-	/* can't just subtract 

Re: Portal->commandTag as an enum

2020-02-03 Thread Mark Dilger



> On Feb 2, 2020, at 6:14 PM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>> I put the CommandTag enum in src/common because there wasn’t any reason
>> not to do so.  It seems plausible that frontend test frameworks might want
>> access to this enum.
> 

Thanks for looking!

> Au contraire, that's an absolutely fundamental mistake.  There is
> zero chance of this enum holding still across PG versions, so if
> we expose it to frontend code, we're going to have big problems
> with cross-version compatibility.  See our historical problems with
> assuming the enum for character set encodings was the same between
> frontend and backend ... and that set is orders of magnitude more
> stable than this one.

I completely agree that this enum cannot be expected to remain stable across 
versions.

For the purposes of this patch, which has nothing to do with frontend tools, 
this issue doesn’t matter to me.  I’m happy to move this into src/backend.

Is there no place to put code which would be useful for frontend tools without 
implying stability?  Sure, psql and friends can’t use it, because they need to 
be able to connect to servers of other versions.  But why couldn’t a test 
framework tool use something like this?  Could we have someplace like 
src/common/volatile for this sort of thing?

> 
> As I recall, the hardest problem with de-string-ifying this is the fact
> that for certain tags we include a rowcount in the string.  I'd like to
> see that undone --- we have to keep it like that on-the-wire to avoid a
> protocol break, but it'd be best if noplace inside the backend did it that
> way, and we converted at the last moment before sending a CommandComplete
> to the client.  Your references to "completion tags" make it sound like
> you've only half done the conversion (though I've not read the patch
> in enough detail to verify).

In v1, I stayed closer to the existing code structure than you are requesting.  
I like the direction you’re suggesting that I go, and I’ve begun that 
transition in anticipation of posting a v2 patch set soon.

> BTW, the size of the patch is rather depressing, especially if it's
> only half done.  Unlike Andres, I'm not even a little bit convinced
> that this is worth the amount of code churn it'll cause.  Have you
> done any code-size or speed comparisons?

A fair amount of the code churn is replacing strings with their enum 
equivalent, creating the enum itself, and creating a data table mapping enums 
to strings.  The churn doesn’t look too bad to me when viewing the original vs 
new code diff side-by-side.

The second file (v1-0002…) is entirely an extension of the regression tests.  
Applying v1-0001… doesn’t entail needing to apply v1-0002… as the code being 
tested exists before and after the patch.  If you don’t want to apply that 
regression test change, that’s fine.  It just provides more extensive coverage 
of event_triggers over different command tags.

There will be a bit more churn in v2, since I’m changing the code flow a bit 
more to avoid generating the strings until they are about to get sent to the 
client, per your comments above.  That has the advantage that multiple places 
in the old code where the completionTag was parsed to get the nprocessed count 
back out now doesn’t need any parsing.

I’ll include stats about code-size and speed when I post v2.

Thanks again for reviewing my patch idea!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Complete data erasure

2020-02-03 Thread Tomas Vondra

On Mon, Feb 03, 2020 at 09:07:09AM -0500, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

On Tue, Jan 28, 2020 at 02:34:07PM -0500, Stephen Frost wrote:
>We certainly can't run external commands during transaction COMMIT, so
>this can't be part of a regular DROP TABLE.

IMO the best solution would be that the DROP TABLE does everything as
usual, but instead of deleting the relfilenode it moves it to some sort
of queue. And then a background worker would "erase" these relfilenodes
outside the COMMIT.


That sounds interesting, though I'm a bit worried that it's going to
lead to the same kind of complications and difficulty that we have with
deleted columns- anything that's working with the system tables will
need to see this new "dropped but pending delete" flag.  Would we also
rename the table when this happens?  Or change the schema it's in?
Otherwise, your typical DROP IF EXISTS / CREATE could end up breaking.



That's not really what I meant - let me explain. When I said DROP TABLE
should do everything as usual, that includes catalog changes. I.e. after
the commit there would not be any remaining entries in system catalogs
or anything like that.

The only thing we'd do differently is that instead of unlinking the
relfilenode segments, we'd move the relfilenode to a persistent queue
(essentially a regular table used as a queue relfilenodes). The
background worker would watch the queue, and when it gets a new
relfilenode it'd "delete" the data and then remove the relfilenode from
the queue.

So essentially others would not be able to even see the (now dropped)
object, they could create new object with the same name etc.

I imagine we might provide a way to wait for the deletion to actually
complete (can't do that as part of the DROP TABLE, though), so that
people can be sure when the data is actually gone (for scripts etc.).
A simple function waiting for the queue to get empty might be enough, I
guess, but maybe not.


And yes, we need to do this in a way that works with replicas, i.e. we
need to WAL-log it somehow. And it should to be done in a way that works
when the replica is on a different type of filesystem.


I agree it should go through WAL somehow (ideally without needing an
actual zero'd or whatever page for every page in the relation), but why
do we care about the filesystem on the replica?  We don't have anything
that's really filesystem specific in WAL replay today and I don't see
this as needing to change that..



I think this depends on what our requirements are.

My assumption is that when you perform this "secure data erasure" on the
primary, you probably also want to erase the data on the replica. But if
the instances use different file systems (COW vs. non-COW, ...) the
exact thing that needs to happen may be different. Or maybe the replica
does not need to do anything, making it noop?

In which case we don't need to WAL-log the exact change for each page,
it might even be fine to not even WAL-log anything except for the final
removal from the queue. I mean, the data is worthless and not used by
anyone at this point, there's no point in replicating it ...

I haven't thought about this very hard. It's not clear what should
happen if we complete the erasure on primary, remove the relfilenode
from the queue, and then restart the replica before it finishes the
local erasure. The queue (if represented by a simple table) will be
replicated, so the replica will forget it still has work to do.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PoC] Non-volatile WAL buffer

2020-02-03 Thread Andres Freund
Hi,

On 2020-01-27 13:54:38 -0500, Robert Haas wrote:
> On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
>  wrote:
> > It sounds reasonable, but I'm sorry that I haven't tested such a program
> > yet.  I'll try it to compare with my non-volatile WAL buffer.  For now, I'm
> > a little worried about the overhead of mmap()/munmap() for each WAL segment
> > file.
> 
> I guess the question here is how the cost of one mmap() and munmap()
> pair per WAL segment (normally 16MB) compares to the cost of one
> write() per block (normally 8kB). It could be that mmap() is a more
> expensive call than read(), but by a small enough margin that the
> vastly reduced number of system calls makes it a winner. But that's
> just speculation, because I don't know how heavy mmap() actually is.

mmap()/munmap() on a regular basis does have pretty bad scalability
impacts. I don't think they'd fully hit us, because we're not in a
threaded world however.


My issue with the proposal to go towards mmap()/munmap() is that I think
doing so forcloses a lot of improvements. Even today, on fast storage,
using the open_datasync is faster (at least when somehow hitting the
O_DIRECT path, which isn't that easy these days) - and that's despite it
being really unoptimized.  I think our WAL scalability is a serious
issue. There's a fair bit that we can improve by just fix without really
changing the way we do IO:

- Split WALWriteLock into one lock for writing and one for flushing the
  WAL. Right now we prevent other sessions from writing out WAL - even
  to other segments - when one session is doing a WAL flush. But there's
  absolutely no need for that.
- Stop increasing the size of the flush request to the max when flushing
  WAL (cf "try to write/flush later additions to XLOG as well" in
  XLogFlush()) - that currently reduces throughput in OLTP workloads
  quite noticably. It made some sense in the spinning disk times, but I
  don't think it does for a halfway decent SSD. By writing the maximum
  ready to write, we hold the lock for longer, increasing latency for
  the committing transaction *and* preventing more WAL from being written.
- We should immediately ask the OS to flush writes for full XLOG pages
  back to the OS. Right now the IO for that will never be started before
  the commit comes around in an OLTP workload, which means that we just
  waste the time between the XLogWrite() and the commit.

That'll gain us 2-3x, I think. But after that I think we're going to
have to actually change more fundamentally how we do IO for WAL
writes. Using async IO I can do like 18k individual durable 8kb writes
(using O_DSYNC) a second, at a queue depth of 32. On my laptop. If I
make it 4k writes, it's 22k.

That's not directly comparable with postgres WAL flushes, of course, as
it's all separate blocks, whereas WAL will often end up overwriting the
last block. But it doesn't at all account for group commits either,
which we *constantly* end up doing.

Postgres manages somewhere between ~450 (multiple users) ~800 (single
user) individually durable WAL writes  / sec on the same hardware. Yes,
that's more than an order of magnitude less. Of course some of that is
just that postgres does more than just IO - but that's not effect on the
order of a magnitude.

So, why am I bringing this up in this thread? Only because I do not see
a way to actually utilize non-pmem hardware to a much higher degree than
we are doing now by using mmap(). Doing so requires using direct IO,
which is fundamentally incompatible with using mmap().



> I have a different concern. I think that, right now, when we reuse a
> WAL segment, we write entire blocks at a time, so the old contents of
> the WAL segment are overwritten without ever being read. But that
> behavior might not be maintained when using mmap(). It might be that
> as soon as we write the first byte to a mapped page, the old contents
> have to be faulted into memory. Indeed, it's unclear how it could be
> otherwise, since the VM page must be made read-write at that point and
> the system cannot know that we will overwrite the whole page. But
> reading in the old contents of a recycled WAL file just to overwrite
> them seems like it would be disastrously expensive.

Yea, that's a serious concern.


> A related, but more minor, concern is whether there are any
> differences in in the write-back behavior when modifying a mapped
> region vs. using write(). Either way, the same pages of the same file
> will get dirtied, but the kernel might not have the same idea in
> either case about when the changed pages should be written back down
> to disk, and that could make a big difference to performance.

I don't think there's a significant difference in case of linux - no
idea about others. And either way we probably should force the kernels
hand to start flushing much sooner.

Greetings,

Andres Freund




Re: BUG #16171: Potential malformed JSON in explain output

2020-02-03 Thread Tom Lane
Daniel Gustafsson  writes:
> On 1 Feb 2020, at 20:37, Tom Lane  wrote:
>> 0002 attached isn't committable, because nobody would want the overhead
>> in production, but it seems like a good trick to keep up our sleeves.

> Thats a neat trick, I wonder if it would be worth maintaining a curated list 
> of
> these tricks in a README under src/test to help others avoid/reduce wheel
> reinventing?

It occurred to me that as long as this is an uncommittable hack anyway,
we could feed the EXPLAIN data to jsonb_in and then hot-wire the jsonb
code to whine about duplicate keys.  So attached, for the archives' sake,
is an improved version that does that.  I still don't find any problems
(other than the one we're fixing here); though no doubt if I reverted
100136849 it'd complain about that.

regards, tom lane

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index f69dde8..1945a77 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -18,6 +18,7 @@
 #include "commands/explain.h"
 #include "executor/instrument.h"
 #include "jit/jit.h"
+#include "utils/builtins.h"
 #include "utils/guc.h"
 
 PG_MODULE_MAGIC;
@@ -397,6 +398,10 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 			{
 es->str->data[0] = '{';
 es->str->data[es->str->len - 1] = '}';
+
+/* Verify that it's valid JSON by feeding to jsonb_in */
+(void) DirectFunctionCall1(jsonb_in,
+		   CStringGetDatum(es->str->data));
 			}
 
 			/*
diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c
index edec657..b9038fa 100644
--- a/src/backend/utils/adt/jsonb_util.c
+++ b/src/backend/utils/adt/jsonb_util.c
@@ -1915,6 +1915,9 @@ uniqueifyJsonbObject(JsonbValue *object)
 if (ptr != res)
 	memcpy(res, ptr, sizeof(JsonbPair));
 			}
+			else
+elog(WARNING, "dropping duplicate jsonb key %s",
+	 ptr->key.val.string.val);
 			ptr++;
 		}
 
diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out
index a70cd0b..2db14e1 100644
--- a/src/test/regress/expected/jsonb.out
+++ b/src/test/regress/expected/jsonb.out
@@ -3488,6 +3488,9 @@ SELECT '{"ff":{"a":12,"b":16},"qq":123}'::jsonb;
 (1 row)
 
 SELECT '{"aa":["a","aaa"],"qq":{"a":12,"b":16,"c":["c1","c2"],"d":{"d1":"d1","d2":"d2","d1":"d3"}}}'::jsonb;
+WARNING:  dropping duplicate jsonb key d1
+LINE 1: SELECT '{"aa":["a","aaa"],"qq":{"a":12,"b":16,"c":["c1","c2"...
+   ^
   jsonb   
 --
  {"aa": ["a", "aaa"], "qq": {"a": 12, "b": 16, "c": ["c1", "c2"], "d": {"d1": "d3", "d2": "d2"}}}
@@ -4021,6 +4024,8 @@ select jsonb_concat('{"d": "test", "a": [1, 2]}', '{"g": "test2", "c": {"c1":1,
 (1 row)
 
 select '{"aa":1 , "b":2, "cq":3}'::jsonb || '{"cq":"l", "b":"g", "fg":false}';
+WARNING:  dropping duplicate jsonb key baacq
+WARNING:  dropping duplicate jsonb key cq
   ?column?   
 -
  {"b": "g", "aa": 1, "cq": "l", "fg": false}
@@ -4033,6 +4038,7 @@ select '{"aa":1 , "b":2, "cq":3}'::jsonb || '{"aq":"l"}';
 (1 row)
 
 select '{"aa":1 , "b":2, "cq":3}'::jsonb || '{"aa":"l"}';
+WARNING:  dropping duplicate jsonb key aacq
?column?   
 --
  {"b": 2, "aa": "l", "cq": 3}


Re: explain HashAggregate to report bucket and memory stats

2020-02-03 Thread Andres Freund
Hi,


On 2020-01-03 10:19:26 -0600, Justin Pryzby wrote:
> On Sun, Feb 17, 2019 at 11:29:56AM -0500, Jeff Janes wrote:
> https://www.postgresql.org/message-id/CAMkU%3D1zBJNVo2DGYBgLJqpu8fyjCE_ys%2Bmsr6pOEoiwA7y5jrA%40mail.gmail.com
> > What would I find very useful is [...] if the HashAggregate node under
> > "explain analyze" would report memory and bucket stats; and if the Aggregate
> > node would report...anything.

Yea, that'd be amazing. It probably should be something every
execGrouping.c using node can opt into.


Justin: As far as I can tell, you're trying to share one instrumentation
state between hashagg and hashjoins. I'm doubtful that's a good
idea. The cases are different enough that that's probably just going to
be complicated, without actually simplifying anything.


> Jeff: can you suggest what details Aggregate should show ?

Memory usage most importantly. Probably makes sense to differentiate
between the memory for the hashtable itself, and the tuples in it (since
they're allocated separately, and just having a overly large hashtable
doesn't hurt that much if it's not filled).


> diff --git a/src/backend/executor/execGrouping.c 
> b/src/backend/executor/execGrouping.c
> index 3603c58..cf0fe3c 100644
> --- a/src/backend/executor/execGrouping.c
> +++ b/src/backend/executor/execGrouping.c
> @@ -203,6 +203,11 @@ BuildTupleHashTableExt(PlanState *parent,
>   hashtable->hash_iv = 0;
>  
>   hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable);
> + hashtable->hinstrument.nbuckets_original = nbuckets;
> + hashtable->hinstrument.nbuckets = nbuckets;
> + hashtable->hinstrument.space_peak = entrysize * 
> hashtable->hashtab->size;

That's not actually an accurate accounting of memory, because for filled
entries a lot of memory is used to store actual tuples:

static TupleHashEntryData *
lookup_hash_entry(AggState *aggstate)
...
/* find or create the hashtable entry using the filtered tuple */
entry = LookupTupleHashEntry(perhash->hashtable, hashslot, );

if (isnew)
{
AggStatePerGroup pergroup;
int transno;

pergroup = (AggStatePerGroup)
MemoryContextAlloc(perhash->hashtable->tablecxt,
   
sizeof(AggStatePerGroupData) * aggstate->numtrans);
entry->additional = pergroup;


since the memory doesn't actually shrink unless the hashtable is
destroyed or reset, it'd probably be sensible to compute the memory
usage either at reset, or at the end of the query.


Greetings,

Andres Freund




Re: Add %x to PROMPT1 and PROMPT2

2020-02-03 Thread Tom Lane
Daniel Gustafsson  writes:
> On 3 Feb 2020, at 08:08, Michael Paquier  wrote:
>> FWIW, I am not really in favor of changing a default old enough that
>> it could vote (a45195a).

> That by itself doesn't seem a good reason to not change things.

> My concern would be that users who have never ever considered that the prompt
> can be changed, all of sudden wonder why the prompt is showing characters it
> normally isn't, thus causing confusion.  That being said, I agree that this is
> a better default long-term.

I've got the same misgivings as Michael.  In a green field this'd likely
be a good idea, but after so many years I'm afraid it will make fewer
people happy than unhappy.

Now on the other hand, we did change the server's default log_line_prefix
not so long ago (7d3235ba4), and the feared storm of complaints was pretty
much undetectable.  So maybe this'd go down the same way.  Worth noting
also is that this shouldn't be able to break any applications, since the
prompt is an interactive-only behavior.

regards, tom lane




Re: Add %x to PROMPT1 and PROMPT2

2020-02-03 Thread Alvaro Herrera
On 2020-Feb-03, Daniel Gustafsson wrote:

> > On 3 Feb 2020, at 08:08, Michael Paquier  wrote:
> 
> > FWIW, I am not really in favor of changing a default old enough that
> > it could vote (a45195a).
> 
> That by itself doesn't seem a good reason to not change things.

Yeah.

> My concern would be that users who have never ever considered that the prompt
> can be changed, all of sudden wonder why the prompt is showing characters it
> normally isn't, thus causing confusion.  That being said, I agree that this is
> a better default long-term.

I think this is the good kind of surprise, not the bad kind.

I think the only kind of user that would be negatively affected would be
those that have scripted psql under expect(1) and would fail to read the
new prompt correctly.  But I would say that that kind of user is the one
most likely to be able to fix things as needed.  Everybody else is just
looking at an extra char in the screen, and they don't really care.

I'm +1 for the default change.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: BUG #16171: Potential malformed JSON in explain output

2020-02-03 Thread Tom Lane
Daniel Gustafsson  writes:
> I guess I'm leaning towards backpatching, but it's not entirely clear-cut.

That's where I stand too.  I'll wait a day or so to see if anyone
else comments; but if not, I'll back-patch.

regards, tom lane




Re: Complete data erasure

2020-02-03 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Tue, Jan 28, 2020 at 02:34:07PM -0500, Stephen Frost wrote:
> >We certainly can't run external commands during transaction COMMIT, so
> >this can't be part of a regular DROP TABLE.
> 
> IMO the best solution would be that the DROP TABLE does everything as
> usual, but instead of deleting the relfilenode it moves it to some sort
> of queue. And then a background worker would "erase" these relfilenodes
> outside the COMMIT.

That sounds interesting, though I'm a bit worried that it's going to
lead to the same kind of complications and difficulty that we have with
deleted columns- anything that's working with the system tables will
need to see this new "dropped but pending delete" flag.  Would we also
rename the table when this happens?  Or change the schema it's in?
Otherwise, your typical DROP IF EXISTS / CREATE could end up breaking.

> And yes, we need to do this in a way that works with replicas, i.e. we
> need to WAL-log it somehow. And it should to be done in a way that works
> when the replica is on a different type of filesystem.

I agree it should go through WAL somehow (ideally without needing an
actual zero'd or whatever page for every page in the relation), but why
do we care about the filesystem on the replica?  We don't have anything
that's really filesystem specific in WAL replay today and I don't see
this as needing to change that..

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side

2020-02-03 Thread Fujii Masao



On 2020/02/03 16:28, Amit Langote wrote:

On Mon, Feb 3, 2020 at 1:17 PM Fujii Masao  wrote:

On 2020/02/02 14:59, Masahiko Sawada wrote:

On Fri, 31 Jan 2020 at 02:29, Fujii Masao  wrote:

On 2020/01/30 12:58, Kyotaro Horiguchi wrote:

+WHEN 3 THEN 'stopping backup'::text

I'm not sure, but the "stop" seems suggesting the backup is terminated
before completion. If it is following the name of the function
pg_stop_backup, I think the name is suggesting to stop "the state for
performing backup", not a backup.

In the first place, the progress is about "backup" so it seems strange
that we have another phase after the "stopping backup" phase.  It
might be better that it is "finishing file transfer" or such.

  "initializing"
-> "starting file transfer"
-> "transferring files"
-> "finishing file transfer"
-> "transaferring WAL"


Better name is always welcome! If "stopping back" is confusing,
what about "performing pg_stop_backup"? So

initializing
performing pg_start_backup
streaming database files
performing pg_stop_backup
transfering WAL files


Another idea I came up with is to show steps that take time instead of
pg_start_backup/pg_stop_backup, for better understanding the
situation. That is, "performing checkpoint" and "performing WAL
archive" etc, which engage the most of the time of these functions.


Yeah, that's an idea. ISTM that "waiting for WAL archiving" sounds
better than "performing WAL archive". Thought?
I've not applied this change in the patch yet, but if there is no
other idea, I'd like to adopt this.


If we are trying to "pg_stop_backup" in phase name, maybe we should
avoid "pg_start_backup"?  Then maybe:

initializing
starting backup / waiting for [ backup start ] checkpoint to finish
transferring database files
finishing backup / waiting for WAL archiving to finish
transferring WAL files


So we now have the following ideas about the phase names for pg_basebackup.

1.
initializing

2.
2-1. starting backup
2-2. starting file transfer
2-3. performing pg_start_backup
2-4. performing checkpoint
2-5. waiting for [ backup start ] checkpoint to finish

3.
3-1. streaming backup
3-2. transferring database files
3-3. streaming database files
3-4. transferring files

4.
4-1. stopping backup
4-2. finishing file transfer
4-3. performing pg_stop_backup
4-4. finishing backup
4-5. waiting for WAL archiving to finish
4-6. performing WAL archive

5.
1. transferring wal
2. transferring WAL files

What conbination of these do you prefer?

Some comments on documentation changes in v2 patch:

+  Amount of data already streamed.


Ok, fixed.


"already" may be redundant.  For example, in pg_start_progress_vacuum,
heap_blks_scanned is described as "...blocks scanned", not "...blocks
already scanned".

+ tablespace_total
+ tablespace_streamed

Better to use plural tablespaces_total and tablespaces_streamed for consistency?


Fixed.


+   The WAL sender process is currently performing
+   pg_start_backup and setting up for
+   making a base backup.

How about "taking" instead of "making" in the above sentence?


Fixed. Attached is the updated version of the patch.



-  
+  

I don't see any new text in the documentation patch that uses above
xref, so no need to define it?


The following description that I added uses this.

certain commands during command execution.  Currently, the only commands
which support progress reporting are ANALYZE,
CLUSTER,
-   CREATE INDEX, and VACUUM.
+   CREATE INDEX, VACUUM,
+   and  (i.e., replication
+   command that  issues to take
+   a base backup).

Regards,


--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8839699079..7068f82ff7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -376,6 +376,14 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   
  
 
+ 
+  
pg_stat_progress_basebackuppg_stat_progress_basebackup
+  One row for each WAL sender process streaming a base backup,
+   showing current progress.
+   See .
+  
+ 
+
 

   
@@ -3515,7 +3523,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
certain commands during command execution.  Currently, the only commands
which support progress reporting are ANALYZE,
CLUSTER,
-   CREATE INDEX, and VACUUM.
+   CREATE INDEX, VACUUM,
+   and  (i.e., replication
+   command that  issues to take
+   a base backup).
This may be expanded in the future.
   
 
@@ -4316,6 +4327,143 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,


   
+ 
+
+ 
+  Base Backup Progress Reporting
+
+  
+   Whenever pg_basebackup is taking a base
+   backup, the pg_stat_progress_basebackup
+   view will contain a row for each WAL sender process that is currently
+   running BASE_BACKUP replication command
+   and streaming 

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

2020-02-03 Thread Andres Freund
Hi,

On 2020-01-21 15:41:54 +0900, Fujii Masao wrote:
> On 2020/01/21 13:39, Michael Paquier wrote:
> > On Tue, Jan 21, 2020 at 08:45:14AM +0530, Amit Kapila wrote:
> > > The original email doesn't say so.  I might be missing something, but
> > > can you explain what makes you think so.
> >
> > Oops.  Incorrect thread, I was thinking about this one previously:
> > https://www.postgresql.org/message-id/822113470.250068.1573246011...@connect.xfinity.com
> >
> > Re-reading the root of the thread, I am still not sure what we could
> > do, as that's rather tricky.

Did anybody consider the proposal at
https://www.postgresql.org/message-id/20191223005430.yhf4n3zr4ojwbcn2%40alap3.anarazel.de
 ?
I think we're going to have to do something like that to actually fix
the problem, rather than polish around the edges.


> See here:
> https://www.postgresql.org/message-id/20190927061414.gf8...@paquier.xyz

On 2019-09-27 15:14:14 +0900, Michael Paquier wrote:
> Wrapping the call of smgrtruncate() within RelationTruncate() to use a
> critical section would make things worse from the user perspective on
> the primary, no?  If the physical truncation fails, we would still
> fail WAL replay on the standby, but instead of generating an ERROR in
> the session of the user attempting the TRUNCATE, the whole primary
> would be taken down.

FWIW, to me this argument just doesn't make any sense - even if a few
people have argued it.

A failure in the FS truncate currently yields to a cluster in a
corrupted state in multiple ways:
1) Dirty buffer contents were thrown away, and going forward their old
   contents will be read back.
2) We have WAL logged something that we haven't done. That's *obviously*
   something *completely* violating WAL logging rules. And break WAL
   replay (including locally, should we crash before the next
   checkpoint - there could be subsequent WAL records relying on the
   block's existance).

That's so obviously worse than a PANIC restart, that I really don't
understand the "worse from the user perspective" argument from your
email above.  Obviously it sucks that the error might re-occur during
recovery. But that's something that usually actually can be fixed -
whereas the data corruption can't.


> The original proposal, i.e., holding the interrupts during
> the truncation, is worth considering? It is not a perfect
> solution but might improve the situation a bit.

I don't think it's useful in isolation.

Greetings,

Andres Freund




Re: Cache relation sizes?

2020-02-03 Thread Andres Freund
Hi,

On 2019-12-31 17:05:31 +1300, Thomas Munro wrote:
> There is one potentially interesting case that doesn't require any
> kind of shared cache invalidation AFAICS.  XLogReadBufferExtended()
> calls smgrnblocks() for every buffer access, even if the buffer is
> already in our buffer pool.

Yea, that's really quite bad*. The bit about doing so even when already
in the buffer pool is particularly absurd.  Needing to have special
handling in mdcreate() for XLogReadBufferExtended() always calling it is
also fairly ugly.


> It doesn't seem great that we are effectively making system calls for
> most WAL records we replay, but, sadly, in this case the patch didn't
> really make any measurable difference when run without strace on this
> Linux VM.  I suspect there is some workload and stack where it would
> make a difference (CF the read(postmaster pipe) call for every WAL
> record that was removed), but this is just something I noticed in
> passing while working on something else, so I haven't investigated
> much.

I wonder if that's just because your workload is too significantly
bottlenecked elsewhere:

> postgres -D pgdata -c checkpoint_timeout=60min

> In another shell:
> pgbench -i -s100 postgres
> pgbench -M prepared -T60 postgres
> killall -9 postgres
> mv pgdata pgdata-save

With scale 100, but the default shared_buffers, you'll frequently hit
the OS for reads/writes. Which will require the same metadata in the
kernel, but then also memcpys between kernel and userspace.

A word of caution about strace's -c: In my experience the total time
measurements are very imprecise somehow. I think it might be that some
of the overhead of ptracing will be attributed to the syscalls or such,
which means frequent syscalls appear relatively more expensive than they
really are.

Greetings,

Andres Freund

* it insults my sense of aesthetics




Re: Brokenness in dump/restore for GENERATED expressions

2020-02-03 Thread Peter Eisentraut

On 2020-01-30 19:54, Tom Lane wrote:

Using today's HEAD, the regression database cannot be dumped and
restored normally.  Since the buildfarm isn't all red, I suppose
it works in --binary-upgrade mode ... but if I just do

$ make installcheck  # to set up the test database
$ pg_dump -Fc regression >r.dump
$ createdb r2
$ pg_restore -d r2 r.dump

I get

pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 6016; 2604 24926 DEFAULT gtest1_1 b postgres
pg_restore: error: could not execute query: ERROR:  column "b" of relation 
"gtest1_1" is a generated column
Command was: ALTER TABLE ONLY public.gtest1_1 ALTER COLUMN b SET DEFAULT (a * 
2);


pg_restore: from TOC entry 6041; 2604 25966 DEFAULT gtest30_1 b postgres
pg_restore: error: could not execute query: ERROR:  cannot use column reference 
in DEFAULT expression
Command was: ALTER TABLE ONLY public.gtest30_1 ALTER COLUMN b SET DEFAULT (a * 
2);


This is the same issue as 
. 
I will work in it this week.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: base backup client as auxiliary backend process

2020-02-03 Thread Andres Freund
Hi,

Comment:

- It'd be good to split out the feature independent refactorings, like
  the introduction of InitControlFile(), into their own commit. Right
  now it's hard to separate out what should just should be moved code,
  and what should be behavioural changes. Especially when there's stuff
  like just reindenting comments as part of the patch.


> @@ -886,12 +891,27 @@ PostmasterMain(int argc, char *argv[])
>   /* Verify that DataDir looks reasonable */
>   checkDataDir();
>
> - /* Check that pg_control exists */
> - checkControlFile();
> -
>   /* And switch working directory into it */
>   ChangeToDataDir();
>
> + if (stat(BASEBACKUP_SIGNAL_FILE, _buf) == 0)
> + {
> + int fd;
> +
> + fd = BasicOpenFilePerm(STANDBY_SIGNAL_FILE, O_RDWR | PG_BINARY,
> +S_IRUSR | S_IWUSR);
> + if (fd >= 0)
> + {
> + (void) pg_fsync(fd);
> + close(fd);
> + }
> + basebackup_signal_file_found = true;
> + }
> +
> + /* Check that pg_control exists */
> + if (!basebackup_signal_file_found)
> + checkControlFile();
> +

This should be moved into its own function, rather than open coded in
PostmasterMain(). Imagine how PostmasterMain() would look if all the
check/initialization functions weren't extracted into functions.


>   /*
>* Check for invalid combinations of GUC settings.
>*/
> @@ -970,7 +990,8 @@ PostmasterMain(int argc, char *argv[])
>* processes will inherit the correct function pointer and not need to
>* repeat the test.
>*/
> - LocalProcessControlFile(false);
> + if (!basebackup_signal_file_found)
> + LocalProcessControlFile(false);
>
>   /*
>* Initialize SSL library, if specified.
> @@ -1386,6 +1407,39 @@ PostmasterMain(int argc, char *argv[])
>*/
>   AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STARTING);
>
> + if (basebackup_signal_file_found)
> + {

This imo *really* should be a separate function.


> + BaseBackupPID = StartBaseBackup();
> +
> + /*
> +  * Wait until done.  Start WAL receiver in the meantime, once 
> base
> +  * backup has received the starting position.
> +  */
> + while (BaseBackupPID != 0)
> + {
> + PG_SETMASK();
> + pg_usleep(100L);
> + PG_SETMASK();
> + MaybeStartWalReceiver();
> + }


Is there seriously no better signalling that we can use than just
looping for a couple hours?

Is it actully guaranteed that a compiler wouldn't just load
BaseBackupPID into a register, and never see a change to it done in a
signal handler?

There should be a note mentioning that we'll just FATAL out if the base
backup process fails. Otherwise it's the obvious question reading this
code.   Also - we have handling to restart WAL receiver, but there's no
handling for the base backup temporarily failing: Is that just because
its easy to do in one, but not the other case?


> + /*
> +  * Reread the control file that came in with the base backup.
> +  */
> + ReadControlFile();
> + }

Is it actualy rereading? I'm just reading the diff, so maybe I'm missing
something, but you've made LocalProcessControlFile not enter this code
path...


> @@ -2824,6 +2880,8 @@ pmdie(SIGNAL_ARGS)
>
>   if (StartupPID != 0)
>   signal_child(StartupPID, SIGTERM);
> + if (BaseBackupPID != 0)
> + signal_child(BaseBackupPID, SIGTERM);
>   if (BgWriterPID != 0)
>   signal_child(BgWriterPID, SIGTERM);
>   if (WalReceiverPID != 0)
> @@ -3062,6 +3120,23 @@ reaper(SIGNAL_ARGS)


>   continue;
>   }
>
> + /*
> +  * Was it the base backup process?
> +  */
> + if (pid == BaseBackupPID)
> + {
> + BaseBackupPID = 0;
> + if (EXIT_STATUS_0(exitstatus))
> + ;
> + else if (EXIT_STATUS_1(exitstatus))
> + ereport(FATAL,
> + (errmsg("base backup failed")));
> + else
> + HandleChildCrash(pid, exitstatus,
> +  _("base backup 
> process"));
> + continue;
> + }
> +

What's the error handling for the case we shut down either because of
SIGTERM above, or here? Does all the code just deal with that the next
start? If not, what makes this safe?



> +/*
> 

Re: [Proposal] Global temporary tables

2020-02-03 Thread 曾文旌(义从)


> 2020年2月2日 上午2:00,Pavel Stehule  写道:
> 
> 
> 
> so 1. 2. 2020 v 14:39 odesílatel 曾文旌(义从)  > napsal:
> 
> 
>> 2020年1月30日 下午10:21,Pavel Stehule > > 写道:
>> 
>> 
>> 
>> čt 30. 1. 2020 v 15:17 odesílatel 曾文旌(义从) > > napsal:
>> 
>> 
>> > 2020年1月29日 下午9:48,Robert Haas > > > 写道:
>> > 
>> > On Tue, Jan 28, 2020 at 12:12 PM 曾文旌(义从) > > > wrote:
>> >>> Opinion by Pavel
>> >>> + rel->rd_islocaltemp = true;  <<< if this is valid, then the name 
>> >>> of field "rd_islocaltemp" is not probably best
>> >>> I renamed rd_islocaltemp
>> >> 
>> >> I don't see any change?
>> >> 
>> >> Rename rd_islocaltemp to rd_istemp  in 
>> >> global_temporary_table_v8-pg13.patch
>> > 
>> > In view of commit 6919b7e3294702adc39effd16634b2715d04f012, I think
>> > that this has approximately a 0% chance of being acceptable. If you're
>> > setting a field in a way that is inconsistent with the current use of
>> > the field, you're probably doing it wrong, because the field has an
>> > existing purpose to which new code must conform. And if you're not
>> > doing that, then you don't need to rename it.
>> Thank you for pointing it out.
>> I've rolled back the rename.
>> But I still need rd_localtemp to be true, The reason is that
>> 1 GTT The GTT needs to support DML in read-only transactions ,like local 
>> temp table.
>> 2 GTT does not need to hold the lock before modifying the index buffer ,also 
>> like local temp table.
>> 
>> Please give me feedback.
>> 
>> maybe some like
>> 
>> rel->rd_globaltemp = true;
>> 
>> and somewhere else
>> 
>> if (rel->rd_localtemp || rel->rd_globaltemp)
>> {
>>   ...
>> }
> I tried to optimize code in global_temporary_table_v10-pg13.patch
> 
> 
> Please give me feedback.
> 
> I tested this patch and I have not any objections - from my user perspective 
> it is work as I expect
> 
> +#define RELATION_IS_TEMP(relation) \
> + ((relation)->rd_islocaltemp || \
> + (relation)->rd_rel->relpersistence == RELPERSISTENCE_GLOBAL_TEMP)
>  
> It looks little bit unbalanced
> 
> maybe is better to inject rd_isglobaltemp to relation structure
> 
> and then
> 
> it should to like 
> 
> +#define RELATION_IS_TEMP(relation) \
> + ((relation)->rd_islocaltemp || \
> + (relation)->rd_isglobaltemp))
> 
> But I have not idea if it helps in complex
In my opinion
For local temp table we need (relation)->rd_rel->relpersistence == 
RELPERSISTENCE_TEMP 
and because one local temp table belongs to only one session, need to mark one 
sessions rd_islocaltemp = true ,and other to rd_islocaltemp = false.

But For GTT, just need (relation)->rd_rel->relpersistence == 
RELPERSISTENCE_GLOBAL_GLOBAL_TEMP
One GTT can be used for every session, so no need rd_isglobaltemp anymore. This 
seems duplicated and redundant.

> 
> 
> 
> 
> 
> 
> 
> Wenjing
> 
> 
> 
>> 
>> 
>> 
>> Wenjing
>> 
>> 
>> 
>> 
>> > 
>> > -- 
>> > Robert Haas
>> > EnterpriseDB: http://www.enterprisedb.com 
>> > The Enterprise PostgreSQL Company
>> 
> 



Re: BUG #16171: Potential malformed JSON in explain output

2020-02-03 Thread Daniel Gustafsson
> On 2 Feb 2020, at 17:48, Tom Lane  wrote:

> Thoughts?

Keeping TEXT explain stable across minor versions is very appealing, but more
so from a policy standpoint than a technical one.  The real-world implication
is probably quite small, but that's a very unscientific guess (searching Github
didn't turn up anything but thats far from conclusive).  Having broken JSON is
however a clear negative, and so is having a different JSON format in back-
branches for something which has never worked in the first place.

I guess I'm leaning towards backpatching, but it's not entirely clear-cut.

cheers ./daniel



Re: Experimenting with hash join prefetch

2020-02-03 Thread Thomas Munro
On Fri, Apr 12, 2019 at 4:23 PM Thomas Munro  wrote:
> ... if we also prefetch during
> the build phase ...

Here's an experimental patch to investigate just that part.  I tried
initiating a prefetch of the bucket just before we copy the tuple and
then finally insert it, but it doesn't seem to be far enough apart (at
least for small tuples), which is a shame because that'd be a one line
patch.  So I made a little insert queue that prefetches and defers the
insertion until N tuples later, and then I managed to get between 10%
and 20% speed-up for contrived tests like this:

create unlogged table t as select generate_series(1, 1)::int i;
select pg_prewarm('t');
set work_mem='8GB';

select count(*) from t t1 join t t2 using (i);

   masterpatched/N=1  patched/N=4
workers=0  89.808s   80.556s (+11%)   76.864 (+16%)
workers=2  27.010s   22.679s (+19%)   23.503 (+15%)
workers=4  16.728s   14.896s (+12%)   14.167 (+18%)

Just an early experiment, but I though it looked promising enough to share.


0001-Prefetch-cache-lines-while-building-hash-join-table.patch
Description: Binary data


Re: Add %x to PROMPT1 and PROMPT2

2020-02-03 Thread Daniel Gustafsson
> On 3 Feb 2020, at 08:08, Michael Paquier  wrote:

> FWIW, I am not really in favor of changing a default old enough that
> it could vote (a45195a).

That by itself doesn't seem a good reason to not change things.

My concern would be that users who have never ever considered that the prompt
can be changed, all of sudden wonder why the prompt is showing characters it
normally isn't, thus causing confusion.  That being said, I agree that this is
a better default long-term.

cheers ./daniel



Re: BUG #16171: Potential malformed JSON in explain output

2020-02-03 Thread hubert depesz lubaczewski
On Sun, Feb 02, 2020 at 11:48:32AM -0500, Tom Lane wrote:
> > Does that prevent backpatching this, or are we Ok with EXPLAIN text output 
> > not
> > being stable across minors?  AFAICT Pg::Explain still works fine with this
> > change, but mileage may vary for other parsers.
> I'm not sure about that either.  It should be a clear win for parsers
> of the non-text formats, because now we're generating valid
> JSON-or-whatever where we were not before.  But it's not too hard to
> imagine that someone's ad-hoc parser of text output would break,
> depending on how much it relies on field order rather than indentation
> to make sense of things.

Change looks reasonable to me.

Interestingly Pg::Explain doesn't handle either current JSON output in
this case (as it's not a valid JSON), nor the new one - but this can be
fixed easily.

Best regards,

depesz





Re: base backup client as auxiliary backend process

2020-02-03 Thread Andres Freund
Hi,

On 2020-01-11 10:52:30 +0100, Peter Eisentraut wrote:
> On 2020-01-10 04:32, Masahiko Sawada wrote:
> > I agreed that these patches are useful on its own and 0001 patch and
> 
> committed 0001

over on -committers Robert complained:

On 2020-01-23 15:49:37 -0500, Robert Haas wrote:
> On Tue, Jan 14, 2020 at 8:57 AM Peter Eisentraut  wrote:
> > walreceiver uses a temporary replication slot by default
> >
> > If no permanent replication slot is configured using
> > primary_slot_name, the walreceiver now creates and uses a temporary
> > replication slot.  A new setting wal_receiver_create_temp_slot can be
> > used to disable this behavior, for example, if the remote instance is
> > out of replication slots.
> >
> > Reviewed-by: Masahiko Sawada 
> > Discussion: 
> > https://www.postgresql.org/message-id/CA%2Bfd4k4dM0iEPLxyVyme2RAFsn8SUgrNtBJOu81YqTY4V%2BnqZA%40mail.gmail.com
> 
> Neither the commit message for this patch nor any of the comments in
> the patch seem to explain why this is a desirable change.
> 
> I assume that's probably discussed on the thread that is linked here,
> but you shouldn't have to dig through the discussion thread to figure
> out what the benefits of a change like this are.

which I fully agree with.


It's not at all clear to me that the potential downsides of this have
been fully thought through. And even if they have, they've not been
documented.

Previously if a standby without a slot was slow receiving WAL,
e.g. because the network bandwidth was insufficient, it'd at some point
just fail because the required WAL is removed. But with this patch that
won't happen - instead the primary will just run out of space. At the
very least this would need to add documentation of this caveat to a few
places.

Perhaps that's worth doing anyway, because it's probably more common for
a standby to just temporarily run behind - but given that this feature
doesn't actually provide any robustness, due to e.g. the possibility of
temporary disconnections or restarts, I'm not sure it's providing all
that much compared to the dangers, for a feature on by default.

Greetings,

Andres Freund




Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-03 Thread Amit Kapila
On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada  wrote:
>
> On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas  wrote:
> > On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund  wrote:
> >>> I think the real question is whether the scenario is common enough to
> >>> worry about.  In practice, you'd have to be extremely unlucky to be
> >>> doing many bulk loads at the same time that all happened to hash to
> >>> the same bucket.
> >>
> >> With a bunch of parallel bulkloads into partitioned tables that really
> >> doesn't seem that unlikely?
> >
> > It increases the likelihood of collisions, but probably decreases the
> > number of cases where the contention gets really bad.
> >
> > For example, suppose each table has 100 partitions and you are
> > bulk-loading 10 of them at a time.  It's virtually certain that you
> > will have some collisions, but the amount of contention within each
> > bucket will remain fairly low because each backend spends only 1% of
> > its time in the bucket corresponding to any given partition.
> >
>
> I share another result of performance evaluation between current HEAD
> and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).
>
> Type of table: normal table, unlogged table
> Number of child tables : 16, 64 (all tables are located on the same 
> tablespace)
> Number of clients : 32
> Number of trials : 100
> Duration: 180 seconds for each trials
>
> The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
> RAM, NVMe SSD 1.5TB.
> Each clients load 10kB random data across all partitioned tables.
>
> Here is the result.
>
>  childs |   type   | target  |  avg_tps   | diff with HEAD
> +--+-++--
>  16 | normal   | HEAD|   1643.833 |
>  16 | normal   | Patched |  1619.5404 |  0.985222
>  16 | unlogged | HEAD|  9069.3543 |
>  16 | unlogged | Patched |  9368.0263 |  1.032932
>  64 | normal   | HEAD|   1598.698 |
>  64 | normal   | Patched |  1587.5906 |  0.993052
>  64 | unlogged | HEAD|  9629.7315 |
>  64 | unlogged | Patched | 10208.2196 |  1.060073
> (8 rows)
>
> For normal tables, loading tps decreased 1% ~ 2% with this patch
> whereas it increased 3% ~ 6% for unlogged tables. There were
> collisions at 0 ~ 5 relation extension lock slots between 2 relations
> in the 64 child tables case but it didn't seem to affect the tps.
>

AFAIU, this resembles the workload that Andres was worried about.   I
think we should once run this test in a different environment, but
considering this to be correct and repeatable, where do we go with
this patch especially when we know it improves many workloads [1] as
well.  We know that on a pathological case constructed by Mithun [2],
this causes regression as well.  I am not sure if the test done by
Mithun really mimics any real-world workload as he has tested by
making N_RELEXTLOCK_ENTS = 1 to hit the worst case.

Sawada-San, if you have a script or data for the test done by you,
then please share it so that others can also try to reproduce it.

[1] - 
https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9%40postgrespro.ru
[2] - 
https://www.postgresql.org/message-id/CAD__Oug52j%3DDQMoP2b%3DVY7wZb0S9wMNu4irXOH3-ZjFkzWZPGg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Parallel grouping sets

2020-02-03 Thread Richard Guo
Hi Amit,

Thanks for reviewing these two patches.

On Sat, Jan 25, 2020 at 6:31 PM Amit Kapila  wrote:

>
> This is what I also understood after reading this thread.  So, my
> question is why not just review v3 and commit something on those lines
> even though it would take a bit more time.  It is possible that if we
> decide to go with v5, we can make it happen earlier, but later when we
> try to get v3, the code committed as part of v5 might not be of any
> use or if it is useful, then in which cases?
>

Yes, approach #2 (v3) would be generally better than approach #1 (v5) in
performance. I started with approach #1 because it is much easier.

If we decide to go with approach #2, I think we can now concentrate on
v3 patch.

For v3 patch, we have some other idea, which is to perform a normal
grouping sets aggregation in the final phase, with 'GroupingSetId'
included in the group keys (as described in the previous email). With
this idea, we can avoid a lot of hacky codes in current v3 patch.

Thanks
Richard


Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-03 Thread Kasahara Tatsuhito
On Mon, Feb 3, 2020 at 5:33 PM Fujii Masao  wrote:
> Thanks for explaining that! But heap_beginscan_internal() was really
> called in TidScan case?
Oh, you are right.
Tid Scan started calling table_beginscan from v12 (commit 147e3722f7).
So previously(-v11), Tid Scan might never calls heap_beginscan_internal().

Therefore, from v12, Tid scan not only increases the value of
seq_scan, but also acquires a predicate lock.

Best regards,
-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com




Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-03 Thread Fujii Masao




On 2020/02/03 16:39, Kasahara Tatsuhito wrote:

On Mon, Feb 3, 2020 at 4:22 PM Fujii Masao  wrote:

On 2020/02/01 16:05, Kasahara Tatsuhito wrote:

  if (scan->rs_base.rs_flags & (SO_TYPE_SEQSCAN | SO_TYPE_SAMPLESCAN))
  {
  /*
   * Ensure a missing snapshot is noticed reliably, even if the
   * isolation mode means predicate locking isn't performed (and
   * therefore the snapshot isn't used here).
   */
  Assert(snapshot);
  PredicateLockRelation(relation, snapshot);
  }

Therefore, it can not simply remove the SO_TYPE_SEQSCAN flag from a TID scan.
To keep the old behavior, I think it would be better to add a new
SO_TYPE_TIDSCAN flag and take a predicate lock on the entire relation.


But in the old behavior, PredicateLockRelation() was not called in TidScan case
because its flag was not SO_TYPE_SEQSCAN. No?

No. Tid scan called PredicateLockRelation() both previous and current.

In the current (v12 and HEAD), Tid scan has SO_TYPE_SEQSCAN flag so
that PredicateLockRelation()is called in Tid scan.
In the Previous (- v11), in heap_beginscan_internal(), checks
is_bitmapscan flags.
If is_bitmapscan is set to false, calls PredicateLockRelation().

(- v11)
 if (!is_bitmapscan)
 PredicateLockRelation(relation, snapshot);

And in the Tid scan,  is_bitmapscan is set to false, so that
PredicateLockRelation()is called in Tid scan.


Thanks for explaining that! But heap_beginscan_internal() was really
called in TidScan case?

Regards,


--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: [Proposal] Global temporary tables

2020-02-03 Thread Konstantin Knizhnik



On 01.02.2020 19:14, 曾文旌(义从) wrote:



2020年1月27日 下午5:38,Konstantin Knizhnik > 写道:




On 25.01.2020 18:15, 曾文旌(义从) wrote:

I wonder why do we need some special check for GTT here.
From my point of view cleanup at startup of local storage of temp 
tables should be performed in the same way for local and global 
temp tables.
After oom kill, In autovacuum, the Isolated local temp table will be 
cleaned like orphan temporary tables. The definition of local temp 
table is deleted with the storage file.

But GTT can not do that. So we have the this implementation in my patch.
If you have other solutions, please let me know.

I wonder if it is possible that autovacuum or some other Postgres 
process is killed by OOM and postmaster is not noticing it can 
doens't restart Postgres instance?
as far as I know, crash of any process connected to Postgres shared 
memory (and autovacuum definitely has such connection) cause Postgres 
restart.
Postmaster will not restart after oom happen, but the startup process 
will. GTT data files are cleaned up in the startup process.


Yes, exactly.
But it is still not clear to me why do we need some special handling for 
GTT?

Shared memory is reinitialized and storage of temporary tables is removed.
It is true for both local and global temp tables.





In my design
1 Because different sessions have different transaction information, 
I choose to store the transaction information of GTT in MyProc,not 
catalog.
2 About the XID wraparound problem, the reason is the design of the 
temp table storage(local temp table and global temp table) that 
makes it can not to do vacuum by autovacuum.

It should be completely solve at the storage level.



My point of view is that vacuuming of temp tables is common problem 
for local and global temp tables.
So it has to be addressed in the common way and so we should not try 
to fix this problem only for GTT.

I think I agree with you this point.
However, this does not mean that GTT transaction information stored in 
pg_class is correct.
If you keep it that way, like in global_private_temp-8.patch, It may 
cause data loss in GTT after aotuvauum.


In my patch autovacuum is prohibited for GTT.


IMHO forced terminated of client sessions is not acceptable solution.

And it is not an absolutely necessary requirement.
So from my point of view we should not add such limitations to GTT 
design.

This limitation makes it possible for the GTT to do all the DDL.
IMHO even oracle's GTT has similar limitations.


I have checked that Oracle is not preventing creation of index for GTT 
if there are some active sessions working with this table. And this 
index becomes visible for all this sessions.




As global_private_temp-8.patch, think about:
1 session X tale several hours doing some statistical work with the 
GTT A, which generated some data using transaction 100, The work is 
not over.
2 Then session Y vacuumed A, and the GTT's relfrozenxid (in pg_class) 
was updated to 1000 .

3 Then the aotuvacuum happened, the clog  before 1000  was cleaned up.
4 The data in session A could be lost due to missing clog, The 
analysis task failed.


However This is likely to happen because you allowed the GTT do vacuum.
And this is not a common problem, that not happen with local temp tables.
I feel uneasy about leaving such a question. We can improve it.



May be the easies solution is to prohibit explicit vacuum of GTT?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [Proposal] Global temporary tables

2020-02-03 Thread Konstantin Knizhnik




On 31.01.2020 22:38, Robert Haas wrote:

Now, your idea is not quite as crazy as that, but it has the same
basic problem: you can't insert code into a low-level facility that
uses a high level facility which may in turn use and depend on that
very same low-level facility to not be in the middle of an operation.
If you do, it's going to break somehow.



Thank you for explanation.
You convinced me that building indexes from _bt_getbuf is not good idea.
What do you think about idea to check and build indexes for GTT prior to 
query execution?


In this case we do not need to patch code of all indexes - it can be 
done just in one place.
We can use build function of access method to initialize index and 
populate it with data.


So right now when building query execution plan, optimizer checks if 
index is valid.
If index belongs to GTT, it an check that first page of the index is 
initialized and if not - call build method for this index.


If building index during building query plan is not desirable, we can 
just construct list of indexes which should be checked and
perform check itself and building indexes somewhere after building plan 
but for execution of the query.


Do you seem some problems with such approach?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Parallel grouping sets

2020-02-03 Thread Richard Guo
Hi Jesse,

Thanks for reviewing these two patches.

On Sat, Jan 25, 2020 at 6:52 AM Jesse Zhang  wrote:

>
> I glanced over both patches. Just the opposite, I have a hunch that v3
> is always better than v5. Here's my 6-minute understanding of both.
>
> v5 (the one with a simple partial aggregate) works by pushing a little
> bit of partial aggregate onto workers, and perform grouping aggregate
> above gather. This has two interesting outcomes: we can execute
> unmodified partial aggregate on the workers, and execute almost
> unmodified rollup aggreegate once the trans values are gathered. A
> parallel plan for a query like
>
> SELECT count(*) FROM foo GROUP BY GROUPING SETS (a), (b), (c), ();
>
> can be
>
> Finalize GroupAggregate
>   Output: count(*)
>   Group Key: a
>   Group Key: b
>   Group Key: c
>   Group Key: ()
> Gather Merge
> Partial GroupAggregate
>   Output: PARTIAL count(*)
>   Group Key: a, b, c
> Sort
>   Sort Key: a, b, c
> Parallel Seq Scan on foo
>

Yes, this is the idea of v5 patch.



> v3 ("the one with grouping set id") really turns the plan from a tree to
> a multiplexed pipe: we can execute grouping aggregate on the workers,
> but only partially. When we emit the trans values, also tag the tuple
> with a group id. After gather, finalize the aggregates with a modified
> grouping aggregate. Unlike a non-split grouping aggregate, the finalize
> grouping aggregate does not "flow" the results from one rollup to the
> next one. Instead, each group only advances on partial inputs tagged for
> the group.
>
> Finalize HashAggregate
>   Output: count(*)
>   Dispatched by: (GroupingSetID())
>   Group Key: a
>   Group Key: b
>   Group Key: c
> Gather
> Partial GroupAggregate
>   Output: PARTIAL count(*), GroupingSetID()
>   Group Key: a
>   Sort Key: b
>   Group Key: b
>   Sort Key: c
>   Group Key: c
> Sort
>   Sort Key: a
> Parallel Seq Scan on foo
>

Yes, this is what v3 patch does.

We (Pengzhou and I) had an offline discussion on this plan and we have
some other idea.  Since we have tagged 'GroupingSetId' for each tuple
produced by partial aggregate, why not then perform a normal grouping
sets aggregation in the final phase, with the 'GroupingSetId' included
in the group keys?  The plan looks like:

# explain (costs off, verbose)
select c1, c2, c3, avg(c3) from gstest group by grouping
sets((c1,c2),(c1),(c2,c3));
QUERY PLAN
--
 Finalize GroupAggregate
   Output: c1, c2, c3, avg(c3)
   Group Key: (gset_id), gstest.c1, gstest.c2, gstest.c3
   ->  Sort
 Output: c1, c2, c3, (gset_id), (PARTIAL avg(c3))
 Sort Key: (gset_id), gstest.c1, gstest.c2, gstest.c3
 ->  Gather
   Output: c1, c2, c3, (gset_id), (PARTIAL avg(c3))
   Workers Planned: 4
   ->  Partial HashAggregate
 Output: c1, c2, c3, gset_id, PARTIAL avg(c3)
 Hash Key: gstest.c1, gstest.c2
 Hash Key: gstest.c1
 Hash Key: gstest.c2, gstest.c3
 ->  Parallel Seq Scan on public.gstest
   Output: c1, c2, c3

This plan should be able to give the correct results. We are still
thinking if it is a better plan than the 'multiplexed pipe' plan as in
v3. Inputs of thoughts here would be appreciated.


> Note that for the first approach to be viable, the partial aggregate
> *has to* use a group key that's the union of all grouping sets. In cases

where individual columns have a low cardinality but joint cardinality is
> high (say columns a, b, c each has 16 distinct values, but they are
> independent, so there are 4096 distinct values on (a,b,c)), this results
> in fairly high traffic through the shm tuple queue.
>

Yes, you are right. This is the case mentioned by David earlier in [1].
In this case, ideally the parallel plan would fail when competing with
non-parallel plan in add_path() and so not be chosen.

[1] -
https://www.postgresql.org/message-id/CAKJS1f8Q9muALhkapbnO3bPUgAmZkWq9tM_crk8o9=jiiop...@mail.gmail.com

Thanks
Richard