Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side

2020-02-04 Thread Amit Langote
On Wed, Feb 5, 2020 at 3:36 PM Fujii Masao  wrote:
> On 2020/02/04 10:34, Amit Langote wrote:
> > I like:
>
> Thanks for reviewing the patch!
>
> > 1. initializing
> > 2-5 waiting for backup start checkpoint to finish
>
> Can we shorten this to "waiting for checkpoint"? IMO the simpler
> phase name is better and "to finish" sounds a bit redundant. Also
> in the description of pg_stat_progress_create_index, basically
> "waiting for xxx" is used.

"waiting for checkpoint" works for me.

> > 3-3 streaming database files
> > 4-5 waiting for wal archiving to finish
>
> Can we shorten this to "waiting for wal archiving" because of
> the same reason as above?

Yes.

> > 5-1 transferring wal (or streaming wal)
>
> IMO "transferring wal" is better because this phase happens only when
> "--wal-method=fetch" is specified in pg_basebackup. "streaming wal"
> seems to implie "--wal-method=stream", instead.

Ah, okay,

> > Reading this:
> >
> > + backup_total
> > + bigint
> > + 
> > +  Total amount of data that will be streamed. If progress reporting
> > +  is not enabled in pg_basebackup
> > +  (i.e., --progress option is not specified),
> > +  this is 0.
> >
> > I wonder how hard would it be to change basebackup.c to always set
> > backup_total, irrespective of whether --progress is specified with
> > pg_basebackup or not?  It seems quite misleading to leave it set to 0,
> > because one may panic that they have lost their data, that is, if they
> > haven't first read the documentation.
>
> Yeah, I understand your concern. The pg_basebackup document explains
> the risk when --progress is specified, as follows. Since I imagined that
> someone may explicitly disable --progress to avoid this risk, I made
> the server estimate the total size only when --progress is specified.
> But you think that this overhead by --progress is negligibly small?
>
> 
> When this is enabled, the backup will start by enumerating the size of
> the entire database, and then go back and send the actual contents.
> This may make the backup take slightly longer, and in particular it will
> take longer before the first data is sent.
> 

Sorry, I hadn't read this before.  So, my proposal would make this a lie.

Still, if "streaming database files" is the longest phase, then not
having even an approximation of how much data is to be streamed over
doesn't much help estimating progress,  at least as long as one only
has this view to look at.

That said, the overhead of checking the size before sending any data
may be worse for some people than others, so having the option to
avoid that might be good after all.

Thanks,
Amit




Re: [Proposal] Global temporary tables

2020-02-04 Thread Konstantin Knizhnik




On 05.02.2020 00:38, Robert Haas wrote:


My guess it that the right time to do this work is just after we
acquire locks, at the end of parse analysis. I think trying to do it
during execution is too late, since the planner looks at indexes, and
trying to do it in the planner instead of before we start planning
seems more likely to cause bugs and has no real advantages. It's just
better to do complicated things (like creating indexes) separately
rather than in the middle of some other complicated thing (like
planning). I could tie my shoelaces the first time they get tangled up
with my break pedal but it's better to do it before I get in the car.

I have implemented this approach in my new patch

https://www.postgresql.org/message-id/3e88b59f-73e8-685e-4983-9026f94c57c5%40postgrespro.ru

I have added check whether index is initialized or not to plancat.c 
where optimizer checks if index is valid.
Now it should work for all kinds of indexes (B-Tree, hash, user defined 
access methods...).


And I'm still inclined to do it by flat-copying files rather than
calling ambuild. It will be slightly faster, but also importantly, it
will guarantee that (1) every backend gets exactly the same initial
state and (2) it has fewer ways to fail because it doesn't involve
calling any user-defined code. Those seem like fairly compelling
advantages, and I don't see what the disadvantages are. I think
calling ambuild() at the point in time proposed in the preceding
paragraph would be fairly safe and would probably work OK most of the
time, but I can't think of any reason it would be better.


There is very important reason (from my point of view): allow other 
sessions to use created index and

so provide compatible behavior with regular tables (and with Oracle).
So we should be able to populate index with existed GTT data.
And ambuild will do it.



Incidentally, what I'd be inclined to do is - if the session is
running a query that does only read-only operations, let it continue
to point to the "master" copy of the GTT and its indexes, which is
stored in the relfilenodes indicated for those relations in pg_class.
If it's going to acquire a lock heavier than AccessShareLock, then
give it is own copies of the table and indexes, stored in a temporary
relfilenode (tXXX_YYY) and redirect all future access to that GTT by
this backend to there. Maybe there's some reason this won't work, but
it seems nice to avoid saying that we've "attached" to the GTT if all
we did is read the empty table.

Sorry, I do not understand the benefits of such optimization. It seems 
to be very rare situation when session will try to access temp table 
which was not previously filled with data. But even if it happen, 
keeping "master" copy will not safe much: we in any case have shared 
metadata and no data. Yes, with current approach, first access to GTT 
will cause creation of empty indexes. But It is just initialization of 
1-3 pages. I do not think that delaying index initialization can be 
really useful.


In any case, calling ambuild is the simplest and most universal 
approach, providing desired and compatible behavior.
I really do not understand why we should try yo invent some alternative 
solution.







Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-02-04 Thread Kasahara Tatsuhito
On Mon, Feb 3, 2020 at 6:20 PM Kasahara Tatsuhito
 wrote:
> Therefore, from v12, Tid scan not only increases the value of
> seq_scan, but also acquires a predicate lock.

Based on further investigation and Fujii's advice, I've summarized
this issue as follows.

>From commit 147e3722f7, Tid Scan came to
(A) increments num of seq_scan on pg_stat_*_tables
and
(B) take a predicate lock on the entire relation.

(A) may be confusing to users, so I think it is better to fix it.
For (B), an unexpected serialization error has occurred as follows, so
I think it should be fix.

=
[Preparation]
CREATE TABLE tid_test (c1 int, c2 int);
INSERT INTO  tid_test SELECT generate_series(1,1000), 0;


[Session-1:]
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
  [Session-2:]
  BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE ;
[Session-1:]
SELECT * FROM tid_test WHERE ctid = '(0,1)';
  [Session-2:]
  SELECT * FROM tid_test WHERE ctid = '(1,1)';
[Session-1:]
INSERT INTO tid_test SELECT 1001, 10;
  [Session-2:]
  INSERT INTO tid_test SELECT 1001, 10;
[Session-1:]
COMMIT;
  [Session-2:]
  COMMIT;

Result:
(-v11): Both session could commit.
(v12-): Session-2 raised error as following because of taking a
predicate lock on the entire table...

ERROR:  could not serialize access due to read/write dependencies
among transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during
commit attempt.
HINT:  The transaction might succeed if retried.

=

Attached patch fix both (A) and (B), so that the behavior of Tid Scan
back to the same as before v11.
(As a result, this patch is the same as the one that first attached.)

Best regards,
-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com


fix_tidscan_issues.patch
Description: Binary data


Re: Expose lock group leader pid in pg_stat_activity

2020-02-04 Thread Julien Rouhaud
On Wed, Feb 05, 2020 at 10:48:31AM +0900, Michael Paquier wrote:
> On Tue, Feb 04, 2020 at 03:27:25PM +0100, Julien Rouhaud wrote:
> > I added some code comments to remind that we don't guarantee any consistency
> > here.
> 
> That's mostly fine.  I have moved the comment related to
> AuxiliaryPidGetProc() within the inner part of its the "if" (or the
> comment should be changed to be conditional).  An extra thing is that
> nulls[29] was not set to true for a user without the proper permission
> rights.

Oh, oops indeed.

> Does that look fine to you?

This looks good, thanks a lot!




Re: ALTER tbl rewrite loses CLUSTER ON index

2020-02-04 Thread Amit Langote
Hi Justin,

On Mon, Feb 3, 2020 at 1:17 AM Justin Pryzby  wrote:
> Other options are preserved by ALTER (and CLUSTER ON is and most obviously
> should be preserved by CLUSTER's rewrite), so I think (SET) CLUSTER should be
> preserved by ALTER, too.

Yes.

create table foo (a int primary key);
cluster foo;
ERROR:  there is no previously clustered index for table "foo"
cluster foo using foo_pkey;
alter table foo alter a type bigint;
cluster foo;
ERROR:  there is no previously clustered index for table "foo"

With your patch, this last error doesn't occur.

Like you, I too suspect that losing indisclustered like this is
unintentional, so should be fixed.

> As far as I can see, this should be the responsibility of something in the
> vicinity of ATPostAlterTypeParse/RememberIndexForRebuilding.
>
> Attach patch sketches a fix.

While your sketch hits pretty close, it could be done a bit
differently.  For one, I don't like the way it's misusing
changedIndexOids and changedIndexDefs.

Instead, we can do something similar to what
RebuildConstraintComments() does for constraint comments.  For
example, we can have a PreserveClusterOn() that adds a AT_ClusterOn
command into table's AT_PASS_OLD_INDEX pass commands.  Attached patch
shows what I'm thinking.  I also added representative tests.

Thanks,
Amit
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f599393473..ea9941278e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -490,6 +490,7 @@ static void ATPostAlterTypeParse(Oid oldId, Oid oldRelId, 
Oid refRelId,
 static void RebuildConstraintComment(AlteredTableInfo *tab, int pass,
 Oid 
objid, Relation rel, List *domname,
 const 
char *conname);
+static void PreserveClusterOn(AlteredTableInfo *tab, int pass, Oid indoid);
 static void TryReuseIndex(Oid oldId, IndexStmt *stmt);
 static void TryReuseForeignKey(Oid oldId, Constraint *con);
 static ObjectAddress ATExecAlterColumnGenericOptions(Relation rel, const char 
*colName,
@@ -11832,6 +11833,9 @@ ATPostAlterTypeParse(Oid oldId, Oid oldRelId, Oid 
refRelId, char *cmd,
newcmd->def = (Node *) stmt;
tab->subcmds[AT_PASS_OLD_INDEX] =
lappend(tab->subcmds[AT_PASS_OLD_INDEX], 
newcmd);
+
+   /* Preserve index's indisclustered property, if set. */
+   PreserveClusterOn(tab, AT_PASS_OLD_INDEX, oldId);
}
else if (IsA(stm, AlterTableStmt))
{
@@ -11868,6 +11872,9 @@ ATPostAlterTypeParse(Oid oldId, Oid oldRelId, Oid 
refRelId, char *cmd,

 rel,

 NIL,

 indstmt->idxname);
+
+   /* Preserve index's indisclustered 
property, if set. */
+   PreserveClusterOn(tab, 
AT_PASS_OLD_INDEX, indoid);
}
else if (cmd->subtype == AT_AddConstraint)
{
@@ -11990,6 +11997,37 @@ RebuildConstraintComment(AlteredTableInfo *tab, int 
pass, Oid objid,
tab->subcmds[pass] = lappend(tab->subcmds[pass], newcmd);
 }
 
+/*
+ * For an index on table that's being recreated due PostAlterType processing,
+ * preserve its indisclustered property by issuing ALTER TABLE CLUSTER ON on
+ * table to run after the command to re-create the index.
+ */
+static void
+PreserveClusterOn(AlteredTableInfo *tab, int pass, Oid indoid)
+{
+   HeapTuple   indexTuple;
+   Form_pg_index indexForm;
+
+   Assert(OidIsValid(indoid));
+   Assert(pass == AT_PASS_OLD_INDEX);
+
+   indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(indoid));
+   if (!HeapTupleIsValid(indexTuple))
+   elog(ERROR, "cache lookup failed for index %u", indoid);
+   indexForm = (Form_pg_index) GETSTRUCT(indexTuple);
+
+   if (indexForm->indisclustered)
+   {
+   AlterTableCmd *newcmd = makeNode(AlterTableCmd);
+
+   newcmd->subtype = AT_ClusterOn;
+   newcmd->name = get_rel_name(indoid);
+   tab->subcmds[pass] = lappend(tab->subcmds[pass], newcmd);
+   }
+
+   ReleaseSysCache(indexTuple);
+}
+
 /*
  * Subroutine for ATPostAlterTypeParse().  Calls out to CheckIndexCompatible()
  * for the real analysis, then mutates the IndexStmt based on that verdict.
diff --git a/src/test/regress/expected/alter_table.out 
b/src/test/regress/expected/alter_table.out
index e73ce4b6f3..010a7aa98b 100644
--- a/src/test/regress/expected/alter_table.out
+++ 

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2020-02-04 Thread Masahiko Sawada
On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila  wrote:
>
> On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada  
> wrote:
> >
> > On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas  wrote:
> > > On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund  wrote:
> > >>> I think the real question is whether the scenario is common enough to
> > >>> worry about.  In practice, you'd have to be extremely unlucky to be
> > >>> doing many bulk loads at the same time that all happened to hash to
> > >>> the same bucket.
> > >>
> > >> With a bunch of parallel bulkloads into partitioned tables that really
> > >> doesn't seem that unlikely?
> > >
> > > It increases the likelihood of collisions, but probably decreases the
> > > number of cases where the contention gets really bad.
> > >
> > > For example, suppose each table has 100 partitions and you are
> > > bulk-loading 10 of them at a time.  It's virtually certain that you
> > > will have some collisions, but the amount of contention within each
> > > bucket will remain fairly low because each backend spends only 1% of
> > > its time in the bucket corresponding to any given partition.
> > >
> >
> > I share another result of performance evaluation between current HEAD
> > and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).
> >
> > Type of table: normal table, unlogged table
> > Number of child tables : 16, 64 (all tables are located on the same 
> > tablespace)
> > Number of clients : 32
> > Number of trials : 100
> > Duration: 180 seconds for each trials
> >
> > The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
> > RAM, NVMe SSD 1.5TB.
> > Each clients load 10kB random data across all partitioned tables.
> >
> > Here is the result.
> >
> >  childs |   type   | target  |  avg_tps   | diff with HEAD
> > +--+-++--
> >  16 | normal   | HEAD|   1643.833 |
> >  16 | normal   | Patched |  1619.5404 |  0.985222
> >  16 | unlogged | HEAD|  9069.3543 |
> >  16 | unlogged | Patched |  9368.0263 |  1.032932
> >  64 | normal   | HEAD|   1598.698 |
> >  64 | normal   | Patched |  1587.5906 |  0.993052
> >  64 | unlogged | HEAD|  9629.7315 |
> >  64 | unlogged | Patched | 10208.2196 |  1.060073
> > (8 rows)
> >
> > For normal tables, loading tps decreased 1% ~ 2% with this patch
> > whereas it increased 3% ~ 6% for unlogged tables. There were
> > collisions at 0 ~ 5 relation extension lock slots between 2 relations
> > in the 64 child tables case but it didn't seem to affect the tps.
> >
>
> AFAIU, this resembles the workload that Andres was worried about.   I
> think we should once run this test in a different environment, but
> considering this to be correct and repeatable, where do we go with
> this patch especially when we know it improves many workloads [1] as
> well.  We know that on a pathological case constructed by Mithun [2],
> this causes regression as well.  I am not sure if the test done by
> Mithun really mimics any real-world workload as he has tested by
> making N_RELEXTLOCK_ENTS = 1 to hit the worst case.
>
> Sawada-San, if you have a script or data for the test done by you,
> then please share it so that others can also try to reproduce it.

Unfortunately the environment I used for performance verification is
no longer available.

I agree to run this test in a different environment. I've attached the
rebased version patch. I'm measuring the performance with/without
patch, so will share the results.

Regards,

--
Masahiko Sawada  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v14-0001-Move-relation-extension-locks-out-of-heavyweigth.patch
Description: Binary data


Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side

2020-02-04 Thread Fujii Masao




On 2020/02/04 10:34, Amit Langote wrote:

On Mon, Feb 3, 2020 at 11:04 PM Fujii Masao  wrote:

So we now have the following ideas about the phase names for pg_basebackup.

1.
initializing

2.
2-1. starting backup
2-2. starting file transfer
2-3. performing pg_start_backup
2-4. performing checkpoint
2-5. waiting for [ backup start ] checkpoint to finish

3.
3-1. streaming backup
3-2. transferring database files
3-3. streaming database files
3-4. transferring files

4.
4-1. stopping backup
4-2. finishing file transfer
4-3. performing pg_stop_backup
4-4. finishing backup
4-5. waiting for WAL archiving to finish
4-6. performing WAL archive

5.
1. transferring wal
2. transferring WAL files

What conbination of these do you prefer?


I like:


Thanks for reviewing the patch!


1. initializing
2-5 waiting for backup start checkpoint to finish


Can we shorten this to "waiting for checkpoint"? IMO the simpler
phase name is better and "to finish" sounds a bit redundant. Also
in the description of pg_stat_progress_create_index, basically
"waiting for xxx" is used.


3-3 streaming database files
4-5 waiting for wal archiving to finish


Can we shorten this to "waiting for wal archiving" because of
the same reason as above?


5-1 transferring wal (or streaming wal)


IMO "transferring wal" is better because this phase happens only when
"--wal-method=fetch" is specified in pg_basebackup. "streaming wal"
seems to implie "--wal-method=stream", instead.


-  
+  

I don't see any new text in the documentation patch that uses above
xref, so no need to define it?


The following description that I added uses this.

  certain commands during command execution.  Currently, the only commands
  which support progress reporting are ANALYZE,
  CLUSTER,
-   CREATE INDEX, and VACUUM.
+   CREATE INDEX, VACUUM,
+   and  (i.e., replication
+   command that  issues to take
+   a base backup).


Sorry, I missed that.  I was mistakenly expecting a different value of linkend.

Some comments on v3:

+ Process ID of a WAL sender process.

"a" sounds redundant.  Maybe:

of this WAL sender process or
of WAL sender process


I borrowed "Process ID of a WAL sender process" from the description
of pg_stat_replication.pid. But if it's better to get rid of "a",
I'm happy to do that!


Reading this:

+ backup_total
+ bigint
+ 
+  Total amount of data that will be streamed. If progress reporting
+  is not enabled in pg_basebackup
+  (i.e., --progress option is not specified),
+  this is 0.

I wonder how hard would it be to change basebackup.c to always set
backup_total, irrespective of whether --progress is specified with
pg_basebackup or not?  It seems quite misleading to leave it set to 0,
because one may panic that they have lost their data, that is, if they
haven't first read the documentation.


Yeah, I understand your concern. The pg_basebackup document explains
the risk when --progress is specified, as follows. Since I imagined that
someone may explicitly disable --progress to avoid this risk, I made
the server estimate the total size only when --progress is specified.
But you think that this overhead by --progress is negligibly small?


When this is enabled, the backup will start by enumerating the size of
the entire database, and then go back and send the actual contents.
This may make the backup take slightly longer, and in particular it will
take longer before the first data is sent.


If we really can always estimate the total size, whether --progress is
specified or not, we should get rid of PROGRESS option from BASE_BACKUP
replication command because it will no longer be necessary, I think.

Regards,  


--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: [PATCH v1] Allow COPY "text" format to output a header

2020-02-04 Thread Surafel Temesgen
Hi,
On Tue, Feb 4, 2020 at 4:25 PM Rémi Lapeyre  wrote:

> This patch adds the possibility to use the "header" option when using COPY
> with
> the text format. A todo entry was opened for this and I updated the tests
> and
> the documentation.
>
> This was previously discussed at
> https://www.postgresql.org/message-id/flat/CACfv%2BpJ31tesLvncJyP24quo8AE%2BM0GP6p6MEpwPv6yV8%3DsVHQ%40mail.gmail.com
>
>
FWIW there was more recent propose patch at
https://www.postgresql.org/message-id/flat/caf1-j-0ptcwmeltswwgv2m70u26n4g33gpe1rckqqe6wvqd...@mail.gmail.com
 and among feedback given is to adding header matching feature on to this.

regards
Surafel


A bug in LWLOCK_STATS

2020-02-04 Thread Fujii Masao

Hi,

When I compiled PostgreSQL with -DLWLOCK_STATS and tried to check
the statistics of light-weight locks, I observed that more than one
statistics entries were output *for the same backend process and
the same lwlock*. For example, I got the following four statistics
when I checked how the process with PID 81141 processed ProcArrayLock.
This is strange, and IMO only one statistics should be output for
the same backend process and lwlock.

$ grep "PID 81141 lwlock ProcArrayLock" 
data/log/postgresql-2020-02-05_141842.log
PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 4000 exacq 0 blk 0 spindelay 
0 dequeue self 0
PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 2 exacq 0 blk 0 spindelay 0 
dequeue self 0
PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 6001 exacq 1 blk 0 spindelay 
0 dequeue self 0
PID 81141 lwlock ProcArrayLock 0x111e87780: shacq 5 exacq 1 blk 0 spindelay 0 
dequeue self 0

The cause of this issue is that the key variable used for lwlock hash
table was not fully initialized. The key consists of two fields and
they are initialized as follows. But the following 4 bytes allocated
for the alignment was not initialized. So even if the same key was
specified, hash_search(HASH_ENTER) could not find the existing
entry for that key and created new one.

key.tranche = lock->tranche;
key.instance = lock;

Attached is the patch fixing this issue by initializing the key
variable with zero. In the patched version, I confirmed that only one
statistics is output for the same process and the same lwlock.
Also this patch would reduce the volume of lwlock statistics
very much.

This issue was introduced by commit 3761fe3c20. So the patch needs
to be back-patch to v10.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
diff --git a/src/backend/storage/lmgr/lwlock.c 
b/src/backend/storage/lmgr/lwlock.c
index d07ce609d4..4c14e51c67 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -310,6 +310,7 @@ get_lwlock_stats_entry(LWLock *lock)
return _stats_dummy;
 
/* Fetch or create the entry. */
+   MemSet(, 0, sizeof(key));
key.tranche = lock->tranche;
key.instance = lock;
lwstats = hash_search(lwlock_stats_htab, , HASH_ENTER, );


Re: Portal->commandTag as an enum

2020-02-04 Thread Mark Dilger



> On Feb 4, 2020, at 7:34 PM, Andres Freund  wrote:
> 
> Hi,

Thanks for reviewing!  I am pretty much in agreement with your comments, below.

> On 2020-02-04 18:18:52 -0800, Mark Dilger wrote:
>> In master, a number of functions pass a char *completionTag argument (really 
>> a char completionTag[COMPLETION_TAG_BUFSIZE]) which gets filled in with the 
>> string to return to the client from EndCommand.  I have removed that kind of 
>> logic:
>> 
>> -   /* save the rowcount if we're given a completionTag to fill 
>> */
>> -   if (completionTag)
>> -   snprintf(completionTag, COMPLETION_TAG_BUFSIZE,
>> -"SELECT " UINT64_FORMAT,
>> -queryDesc->estate->es_processed);
>> 
>> In the patch, this is replaced with a new struct, QueryCompletionData.  That 
>> bit of code above is replaced with:
>> 
>> +   /* save the rowcount if we're given a qc to fill */
>> +   if (qc)
>> +   SetQC(qc, COMMANDTAG_SELECT, 
>> queryDesc->estate->es_processed, DISPLAYFORMAT_NPROCESSED);
>> 
>> For wire protocol compatibility, we have to track the display format.
>> When this gets to EndCommand, the display format allows the string to
>> be written exactly as the client will expect.  If we ever get to the
>> point where we can break with that compatibility, the third member of
>> this struct, display_format, can be removed.
> 
> Hm. While I like not having this as strings a lot, I wish we could get
> rid of this displayformat stuff.

Agreed, but I don’t know how. 

>> These are replaced by switch() case statements over the possible commandTags:
>> 
>> +   switch (commandTag)
>> +   {
>> +   /*
>> +* Supported idiosyncratic special cases.
>> +*/
>> +   case COMMANDTAG_ALTER_DEFAULT_PRIVILEGES:
>> +   case COMMANDTAG_ALTER_LARGE_OBJECT:
>> +   case COMMANDTAG_COMMENT:
>> +   case COMMANDTAG_CREATE_TABLE_AS:
>> +   case COMMANDTAG_DROP_OWNED:
>> +   case COMMANDTAG_GRANT:
>> +   case COMMANDTAG_IMPORT_FOREIGN_SCHEMA:
>> +   case COMMANDTAG_REFRESH_MATERIALIZED_VIEW:
>> +   case COMMANDTAG_REVOKE:
>> +   case COMMANDTAG_SECURITY_LABEL:
>> +   case COMMANDTAG_SELECT_INTO:
> 
> The number of these makes me wonder if we should just have a metadata
> table in one place, instead of needing to edit multiple
> locations. Something like
> 
> const ... CommandTagBehaviour[] = {
>[COMMANDTAG_INSERT] = {
>.display_processed = true, .display_oid = true, ...},
>[COMMANDTAG_CREATE_TABLE_AS] = {
>.event_trigger = true, ...},
> 
> with the zero initialized defaults being the common cases.
> 
> Not sure if it's worth going there. But it's maybe worth thinking about
> for a minute?

Yes, I was thinking about something like this, only I had in mind a Bitmapset 
for these.  It just so happens that there are 192 enum values, 0..191, which 
happens to fit in 3 64bit words plus a varlena header.  Mind you, that nice 
property would be immediately blown if we added another entry to the enum.  Has 
anybody made a compile-time static version of Bitmapset?  We could store this 
information in either 24 bytes or 32 bytes, depending on whether we add another 
enum value.

Getting a little off topic, I was also thinking about having a counting 
Bitmapset that would store one bit per enum that is included, and then a sparse 
array of counts, perhaps with one byte counts for [0..127] and 8 byte counts 
for [128..huge] that we could use in shared memory for the pg_stat_tag work.  
Is there anything like that?

Anyway, I don’t think we should invent lots of different structures for 
CommandTag tracking, so something that serves double duty might keep the code 
tighter.  I’m already using ordinary Bitmapset over CommandTags in 
event_trigger, so naturally that comes to mind for this, too.


>> Averages for test set 1 by scale:
>> set  scale   tps avg_latency 90%> 11   37411.734   3.162   133.718
>> 19   61240.904   1.05230.547
>> 181  59210.931   1.015   67.023
>> 
>> Averages for test set 1 by clients:
>> set  clients tps avg_latency 90%> 11   21630.461   0.514   24.414
>> 14   59680.675   0.791   40.354
>> 116  76552.433   3.922   366.519
>> 
>> 
>> For command tag patch (branched from 1fd687a035):
>> 
>>  postgresql % find src -type f -name "*.c" -or -name "*.h" | xargs cat | 
>> wc
>>   1482969 5691908 45281399
>> 
>>  postgresql % find src -type f -name "*.o" | xargs cat | wc
>> 38209  476243 18999752
>> 
>> 
>> Averages for test set 1 by scale:
>> set  scale   tps avg_latency 90%> 11   3877   

Re: table partition and column default

2020-02-04 Thread Fujii Masao




On 2020/02/04 13:56, Amit Langote wrote:

On Thu, Dec 26, 2019 at 6:21 PM Julien Rouhaud  wrote:

On Thu, Dec 26, 2019 at 7:12 AM Fujii Masao  wrote:

Thanks for reviewing the patch. Committed!


I saw that you only pushed it on master, shouldn't we backpatch it
down to pg10 as this is the declarative partitioning behavior since
the beginning?


I had meant to reply to this but somehow forgot.

I agree that it might be a good idea to back-patch this down to PG 10.


Back-patched to v10. Thanks Julien and Amit for pointing out this!

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: [HACKERS] kqueue

2020-02-04 Thread Thomas Munro
On Wed, Jan 29, 2020 at 11:54 AM Thomas Munro  wrote:
> If there are no further objections, I'm planning to commit this sooner
> rather than later, so that it gets plenty of air time on developer and
> build farm machines.  If problems are discovered on a particular
> platform, there's a pretty good escape hatch: you can define
> WAIT_USE_POLL, and if it turns out to be necessary, we could always do
> something in src/template similar to what we do for semaphores.

I updated the error messages to match the new "unified" style, adjust
a couple of comments, and pushed.  Thanks to all the people who
tested.  I'll keep an eye on the build farm.




Re: base backup client as auxiliary backend process

2020-02-04 Thread Masahiko Sawada
On Mon, 3 Feb 2020 at 20:06, Andres Freund  wrote:
>
> Hi,
>
> On 2020-01-11 10:52:30 +0100, Peter Eisentraut wrote:
> > On 2020-01-10 04:32, Masahiko Sawada wrote:
> > > I agreed that these patches are useful on its own and 0001 patch and
> >
> > committed 0001
>
> over on -committers Robert complained:
>
> On 2020-01-23 15:49:37 -0500, Robert Haas wrote:
> > On Tue, Jan 14, 2020 at 8:57 AM Peter Eisentraut  
> > wrote:
> > > walreceiver uses a temporary replication slot by default
> > >
> > > If no permanent replication slot is configured using
> > > primary_slot_name, the walreceiver now creates and uses a temporary
> > > replication slot.  A new setting wal_receiver_create_temp_slot can be
> > > used to disable this behavior, for example, if the remote instance is
> > > out of replication slots.
> > >
> > > Reviewed-by: Masahiko Sawada 
> > > Discussion: 
> > > https://www.postgresql.org/message-id/CA%2Bfd4k4dM0iEPLxyVyme2RAFsn8SUgrNtBJOu81YqTY4V%2BnqZA%40mail.gmail.com
> >
> > Neither the commit message for this patch nor any of the comments in
> > the patch seem to explain why this is a desirable change.
> >
> > I assume that's probably discussed on the thread that is linked here,
> > but you shouldn't have to dig through the discussion thread to figure
> > out what the benefits of a change like this are.
>
> which I fully agree with.
>
>
> It's not at all clear to me that the potential downsides of this have
> been fully thought through. And even if they have, they've not been
> documented.
>
> Previously if a standby without a slot was slow receiving WAL,
> e.g. because the network bandwidth was insufficient, it'd at some point
> just fail because the required WAL is removed. But with this patch that
> won't happen - instead the primary will just run out of space. At the
> very least this would need to add documentation of this caveat to a few
> places.

+1 to add downsides to the documentation.

It might not normally happen but with this parameter we will need to
have enough setting of max_replication_slots because the standby will
fail to start after failover due to full of slots.

WAL required by the standby could be removed on the primary due to the
standby delaying much, for example when the standby stopped for a long
time or when the standby is running but delayed for some reason. This
feature prevents WAL from removal for the latter case. That is, we can
ensure that required WAL is not removed during replication running.
For the former case we can use a permanent replication slot. Although
there is a risk of running out of space but I personally think this
behavior is better for most cases.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-02-04 Thread Dilip Kumar
On Wed, Feb 5, 2020 at 9:27 AM Amit Kapila  wrote:
>
> On Tue, Feb 4, 2020 at 11:00 AM Dilip Kumar  wrote:
> >
> > On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila  
> > wrote:
> > >
> > >
> > > One more thing we can do is to identify whether the tuple belongs to
> > > toast relation while decoding it.  However, I think to do that we need
> > > to have access to relcache at that time and that might add some
> > > overhead as we need to do that for each tuple.  Can we investigate
> > > what it will take to do that and if it is better than setting a bit
> > > during WAL logging.
> > >
> > I have done some more analysis on this and it appears that there are
> > few problems in doing this.  Basically, once we get the confirmed
> > flush location, we advance the replication_slot_catalog_xmin so that
> > vacuum can garbage collect the old tuple.  So the problem is that
> > while we are collecting the changes in the ReorderBuffer our catalog
> > version might have removed,  and we might not find any relation entry
> > with that relfilenodeid (because it is dropped or altered in the
> > future).
> >
>
> Hmm, this means this can also occur while streaming the changes.  The
> main reason as I understand is that it is because before decoding
> commit, we don't know whether these changes are already sent to the
> subscriber (based on confirmed_flush_location/start_decoding_at).
Right.

>I think it is better to skip streaming such transactions as we can't
> make the right decision about these and as this can happen generally
> after the crash for the first few transactions, it shouldn't matter
> much if we serialize such transactions instead of streaming them.

I think the idea makes sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-02-04 Thread Dilip Kumar
On Fri, Jan 31, 2020 at 8:08 AM Amit Kapila  wrote:
>
> On Thu, Jan 30, 2020 at 6:10 PM Dilip Kumar  wrote:
> >
> > On Thu, Jan 30, 2020 at 4:11 PM Amit Kapila  wrote:
> > >
> > > Also, if we need to copy the snapshot here, then do we need to again
> > > copy it in ReorderBufferProcessTXN(in below code and in catch block in
> > > the same function).
> > I think so because as part of the
> > "REORDER_BUFFER_CHANGE_INTERNAL_SNAPSHOT" change, we might directly
> > point to the snapshot and that will get truncated when we truncate all
> > the changes of the ReorderBufferTXN.   So I think we can check if
> > snapshot_now->copied is true then we can avoid copying otherwise we
> > can copy?
> >
>
> Yeah, that makes sense, but I think then we also need to ensure that
> ReorderBufferStreamTXN frees the snapshot only when it is copied.  It
> seems to me it should be always copied in the place where we are
> trying to free it, so probably we should have an Assert there.
>
> One more thing:
> ReorderBufferProcessTXN()
> {
> ..
> + if (streaming)
> + {
> + /*
> + * While streaming an in-progress transaction there is a
> + * possibility that the (sub)transaction might get aborted
> + * concurrently.  In such case if the (sub)transaction has
> + * catalog update then we might decode the tuple using wrong
> + * catalog version.  So for detecting the concurrent abort we
> + * set CheckXidAlive to the current (sub)transaction's xid for
> + * which this change belongs to.  And, during catalog scan we
> + * can check the status of the xid and if it is aborted we will
> + * report an specific error which we can ignore.  We might have
> + * already streamed some of the changes for the aborted
> + * (sub)transaction, but that is fine because when we decode the
> + * abort we will stream abort message to truncate the changes in
> + * the subscriber.
> + */
> + CheckXidAlive = change->txn->xid;
> + }
> ..
> }
>
> I think it is better to move the above code into an inline function
> (something like SetXidAlive).  It will make the code in function
> ReorderBufferProcessTXN look cleaner and easier to understand.
>
Fixed in the latest version sent upthread.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-02-04 Thread Amit Kapila
On Tue, Feb 4, 2020 at 11:00 AM Dilip Kumar  wrote:
>
> On Tue, Jan 28, 2020 at 11:43 AM Amit Kapila  wrote:
> >
> >
> > One more thing we can do is to identify whether the tuple belongs to
> > toast relation while decoding it.  However, I think to do that we need
> > to have access to relcache at that time and that might add some
> > overhead as we need to do that for each tuple.  Can we investigate
> > what it will take to do that and if it is better than setting a bit
> > during WAL logging.
> >
> I have done some more analysis on this and it appears that there are
> few problems in doing this.  Basically, once we get the confirmed
> flush location, we advance the replication_slot_catalog_xmin so that
> vacuum can garbage collect the old tuple.  So the problem is that
> while we are collecting the changes in the ReorderBuffer our catalog
> version might have removed,  and we might not find any relation entry
> with that relfilenodeid (because it is dropped or altered in the
> future).
>

Hmm, this means this can also occur while streaming the changes.  The
main reason as I understand is that it is because before decoding
commit, we don't know whether these changes are already sent to the
subscriber (based on confirmed_flush_location/start_decoding_at).  I
think it is better to skip streaming such transactions as we can't
make the right decision about these and as this can happen generally
after the crash for the first few transactions, it shouldn't matter
much if we serialize such transactions instead of streaming them.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Portal->commandTag as an enum

2020-02-04 Thread Andres Freund
Hi,

On 2020-02-04 18:18:52 -0800, Mark Dilger wrote:
> In master, a number of functions pass a char *completionTag argument (really 
> a char completionTag[COMPLETION_TAG_BUFSIZE]) which gets filled in with the 
> string to return to the client from EndCommand.  I have removed that kind of 
> logic:
>
> -   /* save the rowcount if we're given a completionTag to fill */
> -   if (completionTag)
> -   snprintf(completionTag, COMPLETION_TAG_BUFSIZE,
> -"SELECT " UINT64_FORMAT,
> -queryDesc->estate->es_processed);
>
> In the patch, this is replaced with a new struct, QueryCompletionData.  That 
> bit of code above is replaced with:
>
> +   /* save the rowcount if we're given a qc to fill */
> +   if (qc)
> +   SetQC(qc, COMMANDTAG_SELECT, 
> queryDesc->estate->es_processed, DISPLAYFORMAT_NPROCESSED);
>
> For wire protocol compatibility, we have to track the display format.
> When this gets to EndCommand, the display format allows the string to
> be written exactly as the client will expect.  If we ever get to the
> point where we can break with that compatibility, the third member of
> this struct, display_format, can be removed.

Hm. While I like not having this as strings a lot, I wish we could get
rid of this displayformat stuff.



> These are replaced by switch() case statements over the possible commandTags:
>
> +   switch (commandTag)
> +   {
> +   /*
> +* Supported idiosyncratic special cases.
> +*/
> +   case COMMANDTAG_ALTER_DEFAULT_PRIVILEGES:
> +   case COMMANDTAG_ALTER_LARGE_OBJECT:
> +   case COMMANDTAG_COMMENT:
> +   case COMMANDTAG_CREATE_TABLE_AS:
> +   case COMMANDTAG_DROP_OWNED:
> +   case COMMANDTAG_GRANT:
> +   case COMMANDTAG_IMPORT_FOREIGN_SCHEMA:
> +   case COMMANDTAG_REFRESH_MATERIALIZED_VIEW:
> +   case COMMANDTAG_REVOKE:
> +   case COMMANDTAG_SECURITY_LABEL:
> +   case COMMANDTAG_SELECT_INTO:

The number of these makes me wonder if we should just have a metadata
table in one place, instead of needing to edit multiple
locations. Something like

const ... CommandTagBehaviour[] = {
[COMMANDTAG_INSERT] = {
.display_processed = true, .display_oid = true, ...},
[COMMANDTAG_CREATE_TABLE_AS] = {
.event_trigger = true, ...},

with the zero initialized defaults being the common cases.

Not sure if it's worth going there. But it's maybe worth thinking about
for a minute?


> Averages for test set 1 by scale:
> set   scale   tps avg_latency 90% 1 1   37411.734   3.162   133.718
> 1 9   61240.904   1.05230.547
> 1 81  59210.931   1.015   67.023
>
> Averages for test set 1 by clients:
> set   clients tps avg_latency 90% 1 1   21630.461   0.514   24.414
> 1 4   59680.675   0.791   40.354
> 1 16  76552.433   3.922   366.519
>
>
> For command tag patch (branched from 1fd687a035):
>
>   postgresql % find src -type f -name "*.c" -or -name "*.h" | xargs cat | 
> wc
>1482969 5691908 45281399
>
>   postgresql % find src -type f -name "*.o" | xargs cat | wc
>  38209  476243 18999752
>
>
> Averages for test set 1 by scale:
> set   scale   tps avg_latency 90% 1 1   38771.645   3.066   24.973
> 1 9   63830.859   1.032   64.566
> 1 81  59450.925   1.023   162.9
>
> Averages for test set 1 by clients:
> set   clients tps avg_latency 90% 1 1   21410.466   0.522   11.531
> 1 4   59670.673   0.783   136.882
> 1 16  80962.292   3.817   104.026

Not bad.


> diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
> index 9aa2b61600..5322c14ce4 100644
> --- a/src/backend/commands/async.c
> +++ b/src/backend/commands/async.c
> @@ -594,7 +594,7 @@ pg_notify(PG_FUNCTION_ARGS)
>   payload = text_to_cstring(PG_GETARG_TEXT_PP(1));
>
>   /* For NOTIFY as a statement, this is checked in ProcessUtility */
> - PreventCommandDuringRecovery("NOTIFY");
> + PreventCommandDuringRecovery(COMMANDTAG_NOTIFY);
>
>   Async_Notify(channel, payload);
>
> diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
> index 40a8ec1abd..4828e75bd5 100644
> --- a/src/backend/commands/copy.c
> +++ b/src/backend/commands/copy.c
> @@ -1063,7 +1063,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
>
>   /* check read-only transaction and parallel mode */
>   if (XactReadOnly && !rel->rd_islocaltemp)
> - PreventCommandIfReadOnly("COPY FROM");
> + 

Re: Documentation patch for ALTER SUBSCRIPTION

2020-02-04 Thread David Christensen


>> On Feb 4, 2020, at 8:45 PM, Amit Kapila  wrote:
>> 
>> On Fri, Jan 24, 2020 at 2:05 AM David Christensen  
>> wrote:
>> Greetings,
>> Enclosed find a documentation patch that clarifies the behavior of ALTER 
>> SUBSCRIPTION … REFRESH PUBLICATION with new tables; I ran into a situation 
>> today where the docs were not clear that existing tables would not be 
>> re-copied, so remedying this situation.
> 
> It seems this is already covered in REFRESH PUBLICATION, see "This
> will start replication of tables that were added to the subscribed-to
> publications since the last invocation of REFRESH PUBLICATION or since
> CREATE SUBSCRIPTION.".  As far as I understand, this text explains the
> situation you were facing.  Can you explain why the text quoted by me
> is not sufficient?

Hi Amit,

From several reads of the text it was not explicitly clear to me that when you 
issued the copy_data that it would not effectively recopy existing tables in 
the existing publication, which I had been trying to confirm was not the case 
prior to running a refresh operation. I had to resort to reviewing the source 
code to get the answer I was looking for.

If you are already familiar with the operation under the hood I am sure the 
ambiguity is not there but since I was recently confused by this I wanted to be 
more explicit in a way that would have helped me answer my original question. 

Best,

David



Re: Documentation patch for ALTER SUBSCRIPTION

2020-02-04 Thread Amit Kapila
On Fri, Jan 24, 2020 at 2:05 AM David Christensen  wrote:
>
> Greetings,
>
> Enclosed find a documentation patch that clarifies the behavior of ALTER 
> SUBSCRIPTION … REFRESH PUBLICATION with new tables; I ran into a situation 
> today where the docs were not clear that existing tables would not be 
> re-copied, so remedying this situation.
>

It seems this is already covered in REFRESH PUBLICATION, see "This
will start replication of tables that were added to the subscribed-to
publications since the last invocation of REFRESH PUBLICATION or since
CREATE SUBSCRIPTION.".  As far as I understand, this text explains the
situation you were facing.  Can you explain why the text quoted by me
is not sufficient?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: closesocket behavior in different platforms

2020-02-04 Thread Amit Kapila
On Thu, Jan 30, 2020 at 9:04 AM Amit Kapila  wrote:
>
> On Wed, Jan 29, 2020 at 8:29 PM Robert Haas  wrote:
> >
> > On Wed, Jan 29, 2020 at 6:04 AM vignesh C  wrote:
> > > Thanks for your review and suggestion. I have made a patch based on
> > > similar lines. Attached patch has the doc update with the explanation.
> > > Thoughts?
> >
> > Documenting this doesn't seem very useful to me.
> >
>
> I thought of documenting it because this has been reported/discussed
> multiple times (see some of the links of discussions at the end of the
> first email) and every time we need to spend time explaining the same
> thing.  However, if we decide not to do that I am fine with it.
>

Does anybody else have any opinion on whether it makes sense to
document this behavior?  To summarize for others, the behavior
difference as noted by Vignesh in his patch is:

+
+
+ You will get server closed the connection unexpectedly message while
+ trying to execute sql command on disconnected connection. The message is
+ slightly different in windows and non-windows. In non-windows, you will
+ see a FATAL message before the error message:
+
+FATAL:  terminating connection due to idle-in-transaction timeout
+server closed the connection unexpectedly
+This probably means the server terminated abnormally
+before or while processing the request.
+The connection to the server was lost. Attempting reset: Succeeded.
+
+ In windows, you might not see the FATAL message:
+
+server closed the connection unexpectedly
+This probably means the server terminated abnormally
+before or while processing the request.
+The connection to the server was lost. Attempting reset: Succeeded.

We have spent a decent amount of time on this and it is due to windows
API behaving differently.  By documenting, we might avoid the future
effort of explaining this to users.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Expose lock group leader pid in pg_stat_activity

2020-02-04 Thread Michael Paquier
On Tue, Feb 04, 2020 at 03:27:25PM +0100, Julien Rouhaud wrote:
> I added some code comments to remind that we don't guarantee any consistency
> here.

That's mostly fine.  I have moved the comment related to
AuxiliaryPidGetProc() within the inner part of its the "if" (or the
comment should be changed to be conditional).  An extra thing is that
nulls[29] was not set to true for a user without the proper permission
rights.

Does that look fine to you?
--
Michael
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2228256907..226c904c04 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5175,9 +5175,9 @@
   proname => 'pg_stat_get_activity', prorows => '100', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => 'int4',
-  proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,bool,text,numeric,text,bool,text,bool}',
-  proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,sslcompression,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc}',
+  proallargtypes => '{int4,oid,int4,oid,text,text,text,text,text,timestamptz,timestamptz,timestamptz,timestamptz,inet,text,int4,xid,xid,text,bool,text,text,int4,bool,text,numeric,text,bool,text,bool,int4}',
+  proargmodes => '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,datid,pid,usesysid,application_name,state,query,wait_event_type,wait_event,xact_start,query_start,backend_start,state_change,client_addr,client_hostname,client_port,backend_xid,backend_xmin,backend_type,ssl,sslversion,sslcipher,sslbits,sslcompression,ssl_client_dn,ssl_client_serial,ssl_issuer_dn,gss_auth,gss_princ,gss_enc,leader_pid}',
   prosrc => 'pg_stat_get_activity' },
 { oid => '3318',
   descr => 'statistics: information about progress of backends running maintenance command',
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c9e6060035..f681aafcf9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -741,6 +741,7 @@ CREATE VIEW pg_stat_activity AS
 S.datid AS datid,
 D.datname AS datname,
 S.pid,
+S.leader_pid,
 S.usesysid,
 U.rolname AS usename,
 S.application_name,
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 7b2da2b36f..7e6a3c1774 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -547,7 +547,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_activity(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_ACTIVITY_COLS	29
+#define PG_STAT_GET_ACTIVITY_COLS	30
 	int			num_backends = pgstat_fetch_stat_numbackends();
 	int			curr_backend;
 	int			pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
@@ -686,33 +686,40 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			values[5] = CStringGetTextDatum(clipped_activity);
 			pfree(clipped_activity);
 
+			/* leader_pid */
+			nulls[29] = true;
+
 			proc = BackendPidGetProc(beentry->st_procpid);
-			if (proc != NULL)
-			{
-uint32		raw_wait_event;
 
-raw_wait_event = UINT32_ACCESS_ONCE(proc->wait_event_info);
-wait_event_type = pgstat_get_wait_event_type(raw_wait_event);
-wait_event = pgstat_get_wait_event(raw_wait_event);
-
-			}
-			else if (beentry->st_backendType != B_BACKEND)
+			if (proc == NULL && (beentry->st_backendType != B_BACKEND))
 			{
 /*
  * For an auxiliary process, retrieve process info from
  * AuxiliaryProcs stored in shared-memory.
  */
 proc = AuxiliaryPidGetProc(beentry->st_procpid);
+			}
 
-if (proc != NULL)
+			/*
+			 * If a PGPROC entry was retrieved, display wait events and lock
+			 * group leader information if any.  To avoid extra overhead, no
+			 * extra lock is being held, so there is no guarantee of
+			 * consistency across multiple rows.
+			 */
+			if (proc != NULL)
+			{
+uint32		raw_wait_event;
+PGPROC	   *leader;
+
+raw_wait_event = UINT32_ACCESS_ONCE(proc->wait_event_info);
+wait_event_type = pgstat_get_wait_event_type(raw_wait_event);
+wait_event = pgstat_get_wait_event(raw_wait_event);
+
+leader = proc->lockGroupLeader;
+if (leader)
 {
-	uint32		raw_wait_event;
-
-	raw_wait_event =
-		UINT32_ACCESS_ONCE(proc->wait_event_info);
-	wait_event_type =
-		pgstat_get_wait_event_type(raw_wait_event);
-	wait_event = pgstat_get_wait_event(raw_wait_event);
+	values[29] = 

RE: Complete data erasure

2020-02-04 Thread tsunakawa.ta...@fujitsu.com
From: Tom Lane 
> Up to now, we've sort of looked the other way with respect to failures
> of file unlinks post-commit, reasoning that the worst that will happen
> is disk space leakage from no-longer-referenced files that we failed to
> unlink.  (Which is bad, certainly, but not catastrophic; it's immaterial
> to database semantics.)  This patch basically needs to raise the level of
> guarantee that exists in this area, or it won't do what it says on the
> tin.  But I've not seen any indication that we know how to do that in a
> workable way.

Hmm, the error case is a headache.Even if the bgworker does the erasure, it 
could hit the same error repeatedly when the file system or disk is broken, 
causing repeated I/O that may hamper performance.

Do we have no good  choice but to leave it up to the user to erase the file 
content like the following?

https://docs.oracle.com/en/database/oracle/oracle-database/19/asoag/general-considerations-of-using-transparent-data-encryption.html#GUID-F02C9CBF-0374-408B-8655-F7531B681D41
--
Oracle Database
Advanced Security Guide
7 General Considerations of Using Transparent Data Encryption 

Managing Security for Plaintext Fragments


You should remove old plaintext fragments that can appear over time.


Old plaintext fragments may be present for some time until the database 
overwrites the blocks containing such values. If privileged operating system 
users bypass the access controls of the database, then they might be able to 
directly access these values in the data file holding the tablespace. 

To minimize this risk:
 

1.Create a new tablespace in a new data file. 

You can use the CREATE TABLESPACE statement to create this tablespace. 


2.Move the table containing encrypted columns to the new tablespace. You can 
use the ALTER TABLE.MOVE statement. 

Repeat this step for all of the objects in the original tablespace.


3.Drop the original tablespace. 

You can use the DROP TABLESPACE tablespace INCLUDING CONTENTS KEEP DATAFILES 
statement. Oracle recommends that you securely delete data files using 
platform-specific utilities. 


4.Use platform-specific and file system-specific utilities to securely delete 
the old data file. Examples of such utilities include shred (on Linux) and 
sdelete (on Windows). 
--


Regards
Takayuki Tsunakawa






Re: Delaying/avoiding BTreeTupleGetNAtts() call within _bt_compare()

2020-02-04 Thread Peter Geoghegan
On Thu, Jan 30, 2020 at 1:19 AM Floris Van Nee  wrote:
> I'd say applying just v2-0001 is actually slightly hurtful for single-core 
> performance. Applying all of them gives a good improvement though. It looks 
> like the performance improvement is also more noticeable at higher core 
> counts now.

Many thanks for testing once again!

Your tests show that the overall winner is "", which is
strictly better than all other configurations you tested -- it is at
least slightly better than every other configuration at every client
count tested. I was particularly pleased to see that ""
is ~8.6% faster than the master branch with 30 clients! That result
greatly exceeded my expectations.

I have been able to independently confirm that you really need the
first two patches together to see the benefits -- that wasn't clear
until recently.

The interesting thing now is the role of the "negative infinity test"
patch (the 0003-* patch) in all of this. I suspect that it may not be
helping us much here. I wonder, could you test the following
configurations to settle this question?

*  with 30 clients (i.e. repeat the test that you reported on
most recently)

*  with 30 clients (i.e. repeat the test that you
reported got us that nice ~8.6% increase in TPS)

*  with 30 clients -- a new test, to see if performance is
at all helped by the "negative infinity test" patch (the 0003-*
patch).

It seems like a good idea to repeat the other two tests as part of
performing this third test, out of general paranoia. Intel seem to
roll out a microcode update for a spectre-like security issue about
every other day.

Thanks again
-- 
Peter Geoghegan




Re: Memory-Bounded Hash Aggregation

2020-02-04 Thread Thomas Munro
On Wed, Feb 5, 2020 at 12:08 PM Peter Geoghegan  wrote:
> Parallel CREATE INDEX is currently accidentally disabled on the master
> branch. That should be fixed in the next couple of days. You can
> temporarily revert 74618e77 if you want to get it back for testing
> purposes today.

(Fixed -- sorry for the disruption.)




Re: Memory-Bounded Hash Aggregation

2020-02-04 Thread Peter Geoghegan
On Mon, Feb 3, 2020 at 6:24 PM Jeff Davis  wrote:
> And now I'm attaching another version of the main Hash Aggregation
> patch to be applied on top of the logtape.c patch.

Have you tested this against tuplesort.c, particularly parallel CREATE
INDEX? It would be worth trying to measure any performance impact.
Note that most parallel CREATE INDEX tuplesorts will do a merge within
each worker, and one big merge in the leader. It's much more likely to
have multiple passes than a regular serial external sort.

Parallel CREATE INDEX is currently accidentally disabled on the master
branch. That should be fixed in the next couple of days. You can
temporarily revert 74618e77 if you want to get it back for testing
purposes today.

Have you thought about integer overflow in your heap related routines?
This isn't as unlikely as you might think. See commit 512f67c8, for
example.

Have you thought about the MaxAllocSize restriction as it concerns
lts->freeBlocks? Will that be okay when you have many more tapes than
before?

> Not a lot of changes from the last version; mostly some cleanup and
> rebasing. But it's faster now with the logtape.c changes.

LogicalTapeSetExtend() seems to work in a way that assumes that the
tape is frozen. It would be good to document that assumption, and
possible enforce it by way of an assertion. The same remark applies to
any other assumptions you're making there.

-- 
Peter Geoghegan




Re: Minimal logical decoding on standbys

2020-02-04 Thread James Sewell
Hi all,

This is great stuff! My understanding is that this patch guarantees 0 data
loss for a logical replication stream if the primary crashes and a standby
which was marked as sync at failure time is promoted.

Is this correct?

James
-- 
James Sewell,
Chief Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
P (+61) 2 8099 9000  W www.jirotech.com  F (+61) 2 8099 9099

-- 
The contents of this email are confidential and may be subject to legal or 
professional privilege and copyright. No representation is made that this 
email is free of viruses or other defects. If you have received this 
communication in error, you may not copy or distribute any part of it or 
otherwise disclose its contents to anyone. Please advise the sender of your 
incorrect receipt of this correspondence.


Re: Complete data erasure

2020-02-04 Thread Tom Lane
Tomas Vondra  writes:
> I think it depends how exactly it's implemented. As Tom pointed out in
> his message [1], we can't do the erasure itself in the post-commit is
> not being able to handle errors. But if the files are renamed durably,
> and the erasure happens in a separate process, that could be OK. The
> COMMIT may wayt for it or not, that's mostly irrelevant I think.

How is requiring a file rename to be completed post-commit any less
problematic than the other way?  You still have a non-negligible
chance of failure.

>> 1. Writes a commit WAL record, finalizing the system catalog change.
>> 2. Puts the data files in the trash bin or renames them.
>> 3. Erase the file content and delete the file. This could take a long time.
>> 4. COMMIT replies success to the client.

> I don't think the COMMIT has to wait for (3) - it might, of course, but
> for some use cases it may be better to just commit and leave the
> bgworker do the work. And then allow checking if it completed.

This doesn't seem like a path that will lead to success.  The fundamental
point here is that COMMIT has to be an atomic action --- or if it isn't,
failure partway through has to lead to a database crash & restart, which
isn't very pleasant, especially if WAL replay of the commit after the
restart re-encounters the same error.

Up to now, we've sort of looked the other way with respect to failures
of file unlinks post-commit, reasoning that the worst that will happen
is disk space leakage from no-longer-referenced files that we failed to
unlink.  (Which is bad, certainly, but not catastrophic; it's immaterial
to database semantics.)  This patch basically needs to raise the level of
guarantee that exists in this area, or it won't do what it says on the
tin.  But I've not seen any indication that we know how to do that in a
workable way.

regards, tom lane




Re: [Proposal] Global temporary tables

2020-02-04 Thread Robert Haas
On Mon, Feb 3, 2020 at 3:08 AM Konstantin Knizhnik
 wrote:
> Thank you for explanation.
> You convinced me that building indexes from _bt_getbuf is not good idea.
> What do you think about idea to check and build indexes for GTT prior to
> query execution?
>
> In this case we do not need to patch code of all indexes - it can be
> done just in one place.
> We can use build function of access method to initialize index and
> populate it with data.
>
> So right now when building query execution plan, optimizer checks if
> index is valid.
> If index belongs to GTT, it an check that first page of the index is
> initialized and if not - call build method for this index.
>
> If building index during building query plan is not desirable, we can
> just construct list of indexes which should be checked and
> perform check itself and building indexes somewhere after building plan
> but for execution of the query.
>
> Do you seem some problems with such approach?

My guess it that the right time to do this work is just after we
acquire locks, at the end of parse analysis. I think trying to do it
during execution is too late, since the planner looks at indexes, and
trying to do it in the planner instead of before we start planning
seems more likely to cause bugs and has no real advantages. It's just
better to do complicated things (like creating indexes) separately
rather than in the middle of some other complicated thing (like
planning). I could tie my shoelaces the first time they get tangled up
with my break pedal but it's better to do it before I get in the car.

And I'm still inclined to do it by flat-copying files rather than
calling ambuild. It will be slightly faster, but also importantly, it
will guarantee that (1) every backend gets exactly the same initial
state and (2) it has fewer ways to fail because it doesn't involve
calling any user-defined code. Those seem like fairly compelling
advantages, and I don't see what the disadvantages are. I think
calling ambuild() at the point in time proposed in the preceding
paragraph would be fairly safe and would probably work OK most of the
time, but I can't think of any reason it would be better.

Incidentally, what I'd be inclined to do is - if the session is
running a query that does only read-only operations, let it continue
to point to the "master" copy of the GTT and its indexes, which is
stored in the relfilenodes indicated for those relations in pg_class.
If it's going to acquire a lock heavier than AccessShareLock, then
give it is own copies of the table and indexes, stored in a temporary
relfilenode (tXXX_YYY) and redirect all future access to that GTT by
this backend to there. Maybe there's some reason this won't work, but
it seems nice to avoid saying that we've "attached" to the GTT if all
we did is read the empty table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Complete data erasure

2020-02-04 Thread Tomas Vondra

On Tue, Feb 04, 2020 at 12:53:44AM +, tsunakawa.ta...@fujitsu.com
wrote:

From: Tomas Vondra 

That's not really what I meant - let me explain. When I said DROP
TABLE should do everything as usual, that includes catalog changes.
I.e. after the commit there would not be any remaining entries in
system catalogs or anything like that.

The only thing we'd do differently is that instead of unlinking the
relfilenode segments, we'd move the relfilenode to a persistent queue
(essentially a regular table used as a queue relfilenodes). The
background worker would watch the queue, and when it gets a new
relfilenode it'd "delete" the data and then remove the relfilenode
from the queue.

So essentially others would not be able to even see the (now dropped)
object, they could create new object with the same name etc.


That sounds good.  I think we can also follow the way the WAL archiver
does its job, instead of using a regular table.  That is, when the
transaction that performed DROP TABLE commits, it puts the data files
in the "trash bin," which is actually a filesystem directory.  Or, it
just renames the data files in the original directory by appending some
suffix such as ".del".  Then, the background worker scans the trash bin
or the data directory to erase the file content and delete the file.



Yeah, that could work, I guess.


The trash bin mechanism may open up the application for restoring
mistakenly dropped tables, a feature like Oracle's Flash Drop.  The
dropping transaction puts the table metadata (system catalog data or
DDL) in the trash bin as well as the data file.



That seems like a very different feature, and I doubt this is the right
way to implement that. That would require much more infrastructure than
just moving the file to a separate dir.




I imagine we might provide a way to wait for the deletion to actually
complete (can't do that as part of the DROP TABLE, though), so that
people can be sure when the data is actually gone (for scripts etc.).
A simple function waiting for the queue to get empty might be enough,
I guess, but maybe not.


Agreed, because the user should expect the disk space to be available
after DROP TABLE has been committed.  Can't we really make the COMMIT
to wait for the erasure to complete?  Do we have to use an asynchronous
erasure method with a background worker?  For example, COMMIT performs:



I think it depends how exactly it's implemented. As Tom pointed out in
his message [1], we can't do the erasure itself in the post-commit is
not being able to handle errors. But if the files are renamed durably,
and the erasure happens in a separate process, that could be OK. The
COMMIT may wayt for it or not, that's mostly irrelevant I think.

[1] https://www.postgresql.org/message-id/9104.1579107235%40sss.pgh.pa.us


1. Writes a commit WAL record, finalizing the system catalog change.
2. Puts the data files in the trash bin or renames them.
3. Erase the file content and delete the file. This could take a long time.
4. COMMIT replies success to the client.



I don't think the COMMIT has to wait for (3) - it might, of course, but
for some use cases it may be better to just commit and leave the
bgworker do the work. And then allow checking if it completed.


What is concerned about is that the need to erase and delete the data
file would be forgotten if the server crashes during step 3.  If so,
postmaster can do the job at startup, just like it deletes temporary
files (although it delays the startup.)



Startup seems like a pretty bad place to do this stuff. There may be a
lot of data to erase, making recovery very long.




I think this depends on what our requirements are.

My assumption is that when you perform this "secure data erasure" on
the primary, you probably also want to erase the data on the replica.
But if the instances use different file systems (COW vs. non-COW,
...) the exact thing that needs to happen may be different. Or maybe
the replica does not need to do anything, making it noop?


We can guide the use of non-COW file systems on both the primary and
standby in the manual.



I don't see how that solves the issue. I think it's quite useful to be
able to use different filesystems for primary/replica. And we may even
know how to securely erase data on both, in which case I don't see a
point not to allow such configurations.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Global temporary tables

2020-02-04 Thread Robert Haas
On Sat, Feb 1, 2020 at 11:14 AM 曾文旌(义从)  wrote:
> As global_private_temp-8.patch, think about:
> 1 session X tale several hours doing some statistical work with the GTT A, 
> which generated some data using transaction 100, The work is not over.
> 2 Then session Y vacuumed A, and the GTT's relfrozenxid (in pg_class) was 
> updated to 1000 .
> 3 Then the aotuvacuum happened, the clog  before 1000  was cleaned up.
> 4 The data in session A could be lost due to missing clog, The analysis task 
> failed.
>
> However This is likely to happen because you allowed the GTT do vacuum.
> And this is not a common problem, that not happen with local temp tables.
> I feel uneasy about leaving such a question. We can improve it.

Each session is going to need to maintain its own notion of the
relfrozenxid and relminmxid of each GTT to which it is attached.
Storing the values in pg_class makes no sense and is completely
unacceptable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




RE: Index Skip Scan

2020-02-04 Thread Floris Van Nee


> this point me and Jesper inclined to go with the second option. But maybe
> I'm missing something, are there any other suggestions?

Unfortunately I figured this would need a more invasive fix. I tend to agree 
that it'd be better to not skip in situations like this. I think it'd make most 
sense to make any plan for these 'prepare/fetch' queries would not use skip, 
but rather a materialize node, right?

-Floris



Re: Clarifying/rationalizing Vars' varno/varattno/varnoold/varoattno

2020-02-04 Thread Tom Lane
Robert Haas  writes:
> On Fri, Dec 20, 2019 at 11:13 AM Tom Lane  wrote:
>> The alternatives that seem plausible at this point are
>> (1) Create some sort of wrapper node indicating "the contents of this
>> expression might be replaced by NULL".  This is basically what the
>> planner's PlaceHolderVars do, so maybe we'd just be talking about
>> introducing those at some earlier stage.
>> ...

> I'm not sure which is better, either, although I would like to note in
> passing that the name PlaceHolderVar seems to me to be confusing and
> terrible. It took me years to understand it, and I've never been
> totally sure that I actually do. Why is it not called
> MightBeNullWrapper or something?

Here's a data dump about my further thoughts in this area.  I've concluded
that the "wrapper" approach is the right way to proceed, and that rather
than having the planner introduce the wrappers as happens now, we should
indeed have the parser introduce the wrappers from the get-go.  There are
a few arguments for that:

* Arguably, this is a question of decorating the parse tree with
information about query semantics.  I've always held that parse analysis
is what should introduce knowledge of semantics; the planner ought not be
reverse-engineering that.

* AFAICS, we would need an additional pass over the query tree in order to
do this in the planner.  There is no existing recursive tree-modification
pass that happens at an appropriate time.

* We can use the same type of wrapper node to solve the problems with
grouping-set expressions that were discussed in
https://www.postgresql.org/message-id/flat/7dbdcf5c-b5a6-ef89-4958-da212fe10176%40iki.fi
although I'd leave that for a follow-on patch rather than try to fix
it immediately.  Here again, it'd be better to introduce the wrappers
at parse time --- check_ungrouped_columns() is already detecting the
presence of grouping-expression references, so we could make it inject
wrappers around them at relatively little extra cost.

Per Robert's complaint above, these wrappers need better documentation,
and they should be called something other than PlaceHolderVar, even though
they're basically that (and hopefully will replace those completely).
I'm tentatively thinking of calling them "NullableVar", but am open to
better ideas.  And here is a proposed addition to optimizer/README to
explain why they exist.  I'm not quite satisfied with the explanation yet
--- in particular, if we don't need them at runtime, why do we need them
at parse time?  Any thoughts about how to explain this more solidly are
welcome.

--

To simplify handling of some issues with outer joins, we use NullableVars,
which are produced by the parser and used by the planner, but do not
appear in finished plans.  A NullableVar is a wrapper around another
expression, decorated with a set of outer-join relids, and notionally
having the semantics

CASE WHEN any-of-these-outer-joins-produced-a-null-extended-row
THEN NULL
ELSE contained-expression
END

It's only notional, because no such calculation is ever done explicitly.
In a finished plan, the NullableVar construct is replaced by a plain Var
referencing an output column of the topmost mentioned outer join, while
the "contained expression" is the corresponding input to the bottommost
join.  Any forcing to null happens in the course of calculating the
outer join results.  (Because we don't ever have to do the calculation
explicitly, it's not necessary to distinguish which side of an outer join
got null-extended, which'd otherwise be essential information for FULL
JOIN cases.)

A NullableVar wrapper is placed around a Var referencing a column of the
nullable side of an outer join when that reference appears syntactically
above (outside) the outer join, but not when the reference is below the
outer join, such as within its ON clause.  References to the non-nullable
side of an outer join are never wrapped.  NullableVars mentioning multiple
join nodes arise from cases with nested outer joins.

It might seem that the NullableVar construct is unnecessary (and indeed,
we got by without it for many years).  In a join row that's null-extended
for lack of a matching nullable-side row, the only reasonable value to
impute to a Var of that side is NULL, no matter where you look in the
parse tree.  However there are pressing reasons to use NullableVars
anyway:

* NullableVars simplify reasoning about where to evaluate qual clauses.
Consider
SELECT * FROM t1 LEFT JOIN t2 ON (t1.x = t2.y) WHERE foo(t2.z)
(Assume foo() is not strict, so that we can't reduce the left join to
a plain join.)  A naive implementation might try to push the foo(t2.z)
call down to the scan of t2, but that is not correct because (a) what
foo() should actually see for a null-extended join row is NULL, and
(b) if foo() returns false, we should suppress the t1 row from the join
altogether, not emit it with a null-extended t2 row.  On the other hand,
it *would* be 

Re: Index Skip Scan

2020-02-04 Thread Dmitry Dolgov
> On Mon, Jan 27, 2020 at 02:00:39PM +, Floris Van Nee wrote:
>
> Thanks for the new patch! I tested it and managed to find a case that causes
> some issues. Here's how to reproduce:

So, after a bit of investigation I found out the issue (it was actually there
even in the previous version). In this only case of moving forward and reading
backward, exactly scenarious you've described above, current implementation was
not ignoring deleted tuples.

My first idea to fix this was to use _bt_readpage when necessary and put
couple of _bt_killitems when we leave a page while jumping before, so that
deleted tuples will be ignored. To demonstrate it visually, let's say we
want to go backward on a cursor over an ORDER BY a DESC, b DESC query,
i.e. return:

(1,100), (2, 100), (3, 100) etc.

To achieve that we jump from (1,1) to (1,100), from (2,1) to (2,100) and so on.
If some values are deleted, we need to read backward. E.g. if (3,100) is
deleted, we need to return (3,99).
 
   +---+---+---+---+ 
   |   |   |   |   | 
   | 1,1 ... 1,100 | 2,1 ... 2,100 | 3,1 ... 3,100 | 4,1 ... 4,100 | 
   |   |   |   |   | 
   +---+---+---+---+ 
 
   | ^ | ^ | ^ | ^   
   | | | | | | | |   
   +-+ +-+ +-+ +-+   
 
If it happened that a whole value series is deleted, we return to the
previous value and need to detect such situation. E.g. if all the values
from (3,1) to (3,100) were deleted, we will return to (2,100).
 
   +---+---+   +---+ 
   |   |   |   |   | 
   | 1,1 ... 1,100 | 2,1 ... 2,100 |<--+ 4,1 ... 4,100 | 
   |   |   |   |   | 
   +---+---+   +---+ 
 ^   
   | ^ | ^ | ^   |   
   | | | | | |   |   
   +-+ +-+ +-+   |   
   +-+ 
 
This all is implemented inside _bt_skip. Unfortunately as I see it now the idea
of relying on ability to skip dead index tuples without checking a heap tuple
is not reliable, since an index tuple will be added into killedItems and can be
marked as dead only when not a single transaction can see it anymore.

Potentially there are two possible solutions:

* Adjust the code in nodeIndexOnlyscan to perform a proper visibility check and
  understand if we returned back. Obviously it will make the patch more 
invasive.

* Reduce scope of the patch, and simply do not apply jumping in this case. This
  means less functionality but hopefully still brings some value.

At this point me and Jesper inclined to go with the second option. But maybe
I'm missing something, are there any other suggestions?




Re: Index only scan and ctid

2020-02-04 Thread Tom Lane
Laurenz Albe  writes:
> On Mon, 2020-02-03 at 14:43 -0500, Tom Lane wrote:
>> Laurenz Albe  writes:
>>> I noticed that "ctid" in the select list prevents an index only scan:
>>> This strikes me as strange, since every index contains "ctid".

>> There's no provision for an IOS to return a system column, though.
>> Not sure what it'd take to make that possible.

> I was reminded what the obvious problem is:
> the ctid of a heap only tuple is not stored in the index.  Duh.

Duh ... the members of a HOT chain share the same indexed value(s),
which is why we needn't care exactly which one is live during IOS.
But they don't have the same TID.  Oh well.

regards, tom lane




Re: Index only scan and ctid

2020-02-04 Thread Laurenz Albe
On Mon, 2020-02-03 at 14:43 -0500, Tom Lane wrote:
> Laurenz Albe  writes:
> > I noticed that "ctid" in the select list prevents an index only scan:
> > This strikes me as strange, since every index contains "ctid".
> 
> There's no provision for an IOS to return a system column, though.
> Not sure what it'd take to make that possible.

I was reminded what the obvious problem is:
the ctid of a heap only tuple is not stored in the index.  Duh.

Yours,
Laurenz Albe





Re: BUG #16171: Potential malformed JSON in explain output

2020-02-04 Thread Tom Lane
I wrote:
> Daniel Gustafsson  writes:
>> I guess I'm leaning towards backpatching, but it's not entirely clear-cut.

> That's where I stand too.  I'll wait a day or so to see if anyone
> else comments; but if not, I'll back-patch.

Hearing no objections, done that way.

regards, tom lane




Re: [Proposal] Global temporary tables

2020-02-04 Thread Konstantin Knizhnik



On 04.02.2020 18:01, 曾文旌(义从) wrote:





Yes, exactly.
But it is still not clear to me why do we need some special handling 
for GTT?
Shared memory is reinitialized and storage of temporary tables is 
removed.

It is true for both local and global temp tables.
Of course not. The local temp table cleans up the entire table 
(including catalog buffer and datafile). GTT is not.




What do you mean by "catalog buffer"?
Yes, cleanup of local temp table requires deletion of correspondent 
entry from catalog and GTT should not do it.
But  I am speaking only about cleanup of data files of temp relations. 
It is done in the same way for local and global temp tables.




In my patch autovacuum is prohibited for GTT.

But vacuum GTT is not prohibited.

Yes, but the simplest solution is to prohibit also explicit vacuum of 
GTT, isn't it?





IMHO forced terminated of client sessions is not acceptable solution.

And it is not an absolutely necessary requirement.
So from my point of view we should not add such limitations to GTT 
design.

This limitation makes it possible for the GTT to do all the DDL.
IMHO even oracle's GTT has similar limitations.


I have checked that Oracle is not preventing creation of index for 
GTT if there are some active sessions working with this table. And 
this index becomes visible for all this sessions.
1 Yes The creation of inde gtt has been improved 
in global_temporary_table_v10-pg13.patch
2 But alter GTT ; drop GTT ; drop index on GTT is blocked by other 
sessions



Yes, you are right.
Orale documetation says:
>  1) DDL operation on global temporary tables

> It is not possible to perform a DDL operation (except |TRUNCATE 
|) 
on an existing global temporary table if one or more sessions are 
currently bound to that table.


But looks like create index is not considered as DDL operation on GTT 
and is also supported by Oracle.


Your approach with prohibiting such accessed using shared cache is 
certainly better then my attempt to prohibit such DDLs for GTT at all.
I just what to eliminate maintenance of such shared cache to simplify 
the patch.


But I still think that we should allow truncation of GTT and 
creating/dropping indexes on it without any limitations.


May be the easies solution is to prohibit explicit vacuum of GTT?

I think vacuum is an important part of GTT.

Looking back at previous emails, robert once said that vacuum GTT is 
pretty important.
https://www.postgresql.org/message-id/CA%2BTgmob%3DL1k0cpXRcipdsaE07ok%2BOn%3DtTjRiw7FtD_D2T%3DJwhg%40mail.gmail.com 





Well, may be I am not right.
I never saw use cases where temp table are used not like append-only 
storage (when temp table tuples are updated multiple times).
But I think that if such problem actually exists then solution is to 
support autovacuum for temp tables, rather than allow manual vacuum.
Certainly it can not be done by another  worker because it has no access 
to private backend's data. But it can done incrementally by backend itself.





Re: Memory-Bounded Hash Aggregation

2020-02-04 Thread Heikki Linnakangas

On 03/02/2020 20:29, Jeff Davis wrote:

1. Use a minheap for the freelist. The original design used an array
that had to be sorted between a read (which frees a block) and a write
(which needs to sort the array to consume the lowest block number). The
comments said:

   * sorted.  This is an efficient way to handle it because we expect
cycles
   * of releasing many blocks followed by re-using many blocks, due to
   * the larger read buffer.

But I didn't find a case where that actually wins over a simple
minheap. With that in mind, a minheap seems closer to what one might
expect for that purpose, and more robust when the assumptions don't
hold up as well. If someone knows of a case where the re-sorting
behavior is important, please let me know.


A minheap certainly seems more natural for that. I guess re-sorting the 
array would be faster in the extreme case that you free almost all of 
the blocks, and then consume almost all of the blocks, but I don't think 
the usage pattern is ever that extreme. Because if all the data fit in 
memory, we wouldn't be spilling in the first place.


I wonder if a more advanced heap like the pairing heap or fibonacci heap 
would perform better? Probably doesn't matter in practice, so better 
keep it simple...



Changing to a minheap effectively solves the problem for HashAgg,
though in theory the memory consumption of the freelist itself could
become significant (though it's only 0.1% of the free space being
tracked).


We could fairly easily spill parts of the freelist to disk, too, if 
necessary. But it's probably not worth the trouble.



2. Lazily-allocate the read buffer. The write buffer was always lazily-
allocated, so this patch creates better symmetry. More importantly, it
means freshly-rewound tapes don't have any buffer allocated, so it
greatly expands the number of tapes that can be managed efficiently as
long as only a limited number are active at once.


Makes sense.


3. Allow expanding the number of tapes for an existing tape set. This
is useful for HashAgg, which doesn't know how many tapes will be needed
in advance.


I'd love to change the LogicalTape API so that you could allocate and 
free tapes more freely. I wrote a patch to do that, as part of replacing 
tuplesort.c's polyphase algorithm with a simpler one (see [1]), but I 
never got around to committing it. Maybe the time is ripe to do that now?


[1] 
https://www.postgresql.org/message-id/420a0ec7-602c-d406-1e75-1ef7ddc58...@iki.fi


- Heikki




Re: [Proposal] Global temporary tables

2020-02-04 Thread 曾文旌(义从)


> 2020年2月3日 下午4:16,Konstantin Knizhnik  写道:
> 
> 
> 
> On 01.02.2020 19:14, 曾文旌(义从) wrote:
>> 
>> 
>>> 2020年1月27日 下午5:38,Konstantin Knizhnik >> > 写道:
>>> 
>>> 
>>> 
>>> On 25.01.2020 18:15, 曾文旌(义从) wrote:
 I wonder why do we need some special check for GTT here.
> From my point of view cleanup at startup of local storage of temp tables 
> should be performed in the same way for local and global temp tables.
 After oom kill, In autovacuum, the Isolated local temp table will be 
 cleaned like orphan temporary tables. The definition of local temp table 
 is deleted with the storage file. 
 But GTT can not do that. So we have the this implementation in my patch.
 If you have other solutions, please let me know.
 
>>> I wonder if it is possible that autovacuum or some other Postgres process 
>>> is killed by OOM and postmaster is not noticing it can doens't restart 
>>> Postgres instance?
>>> as far as I know, crash of any process connected to Postgres shared memory 
>>> (and autovacuum definitely has such connection) cause Postgres restart.
>> Postmaster will not restart after oom happen, but the startup process will. 
>> GTT data files are cleaned up in the startup process.
> 
> Yes, exactly.
> But it is still not clear to me why do we need some special handling for GTT?
> Shared memory is reinitialized and storage of temporary tables is removed.
> It is true for both local and global temp tables.
Of course not. The local temp table cleans up the entire table (including 
catalog buffer and datafile). GTT is not.

> 
>>> 
>>> 
 In my design
 1 Because different sessions have different transaction information, I 
 choose to store the transaction information of GTT in MyProc,not catalog.
 2 About the XID wraparound problem, the reason is the design of the temp 
 table storage(local temp table and global temp table) that makes it can 
 not to do vacuum by autovacuum. 
 It should be completely solve at the storage level.
 
>>> 
>>> My point of view is that vacuuming of temp tables is common problem for 
>>> local and global temp tables. 
>>> So it has to be addressed in the common way and so we should not try to fix 
>>> this problem only for GTT.
>> I think I agree with you this point.
>> However, this does not mean that GTT transaction information stored in 
>> pg_class is correct.
>> If you keep it that way, like in global_private_temp-8.patch, It may cause 
>> data loss in GTT after aotuvauum.
> 
> In my patch autovacuum is prohibited for GTT.
But vacuum GTT is not prohibited. 

> 
>> IMHO forced terminated of client sessions is not acceptable solution.
>>> And it is not an absolutely necessary requirement.
>>> So from my point of view we should not add such limitations to GTT design.
>> This limitation makes it possible for the GTT to do all the DDL.
>> IMHO even oracle's GTT has similar limitations.
> 
> I have checked that Oracle is not preventing creation of index for GTT if 
> there are some active sessions working with this table. And this index 
> becomes visible for all this sessions.
1 Yes The creation of inde gtt has been improved in 
global_temporary_table_v10-pg13.patch
2 But alter GTT ; drop GTT ; drop index on GTT is blocked by other sessions

SQL> drop table gtt;
drop table gtt
   *
ERROR at line 1:
ORA-14452: attempt to create, alter or drop an index on temporary table already
in use


SQL> ALTER TABLE gtt add b int ; 
ALTER TABLE gtt add b int
*
ERROR at line 1:
ORA-14450: attempt to access a transactional temp table already in use

SQL> drop index idx_gtt;
drop index idx_gtt
   *
ERROR at line 1:
ORA-14452: attempt to create, alter or drop an index on temporary table already
in use

I'm not saying we should do this, but from an implementation perspective we 
face similar issues.
If a dba changes a GTT, he can do it. Therefore, I think it is acceptable to do 
so.

> 
> 
>> As global_private_temp-8.patch, think about:
>> 1 session X tale several hours doing some statistical work with the GTT A, 
>> which generated some data using transaction 100, The work is not over.
>> 2 Then session Y vacuumed A, and the GTT's relfrozenxid (in pg_class) was 
>> updated to 1000 .
>> 3 Then the aotuvacuum happened, the clog  before 1000  was cleaned up.
>> 4 The data in session A could be lost due to missing clog, The analysis task 
>> failed.
>> 
>> However This is likely to happen because you allowed the GTT do vacuum. 
>> And this is not a common problem, that not happen with local temp tables.
>> I feel uneasy about leaving such a question. We can improve it.
>> 
> 
> May be the easies solution is to prohibit explicit vacuum of GTT?
I think vacuum is an important part of GTT.

Looking back at previous emails, robert once said that vacuum GTT is pretty 
important.
https://www.postgresql.org/message-id/CA%2BTgmob%3DL1k0cpXRcipdsaE07ok%2BOn%3DtTjRiw7FtD_D2T%3DJwhg%40mail.gmail.com
 

Re: Expose lock group leader pid in pg_stat_activity

2020-02-04 Thread Julien Rouhaud
On Thu, Jan 30, 2020 at 10:03:01PM +0900, Michael Paquier wrote:
> On Tue, Jan 28, 2020 at 02:52:08PM +0100, Tomas Vondra wrote:
> > On Tue, Jan 28, 2020 at 02:26:34PM +0100, Julien Rouhaud wrote:
> >> There were already some dependencies between the rows since parallel
> >> queries were added, as you could see eg. a parallel worker while no
> >> query is currently active.  This patch will make those corner cases
> >> more obvious.
>
> I was reviewing the code and one thing that I was wondering is if it
> would be better to make the code more defensive and return NULL when
> the PID of the referenced leader is 0 or InvalidPid.  However that
> would mean that we have a dummy 2PC entry from PGPROC or something not
> yet started, which makes no sense.  So your simpler version is
> actually fine.  What you have here is that in the worst case you could
> finish with an incorrect reference to shared memory if a PGPROC is
> recycled between the moment you look for the leader field and the
> moment the PID value is fetched.  That's very unlikely to happen, and
> I agree that it does not really justify the cost of taking extra locks
> every time we scan pg_stat_activity.

Ok.

>
> > Yeah, sure. I mean explicit dependencies, e.g. a column referencing
> > values from another row, like leader_id does.
>
> +  The leader_pid is NULL for processes not involved in parallel query.
> This is missing two markups, one for "NULL" and a second for
> "leader_pid".

The extra "leader_pid" disappeared when I reworked the doc.  I'm not sure what
you meant here for NULL as I don't see any extra markup used in closeby
documentation, so I hope this version is ok.

> The documentation does not match the surroundings
> either, so I would suggest a reformulation for the beginning:
> PID of the leader process if this process is involved in parallel query.

> And actually this paragraph is not completely true, because leader_pid
> remains set even after one parallel query run has been finished for a
> session when leader_pid = pid as lockGroupLeader is set to NULL only
> once the process is stopped in ProcKill().

Oh good point, that's unfortunately not a super friendly behavior.  I tried to
adapt the documentation to address of all that.  It's maybe slightly too
verbose, but I guess that extra clarity is welcome here.

> >> Should I document the possible inconsistencies?
> >
> > I think it's worth mentioning that as a comment in the code, say before
> > the pg_stat_get_activity function. IMO we don't need to document all
> > possible inconsistencies, a generic explanation is enough.
>
> Agreed that adding some information in the area when we look at the
> PGPROC entries for wait events and such would be nice.

I added some code comments to remind that we don't guarantee any consistency
here.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8839699079..5e1f6c057b 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -622,6 +622,18 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
  integer
  Process ID of this backend
 
+
+ leader_pid
+ integer
+ 
+  Process ID of the leader process if this process is or has been involved
+  in parallel query, or null.  When a process wants to cooperate with
+  parallel workers, it becomes a parallel group leader, which means that
+  this field will be valued to its own process ID, and will remain a group
+  leader as long as the process exist.  When a parallel worker starts up,
+  this field will be valued with the parallel group leader process ID.
+ 
+
 
  usesysid
  oid
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index c9e6060035..f681aafcf9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -741,6 +741,7 @@ CREATE VIEW pg_stat_activity AS
 S.datid AS datid,
 D.datname AS datname,
 S.pid,
+S.leader_pid,
 S.usesysid,
 U.rolname AS usename,
 S.application_name,
diff --git a/src/backend/utils/adt/pgstatfuncs.c 
b/src/backend/utils/adt/pgstatfuncs.c
index 7b2da2b36f..f4bea2edd2 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -547,7 +547,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_activity(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_ACTIVITY_COLS  29
+#define PG_STAT_GET_ACTIVITY_COLS  30
int num_backends = pgstat_fetch_stat_numbackends();
int curr_backend;
int pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
@@ -686,33 +686,36 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
values[5] = CStringGetTextDatum(clipped_activity);
pfree(clipped_activity);
 
+ 

[PATCH v1] Allow COPY "text" format to output a header

2020-02-04 Thread Rémi Lapeyre
This patch adds the possibility to use the "header" option when using COPY with
the text format. A todo entry was opened for this and I updated the tests and
the documentation.

This was previously discussed at 
https://www.postgresql.org/message-id/flat/CACfv%2BpJ31tesLvncJyP24quo8AE%2BM0GP6p6MEpwPv6yV8%3DsVHQ%40mail.gmail.com

Greetings,
Rémi

---
 doc/src/sgml/ref/copy.sgml  |  3 +-
 src/backend/commands/copy.c | 11 ---
 src/test/regress/input/copy.source  | 46 +++--
 src/test/regress/output/copy.source | 41 +++--
 4 files changed, 64 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index a99f8155e4..c335320786 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -270,7 +270,8 @@ COPY { table_name [ ( CSV format.
+  This option is allowed only when using CSV or
+  text format.
  
 

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 40a8ec1abd..44bf73423c 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1365,10 +1365,10 @@ ProcessCopyOptions(ParseState *pstate,
  errmsg("COPY delimiter cannot be \"%s\"", cstate->delim)));

 	/* Check header */
-	if (!cstate->csv_mode && cstate->header_line)
+	if (cstate->binary && cstate->header_line)
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("COPY HEADER available only in CSV mode")));
+ errmsg("COPY HEADER available only in CSV and text mode")));

 	/* Check quote */
 	if (!cstate->csv_mode && cstate->quote != NULL)
@@ -2100,8 +2100,11 @@ CopyTo(CopyState cstate)

 colname = NameStr(TupleDescAttr(tupDesc, attnum - 1)->attname);

-CopyAttributeOutCSV(cstate, colname, false,
-	list_length(cstate->attnumlist) == 1);
+if (cstate->csv_mode)
+	CopyAttributeOutCSV(cstate, colname, false,
+		list_length(cstate->attnumlist) == 1);
+else
+	CopyAttributeOutText(cstate, colname);
 			}

 			CopySendEndOfRow(cstate);
diff --git a/src/test/regress/input/copy.source b/src/test/regress/input/copy.source
index a1d529ad36..0e08ceb844 100644
--- a/src/test/regress/input/copy.source
+++ b/src/test/regress/input/copy.source
@@ -87,52 +87,66 @@ ANALYZE bt_f8_heap;
 ANALYZE array_op_test;
 ANALYZE array_index_op_test;

+-- test header line feature
+
+create temp table copytest (
+	c1 int,
+	"col with tabulation: 	" text);
+
+copy copytest from stdin (header);
+this is just a line full of junk that would error out if parsed
+1	a
+2	b
+\.
+
+copy copytest to stdout (header);
+
 --- test copying in CSV mode with various styles
 --- of embedded line ending characters

-create temp table copytest (
+create temp table copytest2 (
 	style	text,
 	test 	text,
 	filler	int);

-insert into copytest values('DOS',E'abc\r\ndef',1);
-insert into copytest values('Unix',E'abc\ndef',2);
-insert into copytest values('Mac',E'abc\rdef',3);
-insert into copytest values(E'esc\\ape',E'a\\r\\\r\\\n\\nb',4);
+insert into copytest2 values('DOS',E'abc\r\ndef',1);
+insert into copytest2 values('Unix',E'abc\ndef',2);
+insert into copytest2 values('Mac',E'abc\rdef',3);
+insert into copytest2 values(E'esc\\ape',E'a\\r\\\r\\\n\\nb',4);

-copy copytest to '@abs_builddir@/results/copytest.csv' csv;
+copy copytest2 to '@abs_builddir@/results/copytest.csv' csv;

-create temp table copytest2 (like copytest);
+create temp table copytest3 (like copytest2);

-copy copytest2 from '@abs_builddir@/results/copytest.csv' csv;
+copy copytest3 from '@abs_builddir@/results/copytest.csv' csv;

-select * from copytest except select * from copytest2;
+select * from copytest2 except select * from copytest3;

-truncate copytest2;
+truncate copytest3;

 --- same test but with an escape char different from quote char

-copy copytest to '@abs_builddir@/results/copytest.csv' csv quote  escape E'\\';
+copy copytest2 to '@abs_builddir@/results/copytest.csv' csv quote  escape E'\\';

-copy copytest2 from '@abs_builddir@/results/copytest.csv' csv quote  escape E'\\';
+copy copytest3 from '@abs_builddir@/results/copytest.csv' csv quote  escape E'\\';

-select * from copytest except select * from copytest2;
+select * from copytest2 except select * from copytest3;


 -- test header line feature

-create temp table copytest3 (
+create temp table copytest4 (
 	c1 int,
 	"col with , comma" text,
 	"col with "" quote"  int);

-copy copytest3 from stdin csv header;
+copy copytest4 from stdin csv header;
 this is just a line full of junk that would error out if parsed
 1,a,1
 2,b,2
 \.

-copy copytest3 to stdout csv header;
+copy copytest4 to stdout csv header;

 -- test copy from with a partitioned table
 create table parted_copytest (
diff --git a/src/test/regress/output/copy.source b/src/test/regress/output/copy.source
index 938d3551da..7f864d77b2 100644
--- a/src/test/regress/output/copy.source
+++ b/src/test/regress/output/copy.source
@@ -58,40 

More DSM slot exhaustion bugs

2020-02-04 Thread Thomas Munro
Hello,

Here are two more bugs I found by running the regression tests
repeatedly with dsm_create() hacked to fail at random.

1.  If find_or_make_matching_shared_tupledesc() fails, we leave behind
a null pointer in RecordCacheHash, so that a later lookup segfaults.

2.  If we do a rescan, then ExecHashJoinReInitializeDSM() needs to
return early if there is no DSM segment, otherwise a TOC lookup raises
a bogus error.

Here are some draft patches.


0001-Fix-shared-typmod-bug-on-DSM-slot-exhaustion.patch
Description: Binary data


0002-Fix-parallel-hash-join-rescan-on-DSM-exhaustion.patch
Description: Binary data


0003-Inject-random-faults-into-dsm_create.patch
Description: Binary data


Re: Memory-Bounded Hash Aggregation

2020-02-04 Thread Adam Lee
On Mon, Feb 03, 2020 at 06:24:14PM -0800, Jeff Davis wrote:
> On Mon, 2020-02-03 at 10:29 -0800, Jeff Davis wrote:
> > I ended up converting the freelist to a min heap.
> > 
> > Attached is a patch which makes three changes to better support
> > HashAgg:
> 
> And now I'm attaching another version of the main Hash Aggregation
> patch to be applied on top of the logtape.c patch.
> 
> Not a lot of changes from the last version; mostly some cleanup and
> rebasing. But it's faster now with the logtape.c changes.

Nice!

Just back from the holiday. I had the perf test with Tomas's script,
didn't notice the freelist sorting regression at that time.

The minheap looks good, have you tested the performance and aggregate
validation?

About the "Cap the size of the LogicalTapeSet's freelist" and "Don't
bother tracking free space for HashAgg at all" you mentioned in last
mail, I suppose these two options will lost the disk space saving
benefit since some blocks are not reusable then?

-- 
Adam Lee




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-02-04 Thread k.jami...@fujitsu.com
Hi,

I have rebased the patch to keep the CFbot happy.
Apparently, in the previous patch there was a possibility of infinite loop
when allocating buffers, so I fixed that part and also removed some whitespaces.

Kindly check the attached V6 patch.
Any thoughts on this?

Regards,
Kirk Jamison


v6-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description:  v6-Optimize-dropping-of-relation-buffers-using-dlist.patch


Re: Implementing Incremental View Maintenance

2020-02-04 Thread nuko yokohama
"ROW LEVEL SECURITY" and INCREMENTAL MATERIALIZED VIEW.

Hi.

If ROW LEVEL SECURITY is set for the source table after creating the
INCREMENTAL MATELIALIZED VIEW, the search results by that are not reflected.
After setting ROW LEVEL SECURITY (similar to normal MATERIALIZED VIEW), you
need to execute REFRESH MATERILALIZED VIEW and reflect the result.
(Not limited to this, but in general cases where search results change by
means other than DML)

I propose to add this note to the document (rules.sgml).

execute log.

```
[ec2-user@ip-10-0-1-10 rls]$ psql testdb -e -f rls.sql
CREATE USER user_a;
CREATE ROLE
CREATE TABLE test (id int, data text);
CREATE TABLE
GRANT ALL ON TABLE test TO user_a;
GRANT
GRANT ALL ON SCHEMA public  TO user_a;
GRANT
SET ROLE user_a;
SET
INSERT INTO test VALUES (1,'A'),(2,'B'),(3,'C');
INSERT 0 3
SELECT * FROM test;
 id | data
+--
  1 | A
  2 | B
  3 | C
(3 rows)

CREATE VIEW test_v AS SELECT * FROM test;
CREATE VIEW
CREATE MATERIALIZED VIEW test_mv AS SELECT * FROM test;
SELECT 3
CREATE INCREMENTAL MATERIALIZED VIEW test_imv AS SELECT * FROM test;
SELECT 3
SELECT * FROM test_v;
 id | data
+--
  1 | A
  2 | B
  3 | C
(3 rows)

SELECT * FROM test_mv;
 id | data
+--
  1 | A
  2 | B
  3 | C
(3 rows)

SELECT * FROM test_imv;
 id | data
+--
  3 | C
  1 | A
  2 | B
(3 rows)

RESET ROLE;
RESET
CREATE POLICY test_AAA ON test FOR SELECT TO user_a USING (data = 'A');
CREATE POLICY
ALTER TABLE test ENABLE ROW LEVEL SECURITY ;
ALTER TABLE
SET ROLE user_a;
SET
SELECT * FROM test_v;
 id | data
+--
  1 | A
(1 row)

SELECT * FROM test_mv;
 id | data
+--
  1 | A
  2 | B
  3 | C
(3 rows)

SELECT * FROM test_imv;
 id | data
+--
  3 | C
  1 | A
  2 | B
(3 rows)

REFRESH MATERIALIZED VIEW test_mv;
REFRESH MATERIALIZED VIEW
REFRESH MATERIALIZED VIEW test_imv;
REFRESH MATERIALIZED VIEW
SELECT * FROM test_mv;
 id | data
+--
  1 | A
(1 row)

SELECT * FROM test_imv;
 id | data
+--
  1 | A
(1 row)

RESET ROLE;
RESET
REVOKE ALL ON TABLE test FROM user_a;
REVOKE
REVOKE ALL ON TABLE test_v FROM user_a;
REVOKE
REVOKE ALL ON TABLE test_mv FROM user_a;
REVOKE
REVOKE ALL ON TABLE test_imv FROM user_a;
REVOKE
REVOKE ALL ON SCHEMA public FROM user_a;
REVOKE
DROP TABLE test CASCADE;
psql:rls.sql:40: NOTICE:  drop cascades to 3 other objects
DETAIL:  drop cascades to view test_v
drop cascades to materialized view test_mv
drop cascades to materialized view test_imv
DROP TABLE
DROP USER user_a;
DROP ROLE
[ec2-user@ip-10-0-1-10 rls]$

```

Regard.

2018年12月27日(木) 21:57 Yugo Nagata :

> Hi,
>
> I would like to implement Incremental View Maintenance (IVM) on
> PostgreSQL.
> IVM is a technique to maintain materialized views which computes and
> applies
> only the incremental changes to the materialized views rather than
> recomputate the contents as the current REFRESH command does.
>
> I had a presentation on our PoC implementation of IVM at PGConf.eu 2018
> [1].
> Our implementation uses row OIDs to compute deltas for materialized
> views.
> The basic idea is that if we have information about which rows in base
> tables
> are contributing to generate a certain row in a matview then we can
> identify
> the affected rows when a base table is updated. This is based on an idea of
> Dr. Masunaga [2] who is a member of our group and inspired from ID-based
> approach[3].
>
> In our implementation, the mapping of the row OIDs of the materialized view
> and the base tables are stored in "OID map". When a base relation is
> modified,
> AFTER trigger is executed and the delta is recorded in delta tables using
> the transition table feature. The accual udpate of the matview is triggerd
> by REFRESH command with INCREMENTALLY option.
>
> However, we realize problems of our implementation. First, WITH OIDS will
> be removed since PG12, so OIDs are no longer available. Besides this, it
> would
> be hard to implement this since it needs many changes of executor nodes to
> collect base tables's OIDs during execuing a query. Also, the cost of
> maintaining
> OID map would be high.
>
> For these reasons, we started to think to implement IVM without relying on
> OIDs
> and made a bit more surveys.
>
> We also looked at Kevin Grittner's discussion [4] on incremental matview
> maintenance.  In this discussion, Kevin proposed to use counting algorithm
> [5]
> to handle projection views (using DISTNICT) properly. This algorithm need
> an
> additional system column, count_t, in materialized views and delta tables
> of
> base tables.
>
> However, the discussion about IVM is now stoped, so we would like to
> restart and
> progress this.
>
>
> Through our PoC inplementation and surveys, I think we need to think at
> least
> the followings for implementing IVM.
>
> 1. How to extract changes on base tables
>
> I think there would be at least two approaches for it.
>
>  - Using transition table in AFTER triggers
>  - Extracting changes from WAL using logical decoding
>
> In our PoC 

Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-02-04 Thread Amit Kapila
On Tue, Feb 4, 2020 at 10:15 AM Kuntal Ghosh  wrote:
>
> On Sun, Jan 12, 2020 at 9:51 AM Tom Lane  wrote:
> >
> > Tomas Vondra  writes:
> > > On Sat, Jan 11, 2020 at 10:53:57PM -0500, Tom Lane wrote:
> > >> remind me where the win came from, exactly?
> >
> > > Well, the problem is that in 10 we allocate tuple data in the main
> > > memory ReorderBuffer context, and when the transaction gets decoded we
> > > pfree() it. But in AllocSet that only moves the data to the freelists,
> > > it does not release it entirely. So with the right allocation pattern
> > > (sufficiently diverse chunk sizes) this can easily result in allocation
> > > of large amount of memory that is never released.
> >
> > > I don't know if this is what's happening in this particular test, but I
> > > wouldn't be surprised by it.
> >
> > Nah, don't think I believe that: the test inserts a bunch of tuples,
> > but they look like they will all be *exactly* the same size.
> >
> > CREATE TABLE decoding_test(x integer, y text);
> > ...
> >
> > FOR i IN 1..10 LOOP
> > BEGIN
> > INSERT INTO decoding_test(x) SELECT generate_series(1,5000);
> > EXCEPTION
> > when division_by_zero then perform 'dummy';
> > END;
> >
> I performed the same test in pg11 and reproduced the issue on the
> commit prior to a4ccc1cef5a04 (Generational memory allocator).
>
> ulimit -s 1024
> ulimit -v 30
>
> wal_level = logical
> max_replication_slots = 4
>
> And executed the following code snippet (shared by Amit Khandekar
> earlier in the thread).
>
..
>
> SELECT data from pg_logical_slot_get_changes('test_slot', NULL, NULL) LIMIT 
> 10;
>
> I got the following error:
> ERROR:  out of memory
> DETAIL:  Failed on request of size 8208.
>
> After that, I applied the "Generational memory allocator" patch and
> that solved the issue. From the error message, it is evident that the
> underlying code is trying to allocate a MaxTupleSize memory for each
> tuple. So, I re-introduced the following lines (which are removed by
> a4ccc1cef5a04) on top of the patch:
>
> --- a/src/backend/replication/logical/reorderbuffer.c
> +++ b/src/backend/replication/logical/reorderbuffer.c
> @@ -417,6 +417,9 @@ ReorderBufferGetTupleBuf(ReorderBuffer *rb, Size 
> tuple_len)
>
> alloc_len = tuple_len + SizeofHeapTupleHeader;
>
> +   if (alloc_len < MaxHeapTupleSize)
> +   alloc_len = MaxHeapTupleSize;
>
> And, the issue got reproduced with the same error:
> WARNING:  problem in Generation Tuples: number of free chunks 0 in
> block 0x7fe9e9e74010 exceeds 1018 allocated
> .
> ERROR:  out of memory
> DETAIL:  Failed on request of size 8208.
>
> I don't understand the code well enough to comment whether we can
> back-patch only this part of the code.
>

I don't think we can just back-patch that part of code as it is linked
to the way we are maintaining a cache (~8MB) for frequently allocated
objects.  See the comments around the definition of
max_cached_tuplebufs.  But probably, we can do something once we reach
such a limit, basically, once we know that we have already allocated
max_cached_tuplebufs number of tuples of size MaxHeapTupleSize, we
don't need to allocate more of that size.  Does this make sense?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Add %x to PROMPT1 and PROMPT2

2020-02-04 Thread Michael Paquier
On Tue, Feb 04, 2020 at 10:31:43AM +0900, Ian Barwick wrote:
> The last change I recall affecting default psql behaviour was the addition
> of COMP_KEYWORD_CASE in 9.2 (db84ba65), which personally I (and no doubt 
> others)
> found annoying, but the world still turns.
> 
> +1 one for this change, it's something I also add to every .psqlrc I setup.

So..  We have:
+1: Vik, Ian, Daniel, Alvaro, Christoph
+-0: Tom (?), Fabien (?)
-1: Michael P.

So there is a clear majority in favor of changing the default.  Any
extra opinions from others?
--
Michael


signature.asc
Description: PGP signature


Refactor compile-time assertion checks for C/C++

2020-02-04 Thread Michael Paquier
Hi all,

As of [1], I have been playing with the compile time assertions that
we have for expressions, declarations and statements.  And it happens
that it is visibly possible to consolidate the fallback
implementations for C and C++.  Attached is the result of what I am
getting at.  I am adding this patch to next CF.  Thoughts are
welcome.

[1]: 
https://www.postgresql.org/message-id/201DD0641B056142AC8C6645EC1B5F62014B8E8030@SYD1217

Thanks,
--
Michael
From 7596171651ced381a30534568220661c79d90d0f Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 4 Feb 2020 17:09:59 +0900
Subject: [PATCH] Refactor assertion definitions in c.h

This unifies the C and C++ fallback implementations.
---
 src/include/c.h | 29 +++--
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/src/include/c.h b/src/include/c.h
index d10b9812fb..905ecd948f 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -841,39 +841,32 @@ extern void ExceptionalCondition(const char *conditionName,
  * about a negative width for a struct bit-field.  This will not include a
  * helpful error message, but it beats not getting an error at all.
  */
-#ifndef __cplusplus
-#ifdef HAVE__STATIC_ASSERT
+#if !defined(__cplusplus) && defined(HAVE__STATIC_ASSERT)
+/* Default C implementation */
 #define StaticAssertStmt(condition, errmessage) \
 	do { _Static_assert(condition, errmessage); } while(0)
 #define StaticAssertExpr(condition, errmessage) \
 	((void) ({ StaticAssertStmt(condition, errmessage); true; }))
 #define StaticAssertDecl(condition, errmessage) \
 	_Static_assert(condition, errmessage)
-#else			/* !HAVE__STATIC_ASSERT */
+#elif defined(__cplusplus) && \
+	defined(__cpp_static_assert) && __cpp_static_assert >= 200410
+/* Default C++ implementation */
 #define StaticAssertStmt(condition, errmessage) \
-	((void) sizeof(struct { int static_assert_failure : (condition) ? 1 : -1; }))
+	static_assert(condition, errmessage)
 #define StaticAssertExpr(condition, errmessage) \
+	({ StaticAssertStmt(condition, errmessage); })
+#define StaticAssertDecl(condition, errmessage) \
 	StaticAssertStmt(condition, errmessage)
-#define StaticAssertDecl(condition, errmessage) \
-	extern void static_assert_func(int static_assert_failure[(condition) ? 1 : -1])
-#endif			/* HAVE__STATIC_ASSERT */
-#else			/* C++ */
-#if defined(__cpp_static_assert) && __cpp_static_assert >= 200410
-#define StaticAssertStmt(condition, errmessage) \
-	static_assert(condition, errmessage)
-#define StaticAssertExpr(condition, errmessage) \
-	({ static_assert(condition, errmessage); })
-#define StaticAssertDecl(condition, errmessage) \
-	static_assert(condition, errmessage)
-#else			/* !__cpp_static_assert */
+#else
+/* Fallback implementation for C and C++ */
 #define StaticAssertStmt(condition, errmessage) \
 	do { struct static_assert_struct { int static_assert_failure : (condition) ? 1 : -1; }; } while(0)
 #define StaticAssertExpr(condition, errmessage) \
 	((void) ({ StaticAssertStmt(condition, errmessage); }))
 #define StaticAssertDecl(condition, errmessage) \
 	extern void static_assert_func(int static_assert_failure[(condition) ? 1 : -1])
-#endif			/* __cpp_static_assert */
-#endif			/* C++ */
+#endif
 
 
 /*
-- 
2.25.0



signature.asc
Description: PGP signature