RE: Stronger safeguard for archive recovery not to miss data

2020-11-25 Thread osumi.takami...@fujitsu.com
Hello, Sergei

> It is possible only with manually configured hot_standby = off, right?
> We have ERROR in hot standby mode just below in
> CheckRequiredParameterValues.
Yes, you are right. I turned off the hot_standby in the test in the previous 
e-mail.
I'm sorry that I missed writing this tip of information.

Best,
Takamichi Osumi




Re: walsender bug: stuck during shutdown

2020-11-25 Thread Fujii Masao




On 2020/11/26 11:45, Alvaro Herrera wrote:

On 2020-Nov-26, Fujii Masao wrote:


On the second thought, walsender doesn't wait forever unless
wal_sender_timeout is disabled, even in the case in discussion?
Or if there is the case where wal_sender_timeout doesn't work expectedly,
we might need to fix that at first.


Hmm, no, it doesn't wait forever in that sense; tracing with the
debugger shows that the process is looping continuously.


Yes, so the problem here is that walsender goes into the busy loop
in that case. Seems this happens only in logical replication walsender.
In physical replication walsender, WaitLatchOrSocket() in WalSndLoop()
seems to work as expected and prevent it from entering into busy loop
even in that case.

/*
 * If postmaster asked us to stop, don't wait anymore.
 *
 * It's important to do this check after the recomputation of
 * RecentFlushPtr, so we can send all remaining data before 
shutting
 * down.
 */
if (got_STOPPING)
break;

The above code in WalSndWaitForWal() seems to cause this issue. But I've
not come up with idea about how to fix yet.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




RE: Stronger safeguard for archive recovery not to miss data

2020-11-25 Thread osumi.takami...@fujitsu.com
Hello, Horiguchi-San


On Thursday, Nov 26, 2020 4:29 PM Kyotaro Horiguchi 
> At Thu, 26 Nov 2020 07:18:39 +, "osumi.takami...@fujitsu.com"
>  wrote in
> > In order to test this fix, what I did is
> > 1 - get a base backup during wal_level is replica
> > 2 - stop the server and change the wal_level from replica to minimal
> > 3 - restart the server(to generate XLOG_PARAMETER_CHANGE)
> > 4 - stop the server and make the wal_level back to replica
> > 5 - start the server again
> > 6 - execute archive recoveries in both cases
> > (1) by editing the postgresql.conf and
> > touching recovery.signal in the base backup from 1th step
> > (2) by making a replica with standby.signal
> > * During wal_level is replica, I enabled archive_mode in this test.
> >
> Perhaps we need the TAP test that conducts the above steps.
It really makes sense that it's better to show the procedures
about how to reproduce the flow.
Thanks for your advice.

Best,
Takamichi Osumi




Re: [patch] [doc] Minor variable related cleanup and rewording of plpgsql docs

2020-11-25 Thread Pavel Stehule
čt 26. 11. 2020 v 6:41 odesílatel David G. Johnston <
david.g.johns...@gmail.com> napsal:

> Hackers,
>
> Bug # 16519 [1] is another report of confusion regarding trying to use
> parameters in improper locations - specifically the SET ROLE command within
> pl/pgsql.  I'm re-attaching the doc patch and am adding it to the
> commitfest.
>

I checked this patch, and I think so it is correct - my comments are just
about enhancing by some examples

Maybe for following sentence the some examples can be practical

+ If the SQL command being executed is incapable of returning a result
+ it does not accept query parameters.
 

+ it does not accept query parameters (usually DDL commands like CREATE
TABLE, DROP TABLE, ALTER ... ).

+Query parameters will only be substituted in places where
syntactically allowed
+(in particular, identifiers can never be replaced with a query
parameter.)
+As an extreme case, consider this example of poor programming style:

In this case, I miss the more precious specification of identifiers

+(in particular, SQL identifiers (like schema, table, column names) can
never be replaced with a query parameter.)

Regards

Pavel



> David J.
>
> [1]
> https://www.postgresql.org/message-id/16519-9ef04828d058a319%40postgresql.org
>
>


Re: Stronger safeguard for archive recovery not to miss data

2020-11-25 Thread Sergei Kornilov
Hello

> First of all, I confirmed the server started up without this patch.

It is possible only with manually configured hot_standby = off, right?
We have ERROR in hot standby mode just below in CheckRequiredParameterValues.

regards, Sergei




Re: Online verification of checksums

2020-11-25 Thread Michael Paquier
On Tue, Nov 24, 2020 at 12:38:30PM -0500, David Steele wrote:
> We are not just looking at one LSN value. Here are the steps we are
> proposing (I'll skip checks for zero pages here):
> 
> 1) Test the page checksum. If it passes the page is OK.
> 2) If the checksum does not pass then record the page offset and LSN and
> continue.

But here the checksum is broken, so while the offset is something we
can rely on how do you make sure that the LSN is fine?  A broken 
checksum could perfectly mean that the LSN is actually *not* fine if
the page header got corrupted.

> 3) After the file is copied, reopen and reread the file, seeking to offsets
> where possible invalid pages were recorded in the first pass.
> a) If the page is now valid then it is OK.
> b) If the page is not valid but the LSN has increased from the LSN

Per se the previous point about the LSN value that we cannot rely on.

> A malicious attacker could easily trick these checks, but as Stephen pointed
> out elsewhere they would likely make the checksums valid which would escape
> detection anyway.
> 
> We believe that the chances of random storage corruption passing all these
> checks is incredibly small, but eventually we'll also check against the WAL
> to be completely sure.

The lack of check for any concurrent I/O on the follow-up retries is
disturbing.  How do you guarantee that on the second retry what you 
have is a torn page and not something corrupted?  Init forks for
example are made of up to 2 blocks, so the window would get short for
at least those.  There are many instances with tables that have few
pages as well.
--
Michael


signature.asc
Description: PGP signature


Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-25 Thread Kyotaro Horiguchi
At Thu, 26 Nov 2020 16:18:55 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> + /* Zero the array of blocks because these will all be dropped anyway */
> + MemSet(firstDelBlocks, 0, sizeof(BlockNumber) * n * (MAX_FORKNUM + 1));
> 
> We don't need to prepare nforks, forks and firstDelBlocks for all
> relations before looping over relations.  In other words, we can fill
> in the arrays for a relation at every iteration of relations.

Or even we could call FindAndDropRelFileNodeBuffers() for each
forks. It dones't matter in the performance perspective whether the
function loops over forks or the function is called for each forks.

regrds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: ResourceOwner refactoring

2020-11-25 Thread Michael Paquier
On Thu, Nov 19, 2020 at 06:40:19PM +0800, Craig Ringer wrote:
> [off-list for now]

Looks like you have missed something here.
--
Michael


signature.asc
Description: PGP signature


Re: Add table access method as an option to pgbench

2020-11-25 Thread Fabien COELHO


Hello David,

The previous patch was based on branch "REL_13_STABLE". Now, the attached new 
patch v2 is based on master branch. I followed the new code structure using 
appendPQExpBuffer to append the new clause "using TABLEAM" in a proper 
position and tested. In the meantime, I also updated the pgbench.sqml file to 
reflect the changes.


My 0.02€: I'm fine with the added feature.

The patch lacks minimal coverage test. Consider adding something to 
pgbench tap tests, including failures (ie non existing method).


The option in the help string is not at the right ab place.

I would merge the tableam declaration to the previous one with a extended 
comments, eg "tablespace and access method selection".


escape_tableam -> escaped_tableam ?

--
Fabien.

Re: Stronger safeguard for archive recovery not to miss data

2020-11-25 Thread Kyotaro Horiguchi
At Thu, 26 Nov 2020 07:18:39 +, "osumi.takami...@fujitsu.com" 
 wrote in 
> Hello
> 
> 
> The attached patch is intended to prevent a scenario that
> archive recovery hits WALs which come from wal_level=minimal
> and the server continues to work, which was discussed in the thread of [1].
> The motivation is to protect that user ends up with both getting replica
> that could miss data and getting the server to miss data in targeted recovery 
> mode.
> 
> About how to modify this, we reached the consensus in the thread.
> It is by changing the ereport's level from WARNING to FATAL in 
> CheckRequiredParameterValues().
> 
> In order to test this fix, what I did is
> 1 - get a base backup during wal_level is replica
> 2 - stop the server and change the wal_level from replica to minimal
> 3 - restart the server(to generate XLOG_PARAMETER_CHANGE)
> 4 - stop the server and make the wal_level back to replica
> 5 - start the server again
> 6 - execute archive recoveries in both cases
>   (1) by editing the postgresql.conf and
>   touching recovery.signal in the base backup from 1th step
>   (2) by making a replica with standby.signal
> * During wal_level is replica, I enabled archive_mode in this test.
> 
> First of all, I confirmed the server started up without this patch.
> After applying this safeguard patch, I checked that
> the server cannot start up any more in the scenario case.
> I checked the log and got the result below with this patch.
> 
> 2020-11-26 06:49:46.003 UTC [19715] FATAL:  WAL was generated with 
> wal_level=minimal, data may be missing
> 2020-11-26 06:49:46.003 UTC [19715] HINT:  This happens if you temporarily 
> set wal_level=minimal without taking a new base backup.
> 
> Lastly, this should be backpatched.
> Any comments ?

Perhaps we need the TAP test that conducts the above steps.

> [1]
> https://www.postgresql.org/message-id/TYAPR01MB29901EBE5A3ACCE55BA99186FE320%40TYAPR01MB2990.jpnprd01.prod.outlook.com

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [doc] plan invalidation when statistics are update

2020-11-25 Thread Fujii Masao




On 2020/11/26 14:30, torikoshia wrote:

On 2020-11-25 14:13, Fujii Masao wrote:

On 2020/11/24 23:14, Fujii Masao wrote:



On 2020/11/19 14:33, torikoshia wrote:

On 2020-11-18 11:35, Fujii Masao wrote:

Thanks for your comment!


On 2020/11/18 11:04, torikoshia wrote:

Hi,

AFAIU, when the planner statistics are updated, generic plans are invalidated 
and PostgreSQL recreates. However, the manual doesn't seem to explain it 
explicitly.

   https://www.postgresql.org/docs/devel/sql-prepare.html

I guess this case is included in 'whenever database objects used in the 
statement have definitional (DDL) changes undergone', but I feel it's hard to 
infer.

Since updates of the statistics can often happen, how about describing this 
case explicitly like an attached patch?


+1 to add that note.

-   statement.  Also, if the value of  changes
+   statement. For example, when the planner statistics of the statement
+   are updated, PostgreSQL re-analyzes and
+   re-plans the statement.

I don't think "For example," is necessary.

"planner statistics of the statement" sounds vague? Does the statement
is re-analyzed and re-planned only when the planner statistics of database
objects used in the statement are updated? If yes, we should describe
that to make the note a bit more explicitly?


Yes. As far as I confirmed, updating statistics which are not used in
prepared statements doesn't trigger re-analyze and re-plan.

Since plan invalidations for DDL changes and statistcal changes are caused
by PlanCacheRelCallback(Oid 'relid'), only the prepared statements using
'relid' relation seem invalidated.> Attached updated patch.


Thanks for confirming that and updating the patch!


    force re-analysis and re-planning of the statement before using it
    whenever database objects used in the statement have undergone
    definitional (DDL) changes since the previous use of the prepared
-   statement.  Also, if the value of  changes
+   statement. Similarly, whenever the planner statistics of database
+   objects used in the statement have updated, re-analysis and re-planning
+   happen.

"been" should be added between "have" and "updated" in the above "objects
 used in the statement have updated"?


You're right.


I'm inclined to add "since the previous use of the prepared statement" into
also the second description, to make it clear. But if we do that, it's better
to merge the above two description into one, as follows?

    whenever database objects used in the statement have undergone
-   definitional (DDL) changes since the previous use of the prepared
+   definitional (DDL) changes or the planner statistics of them have
+   been updated since the previous use of the prepared
    statement.  Also, if the value of  changes


Thanks, it seems better.


Pushed. Thanks!



+1 for documenting this case since I just spent time reading code last week for 
it. and
+1 for the above sentence to describe this case.


Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: [Patch] Optimize dropping of relation buffers using dlist

2020-11-25 Thread Kyotaro Horiguchi
Hello, Kirk. Thank you for the new version.

At Thu, 26 Nov 2020 03:04:10 +, "k.jami...@fujitsu.com" 
 wrote in 
> On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> > From: Andres Freund 
> > > DropRelFileNodeBuffers() in recovery? The most common path is
> > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers(),
> > > which 3/4 doesn't address and 4/4 doesn't mention.
> > >
> > > 4/4 seems to address DropRelationFiles(), but only talks about
> > TRUNCATE?
> > 
> > Yes.  DropRelationFiles() is used in the following two paths:
> > 
> > [Replay of TRUNCATE during recovery]
> > xact_redo_commit/abort() -> DropRelationFiles()  -> smgrdounlinkall() ->
> > DropRelFileNodesAllBuffers()
> > 
> > [COMMIT/ROLLBACK PREPARED]
> > FinishPreparedTransaction() -> DropRelationFiles()  -> smgrdounlinkall()
> > -> DropRelFileNodesAllBuffers()
> 
> Yes. The concern is that it was not clear in the function descriptions and 
> commit logs
> what the optimizations for the DropRelFileNodeBuffers() and 
> DropRelFileNodesAllBuffers()
> are for. So I revised only the function description of 
> DropRelFileNodeBuffers() and the
> commit logs of the 0003-0004 patches. Please check if the brief descriptions 
> would suffice.

I read the commit message of 3/4.  (Though this is not involved
literally in the final commit.)

> While recovery, when WAL files of XLOG_SMGR_TRUNCATE from vacuum
> or autovacuum are replayed, the buffers are dropped when the sizes
> of all involved forks of a relation are already "cached". We can get

This sentence seems missing "dropped by (or using) what".

> a reliable size of nblocks for supplied relation's fork at that time,
> and it's safe because DropRelFileNodeBuffers() relies on the behavior
> that cached nblocks will not be invalidated by file extension during
> recovery.  Otherwise, or if not in recovery, proceed to sequential
> search of the whole buffer pool.

This sentence seems involving confusion. It reads as if "we can rely
on it because we're relying on it".  And "the cached value won't be
invalidated" doesn't explain the reason precisely. The reason I think
is that the cached value is guaranteed to be the maximum page we have
in shared buffer at least while recovery, and that guarantee is holded
by not asking fseek once we cached the value.

> > > I also don't get why 4/4 would be a good idea on its own. It uses
> > > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > > DropRelFileNodesAllBuffers() can be used for many relations at once,
> > > this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD - 1
> > lookups a lot
> > > of times, once for each of nnodes relations?
> > 
> > So, the threshold value should be compared with the total number of blocks
> > of all target relations, not each relation.  You seem to be right, got it.
> 
> Fixed this in 0004 patch. Now we compare the total number of 
> buffers-to-be-invalidated
> For ALL relations to the BUF_DROP_FULL_SCAN_THRESHOLD.

I didn't see the previous version, but the row of additional
palloc/pfree's in this version looks uneasy.


int i,
+   j,
+   *nforks,
n = 0;

Perhaps I think we don't define variable in different types at once.
(I'm not sure about defining multple variables at once.)


@@ -3110,7 +3125,10 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, 
int nnodes)
DropRelFileNodeAllLocalBuffers(rnodes[i].node);
}
else
+   {
+   rels[n] = smgr_reln[i];
nodes[n++] = rnodes[i].node;
+   }
}

We don't need to remember nodes and rnodes here since rnodes[n] is
rels[n]->smgr_rnode here.  Or we don't even need to store rels since
we can scan the smgr_reln later again.

nodes is needed in the full-scan path but it is enough to collect it
after finding that we do full-scan.


/*
@@ -3120,6 +3138,68 @@ DropRelFileNodesAllBuffers(RelFileNodeBackend *rnodes, 
int nnodes)
if (n == 0)
{
pfree(nodes);
+   pfree(rels);
+   pfree(rnodes);
+   return;
+   }
+
+   nforks = palloc(sizeof(int) * n);
+   forks = palloc(sizeof(ForkNumber *) * n);
+   blocks = palloc(sizeof(BlockNumber *) * n);
+   firstDelBlocks = palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1));
+   for (i = 0; i < n; i++)
+   {
+   forks[i] = palloc(sizeof(ForkNumber) * (MAX_FORKNUM + 1));
+   blocks[i] = palloc(sizeof(BlockNumber) * (MAX_FORKNUM + 1));
+   }

We can allocate the whole array at once like this.

 BlockNumber (*blocks)[MAX_FORKNUM+1] =
  (BlockNumber (*)[MAX_FORKNUM+1])
  palloc(sizeof(BlockNumber) * n * (MAX_FORKNUM + 1))

The elements of forks[][] and blocks[][] 

Stronger safeguard for archive recovery not to miss data

2020-11-25 Thread osumi.takami...@fujitsu.com
Hello


The attached patch is intended to prevent a scenario that
archive recovery hits WALs which come from wal_level=minimal
and the server continues to work, which was discussed in the thread of [1].
The motivation is to protect that user ends up with both getting replica
that could miss data and getting the server to miss data in targeted recovery 
mode.

About how to modify this, we reached the consensus in the thread.
It is by changing the ereport's level from WARNING to FATAL in 
CheckRequiredParameterValues().

In order to test this fix, what I did is
1 - get a base backup during wal_level is replica
2 - stop the server and change the wal_level from replica to minimal
3 - restart the server(to generate XLOG_PARAMETER_CHANGE)
4 - stop the server and make the wal_level back to replica
5 - start the server again
6 - execute archive recoveries in both cases
(1) by editing the postgresql.conf and
touching recovery.signal in the base backup from 1th step
(2) by making a replica with standby.signal
* During wal_level is replica, I enabled archive_mode in this test.

First of all, I confirmed the server started up without this patch.
After applying this safeguard patch, I checked that
the server cannot start up any more in the scenario case.
I checked the log and got the result below with this patch.

2020-11-26 06:49:46.003 UTC [19715] FATAL:  WAL was generated with 
wal_level=minimal, data may be missing
2020-11-26 06:49:46.003 UTC [19715] HINT:  This happens if you temporarily set 
wal_level=minimal without taking a new base backup.

Lastly, this should be backpatched.
Any comments ?

[1]
https://www.postgresql.org/message-id/TYAPR01MB29901EBE5A3ACCE55BA99186FE320%40TYAPR01MB2990.jpnprd01.prod.outlook.com


Best,
Takamichi Osumi


stronger_safeguard_for_archive_recovery.patch
Description: stronger_safeguard_for_archive_recovery.patch


Re: Parallel plans and "union all" subquery

2020-11-25 Thread Luc Vlaming

On 25-11-2020 14:54, Greg Nancarrow wrote:



On Wed, Nov 25, 2020 at 6:43 PM Luc Vlaming > wrote:

 >
 >
 > You're completely right, sorry for my error. I was too quick on assuming
 > my patch would work for this specific case too; I should have tested
 > that before replying. It looked very similar but turns out to not work
 > because of the upper rel not being considered parallel.
 >
 > I would like to extend my patch to support this, or create a second
 > patch. This would however be significantly more involved because it
 > would require that we (always?) consider two paths whenever we process a
 > subquery: the best parallel plan and the best serial plan. Before I
 > emback on such a journey I would like some input on whether this would
 > be a very bad idea. Thoughts?
 >

Hi,

I must admit, your intended approach isn't what immediately came to mind 
when I saw this issue. Have you analyzed and debugged this to know 
exactly what is going on?
I haven't had time to debug this and see exactly where the code paths 
diverge for the use of "values(1)" verses "values(1::numeric)" in this 
case, but that would be one of the first steps.


What I wondered (and I may well be wrong) was how come the documented 
type resolution algorithm 
(https://www.postgresql.org/docs/13/typeconv-union-case.html 
) doesn't 
seem to be working quite right here, at least to the point of creating 
the same/similar parse tree as when I change "values(1)" to 
"values(1::numeric)" or even just "values(1.)"? So shouldn't then  the 
use of "values(1)" in this case (a constant, convertible to numeric - 
the preferred type ) result in the same (parallel) plan as when 
"values(1::numeric)" is used? Perhaps this isn't happening because the 
code is treating these as generalised expressions when their types 
aren't the same, and this then affects parsing/planning?
My natural thought was that there seems to be a minor issue in the code, 
which should be reasonably easy to fix, at least for this fairly simple 
case.


However, I claim no expertise in the area of parser/analyzer/planner, I 
only know certain areas of that code, but enough to appreciate it is 
complex and intricate, and easily broken.
Perhaps one of the major contributors to this area of the code, who 
probably know this code very well, like maybe Tom Lane or Robert Haas 
(to name two) might like to comment on whether what we're looking at is 
indeed a bug/deficiency and worth fixing, and whether Luc is correct in 
his expressed approach on what would be required to fix it?


Regards,
Greg Nancarrow
Fujitsu Australia


So from what I recall from building the patch is that the difference is 
that when all types are identical, then flatten_simple_union_all simply 
flattens all union-all operations into an append relation.
If you don't have identical types then the situation has to be handled 
by the code in prepunion.c which doesn't always keep a parallel path 
around. The patch I had posted fixes this for a relatively simple issue 
and not the case described here.
If interesting I can make a draft of what this would look like if this 
makes it easier to discuss?


Regards,
Luc
Swarm64




Have collation versioning to ignore hash and similar AM

2020-11-25 Thread Julien Rouhaud
Hello,

Collation versioning has been committed, but there are still at least 2 cases
that aren't perfectly handled:

- AMs which don't rely on stable collation ordering (hash, bloom...)
- expressions which don't rely on a stable collation ordering (e.g. md5())

Handling expressions will probably require a lot of work, so I'd like to start
with AM handling.

My initial idea was to add a new field in IndexAmRoutine for that (see patch
[1] that was present in v30 [2], and the rest of the patches that used it).
Does anyone have any suggestion on a better way to handle that?

Note also that in the patch I named the field "amnostablecollorder", which is a
bit weird but required so that a custom access method won't be assumed to not
rely on a stable collation ordering if for some reason the author forgot to
correctly set it up.  If we end up with a new field in IndexAmRoutine, maybe we
could also add an API version field that could be bumped in case of changes so
that authors get an immediate error?  This way we wouldn't have to be worried
of a default value anymore.

[1] 
https://www.postgresql.org/message-id/attachment/114354/v30-0001-Add-a-new-amnostablecollorder-flag-in-IndexAmRou.patch
[2] https://www.postgresql.org/message-id/20200924094854.abjmpfqixq6xd4o5%40nol




Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-25 Thread Masahiro Ikeda

On 2020-11-25 20:19, Fujii Masao wrote:

On 2020/11/19 16:31, Masahiro Ikeda wrote:

On 2020-11-17 11:46, Fujii Masao wrote:

On 2020/11/16 16:35, Masahiro Ikeda wrote:

On 2020-11-12 14:58, Fujii Masao wrote:

On 2020/11/06 10:25, Masahiro Ikeda wrote:

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both 
number

are important.

Also about performance, I thought there are few impacts 
because it
increments stats in memory. If I can implement to reuse 
pgWalUsage's
value which already collects these stats, there is no impact 
in

XLogInsertRecord.
For example, how about pg_stat_wal() calculates the 
accumulated
value of wal_records, wal_fpi, and wal_bytes to use 
pgWalUsage's

value?


I don't think that works, but it would work that 
pgstat_send_wal()
takes the difference of that values between two successive 
calls.


WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - 
prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - 
prevWalUsage.wal_fpi;

...
   pgstat_send(, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.


+    /* fill in some values using pgWalUsage */
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes - 
prevWalUsage.wal_bytes;
+    WalStats.m_wal_records = pgWalUsage.wal_records - 
prevWalUsage.wal_records;
+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi - 
prevWalUsage.wal_fpi;


It's better to use WalUsageAccumDiff() here?


Yes, thanks. I fixed it.


prevWalUsage needs to be initialized with pgWalUsage?

+    if (AmWalWriterProcess()){
+    WalStats.m_wal_write_walwriter++;
+    }
+    else
+    {
+    WalStats.m_wal_write_backend++;
+    }

I think that it's better not to separate m_wal_write_xxx into two 
for
walwriter and other processes. Instead, we can use one 
m_wal_write_xxx

counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the 
counters
per process type if necessary. If we adopt this approach, we can 
easily
extend pg_stat_wal so that any fields can be reported per process 
type.


I'll remove the above source code because these counters are not 
useful.



On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the 
performance

trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may 
tune

"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, 
auto_explain,

autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire 
database,

they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.


2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance 
of
a write-heavy workload generating lots 

Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-25 Thread Luc Vlaming

On 26-11-2020 07:31, Bharath Rupireddy wrote:

On Thu, Nov 26, 2020 at 9:55 AM Michael Paquier  wrote:


+inline Size
+GetTupleSize(TupleTableSlot *slot, Size maxsize)
+{
+   Size sz = 0;
+   HeapTuple tuple = NULL;
+
+   if (TTS_IS_HEAPTUPLE(slot))
+   tuple = ((HeapTupleTableSlot *) slot)->tuple;
+   else if(TTS_IS_BUFFERTUPLE(slot))
+   tuple = ((BufferHeapTupleTableSlot *) slot)->base.tuple;
+   else if(TTS_IS_MINIMALTUPLE(slot))
+   tuple = ((MinimalTupleTableSlot *) slot)->tuple;

There have been various talks about the methods we could use to
evaluate the threshold in bytes when evaluating that a flush can
happen, including the use of memory contexts, or even estimate the
size of the number of tuples.  This one looks promising because it
seems exact, however for virtual slots I don't like much the fact that
you basically just extracted the parts of tts_virtual_materialize()
and stuck them in this routine.  That's a recipe for future bugs if
the materialization logic changes.  In short, I am surprised that this
calculation is not directly part of TupleTableSlotOps.  What we'd want
is to get this information depending on the slot type dealt with, and
with your patch you would miss to handle any new slot type
introduced.



Yes for virtual slots, I reused the code from
tts_virtual_materialize() in GetTupleSize(). I can think of below
options:

1) Make the size calculation code for virtual slots, a macro or a
static inline function and use that in tts_virtual_materialize() and
GetTupleSize().
2) Add comments in both the places, such as "if any code is changed
here, consider changing it in tts_virtual_materialize() /
GetTupleSize()"
3) Add a size variable to TupleTableSlotOps structure.
4) Add a new API to TupleTableSlotOps structure say get_slot_size().
5) For new slot types, maybe we can have comments in tuptable.h to
consider having equivalent change in GetTupleSize().

If we go with 3 and 4, will it be acceptable to add the extra code in
generic structure which gets used in most of the code base and use
that new code only in limited places (for multi inserts in CTAS and
Refresh Mat View)? I think we can go ahead with 2 and 5. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com



What I'm wondering about is the reason for wanting a cap on data volume. 
When doing some local (highly concurrent) ingest speed tests a few weeks 
ago it seemed to mostly matter how many pages were being written and the 
resulting pressure on locks, etc. and not necessarily so much the actual 
memory usage. I didn't collect proof on that though (yet). There was 
however a very clearly observable contention point where with bigger 
buffers the performance would not only stagnate but actually drop.


So what I'm kinda wondering is if we should worry more about the amount 
of pages that are going to be written and maybe not so much about the 
memory usage?


If this were to be the case then maybe we can consider improving the 
current design, potentially in a follow-up patch? The problem I see is 
that generically each tableam will have different choices to make on how 
to buffer and flush multiple rows, given that a storage engine might 
have more or less write amplification, a different way of extending a 
relation, fsm use, etc.

Assuming we indeed want a per-tableam implementation, we could either:
- make multi_insert buffer the tuples itself and add a flush_multi_insert.
- add a new function called create_multi_insert which returns something 
like a MultiInsertState, which, like a destreceiver, has a set of 
callbacks to start, shutdown and insert.


With both solutions one part that to me seems appealing is that we 
buffer the data in something that likely resembles the disk format very 
much. Thoughts?


Regards,
Luc
Swarm64




Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Amit Khandekar
On Thu, 26 Nov 2020 at 10:55, Krunal Bauskar  wrote:
> Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for 
> server and 8 for client) [2 numa nodes]
> Storage: 3.2 TB NVMe SSD
> OS: CentOS Linux release 7.6
> PGSQL: baseline = Release Tag 13.1
> Invocation suite: 
> https://github.com/mysqlonarm/benchmark-suites/tree/master/pgsql-pbench (Uses 
> pgbench)

Using the same hardware, attached are my improvement figures, which
are pretty much in line with your figures. Except that, I did not run
for more than 400 number of clients. And, I am getting some
improvement even for select-only workloads, in case of 200-400
clients. For read-write load,  I had seen that the s_lock() contention
was caused when the XLogFlush() uses the spinlock. But for read-only
case, I have not analyzed where the improvement occurred.

The .png files in the attached tar have the graphs for head versus patch.

The GUCs that I changed :

work_mem=64MB
shared_buffers=128GB
maintenance_work_mem = 1GB
min_wal_size = 20GB
max_wal_size = 100GB
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9
full_page_writes = on
synchronous_commit = on
effective_io_concurrency = 200
log_checkpoints = on

For backends, 64 CPUs were allotted (covering 2 NUMA nodes) , and for
pgbench clients a separate set of 28 CPUs were allotted on a different
socket. Server was pre_warmed().


results.tar.gz
Description: application/gzip


RE: Parallel Inserts in CREATE TABLE AS

2020-11-25 Thread Hou, Zhijie
Hi ,

> On Thu, Nov 26, 2020 at 7:47 AM Hou, Zhijie 
> wrote:
> >
> > Hi,
> >
> > I have an issue about the following code:
> >
> > econtext = node->ps.ps_ExprContext;
> > ResetExprContext(econtext);
> >
> > +   if (ISCTAS(node->ps.intoclause))
> > +   {
> > +   ExecParallelInsertInCTAS(node);
> > +   return NULL;
> > +   }
> >
> > /* If no projection is required, we're done. */
> > if (node->ps.ps_ProjInfo == NULL)
> > return slot;
> >
> > /*
> >  * Form the result tuple using ExecProject(), and return it.
> >  */
> > econtext->ecxt_outertuple = slot;
> > return ExecProject(node->ps.ps_ProjInfo);
> >
> > It seems the projection will be skipped.
> > Is this because projection is not required in this case ?
> > (I'm not very familiar with where the projection will be.)
> >
> 
> For parallel inserts in CTAS, I don't think we need to project the tuples
> being returned from the underlying plan nodes, and also we have nothing
> to project from the Gather node further up. The required projection will
> happen while the tuples are being returned from the underlying nodes and
> the projected tuples are being directly fed to CTAS's dest receiver
> intorel_receive(), from there into the created table. We don't need
> ExecProject again in ExecParallelInsertInCTAS().
> 
> For instance, projection will always be done when the tuple is being returned
> from an underlying sequential scan node(see ExecScan() -->
> ExecProject() and this is true for both leader and workers. In both leader
> and workers, we are just calling CTAS's dest receiver intorel_receive().
> 
> Thoughts?

I took a deep look at the projection logic.
In most cases, you are right that Gather node does not need projection.

In some rare cases, such as Subplan (or initplan I guess).
The projection will happen in Gather node.

The example:

Create table test(i int);
Create table test2(a int, b int);
insert into test values(generate_series(1,1000,1));
insert into test2 values(generate_series(1,1000,1), generate_series(1,1000,1));

postgres=# explain(verbose, costs off) select test.i,(select i from (select * 
from test2) as tt limit 1) from test where test.i < 2000;
   QUERY PLAN

 Gather
   Output: test.i, (SubPlan 1)
   Workers Planned: 2
   ->  Parallel Seq Scan on public.test
 Output: test.i
 Filter: (test.i < 2000)
   SubPlan 1
 ->  Limit
   Output: (test.i)
   ->  Seq Scan on public.test2
 Output: test.i

In this case, projection is necessary,
because the subplan will be executed in projection.

If skipped, the table created will loss some data.



Best regards,
houzj




Re: proposal: possibility to read dumped table's name from file

2020-11-25 Thread Pavel Stehule
st 25. 11. 2020 v 21:00 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > st 25. 11. 2020 v 19:25 odesílatel Dean Rasheed <
> dean.a.rash...@gmail.com>
> > napsal:
> >> I agree that being able to configure pg_dump via a config file would
> >> be very useful, but the syntax proposed here feels much more like a
> >> hacked-up syntax designed to meet this one use case, rather than a
> >> good general-purpose design that can be easily extended.
>
> > But I don't understand why? What is a use case? What is a benefit against
> > command line, or libpq variables? And why should config files be better
> as
> > a solution for limited length of command line, when I need to dump
> > thousands of tables exactly specified?
>
> Because next week somebody will want to dump thousands of functions
> selected by name, or schemas selected by name, etc etc.  I agree with
> the position that we don't want a single-purpose solution.  The idea
> that the syntax should match the command line switch syntax seems
> reasonable, though I'm not wedded to it.  (One thing to consider is
> how painful will it be for people to quote table names containing
> funny characters, for instance.  On the command line, we largely
> depend on the shell's quoting behavior to solve that, but we'd not
> have that infrastructure when reading from a file.)
>

This is not a problem with the current patch - and the last version of this
patch supports well obscure names.

There was a requirement for supporting all and future pg_dump options - ok
it can make sense. I have not a problem to use instead a line format

"option argument" or "long-option=argument"

This format - can it be a solution? I'll try to rewrite the parser for this
format.

It is implementable, but this is in collision with Stephen's requirement
for human well readable format designed for handy writing. There are
requests that have no intersection. Well readable format needs a more
complex parser. And machine generating in this format needs more fork -
generating flat file is more simple and more robust than generating JSON or
YAML.


>
> > What are the benefits of supporting multiple formats?
>
> Yeah, that part of Dean's sketch seemed like overkill to me too.
>
> I wasn't very excited about multiple switch files either, though
> depending on how the implementation is done, that could be simple
> enough to be in the might-as-well category.
>
> One other point that I'm wondering about is that there's really no
> value in doing anything here until you get to some thousands of
> table names; as long as the list fits in the shell's command line
> length limit, you might as well just make a shell script file.
> Does pg_dump really have sane performance for that situation, or
> are we soon going to be fielding requests to make it not be O(N^2)
> in the number of listed tables?
>

Performance is another factor, but the command line limit can be easily
touched when table names have maximum width.


> regards, tom lane
>


Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-25 Thread Bharath Rupireddy
On Thu, Nov 26, 2020 at 9:55 AM Michael Paquier  wrote:
>
> +inline Size
> +GetTupleSize(TupleTableSlot *slot, Size maxsize)
> +{
> +   Size sz = 0;
> +   HeapTuple tuple = NULL;
> +
> +   if (TTS_IS_HEAPTUPLE(slot))
> +   tuple = ((HeapTupleTableSlot *) slot)->tuple;
> +   else if(TTS_IS_BUFFERTUPLE(slot))
> +   tuple = ((BufferHeapTupleTableSlot *) slot)->base.tuple;
> +   else if(TTS_IS_MINIMALTUPLE(slot))
> +   tuple = ((MinimalTupleTableSlot *) slot)->tuple;
>
> There have been various talks about the methods we could use to
> evaluate the threshold in bytes when evaluating that a flush can
> happen, including the use of memory contexts, or even estimate the
> size of the number of tuples.  This one looks promising because it
> seems exact, however for virtual slots I don't like much the fact that
> you basically just extracted the parts of tts_virtual_materialize()
> and stuck them in this routine.  That's a recipe for future bugs if
> the materialization logic changes.  In short, I am surprised that this
> calculation is not directly part of TupleTableSlotOps.  What we'd want
> is to get this information depending on the slot type dealt with, and
> with your patch you would miss to handle any new slot type
> introduced.
>

Yes for virtual slots, I reused the code from
tts_virtual_materialize() in GetTupleSize(). I can think of below
options:

1) Make the size calculation code for virtual slots, a macro or a
static inline function and use that in tts_virtual_materialize() and
GetTupleSize().
2) Add comments in both the places, such as "if any code is changed
here, consider changing it in tts_virtual_materialize() /
GetTupleSize()"
3) Add a size variable to TupleTableSlotOps structure.
4) Add a new API to TupleTableSlotOps structure say get_slot_size().
5) For new slot types, maybe we can have comments in tuptable.h to
consider having equivalent change in GetTupleSize().

If we go with 3 and 4, will it be acceptable to add the extra code in
generic structure which gets used in most of the code base and use
that new code only in limited places (for multi inserts in CTAS and
Refresh Mat View)? I think we can go ahead with 2 and 5. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: Add table access method as an option to pgbench

2020-11-25 Thread Michael Paquier
On Wed, Nov 25, 2020 at 12:13:55PM -0800, David Zhang wrote:
> The previous patch was based on branch "REL_13_STABLE". Now, the attached
> new patch v2 is based on master branch. I followed the new code structure
> using appendPQExpBuffer to append the new clause "using TABLEAM" in a proper
> position and tested. In the meantime, I also updated the pgbench.sqml file
> to reflect the changes.

+ 
+  
--table-am=TABLEAM
+  
+   
+Create all tables with specified table access method
+TABLEAM.
+If unspecified, default is heap.
+   
+  
+ 
This needs to be in alphabetical order.  And what you say here is
wrong.  The default would not be heap if default_table_access_method
is set to something else.  I would suggest to use table_access_method
instead of TABLEAM, and keep the paragraph minimal, say:
"Create tables using the specified table access method, rather than
the default."

--table-am is really the best option name?  --table-access-method
sounds a bit more consistent to me.

+   if (tableam != NULL)
+   {
+   char   *escape_tableam;
+
+   escape_tableam = PQescapeIdentifier(con, tableam, strlen(tableam));
+   appendPQExpBuffer(, " using %s", escape_tableam);
+   PQfreemem(escape_tableam);
+   }
The order is wrong here, generating an unsupported grammar, see by
yourself with this command:
pgbench --partition-method=hash --table-am=heap -i  --partitions 2

This makes the patch trickier than it looks as the USING clause is
located between PARTITION BY and WITH.  Also, partitioned tables
cannot use the USING clause so you need to be careful that
createPartitions() also uses the correct table AM.

This stuff needs tests.
--
Michael


signature.asc
Description: PGP signature


Re: [doc] plan invalidation when statistics are update

2020-11-25 Thread torikoshia

On 2020-11-25 14:13, Fujii Masao wrote:

On 2020/11/24 23:14, Fujii Masao wrote:



On 2020/11/19 14:33, torikoshia wrote:

On 2020-11-18 11:35, Fujii Masao wrote:

Thanks for your comment!


On 2020/11/18 11:04, torikoshia wrote:

Hi,

AFAIU, when the planner statistics are updated, generic plans are 
invalidated and PostgreSQL recreates. However, the manual doesn't 
seem to explain it explicitly.


   https://www.postgresql.org/docs/devel/sql-prepare.html

I guess this case is included in 'whenever database objects used in 
the statement have definitional (DDL) changes undergone', but I 
feel it's hard to infer.


Since updates of the statistics can often happen, how about 
describing this case explicitly like an attached patch?


+1 to add that note.

-   statement.  Also, if the value of linkend="guc-search-path"/> changes
+   statement. For example, when the planner statistics of the 
statement
+   are updated, PostgreSQL re-analyzes 
and

+   re-plans the statement.

I don't think "For example," is necessary.

"planner statistics of the statement" sounds vague? Does the 
statement
is re-analyzed and re-planned only when the planner statistics of 
database
objects used in the statement are updated? If yes, we should 
describe

that to make the note a bit more explicitly?


Yes. As far as I confirmed, updating statistics which are not used in
prepared statements doesn't trigger re-analyze and re-plan.

Since plan invalidations for DDL changes and statistcal changes are 
caused
by PlanCacheRelCallback(Oid 'relid'), only the prepared statements 
using

'relid' relation seem invalidated.> Attached updated patch.


Thanks for confirming that and updating the patch!


force re-analysis and re-planning of the statement before using it
whenever database objects used in the statement have undergone
definitional (DDL) changes since the previous use of the prepared
-   statement.  Also, if the value of  
changes

+   statement. Similarly, whenever the planner statistics of database
+   objects used in the statement have updated, re-analysis and 
re-planning

+   happen.

"been" should be added between "have" and "updated" in the above 
"objects

 used in the statement have updated"?


You're right.

I'm inclined to add "since the previous use of the prepared statement" 
into
also the second description, to make it clear. But if we do that, it's 
better

to merge the above two description into one, as follows?

whenever database objects used in the statement have undergone
-   definitional (DDL) changes since the previous use of the prepared
+   definitional (DDL) changes or the planner statistics of them have
+   been updated since the previous use of the prepared
statement.  Also, if the value of  
changes


Thanks, it seems better.


Regards,




Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Krunal Bauskar
On Thu, 26 Nov 2020 at 10:50, Tom Lane  wrote:

> Michael Paquier  writes:
> > On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote:
> >> (Thanks to Amit Khandekar for rigorously performance testing this patch
> >> with different combinations).
>
> > For the simple-update and tpcb-like graphs, do you have any actual
> > numbers to share between 128 and 1024 connections?
>
> Also, exactly what hardware/software platform were these curves
> obtained on?
>

Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for
server and 8 for client) [2 numa nodes]
Storage: 3.2 TB NVMe SSD
OS: CentOS Linux release 7.6
PGSQL: baseline = Release Tag 13.1
Invocation suite:
https://github.com/mysqlonarm/benchmark-suites/tree/master/pgsql-pbench (Uses
pgbench)



> regards, tom lane
>


-- 
Regards,
Krunal Bauskar


RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-25 Thread k.jami...@fujitsu.com
> From: k.jami...@fujitsu.com 
> On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> > From: Andres Freund 
> > > DropRelFileNodeBuffers() in recovery? The most common path is
> > > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers()
> > > , which 3/4 doesn't address and 4/4 doesn't mention.
> > >
> > > 4/4 seems to address DropRelationFiles(), but only talks about
> > TRUNCATE?
> >
> > Yes.  DropRelationFiles() is used in the following two paths:
> >
> > [Replay of TRUNCATE during recovery]
> > xact_redo_commit/abort() -> DropRelationFiles()  -> smgrdounlinkall()
> > ->
> > DropRelFileNodesAllBuffers()
> >
> > [COMMIT/ROLLBACK PREPARED]
> > FinishPreparedTransaction() -> DropRelationFiles()  ->
> > smgrdounlinkall()
> > -> DropRelFileNodesAllBuffers()
> 
> Yes. The concern is that it was not clear in the function descriptions and
> commit logs what the optimizations for the DropRelFileNodeBuffers() and
> DropRelFileNodesAllBuffers() are for. So I revised only the function
> description of DropRelFileNodeBuffers() and the commit logs of the
> 0003-0004 patches. Please check if the brief descriptions would suffice.
> 
> 
> > > I also don't get why 4/4 would be a good idea on its own. It uses
> > > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > > DropRelFileNodesAllBuffers() can be used for many relations at once,
> > > this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD - 1
> > lookups a lot
> > > of times, once for each of nnodes relations?
> >
> > So, the threshold value should be compared with the total number of
> > blocks of all target relations, not each relation.  You seem to be right, 
> > got it.
> 
> Fixed this in 0004 patch. Now we compare the total number of
> buffers-to-be-invalidated For ALL relations to the
> BUF_DROP_FULL_SCAN_THRESHOLD.
> 
> > > Also, how is 4/4 safe - this is outside of recovery too?
> >
> > It seems that DropRelFileNodesAllBuffers() should trigger the new
> > optimization path only when InRecovery == true, because it
> > intentionally doesn't check the "accurate" value returned from
> smgrnblocks().
> 
> Fixed it in 0004 patch. Now we ensure that we only enter the optimization path
> Iff during recovery.
> 
> 
> > From: Amit Kapila  On Wed, Nov 18, 2020 at
> > 11:43 PM Andres Freund 
> > > I'm also worried about the cases where this could cause buffers left
> > > in the buffer pool, without a crosscheck like Thomas' patch would
> > > allow to add. Obviously other processes can dirty buffers in
> > > hot_standby, so any leftover buffer could have bad consequences.
> > >
> >
> > The problem can only arise if other processes extend the relation. The
> > idea was that in recovery it extends relation by one process which
> > helps to maintain the cache. Kirk seems to have done testing to
> > cross-verify it by using his first patch
> > (Prevent-invalidating-blocks-in-smgrextend-during). Which other
> crosscheck you are referring here?
> >
> > I agree that we can do a better job by expanding comments to clearly
> > state why it is safe.
> 
> Yes, basically what Amit-san also mentioned above. The first patch prevents
> that.
> And in the description of DropRelFileNodeBuffers in the 0003 patch, please
> check If that would suffice.
> 
> 
> > > Smaller comment:
> > >
> > > +static void
> > > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber
> > *forkNum,
> > > int nforks,
> > > +   BlockNumber
> > > *nForkBlocks, BlockNumber *firstDelBlock) ...
> > > + /* Check that it is in the buffer pool. If not, do
> > nothing.
> > > */
> > > + LWLockAcquire(bufPartitionLock, LW_SHARED);
> > > + buf_id = BufTableLookup(, bufHash);
> > > ...
> > > + bufHdr = GetBufferDescriptor(buf_id);
> > > +
> > > + buf_state = LockBufHdr(bufHdr);
> > > +
> > > + if (RelFileNodeEquals(bufHdr->tag.rnode, rnode)
> > &&
> > > + bufHdr->tag.forkNum == forkNum[i] &&
> > > + bufHdr->tag.blockNum >= firstDelBlock[i])
> > > + InvalidateBuffer(bufHdr);   /* releases
> > > spinlock */
> > > + else
> > > + UnlockBufHdr(bufHdr, buf_state);
> > >
> > > I'm a bit confused about the check here. We hold a buffer partition
> > > lock, and have done a lookup in the mapping table. Why are we then
> > > rechecking the relfilenode/fork/blocknum? And why are we doing so
> > > holding the buffer header lock, which is essentially a spinlock, so
> > > should only ever be held for very short portions?
> > >
> > > This looks like it's copying logic from DropRelFileNodeBuffers()
> > > etc, but there the situation is different: We haven't done a buffer
> > > mapping lookup, and we don't hold a partition lock!
> >
> > That's because the buffer partition lock is released 

Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote:
>> (Thanks to Amit Khandekar for rigorously performance testing this patch
>> with different combinations).

> For the simple-update and tpcb-like graphs, do you have any actual
> numbers to share between 128 and 1024 connections?

Also, exactly what hardware/software platform were these curves
obtained on?

regards, tom lane




Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Krunal Bauskar
scalability   baseline   patched
---- --
  updatetpcb  update tpcb
--
128   107932 78554  108081 78569
25682877 64682101543 73774
51255174 46494  77886 61105
1024   32267 27020 33170 30597

configuration:
https://github.com/mysqlonarm/benchmark-suites/blob/master/pgsql-pbench/conf/pgsql.cnf/postgresql.conf


On Thu, 26 Nov 2020 at 10:36, Michael Paquier  wrote:

> On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote:
> > (Thanks to Amit Khandekar for rigorously performance testing this patch
> >  with different combinations).
>
> For the simple-update and tpcb-like graphs, do you have any actual
> numbers to share between 128 and 1024 connections?  The blue lines
> look like they are missing some measurements in-between, so it is hard
> to tell if this is an actual improvement or just some lack of data.
> --
> Michael
>


-- 
Regards,
Krunal Bauskar


Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Michael Paquier
On Thu, Nov 26, 2020 at 10:00:50AM +0530, Krunal Bauskar wrote:
> (Thanks to Amit Khandekar for rigorously performance testing this patch
>  with different combinations).

For the simple-update and tpcb-like graphs, do you have any actual
numbers to share between 128 and 1024 connections?  The blue lines
look like they are missing some measurements in-between, so it is hard
to tell if this is an actual improvement or just some lack of data.
--
Michael


signature.asc
Description: PGP signature


Re: PoC: custom signal handler for extensions

2020-11-25 Thread Michael Paquier
On Wed, Nov 25, 2020 at 06:34:48PM +0300, Anastasia Lubennikova wrote:
> I took a look at the patch. It looks fine and I see, that it contains fixes
> for the latest questions in the thread.
> 
> I think we should provide a test module for this feature, that will serve as
> both test and example of the use.

Yep.  That's missing.  The trick here is usually to be able to come up
with something minimalist still useful enough for two reasons:
- This can be used as a template.
- This makes sure that the API work.
As the patch stands, nothing in this architecture is tested.

> This is a feature for extension developers, so I don't know where we should
> document it. At the very least we can improve comments. For example,
> describe the fact that custom signals are handled after receiving SIGUSR1.

As you say, this is for extension developers, so this should be
documented in the headers defining those APIs.  FWIW, I am -1 with the
patch as proposed.  First, it has zero documentation.  Second, it uses
a hardcoded custom number of signals of 64 instead of a flexible
design.  In most cases, 64 is a waste.  In some cases, 64 may not be
enough.  IMO, a design based on the registration of a custom
procsignal done at shmem init time would be much more flexible.  You
need one registration API and something to get an ID associated to the
custom entry, and that's roughly what Teodor tells upthread.

This needs more work, so I am marking it as returned with feedback.
--
Michael


signature.asc
Description: PGP signature


Re: Enumize logical replication message actions

2020-11-25 Thread Amit Kapila
On Wed, Nov 25, 2020 at 2:52 PM Amit Kapila  wrote:
>
> On Wed, Nov 25, 2020 at 2:26 PM Peter Smith  wrote:
> >
> > Hi Hackers.
> >
> > Last month there was a commit [1] for replacing logical replication
> > message type characters with enums of equivalent values.
> >
> > I was revisiting this code recently and I think due to oversight that
> > initial patch was incomplete. IIUC there are several more enum
> > substitutions which should have been made.
> >
> > PSA my patch which adds those missing substitutions.
> >
>
> Good catch. I'll review it in a day or so.
>

The patch looks good to me and I have pushed it.

-- 
With Regards,
Amit Kapila.




Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-25 Thread Michael Paquier
On Wed, Nov 25, 2020 at 04:25:56PM +0100, Magnus Hagander wrote:
> On Wed, Nov 25, 2020 at 9:29 AM Peter Eisentraut 
>  wrote:
>> Perhaps it's worth asking whom the advice applies to then.  You suggest
>> it's mostly developers.  I for one am still grumpy that in 9.5 we
>> removed the variant of the hint that suggested "postgres -D ..." instead
>> of pg_ctl.  I used to copy and paste that a lot.  The argument back then
>> was that the hint should target end users, not developers.  I doubt that
>> under the current circumstances, running pg_ctl start from the console
>> is really appropriate advice for a majority of users.  (For one thing,
>> systemd will kill it when you log out.)  I don't know what better advice
> 
> I guess one option could be to just remove it, unconditionally. And
> assume that any users who is running it manually read that in docs
> somewhere that tells them what to do next, and that any user who's
> running it under a wrapper will have the wrapper set it up?

Hmm.  I don't think that users running manually initdb will read the
documentation if we don't show directly a reference to them.  FWIW, I
agree with Peter that it would have been nice to keep around the 9.5
hint, and I would wish that the last one remains around.  I agree,
however, that there is value in having a switch that suppresses them.

>> would be, though.  Maybe we need to add some kind of text adventure game
>> into initdb.
> 
> I do like this idea though...

+1.
--
Michael


signature.asc
Description: PGP signature


Improving spin-lock implementation on ARM.

2020-11-25 Thread Krunal Bauskar
Improving spin-lock implementation on ARM.


* Spin-Lock is known to have a significant effect on performance
  with increasing scalability.

* Existing Spin-Lock implementation for ARM is sub-optimal due to
  use of TAS (test and swap)

* TAS is implemented on ARM as load-store so even if the lock is not free,
  store operation will execute to replace the same value.
  This redundant operation (mainly store) is costly.

* CAS is implemented on ARM as load-check-store-check that means if the
  lock is not free, check operation, post-load will cause the loop to
  return there-by saving on costlier store operation. [1]

* x86 uses optimized xchg operation.
  ARM too started supporting it (using Large System Extension) with
  ARM-v8.1 but since it not supported with ARM-v8, GCC default tends
  to roll more generic load-store assembly code.

* gcc-9.4+ onwards there is support for outline-atomics that could emit
  both the variants of the code (load-store and cas/swp) and based on
  underlying supported architecture proper variant it used but still a lot
  of distros don't support GCC-9.4 as the default compiler.

* In light of this, we would like to propose a CAS-based approach based on
  our local testing has shown improvement in the range of 10-40%.
  (attaching graph).

* Patch enables CAS based approach if the CAS is supported depending on
  existing compiled flag HAVE_GCC__ATOMIC_INT32_CAS

(Thanks to Amit Khandekar for rigorously performance testing this patch
 with different combinations).

[1]: https://godbolt.org/z/jqbEsa

P.S: Sorry if I missed any standard pgsql protocol since I am just starting
with pgsql.

---
Krunal Bauskar
#mysqlonarm
Huawei Technologies
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 31a5ca6..940fdcd 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -321,7 +321,24 @@ tas(volatile slock_t *lock)
  * than other widths.
  */
 #if defined(__arm__) || defined(__arm) || defined(__aarch64__) || defined(__aarch64)
-#ifdef HAVE_GCC__SYNC_INT32_TAS
+
+#ifdef HAVE_GCC__ATOMIC_INT32_CAS
+/* just reusing the same flag to avoid re-declaration of default tas functions below */
+#define HAS_TEST_AND_SET
+
+#define TAS(lock) cas(lock)
+typedef int slock_t;
+
+static __inline__ int
+cas(volatile slock_t *lock)
+{
+	slock_t expected = 0;
+	return !(__atomic_compare_exchange_n(lock, , (slock_t) 1,
+false, __ATOMIC_ACQUIRE, __ATOMIC_ACQUIRE));
+}
+
+#define S_UNLOCK(lock) __atomic_store_n(lock, (slock_t) 0, __ATOMIC_RELEASE);
+#elif HAVE_GCC__SYNC_INT32_TAS
 #define HAS_TEST_AND_SET
 
 #define TAS(lock) tas(lock)


Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-25 Thread Michael Paquier
On Thu, Nov 26, 2020 at 07:24:01AM +0530, Bharath Rupireddy wrote:
> Yeah. The tuple size may change after ExecCopySlot(). For instance, create
> table t2 as select a1 from t1; where t1 has two integer columns a1, b1. I'm
> creating t2 with single column a1 from t1 which makes the source slot
> virtual.

+inline Size
+GetTupleSize(TupleTableSlot *slot, Size maxsize)
+{
+   Size sz = 0;
+   HeapTuple tuple = NULL;
+
+   if (TTS_IS_HEAPTUPLE(slot))
+   tuple = ((HeapTupleTableSlot *) slot)->tuple;
+   else if(TTS_IS_BUFFERTUPLE(slot))
+   tuple = ((BufferHeapTupleTableSlot *) slot)->base.tuple;
+   else if(TTS_IS_MINIMALTUPLE(slot))
+   tuple = ((MinimalTupleTableSlot *) slot)->tuple;

There have been various talks about the methods we could use to
evaluate the threshold in bytes when evaluating that a flush can
happen, including the use of memory contexts, or even estimate the
size of the number of tuples.  This one looks promising because it
seems exact, however for virtual slots I don't like much the fact that
you basically just extracted the parts of tts_virtual_materialize()
and stuck them in this routine.  That's a recipe for future bugs if
the materialization logic changes.  In short, I am surprised that this
calculation is not directly part of TupleTableSlotOps.  What we'd want
is to get this information depending on the slot type dealt with, and
with your patch you would miss to handle any new slot type
introduced.
--
Michael


signature.asc
Description: PGP signature


Re: Parallel Inserts in CREATE TABLE AS

2020-11-25 Thread Bharath Rupireddy
On Thu, Nov 26, 2020 at 7:47 AM Hou, Zhijie  wrote:
>
> Hi,
>
> I have an issue about the following code:
>
> econtext = node->ps.ps_ExprContext;
> ResetExprContext(econtext);
>
> +   if (ISCTAS(node->ps.intoclause))
> +   {
> +   ExecParallelInsertInCTAS(node);
> +   return NULL;
> +   }
>
> /* If no projection is required, we're done. */
> if (node->ps.ps_ProjInfo == NULL)
> return slot;
>
> /*
>  * Form the result tuple using ExecProject(), and return it.
>  */
> econtext->ecxt_outertuple = slot;
> return ExecProject(node->ps.ps_ProjInfo);
>
> It seems the projection will be skipped.
> Is this because projection is not required in this case ?
> (I'm not very familiar with where the projection will be.)
>

For parallel inserts in CTAS, I don't think we need to project the
tuples being returned from the underlying plan nodes, and also we have
nothing to project from the Gather node further up. The required
projection will happen while the tuples are being returned from the
underlying nodes and the projected tuples are being directly fed to
CTAS's dest receiver intorel_receive(), from there into the created
table. We don't need ExecProject again in ExecParallelInsertInCTAS().

For instance, projection will always be done when the tuple is being
returned from an underlying sequential scan node(see ExecScan() -->
ExecProject() and this is true for both leader and workers. In both
leader and workers, we are just calling CTAS's dest receiver
intorel_receive().

Thoughts?

>
> If projection is not required here, shall we add some comments here?
>

If the above point looks okay, I can add a comment.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2020-11-25 Thread Amit Kapila
On Wed, Nov 25, 2020 at 7:47 PM Antonin Houska  wrote:
>
> Antonin Houska  wrote:
>
> > Amit Kapila  wrote:
>
> > > I think we also need to maintain oldestXidHavingUndo for CLOG truncation 
> > > and
> > > transaction-wraparound. We can't allow CLOG truncation for the transaction
> > > whose undo is not discarded as that could be required by some other
> > > transaction.
> >
> > Good point. Even the discard worker might need to check the transaction 
> > status
> > when deciding whether undo log of that transaction should be discarded.
>
> In the zheap code [1] I see that DiscardWorkerMain() discards undo log up to
> OldestXmin:
>
>
> OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_AUTOVACUUM |
>
> PROCARRAY_FLAGS_VACUUM);
>
> oldestXidHavingUndo = 
> GetXidFromEpochXid(pg_atomic_read_u64(>oldestXidWithEpochHavingUndo));
>
> /*
>  * Call the discard routine if there oldestXidHavingUndo is lagging
>  * behind OldestXmin.
>  */
> if (OldestXmin != InvalidTransactionId &&
> TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin))
> {
> UndoDiscard(OldestXmin, );
>
> and that UndoDiscard() eventually advances oldestXidHavingUndo in the shared
> memory.
>
> I'm not sure this is correct because, IMO, OldestXmin can advance as soon as
> AbortTransaction() has cleared both xid and xmin fields of the transaction's
> PGXACT (by calling ProcArrayEndTransactionInternal). However the corresponding
> undo log may still be waiting for processing. Am I wrong?
>

The UndoDiscard->UndoDiscardOneLog ensures that we don't discard the
undo if there is a pending abort.

> I think that oldestXidHavingUndo should be advanced at the time transaction
> commits or when the undo log of an aborted transaction has been applied.
>

We can't advance oldestXidHavingUndo just on commit because later we
need to rely on it for visibility, basically any transaction older
than oldestXidHavingUndo should be all-visible.

> Then
> the discard worker would simply discard the undo log up to
> oldestXidHavingUndo. However, as the transactions whose undo is still not
> applied may no longer be registered in the shared memory (proc array), I don't
> know how to determine the next value of oldestXidHavingUndo.
>
> Also I wonder if FullTransactionId is needed for oldestXidHavingUndo in the
> shared memory rather than plain TransactionId (see
> oldestXidWithEpochHavingUndo in PROC_HDR). I think that the value cannot lag
> behind nextFullXid by more than 2 billions transactions anyway because in that
> case it would cause XID wraparound.
>

You are right but still, it is better to keep it as FullTransactionId
because (a) zheap uses FullTransactionId and we need to compare it
with oldestXidWithEpochHavingUndo for visibility purpose, (b) In
future, we want to get rid this of this limitation for undo as well.

-- 
With Regards,
Amit Kapila.




Re: Huge memory consumption on partitioned table with FKs

2020-11-25 Thread Kyotaro Horiguchi
At Thu, 26 Nov 2020 09:59:28 +0900, Keisuke Kuroda 
 wrote in 
> Hi Hackers,
> 
> Analyzed the problem and created a patch to resolve it.
> 
> # Problem 1
> 
> When you create a foreign key to a partitioned table, referential
> integrity function is created for the number of partitions.
> Internally, SPI_prepare() creates a plan and SPI_keepplan()
> caches the plan.
> 
> The more partitions in the referencing table, the more plans will
> be cached.
> 
> # Problem 2
> 
> When referenced table is partitioned table, the larger the number
> of partitions, the larger the plan size to be cached.
> 
> The actual plan processed is simple and small if pruning is
> enabled. However, the cached plan will include all partition
> information.
> 
> The more partitions in the referenced table, the larger the plan
> size to be cached.
> 
> # Idea for solution
> 
> Constraints with the same pg_constraint.parentid can be combined
> into one plan with the same comparentid if we can guarantee that
> all their contents are the same.

The memory reduction this patch gives seems quite high with a small
footprint.

This shares RI_ConstraintInfo cache by constraints that shares the
same parent constraints. But you forgot that the cache contains some
members that can differ among partitions.

Consider the case of attaching a partition that have experienced a
column deletion.

create table t (a int primary key);
create table p (a int, r int references t(a)) partition by range(a);
create table c1 partition of p for values from (0) to (10);
create table c2 (a int, r int);
alter table c2 drop column r;
alter table c2 add column r int;
alter table p attach partition c2 for values from (10) to (20);

In that case riinfo->fk_attnums has different values from other
partitions.

=# select oid, conrelid::regclass, confrelid::regclass, conparentid, conname, 
conkey from pg_constraint where confrelid = 't'::regclass;

  oid  | conrelid | confrelid | conparentid | conname  | conkey 
---+--+---+-+--+
 16620 | p| t |   0 | p_r_fkey | {2}
 16626 | c1   | t |   16620 | p_r_fkey | {2}
 16632 | c2   | t |   16620 | p_r_fkey | {3}
(3 rows)

conkey is copied onto riinfo->fk_attnums.

> Problem 1 can be solved
> and significant memory bloat can be avoided.
> CachedPlan: 710MB -> 1466kB
> 
> Solving Problem 2 could also reduce memory,
> but I don't have a good idea.
> 
> Currently, DISCARD ALL does not discard CachedPlan by SPI as in
> this case. It may be possible to modify DISCARD ALL to discard
> CachedPlan and run it periodically. However, we think the burden
> on the user is high.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: [Patch] Optimize dropping of relation buffers using dlist

2020-11-25 Thread k.jami...@fujitsu.com
On Thursday, November 19, 2020 4:08 PM, Tsunakawa, Takayuki wrote:
> From: Andres Freund 
> > DropRelFileNodeBuffers() in recovery? The most common path is
> > DropRelationFiles()->smgrdounlinkall()->DropRelFileNodesAllBuffers(),
> > which 3/4 doesn't address and 4/4 doesn't mention.
> >
> > 4/4 seems to address DropRelationFiles(), but only talks about
> TRUNCATE?
> 
> Yes.  DropRelationFiles() is used in the following two paths:
> 
> [Replay of TRUNCATE during recovery]
> xact_redo_commit/abort() -> DropRelationFiles()  -> smgrdounlinkall() ->
> DropRelFileNodesAllBuffers()
> 
> [COMMIT/ROLLBACK PREPARED]
> FinishPreparedTransaction() -> DropRelationFiles()  -> smgrdounlinkall()
> -> DropRelFileNodesAllBuffers()

Yes. The concern is that it was not clear in the function descriptions and 
commit logs
what the optimizations for the DropRelFileNodeBuffers() and 
DropRelFileNodesAllBuffers()
are for. So I revised only the function description of DropRelFileNodeBuffers() 
and the
commit logs of the 0003-0004 patches. Please check if the brief descriptions 
would suffice.


> > I also don't get why 4/4 would be a good idea on its own. It uses
> > BUF_DROP_FULL_SCAN_THRESHOLD to guard
> > FindAndDropRelFileNodeBuffers() on a per relation basis. But since
> > DropRelFileNodesAllBuffers() can be used for many relations at once,
> > this could end up doing BUF_DROP_FULL_SCAN_THRESHOLD - 1
> lookups a lot
> > of times, once for each of nnodes relations?
> 
> So, the threshold value should be compared with the total number of blocks
> of all target relations, not each relation.  You seem to be right, got it.

Fixed this in 0004 patch. Now we compare the total number of 
buffers-to-be-invalidated
For ALL relations to the BUF_DROP_FULL_SCAN_THRESHOLD.

> > Also, how is 4/4 safe - this is outside of recovery too?
> 
> It seems that DropRelFileNodesAllBuffers() should trigger the new
> optimization path only when InRecovery == true, because it intentionally
> doesn't check the "accurate" value returned from smgrnblocks().

Fixed it in 0004 patch. Now we ensure that we only enter the optimization path
Iff during recovery.
 

> From: Amit Kapila 
> On Wed, Nov 18, 2020 at 11:43 PM Andres Freund 
> > I'm also worried about the cases where this could cause buffers left
> > in the buffer pool, without a crosscheck like Thomas' patch would
> > allow to add. Obviously other processes can dirty buffers in
> > hot_standby, so any leftover buffer could have bad consequences.
> >
> 
> The problem can only arise if other processes extend the relation. The idea
> was that in recovery it extends relation by one process which helps to
> maintain the cache. Kirk seems to have done testing to cross-verify it by 
> using
> his first patch (Prevent-invalidating-blocks-in-smgrextend-during). Which
> other crosscheck you are referring here?
> 
> I agree that we can do a better job by expanding comments to clearly state
> why it is safe.

Yes, basically what Amit-san also mentioned above. The first patch prevents 
that.
And in the description of DropRelFileNodeBuffers in the 0003 patch, please check
If that would suffice.


> > Smaller comment:
> >
> > +static void
> > +FindAndDropRelFileNodeBuffers(RelFileNode rnode, ForkNumber
> *forkNum,
> > int nforks,
> > + BlockNumber
> > *nForkBlocks, BlockNumber *firstDelBlock) ...
> > +   /* Check that it is in the buffer pool. If not, do
> nothing.
> > */
> > +   LWLockAcquire(bufPartitionLock, LW_SHARED);
> > +   buf_id = BufTableLookup(, bufHash);
> > ...
> > +   bufHdr = GetBufferDescriptor(buf_id);
> > +
> > +   buf_state = LockBufHdr(bufHdr);
> > +
> > +   if (RelFileNodeEquals(bufHdr->tag.rnode, rnode)
> &&
> > +   bufHdr->tag.forkNum == forkNum[i] &&
> > +   bufHdr->tag.blockNum >= firstDelBlock[i])
> > +   InvalidateBuffer(bufHdr);   /* releases
> > spinlock */
> > +   else
> > +   UnlockBufHdr(bufHdr, buf_state);
> >
> > I'm a bit confused about the check here. We hold a buffer partition
> > lock, and have done a lookup in the mapping table. Why are we then
> > rechecking the relfilenode/fork/blocknum? And why are we doing so
> > holding the buffer header lock, which is essentially a spinlock, so
> > should only ever be held for very short portions?
> >
> > This looks like it's copying logic from DropRelFileNodeBuffers() etc,
> > but there the situation is different: We haven't done a buffer mapping
> > lookup, and we don't hold a partition lock!
> 
> That's because the buffer partition lock is released immediately after the 
> hash
> table has been looked up.  As an aside, InvalidateBuffer() requires the caller
> to hold the buffer header spinlock and doesn't hold the buffer partition lock.

Yes. Holding the buffer header 

Re: POC: Cleaning up orphaned files using undo logs

2020-11-25 Thread Amit Kapila
On Wed, Nov 25, 2020 at 8:00 PM Antonin Houska  wrote:
>
> Amit Kapila  wrote:
>
> > On Wed, Nov 18, 2020 at 4:03 PM Antonin Houska  wrote:
> > >
> > > Amit Kapila  wrote:
> > >
> > > > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska  wrote:
> > > > >
> > > > > Amit Kapila  wrote:
> > > > >
> > > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  
> > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > No background undo
> > > > > > > --
> > > > > > >
> > > > > > > Reduced complexity of the patch seems to be the priority at the 
> > > > > > > moment. Amit
> > > > > > > suggested that cleanup of an orphaned relation file is simple 
> > > > > > > enough to be
> > > > > > > done on foreground and I agree.
> > > > > > >
> > > > > >
> > > > > > Yeah, I think we should try and see if we can make it work but I
> > > > > > noticed that there are few places like AbortOutOfAnyTransaction 
> > > > > > where
> > > > > > we have the assumption that undo will be executed in the background.
> > > > > > We need to deal with it.
> > > > >
> > > > > I think this is o.k. if we always check for unapplied undo during 
> > > > > startup.
> > > > >
> > > >
> > > > Hmm, how it is ok to leave undo (and rely on startup) unless it is a
> > > > PANIC error. IIRC, this path is invoked in non-panic errors as well.
> > > > Basically, we won't be able to discard such an undo which doesn't seem
> > > > like a good idea.
> > >
> > > Since failure to apply leaves unconsistent data, I assume it should always
> > > cause PANIC, shouldn't it?
> > >
> >
> > But how can we ensure that AbortOutOfAnyTransaction will be called
> > only in that scenario?
>
> I meant that AbortOutOfAnyTransaction should PANIC itself if it sees that
> there is unapplied undo, so nothing changes for its callers. Do I still miss
> something?
>

Adding PANIC in some generic code-path sounds scary. Why can't we
simply try to execute undo?

> > > > Another thing is that it seems we need to connect to the database to 
> > > > perform
> > > > it which might appear a bit odd that we don't allow users to connect to 
> > > > the
> > > > database but internally we are connecting it.
> > >
> > > I think the implementation will need to follow the outcome of the part of 
> > > the
> > > discussion that starts at [2], but I see your concern. I'm thinking why
> > > database connection is not needed to apply WAL but is needed for UNDO. I 
> > > think
> > > locks make the difference.
> > >
> >
> > Yeah, it would be probably a good idea to see if we can make undo
> > apply work without db-connection especially if we want to do before
> > allowing connections. The other possibility could be to let discard
> > worker do this work lazily after allowing connections.
>
> Actually I hit the problem of missing connection when playing with the
> "undoxacttest" module. Those tests use table_open() / table_close() functions,
> but it might not be necessary for the real RMGRs.
>

How can we apply the action on a page without opening the relation?

-- 
With Regards,
Amit Kapila.




Re: [PATCH] LWLock self-deadlock detection

2020-11-25 Thread Tom Lane
Craig Ringer  writes:
> On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat 
> wrote:
>> I'd prefer to make the lock self deadlock check run for production
>> builds, not just cassert builds.

I'd like to register a strong objection to spending any cycles whatsoever
on this.  If you're using LWLocks in a way that could deadlock, you're
doing it wrong.  The entire point of that mechanism is to be Light Weight
and not have all the overhead that the heavyweight lock mechanism has.
Building in deadlock checks and such is exactly the wrong direction.
If you think you need that, you should be using a heavyweight lock.

Maybe there's some case for a cassert-only check of this sort, but running
it in production is right out.

I'd also opine that checking for self-deadlock, but not any more general
deadlock, seems pretty pointless.  Careless coding is far more likely
to create opportunities for the latter.  (Thus, I have little use for
the existing assertions of this sort, either.)

regards, tom lane




Re: walsender bug: stuck during shutdown

2020-11-25 Thread Alvaro Herrera
On 2020-Nov-26, Fujii Masao wrote:

> On the second thought, walsender doesn't wait forever unless
> wal_sender_timeout is disabled, even in the case in discussion?
> Or if there is the case where wal_sender_timeout doesn't work expectedly,
> we might need to fix that at first.

Hmm, no, it doesn't wait forever in that sense; tracing with the
debugger shows that the process is looping continuously.




RE: [POC] Fast COPY FROM command for the table with foreign partitions

2020-11-25 Thread tsunakawa.ta...@fujitsu.com
From: Amit Langote 
> Second is whether the interface for setting ri_usesMultiInsert
> encourages situations where different modules could possibly engage in
> conflicting behaviors.  I can't think of a real-life example of that
> with the current implementation, but maybe the interface provided in
> the patch makes it harder to ensure that that remains true in the
> future.  Tsunakawa-san, have you encountered an example of this, maybe
> when trying to integrate this patch with some other?

Thanks.  No, I pointed out purely from the standpoint of program modularity 
(based on structured programming?)


> Anyway, one thing we could do is rename
> ExecRelationAllowsMultiInsert() to ExecSetRelationUsesMultiInsert(),
> that is, to make it actually set ri_usesMultiInsert and have places
> like CopyFrom() call it if (and only if) its local logic allows
> multi-insert to be used.  So, ri_usesMultiInsert starts out set to
> false and if a module wants to use multi-insert for a given target
> relation, it calls ExecSetRelationUsesMultiInsert() to turn the flag
> on.  Also, given the confusion regarding how execPartition.c

I think separating the setting and inspection of the property into different 
functions will be good, at least.


> manipulates the flag, maybe change ExecFindPartition() to accept a
> Boolean parameter multi_insert, which it will pass down to
> ExecInitPartitionInfo(), which in turn will call
> ExecSetRelationUsesMultiInsert() for a given partition.  Of course, if
> the logic in ExecSetRelationUsesMultiInsert() determines that
> multi-insert can't be used, for the reasons listed in the function,
> then the caller will have to live with that decision.

I can't say for sure, but it looks strange to me, because I can't find a good 
description of multi_insert argument for ExecFindPartition().  If we add 
multi_insert, I'm afraid we may want to add further arguments for other 
properties in the future like "Hey, get me the partition that has triggers.", 
"Next, pass me a partition that uses a foreign table.", etc.  I think the 
current ExecFindPartition() is good -- "Get me a partition that accepts this 
row."

I wonder if ri_usesMultiInsert is really necessary.  Would it cut down enough 
costs in the intended use case(s), say the heavyweight COPY FROM?


Regards
Takayuki Tsunakawa




Re: [PATCH] LWLock self-deadlock detection

2020-11-25 Thread Craig Ringer
On Wed, Nov 25, 2020 at 9:23 PM Ashutosh Bapat 
wrote:

> On Wed, Nov 25, 2020 at 11:47 AM Craig Ringer
>  wrote:
> >> I am also seeing a pattern
> >> Assert(!LWLockHeldByMe());
> >> LWLockAcquire()
> >>
> >> at some places. Should we change LWLockAcquire to do
> >> Assert(!LWLockHeldByMe()) always to detect such occurrences?
> >
> >
> > I'm inclined not to, at least not without benchmarking it, because
> that'd do the check before we attempt the fast-path. cassert builds are
> still supposed to perform decently and be suitable for day to day
> development and I'd rather not risk a slowdown.
> >
> > I'd prefer to make the lock self deadlock check run for production
> builds, not just cassert builds. It'd print a normal LOG (with backtrace if
> supported) then Assert(). So on an assert build we'd get a crash and core,
> and on a non-assert build we'd carry on and self-deadlock anyway.
> >
> > That's probably the safest thing to do. We can't expect to safely ERROR
> out of the middle of an LWLockAcquire(), not without introducing a new and
> really hard to test code path that'll also be surprising to callers. We
> probably don't want to PANIC the whole server for non-assert builds since
> it might enter a panic-loop. So it's probably better to self-deadlock. We
> could HINT that a -m immediate shutdown will be necessary to recover though.
>
> I agree that it will be helpful to print something in the logs
> indicating the reason for this hang in case the hang happens in a
> production build. In your patch you have used ereport(PANIC, ) which
> may simply be replaced by an Assert() in an assert enabled build.


I'd either select between PANIC (assert build) and LOG (non assert build),
or always LOG then put an Assert() afterwards. It's strongly desirable to
get the log output before any crash because it's a pain to get the LWLock
info from a core.


> We
> already have Assert(!LWLockHeldByMe()) so that should be safe. It will
> be good to have -m immediate hint in LOG message. But it might just be
> better to kill -9 that process to get rid of it. That will cause the
> server to restart and not just shutdown.
>

Sure. Won't work on Windows though. And I am hesitant about telling users
to "kill -9" anything.

pg_ctl -m immediate restart then.


RE: Parallel Inserts in CREATE TABLE AS

2020-11-25 Thread Hou, Zhijie
Hi,

I have an issue about the following code:

econtext = node->ps.ps_ExprContext;
ResetExprContext(econtext);
 
+   if (ISCTAS(node->ps.intoclause))
+   {
+   ExecParallelInsertInCTAS(node);
+   return NULL;
+   }

/* If no projection is required, we're done. */
if (node->ps.ps_ProjInfo == NULL)
return slot;

/*
 * Form the result tuple using ExecProject(), and return it.
 */
econtext->ecxt_outertuple = slot;
return ExecProject(node->ps.ps_ProjInfo);

It seems the projection will be skipped.
Is this because projection is not required in this case ?
(I'm not very familiar with where the projection will be.)

If projection is not required here, shall we add some comments here?

Best regards,
houzj






Re: Multi Inserts in CREATE TABLE AS - revived patch

2020-11-25 Thread Bharath Rupireddy
On Wed, Nov 25, 2020 at 2:11 PM Luc Vlaming  wrote:
>
> Thanks for reviving the patch! I did unfortunately have to shift my
> priorities somewhat and did not find much time to work on open source
> things the last week(s).
>

Thanks for the comments.

>
> I'm wondering about the use of the GetTupleSize function. As far as I
> understand the idea is to limit the amount of buffered data, presumably
> to not write too much data at once for intorel_flush_multi_insert.
> If I understood correctly how it all works, the table slot can however
> be of different type than the source slot, which makes that the call to
> CopySlot() potentially stores a different amount of data than computed
> by GetTupleSize(). Not sure if this is a big problem as an estimation
> might be good enough?
>

Yeah. The tuple size may change after ExecCopySlot(). For instance, create
table t2 as select a1 from t1; where t1 has two integer columns a1, b1. I'm
creating t2 with single column a1 from t1 which makes the source slot
virtual.

Source slot is virtual and the size calculated with GetTupleSize() is 8
bytes:
(gdb) p *slot
$18 = {type = T_TupleTableSlot, tts_flags = 16, tts_nvalid = 1,
  tts_ops = 0x562c592652c0 ,
  tts_tupleDescriptor = 0x562c5a0409f0, tts_values = 0x562c5a040b50,
  tts_isnull = 0x562c5a040b58, tts_mcxt = 0x562c5a040320, tts_tid = {
ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, tts_tableOid
= 0}
(gdb) call GetTupleSize(slot, 65535)
$24 = 8

After ExecCopySlot(batchslot, slot), destination slot changes to
TTSOpsBufferHeapTuple and the GetTupleSize() gives 28 bytes now.
(gdb) p *batchslot
$19 = {type = T_TupleTableSlot, tts_flags = 20, tts_nvalid = 0,
  tts_ops = 0x562c592653e0 ,
  tts_tupleDescriptor = 0x7f063fbeecd0, tts_values = 0x562c5a05daa8,
  tts_isnull = 0x562c5a05dab0, tts_mcxt = 0x562c5a040320, tts_tid = {
ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, tts_tableOid
= 0}
(gdb) call GetTupleSize(batchslot, 65535)
$25 = 28

I think your suggestion to call GetTupleSize() on the destination slot
after ExecCopySlot() is right. I changed it in the v4 patch.

>
> Some other solutions/implementations would be:
> - compute the size after doing CopySlot. Maybe the relation never wants
> a virtual tuple and then you can also simplify GetTupleSize?
>

I think we need to have TTSOpsVirtual code in GetTupleSize() because
table_slot_create() which gets called before ExecCopySlot() may create
virtual slots for cases such as views and partitioned tables. Though we can
not insert into views or partitioned tables using CTAS, I want
GetTupleSize() to be a generic function. Right now, I can not find other
use cases where GetTupleSize() can be used.

>
> - after CopySlot ask for the memory consumed in the slot using
> MemoryContextMemAllocated.
>

MemoryContextMemAllocated of the slot's tts_mcxt will always have extra
bytes and those extra bytes are way more compared to the actual tuple
bytes. And most of the time, ExecCopySlot() will just point the src slot
tts_mcxt to dest slot  tts_mcxt. For instance, for a single row with a
single integer column of 8 bytes, the mem_allocated is 49232 bytes. This is
the reason we can not rely on mem_allocated.

(gdb) p slot->tts_mcxt -> source slot
$22 = (MemoryContext) 0x562c5a040320
(gdb) p *slot->tts_mcxt
$20 = {type = T_AllocSetContext, isReset = false, allowInCritSection =
false,
  *mem_allocated = 49232*, methods = 0x562c5926d560 ,
  parent = 0x562c59f97820, firstchild = 0x562c5a042330, prevchild = 0x0,
  nextchild = 0x0, name = 0x562c590d3554 "ExecutorState", ident = 0x0,
  reset_cbs = 0x0}

(gdb) p batchslot->tts_mcxt -> destination slot after
ExecCopySlot().
$23 = (MemoryContext) 0x562c5a040320
(gdb) p *batchslot->tts_mcxt
$21 = {type = T_AllocSetContext, isReset = false, allowInCritSection =
false,
  *mem_allocated = 49232*, methods = 0x562c5926d560 ,
  parent = 0x562c59f97820, firstchild = 0x562c5a042330, prevchild = 0x0,
  nextchild = 0x0, name = 0x562c590d3554 "ExecutorState", ident = 0x0,
  reset_cbs = 0x0}

>
> Some small things to maybe change are:
> ===
> +   if (myState->mi_slots[myState->mi_slots_num] == NULL)
> +   {
> +   batchslot = table_slot_create(myState->rel, NULL);
> +   myState->mi_slots[myState->mi_slots_num] =
batchslot;
> +   }
> +   else
> +   batchslot =
myState->mi_slots[myState->mi_slots_num];
>
> Alternative:
> +   if (myState->mi_slots[myState->mi_slots_num] == NULL)
> +   myState->mi_slots[myState->mi_slots_num] =
> table_slot_create(myState->rel, NULL);
> +   batchslot = myState->mi_slots[myState->mi_slots_num];
>

Changed.

> ==
>
> +   sz = att_align_nominal(sz, att->attalign);
> This could be moved out of the if statement?
>
> ==

I don't think we can change it. If we were to move it, then sz =

RE: POC: postgres_fdw insert batching

2020-11-25 Thread tsunakawa.ta...@fujitsu.com
From: Tomas Vondra 
> Well, good that we all agree this is a useful feature to have (in
> general). The question is whether postgres_fdw should be doing batching
> on it's onw (per this thread) or rely on some other feature (libpq
> pipelining). I haven't followed the other thread, so I don't have an
> opinion on that.

Well, as someone said in this thread, I think bulk insert is much more common 
than updates/deletes.  Thus, major DBMSs have INSERT VALUES(record1), 
(record2)... and INSERT SELECT.  Oracle has direct path INSERT in addition.  As 
for the comparison of INSERT with multiple records and libpq batching (= 
multiple INSERTs), I think the former is more efficient because the amount of 
data transfer is less and the parsing-planning of INSERT for each record is 
eliminated.

I never deny the usefulness of libpq batch/pipelining, but I'm not sure if app 
developers would really use it.  If they want to reduce the client-server 
round-trips, won't they use traditional stored procedures?  Yes, the stored 
procedure language is very DBMS-specific.  Then, I'd like to know what kind of 
well-known applications are using standard batching API like JDBC's batch 
updates.  (Sorry, I think that should be discussed in libpq batch/pipelining 
thread and this thread should not be polluted.)


> Note however we're doing two things here, actually - we're implementing
> custom batching for postgres_fdw, but we're also extending the FDW API
> to allow other implementations do the same thing. And most of them won't
> be able to rely on the connection library providing that, I believe.

I'm afraid so, too.  Then, postgres_fdw would be an example that other FDW 
developers would look at when they use INSERT with multiple records.


Regards
Takayuki Tsunakawa






Re: walsender bug: stuck during shutdown

2020-11-25 Thread Fujii Masao




On 2020/11/26 3:45, Alvaro Herrera wrote:

On 2020-Nov-25, Fujii Masao wrote:


But whether MyWalSnd->write is InvalidRecPtr or not, if it's behind sentPtr,
walsender should keep waiting for the ack to all the sent message to be
replied, i.e., isn't this expected behavior of normal shutdown?  That is,
if we want to shutdown walsender even when the client side doesn't
reply message, immediate shutdown should be used or the client side
should be terminated, instead?


I don't think "waiting forever" can be considered the expected behavior;
this has caused what are nominally production outages several times
already, since we sent a shutdown signal to the server and it never
completed shutting down.


On the second thought, walsender doesn't wait forever unless
wal_sender_timeout is disabled, even in the case in discussion?
Or if there is the case where wal_sender_timeout doesn't work expectedly,
we might need to fix that at first.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: autovac issue with large number of tables

2020-11-25 Thread Kasahara Tatsuhito
On Wed, Nov 25, 2020 at 8:46 PM Masahiko Sawada  wrote:
>
> On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
>  wrote:
> >
> > Hi,
> >
> > On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
> > >  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> > > >  wrote:
> > > > > > I wonder if we could have table_recheck_autovac do two probes of 
> > > > > > the stats
> > > > > > data.  First probe the existing stats data, and if it shows the 
> > > > > > table to
> > > > > > be already vacuumed, return immediately.  If not, *then* force a 
> > > > > > stats
> > > > > > re-read, and check a second time.
> > > > > Does the above mean that the second and subsequent 
> > > > > table_recheck_autovac()
> > > > > will be improved to first check using the previous refreshed 
> > > > > statistics?
> > > > > I think that certainly works.
> > > > >
> > > > > If that's correct, I'll try to create a patch for the PoC
> > > >
> > > > I still don't know how to reproduce Jim's troubles, but I was able to 
> > > > reproduce
> > > > what was probably a very similar problem.
> > > >
> > > > This problem seems to be more likely to occur in cases where you have
> > > > a large number of tables,
> > > > i.e., a large amount of stats, and many small tables need VACUUM at
> > > > the same time.
> > > >
> > > > So I followed Tom's advice and created a patch for the PoC.
> > > > This patch will enable a flag in the table_recheck_autovac function to 
> > > > use
> > > > the existing stats next time if VACUUM (or ANALYZE) has already been 
> > > > done
> > > > by another worker on the check after the stats have been updated.
> > > > If the tables continue to require VACUUM after the refresh, then a 
> > > > refresh
> > > > will be required instead of using the existing statistics.
> > > >
> > > > I did simple test with HEAD and HEAD + this PoC patch.
> > > > The tests were conducted in two cases.
> > > > (I changed few configurations. see attached scripts)
> > > >
> > > > 1. Normal VACUUM case
> > > >   - SET autovacuum = off
> > > >   - CREATE tables with 100 rows
> > > >   - DELETE 90 rows for each tables
> > > >   - SET autovacuum = on and restart PostgreSQL
> > > >   - Measure the time it takes for all tables to be VACUUMed
> > > >
> > > > 2. Anti wrap round VACUUM case
> > > >   - CREATE brank tables
> > > >   - SELECT all of these tables (for generate stats)
> > > >   - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
> > > >   - Consumes a lot of XIDs by using txid_curent()
> > > >   - Measure the time it takes for all tables to be VACUUMed
> > > >
> > > > For each test case, the following results were obtained by changing
> > > > autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> > > > Also changing num of tables to 1000, 5000, 1 and 2.
> > > >
> > > > Due to the poor VM environment (2 VCPU/4 GB), the results are a little 
> > > > unstable,
> > > > but I think it's enough to ask for a trend.
> > > >
> > > > ===
> > > > [1.Normal VACUUM case]
> > > >  tables:1000
> > > >   autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
> > > >   autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
> > > >   autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
> > > >   autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
> > > >   autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec
> > > >
> > > >  tables:5000
> > > >   autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
> > > >   autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
> > > >   autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
> > > >   autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
> > > >   autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec
> > > >
> > > >  tables:1
> > > >   autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
> > > >   autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
> > > >   autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
> > > >   autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
> > > >   autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec
> > > >
> > > >  tables:2
> > > >   autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
> > > >   autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
> > > >   autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
> > > >   autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
> > > >   autovacuum_max_workers 10:  (HEAD) 320 sec VS (with patch)  113 sec
> > > >
> > > > [2.Anti wrap round VACUUM case]
> > > >  tables:1000
> > > >   autovacuum_max_workers 1:   (HEAD) 19 sec VS (with patch) 18 sec
> > > >   

Re: Asynchronous Append on postgres_fdw nodes.

2020-11-25 Thread movead...@highgo.ca

I test the patch and occur several issues as blow:

Issue one:
Get a Assert error at 'Assert(bms_is_member(i, node->as_needrequest));' in
ExecAppendAsyncRequest() function when I use more than two foreign table
on different foreign server.

I research the code and do such change then the Assert problom disappear.

@@ -1004,6 +1004,7 @@ ExecAppendAsyncResponse(AsyncRequest *areq) 
bms_del_member(node->as_needrequest, areq->request_index); 
node->as_asyncpending = bms_add_member(node->as_asyncpending, 
areq->request_index); + node->as_lastasyncplan = INVALID_SUBPLAN_INDEX; return 
false; }

Issue two:
Then I test and find if I have sync subplan and async sunbplan, it will run over
the sync subplan then the async turn, I do not know if it is intent.

Issue three:
After code change mentioned in the Issue one, I can not get performance 
improvement.
I query on partitioned table and all sub-partition the time spent on 
partitioned table
always same as the sum of all sub-partition.

Sorry if I have something wrong when test the patch.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca


Re: Deleting older versions in unique indexes to avoid page splits

2020-11-25 Thread Peter Geoghegan
On Wed, Nov 25, 2020 at 1:20 PM Victor Yegorov  wrote:
> In the _bt_delete_or_dedup_one_page() we start with the simple loop over 
> items on the page and
> if there're any LP_DEAD tuples, we're kicking off _bt_delitems_delete().

Right.

> So if I understood you right, you plan to make this loop (or a similar one 
> somewhere around)
> to track TIDs of the LP_DEAD tuples and then (perhaps on a second loop over 
> the page) compare all other
> currently-not-LP_DEAD tuples and mark those pages, that have at least 2 TIDs 
> pointing at (one LP_DEAD and other not)
> as a promising one.

Yes. We notice extra TIDs that can be included in our heapam test "for
free". The cost is low, but the benefits are also often quite high: in
practice there are *natural* correlations that we can exploit.

For example: maybe there were non-HOT updates, and some but not all of
the versions got marked LP_DEAD. We can get them all in one go,
avoiding a true bottom-up index deletion pass for much longer
(compared to doing LP_DEAD deletion the old way, which is what happens
in v9 of the patch). We're better off doing the deletions all at once.
It's cheaper.

(We still really need to have bottom-up deletion passes, of course,
because that covers the important case where there are no LP_DEAD bits
set at all, which is an important goal of this project.)

Minor note: Technically there aren't any promising tuples involved,
because that only makes sense when we are not going to visit every
possible heap page (just the "most promising" heap pages). But we are
going to visit every possible heap page with the new LP_DEAD bit
deletion code (which could occasionally mean visiting 10 or more heap
pages, which is a lot more than bottom-up index deletion will ever
visit). All we need to do with the new LP_DEAD deletion logic is to
include all possible matching TIDs (not just those that are marked
LP_DEAD already).

> What I meant to ask — will LP_DEAD be set by IndexScan in the presence of the 
> long transaction?

That works in the same way as before, even with the new LP_DEAD
deletion code. The new code uses the same information as before (set
LP_DEAD bits), which is generated in the same way as before. The
difference is in how the information is actually used during LP_DEAD
deletion -- we can now delete some extra things in certain common
cases.

In practice this (and bottom-up deletion) make nbtree more robust
against disruption caused by long running transactions that hold a
snapshot open. It's hard to give a simple explanation of why that is,
because it's a second order effect. The patch is going to make it
possible to recover when LP_DEAD bits suddenly stop being set because
of an old snapshot -- now we'll have a "second chance", and maybe even
a third chance. But if the snapshot is held open *forever*, then a
second chance has no value.

Here is a thought experiment that might be helpful:

Imagine Postgres just as it is today (without the patch), except that
VACUUM runs very frequently, and is infinitely fast (this is a magical
version of VACUUM). This solves many problems, but does not solve all
problems. Magic Postgres will become just as slow as earthly Postgres
when there is a snapshot that is held open for a very long time. That
will take longer to happen compared to earthly/mortal Postgres, but
eventually there will be no difference between the two at all. But,
when you don't have such an extreme problem, magic Postgres really is
much faster.

I think that it will be possible to approximate the behavior of magic
Postgres using techniques like bottom-up deletion, the new LP_DEAD
deletion thing we've been talking today, and maybe other enhancements
in other areas (like in heap pruning). It doesn't matter that we don't
physically remove garbage immediately, as long as we "logically"
remove it immediately. The actually physical removal can occur in a
just in time, incremental fashion, creating the illusion that VACUUM
really does run infinitely fast. No magic required.

Actually, in a way this isn't new; we have always "logically" removed
garbage at the earliest opportunity (by which I mean we allow that it
can be physically removed according to an oldestXmin style cutoff,
which can be reacquired/updated the second the oldest MVCC snapshot
goes away). We don't think of useless old versions as being "logically
removed" the instant an old snapshot goes away. But maybe we should --
it's a useful mental model.

It will also be very helpful to add "remove useless intermediate
versions" logic at some point. This is quite a distinct area to what I
just described, but it's also important. We need both, I think.

-- 
Peter Geoghegan




Re: Huge memory consumption on partitioned table with FKs

2020-11-25 Thread Keisuke Kuroda
Hi Hackers,

Analyzed the problem and created a patch to resolve it.

# Problem 1

When you create a foreign key to a partitioned table, referential
integrity function is created for the number of partitions.
Internally, SPI_prepare() creates a plan and SPI_keepplan()
caches the plan.

The more partitions in the referencing table, the more plans will
be cached.

# Problem 2

When referenced table is partitioned table, the larger the number
of partitions, the larger the plan size to be cached.

The actual plan processed is simple and small if pruning is
enabled. However, the cached plan will include all partition
information.

The more partitions in the referenced table, the larger the plan
size to be cached.

# Idea for solution

Constraints with the same pg_constraint.parentid can be combined
into one plan with the same comparentid if we can guarantee that
all their contents are the same.

Problem 1 can be solved
and significant memory bloat can be avoided.
CachedPlan: 710MB -> 1466kB

Solving Problem 2 could also reduce memory,
but I don't have a good idea.

Currently, DISCARD ALL does not discard CachedPlan by SPI as in
this case. It may be possible to modify DISCARD ALL to discard
CachedPlan and run it periodically. However, we think the burden
on the user is high.

# result with patch(PG14 HEAD(e522024b) + patch)

   name   |  bytes  | pg_size_pretty
--+-+
 CachedPlanQuery  |   12912 | 13 kB
 CachedPlanSource |   17448 | 17 kB
 CachedPlan   | 1501192 | 1466 kB

CachedPlan * 1 Record

postgres=# SELECT count(*) FROM pg_backend_memory_contexts WHERE name
= 'CachedPlan' AND ident LIKE 'SELECT 1 FROM%';
 count
---
 1

postgres=# SELECT * FROM pg_backend_memory_contexts WHERE name =
'CachedPlan' AND ident LIKE 'SELECT 1 FROM%';
-[ RECORD 1 
]-+--
name  | CachedPlan
ident | SELECT 1 FROM "public"."ps" x WHERE "c1"
OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
parent| CacheMemoryContext
level | 2
total_bytes   | 2101248
total_nblocks | 12
free_bytes| 613256
free_chunks   | 1
used_bytes| 1487992
(1 record)

# result without patch(PG14 HEAD(e522024b))

   name   |   bytes   | pg_size_pretty
--+---+
 CachedPlanQuery  |   1326280 | 1295 kB
 CachedPlanSource |   1474528 | 1440 kB
 CachedPlan   | 744009200 | 710 MB

CachedPlan * 500 Records

postgres=# SELECT count(*) FROM pg_backend_memory_contexts WHERE name
= 'CachedPlan' AND ident LIKE 'SELECT 1 FROM%';
 count
---
   500

SELECT * FROM pg_backend_memory_contexts WHERE name = 'CachedPlan' AND
ident LIKE 'SELECT 1 FROM%';
name  | CachedPlan
ident | SELECT 1 FROM "public"."ps" x WHERE "c1"
OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x
parent| CacheMemoryContext
level | 2
total_bytes   | 2101248
total_nblocks | 12
free_bytes| 613256
free_chunks   | 1
used_bytes| 1487992
...(500 same records)

Best Regards,
-- 
Keisuke Kuroda
NTT Software Innovation Center
keisuke.kuroda.3...@gmail.com


v1_reduce_ri_SPI-plan-hash.patch
Description: Binary data


Re: Libpq support to connect to standby server as priority

2020-11-25 Thread Greg Nancarrow
On Thu, Nov 26, 2020 at 3:43 AM Tom Lane  wrote:
>
> Thanks for looking!  Pushed.
>
> At this point the cfbot is going to start complaining that the last-posted
> patch in this thread no longer applies.  Unless you have a new patch set
> nearly ready to post, I think we should close the CF entry as RWF, and
> then you can open a new one when you're ready.
>

Actually, the cfbot shouldn't be complaining, as my last-posted patch
still seems to apply cleanly on the latest code (with your pushed
patch), and all tests pass.
Hopefully it's OK to let it roll over to the next CF with the WOA status.
I am actively working on processing the feedback and updating the
patch, and hope to post an update next week sometime.

Regards,
Greg Nancarrow
Fujitsu Australia




Re: [patch] CLUSTER blocks scanned progress reporting

2020-11-25 Thread Fujii Masao




On 2020/11/25 20:52, Matthias van de Meent wrote:

On Wed, 25 Nov 2020 at 10:35, Fujii Masao  wrote:


On 2020/11/25 0:25, Matthias van de Meent wrote:


I noticed that with my proposed patch it is still possible to go to
the next phase while heap_blks_scanned != heap_blks_total. This can
happen when the final heap pages contain only dead tuples, so no tuple
is returned from the last heap page(s) of the scan. As the
heapScan->rs_cblock is set to InvalidBlockNumber when the scan is
finished (see heapam.c#1060-1072), I think it would be correct to set
heap_blks_scanned to heapScan->rs_nblocks at the end of the scan
instead.


Thanks for updating the patch!

Please let me check my understanding about this. I was thinking that even
when the last page contains only dead tuples, table_scan_getnextslot()
returns the last page (because SnapshotAny is used) and heap_blks_scanned
is incremented properly. And then, heapam_relation_copy_for_cluster()
handles all the dead tuples in that page. No?


Yes, my description was incorrect.

The potential for a discrepancy exists for seemingly empty pages. I
thought that 'filled with only dead tuples' would be correct for
'seemingly empty', but SnapshotAny indeed returns all tuples on a
page. But pages can still be empty with SnapshotAny, through being
emptied by vacuum, so the last pages scanned can still be empty pages.
Then, there would be no tuple returned at the last pages of the scan,
and subsequently the CLUSTER/VACUUM FULL would start the next phase
without reporting on the last pages that were scanned and had no
tuples in them (such that heap_blks_scanned != heap_blks_total).

Vacuum truncates empty blocks from the end of the relation, and this
prevents empty blocks at the end of CLUSTER for the default case of
table scans starting at 0; but because the table scan might not start
at block 0, we can have an empty page at the end of the table scan due
to the last page of the scan not being the last page of the table.


Thanks for explaining that! I understood that.
That situation can happen easily also by using VACUUM (TRUNCATE off) command.

+* A heap scan need not return tuples for the 
last page it has
+* scanned. To ensure that heap_blks_scanned is 
equivalent to
+* total_heap_blks after the table scan phase, 
this parameter
+* is manually updated to the correct value 
when the table scan
+* finishes.

So it's better to update this comment a bit? For example,

If the scanned last pages are empty, it's possible to go to the next phase
while heap_blks_scanned != heap_blks_total. To ensure that they are
equivalet after the table scan phase, this parameter is manually updated
to the correct value when the table scan finishes.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: pg_dump, ATTACH, and independently restorable child partitions

2020-11-25 Thread Tom Lane
Justin Pryzby  writes:
> 1. Maybe pg_restore ExecuteSqlCommandBuf() should (always?) call
> ExecuteSimpleCommands() instead of ExecuteSqlCommand().  It doesn't seem to
> break anything (although that surprised me).

That certainly does break everything, which I imagine is the reason
why the cfbot shows that this patch is failing the pg_upgrade tests.
Note the comments for ExecuteSimpleCommands:

 * We have to lex the data to the extent of identifying literals and quoted
 * identifiers, so that we can recognize statement-terminating semicolons.
 * We assume that INSERT data will not contain SQL comments, E'' literals,
 * or dollar-quoted strings, so this is much simpler than a full SQL lexer.

IOW, where that says "Simple", it means *simple* --- in practice,
we only risk using it on commands that we know pg_dump itself built
earlier.  There is no reasonable prospect of getting pg_restore to
split arbitrary SQL at command boundaries.  We'd need something
comparable to psql's lexer, which is huge, and from a future-proofing
standpoint it would be just awful.  (The worst that happens if psql
misparses your string is that it won't send the command when you
expected.  If pg_restore misparses stuff, your best case is that the
restore fails cleanly; the worst case could easily result in
SQL-injection compromises.)  So I think we cannot follow this
approach.

What we'd need to do if we want this to work with direct-to-DB restore
is to split off the ATTACH PARTITION command as a separate TOC entry.
That doesn't seem amazingly difficult, and it would even offer the
possibility that you could extract the partition standalone without
having to ignore errors.  (You'd use -l/-L to select the CREATE TABLE,
the data, etc, but not the ATTACH object.)

That would possibly come out as a larger patch than you have here,
but maybe not by much.  I don't think there's too much more involved
than setting up the proper command strings and calling ArchiveEntry().
You'd need to do some testing to verify that cases like --clean
work sanely.

Also, I read the 0002 patch briefly and couldn't make heads or tails
of it, except that it seemed to be abusing the PQExpBuffer abstraction
well beyond anything I'd consider acceptable.  If you want separate
strings, make a PQExpBuffer for each one.

regards, tom lane




Re: Optimising latch signals

2020-11-25 Thread Thomas Munro
On Thu, Nov 19, 2020 at 4:49 PM Thomas Munro  wrote:
> I'll add this to the next commitfest.

Let's see if this version fixes the Windows compile error and warning
reported by cfbot.
From 3eb542891a11d39047b28f6f33ae4e3d25bdd510 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 8 Aug 2020 15:08:09 +1200
Subject: [PATCH v4 1/5] Optimize latches to send fewer signals.

Don't send signals to processes that aren't currently sleeping.

Author: Andres Freund 
Discussion: https://postgr.es/m/ca+hukgjjxpdpzbe0a3hyuywbvazuc89yx3jk9rfzgfv_khu...@mail.gmail.com
---
 src/backend/storage/ipc/latch.c | 24 
 src/include/storage/latch.h |  1 +
 2 files changed, 25 insertions(+)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 24afc47d51..017b5c0c0f 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -274,6 +274,7 @@ void
 InitLatch(Latch *latch)
 {
 	latch->is_set = false;
+	latch->maybe_sleeping = false;
 	latch->owner_pid = MyProcPid;
 	latch->is_shared = false;
 
@@ -321,6 +322,7 @@ InitSharedLatch(Latch *latch)
 #endif
 
 	latch->is_set = false;
+	latch->maybe_sleeping = false;
 	latch->owner_pid = 0;
 	latch->is_shared = true;
 }
@@ -523,6 +525,10 @@ SetLatch(Latch *latch)
 
 	latch->is_set = true;
 
+	pg_memory_barrier();
+	if (!latch->maybe_sleeping)
+		return;
+
 #ifndef WIN32
 
 	/*
@@ -589,6 +595,7 @@ ResetLatch(Latch *latch)
 {
 	/* Only the owner should reset the latch */
 	Assert(latch->owner_pid == MyProcPid);
+	Assert(latch->maybe_sleeping == false);
 
 	latch->is_set = false;
 
@@ -1289,6 +1296,14 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		 * ordering, so that we cannot miss seeing is_set if a notification
 		 * has already been queued.
 		 */
+		if (set->latch && !set->latch->is_set)
+		{
+			/* about to sleep on a latch */
+			set->latch->maybe_sleeping = true;
+			pg_memory_barrier();
+			/* and recheck */
+		}
+
 		if (set->latch && set->latch->is_set)
 		{
 			occurred_events->fd = PGINVALID_SOCKET;
@@ -1299,6 +1314,9 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 			occurred_events++;
 			returned_events++;
 
+			/* could have been set above */
+			set->latch->maybe_sleeping = false;
+
 			break;
 		}
 
@@ -1310,6 +1328,12 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		rc = WaitEventSetWaitBlock(set, cur_timeout,
    occurred_events, nevents);
 
+		if (set->latch)
+		{
+			Assert(set->latch->maybe_sleeping);
+			set->latch->maybe_sleeping = false;
+		}
+
 		if (rc == -1)
 			break;/* timeout occurred */
 		else
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 7c742021fb..d0da7e5d31 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -110,6 +110,7 @@
 typedef struct Latch
 {
 	sig_atomic_t is_set;
+	sig_atomic_t maybe_sleeping;
 	bool		is_shared;
 	int			owner_pid;
 #ifdef WIN32
-- 
2.20.1

From e0709315a443c76058a5a61e59bb54e248981d5d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 26 Nov 2020 10:18:29 +1300
Subject: [PATCH v4 2/5] Use SIGURG rather than SIGUSR1 for latches.

Traditionally, SIGUSR1 has been overloaded for ad-hoc signals,
procsignal.c signals and latch.c wakeups.  Move that last use over to a
new dedicated signal.  SIGURG is normally used to report out-of-band
socket data, but PostgreSQL doesn't use that.

The signal handler is now installed in all postmaster children by
InitializeLatchSupport().  Those wishing to disconnect from it should
call ShutdownLatchSupport() which will set it back to SIG_IGN.

Future patches will use this separation to avoid the need for a signal
handler on some operating systems.

Discussion: https://postgr.es/m/ca+hukgjjxpdpzbe0a3hyuywbvazuc89yx3jk9rfzgfv_khu...@mail.gmail.com
---
 src/backend/postmaster/bgworker.c| 19 ++-
 src/backend/postmaster/postmaster.c  |  4 +++
 src/backend/storage/ipc/latch.c  | 50 ++--
 src/backend/storage/ipc/procsignal.c |  2 --
 src/include/storage/latch.h  | 11 +-
 5 files changed, 40 insertions(+), 46 deletions(-)

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 5a9a0e3435..9241b4edf4 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -654,22 +654,6 @@ bgworker_die(SIGNAL_ARGS)
 	MyBgworkerEntry->bgw_type)));
 }
 
-/*
- * Standard SIGUSR1 handler for unconnected workers
- *
- * Here, we want to make sure an unconnected worker will at least heed
- * latch activity.
- */
-static void
-bgworker_sigusr1_handler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /*
  * Start a new background worker
  *
@@ -700,6 +684,7 @@ StartBackgroundWorker(void)
 	 */
 	if ((worker->bgw_flags & BGWORKER_SHMEM_ACCESS) == 0)
 	{
+		ShutdownLatchSupport();
 		dsm_detach_all();
 		PGSharedMemoryDetach();
 	}
@@ -727,7 +712,7 @@ StartBackgroundWorker(void)
 	

Re: Deleting older versions in unique indexes to avoid page splits

2020-11-25 Thread Victor Yegorov
ср, 25 нояб. 2020 г. в 19:41, Peter Geoghegan :

> We have promising tuples for bottom-up deletion. Why not have
> "promising heap blocks" for traditional LP_DEAD index tuple deletion?
> Or if you prefer, we can consider index tuples that *don't* have their
> LP_DEAD bits set already but happen to point to the *same heap block*
> as other tuples that *do* have their LP_DEAD bits set promising. (The
> tuples with their LP_DEAD bits set are not just promising -- they're
> already a sure thing.)
>

In the _bt_delete_or_dedup_one_page() we start with the simple loop over
items on the page and
if there're any LP_DEAD tuples, we're kicking off _bt_delitems_delete().

So if I understood you right, you plan to make this loop (or a similar one
somewhere around)
to track TIDs of the LP_DEAD tuples and then (perhaps on a second loop over
the page) compare all other
currently-not-LP_DEAD tuples and mark those pages, that have at least 2
TIDs pointing at (one LP_DEAD and other not)
as a promising one.

Later, should we require to kick deduplication, we'll go visit promising
pages first.

Is my understanding correct?


> I am missing a general perspective here.
> >
> > Is it true, that despite the long (vacuum preventing) transaction we can
> re-use space,
> > as after the DELETE statements commits, IndexScans are setting LP_DEAD
> hints after
> > they check the state of the corresponding heap tuple?
>
> The enhancement to traditional LP_DEAD deletion that I just described
> does not affect the current restrictions on setting LP_DEAD bits in
> the presence of a long-running transaction, or anything like that.
> That seems like an unrelated project. The value of this enhancement is
> purely its ability to delete *extra* index tuples that could have had
> their LP_DEAD bits set already (it was possible in principle), but
> didn't. And only when they are nearby to index tuples that really do
> have their LP_DEAD bits set.
>

I wasn't considering improvements here, that was a general question about
how this works
(trying to clear up gaps in my understanding).

What I meant to ask — will LP_DEAD be set by IndexScan in the presence of
the long transaction?


-- 
Victor Yegorov


Re: enable_incremental_sort changes query behavior

2020-11-25 Thread James Coleman
On Wed, Nov 25, 2020 at 2:53 PM James Coleman  wrote:
>
> On Tue, Nov 24, 2020 at 2:31 PM Tom Lane  wrote:
> >
> > James Coleman  writes:
> > > On Mon, Nov 23, 2020 at 2:24 PM Tom Lane  wrote:
> > >> 1. James was wondering, far upthread, why we would do projections
> > >> pre-sort or post-sort.  I think the answers can be found by studying
> > >> planner.c's make_sort_input_target(), which separates out what we want
> > >> to do pre-sort and post-sort.
> >
> > > Does it imply we need to intentionally avoid SRFs also?
> >
> > It's sort of a wart that we allow SRFs in ORDER BY at all, but my
> > expectation is that make_sort_input_target would prevent lower levels
> > of the planner from needing to think about that.  We don't allow SRFs
> > in index expressions, nor in WHERE clauses (so they'll never come up
> > as mergejoin sort keys).  So really there's no way that scan/join
> > processing should need to consider such cases.  Such a sort would
> > only ever be implemented via a final Sort atop ProjectSet.
>
> In this case though we're sorting based on "interesting" pathkeys,
> which means we don't don't necessarily have index expressions or
> mergejoin sort keys. For example, even with the bugfix patch (from the
> parallel fix thread I've linked to previously) applied, I'm able to
> generate this (with of course some GUCs "tweaked"):
>
> select unique1 from tenk1 order by unnest('{1}'::int[]);
>
>  Gather Merge
>Workers Planned: 2
>->  Sort
>  Sort Key: (unnest('{1}'::integer[]))
>  ->  ProjectSet
>->  Parallel Index Only Scan using tenk1_unique1 on tenk1
>
> If I understand correctly, that's a violation of the spirit of what
> the comments above make_sort_input_target(). If that's true, then it
> should be pretty easy to disallow them from being considered. Given
> the existing restrictions on where SRFs can be placed in a SELECT
> (e.g., no CASE/COALESCE) and my assumption (without having thoroughly
> verified this) that SRFs aren't allowed as arguments to functions or
> as arguments to any other expression (I assume only scalars are
> allowed), would it be sufficient to check the pathkey expression
> (without recursion) to see if it's a FuncExpr that returns a set?

Here's the plan with a change to restrict SRFs here:

 Sort
   Sort Key: (unnest('{1,2}'::integer[]))
   ->  Gather
 Workers Planned: 2
 ->  ProjectSet
   ->  Parallel Index Only Scan using tenk1_unique1 on tenk1

I'm a bit surprised the ProjectSet is above the Index Scan rather than
above the Gather, but maybe this is still more correct?

James




Re: Add table access method as an option to pgbench

2020-11-25 Thread David Zhang

Thank a lot for your comments, Michael.

On 2020-11-24 7:41 p.m., Michael Paquier wrote:

On Tue, Nov 24, 2020 at 03:32:38PM -0800, David Zhang wrote:

But, providing another option for the end user may not be a bad idea, and it
might make the tests easier at some points.

My first thought is that we have no need to complicate pgbench with
this option because there is a GUC able to do that, but we do that for
tablespaces, so...  No objections from here.


The attached file is quick patch for this.

Thoughts?

This patch does not apply on HEAD, where you can just use
appendPQExpBuffer() to append the new clause to the CREATE TABLE
query.  This needs proper documentation.
The previous patch was based on branch "REL_13_STABLE". Now, the 
attached new patch v2 is based on master branch. I followed the new code 
structure using appendPQExpBuffer to append the new clause "using 
TABLEAM" in a proper position and tested. In the meantime, I also 
updated the pgbench.sqml file to reflect the changes.

--
Michael

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca
>From 1e908ecdd6add81c96dca489f45f5ae9d0c2a5f1 Mon Sep 17 00:00:00 2001
From: David Zhang 
Date: Wed, 25 Nov 2020 10:41:20 -0800
Subject: [PATCH] add table access method option to pgbench

---
 doc/src/sgml/ref/pgbench.sgml | 11 +++
 src/bin/pgbench/pgbench.c | 20 
 2 files changed, 31 insertions(+)

diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 7180fedd65..a8f8d24fe1 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -378,6 +378,17 @@ pgbench  options 
 d
   
  
 
+ 
+  
--table-am=TABLEAM
+  
+   
+Create all tables with specified table access method 
+TABLEAM.
+If unspecified, default is heap.
+   
+  
+ 
+
 

 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3057665bbe..62b1dd3e23 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -189,6 +189,11 @@ int64  latency_limit = 0;
 char  *tablespace = NULL;
 char  *index_tablespace = NULL;
 
+/*
+ * table access method selection
+ */
+char  *tableam = NULL;
+
 /*
  * Number of "pgbench_accounts" partitions.  0 is the default and means no
  * partitioning.
@@ -643,6 +648,7 @@ usage(void)
   "  --partitions=NUM partition pgbench_accounts into 
NUM parts (default: 0)\n"
   "  --tablespace=TABLESPACE  create tables in the specified 
tablespace\n"
   "  --unlogged-tablescreate tables as unlogged 
tables\n"
+  "  --table-am=TABLEAM   create tables with specified 
table access method (default: heap)\n"
   "\nOptions to select what to run:\n"
   "  -b, --builtin=NAME[@W]   add builtin script NAME weighted 
at W (default: 1)\n"
   "   (use \"-b list\" to list 
available scripts)\n"
@@ -3759,6 +3765,15 @@ initCreateTables(PGconn *con)
  ddl->table,
  (scale >= 
SCALE_32BIT_THRESHOLD) ? ddl->bigcols : ddl->smcols);
 
+   if (tableam != NULL)
+   {
+   char   *escape_tableam;
+
+   escape_tableam = PQescapeIdentifier(con, tableam, 
strlen(tableam));
+   appendPQExpBuffer(, " using %s", escape_tableam);
+   PQfreemem(escape_tableam);
+   }
+
/* Partition pgbench_accounts table */
if (partition_method != PART_NONE && strcmp(ddl->table, 
"pgbench_accounts") == 0)
appendPQExpBuffer(,
@@ -5419,6 +5434,7 @@ main(int argc, char **argv)
{"show-script", required_argument, NULL, 10},
{"partitions", required_argument, NULL, 11},
{"partition-method", required_argument, NULL, 12},
+   {"table-am", required_argument, NULL, 13},
{NULL, 0, NULL, 0}
};
 
@@ -5792,6 +5808,10 @@ main(int argc, char **argv)
exit(1);
}
break;
+   case 13:/* table access method*/
+   initialization_option_set = true;
+   tableam = pg_strdup(optarg);
+   break;
default:
fprintf(stderr, _("Try \"%s --help\" for more 
information.\n"), progname);
exit(1);
-- 
2.17.1



Re: POC: postgres_fdw insert batching

2020-11-25 Thread Tomas Vondra
On 11/25/20 7:31 AM, tsunakawa.ta...@fujitsu.com wrote:
> From: Craig Ringer 
>> I suggest that when developing this, you keep in mind the ongoing
>> work on the libpq pipelining/batching enhancements, and also the
>> way many interfaces to foreign data sources support asynchronous,
>> concurrent operations.
> 
> Yes, thank you, I bear it in mind.  I understand it's a feature for
> batching multiple kinds of SQL statements like DBC's batch updates.
> 

I haven't followed the libpq pipelining thread very closely. It does
seem related, but I'm not sure if it's a good match for this patch, or
how far is it from being committable ...

> 
>> I'd argue it's pretty much vital for decent performance when
>> talking to a cloud database from an on-prem server for example, or
>> any other time that round-trip-time reduction is important.
> 
> Yeah, I'm thinking of the data migration and integration as the
> prominent use case.
> 

Well, good that we all agree this is a useful feature to have (in
general). The question is whether postgres_fdw should be doing batching
on it's onw (per this thread) or rely on some other feature (libpq
pipelining). I haven't followed the other thread, so I don't have an
opinion on that.

Note however we're doing two things here, actually - we're implementing
custom batching for postgres_fdw, but we're also extending the FDW API
to allow other implementations do the same thing. And most of them won't
be able to rely on the connection library providing that, I believe.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: remove spurious CREATE INDEX CONCURRENTLY wait

2020-11-25 Thread Alvaro Herrera
On 2020-Nov-23, Andres Freund wrote:

> On 2020-11-23 12:30:05 -0300, Alvaro Herrera wrote:

> > In other words, my conclusion is that there definitely is a bug here and
> > I am going to restore the use of exclusive lock for setting the
> > statusFlags.
> 
> Cool.

Here's a patch.

Note it also moves the computation of vacuum's Xmin (per
GetTransactionSnapshot) to *after* the bit has been set in statusFlags.
>From b813c67a4abe2127b8bd13db7e920f958db15d59 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 24 Nov 2020 18:10:42 -0300
Subject: [PATCH] Restore lock level to update statusFlags

Reverts 27838981be9d (some comments are kept).  Per discussion, it does
not seem safe to relax the lock level used for this; in order for it to
be safe, there would have to be memory barriers between the point we set
the flag and the point we set the trasaction Xid, which perhaps would
not be so bad; but there would also have to be barriers at the readers'
side, which from a performance perspective might be bad.

Now maybe this analysis is wrong and it *is* safe for some reason, but
proof of that is not trivial.

Discussion: https://postgr.es/m/20201118190928.vnztes7c2sldu...@alap3.anarazel.de
---
 src/backend/commands/vacuum.c | 20 +++-
 src/backend/replication/logical/logical.c |  2 +-
 src/backend/storage/ipc/procarray.c   |  4 +---
 src/include/storage/proc.h|  6 +++---
 4 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 395e75f768..f1112111de 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1712,13 +1712,6 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 	/* Begin a transaction for vacuuming this relation */
 	StartTransactionCommand();
 
-	/*
-	 * Need to acquire a snapshot to prevent pg_subtrans from being truncated,
-	 * cutoff xids in local memory wrapping around, and to have updated xmin
-	 * horizons.
-	 */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	if (!(params->options & VACOPT_FULL))
 	{
 		/*
@@ -1739,9 +1732,11 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		 * Note: these flags remain set until CommitTransaction or
 		 * AbortTransaction.  We don't want to clear them until we reset
 		 * MyProc->xid/xmin, otherwise GetOldestNonRemovableTransactionId()
-		 * might appear to go backwards, which is probably Not Good.
+		 * might appear to go backwards, which is probably Not Good.  (We also
+		 * set PROC_IN_VACUUM *before* taking our own snapshot, so that our
+		 * xmin doesn't become visible ahead of setting the flag.)
 		 */
-		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		MyProc->statusFlags |= PROC_IN_VACUUM;
 		if (params->is_wraparound)
 			MyProc->statusFlags |= PROC_VACUUM_FOR_WRAPAROUND;
@@ -1749,6 +1744,13 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params)
 		LWLockRelease(ProcArrayLock);
 	}
 
+	/*
+	 * Need to acquire a snapshot to prevent pg_subtrans from being truncated,
+	 * cutoff xids in local memory wrapping around, and to have updated xmin
+	 * horizons.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	/*
 	 * Check for user-requested abort.  Note we want this to be inside a
 	 * transaction, so xact.c doesn't issue useless WARNING.
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 4324e32656..f1f4df7d70 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -181,7 +181,7 @@ StartupDecodingContext(List *output_plugin_options,
 	 */
 	if (!IsTransactionOrTransactionBlock())
 	{
-		LWLockAcquire(ProcArrayLock, LW_SHARED);
+		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 		MyProc->statusFlags |= PROC_IN_LOGICAL_DECODING;
 		ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
 		LWLockRelease(ProcArrayLock);
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 94edb24b22..c7848c0b69 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -662,10 +662,8 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		/* avoid unnecessarily dirtying shared cachelines */
 		if (proc->statusFlags & PROC_VACUUM_STATE_MASK)
 		{
-			/* Only safe to change my own flags with just share lock */
-			Assert(proc == MyProc);
 			Assert(!LWLockHeldByMe(ProcArrayLock));
-			LWLockAcquire(ProcArrayLock, LW_SHARED);
+			LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
 			Assert(proc->statusFlags == ProcGlobal->statusFlags[proc->pgxactoff]);
 			proc->statusFlags &= ~PROC_VACUUM_STATE_MASK;
 			ProcGlobal->statusFlags[proc->pgxactoff] = proc->statusFlags;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 00bb244340..22046e4e36 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ 

Re: proposal: possibility to read dumped table's name from file

2020-11-25 Thread Tom Lane
Pavel Stehule  writes:
> st 25. 11. 2020 v 19:25 odesílatel Dean Rasheed 
> napsal:
>> I agree that being able to configure pg_dump via a config file would
>> be very useful, but the syntax proposed here feels much more like a
>> hacked-up syntax designed to meet this one use case, rather than a
>> good general-purpose design that can be easily extended.

> But I don't understand why? What is a use case? What is a benefit against
> command line, or libpq variables? And why should config files be better as
> a solution for limited length of command line, when I need to dump
> thousands of tables exactly specified?

Because next week somebody will want to dump thousands of functions
selected by name, or schemas selected by name, etc etc.  I agree with
the position that we don't want a single-purpose solution.  The idea
that the syntax should match the command line switch syntax seems
reasonable, though I'm not wedded to it.  (One thing to consider is
how painful will it be for people to quote table names containing
funny characters, for instance.  On the command line, we largely
depend on the shell's quoting behavior to solve that, but we'd not
have that infrastructure when reading from a file.)

> What are the benefits of supporting multiple formats?

Yeah, that part of Dean's sketch seemed like overkill to me too.

I wasn't very excited about multiple switch files either, though
depending on how the implementation is done, that could be simple
enough to be in the might-as-well category.

One other point that I'm wondering about is that there's really no
value in doing anything here until you get to some thousands of
table names; as long as the list fits in the shell's command line
length limit, you might as well just make a shell script file.
Does pg_dump really have sane performance for that situation, or
are we soon going to be fielding requests to make it not be O(N^2)
in the number of listed tables?

regards, tom lane




Re: enable_incremental_sort changes query behavior

2020-11-25 Thread James Coleman
On Tue, Nov 24, 2020 at 2:31 PM Tom Lane  wrote:
>
> James Coleman  writes:
> > On Mon, Nov 23, 2020 at 2:24 PM Tom Lane  wrote:
> >> 1. James was wondering, far upthread, why we would do projections
> >> pre-sort or post-sort.  I think the answers can be found by studying
> >> planner.c's make_sort_input_target(), which separates out what we want
> >> to do pre-sort and post-sort.
>
> > Does it imply we need to intentionally avoid SRFs also?
>
> It's sort of a wart that we allow SRFs in ORDER BY at all, but my
> expectation is that make_sort_input_target would prevent lower levels
> of the planner from needing to think about that.  We don't allow SRFs
> in index expressions, nor in WHERE clauses (so they'll never come up
> as mergejoin sort keys).  So really there's no way that scan/join
> processing should need to consider such cases.  Such a sort would
> only ever be implemented via a final Sort atop ProjectSet.

In this case though we're sorting based on "interesting" pathkeys,
which means we don't don't necessarily have index expressions or
mergejoin sort keys. For example, even with the bugfix patch (from the
parallel fix thread I've linked to previously) applied, I'm able to
generate this (with of course some GUCs "tweaked"):

select unique1 from tenk1 order by unnest('{1}'::int[]);

 Gather Merge
   Workers Planned: 2
   ->  Sort
 Sort Key: (unnest('{1}'::integer[]))
 ->  ProjectSet
   ->  Parallel Index Only Scan using tenk1_unique1 on tenk1

If I understand correctly, that's a violation of the spirit of what
the comments above make_sort_input_target(). If that's true, then it
should be pretty easy to disallow them from being considered. Given
the existing restrictions on where SRFs can be placed in a SELECT
(e.g., no CASE/COALESCE) and my assumption (without having thoroughly
verified this) that SRFs aren't allowed as arguments to functions or
as arguments to any other expression (I assume only scalars are
allowed), would it be sufficient to check the pathkey expression
(without recursion) to see if it's a FuncExpr that returns a set?

> >> 2. Robert is correct that under normal circumstances, the targetlists of
> >> both baserels and join rels contain only Vars.
>
> > Is that primarily a project policy? Or a limitation of our current
> > planner (just can't push things down/carry the results back up as much
> > as we'd like)? Or something else? In particular, it seems plausible
> > there are cases where pushing down the sort on a non-Var expression to
> > the base rel could improve plans, so I'm wondering if there's a reason
> > to intentionally avoid that in the long or short run (or both).
>
> I think you've just rediscovered Joe Hellerstein's thesis topic [1].
> We ripped out the remnants of that code ages ago (search very early
> git states for "JMH" if you're interested), because the complexity
> vs. benefit ratio seemed pretty bad.  Maybe there'll be a case for
> putting it back some day, but I'm dubious.  Note that we have the
> ability to push down sorts-on-expressions anyway; that's not constrained
> by what is in the relation targetlists.

I'm doing some grepping now to see what that work entailed, but I'm
not sure that what I'm describing is the same thing. By "pushing down
a non-Var expression to the base rel" I mean what (to my ears) sounds
like what you're saying by "we have the ability to push down
sorts-on-expressions anyway; that's not constrained by what is in the
relation targetlists". The target list entries I'm talking about for
these expressions are the ones inserted by
prepare_sort_from_pathkeys(), which in PG13 happens just a bit more
frequently since, at least in parallel query, consider sort paths
lower than we did before based on interesting pathkeys (for now just
query_pathkeys) in generate_useful_gather_paths().

> > Interesting: so merge joins are an example of us pushing down sorts,
> > which I assume means (part of) the answer to my question on (2) is
> > that there's nothing inherently wrong/broken with evaluating
> > expressions lower down the tree as long as we're careful about what is
> > safe/unsafe with respect to volatility and parallelism?
>
> Right, I don't see any fundamental problem with that, we just have
> to be careful about these constraints.

Great. That's exactly what I was looking for. To summarize: the
constraints are on volatility, parallel safety, and SRFs.

> > I have wondered if we should strictly require the expression to be in
> > the target list even if nonvolatile, but prepare_sort_from_pathkeys
> > doesn't seem to think that's necessary, so I'm comfortable with that
> > unless there's something you know we haven't considered.
>
> No, prepare_sort_from_pathkeys is happy to build a sort expression if it
> can't find it already computed in the input.  The secret here is that
> we should never get to that code with a "dangerous" sort expression,
> because we should never have made a Path that would request such 

Re: SEARCH and CYCLE clauses

2020-11-25 Thread Pavel Stehule
st 25. 11. 2020 v 14:06 odesílatel Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> napsal:

> On 2020-10-10 07:25, Pavel Stehule wrote:
> > This patch is based on transformation CYCLE and SEARCH clauses to
> > specific expressions - it is in agreement with ANSI SQL
> >
> > There is not a problem with compilation
> > Nobody had objections in discussion
> > There are enough regress tests and documentation
> > check-world passed
> > doc build passed
> >
> > I'll mark this patch as ready for committer
> >
> > Possible enhancing for this feature (can be done in next steps)
> >
> > 1. support UNION DISTINCT
> > 2. better compatibility with Oracle and DB2 (USING clause can be
> optional)
>
> Here is an updated patch.  New since last time:
>
> - UNION DISTINCT is now supported (since hash_record() was added)
>
> - Some code has been cleaned up.
>
> - Some code has been moved from the rewriter to the parser so that
> certain errors are properly detected at parse time.
>
> - Added more syntax checks and more tests.
>
> - Support for dependency tracking was added (the type and operator for
> the cycle mark need to be added as dependencies).
>
> I found a bug that nested UNIONs (foo UNION bar UNION baz) were not
> handled (would crash) in the rewriter code.  For now, I have just
> changed that to error out.  This could be fixed, it would be a localized
> change in the rewriter code in any case.  Doesn't seem important for the
> first pass, though.
>

I checked this patch, and I didn't find any issue.

make check-world passed
make doc passed

I'll mark it as ready for committer

Regards

Pavel



> --
> Peter Eisentraut
> 2ndQuadrant, an EDB company
> https://www.2ndquadrant.com/
>


Re: proposal: possibility to read dumped table's name from file

2020-11-25 Thread Pavel Stehule
Hi

st 25. 11. 2020 v 19:25 odesílatel Dean Rasheed 
napsal:

> On Thu, 19 Nov 2020 at 19:57, Pavel Stehule 
> wrote:
> >
> > minor update - fixed handling of processing names with double quotes
> inside
> >
>
> I see this is marked RFC, but reading the thread it doesn't feel like
> we have reached consensus on the design for this feature.
>
> I agree that being able to configure pg_dump via a config file would
> be very useful, but the syntax proposed here feels much more like a
> hacked-up syntax designed to meet this one use case, rather than a
> good general-purpose design that can be easily extended.
>

Nobody sent a real use case for introducing the config file. There was a
discussion about formats, and you introduce other dimensions and
variability.

But I don't understand why? What is a use case? What is a benefit against
command line, or libpq variables? And why should config files be better as
a solution for limited length of command line, when I need to dump
thousands of tables exactly specified?

Regards

Pavel


> IMO, a pg_dump config file should be able to specify all options
> currently supported through the command line, and vice versa (subject
> to command line length limits), with a single common code path for
> handling options. That way, any new options we add will work on the
> command line and in config files. Likewise, the user should only need
> to learn one set of options, and have the choice of specifying them on
> the command line or in a config file (or a mix of both).
>
> I can imagine eventually supporting multiple different file formats,
> each just being a different representation of the same data, so
> perhaps this could work with 2 new options:
>
>   --option-file-format=plain|yaml|json|...
>   --option-file=filename
>

> with "plain" being the default initial implementation, which might be
> something like our current postgresql.conf file format.
>
> Also, I think we should allow multiple "--option-file" arguments
> (e.g., to list different object types in different files), and for a
> config file to contain its own "--option-file" arguments, to allow
> config files to include other config files.
>
> The current design feels far too limited to me, and requires new code
> and new syntax to be added each time we extend it, so I'm -1 on this
> patch as it stands.


This new syntax tries to be consistent and simple. It really doesn't try to
implement an alternative configuration file for pg_dump. The code is simple
and can be easily extended.

What are the benefits of supporting multiple formats?


> Regards,
> Dean
>


Re: [PATCH] Add section headings to index types doc

2020-11-25 Thread Tom Lane
=?UTF-8?Q?J=c3=bcrgen_Purtz?=  writes:
> Attached are two patches: a) summary against master and b) standalone of 
> my current changes.

This was a bit confused ... as submitted, the patch wanted to commit
a couple of patchfiles.  I sorted it out I believe, and pushed with
a little additional fiddling of my own.

I did not commit the reindentation of existing text --- I don't think
it's worth adding "git blame" noise for.

regards, tom lane




Re: Improper use about DatumGetInt32

2020-11-25 Thread Alvaro Herrera
On 2020-Nov-25, Peter Eisentraut wrote:

>  bt_page_stats(PG_FUNCTION_ARGS)
>  {
>   text   *relname = PG_GETARG_TEXT_PP(0);
> - uint32  blkno = PG_GETARG_UINT32(1);
> + int64   blkno = PG_GETARG_INT64(1);

As a matter of style, I think it'd be better to have an int64 variable
that gets the value from PG_GETARG_INT64(), then you cast that to
another variable that's a BlockNumber and use that throughout the rest
of the code.  So you'd avoid changes like this:

>  static bytea *get_raw_page_internal(text *relname, ForkNumber forknum,
> - 
> BlockNumber blkno);
> + int64 
> blkno);

where the previous coding was correct, and the new one is dubious and it
forces you to add unnecessary range checks in that function:

> @@ -144,11 +144,16 @@ get_raw_page_internal(text *relname, ForkNumber 
> forknum, BlockNumber blkno)
>   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>errmsg("cannot access temporary tables of 
> other sessions")));
>  
> + if (blkno < 0 || blkno > MaxBlockNumber)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +  errmsg("invalid block number")));
> +





Re: [PATCH] Add `truncate` option to subscription commands

2020-11-25 Thread David Christensen
Hi,

At this time I do not have time to make the necessary changes for this
commitfest so I am voluntarily withdrawing this patch, but will
revisit at a future time.

Best,

David

On Wed, Oct 28, 2020 at 1:06 PM Rahila Syed  wrote:
>
> Hi David,
>
> The feature seems useful to me.  The code will need to be refactored due to 
> changes in commit : b05fe7b442
>
> Please see the following comments.
> 1. Is there a specific reason behind having new relstate for truncate?
> The current state flow is INIT->DATATSYNC->SYNCWAIT->CATCHUP->SYNCDONE->READY.
> Can we accommodate the truncate in either INIT or DATASYNC?
>
> 2.   +   StartTransactionCommand();
>   +   rel = 
> table_open(MyLogicalRepWorker->relid, RowExclusiveLock);
>   +
>   +   rels = lappend(rels, rel);
>   +   relids = lappend_oid(relids, 
> MyLogicalRepWorker->relid);
>   +
>   +   ExecuteTruncateGuts(rels, relids, NIL, 
> DROP_RESTRICT, false);
>   +   CommitTransactionCommand();
>
> Truncate is being performed in a separate transaction as data copy, I think 
> that leaves a window
> open for concurrent transactions to modify the data after truncate and before 
> copy.
>
> 3. Regarding the truncate of the referenced table, I think one approach can 
> be to perform the following:
> i. lock the referencing and referenced tables against writes
> ii. drop the foriegn key constraints,
> iii.truncate
> iv. sync
> v. recreate the constraints
> vi. release lock.
> However, I am not sure of the implications of locking these tables on the 
> main apply process.
>
>
> Thank you,
>
>
> On Mon, Oct 12, 2020 at 11:31 PM David Christensen  wrote:
>>
>> > On Oct 11, 2020, at 10:00 PM, Amit Kapila  wrote:
>> >
>> > On Mon, Oct 12, 2020 at 3:44 AM David Christensen  
>> > wrote:
>> >>
>> >>
>> >> On Oct 11, 2020, at 1:14 PM, Euler Taveira 
>> >>  wrote:
>> >>
>> >> 
>> >> On Fri, 9 Oct 2020 at 15:54, David Christensen  wrote:
>> >>>
>> >>>
>> >>> Enclosed find a patch to add a “truncate” option to subscription 
>> >>> commands.
>> >>>
>> >>> When adding new tables to a subscription (either via `CREATE 
>> >>> SUBSCRIPTION` or `REFRESH PUBLICATION`), tables on the target which are 
>> >>> being newly subscribed will be truncated before the data copy step.  
>> >>> This saves explicit coordination of a manual `TRUNCATE` on the target 
>> >>> tables and allows the results of the initial data sync to be the same as 
>> >>> on the publisher at the time of sync.
>> >>>
>> >>> To preserve compatibility with existing behavior, the default value for 
>> >>> this parameter is `false`.
>> >>>
>> >>
>> >> Truncate will fail for tables whose foreign keys refer to it. If such a 
>> >> feature cannot handle foreign keys, the usefulness will be restricted.
>> >>
>> >>
>> >> This is true for existing “truncate” with FKs, so doesn’t seem to be any 
>> >> different to me.
>> >>
>> >
>> > What would happen if there are multiple tables and truncate on only
>> > one of them failed due to FK check? Does it give an error in such a
>> > case, if so will the other tables be truncated?
>>
>> Currently each SyncRep relation is sync’d separately in its own worker 
>> process; we are doing the truncate at the initialization step of this, so 
>> it’s inherently in its own transaction. I think if we are going to do any 
>> sort of validation on this, it would have to be at the point of the CREATE 
>> SUBSCRIPTION/REFRESH PUBLICATION where we have the relation list and can do 
>> sanity-checking there.
>>
>> Obviously if someone changes the schema at some point between when it does 
>> this and when relation syncs start there is a race condition, but the same 
>> issue would affect other data sync things, so I don’t care to solve that as 
>> part of this patch.
>>
>> >> Hypothetically if you checked all new tables and could verify if there 
>> >> were FK cycles only already in the new tables being added then “truncate 
>> >> cascade” would be fine. Arguably if they had existing tables that were 
>> >> part of an FK that wasn’t fully replicated they were already operating 
>> >> brokenly.
>> >>
>> >
>> > I think if both PK_table and FK_table are part of the same
>> > subscription then there should be a problem as both them first get
>> > truncated? If they are part of a different subscription (or if they
>> > are not subscribed due to whatever reason) then probably we need to
>> > deal such cases carefully.
>>
>> You mean “should not be a problem” here?  If so, I agree with that.  
>> Obviously if we determine this features is only useful with this support 
>> we’d have to chase the entire dependency graph and make sure that is all 
>> contained in the set of newly-subscribed tables (or at least FK referents).
>>
>> I have not considered tables that are part of more than one subscription (is 
>> that 

Re: walsender bug: stuck during shutdown

2020-11-25 Thread Alvaro Herrera
On 2020-Nov-25, Fujii Masao wrote:

> But whether MyWalSnd->write is InvalidRecPtr or not, if it's behind sentPtr,
> walsender should keep waiting for the ack to all the sent message to be
> replied, i.e., isn't this expected behavior of normal shutdown?  That is,
> if we want to shutdown walsender even when the client side doesn't
> reply message, immediate shutdown should be used or the client side
> should be terminated, instead?

I don't think "waiting forever" can be considered the expected behavior;
this has caused what are nominally production outages several times
already, since we sent a shutdown signal to the server and it never
completed shutting down.

If you want to propose a better patch to address the issue, feel free,
but keeping things as they are seems a bad idea to me.

(Hmm, maybe another idea would be to have WalSndDone cause a keepalive
message to be sent to the client and complete shut down when we get a
reply to that.)




Re: Deleting older versions in unique indexes to avoid page splits

2020-11-25 Thread Peter Geoghegan
On Wed, Nov 25, 2020 at 4:43 AM Victor Yegorov  wrote:
>> Then I had a much better idea: Make the existing LP_DEAD stuff a
>> little more like bottom-up index deletion. We usually have to access
>> heap blocks that the index tuples point to today, in order to have a
>> latestRemovedXid cutoff (to generate recovery conflicts). It's worth
>> scanning the leaf page for index tuples with TIDs whose heap block
>> matches the index tuples that actually have their LP_DEAD bits set.
>> This only consumes a few more CPU cycles. We don't have to access any
>> more heap blocks to try these extra TIDs, so it seems like a very good
>> idea to try them out.
>
>
> I don't seem to understand this.
>
> Is it: we're scanning the leaf page for all LP_DEAD tuples that point to the 
> same
> heap block? Which heap block we're talking about here, the one that holds
> entry we're about to add (the one that triggered bottom-up-deletion due to 
> lack
> of space I mean)?

No, the incoming tuple isn't significant.

As you know, bottom-up index deletion uses heuristics that are
concerned with duplicates on the page, and the "logically unchanged by
an UPDATE" hint that the executor passes to btinsert(). Bottom-up
deletion runs when all LP_DEAD bits have been cleared (either because
there never were any LP_DEAD bits set, or because they were set and
then deleted, which wasn't enough).

But before bottom-up deletion may run, traditional deletion of LP_DEAD
index tuples runs -- this is always our preferred strategy because
index tuples with their LP_DEAD bits set are already known to be
deletable. We can make this existing process (which has been around
since PostgreSQL 8.2) better by applying similar principles.

We have promising tuples for bottom-up deletion. Why not have
"promising heap blocks" for traditional LP_DEAD index tuple deletion?
Or if you prefer, we can consider index tuples that *don't* have their
LP_DEAD bits set already but happen to point to the *same heap block*
as other tuples that *do* have their LP_DEAD bits set promising. (The
tuples with their LP_DEAD bits set are not just promising -- they're
already a sure thing.)

This means that traditional LP_DEAD deletion is now slightly more
speculative in one way (it speculates about what is likely to be true
using heuristics). But it's much less speculative than bottom-up index
deletion. We are required to visit these heap blocks anyway, since a
call to _bt_delitems_delete() for LP_DEAD deletion must already call
table_compute_xid_horizon_for_tuples(), which has to access the blocks
to get a latestRemovedXid for the WAL record.

The only thing that we have to lose here is a few CPU cycles to find
extra TIDs to consider. We'll visit exactly the same number of heap
blocks as before. (Actually, _bt_delitems_delete() does not have to do
that in all cases, actually, but it has to do it with a logged table
with wal_level >= replica, which is the vast majority of cases in
practice.)

This means that traditional LP_DEAD deletion reuses some of the
bottom-up index deletion infrastructure. So maybe nbtree never calls
table_compute_xid_horizon_for_tuples() now, since everything goes
through the new heapam stuff instead (which knows how to check extra
TIDs that might not be dead at all).

> I am missing a general perspective here.
>
> Is it true, that despite the long (vacuum preventing) transaction we can 
> re-use space,
> as after the DELETE statements commits, IndexScans are setting LP_DEAD hints 
> after
> they check the state of the corresponding heap tuple?

The enhancement to traditional LP_DEAD deletion that I just described
does not affect the current restrictions on setting LP_DEAD bits in
the presence of a long-running transaction, or anything like that.
That seems like an unrelated project. The value of this enhancement is
purely its ability to delete *extra* index tuples that could have had
their LP_DEAD bits set already (it was possible in principle), but
didn't. And only when they are nearby to index tuples that really do
have their LP_DEAD bits set.

> I haven't done any testing so far since sending my last e-mail.
> If you'll have a chance to send a new v10 version with 
> LP_DEAD-deletion-with-extra-TIDs thing,
> I will do some tests (planned).

Thanks! I think that it will be next week. It's a relatively big change.

-- 
Peter Geoghegan




Re: proposal: possibility to read dumped table's name from file

2020-11-25 Thread Dean Rasheed
On Thu, 19 Nov 2020 at 19:57, Pavel Stehule  wrote:
>
> minor update - fixed handling of processing names with double quotes inside
>

I see this is marked RFC, but reading the thread it doesn't feel like
we have reached consensus on the design for this feature.

I agree that being able to configure pg_dump via a config file would
be very useful, but the syntax proposed here feels much more like a
hacked-up syntax designed to meet this one use case, rather than a
good general-purpose design that can be easily extended.

IMO, a pg_dump config file should be able to specify all options
currently supported through the command line, and vice versa (subject
to command line length limits), with a single common code path for
handling options. That way, any new options we add will work on the
command line and in config files. Likewise, the user should only need
to learn one set of options, and have the choice of specifying them on
the command line or in a config file (or a mix of both).

I can imagine eventually supporting multiple different file formats,
each just being a different representation of the same data, so
perhaps this could work with 2 new options:

  --option-file-format=plain|yaml|json|...
  --option-file=filename

with "plain" being the default initial implementation, which might be
something like our current postgresql.conf file format.

Also, I think we should allow multiple "--option-file" arguments
(e.g., to list different object types in different files), and for a
config file to contain its own "--option-file" arguments, to allow
config files to include other config files.

The current design feels far too limited to me, and requires new code
and new syntax to be added each time we extend it, so I'm -1 on this
patch as it stands.

Regards,
Dean




Re: error_severity of brin work item

2020-11-25 Thread Justin Pryzby
On Mon, Nov 23, 2020 at 04:39:57PM -0300, Alvaro Herrera wrote:
> I think this formulation (attached v3) has fewer moving parts.
> 
> However, now that I did that, I wonder if this is really the best
> approach to solve this problem.  Maybe instead of doing this at the BRIN
> level, it should be handled at the autovac level, by having the worker
> copy the work-item to local memory and remove it from the shared list as
> soon as it is in progress.  That way, if *any* error occurs while trying
> to execute it, it will go away instead of being retried for all
> eternity.
> 
> Preliminary patch for that attached as autovacuum-workitem.patch.
> 
> I would propose to clean that up to apply instead of your proposed fix.

I don't know why you said it would be retried for eternity ?
I think the perform_work_item() TRY/CATCH avoids that.

Also .. I think your idea doesn't solve my issue, that REINDEX CONCURRENTLY is
causing vacuum to leave errors in my logs.

I check that the first patch avoids the issue and the 2nd one does not.

-- 
Justin




Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-25 Thread Bruce Momjian
On Wed, Nov 25, 2020 at 10:33:09AM -0500, Tom Lane wrote:
> Magnus Hagander  writes:
> > I guess one option could be to just remove it, unconditionally. And
> > assume that any users who is running it manually read that in docs
> > somewhere that tells them what to do next, and that any user who's
> > running it under a wrapper will have the wrapper set it up?
> 
> I could get behind that, personally.  (1) I think most end-users don't
> run initdb by hand anymore.  (2) The message is barely useful; what
> it mostly does is to distract your attention from the slightly
> more useful info printed ahead of it.

I think the question is not how many people who run initdb in any form
need the instructions to start Postgres, but how many who are actually
seeing the initdb output need those instructions.  If someone isn't
seeing the initdb output, it doesn't matter what we output.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Libpq support to connect to standby server as priority

2020-11-25 Thread Tom Lane
Greg Nancarrow  writes:
> On Wed, Nov 25, 2020 at 12:07 PM Tom Lane  wrote:
>> Here's a v2 that does it like that.

> Looks OK to me.

Thanks for looking!  Pushed.

At this point the cfbot is going to start complaining that the last-posted
patch in this thread no longer applies.  Unless you have a new patch set
nearly ready to post, I think we should close the CF entry as RWF, and
then you can open a new one when you're ready.

regards, tom lane




Re: Improper use about DatumGetInt32

2020-11-25 Thread Peter Eisentraut

On 2020-11-02 16:59, Peter Eisentraut wrote:

I have committed 0003.

For 0001, normal_rand(), I think you should reject negative arguments
with an error.


I have committed a fix for that.


For 0002, I think you should change the block number arguments to int8,
same as other contrib modules do.


Looking further into this, almost all of pageinspect needs to be updated 
to handle block numbers larger than INT_MAX correctly.  Attached is a 
patch for this.  It is meant to work like other contrib modules, such as 
pg_freespace and pg_visibility.  I haven't tested this much yet.
From 30c0e36a1f19b2d09eb81a26949c7793167084e8 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 25 Nov 2020 17:13:46 +0100
Subject: [PATCH] pageinspect: Change block number arguments to bigint

---
 contrib/pageinspect/Makefile  |  3 +-
 contrib/pageinspect/brinfuncs.c   |  2 +-
 contrib/pageinspect/btreefuncs.c  | 40 ++---
 contrib/pageinspect/hashfuncs.c   | 11 ++-
 contrib/pageinspect/pageinspect--1.8--1.9.sql | 81 +++
 contrib/pageinspect/pageinspect.control   |  2 +-
 contrib/pageinspect/rawpage.c | 24 --
 doc/src/sgml/pageinspect.sgml | 12 +--
 8 files changed, 143 insertions(+), 32 deletions(-)
 create mode 100644 contrib/pageinspect/pageinspect--1.8--1.9.sql

diff --git a/contrib/pageinspect/Makefile b/contrib/pageinspect/Makefile
index d9d8177116..3fcbfea293 100644
--- a/contrib/pageinspect/Makefile
+++ b/contrib/pageinspect/Makefile
@@ -12,7 +12,8 @@ OBJS = \
rawpage.o
 
 EXTENSION = pageinspect
-DATA =  pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
+DATA =  pageinspect--1.8--1.9.sql \
+   pageinspect--1.7--1.8.sql pageinspect--1.6--1.7.sql \
pageinspect--1.5.sql pageinspect--1.5--1.6.sql \
pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c
index fb32d74a66..6f452c688d 100644
--- a/contrib/pageinspect/brinfuncs.c
+++ b/contrib/pageinspect/brinfuncs.c
@@ -252,7 +252,7 @@ brin_page_items(PG_FUNCTION_ARGS)
int att = attno - 1;
 
values[0] = UInt16GetDatum(offset);
-   values[1] = UInt32GetDatum(dtup->bt_blkno);
+   values[1] = Int64GetDatum((int64) dtup->bt_blkno);
values[2] = UInt16GetDatum(attno);
values[3] = 
BoolGetDatum(dtup->bt_columns[att].bv_allnulls);
values[4] = 
BoolGetDatum(dtup->bt_columns[att].bv_hasnulls);
diff --git a/contrib/pageinspect/btreefuncs.c b/contrib/pageinspect/btreefuncs.c
index 445605db58..5937de3a1c 100644
--- a/contrib/pageinspect/btreefuncs.c
+++ b/contrib/pageinspect/btreefuncs.c
@@ -164,7 +164,7 @@ Datum
 bt_page_stats(PG_FUNCTION_ARGS)
 {
text   *relname = PG_GETARG_TEXT_PP(0);
-   uint32  blkno = PG_GETARG_UINT32(1);
+   int64   blkno = PG_GETARG_INT64(1);
Buffer  buffer;
Relationrel;
RangeVar   *relrv;
@@ -197,8 +197,15 @@ bt_page_stats(PG_FUNCTION_ARGS)
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot access temporary tables of 
other sessions")));
 
+   if (blkno < 0 || blkno > MaxBlockNumber)
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("invalid block number")));
+
if (blkno == 0)
-   elog(ERROR, "block 0 is a meta page");
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("block 0 is a meta page")));
 
CHECK_RELATION_BLOCK_RANGE(rel, blkno);
 
@@ -219,16 +226,16 @@ bt_page_stats(PG_FUNCTION_ARGS)
elog(ERROR, "return type must be a row type");
 
j = 0;
-   values[j++] = psprintf("%d", stat.blkno);
+   values[j++] = psprintf("%u", stat.blkno);
values[j++] = psprintf("%c", stat.type);
-   values[j++] = psprintf("%d", stat.live_items);
-   values[j++] = psprintf("%d", stat.dead_items);
-   values[j++] = psprintf("%d", stat.avg_item_size);
-   values[j++] = psprintf("%d", stat.page_size);
-   values[j++] = psprintf("%d", stat.free_size);
-   values[j++] = psprintf("%d", stat.btpo_prev);
-   values[j++] = psprintf("%d", stat.btpo_next);
-   values[j++] = psprintf("%d", (stat.type == 'd') ? stat.btpo.xact : 
stat.btpo.level);
+   values[j++] = psprintf("%u", stat.live_items);
+   values[j++] = psprintf("%u", stat.dead_items);
+   values[j++] = psprintf("%u", stat.avg_item_size);
+   values[j++] = psprintf("%u", stat.page_size);
+   values[j++] = psprintf("%u", 

Re: Transaction isolation and table contraints

2020-11-25 Thread David G. Johnston
On Wed, Nov 25, 2020 at 8:14 AM Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

> Hi hackers,
>
> I wonder if it is considered as correct behavior that transaction
> conflict detection depends on presence of primary key:
>
> create table t(pk integer, val integer);
> insert into t values (1,0),(2,0);
>
> ERROR:  could not serialize access due to read/write dependencies among
> transactions
> [...]
> Now let's repeat the same scenario but with "pk" declared as primary key:
>
> create table t(pk integer primary key, val integer);
> insert into t values (1,0),(2,0);
>
> Now both transactions are succeeded.
>

>From the docs:

"A sequential scan will always necessitate a relation-level predicate lock.
This can result in an increased rate of serialization failures."

The two seem possibly related (I'm not experienced with using serializable)

https://www.postgresql.org/docs/current/transaction-iso.html#XACT-SERIALIZABLE


> Please notice, that even if it is expected behavior, hint in error
> message is not correct, because transaction is actually aborted and
> there is no chance to retry it.
>
>
It is technically correct, it just doesn't describe precisely how to
perform said retry, leaving that up to the reader to glean from the
documentation.

The hint assumes users of serializable isolation mode are familiar with
transaction mechanics.  In particular, the application needs to be prepared
to retry failed transactions, and part of that is knowing that PostgreSQL
will not automatically rollback a failed transaction for you, it is
something that must be done as part of the error detection and recovery
infrastructure that is needed when writing applications that utilize
serializable.

David J.


Re: A few new options for CHECKPOINT

2020-11-25 Thread Stephen Frost
Greetings,

* Bossart, Nathan (bossa...@amazon.com) wrote:
> On 11/24/20, 4:03 PM, "tsunakawa.ta...@fujitsu.com" 
>  wrote:
> > From: Bossart, Nathan 
> >> The main purpose of this patch is to give users more control over their 
> >> manually
> >> requested checkpoints or restartpoints.  I suspect the most useful option 
> >> is
> >> IMMEDIATE, which can help avoid checkpoint- related IO spikes.  However, I
> >> didn't see any strong reason to prevent users from also adjusting FORCE and
> >> WAIT.
> >
> > I think just IMMEDIATE would suffice, too.  But could you tell us why you 
> > got to want to give users more control?  Could we know concrete example 
> > situations where users want to perform CHECKPOINT with options?
> 
> It may be useful for backups taken with the "consistent snapshot"
> approach.  As noted in the documentation [0], running CHECKPOINT
> before taking the snapshot can reduce recovery time.  However, users
> might wish to avoid the IO spike caused by an immediate checkpoint.

I'm a bit confused by the idea here..  The whole point of running a
CHECKPOINT is to get the immediate behavior with the IO spike to get
things flushed out to disk so that, on crash recovery, there's less
outstanding WAL to replay.

Avoiding the IO spike implies that you're waiting for a regular
checkpoint and that additional WAL is building up since that started and
therefore you're going to have to replay that WAL during crash recovery
and so you won't end up reducing your recovery time, so I'm failing to
see the point..?  I don't think you really get to have both..  pay the
price at backup time, or pay it during crash recovery.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: PoC: custom signal handler for extensions

2020-11-25 Thread Anastasia Lubennikova

On 03.09.2020 14:25, Pavel Borisov wrote:


Is there a chance to see you restarting working on this patch ?


I noticed that despite interest to the matter, the discussion in this 
CF thread stopped quite a while ago. So I'd like to push here a patch 
revised according to the previous discussion. Actually the patch is 
being used as a part of the pg_query_state extension during several 
major PostgreSQL releases. So I'd like to raise the matter of 
reviewing this updated patch and include it into version 14 as I see 
much use of this change.


I welcome all your considerations,

I took a look at the patch. It looks fine and I see, that it contains 
fixes for the latest questions in the thread.


I think we should provide a test module for this feature, that will 
serve as both test and example of the use.


This is a feature for extension developers, so I don't know where we 
should document it. At the very least we can improve comments. For 
example, describe the fact that custom signals are handled after 
receiving SIGUSR1.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-25 Thread Tom Lane
Magnus Hagander  writes:
> I guess one option could be to just remove it, unconditionally. And
> assume that any users who is running it manually read that in docs
> somewhere that tells them what to do next, and that any user who's
> running it under a wrapper will have the wrapper set it up?

I could get behind that, personally.  (1) I think most end-users don't
run initdb by hand anymore.  (2) The message is barely useful; what
it mostly does is to distract your attention from the slightly
more useful info printed ahead of it.

>> would be, though.  Maybe we need to add some kind of text adventure game
>> into initdb.

> I do like this idea though...

+1

regards, tom lane




Re: Prevent printing "next step instructions" in initdb and pg_upgrade

2020-11-25 Thread Magnus Hagander
On Wed, Nov 25, 2020 at 9:29 AM Peter Eisentraut
 wrote:
>
> On 2020-11-24 13:32, Magnus Hagander wrote:
> > I think it boils down to that today the output from initdb is entirely
> > geared towards people running initdb directly and starting their
> > server manually, and very few people outside the actual PostgreSQL
> > developers ever do that. But there are still a lot of people who run
> > initdb through their wrapper manually (for redhat you have to do that,
> > for debian you only have to do it if you're creating a secondary
> > cluster but that still a pretty common operation).
>
> Perhaps it's worth asking whom the advice applies to then.  You suggest
> it's mostly developers.  I for one am still grumpy that in 9.5 we
> removed the variant of the hint that suggested "postgres -D ..." instead
> of pg_ctl.  I used to copy and paste that a lot.  The argument back then
> was that the hint should target end users, not developers.  I doubt that
> under the current circumstances, running pg_ctl start from the console
> is really appropriate advice for a majority of users.  (For one thing,
> systemd will kill it when you log out.)  I don't know what better advice


I guess one option could be to just remove it, unconditionally. And
assume that any users who is running it manually read that in docs
somewhere that tells them what to do next, and that any user who's
running it under a wrapper will have the wrapper set it up?


> would be, though.  Maybe we need to add some kind of text adventure game
> into initdb.

I do like this idea though...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/




Transaction isolation and table contraints

2020-11-25 Thread Konstantin Knizhnik

Hi hackers,

I wonder if it is considered as correct behavior that transaction 
conflict detection depends on presence of primary key:


create table t(pk integer, val integer);
insert into t values (1,0),(2,0);

Session1: begin isolation level serializable;
Session2: begin isolation level serializable;
Session1: update t set val=val+1  where pk=1;
Session2: update t set val=val+1  where pk=2;
Session1: commit;
Session2: commit;
ERROR:  could not serialize access due to read/write dependencies among 
transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during 
commit attempt.

HINT:  The transaction might succeed if retried.


Now let's repeat the same scenario but with "pk" declared as primary key:

create table t(pk integer primary key, val integer);
insert into t values (1,0),(2,0);

Session1: begin isolation level serializable;
Session2: begin isolation level serializable;
Session1: update t set val=val+1  where pk=1;
Session2: update t set val=val+1  where pk=2;
Session1: commit;
Session2: commit;

Now both transactions are succeeded.
Please notice, that even if it is expected behavior, hint in error 
message is not correct, because transaction is actually aborted and 
there is no chance to retry it.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: A few new options for CHECKPOINT

2020-11-25 Thread Laurenz Albe
On Wed, 2020-11-25 at 11:41 +0100, Bernd Helmle wrote:
> Am Mittwoch, den 25.11.2020, 13:47 +0900 schrieb Michael Paquier:
> 
> > I can see the use case for IMMEDIATE, but I fail to see the use cases
> > for WAIT and FORCE.  CHECKPOINT_FORCE is internally implied for the
> > end-of-recovery and shutdown checkpoints.  WAIT could be a dangerous
> > thing if disabled, as clients could pile up requests to the
> > checkpointer for no real purpose.
> 
> Wouldn't it be more convenient to use "FAST" for immediate checkpoint,
> defaulting to "FAST ON"? That would be along the parameter used in the
> streaming protocol command BASE_BACKUP, where "FAST" disables lazy
> checkpointing.

+1

That would also match pg_basebackup's "-c fast|spread".

Yours,
Laurenz Albe





Re: Implementing Incremental View Maintenance

2020-11-25 Thread Konstantin Knizhnik




On 25.11.2020 16:06, Yugo NAGATA wrote:

On Wed, 25 Nov 2020 15:16:05 +0300
Konstantin Knizhnik  wrote:



On 24.11.2020 13:11, Yugo NAGATA wrote:

I wonder if it is possible to somehow use predicate locking mechanism of
Postgres to avoid this anomalies without global lock?

You mean that, ,instead of using any table lock, if any possibility of the
anomaly is detected using predlock mechanism then abort the transaction?

Yes. If both transactions are using serializable isolation level, then
lock is not needed, isn't it?
So at least you can add yet another simple optimization: if transaction
has serializable isolation level,
then exclusive lock is not required.

As long as we use the trigger approach, we can't handle concurrent view 
maintenance
in either repeatable read or serializable isolation level.  It is because one
transaction (R= R+dR) cannot see changes occurred in another transaction (S'= 
S+dS)
in such cases, and we cannot get the incremental change on the view (dV=dR*dS).
Therefore, in the current implementation, the transaction is aborted when the
concurrent view maintenance happens in repeatable read or serializable.


Sorry, may be I do not correctly understand you or you do not understand me.
Lets consider two serializable transactions (I do not use view or 
triggers, but perform correspondent updates manually):




create table t(pk integer, val int);
create table mat_view(gby_key integer primary key, total bigint);
insert into t values (1,0),(2,0);
insert into mat_view values (1,0),(2,0);

Session 1: Session 2:

begin isolation level serializable;
begin isolation level serializable;
insert into t values (1,200); insert into t 
values (1,300);

update mat_view set total=total+200  where gby_key=1;
update mat_view set total=total+300 where gby_key=1;

commit;
ERROR:  could not serialize access due to concurrent update

So both transactions are aborted.
It is expected behavior for serializable transactions.
But if transactions updating different records of mat_view, then them 
can be executed concurrently:


Session 1: Session 2:

begin isolation level serializable;
begin isolation level serializable;
insert into t values (1,200); insert into t 
values (2,300);

update mat_view set total=total+200  where gby_key=1;
update mat_view set total=total+300 where gby_key=2;
commit;  commit;

So, if transactions are using serializable isolation level, then we can 
update mat view without exclusive lock

and if there is not conflict, this transaction can be executed concurrently.

Please notice, that exclusive lock doesn't prevent conflict in first case:

Session 1: Session 2:

begin isolation level serializable;
begin isolation level serializable;
insert into t values (1,200); insert into t 
values (1,300);

lock table mat_view;
update mat_view set total=total+200  where gby_key=1;
lock table mat_view;

commit;
update mat_view set total=total+300 where gby_key=1;
commit;
ERROR:  could not serialize access due to concurrent update


So do you agree that there are no reasons for using explicit lock for 
serializable transactions?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Improper use about DatumGetInt32

2020-11-25 Thread Anastasia Lubennikova

On 02.11.2020 18:59, Peter Eisentraut wrote:

I have committed 0003.

For 0001, normal_rand(), I think you should reject negative arguments 
with an error.


I've updated 0001. The change is trivial, see attached.



For 0002, I think you should change the block number arguments to 
int8, same as other contrib modules do.


Agree. It will need a bit more work, though. Probably a new version of 
pageinspect contrib, as the public API will change.

Ashutosh, are you going to continue working on it?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit b42df63c757bf5ff715052a401aba37691a7e2b8
Author: anastasia 
Date:   Wed Nov 25 17:19:33 2020 +0300

Handle negative number of tuples passed to normal_rand()

The function converts the first argument i.e. the number of tuples to
return into an unsigned integer which turns out to be huge number when a
negative value is passed. This causes the function to take much longer
time to execute. Instead throw an error about invalid parameter value.

While at it, improve SQL test to test the number of tuples returned by
this function.

Authors: Ashutosh Bapat, Anastasia Lubennikova

diff --git a/contrib/tablefunc/expected/tablefunc.out b/contrib/tablefunc/expected/tablefunc.out
index fffadc6e1b4..97bfd690841 100644
--- a/contrib/tablefunc/expected/tablefunc.out
+++ b/contrib/tablefunc/expected/tablefunc.out
@@ -3,10 +3,20 @@ CREATE EXTENSION tablefunc;
 -- normal_rand()
 -- no easy way to do this for regression testing
 --
-SELECT avg(normal_rand)::int FROM normal_rand(100, 250, 0.2);
- avg 
--
- 250
+SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
+ avg | count 
+-+---
+ 250 |   100
+(1 row)
+
+-- negative number of tuples
+SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
+ERROR:  invalid number of tuples: -1
+-- zero number of tuples
+SELECT avg(normal_rand)::int, count(*) FROM normal_rand(0, 250, 0.2);
+ avg | count 
+-+---
+ | 0
 (1 row)
 
 --
diff --git a/contrib/tablefunc/sql/tablefunc.sql b/contrib/tablefunc/sql/tablefunc.sql
index ec375b05c63..5341c005f91 100644
--- a/contrib/tablefunc/sql/tablefunc.sql
+++ b/contrib/tablefunc/sql/tablefunc.sql
@@ -4,7 +4,11 @@ CREATE EXTENSION tablefunc;
 -- normal_rand()
 -- no easy way to do this for regression testing
 --
-SELECT avg(normal_rand)::int FROM normal_rand(100, 250, 0.2);
+SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
+-- negative number of tuples
+SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
+-- zero number of tuples
+SELECT avg(normal_rand)::int, count(*) FROM normal_rand(0, 250, 0.2);
 
 --
 -- crosstab()
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 02f02eab574..c6d32d46f01 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -174,6 +174,7 @@ normal_rand(PG_FUNCTION_ARGS)
 	FuncCallContext *funcctx;
 	uint64		call_cntr;
 	uint64		max_calls;
+	int32		num_tuples;
 	normal_rand_fctx *fctx;
 	float8		mean;
 	float8		stddev;
@@ -193,7 +194,14 @@ normal_rand(PG_FUNCTION_ARGS)
 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
 
 		/* total number of tuples to be returned */
-		funcctx->max_calls = PG_GETARG_UINT32(0);
+		num_tuples = PG_GETARG_INT32(0);
+
+		if (num_tuples < 0)
+			ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid number of tuples: %d", num_tuples)));
+
+		funcctx->max_calls = num_tuples;
 
 		/* allocate memory for user context */
 		fctx = (normal_rand_fctx *) palloc(sizeof(normal_rand_fctx));


Re: Use standard SIGHUP and SIGTERM handlers in autoprewarm module

2020-11-25 Thread Bharath Rupireddy
On Wed, Nov 25, 2020 at 3:29 PM Fujii Masao 
wrote:
>
> On 2020/11/20 19:33, Bharath Rupireddy wrote:
> > On Wed, Nov 18, 2020 at 5:52 PM Fujii Masao mailto:masao.fu...@oss.nttdata.com>> wrote:
> >  >
> >  > > handle_sigterm() and die() are similar except that die() has extra
> >  > > handling(below) for exiting immediately when waiting for
input/command
> >  > > from the client.
> >  > >  /*
> >  > >   * If we're in single user mode, we want to quit immediately
- we can't
> >  > >   * rely on latches as they wouldn't work when stdin/stdout is
a file.
> >  > >   * Rather ugly, but it's unlikely to be worthwhile to invest
much more
> >  > >   * effort just for the benefit of single user mode.
> >  > >   */
> >  > >  if (DoingCommandRead && whereToSendOutput != DestRemote)
> >  > >  ProcessInterrupts();
> >  > >
> >  > > Having this extra handling is correct for normal backends as they
can
> >  > > connect directly to clients for reading input commands, the above
if
> >  > > condition may become true and ProcessInterrupts() may be called to
> >  > > handle the SIGTERM immediately.
> >  > >
> >  > > Note that DoingCommandRead can never be true in bg workers as it's
> >  > > being set to true only in normal backend PostgresMain(). If we use
> >  > > die()(instead of custom SIGTERM handlers such as handle_sigterm())
in
> >  > > a bg worker/non-backend process, there are no chances that the
above
> >  > > part of the code gets hit and the interrupts are processed
> >  > > immediately. And also here are the bg worker process that use die()
> >  > > instead of their own custom handlers: parallel workers, autoprewarm
> >  > > worker, autovacuum worker, logical replication launcher and apply
> >  > > worker, wal sender process
> >  > >
> >  > > I think we can also use die() instead of handle_sigterm() in
> >  > > test_shm_mq/worker.c.
> >  > >
> >  > > I attached a patch for this change.
> >  >
> >  > Thanks for the patch! It looks good to me.
> >  >
> >
> > Thanks.
>
> When I read the patch again, I found that, with the patch, the shutdown
> of worker_spi causes to report the following FATAL message.
>
>  FATAL:  terminating connection due to administrator command
>
> Isn't this message confusing because it's not a connection? If so,
> we need to update ProcessInterrupts() so that the proper message is
> reported like other bgworkers do.
>

This is also true for all the bgworker that use the die() handler. How
about doing it the way bgworker_die() does in ProcessInterrupts()? This
would give meaningful information. Thoughts? If okay, I can make a separate
patch.

else if (RecoveryConflictPending)
{
/* Currently there is only one non-retryable recovery conflict
*/
Assert(RecoveryConflictReason ==
PROCSIG_RECOVERY_CONFLICT_DATABASE);
pgstat_report_recovery_conflict(RecoveryConflictReason);
ereport(FATAL,
(errcode(ERRCODE_DATABASE_DROPPED),
 errmsg("terminating connection due to conflict with
recovery"),
 errdetail_recovery_conflict()));
}






*   else if (IsBackgroundWorker)   {   ereport(FATAL,
 (errcode(ERRCODE_ADMIN_SHUTDOWN),
  errmsg("terminating background worker \"%s\" due to administrator
command",MyBgworkerEntry->bgw_type)));   }*
else
ereport(FATAL,
(errcode(ERRCODE_ADMIN_SHUTDOWN),
 errmsg("terminating connection due to administrator
command")));

>
> BTW, though this is not directly related to this topic, it's strange to me
> that the normal shutdown of bgworker causes to emit FATAL message
> and the message "background worker ... exited with exit code 1"
> at the normal shutdown. I've heard sometimes that users coplained that
> it's confusing to report that message for logical replication launcher
> at the normal shutdown. Maybe the mechanism to shutdown bgworker
> normally seems to have not been implemented well yet.
>

What do you mean by normal shutdown of bgworker? Is it that bgworker has
exited successfully with exit code 0 or for some reason with exit code
other than 0? Or is it when the postmaster is shutdown normally?

IIUC, when a bgworker exists either normally with an exit code 0 or other
than 0, then CleanupBackgroundWorker() is called in postmaster, a
message(like below) is prepared, and the LogChildExit() is called with
either DEBUG1 or LOG level and for instance the message you specified gets
printed "background worker ... exited with exit code 1". I have not seen a
FATAL message similar to "background worker ... exited with exit code 1" at
the normal shutdown.

snprintf(namebuf, MAXPGPATH, _("background worker \"%s\""),
rw->rw_worker.bgw_type);

LogChildExit(EXIT_STATUS_0(exitstatus) ? DEBUG1 : LOG, namebuf, pid,
exitstatus);

Am I missing something?

If my analysis is right, 

Re: POC: Cleaning up orphaned files using undo logs

2020-11-25 Thread Antonin Houska
Amit Kapila  wrote:

> On Wed, Nov 18, 2020 at 4:03 PM Antonin Houska  wrote:
> >
> > Amit Kapila  wrote:
> >
> > > On Fri, Nov 13, 2020 at 6:02 PM Antonin Houska  wrote:
> > > >
> > > > Amit Kapila  wrote:
> > > >
> > > > > On Thu, Nov 12, 2020 at 2:45 PM Antonin Houska  
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > No background undo
> > > > > > --
> > > > > >
> > > > > > Reduced complexity of the patch seems to be the priority at the 
> > > > > > moment. Amit
> > > > > > suggested that cleanup of an orphaned relation file is simple 
> > > > > > enough to be
> > > > > > done on foreground and I agree.
> > > > > >
> > > > >
> > > > > Yeah, I think we should try and see if we can make it work but I
> > > > > noticed that there are few places like AbortOutOfAnyTransaction where
> > > > > we have the assumption that undo will be executed in the background.
> > > > > We need to deal with it.
> > > >
> > > > I think this is o.k. if we always check for unapplied undo during 
> > > > startup.
> > > >
> > >
> > > Hmm, how it is ok to leave undo (and rely on startup) unless it is a
> > > PANIC error. IIRC, this path is invoked in non-panic errors as well.
> > > Basically, we won't be able to discard such an undo which doesn't seem
> > > like a good idea.
> >
> > Since failure to apply leaves unconsistent data, I assume it should always
> > cause PANIC, shouldn't it?
> >
> 
> But how can we ensure that AbortOutOfAnyTransaction will be called
> only in that scenario?

I meant that AbortOutOfAnyTransaction should PANIC itself if it sees that
there is unapplied undo, so nothing changes for its callers. Do I still miss
something?

> > > Another thing is that it seems we need to connect to the database to 
> > > perform
> > > it which might appear a bit odd that we don't allow users to connect to 
> > > the
> > > database but internally we are connecting it.
> >
> > I think the implementation will need to follow the outcome of the part of 
> > the
> > discussion that starts at [2], but I see your concern. I'm thinking why
> > database connection is not needed to apply WAL but is needed for UNDO. I 
> > think
> > locks make the difference.
> >
> 
> Yeah, it would be probably a good idea to see if we can make undo
> apply work without db-connection especially if we want to do before
> allowing connections. The other possibility could be to let discard
> worker do this work lazily after allowing connections.

Actually I hit the problem of missing connection when playing with the
"undoxacttest" module. Those tests use table_open() / table_close() functions,
but it might not be necessary for the real RMGRs.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: Is postgres ready for 2038?

2020-11-25 Thread Pavel Borisov
> There are no 32bit Windows version builds since Postgres 11, see:
>
> https://www.postgresql.org/download/windows/
>
> but Postgres 13 still has the same  2038 problems.
>
>
>
> As @Pavel Borisov hints , I can find `_USE_32BIT_TIME_T` code here:
>
> https://github.com/postgres/postgres/search?q=_USE_32BIT_TIME_T
>
>
> Is it a good idea to remove `_USE_32BIT_TIME_T` code and build with 64bit
> Perl
>
> might solve 2038 problem?
>

As it was mentioned above `_USE_32BIT_TIME_T` is not default flag for msvc
builds, but perl can set this on purpose (or on the reason it is very
ancient). Postgres will not set `_USE_32BIT_TIME_T` itself and and by
default compiles with 64bit time, which is correct. But it regards the
perl's choice in case it is made on purpose. If you face the problem with
PG compiled with non-2038 compatible time please first check your perl,
this is the reason.

Regards, Pavel Borisov.

>


Re: POC: Cleaning up orphaned files using undo logs

2020-11-25 Thread Antonin Houska
Antonin Houska  wrote:

> Amit Kapila  wrote:

> > I think we also need to maintain oldestXidHavingUndo for CLOG truncation and
> > transaction-wraparound. We can't allow CLOG truncation for the transaction
> > whose undo is not discarded as that could be required by some other
> > transaction.
> 
> Good point. Even the discard worker might need to check the transaction status
> when deciding whether undo log of that transaction should be discarded.

In the zheap code [1] I see that DiscardWorkerMain() discards undo log up to
OldestXmin:


OldestXmin = GetOldestXmin(NULL, PROCARRAY_FLAGS_AUTOVACUUM |
   PROCARRAY_FLAGS_VACUUM);

oldestXidHavingUndo = 
GetXidFromEpochXid(pg_atomic_read_u64(>oldestXidWithEpochHavingUndo));

/*
 * Call the discard routine if there oldestXidHavingUndo is lagging
 * behind OldestXmin.
 */
if (OldestXmin != InvalidTransactionId &&
TransactionIdPrecedes(oldestXidHavingUndo, OldestXmin))
{
UndoDiscard(OldestXmin, );

and that UndoDiscard() eventually advances oldestXidHavingUndo in the shared
memory.

I'm not sure this is correct because, IMO, OldestXmin can advance as soon as
AbortTransaction() has cleared both xid and xmin fields of the transaction's
PGXACT (by calling ProcArrayEndTransactionInternal). However the corresponding
undo log may still be waiting for processing. Am I wrong?

I think that oldestXidHavingUndo should be advanced at the time transaction
commits or when the undo log of an aborted transaction has been applied. Then
the discard worker would simply discard the undo log up to
oldestXidHavingUndo. However, as the transactions whose undo is still not
applied may no longer be registered in the shared memory (proc array), I don't
know how to determine the next value of oldestXidHavingUndo.

Also I wonder if FullTransactionId is needed for oldestXidHavingUndo in the
shared memory rather than plain TransactionId (see
oldestXidWithEpochHavingUndo in PROC_HDR). I think that the value cannot lag
behind nextFullXid by more than 2 billions transactions anyway because in that
case it would cause XID wraparound. (That in turn makes me think that VACUUM
FREEZE should be able to discard undo log too.)

[1] https://github.com/EnterpriseDB/zheap/tree/master

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




POC: Better infrastructure for automated testing of concurrency issues

2020-11-25 Thread Alexander Korotkov
Hackers,

PostgreSQL is a complex multi-process system, and we are periodically faced
with complicated concurrency issues. While the postgres community does a
great job on investigating and fixing the problems, our ability to
reproduce concurrency issues in the source code test suites is limited.

I think we currently have two general ways to reproduce the concurrency
issues.
1. A text scenario for manual reproduction of the issue, which could
involve psql sessions, gdb sessions etc. Couple of examples are [1] and
[2]. This method provides reliable reproduction of concurrency issues. But
it's  hard to automate, because it requires external instrumentation
(debugger) and it's not stable in terms of postgres code changes (that is
particular line numbers for breakpoints could be changed). I think this is
why we currently don't have such scenarios among postgres test suites.
2. Another way is to reproduce the concurrency issue without directly
touching the database internals using pgbench or other way to simulate the
workload (see [3] for example). This way is easier to automate, because it
doesn't need external instrumentation and it's not so sensitive to source
code changes. But at the same time this way is not reliable and is
resource-consuming.

In the view of above, I'd like to propose a POC patch, which implements new
builtin infrastructure for reproduction of concurrency issues in automated
test suites.  The general idea is so-called "stop events", which are
special places in the code, where the execution could be stopped on some
condition.  Stop event also exposes a set of parameters, encapsulated into
jsonb value.  The condition over stop event parameters is defined using
jsonpath language.

Following functions control behavior –
 * pg_stopevent_set(stopevent_name, jsonpath_conditon) – sets condition for
the stop event.  Once the function is executed, all the backends, which run
a given stop event with parameters satisfying the given jsonpath condition,
will be stopped.
 * pg_stopevent_reset(stopevent_name) – resets stop events.  All the
backends previously stopped on a given stop event will continue the
execution.

For sure, evaluation of stop events causes a CPU overhead.  This is why
it's controlled by enable_stopevents GUC, which is off by default. I expect
the overhead with enable_stopevents = off shouldn't be observable.  Even if
it would be observable, we could enable stop events only by specific
configure parameter.  There is also trace_stopevents GUC, which traces all
the stop events to the log with debug2 level.

In the code stop events are defined using macro STOPEVENT(event_id,
params).  The 'params' should be a function call, and it's evaluated only
if stop events are enabled.  pg_isolation_test_session_is_blocked() takes
stop events into account.  So, stop events are suitable for isolation tests.

POC patch comes with a sample isolation test in
src/test/isolation/specs/gin-traverse-deleted-pages.spec, which reproduces
the issue described in [2] (gin scan steps to the page concurrently deleted
by vacuum).

>From my point of view, stop events would open great possibilities to
improve coverage of concurrency issues.  They allow us to reliably test
concurrency issues in both isolation and tap test suites.  And such test
suites don't take extraordinary resources for execution.  The main cost
here is maintaining a set of stop events among the codebase.  But I think
this cost is justified by much better coverage of concurrency issues.

The feedback is welcome.

Links.
1. https://www.postgresql.org/message-id/4E1DE580.1090905%40enterprisedb.com
2.
https://www.postgresql.org/message-id/CAPpHfdvMvsw-NcE5bRS7R1BbvA4BxoDnVVjkXC5W0Czvy9LVrg%40mail.gmail.com
3.
https://www.postgresql.org/message-id/BF9B38A4-2BFF-46E8-BA87-A2D00A8047A6%40hintbits.com

--
Regards,
Alexander Korotkov


0001-Stopevents-v1.patch
Description: Binary data


Re: Parallel plans and "union all" subquery

2020-11-25 Thread Greg Nancarrow
On Wed, Nov 25, 2020 at 6:43 PM Luc Vlaming  wrote:
>
>
> You're completely right, sorry for my error. I was too quick on assuming
> my patch would work for this specific case too; I should have tested
> that before replying. It looked very similar but turns out to not work
> because of the upper rel not being considered parallel.
>
> I would like to extend my patch to support this, or create a second
> patch. This would however be significantly more involved because it
> would require that we (always?) consider two paths whenever we process a
> subquery: the best parallel plan and the best serial plan. Before I
> emback on such a journey I would like some input on whether this would
> be a very bad idea. Thoughts?
>

Hi,

I must admit, your intended approach isn't what immediately came to mind
when I saw this issue. Have you analyzed and debugged this to know exactly
what is going on?
I haven't had time to debug this and see exactly where the code paths
diverge for the use of "values(1)" verses "values(1::numeric)" in this
case, but that would be one of the first steps.

What I wondered (and I may well be wrong) was how come the documented type
resolution algorithm (
https://www.postgresql.org/docs/13/typeconv-union-case.html) doesn't seem
to be working quite right here, at least to the point of creating the
same/similar parse tree as when I change "values(1)" to
"values(1::numeric)" or even just "values(1.)"? So shouldn't then  the use
of "values(1)" in this case (a constant, convertible to numeric - the
preferred type ) result in the same (parallel) plan as when
"values(1::numeric)" is used? Perhaps this isn't happening because the code
is treating these as generalised expressions when their types aren't the
same, and this then affects parsing/planning?
My natural thought was that there seems to be a minor issue in the code,
which should be reasonably easy to fix, at least for this fairly simple
case.

However, I claim no expertise in the area of parser/analyzer/planner, I
only know certain areas of that code, but enough to appreciate it is
complex and intricate, and easily broken.
Perhaps one of the major contributors to this area of the code, who
probably know this code very well, like maybe Tom Lane or Robert Haas (to
name two) might like to comment on whether what we're looking at is indeed
a bug/deficiency and worth fixing, and whether Luc is correct in his
expressed approach on what would be required to fix it?

Regards,
Greg Nancarrow
Fujitsu Australia


Re: Online checksums patch - once again

2020-11-25 Thread Heikki Linnakangas

On 25/11/2020 15:20, Daniel Gustafsson wrote:

On 23 Nov 2020, at 18:36, Heikki Linnakangas  wrote:
What happens is if you crash between UpdateControlFile() and XlogChecksum()?


Good point, that would not get the cluster to a consistent state.  The
XlogChecksum should be performed before controlfile is udpated.

+void
+SetDataChecksumsOff(void)
+{
+   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+
+   if (ControlFile->data_checksum_version == 0)
+   {
+   LWLockRelease(ControlFileLock);
+   return;
+   }
+
+   if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION)
+   {
+   ControlFile->data_checksum_version = 
PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION;
+   UpdateControlFile();
+   LWLockRelease(ControlFileLock);
+
+   /*
+* Update local state in all backends to ensure that any 
backend in
+* "on" state is changed to "inprogress-off".
+*/
+   XlogChecksums(PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION);
+   
WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_INPROGRESS_OFF));
+
+   /*
+* At this point we know that no backends are verifying data 
checksums
+* during reading. Next, we can safely move to state "off" to 
also
+* stop writing checksums.
+*/
+   }
+
+   XlogChecksums(0);
+
+   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+   ControlFile->data_checksum_version = 0;
+   UpdateControlFile();
+   LWLockRelease(ControlFileLock);
+
+   
WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_CHECKSUM_OFF));
+}


The lwlocking doesn't look right here. If 
ControlFile->data_checksum_version != PG_DATA_CHECKSUM_VERSION, 
LWLockAcquire is called twice without a LWLockRelease in between.


What if a checkpoint, and a crash, happens just after the WAL record has 
been written, but before the control file is updated? That's a 
ridiculously tight window for a whole checkpoint cycle to happen, but in 
principle I think that would spell trouble. I think you could set 
delayChkpt to prevent the checkpoint from happening in that window, 
similar to how we avoid this problem with the clog updates at commit. 
Also, I think this should be in a critical section; we don't want the 
process to error out in between for any reason, and if it does happen, 
it's panic time.


- Heikki




Re: [PATCH] LWLock self-deadlock detection

2020-11-25 Thread Ashutosh Bapat
On Wed, Nov 25, 2020 at 11:47 AM Craig Ringer
 wrote:
>> I am also seeing a pattern
>> Assert(!LWLockHeldByMe());
>> LWLockAcquire()
>>
>> at some places. Should we change LWLockAcquire to do
>> Assert(!LWLockHeldByMe()) always to detect such occurrences?
>
>
> I'm inclined not to, at least not without benchmarking it, because that'd do 
> the check before we attempt the fast-path. cassert builds are still supposed 
> to perform decently and be suitable for day to day development and I'd rather 
> not risk a slowdown.
>
> I'd prefer to make the lock self deadlock check run for production builds, 
> not just cassert builds. It'd print a normal LOG (with backtrace if 
> supported) then Assert(). So on an assert build we'd get a crash and core, 
> and on a non-assert build we'd carry on and self-deadlock anyway.
>
> That's probably the safest thing to do. We can't expect to safely ERROR out 
> of the middle of an LWLockAcquire(), not without introducing a new and really 
> hard to test code path that'll also be surprising to callers. We probably 
> don't want to PANIC the whole server for non-assert builds since it might 
> enter a panic-loop. So it's probably better to self-deadlock. We could HINT 
> that a -m immediate shutdown will be necessary to recover though.

I agree that it will be helpful to print something in the logs
indicating the reason for this hang in case the hang happens in a
production build. In your patch you have used ereport(PANIC, ) which
may simply be replaced by an Assert() in an assert enabled build. We
already have Assert(!LWLockHeldByMe()) so that should be safe. It will
be good to have -m immediate hint in LOG message. But it might just be
better to kill -9 that process to get rid of it. That will cause the
server to restart and not just shutdown.

-- 
Best Wishes,
Ashutosh Bapat




Re: Add Information during standby recovery conflicts

2020-11-25 Thread Masahiko Sawada
On Fri, Nov 20, 2020 at 6:18 PM Drouvot, Bertrand  wrote:
>
> Hi,
>
> On 11/17/20 4:44 PM, Fujii Masao wrote:
> >
> > Thanks for updating the patch! Here are review comments.
> >
> > +Controls whether a log message is produced when the startup
> > process
> > +is waiting longer than deadlock_timeout
> > +for recovery conflicts.
> >
> > But a log message can be produced also when the backend is waiting
> > for recovery conflict. Right? If yes, this description needs to be
> > corrected.
>
> Thanks for looking at it!
>
> I don't think so, only the startup process should write those new log
> messages.
>
> What makes you think that would not be the case?
>
> >
> >
> > +for recovery conflicts.  This is useful in determining if
> > recovery
> > +conflicts prevents the recovery from applying WAL.
> >
> > "prevents" should be "prevent"?
>
> Indeed: fixed in the new attached patch.
>
> >
> >
> > +   TimestampDifference(waitStart, GetCurrentTimestamp(), ,
> > );
> > +   msecs = secs * 1000 + usecs / 1000;
> >
> > GetCurrentTimestamp() is basically called before LogRecoveryConflict()
> > is called. So isn't it better to avoid calling GetCurrentTimestamp()
> > newly in
> > LogRecoveryConflict() and to reuse the timestamp that we got?
> > It's helpful to avoid the waste of cycles.
> >
> good catch! fixed in the new attached patch.
>
> >
> > +   while (VirtualTransactionIdIsValid(*vxids))
> > +   {
> > +   PGPROC *proc =
> > BackendIdGetProc(vxids->backendId);
> >
> > BackendIdGetProc() can return NULL if the backend is not active
> > at that moment. This case should be handled.
> >
> handled in the new attached patch.
> >
> > +   case PROCSIG_RECOVERY_CONFLICT_BUFFERPIN:
> > +   reasonDesc = gettext_noop("recovery is still
> > waiting recovery conflict on buffer pin");
> >
> > It's natural to use "waiting for recovery" rather than "waiting
> > recovery"?
> >
> I would be tempted to say so, the new patch makes use of "waiting for".
> >
> > +   /* Also, set deadlock timeout for logging purpose if
> > necessary */
> > +   if (log_recovery_conflict_waits)
> > +   {
> > +   timeouts[cnt].id = STANDBY_TIMEOUT;
> > +   timeouts[cnt].type = TMPARAM_AFTER;
> > +   timeouts[cnt].delay_ms = DeadlockTimeout;
> > +   cnt++;
> > +   }
> >
> > This needs to be executed only when the message has not been logged yet.
> > Right?
> >
> good catch: fixed in the new attached patch.
>

Thank you for updating the patch! Here are review comments on the
latest version patch.

+   while (VirtualTransactionIdIsValid(*vxids))
+   {
+   PGPROC *proc = BackendIdGetProc(vxids->backendId);
+
+   if (proc)
+   {
+   if (nprocs == 0)
+   appendStringInfo(, "%d", proc->pid);
+   else
+   appendStringInfo(, ", %d", proc->pid);
+
+   nprocs++;
+   vxids++;
+   }
+   }

We need to increment vxids even if *proc is null. Otherwise, the loop won't end.

---
+   TimestampTz cur_ts = GetCurrentTimestamp();;

There is an extra semi-colon.

---
 intmax_standby_streaming_delay = 30 * 1000;
+bool   log_recovery_conflict_waits = false;
+bool   logged_lock_conflict = false;


+   if (log_recovery_conflict_waits && !logged_lock_conflict)
+   {
+   timeouts[cnt].id = STANDBY_TIMEOUT;
+   timeouts[cnt].type = TMPARAM_AFTER;
+   timeouts[cnt].delay_ms = DeadlockTimeout;
+   cnt++;
+   }

Can we pass a bool indicating if a timeout may be needed for recovery
conflict logging from ProcSleep() to ResolveRecoveryConflictWithLock()
instead of using a static variable?

Regards,

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/




Re: Implementing Incremental View Maintenance

2020-11-25 Thread Yugo NAGATA
On Wed, 25 Nov 2020 15:16:05 +0300
Konstantin Knizhnik  wrote:

> 
> 
> On 24.11.2020 13:11, Yugo NAGATA wrote:
> >
> >> I wonder if it is possible to somehow use predicate locking mechanism of
> >> Postgres to avoid this anomalies without global lock?
> > You mean that, ,instead of using any table lock, if any possibility of the
> > anomaly is detected using predlock mechanism then abort the transaction?
> 
> Yes. If both transactions are using serializable isolation level, then 
> lock is not needed, isn't it?
> So at least you can add yet another simple optimization: if transaction 
> has serializable isolation level,
> then exclusive lock is not required.

As long as we use the trigger approach, we can't handle concurrent view 
maintenance
in either repeatable read or serializable isolation level.  It is because one
transaction (R= R+dR) cannot see changes occurred in another transaction (S'= 
S+dS)
in such cases, and we cannot get the incremental change on the view (dV=dR*dS). 
Therefore, in the current implementation, the transaction is aborted when the
concurrent view maintenance happens in repeatable read or serializable.
 
> But I wonder if we can go further so that even if transaction is using 
> read-committed or repeatable-read isolation level,
> we still can replace exclusive table lock with predicate locks.
> 
> The main problem with this approach (from my point of view) is the 
> predicate locks are able to detect conflict but not able to prevent it.
> I.e. if such conflict is detected then transaction has to be aborted.
> And it is not always desirable, especially because user doesn't expect 
> it: how can insertion of single record with unique keys in a table cause 
> transaction conflict?
> And this is what will happen in your example with transactions T1 and T2 
> inserting records in R and S tables.

Yes. I wonder that either aborting transaction or waiting on locks is 
unavoidable
when a view is incrementally updated concurrently (at least in the immediate
maintenance where a view is update in the same transaction that updates the base
table).
 
> And what do you think about backrgound update of materialized view?
> On update/insert trigger will just add record to some "delta" table and 
> then some background worker will update view.
> Certainly in this case we loose synchronization between main table and 
> materialized view (last one may contain slightly deteriorated data).
> But in this case no exclusive lock is needed, isn't it?

Of course, we are considering this type of view maintenance. This is
deferred maintenance where a view is update after the transaction
that updates the base tables is committed. Views can be updated in
bacground in a appropreate timing or as a response of a user command.

To implement this, we needs a mechanism to maintain change logs which
records changes of base tables. We think that implementing this infrastructure
is not trivial work, so, in the first patch proposal, we decided to start from
immediate approach which needs less code. 

-- 
Yugo NAGATA 




Re: SEARCH and CYCLE clauses

2020-11-25 Thread Peter Eisentraut

On 2020-10-10 07:25, Pavel Stehule wrote:
This patch is based on transformation CYCLE and SEARCH clauses to 
specific expressions - it is in agreement with ANSI SQL


There is not a problem with compilation
Nobody had objections in discussion
There are enough regress tests and documentation
check-world passed
doc build passed

I'll mark this patch as ready for committer

Possible enhancing for this feature (can be done in next steps)

1. support UNION DISTINCT
2. better compatibility with Oracle and DB2 (USING clause can be optional)


Here is an updated patch.  New since last time:

- UNION DISTINCT is now supported (since hash_record() was added)

- Some code has been cleaned up.

- Some code has been moved from the rewriter to the parser so that 
certain errors are properly detected at parse time.


- Added more syntax checks and more tests.

- Support for dependency tracking was added (the type and operator for 
the cycle mark need to be added as dependencies).


I found a bug that nested UNIONs (foo UNION bar UNION baz) were not 
handled (would crash) in the rewriter code.  For now, I have just 
changed that to error out.  This could be fixed, it would be a localized 
change in the rewriter code in any case.  Doesn't seem important for the 
first pass, though.


--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/
From e69e5cc77a226d646ad0fcd7bc2dd73be9b1abda Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 25 Nov 2020 14:03:45 +0100
Subject: [PATCH v4] SEARCH and CYCLE clauses

Discussion: 
https://www.postgresql.org/message-id/flat/db80ceee-6f97-9b4a-8ee8-3ba0c58e5...@2ndquadrant.com
---
 doc/src/sgml/queries.sgml|  64 ++-
 doc/src/sgml/ref/select.sgml |  44 ++
 src/backend/catalog/dependency.c |  15 +
 src/backend/nodes/copyfuncs.c|  40 ++
 src/backend/nodes/equalfuncs.c   |  36 ++
 src/backend/nodes/nodeFuncs.c|  42 +-
 src/backend/nodes/outfuncs.c |  36 ++
 src/backend/nodes/readfuncs.c|  44 ++
 src/backend/parser/analyze.c |  47 +-
 src/backend/parser/gram.y|  58 +-
 src/backend/parser/parse_agg.c   |   7 +
 src/backend/parser/parse_cte.c   | 193 +++
 src/backend/parser/parse_expr.c  |   4 +
 src/backend/parser/parse_func.c  |   3 +
 src/backend/parser/parse_relation.c  |  54 +-
 src/backend/parser/parse_target.c|  17 +-
 src/backend/rewrite/Makefile |   1 +
 src/backend/rewrite/rewriteHandler.c |  18 +
 src/backend/rewrite/rewriteSearchCycle.c | 668 +++
 src/backend/utils/adt/ruleutils.c|  47 ++
 src/include/nodes/nodes.h|   2 +
 src/include/nodes/parsenodes.h   |  30 +-
 src/include/parser/analyze.h |   2 +
 src/include/parser/kwlist.h  |   2 +
 src/include/parser/parse_node.h  |   2 +
 src/include/rewrite/rewriteSearchCycle.h |  21 +
 src/test/regress/expected/with.out   | 558 +++
 src/test/regress/sql/with.sql| 279 ++
 28 files changed, 2301 insertions(+), 33 deletions(-)
 create mode 100644 src/backend/rewrite/rewriteSearchCycle.c
 create mode 100644 src/include/rewrite/rewriteSearchCycle.h

diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml
index ca51204875..4741506eb5 100644
--- a/doc/src/sgml/queries.sgml
+++ b/doc/src/sgml/queries.sgml
@@ -2218,6 +2218,39 @@ Search Order
  in any case.
 

+
+   
+There is built-in syntax to compute a depth- or breadth-first sort column.
+For example:
+
+
+WITH RECURSIVE search_tree(id, link, data) AS (
+SELECT t.id, t.link, t.data
+FROM tree t
+  UNION ALL
+SELECT t.id, t.link, t.data
+FROM tree t, search_tree st
+WHERE t.id = st.link
+) SEARCH DEPTH FIRST BY id SET ordercol
+SELECT * FROM search_tree ORDER BY ordercol;
+
+WITH RECURSIVE search_tree(id, link, data) AS (
+SELECT t.id, t.link, t.data
+FROM tree t
+  UNION ALL
+SELECT t.id, t.link, t.data
+FROM tree t, search_tree st
+WHERE t.id = st.link
+) SEARCH BREADTH FIRST BY id SET ordercol
+SELECT * FROM search_tree ORDER BY ordercol;
+
+This syntax is internally expanded to something similar to the above
+hand-written forms.  The SEARCH clause specifies whether
+depth- or breadth first search is wanted, the list of columns to track for
+sorting, and a column name that will contain the result data that can be
+used for sorting.  That column will implicitly be added to the output rows
+of the CTE.
+   
   
 
   
@@ -2305,10 +2338,39 @@ Cycle Detection

   
 
+  
+   There is built-in syntax to simplify cycle detection.  The above query can
+   also be written like this:
+
+WITH RECURSIVE search_graph(id, link, data, depth) AS (
+SELECT g.id, g.link, g.data, 1
+FROM graph g
+  UNION ALL
+SELECT g.id, g.link, g.data, sg.depth + 1
+FROM graph g, 

Re: Deleting older versions in unique indexes to avoid page splits

2020-11-25 Thread Victor Yegorov
ср, 25 нояб. 2020 г. в 05:35, Peter Geoghegan :

> Then I had a much better idea: Make the existing LP_DEAD stuff a
> little more like bottom-up index deletion. We usually have to access
> heap blocks that the index tuples point to today, in order to have a
> latestRemovedXid cutoff (to generate recovery conflicts). It's worth
> scanning the leaf page for index tuples with TIDs whose heap block
> matches the index tuples that actually have their LP_DEAD bits set.
> This only consumes a few more CPU cycles. We don't have to access any
> more heap blocks to try these extra TIDs, so it seems like a very good
> idea to try them out.
>

I don't seem to understand this.

Is it: we're scanning the leaf page for all LP_DEAD tuples that point to
the same
heap block? Which heap block we're talking about here, the one that holds
entry we're about to add (the one that triggered bottom-up-deletion due to
lack
of space I mean)?

I ran the regression tests with an enhanced version of the patch, with
> this LP_DEAD-deletion-with-extra-TIDs thing. It also had custom
> instrumentation that showed exactly what happens in each case. We
> manage to delete at least a small number of extra index tuples in
> almost all cases -- so we get some benefit in practically all cases.
> And in the majority of cases we can delete significantly more. It's
> not uncommon to increase the number of index tuples deleted. It could
> go from 1 - 10 or so without the enhancement to LP_DEAD deletion, to
> 50 - 250 with the LP_DEAD enhancement. Some individual LP_DEAD
> deletion calls can free more than 50% of the space on the leaf page.
>

I am missing a general perspective here.

Is it true, that despite the long (vacuum preventing) transaction we can
re-use space,
as after the DELETE statements commits, IndexScans are setting LP_DEAD
hints after
they check the state of the corresponding heap tuple?

If my thinking is correct for both cases — nature of LP_DEAD hint bits and
the mechanics of
suggested optimization — then I consider this a very promising improvement!

I haven't done any testing so far since sending my last e-mail.
If you'll have a chance to send a new v10 version with
LP_DEAD-deletion-with-extra-TIDs thing,
I will do some tests (planned).

-- 
Victor Yegorov


Re: Implementing Incremental View Maintenance

2020-11-25 Thread Konstantin Knizhnik




On 24.11.2020 13:11, Yugo NAGATA wrote:



I wonder if it is possible to somehow use predicate locking mechanism of
Postgres to avoid this anomalies without global lock?

You mean that, ,instead of using any table lock, if any possibility of the
anomaly is detected using predlock mechanism then abort the transaction?


Yes. If both transactions are using serializable isolation level, then 
lock is not needed, isn't it?
So at least you can add yet another simple optimization: if transaction 
has serializable isolation level,

then exclusive lock is not required.

But I wonder if we can go further so that even if transaction is using 
read-committed or repeatable-read isolation level,

we still can replace exclusive table lock with predicate locks.

The main problem with this approach (from my point of view) is the 
predicate locks are able to detect conflict but not able to prevent it.

I.e. if such conflict is detected then transaction has to be aborted.
And it is not always desirable, especially because user doesn't expect 
it: how can insertion of single record with unique keys in a table cause 
transaction conflict?
And this is what will happen in your example with transactions T1 and T2 
inserting records in R and S tables.


And what do you think about backrgound update of materialized view?
On update/insert trigger will just add record to some "delta" table and 
then some background worker will update view.
Certainly in this case we loose synchronization between main table and 
materialized view (last one may contain slightly deteriorated data).

But in this case no exclusive lock is needed, isn't it?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: [patch] CLUSTER blocks scanned progress reporting

2020-11-25 Thread Matthias van de Meent
On Wed, 25 Nov 2020 at 10:35, Fujii Masao  wrote:
>
> On 2020/11/25 0:25, Matthias van de Meent wrote:
> >
> > I noticed that with my proposed patch it is still possible to go to
> > the next phase while heap_blks_scanned != heap_blks_total. This can
> > happen when the final heap pages contain only dead tuples, so no tuple
> > is returned from the last heap page(s) of the scan. As the
> > heapScan->rs_cblock is set to InvalidBlockNumber when the scan is
> > finished (see heapam.c#1060-1072), I think it would be correct to set
> > heap_blks_scanned to heapScan->rs_nblocks at the end of the scan
> > instead.
>
> Thanks for updating the patch!
>
> Please let me check my understanding about this. I was thinking that even
> when the last page contains only dead tuples, table_scan_getnextslot()
> returns the last page (because SnapshotAny is used) and heap_blks_scanned
> is incremented properly. And then, heapam_relation_copy_for_cluster()
> handles all the dead tuples in that page. No?

Yes, my description was incorrect.

The potential for a discrepancy exists for seemingly empty pages. I
thought that 'filled with only dead tuples' would be correct for
'seemingly empty', but SnapshotAny indeed returns all tuples on a
page. But pages can still be empty with SnapshotAny, through being
emptied by vacuum, so the last pages scanned can still be empty pages.
Then, there would be no tuple returned at the last pages of the scan,
and subsequently the CLUSTER/VACUUM FULL would start the next phase
without reporting on the last pages that were scanned and had no
tuples in them (such that heap_blks_scanned != heap_blks_total).

Vacuum truncates empty blocks from the end of the relation, and this
prevents empty blocks at the end of CLUSTER for the default case of
table scans starting at 0; but because the table scan might not start
at block 0, we can have an empty page at the end of the table scan due
to the last page of the scan not being the last page of the table.

Matthias van de Meent




Re: autovac issue with large number of tables

2020-11-25 Thread Masahiko Sawada
On Wed, Nov 25, 2020 at 4:18 PM Kasahara Tatsuhito
 wrote:
>
> Hi,
>
> On Wed, Nov 25, 2020 at 2:17 PM Masahiko Sawada  wrote:
> >
> > On Fri, Sep 4, 2020 at 7:50 PM Kasahara Tatsuhito
> >  wrote:
> > >
> > > Hi,
> > >
> > > On Wed, Sep 2, 2020 at 2:10 AM Kasahara Tatsuhito
> > >  wrote:
> > > > > I wonder if we could have table_recheck_autovac do two probes of the 
> > > > > stats
> > > > > data.  First probe the existing stats data, and if it shows the table 
> > > > > to
> > > > > be already vacuumed, return immediately.  If not, *then* force a stats
> > > > > re-read, and check a second time.
> > > > Does the above mean that the second and subsequent 
> > > > table_recheck_autovac()
> > > > will be improved to first check using the previous refreshed statistics?
> > > > I think that certainly works.
> > > >
> > > > If that's correct, I'll try to create a patch for the PoC
> > >
> > > I still don't know how to reproduce Jim's troubles, but I was able to 
> > > reproduce
> > > what was probably a very similar problem.
> > >
> > > This problem seems to be more likely to occur in cases where you have
> > > a large number of tables,
> > > i.e., a large amount of stats, and many small tables need VACUUM at
> > > the same time.
> > >
> > > So I followed Tom's advice and created a patch for the PoC.
> > > This patch will enable a flag in the table_recheck_autovac function to use
> > > the existing stats next time if VACUUM (or ANALYZE) has already been done
> > > by another worker on the check after the stats have been updated.
> > > If the tables continue to require VACUUM after the refresh, then a refresh
> > > will be required instead of using the existing statistics.
> > >
> > > I did simple test with HEAD and HEAD + this PoC patch.
> > > The tests were conducted in two cases.
> > > (I changed few configurations. see attached scripts)
> > >
> > > 1. Normal VACUUM case
> > >   - SET autovacuum = off
> > >   - CREATE tables with 100 rows
> > >   - DELETE 90 rows for each tables
> > >   - SET autovacuum = on and restart PostgreSQL
> > >   - Measure the time it takes for all tables to be VACUUMed
> > >
> > > 2. Anti wrap round VACUUM case
> > >   - CREATE brank tables
> > >   - SELECT all of these tables (for generate stats)
> > >   - SET autovacuum_freeze_max_age to low values and restart PostgreSQL
> > >   - Consumes a lot of XIDs by using txid_curent()
> > >   - Measure the time it takes for all tables to be VACUUMed
> > >
> > > For each test case, the following results were obtained by changing
> > > autovacuum_max_workers parameters to 1, 2, 3(def) 5 and 10.
> > > Also changing num of tables to 1000, 5000, 1 and 2.
> > >
> > > Due to the poor VM environment (2 VCPU/4 GB), the results are a little 
> > > unstable,
> > > but I think it's enough to ask for a trend.
> > >
> > > ===
> > > [1.Normal VACUUM case]
> > >  tables:1000
> > >   autovacuum_max_workers 1:   (HEAD) 20 sec VS (with patch)  20 sec
> > >   autovacuum_max_workers 2:   (HEAD) 18 sec VS (with patch)  16 sec
> > >   autovacuum_max_workers 3:   (HEAD) 18 sec VS (with patch)  16 sec
> > >   autovacuum_max_workers 5:   (HEAD) 19 sec VS (with patch)  17 sec
> > >   autovacuum_max_workers 10:  (HEAD) 19 sec VS (with patch)  17 sec
> > >
> > >  tables:5000
> > >   autovacuum_max_workers 1:   (HEAD) 77 sec VS (with patch)  78 sec
> > >   autovacuum_max_workers 2:   (HEAD) 61 sec VS (with patch)  43 sec
> > >   autovacuum_max_workers 3:   (HEAD) 38 sec VS (with patch)  38 sec
> > >   autovacuum_max_workers 5:   (HEAD) 45 sec VS (with patch)  37 sec
> > >   autovacuum_max_workers 10:  (HEAD) 43 sec VS (with patch)  35 sec
> > >
> > >  tables:1
> > >   autovacuum_max_workers 1:   (HEAD) 152 sec VS (with patch)  153 sec
> > >   autovacuum_max_workers 2:   (HEAD) 119 sec VS (with patch)   98 sec
> > >   autovacuum_max_workers 3:   (HEAD)  87 sec VS (with patch)   78 sec
> > >   autovacuum_max_workers 5:   (HEAD) 100 sec VS (with patch)   66 sec
> > >   autovacuum_max_workers 10:  (HEAD)  97 sec VS (with patch)   56 sec
> > >
> > >  tables:2
> > >   autovacuum_max_workers 1:   (HEAD) 338 sec VS (with patch)  339 sec
> > >   autovacuum_max_workers 2:   (HEAD) 231 sec VS (with patch)  229 sec
> > >   autovacuum_max_workers 3:   (HEAD) 220 sec VS (with patch)  191 sec
> > >   autovacuum_max_workers 5:   (HEAD) 234 sec VS (with patch)  147 sec
> > >   autovacuum_max_workers 10:  (HEAD) 320 sec VS (with patch)  113 sec
> > >
> > > [2.Anti wrap round VACUUM case]
> > >  tables:1000
> > >   autovacuum_max_workers 1:   (HEAD) 19 sec VS (with patch) 18 sec
> > >   autovacuum_max_workers 2:   (HEAD) 14 sec VS (with patch) 15 sec
> > >   autovacuum_max_workers 3:   (HEAD) 14 sec VS (with patch) 14 sec
> > >   autovacuum_max_workers 5:   (HEAD) 14 sec VS (with patch) 16 sec
> > >   autovacuum_max_workers 10:  (HEAD) 16 sec VS (with patch) 14 sec
> > >
> > >  tables:5000
> > >   

Re: Add statistics to pg_stat_wal view for wal related parameter tuning

2020-11-25 Thread Fujii Masao




On 2020/11/19 16:31, Masahiro Ikeda wrote:

On 2020-11-17 11:46, Fujii Masao wrote:

On 2020/11/16 16:35, Masahiro Ikeda wrote:

On 2020-11-12 14:58, Fujii Masao wrote:

On 2020/11/06 10:25, Masahiro Ikeda wrote:

On 2020-10-30 11:50, Fujii Masao wrote:

On 2020/10/29 17:03, Masahiro Ikeda wrote:

Hi,

Thanks for your comments and advice. I updated the patch.

On 2020-10-21 18:03, Kyotaro Horiguchi wrote:

At Tue, 20 Oct 2020 16:11:29 +0900, Masahiro Ikeda
 wrote in

On 2020-10-20 12:46, Amit Kapila wrote:
> I see that we also need to add extra code to capture these stats (some
> of which is in performance-critical path especially in
> XLogInsertRecord) which again makes me a bit uncomfortable. It might
> be that it is all fine as it is very important to collect these stats
> at cluster-level in spite that the same information can be gathered at
> statement-level to help customers but I don't see a very strong case
> for that in your proposal.


We should avoid that duplication as possible even if the both number
are important.


Also about performance, I thought there are few impacts because it
increments stats in memory. If I can implement to reuse pgWalUsage's
value which already collects these stats, there is no impact in
XLogInsertRecord.
For example, how about pg_stat_wal() calculates the accumulated
value of wal_records, wal_fpi, and wal_bytes to use pgWalUsage's
value?


I don't think that works, but it would work that pgstat_send_wal()
takes the difference of that values between two successive calls.

WalUsage prevWalUsage;
...
pgstat_send_wal()
{
..
   /* fill in some values using pgWalUsage */
   WalStats.m_wal_bytes   = pgWalUsage.wal_bytes   - prevWalUsage.wal_bytes;
   WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
   WalStats.m_wal_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;
...
   pgstat_send(, sizeof(WalStats));

   /* remember the current numbers */
   prevWalUsage = pgWalUsage;


Thanks for Horiguchi-san's advice, I changed to reuse pgWalUsage
which is already defined and eliminates the extra overhead.


+    /* fill in some values using pgWalUsage */
+    WalStats.m_wal_bytes = pgWalUsage.wal_bytes - prevWalUsage.wal_bytes;
+    WalStats.m_wal_records = pgWalUsage.wal_records - prevWalUsage.wal_records;
+    WalStats.m_wal_fpi = pgWalUsage.wal_fpi - prevWalUsage.wal_fpi;

It's better to use WalUsageAccumDiff() here?


Yes, thanks. I fixed it.


prevWalUsage needs to be initialized with pgWalUsage?

+    if (AmWalWriterProcess()){
+    WalStats.m_wal_write_walwriter++;
+    }
+    else
+    {
+    WalStats.m_wal_write_backend++;
+    }

I think that it's better not to separate m_wal_write_xxx into two for
walwriter and other processes. Instead, we can use one m_wal_write_xxx
counter and make pgstat_send_wal() send also the process type to
the stats collector. Then the stats collector can accumulate the counters
per process type if necessary. If we adopt this approach, we can easily
extend pg_stat_wal so that any fields can be reported per process type.


I'll remove the above source code because these counters are not useful.


On 2020-10-30 12:00, Fujii Masao wrote:

On 2020/10/20 11:31, Masahiro Ikeda wrote:

Hi,

I think we need to add some statistics to pg_stat_wal view.

Although there are some parameter related WAL,
there are few statistics for tuning them.

I think it's better to provide the following statistics.
Please let me know your comments.

```
postgres=# SELECT * from pg_stat_wal;
-[ RECORD 1 ]---+--
wal_records | 2000224
wal_fpi | 47
wal_bytes   | 248216337
wal_buffers_full    | 20954
wal_init_file   | 8
wal_write_backend   | 20960
wal_write_walwriter | 46
wal_write_time  | 51
wal_sync_backend    | 7
wal_sync_walwriter  | 8
wal_sync_time   | 0
stats_reset | 2020-10-20 11:04:51.307771+09
```

1. Basic statistics of WAL activity

- wal_records: Total number of WAL records generated
- wal_fpi: Total number of WAL full page images generated
- wal_bytes: Total amount of WAL bytes generated

To understand DB's performance, first, we will check the performance
trends for the entire database instance.
For example, if the number of wal_fpi becomes higher, users may tune
"wal_compression", "checkpoint_timeout" and so on.

Although users can check the above statistics via EXPLAIN, auto_explain,
autovacuum and pg_stat_statements now,
if users want to see the performance trends  for the entire database,
they must recalculate the statistics.

I think it is useful to add the sum of the basic statistics.


2.  WAL segment file creation

- wal_init_file: Total number of WAL segment files created.

To create a new WAL file may have an impact on the performance of
a write-heavy workload generating lots of WAL. If this number is reported high,
to reduce the number of this 

Re: A few new options for CHECKPOINT

2020-11-25 Thread Bernd Helmle
Am Mittwoch, den 25.11.2020, 13:47 +0900 schrieb Michael Paquier:
> I can see the use case for IMMEDIATE, but I fail to see the use cases
> for WAIT and FORCE.  CHECKPOINT_FORCE is internally implied for the
> end-of-recovery and shutdown checkpoints.  WAIT could be a dangerous
> thing if disabled, as clients could pile up requests to the
> checkpointer for no real purpose.

Wouldn't it be more convenient to use "FAST" for immediate checkpoint,
defaulting to "FAST ON"? That would be along the parameter used in the
streaming protocol command BASE_BACKUP, where "FAST" disables lazy
checkpointing.

I agree that the other options don't seem reasonable for exposing to
SQL.


-- 
Thanks,
Bernd






Re: Is postgres ready for 2038?

2020-11-25 Thread 方徳輝
Hi dear pgsql hackers


Thanks for replies.

There are no 32bit Windows version builds since Postgres 11, see:

https://www.postgresql.org/download/windows/

but Postgres 13 still has the same  2038 problems.



As @Pavel Borisov hints , I can find `_USE_32BIT_TIME_T` code here:

https://github.com/postgres/postgres/search?q=_USE_32BIT_TIME_T


Is it a good idea to remove `_USE_32BIT_TIME_T` code and build with 64bit
Perl

might solve 2038 problem?


Best regards,

Fang

On Thu, Nov 19, 2020 at 4:34 PM Tom Lane  wrote:

> Pavel Borisov  writes:
> > чт, 19 нояб. 2020 г. в 09:29, Greg Stark :
> >> Wait, is configuring with a Perl that has 32-bit time_t driving the
> >> rest of Postgres to use 32-bit timestamps? That seems like the tail
> >> wagging the dog.
> >> It seems like a sensible compromise would be to have Postgres's
> >> configure default to 64-bit time_t and have a flag to choose 32-bit
> >> time_t and then have a configure check that errors out if the time_t
> >> in Perl doesn't match with a hint to either find a newer Perl
> >> distribution or configure with the flag to choose 32-bit. Ie, don't
> >> silently assume users want 32-bit time_t but leave them the option to
> >> choose it explicitly.
>
> >  _USE_32BIT_TIME_T is available only on 32-bit platforms so the proposed
> > flag will not be able to force 32-bit time_t, only allow it. On 64-bit
> > platforms, we simply do not have a choice.
>
> > I suppose that some 10+ years later the number of users willing to
> compile
> > on 32-bit with dinosaur-aged Perl distribution will be nearly zero. So I
> > suppose just mention this would be a funny fact in the documentation.
>
> Yeah.  I can't get excited about putting additional effort, and
> user-visible complexity, into this issue.  The only way it could matter
> to people building Postgres today is if you suppose that the executables
> they are building today will still be in use in 2038.  That seems a bit
> hard to credit.  Ten years from now, that'd be a legitimate worry ...
> but it's really hard to believe that these toolchains will still be
> in use then.
>
> (I would not be too surprised if we've dropped support for 32-bit
> builds altogether by 2030.  Certainly, any platform that still
> has 32-bit time_t by then is going to be in a world of hurt.)
>
> regards, tom lane
>


  1   2   >