date:20200807

Re: 回复：how to create index concurrently on partitioned table

2020-08-07 Thread Justin Pryzby

On Mon, Jun 15, 2020 at 09:37:42PM +0900, Michael Paquier wrote:
> On Mon, Jun 15, 2020 at 08:15:05PM +0800, 李杰(慎追) wrote:
> > As shown above, an error occurred while creating an index in the second 
> > partition. 
> > It can be clearly seen that the index of the partitioned table is invalid 
> > and the index of the first partition is normal, the second partition is 
> > invalid, 
> > and the Third Partition index does not exist at all.
> 
> That's a problem.  I really think that we should make the steps of the
> concurrent operation consistent across all relations, meaning that all
> the indexes should be created as invalid for all the parts of the
> partition tree, including partitioned tables as well as their
> partitions, in the same transaction.  Then a second new transaction
> gets used for the index build, followed by a third one for the
> validation that switches the indexes to become valid.

Note that the mentioned problem wasn't serious: there was missing index on
child table, therefor the parent index was invalid, as intended.  However I
agree that it's not nice that the command can fail so easily and leave behind
some indexes created successfully and some failed some not created at all.

But I took your advice initially creating invalid inds.

On Tue, Jun 16, 2020 at 10:02:21AM +0900, Michael Paquier wrote:
> CIC is an operation that exists while allowing read and writes to
> still happen in parallel, so that's fine IMO if it takes time.  Now it
> is true that it may not scale well so we should be careful with the
> approach taken.  Also, I think that the case of REINDEX would require
> less work overall because we already have some code in place to gather
> multiple indexes from one or more relations and work on these
> consistently, all at once.

I'm not sure if by reindex you mean my old 0002.  That's now 0001, so if it can
be simplified somehow, that's great..

That gave me the idea to layer CIC on top of Reindex, since I think it does
exactly what's needed.

-- 
Justin
>From 3c465187a190ecfe21dc24dd36e64cbf304cbd17 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 7 Jun 2020 15:59:54 -0500
Subject: [PATCH v4 1/3] Implement REINDEX of partitioned tables/indexes

---
 doc/src/sgml/ref/reindex.sgml  |   5 -
 src/backend/catalog/index.c|   7 +-
 src/backend/commands/indexcmds.c   | 119 +
 src/backend/tcop/utility.c |   6 +-
 src/include/commands/defrem.h  |   6 +-
 src/test/regress/expected/create_index.out |  18 ++--
 src/test/regress/sql/create_index.sql  |  13 ++-
 7 files changed, 106 insertions(+), 68 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index aac5d5be23..e2e32b3ba0 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -258,11 +258,6 @@ REINDEX [ ( option [, ...] ) ] { IN
can always reindex anything.
   
 
-  
-   Reindexing partitioned tables or partitioned indexes is not supported.
-   Each individual partition can be reindexed separately instead.
-  
-
   
Rebuilding Indexes Concurrently
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1be27eec52..7abd7f5cfc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3660,12 +3660,7 @@ reindex_relation(Oid relid, int flags, int options)
 	 */
 	rel = table_open(relid, ShareLock);
 
-	/*
-	 * This may be useful when implemented someday; but that day is not today.
-	 * For now, avoid erroring out when called in a multi-table context
-	 * (REINDEX SCHEMA) and happen to come across a partitioned table.  The
-	 * partitions may be reindexed on their own anyway.
-	 */
+	/* Avoid erroring out */
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		ereport(WARNING,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2baca12c5f..df3e567acb 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -88,7 +88,8 @@ static List *ChooseIndexColumnNames(List *indexElems);
 static void RangeVarCallbackForReindexIndex(const RangeVar *relation,
 			Oid relId, Oid oldRelId, void *arg);
 static bool ReindexRelationConcurrently(Oid relationOid, int options);
-static void ReindexPartitionedIndex(Relation parentIdx);
+static void ReindexPartitionedRel(Oid reloid, int options, bool concurrent,
+		bool isTopLevel);
 static void update_relispartition(Oid relationId, bool newval);
 static bool CompareOpclassOptions(Datum *opts1, Datum *opts2, int natts);
 
@@ -2420,11 +2421,10 @@ ChooseIndexColumnNames(List *indexElems)
  *		Recreate a specific index.
  */
 void
-ReindexIndex(RangeVar *indexRelation, int options, bool concurrent)
+ReindexIndex(RangeVar *indexRelation, int options, bool concurrent, bool isTopLevel)
 {
 	struct ReindexIndexCallbackState state;
 	Oid			indOid;
-	Relation	irel;
 	char		persistence;
 
 	/*
@@ -2445,22 +2445,10 @@ ReindexIn

Re: Replace remaining StrNCpy() by strlcpy()

2020-08-07 Thread Peter Eisentraut


On 2020-08-05 17:49, Tom Lane wrote:

However I do see one remaining nit to pick, in CreateInitDecodingContext:
  
  	/* register output plugin name with slot */

SpinLockAcquire(&slot->mutex);
-   StrNCpy(NameStr(slot->data.plugin), plugin, NAMEDATALEN);
+   namestrcpy(&slot->data.plugin, plugin);
SpinLockRelease(&slot->mutex);

This is already a pro-forma violation of our rule about "only
straight-line code inside a spinlock".  Now I'm not terribly concerned
about that right now, and the patch as it stands is only changing things
cosmetically.  But if you modify namestrcpy to do pg_mbcliplen then all
of a sudden there is a whole lot of code that could get reached within
the spinlock, and I'm not a bit happy about that prospect.


fixed


BTW, while we're here I think we ought to change namecpy and namestrcpy
to return void (no caller checks their results) and drop their checks
for null-pointer inputs.  AFAICS a null pointer would be a caller bug in
every case, and if it isn't, why is failing to initialize the
destination an OK outcome?  I find the provisions for nulls in namestrcmp
pretty useless too, although it looks like at least some thought has
been spent there.


fixed

I removed namecpy() altogether because you can just use struct assignment.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From b1265388a79269d2cf67cdfbe5a26a74f0f4a178 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 4 Aug 2020 15:16:37 +0200
Subject: [PATCH v3] Replace remaining StrNCpy() by strlcpy()

They are equivalent, and strlcpy() has become the preferred spelling.
Convert over the remaining cases.

Remove StrNCpy() as it is now unused.

In some cases, namestrcpy() is the more appropriate function to use.
While we're here, simplify the API of namestrcpy(): Remove the return
value, don't check for NULL input.  Nothing was using that anyway.
Also, remove a few unused name-related functions.

Discussion: 
https://www.postgresql.org/message-id/flat/44f5e198-36f6-6cdb-7fa9-60e34784daae%402ndquadrant.com
---
 contrib/pgcrypto/crypt-des.c  |  2 +-
 src/backend/access/transam/slru.c |  2 +-
 src/backend/access/transam/xlogarchive.c  |  2 +-
 src/backend/catalog/pg_constraint.c   |  2 +-
 src/backend/commands/indexcmds.c  |  2 +-
 src/backend/commands/statscmds.c  |  2 +-
 src/backend/commands/tablecmds.c  |  2 +-
 src/backend/postmaster/pgstat.c   |  2 +-
 src/backend/replication/logical/logical.c | 11 -
 src/backend/replication/slot.c|  2 +-
 src/backend/utils/adt/formatting.c|  8 ++--
 src/backend/utils/adt/name.c  | 48 ++-
 src/backend/utils/adt/pg_locale.c |  9 
 src/backend/utils/adt/ruleutils.c |  2 +-
 src/common/exec.c |  4 +-
 src/include/c.h   | 29 ---
 src/include/utils/builtins.h  |  3 +-
 src/interfaces/ecpg/pgtypeslib/dt_common.c|  4 +-
 src/interfaces/ecpg/test/pg_regress_ecpg.c|  2 +-
 .../ssl_passphrase_func.c |  2 +-
 20 files changed, 34 insertions(+), 106 deletions(-)

diff --git a/contrib/pgcrypto/crypt-des.c b/contrib/pgcrypto/crypt-des.c
index 6efaa609c9..98c30ea122 100644
--- a/contrib/pgcrypto/crypt-des.c
+++ b/contrib/pgcrypto/crypt-des.c
@@ -720,7 +720,7 @@ px_crypt_des(const char *key, const char *setting)
if (des_setkey((char *) keybuf))
return NULL;
}
-   StrNCpy(output, setting, 10);
+   strlcpy(output, setting, 10);
 
/*
 * Double check that we weren't given a short setting. If we 
were, the
diff --git a/src/backend/access/transam/slru.c 
b/src/backend/access/transam/slru.c
index 9e145f1c36..d1dbb43e09 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -252,7 +252,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, 
int nlsns,
 */
ctl->shared = shared;
ctl->do_fsync = true;   /* default behavior */
-   StrNCpy(ctl->Dir, subdir, sizeof(ctl->Dir));
+   strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
 }
 
 /*
diff --git a/src/backend/access/transam/xlogarchive.c 
b/src/backend/access/transam/xlogarchive.c
index cdd586fcfb..8f8734dc1d 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -323,7 +323,7 @@ ExecuteRecoveryCommand(const char *command, const char 
*commandName, bool failOn
case 'r':
/* %r: filename of last restartpoint */
sp++;
-   StrNCpy(dp, lastRestartPointFname, endp 
- dp);
+

Re: Creating a function for exposing memory usage of backend process

2020-08-07 Thread Michael Paquier

On Fri, Jul 31, 2020 at 03:23:52PM -0400, Robert Haas wrote:
> On Fri, Jul 31, 2020 at 4:25 AM torikoshia  wrote:
>> And as Fujii-san told me in person, exposing memory address seems
>> not preferable considering there are security techniques like
>> address space layout randomization.
> 
> Yeah, exactly. ASLR wouldn't do anything to improve security if there
> were no other security bugs, but there are, and some of those bugs are
> harder to exploit if you don't know the precise memory addresses of
> certain data structures. Similarly, exposing the addresses of our
> internal data structures is harmless if we have no other security
> bugs, but if we do, it might make those bugs easier to exploit. I
> don't think this information is useful enough to justify taking that
> risk.

FWIW, this is the class of issues where it is possible to print some
areas of memory, or even manipulate the stack so as it was possible to
pass down a custom pointer, so exposing the pointer locations is a
real risk, and this has happened in the past.  Anyway, it seems to me
that if this part is done, we could just make it superuser-only with
restrictive REVOKE privileges, but I am not sure that we have enough
user cases to justify this addition.
--
Michael


signature.asc
Description: PGP signature

Re: walsender waiting_for_ping spuriously set

2020-08-07 Thread Alvaro Herrera

On 2020-Aug-07, Alvaro Herrera wrote:

> I'm thinking in keeping the sentences that were added in that commit,
> maybe like so:
> 
> >  * We only send regular messages to the client for full decoded
> >  * transactions, but a synchronous replication and walsender 
> > shutdown
> >  * possibly are waiting for a later location. So, before 
> > sleeping, we
> > +* send a ping containing the flush location. A reply from 
> > standby is
> > +* not needed and would be wasteful most of the time,
> > +* but if the receiver is otherwise idle and walreceiver status 
> > messages
> > +* are enabled, this keepalive will trigger a reply.  
> > Processing the
> > +* reply will update these MyWalSnd locations.

After rereading this a few more times, I think it's OK as Noah had it,
so I'll just use his wording.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: walsender waiting_for_ping spuriously set

2020-08-07 Thread Alvaro Herrera

I just noticed that part of this comment I'm modifying:

> @@ -1444,17 +1444,13 @@ WalSndWaitForWal(XLogRecPtr loc)
>* We only send regular messages to the client for full decoded
>* transactions, but a synchronous replication and walsender 
> shutdown
>* possibly are waiting for a later location. So, before 
> sleeping, we
> -  * send a ping containing the flush location. If the receiver is
> -  * otherwise idle, this keepalive will trigger a reply. 
> Processing the
> -  * reply will update these MyWalSnd locations.
> +  * send a ping containing the flush location. A reply from 
> standby is
> +  * not needed and would be wasteful.

was added very recently, in f246ea3b2a5e ("In caught-up logical
walsender, sleep only in WalSndWaitForWal().").  Added Noah to CC.

I think the walreceiver will only send a reply if
wal_receiver_status_interval is set to a nonzero value.  I don't
understand what reason could there possibly be for setting this
parameter to zero, but it seems better to be explicit about it, as this
code is confusing enough.

I'm thinking in keeping the sentences that were added in that commit,
maybe like so:

>* We only send regular messages to the client for full decoded
>* transactions, but a synchronous replication and walsender 
> shutdown
>* possibly are waiting for a later location. So, before 
> sleeping, we
> +  * send a ping containing the flush location. A reply from 
> standby is
> +  * not needed and would be wasteful most of the time,
> +  * but if the receiver is otherwise idle and walreceiver status 
> messages
> +  * are enabled, this keepalive will trigger a reply.  
> Processing the
> +  * reply will update these MyWalSnd locations.

(Also, the comment would be updated all the way back to 9.5, even if
f246ea3b2a5e itself was not.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Improve handling of pg_stat_statements handling of bind "IN" variables

2020-08-07 Thread Pavel Trukhanov

Hey, let me know if there's any way I can help.

I would argue that making even a small improvement here would be beneficial
to many.

On Tue, Jul 21, 2020 at 11:59 AM Dmitry Dolgov <9erthali...@gmail.com>
wrote:

> > On Thu, Oct 3, 2019 at 3:33 AM Pavel Trukhanov <
> pavel.trukha...@gmail.com> wrote:
> >
> >> On Wed, Jun 26, 2019 at 11:10 PM Tom Lane  wrote:
> >>
> >> Greg Stark  writes:
> >> > Actually thinking about this for two more seconds the question is
> what it
> >> > would do with a query like
> >> > WHERE col = ANY '1,2,3'::integer[]
> >> > Or
> >> > WHERE col = ANY ARRAY[1,2,3]
> >> > Whichever the language binding that is failing to do parameterized
> queries
> >> > is doing to emulate them.
> >>
> >> Yeah, one interesting question is whether this is actually modeling
> >> what happens with real-world applications --- are they sending Consts,
> >> or Params?
> >>
> >> I resolutely dislike the idea of marking arrays that came from IN
> >> differently from other ones; that's just a hack and it's going to give
> >> rise to unexplainable behavioral differences for logically-equivalent
> >> queries.
> >
> > Thanks for your input.
> >
> > As for real-world applications – being a founder of a server monitoring
> saas
> > (okmeter) I have access to stats on hundreds of postgres installations.
> >
> > It shows that IN with a variable number of params is ~7 times more used
> than
> > ANY(array).
>
> Hi,
>
> I would like to do some archaeology and inquire about this thread, since
> unfortunately there was no patch presented as far as I see.
>
> IIUC the ideas suggested in this thread are evolving mostly about modifying
> parser:
>
> > On Fri, Jun 14, 2019 at 2:46 AM Tom Lane  wrote:
> >
> > I do not think you need new expression infrastructure. IMO what's going
> on
> > here is that we're indulging in premature optimization in the parser. It
> > would be better from a structural standpoint if the output of parse
> analysis
> > were closer to what the user wrote, and then the business of separating
> Vars
> > from Consts and reducing the Consts to an array were handled in the
> planner's
> > expression preprocessing phase.
> >
> > So maybe what you should be thinking about is a preliminary patch that's
> > mostly in the nature of refactoring, to move that processing to where it
> > should be.
> >
> > Of course, life is never quite that simple; there are at least two
> > issues you'd have to think about.
> >
> > * The parsing phase is responsible for determining the semantics of
> > the query, in particular resolving the data types of the IN-list items
> > and choosing the comparison operators that will be used.  The planner
> > is not allowed to rethink that.  What I'm not clear about offhand is
> > whether the existing coding in parse analysis might lead to different
> > choices of data types/operators than a more straightforward approach
> > does.  If so, we'd have to decide whether that's okay.
> >
> > * Postponing this work might make things slower overall, which wouldn't
> > matter too much for short IN-lists, but you can bet that people who
> > throw ten-thousand-entry IN-lists at us will notice.  So you'd need to
> > keep an eye on efficiency and make sure you don't end up repeating
> > similar processing over and over.
>
> This puzzles me, since the original issue sounds like a "representation"
> problem, when we want to calculate jumble hash in a way that obviously
> repeating parameters or constants are hashed into one value. I see the
> point in
> ideas like this:
>
> >> One idea that comes to me after looking at the code involved is that
> >> it might be an improvement across-the-board if transformAExprIn were to
> >> reduce the generated ArrayExpr to an array Const immediately, in cases
> >> where all the inputs are Consts.  That is going to happen anyway come
> >> plan time, so it'd have zero impact on semantics or query performance.
> >> Doing it earlier would cost nothing, and could even be a net win, by
> >> reducing per-parse-node overhead in places like the rewriter.
> >>
> >> The advantage for the problem at hand is that a Const that's an array
> >> of 2 elements is going to look the same as a Const that's any other
> >> number of elements so far as pg_stat_statements is concerned.
> >>
> >> This doesn't help of course in cases where the values aren't all
> >> Consts.  Since we eliminated Vars already, the main practical case
> >> would be that they're Params, leading us back to the previous
> >> question of whether apps are binding queries with different numbers
> >> of parameter markers in an IN, and how hard pg_stat_statements should
> >> try to fuzz that if they are.
> >>
> >> Then, to Greg's point, there's a question of whether transformArrayExpr
> >> should do likewise, ie try to produce an array Const immediately.
> >> I'm a bit less excited about that, but consistency suggests that
> >> we should have it act the same as the IN case.
>
> Interestingly enough, something similar w

Re: Should the nbtree page split REDO routine's locking work more like the locking on the primary?

2020-08-07 Thread Peter Geoghegan

On Thu, Aug 6, 2020 at 7:00 PM Peter Geoghegan  wrote:
> On Thu, Aug 6, 2020 at 6:08 PM Tom Lane  wrote:
> > +1 for making this more like what happens in original execution ("on the
> > primary", to use your wording).  Perhaps what you suggest here is still
> > not enough like the original execution, but it sounds closer.
>
> It won't be the same as the original execution, exactly -- I am only
> thinking of holding on to same-level page locks (the original page,
> its new right sibling, and the original right sibling).

I pushed a commit that reorders the lock acquisitions within
btree_xlog_unlink_page() -- they're now consistent with _bt_split()
(at least among sibling pages involved in the page split).

Thanks
-- 
Peter Geoghegan

Re: PROC_IN_ANALYZE stillborn 13 years ago

2020-08-07 Thread Andres Freund

Hi,

On 2020-08-06 18:02:26 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > In fact using conceptually like a new snapshot for each sample tuple
> > actually seems like it'd be somewhat of an improvement over using a
> > single snapshot.
> 
> Dunno, that feels like a fairly bad idea to me.  It seems like it would
> overemphasize the behavior of whatever queries happened to be running
> concurrently with the ANALYZE.  I do follow the argument that using a
> single snapshot for the whole ANALYZE overemphasizes a single instant
> in time, but I don't think that leads to the conclusion that we shouldn't
> use a snapshot at all.

I didn't actually want to suggest that we should take a separate
snapshot for every sampled row - that'd be excessively costly. What I
wanted to say was that I don't think that I don't see a clear accuraccy
benefit. E.g. not seeing any of the values inserted more recently will
under-emphasize those in the histogram.

What precisely do you mean with "overemphasize" above? I mean those will
e the rows most likely to live after the analyze is done, so including
them doesn't seem like a bad thing to me?

> Another angle that would be worth considering, aside from the issue
> of whether the sample used for pg_statistic becomes more or less
> representative, is what impact all this would have on the tuple count
> estimates that go to the stats collector and pg_class.reltuples.
> Right now, we don't have a great story at all on how the stats collector's
> count is affected by combining VACUUM/ANALYZE table-wide counts with
> the incremental deltas reported by transactions happening concurrently
> with VACUUM/ANALYZE.  Would changing this behavior make that better,
> or worse, or about the same?

Hm. Vacuum already counts rows that are inserted concurrently with the
vacuum scan, if it encounters them. Analyze doesn't. Seems like we'd at
least be wrong in a more consistent manner than before...

IIUC both analyze and vacuum will overwrite concurrent changes to
n_live_tuples. So taking concurrently committed changes into account
seems like it'd be the right thing?

We probably could make this more accurate by accounting separately for
"recently inserted and committed" rows, and taking the difference of
n_live_tuples before/after into account.  But I'm a bit doubtful that
it's worth it?

Greetings,

Andres Freund

Re: PROC_IN_ANALYZE stillborn 13 years ago

2020-08-07 Thread Alvaro Herrera

On 2020-Aug-05, Andres Freund wrote:

> I'm mildly against that, because I'd really like to start making use of
> the flag. Not so much for cancellations, but to avoid the drastic impact
> analyze has on bloat.  In OLTP workloads with big tables, and without
> disabled cost limiting for analyze (or slow IO), the snapshot that
> analyze holds is often by far the transaction with the oldest xmin.

I pushed despite the objection because it seemed that downstream
discussion was largely favorable to the change, and there's a different
proposal to solve the bloat problem for analyze; and also:

> Only mildly against because it'd not be hard to reintroduce once we need
> it.

Thanks for the discussion!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

2020-08-07 Thread Andres Freund

Hi,

On 2020-08-07 12:29:03 -0400, Robert Haas wrote:
> On Thu, Aug 6, 2020 at 11:46 PM Bharath Rupireddy
>  wrote:
> > I sent the patch previously[1], but attaching here again, modifies
> > cancel_before_shmem_exit() function comment to reflect the safe usage
> > of before_shmem_exit_list callback mechanism and also removes the
> > point "For simplicity, only the latest entry can be removed*"
> > as this gives a meaning that there is still scope for improvement in
> > cancel_before_shmem_exit() search mechanism.
> >
> > Thoughts?
> 
> I think that the first part of the comment change you suggest is a
> good idea and would avoid developer confusion, but I think that the
> statement about unordered removal of comments being risky doesn't add
> much. It's too vague to help anybody and I don't think I believe it,
> either. So I suggest something more like:
> 
> - * callback.  For simplicity, only the latest entry can be
> - * removed.  (We could work harder but there is no need for
> - * current uses.)
> + * callback.  We only look at the latest entry for removal, as we
> + * expect the caller to use before_shmem_exit callback mechanism
> + * in the LIFO order.

In which situations is the removal actually useful *and* safe, with
these constraints? You'd have to have a very narrow set of functions
that are called while the exit hook is present, i.e. basically this
would only be usable for PG_ENSURE_ERROR_CLEANUP and nothing else.  And
even there it seems like it's pretty easy to get into a situation where
it's not safe.

Greetings,

Andres Freund

Re: Add LWLock blocker(s) information

2020-08-07 Thread David Zhang

Hi, 
This is a very interesting topic. I did apply the 2nd patch to master branch 
and performed a quick test. I can observe below information,
postgres=# select * from pg_lwlock_blocking_pid(26925);
 requested_mode | last_holder_pid | last_holder_mode | nb_holders 
+-+--+
 LW_EXCLUSIVE   |   26844 | LW_EXCLUSIVE |  1
(1 row)

postgres=# select 
query,pid,state,wait_event,wait_event_type,pg_lwlock_blocking_pid(pid),pg_blocking_pids(pid)
 from pg_stat_activity where state='active' and pid != pg_backend_pid();
query |  pid  | state  
| wait_event | wait_event_type |   pg_lwlock_blocking_pid| 
pg_blocking_pids 
--+---+++-+-+--
 INSERT INTO orders SELECT FROM generate_series(1, 1000); | 26925 | active 
| WALWrite   | LWLock  | (LW_EXCLUSIVE,26844,LW_EXCLUSIVE,1) | {}
(1 row)

At some points, I have to keep repeating the query in order to capture the 
"lock info". I think this is probably part of the design, but I was wondering,
if a query is in deadlock expecting a developer to take a look using the 
methods above, will the process be killed before a developer get the chance to 
execute the one of the query?
if some statistics information can be added, it may help the developers to get 
an overall idea about the lock status, and if the developers can specify some 
filters, such as, the number of times a query entered into a deadlock, the 
queries hold the lock more than number of ms, etc, it might help to 
troubleshooting the "lock" issue even better. And moreover, if this feature can 
be an independent extension, similar to "pg_buffercache" it will be great.
Best regards,

David

Re: PROC_IN_ANALYZE stillborn 13 years ago

2020-08-07 Thread Tom Lane

Robert Haas  writes:
> Thinking about it more, there are really two ways to think about an
> estimated row count.

> On the one hand, if you think of the row count estimate as the number
> of rows that are going to pop out of a node, then it's always right to
> think of a unique index as limiting the number of occurrences of a
> given value to 1. But, if you think of the row count estimate as a way
> of estimating the amount of work that the node has to do to produce
> that output, then it isn't.

The planner intends its row counts to be interpreted in the first way.
We do have a rather indirect way of accounting for the cost of scanning
dead tuples and such, which is that we scale scanning costs according
to the measured physical size of the relation.  That works better for
I/O costs than it does for CPU costs, but it's not completely useless
for the latter.  In any case, we'd certainly not want to increase the
scan's row count estimate for that, because that would falsely inflate
our estimate of how much work upper plan levels have to do.  Whatever
happens at the scan level, the upper levels aren't going to see those
dead tuples.

regards, tom lane

Re: Parallel worker hangs while handling errors.

2020-08-07 Thread Robert Haas

On Fri, Aug 7, 2020 at 2:00 PM Tom Lane  wrote:
> The point of the sigdelset is that if somewhere later on, we install
> the BlockSig mask, then SIGQUIT will remain unblocked.

I mean, we're just repeating the same points here, but that's not what
the comment says.

> You asserted
> upthread that noplace in these processes ever does so; maybe that's
> true today, or maybe not,

It's easily checked using 'git grep'.

> but the intent of this code is that *once
> we get through initialization* SIGQUIT will remain unblocked.

I can't speak to the intent, but I can speak to what the comment says.

> I'll concede that it's not 100% clear whether or not these processes
> need to re-block SIGQUIT during error recovery.

I think it's entirely clear that they do not, and I have explained my
reasoning already.

> I repeat, though,
> that I'm disinclined to change that without some evidence that there's
> actually a problem with the way it works now.

I've also already explained why I don't agree with this perspective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PROC_IN_ANALYZE stillborn 13 years ago

2020-08-07 Thread Robert Haas

On Thu, Aug 6, 2020 at 5:35 PM Tom Lane  wrote:
> > It's not clear to me that it would even be correct to categorize those
> > somewhat-different results as "less accurate."
>
> Estimating two rows where the correct answer is one row is clearly
> "less accurate" [ than estimating one row ].

That's a tautology, so I can't argue with it as far as it goes.
Thinking about it more, there are really two ways to think about an
estimated row count.

On the one hand, if you think of the row count estimate as the number
of rows that are going to pop out of a node, then it's always right to
think of a unique index as limiting the number of occurrences of a
given value to 1. But, if you think of the row count estimate as a way
of estimating the amount of work that the node has to do to produce
that output, then it isn't.

If a table has a lot of inserts and deletes, or a lot of updates,
index scans might have to do a lot of extra work chasing down index
pointers to tuples that end up being invisible to our scan. The scan
may not have any filter quals at all, and even if it does, they are
likely cheap to evaluate compared to the cost of finding a locking
buffers and checking visibility, so the dominant cost of the scan is
really based on the total number of rows that are present, not the
number that are visible. Ideally, the presence of those rows would
affect the cost estimate for the node in a way very similar to
expecting to find more rows. At the same time, it doesn't work to just
bump up the row count estimate for the node, because then you'll think
more rows will be output, which might cause poor planning decisions at
higher levels.

It doesn't seem easy to get this 100% right. Tuple visibility can
change very quickly, much faster than the inter-ANALYZE interval. And
sometimes tuples can be pruned away very quickly, too, and the index
pointers may be opportunistically removed very quickly, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

2020-08-07 Thread Tom Lane

Robert Haas  writes:
> On Fri, Aug 7, 2020 at 12:56 PM Tom Lane  wrote:
>> That SETMASK call will certainly unblock SIGQUIT, so I don't see what
>> your point is.

> I can't figure out if you're trolling me here or what. It's true that
> the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
> also be true if the sigdelset() call were absent.

The point of the sigdelset is that if somewhere later on, we install
the BlockSig mask, then SIGQUIT will remain unblocked.  You asserted
upthread that noplace in these processes ever does so; maybe that's
true today, or maybe not, but the intent of this code is that *once
we get through initialization* SIGQUIT will remain unblocked.

I'll concede that it's not 100% clear whether or not these processes
need to re-block SIGQUIT during error recovery.  I repeat, though,
that I'm disinclined to change that without some evidence that there's
actually a problem with the way it works now.

regards, tom lane

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

2020-08-07 Thread Robert Haas

On Fri, Aug 7, 2020 at 1:12 PM Tom Lane  wrote:
> That's a meaningless statement for any one caller.  So it needs to be more
> like "we expect callers to add and remove temporary before_shmem_exit
> callbacks in strict LIFO order".

Sure, that seems fine.

> I wonder whether we ought to change the function to complain if the
> last list entry doesn't match.  We'd have caught this bug sooner
> if it did, and it's not very clear why silently doing nothing is
> a good idea when there's no match.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

2020-08-07 Thread Robert Haas

On Fri, Aug 7, 2020 at 12:56 PM Tom Lane  wrote:
> > I don't think that your analysis here is correct. The sigdelset call
> > is manipulating BlockSig, and the subsequent PG_SETMASK call is
> > working with UnblockSig, so it doesn't make sense to view one as a
> > preparatory step for the other.
>
> That SETMASK call will certainly unblock SIGQUIT, so I don't see what
> your point is.

I can't figure out if you're trolling me here or what. It's true that
the PG_SETMASK() call will certainly unblock SIGQUIT, but that would
also be true if the sigdelset() call were absent.

> Anyway, the bottom line is that that code's been like
> that for a decade or two without complaints, so I'm disinclined to
> mess with it on the strength of nothing much.

Really? Have you reversed your policy of wanting the comments to
accurately describe what the code does?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Robert Haas

On Fri, Aug 7, 2020 at 12:52 PM Tom Lane  wrote:
> At least in the case of segment zero, the file will still exist.  It'll
> have been truncated to zero length, and if the filesystem is stupid about
> holes in files then maybe a write to a high block number would consume
> excessive disk space, but does anyone still care about such filesystems?
> I don't remember at the moment how we handle higher segments, but likely
> we could make them still exist too, postponing all the unlinks till after
> checkpoint.  Or we could just have the backends give up on recycling a
> particular buffer if they can't write it (which is the response to an I/O
> failure already, I hope).

None of this sounds very appealing. Postponing the unlinks means
postponing recovery of the space at the OS level, which I think will
be noticeable and undesirable for users. The other notions all seem to
involve treating as valid on-disk states we currently treat as
invalid, and our sanity checks in this area are already far too weak.
And all you're buying for it is putting a hash table that would
otherwise be shared memory into backend-private memory, which seems
like quite a minor gain. Having that information visible to everybody
seems a lot cleaner.

> Yeah, trying to amortize the cost across multiple drops seems like
> what we really want here.  I'm starting to think about a "relation
> dropper" background process, which would be somewhat like the checkpointer
> but it wouldn't have any interest in actually doing buffer I/O.
> We'd send relation drop commands to it, and it would scan all of shared
> buffers and flush related buffers, and then finally do the file truncates
> or unlinks.  Amortization would happen by considering multiple target
> relations during any one scan over shared buffers.  I'm not very clear
> on how this would relate to the checkpointer's handling of relation
> drops, but it could be worked out; if we were lucky maybe the checkpointer
> could stop worrying about that.

I considered that, too, but it might be overkill. I think that one
scan of shared_buffers every now and then might be cheap enough that
we could just not worry too much about which process gets stuck doing
it. So for example if the number of buffers allocated since the hash
table ended up non-empty reaches NBuffers, the process wanting to do
the next eviction gets handed the job of cleaning it out. Or maybe the
background writer could help; it's not like it does much anyway, zing.
It's possible that a dedicated process is the right solution, but we
might want to at least poke a bit at other alternatives.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

2020-08-07 Thread Tom Lane

Robert Haas  writes:
> ... So I suggest something more like:

> - * callback.  For simplicity, only the latest entry can be
> - * removed.  (We could work harder but there is no need for
> - * current uses.)
> + * callback.  We only look at the latest entry for removal, as we
> + * expect the caller to use before_shmem_exit callback mechanism
> + * in the LIFO order.

That's a meaningless statement for any one caller.  So it needs to be more
like "we expect callers to add and remove temporary before_shmem_exit
callbacks in strict LIFO order".

I wonder whether we ought to change the function to complain if the
last list entry doesn't match.  We'd have caught this bug sooner
if it did, and it's not very clear why silently doing nothing is
a good idea when there's no match.

regards, tom lane

Re: Parallel worker hangs while handling errors.

2020-08-07 Thread Tom Lane

Robert Haas  writes:
> On Fri, Aug 7, 2020 at 11:36 AM Tom Lane  wrote:
>> The sigdelset call, just like the adjacent pqsignal calls, is
>> preparatory setup; it does not intend to allow anything to happen
>> immediately.

> I don't think that your analysis here is correct. The sigdelset call
> is manipulating BlockSig, and the subsequent PG_SETMASK call is
> working with UnblockSig, so it doesn't make sense to view one as a
> preparatory step for the other.

That SETMASK call will certainly unblock SIGQUIT, so I don't see what
your point is.  Anyway, the bottom line is that that code's been like
that for a decade or two without complaints, so I'm disinclined to
mess with it on the strength of nothing much.

regards, tom lane

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Tom Lane

Robert Haas  writes:
> On Fri, Aug 7, 2020 at 12:09 PM Tom Lane  wrote:
>> ... it's not very clear what to do in
>> backends that haven't got access to that table, but maybe we could just
>> accept that backends that are forced to flush dirty buffers might do some
>> useless writes in such cases.

> I don't see how that can work. It's not that the writes are useless;
> it's that they will fail outright because the file doesn't exist.

At least in the case of segment zero, the file will still exist.  It'll
have been truncated to zero length, and if the filesystem is stupid about
holes in files then maybe a write to a high block number would consume
excessive disk space, but does anyone still care about such filesystems?
I don't remember at the moment how we handle higher segments, but likely
we could make them still exist too, postponing all the unlinks till after
checkpoint.  Or we could just have the backends give up on recycling a
particular buffer if they can't write it (which is the response to an I/O
failure already, I hope).

> My viewpoint on this is - I have yet to see anybody really get hosed
> because they drop one relation and that causes a full scan of
> shared_buffers. I mean, it's slightly expensive, but computers are
> fast. Whatever. What hoses people is dropping a lot of relations in
> quick succession, either by spamming DROP TABLE commands or by running
> something like DROP SCHEMA, and then suddenly they're scanning
> shared_buffers over and over again, and their standby is doing the
> same thing, and now it hurts.

Yeah, trying to amortize the cost across multiple drops seems like
what we really want here.  I'm starting to think about a "relation
dropper" background process, which would be somewhat like the checkpointer
but it wouldn't have any interest in actually doing buffer I/O.
We'd send relation drop commands to it, and it would scan all of shared
buffers and flush related buffers, and then finally do the file truncates
or unlinks.  Amortization would happen by considering multiple target
relations during any one scan over shared buffers.  I'm not very clear
on how this would relate to the checkpointer's handling of relation
drops, but it could be worked out; if we were lucky maybe the checkpointer
could stop worrying about that.

regards, tom lane

Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks

2020-08-07 Thread Robert Haas

On Thu, Aug 6, 2020 at 11:46 PM Bharath Rupireddy
 wrote:
> I sent the patch previously[1], but attaching here again, modifies
> cancel_before_shmem_exit() function comment to reflect the safe usage
> of before_shmem_exit_list callback mechanism and also removes the
> point "For simplicity, only the latest entry can be removed*"
> as this gives a meaning that there is still scope for improvement in
> cancel_before_shmem_exit() search mechanism.
>
> Thoughts?

I think that the first part of the comment change you suggest is a
good idea and would avoid developer confusion, but I think that the
statement about unordered removal of comments being risky doesn't add
much. It's too vague to help anybody and I don't think I believe it,
either. So I suggest something more like:

- * callback.  For simplicity, only the latest entry can be
- * removed.  (We could work harder but there is no need for
- * current uses.)
+ * callback.  We only look at the latest entry for removal, as we
+ * expect the caller to use before_shmem_exit callback mechanism
+ * in the LIFO order.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Robert Haas

On Fri, Aug 7, 2020 at 12:09 PM Tom Lane  wrote:
> As for (1), maybe we don't need to keep the info in shmem.  I'll just
> point out that the checkpointer has *already got* a complete list of all
> recently-dropped relations, cf pendingUnlinks in sync.c.  So you could
> imagine looking aside at that to discover that a dirty buffer belongs to a
> recently-dropped relation.  pendingUnlinks would need to be converted to a
> hashtable to make searches cheap, and it's not very clear what to do in
> backends that haven't got access to that table, but maybe we could just
> accept that backends that are forced to flush dirty buffers might do some
> useless writes in such cases.

I don't see how that can work. It's not that the writes are useless;
it's that they will fail outright because the file doesn't exist.

> As for (2), the reason why we have that list is that the physical unlink
> doesn't happen till after the next checkpoint.  So with some messing
> around here, we could probably guarantee that every buffer belonging
> to the relation has been scanned and deleted before the file unlink
> happens --- and then, even if the OID counter has wrapped around, the
> OID won't be reassigned to a new relation before that happens.

This seems right to me, though.

> In short, it seems like maybe we could shove the responsibility for
> cleaning up dropped relations' buffers onto the checkpointer without
> too much added cost.  A possible problem with this is that recycling
> of those buffers will happen much more slowly than it does today,
> but maybe that's okay?

I suspect it's going to be advantageous to try to make cleaning up
dropped buffers quick in normal cases and allow it to fall behind only
when someone is dropping a lot of relations in quick succession, so
that buffer eviction remains cheap in normal cases. I hadn't thought
about the possible negative performance consequences of failing to put
buffers on the free list, but that's another reason to try to make it
fast.

My viewpoint on this is - I have yet to see anybody really get hosed
because they drop one relation and that causes a full scan of
shared_buffers. I mean, it's slightly expensive, but computers are
fast. Whatever. What hoses people is dropping a lot of relations in
quick succession, either by spamming DROP TABLE commands or by running
something like DROP SCHEMA, and then suddenly they're scanning
shared_buffers over and over again, and their standby is doing the
same thing, and now it hurts. The problem on the standby is actually
worse than the problem on the primary, because the primary can do
other things while one process sits there and thinks about
shared_buffers for a long time, but the standby can't, because the
startup process is all there is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

2020-08-07 Thread Robert Haas

On Fri, Aug 7, 2020 at 11:36 AM Tom Lane  wrote:
> I think that code is the way it is intentionally: the idea is to not
> accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);"
> call further down, between the sigsetjmp stanza and the main loop.
> The sigdelset call, just like the adjacent pqsignal calls, is
> preparatory setup; it does not intend to allow anything to happen
> immediately.

I don't think that your analysis here is correct. The sigdelset call
is manipulating BlockSig, and the subsequent PG_SETMASK call is
working with UnblockSig, so it doesn't make sense to view one as a
preparatory step for the other. It could be correct to interpret the
sigdelset call as preparatory work for a future call to
PG_SETMASK(&BlockSig), but AFAICS there are no such calls in the
processes where this incantation exists, so really it just seems to be
a no-op. Furthermore, the comment says that the point of the
sigdelset() is to allow SIGQUIT "at all times," which doesn't square
well with your suggestion that we intended it to take effect only
later.

> In general, you don't want to accept signals in that area because the
> process state may not be fully set up yet.  You could argue that the
> SIGQUIT handler has no state dependencies, making it safe to accept
> SIGQUIT earlier during startup of one of these processes, and likewise
> for them to accept SIGQUIT during error recovery.  But barring actual
> evidence of a problem with slow SIGQUIT response in these areas I'm more
> inclined to leave well enough alone.  Changing this would add hazards,
> e.g. if somebody ever changes the behavior of the SIGQUIT handler, so
> I'd want some concrete evidence of a benefit.  It seems fairly
> irrelevant to the problem at hand with bgworkers, anyway.

The SIGQUIT handler in question contains nothing than a call to
_exit(2) and a long comment explaining why we don't do anything else,
so I think the argument that it has no state dependencies is pretty
well water-tight. Whether or not we've got a problem with timely
SIGQUIT acceptance is much less clear. So it seems to me that the
safer thing to do here would be to unblock the signal. It might gain
something, and it can't really lose anything. Now it's true that the
calculus might change if someone were to modify the behavior of the
SIGQUIT handler in the future, but if they do then it's their job to
think about this stuff. It doesn't seem especially likely for that to
change, anyway. The only reason that the handler for regular backends
does anything other than _exit(2) is that we want to try to let the
client know what happened before we croak, and that concern is
irrelevant for background workers. Doing any other cleanup here is
unsafe and unnecessary.

> As for said problem, I concur with Robert that the v4 patch seems pretty
> dubious; it's adding a lot of barely-thought-out mechanism for no
> convincing gain in safety.  I think the v1 patch was more nearly the right
> thing, except that the unblock needs to happen a tad further down, as
> attached (this is untested but certainly it should pass any test that v1
> passed).  I didn't do anything about rewriting the bogus comment just
> above the sigsetjmp call, but I agree that that should happen too.

I am not sure whether the difference between this and v1 matters,
because in postgres.c it's effectively happening inside sigsetjmp, so
the earlier unblock must not be that bad. But I don't mind putting it
in the place you suggest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Tom Lane

Robert Haas  writes:
> On Fri, Aug 7, 2020 at 12:03 AM Tom Lane  wrote:
>> Yeah, there is no room for "good enough" here.  If a dirty buffer remains
>> in the system, the checkpointer will eventually try to flush it, and fail
>> (because there's no file to write it to), and then checkpointing will be
>> stuck.  So we cannot afford to risk missing any buffers.

> This comment suggests another possible approach to the problem, which
> is to just make a note someplace in shared memory when we drop a
> relation. If we later find any of its buffers, we drop them without
> writing them out. This is not altogether simple, because (1) we don't
> have infinite room in shared memory to accumulate such notes and (2)
> it's not impossible for the OID counter to wrap around and permit the
> creation of a new relation with the same OID, which would be a problem
> if the previous note is still around.

Interesting idea indeed.

As for (1), maybe we don't need to keep the info in shmem.  I'll just
point out that the checkpointer has *already got* a complete list of all
recently-dropped relations, cf pendingUnlinks in sync.c.  So you could
imagine looking aside at that to discover that a dirty buffer belongs to a
recently-dropped relation.  pendingUnlinks would need to be converted to a
hashtable to make searches cheap, and it's not very clear what to do in
backends that haven't got access to that table, but maybe we could just
accept that backends that are forced to flush dirty buffers might do some
useless writes in such cases.

As for (2), the reason why we have that list is that the physical unlink
doesn't happen till after the next checkpoint.  So with some messing
around here, we could probably guarantee that every buffer belonging
to the relation has been scanned and deleted before the file unlink
happens --- and then, even if the OID counter has wrapped around, the
OID won't be reassigned to a new relation before that happens.

In short, it seems like maybe we could shove the responsibility for
cleaning up dropped relations' buffers onto the checkpointer without
too much added cost.  A possible problem with this is that recycling
of those buffers will happen much more slowly than it does today,
but maybe that's okay?

regards, tom lane

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-07 Thread Robert Haas

On Thu, Aug 6, 2020 at 9:23 AM Ashutosh Sharma  wrote:
> Attached v4 patch fixes the latest comments from Robert and Masahiko-san.

Compiler warning:

heap_surgery.c:136:13: error: comparison of unsigned expression < 0 is
always false [-Werror,-Wtautological-compare]
if (blkno < 0 || blkno >= nblocks)
~ ^ ~

There's a certain inconsistency to these messages:

rhaas=# create table foo (a int);
CREATE TABLE
rhaas=# insert into foo values (1);
INSERT 0 1
rhaas=# select heap_force_kill('foo'::regclass, array['(0,2)'::tid]);
NOTICE:  skipping tid (0, 2) because it contains an invalid offset
 heap_force_kill
-

(1 row)

rhaas=# select heap_force_kill('foo'::regclass, array['(1,0)'::tid]);
ERROR:  invalid item pointer
LOCATION:  tids_same_page_fetch_offnums, heap_surgery.c:347
rhaas=# select heap_force_kill('foo'::regclass, array['(1,1)'::tid]);
ERROR:  block number 1 is out of range for relation "foo"

>From a user perspective it seems like I've made three very similar
mistakes: in the first case, the offset is too high, in the second
case it's too low, and in the third case the block number is out of
range. But in one case I get a NOTICE and in the other two cases I get
an ERROR. In one case I get the relation name and in the other two
cases I don't. The two complaints about an invalid offset are phrased
completely differently from each other. For example, suppose you do
this:

ERROR: tid (%u, %u) is invalid for relation "%s" because the block
number is out of range (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item
number is out of range for this block (%u..%u)
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is unused
ERROR: tid (%u, %u) is invalid for relation "%s" because the item is dead

I think I misled you when I said to use pg_class_aclcheck. I think it
should actually be pg_class_ownercheck.

I think the relkind sanity check should permit RELKIND_MATVIEW also.

It's unclear to me why the freeze logic here shouldn't do this part
what heap_prepare_freeze_tuple() does when freezing xmax:

frz->t_infomask2 &= ~HEAP_HOT_UPDATED;
frz->t_infomask2 &= ~HEAP_KEYS_UPDATED;

Likewise, why should we not freeze or invalidate xvac in the case
where tuple->t_infomask & HEAP_MOVED, as heap_prepare_freeze_tuple()
would do?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

2020-08-07 Thread Tom Lane

Robert Haas  writes:
> On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy
>  wrote:
>> We intend to unblock SIGQUIT before sigsetjmp() in places like
>> bgwriter, checkpointer, walwriter and walreceiver, but we only call
>> sigdelset(&BlockSig, SIGQUIT);, Without  PG_SETMASK(&BlockSig);, seems
>> like we are not actually unblocking SIQUIT and quickdie() will never
>> get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
>> 0){}

> Yeah, maybe so. This code has been around for a long time and I'm not
> sure what the thought process behind it was, but I don't see a flaw in
> your analysis here.

I think that code is the way it is intentionally: the idea is to not
accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);"
call further down, between the sigsetjmp stanza and the main loop.
The sigdelset call, just like the adjacent pqsignal calls, is
preparatory setup; it does not intend to allow anything to happen
immediately.

In general, you don't want to accept signals in that area because the
process state may not be fully set up yet.  You could argue that the
SIGQUIT handler has no state dependencies, making it safe to accept
SIGQUIT earlier during startup of one of these processes, and likewise
for them to accept SIGQUIT during error recovery.  But barring actual
evidence of a problem with slow SIGQUIT response in these areas I'm more
inclined to leave well enough alone.  Changing this would add hazards,
e.g. if somebody ever changes the behavior of the SIGQUIT handler, so
I'd want some concrete evidence of a benefit.  It seems fairly
irrelevant to the problem at hand with bgworkers, anyway.

As for said problem, I concur with Robert that the v4 patch seems pretty
dubious; it's adding a lot of barely-thought-out mechanism for no
convincing gain in safety.  I think the v1 patch was more nearly the right
thing, except that the unblock needs to happen a tad further down, as
attached (this is untested but certainly it should pass any test that v1
passed).  I didn't do anything about rewriting the bogus comment just
above the sigsetjmp call, but I agree that that should happen too.

regards, tom lane

diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index beb5e85434..dd22241600 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -753,7 +753,14 @@ StartBackgroundWorker(void)
 		/* Prevent interrupts while cleaning up */
 		HOLD_INTERRUPTS();

-		/* Report the error to the server log */
+		/*
+		 * sigsetjmp will have blocked all signals, but we may need to accept
+		 * signals while communicating with our parallel leader.  Once we've
+		 * done HOLD_INTERRUPTS() it should be safe to unblock signals.
+		 */
+		BackgroundWorkerUnblockSignals();
+
+		/* Report the error to the parallel leader and the server log */
 		EmitErrorReport();

 		/*

Re: Handing off SLRU fsyncs to the checkpointer

2020-08-07 Thread Robert Haas

On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro  wrote:
> * Master is around 11% faster than last week before commit c5315f4f
> "Cache smgrnblocks() results in recovery."
> * This patch gives a similar speedup, bringing the total to around 25%
> faster than last week (the time is ~20% less, the WAL processing speed
> is ~1.25x).

Dang, that's pretty nice, especially for the relatively small amount
of code that it seems to require.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Robert Haas

On Fri, Aug 7, 2020 at 12:03 AM Tom Lane  wrote:
> Yeah, there is no room for "good enough" here.  If a dirty buffer remains
> in the system, the checkpointer will eventually try to flush it, and fail
> (because there's no file to write it to), and then checkpointing will be
> stuck.  So we cannot afford to risk missing any buffers.

This comment suggests another possible approach to the problem, which
is to just make a note someplace in shared memory when we drop a
relation. If we later find any of its buffers, we drop them without
writing them out. This is not altogether simple, because (1) we don't
have infinite room in shared memory to accumulate such notes and (2)
it's not impossible for the OID counter to wrap around and permit the
creation of a new relation with the same OID, which would be a problem
if the previous note is still around.

But this might be solvable. Suppose we create a shared hash table
keyed by  with room for 1 entry per 1000 shared
buffers. When you drop a relation, you insert into the hash table.
Periodically you "clean" the hash table by marking all the entries,
scanning shared buffers to remove any matches, and then deleting all
the marked entries. This should be done periodically in the
background, but if you try to drop a relation and find the hash table
full, or you try to create a relation and find the OID of your new
relation in the hash table, then you have to clean synchronously.

Right now, the cost of dropping K relation when N shared buffers is
O(KN). But with this approach, you only have to incur the O(N)
overhead of scanning shared_buffers when the hash table fills up, and
the hash table size is proportional to N, so the amortized complexity
is O(K); that is, dropping relations takes time proportional to the
number of relations being dropped, but NOT proportional to the size of
shared_buffers, because as shared_buffers grows the hash table gets
proportionally bigger, so that scans don't need to be done as
frequently.

Andres's approach (retail hash table lookups just for blocks less than
the relation size, rather than a full scan) is going to help most with
small relations, whereas this approach helps with relations of any
size, but if you're trying to drop a lot of relations, they're
probably small, and if they are large, scanning shared buffers may not
be the dominant cost, since unlinking the files also takes time. Also,
this approach might turn out to slow down buffer eviction too much.
That could maybe be mitigated by having some kind of cheap fast-path
that gets used when the hash table is empty (like an atomic flag that
indicates whether a hash table probe is needed), and then trying hard
to keep it empty most of the time (e.g. by aggressive background
cleaning, or by ruling that after some number of hash table lookups
the next process to evict a buffer is forced to perform a cleanup).
But you'd probably have to try it to see how well you can do.

It's also possible to combine the two approaches. Small relations
could use Andres's approach while larger ones could use this approach;
or you could insert both large and small relations into this hash
table but use different strategies for cleaning out shared_buffers
depending on the relation size (which could also be preserved in the
hash table).

Just brainstorming here...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Parallel worker hangs while handling errors.

2020-08-07 Thread Robert Haas

On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy
 wrote:
> We intend to unblock SIGQUIT before sigsetjmp() in places like
> bgwriter, checkpointer, walwriter and walreceiver, but we only call
> sigdelset(&BlockSig, SIGQUIT);, Without  PG_SETMASK(&BlockSig);, seems
> like we are not actually unblocking SIQUIT and quickdie() will never
> get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
> 0){}

Yeah, maybe so. This code has been around for a long time and I'm not
sure what the thought process behind it was, but I don't see a flaw in
your analysis here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PG13] Planning (time + buffers) data structure in explain plan (format text)

2020-08-07 Thread Pierre Giraud

Le 07/08/2020 à 14:52, Julien Rouhaud a écrit :
> Hi,
> 
> On Fri, Aug 7, 2020 at 2:30 PM Pierre Giraud  wrote:
>>
>> Hi all,
>>
>> As far as I understand, in the upcoming version 13, information about
>> buffers used during planning is now available in the explain plan.
> 
> Indeed.
> 
>> […]
>>  Planning Time: 0.203 ms
>>Buffers: shared hit=14
>> […]
>>
>> In the JSON format, the data structure is a bit different:
>>
>> […]
>>  "Planning": {
>>"Planning Time": 0.533,
>>"Shared Hit Blocks": 14,
>>"Shared Read Blocks": 0,
>>"Shared Dirtied Blocks": 0,
>>"Shared Written Blocks": 0,
>>"Local Hit Blocks": 0,
>>"Local Read Blocks": 0,
>>"Local Dirtied Blocks": 0,
>>"Local Written Blocks": 0,
>>"Temp Read Blocks": 0,
>>"Temp Written Blocks": 0
>>  },
>> […]
>>
>> For a matter of consistency, I wonder if it would be possible to format
>> it like the following:
>>
>> […]
>>  Planning:
>>Planning Time: 0.203 ms
>>Buffers: shared hit=14
>> […]
> 
> I agree that this output looks more consistent with other output,
> including JIT as you mentioned.  I'll send a patch for that if there's
> no objection.

Thanks a lot!

> 
> Out of curiosity, is the current text output actually harder to parse
> than the one you're suggesting?
> 

I don't want to speak in the name of developers of others parsing tools
but this should not require a lot of work to parse the output I'm proposing.
It would be nice to have their opinion though, especially Hubert depesz
Lubaczewski's since he already integrated the change:
https://gitlab.com/depesz/Pg--Explain/-/commit/4a760136ee69ee4929625d4e4022f79ac60b763f

However, as far as I know, he's not doing anything with the buffers
information with the "Planning" section yet.

To answer your question, I think that the new output would make the
parser a little bit easier to write because it would make things a bit
clearer (ie. more separated) so less prone to errors.

Re: problem with RETURNING and update row movement

2020-08-07 Thread Etsuro Fujita

On Mon, Aug 3, 2020 at 4:39 PM Amit Langote  wrote:
> On Mon, Aug 3, 2020 at 2:54 PM Amit Langote  wrote:
> > By the way, you'll need two adjustments to even get this example
> > working, one of which is a reported problem that system columns in
> > RETURNING list during an operation on partitioned table stopped
> > working in PG 12 [1] for which I've proposed a workaround (attached).
> > Another is that we forgot in our patch to "materialize" the virtual
> > tuple after conversion, which means slot_getsysattr() can't find it to
> > access system columns like xmin, etc.
>
> The only solution I could think of for this so far is this:
>
> +   if (map)
> +   {
> +   orig_slot = execute_attr_map_slot(map,
> + res_slot,
> + orig_slot);
> +
> +   /*
> +* A HACK to install system information into the just
> +* converted tuple so that RETURNING computes any
> +* system columns correctly. This would be the same
> +* information that would be present in the HeapTuple
> +* version of the tuple in res_slot.
> +*/
> +   tuple = ExecFetchSlotHeapTuple(orig_slot, true,
> +  &should_free);
> +   tuple->t_data->t_infomask &= ~(HEAP_XACT_MASK);
> +   tuple->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
> +   tuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
> +   HeapTupleHeaderSetXmin(tuple->t_data,
> +  GetCurrentTransactionId());
> +   HeapTupleHeaderSetCmin(tuple->t_data,
> +  estate->es_output_cid);
> +   HeapTupleHeaderSetXmax(tuple->t_data, 0); /*
> for cleanliness */
> +   }
> +   /*
> +* Override tts_tableOid with the OID of the destination
> +* partition.
> +*/
> +   orig_slot->tts_tableOid =
> RelationGetRelid(destRel->ri_RelationDesc);
> +   /* Also the TID. */
> +   orig_slot->tts_tid = res_slot->tts_tid;
>
> ..but it might be too ugly :(.

Yeah, I think that would be a bit ugly, and actually, is not correct
in case of postgres_fdw foreign table, in which case Xmin and Cmin are
also set to 0 [2].  I think we should probably first address the
tableam issue that you pointed out, but I don't think I'm the right
person to do so.

Best regards,
Etsuro Fujita

[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=da7d44b627ba839de32c9409aca659f60324de76

Re: Yet another issue with step generation in partition pruning

2020-08-07 Thread Etsuro Fujita

On Fri, Aug 7, 2020 at 2:55 PM Etsuro Fujita  wrote:
> On Thu, Aug 6, 2020 at 12:20 AM Etsuro Fujita  wrote:
> > Will push the patch tomorrow.
>
> Done.  (I didn't have time for this, because I was terribly busy with
> other stuff.)

I mean I didn't have time for this *yesterday*.

Best regards,
Etsuro Fujita

Re: [PG13] Planning (time + buffers) data structure in explain plan (format text)

2020-08-07 Thread Julien Rouhaud

Hi,

On Fri, Aug 7, 2020 at 2:30 PM Pierre Giraud  wrote:
>
> Hi all,
>
> As far as I understand, in the upcoming version 13, information about
> buffers used during planning is now available in the explain plan.

Indeed.

> […]
>  Planning Time: 0.203 ms
>Buffers: shared hit=14
> […]
>
> In the JSON format, the data structure is a bit different:
>
> […]
>  "Planning": {
>"Planning Time": 0.533,
>"Shared Hit Blocks": 14,
>"Shared Read Blocks": 0,
>"Shared Dirtied Blocks": 0,
>"Shared Written Blocks": 0,
>"Local Hit Blocks": 0,
>"Local Read Blocks": 0,
>"Local Dirtied Blocks": 0,
>"Local Written Blocks": 0,
>"Temp Read Blocks": 0,
>"Temp Written Blocks": 0
>  },
> […]
>
> For a matter of consistency, I wonder if it would be possible to format
> it like the following:
>
> […]
>  Planning:
>Planning Time: 0.203 ms
>Buffers: shared hit=14
> […]

I agree that this output looks more consistent with other output,
including JIT as you mentioned.  I'll send a patch for that if there's
no objection.

Out of curiosity, is the current text output actually harder to parse
than the one you're suggesting?

Re: LSM tree for Postgres

2020-08-07 Thread Alexander Korotkov

ср, 5 авг. 2020 г., 09:13 Konstantin Knizhnik :
> Concerning degrade of basic index - B-Tree itself is balanced tree. Yes,
> insertion of random keys can cause split of B-Tree page.
> In the worst case half of B-Tree page will be empty. So B-Tree size will
> be two times larger than ideal tree.
> It may cause degrade up to two times. But that is all. There should not
> be infinite degrade of speed tending to zero.

My concerns are not just about space utilization.  My main concern is
about the order of the pages.  After the first merge the base index
will be filled in key order.  So physical page ordering perfectly
matches their logical ordering.  After the second merge some pages of
base index splits, and new pages are added to the end of the index.
Splits also happen in key order.  So, now physical and logical
orderings match within two extents corresponding to first and second
merges, but not within the whole tree.  While there are only few such
extents, disk page reads may in fact be mostly sequential, thanks to
OS cache and readahead.  But finally, after many merges, we can end up
with mostly random page reads.  For instance, leveldb doesn't have a
problem of ordering degradation, because it stores levels in sorted
files.

--
Regards,
Alexander Korotkov

[PG13] Planning (time + buffers) data structure in explain plan (format text)

2020-08-07 Thread Pierre Giraud

Hi all,

As far as I understand, in the upcoming version 13, information about
buffers used during planning is now available in the explain plan.

[…]
 Planning Time: 0.203 ms
   Buffers: shared hit=14
[…]

In the JSON format, the data structure is a bit different:

[…]
 "Planning": {
   "Planning Time": 0.533,
   "Shared Hit Blocks": 14,
   "Shared Read Blocks": 0,
   "Shared Dirtied Blocks": 0,
   "Shared Written Blocks": 0,
   "Local Hit Blocks": 0,
   "Local Read Blocks": 0,
   "Local Dirtied Blocks": 0,
   "Local Written Blocks": 0,
   "Temp Read Blocks": 0,
   "Temp Written Blocks": 0
 },
[…]

For a matter of consistency, I wonder if it would be possible to format
it like the following:

[…]
 Planning:
   Planning Time: 0.203 ms
   Buffers: shared hit=14
[…]


Note: a similar way to format information is already used for JIT.

[…]
 JIT:
   Functions: 3
   Options: Inlining false, Optimization false, Expressions true,
Deforming true
   Timing: Generation 0.340 ms, Inlining 0.000 ms, Optimization 0.168
ms, Emission 1.907 ms, Total 2.414 ms
[…]

Regards,
Pierre

Re: ModifyTable overheads in generic plans

2020-08-07 Thread Amit Langote

On Tue, Aug 4, 2020 at 3:15 PM Amit Langote  wrote:
> On Sat, Aug 1, 2020 at 4:46 AM Robert Haas  wrote:
> > On Fri, Jun 26, 2020 at 8:36 AM Amit Langote  
> > wrote:
> > > 0001 and 0002 are preparatory patches.
> >
> > I read through these patches a bit but it's really unclear what the
> > point of them is. I think they need better commit messages, or better
> > comments, or both.
>
> Thanks for taking a look.  Sorry about the lack of good commentary,
> which I have tried to address in the attached updated version. I
> extracted one more part as preparatory from the earlier 0003 patch, so
> there are 4 patches now.
>
> Also as discussed with Daniel, I have changed the patches so that they
> can be applied on plain HEAD instead of having to first apply the
> patches at [1].  Without runtime pruning for UPDATE/DELETE proposed in
> [1], optimizing ResultRelInfo creation by itself does not improve the
> performance/scalability by that much, but the benefit of lazily
> creating ResultRelInfos seems clear so I think maybe it's okay to
> pursue this independently.

Per cfbot's automatic patch tester, there were some issues in the 0004 patch:

nodeModifyTable.c: In function ‘ExecModifyTable’:
1529nodeModifyTable.c:2484:24: error: ‘junkfilter’ may be used
uninitialized in this function [-Werror=maybe-uninitialized]
1530  junkfilter->jf_junkAttNo,
1531^
1532nodeModifyTable.c:2309:14: note: ‘junkfilter’ was declared here
1533  JunkFilter *junkfilter;
1534  ^
1535cc1: all warnings being treated as errors
1536: recipe for target 'nodeModifyTable.o' failed
1537make[3]: *** [nodeModifyTable.o] Error 1

Fixed in the attached updated version

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com


v4-0003-Revise-child-to-root-tuple-conversion-map-managem.patch
Description: Binary data


v4-0002-Don-t-make-root-ResultRelInfo-for-insert-queries.patch
Description: Binary data


v4-0004-Initialize-result-relation-information-lazily.patch
Description: Binary data


v4-0001-Revise-how-FDWs-obtain-result-relation-informatio.patch
Description: Binary data

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Tomas Vondra


On Fri, Aug 07, 2020 at 10:08:23AM +0300, Konstantin Knizhnik wrote:



On 07.08.2020 00:33, Tomas Vondra wrote:


Unfortunately Konstantin did not share any details about what workloads
he tested, what config etc. But I find the "no regression" hypothesis
rather hard to believe, because we're adding non-trivial amount of code
to a place that can be quite hot.


Sorry, that I have not explained  my test scenarios.
As far as Postgres is pgbench-oriented database:) I have also used pgbench:
read-only case and sip-some updates.
For this patch most critical is number of buffer allocations,
so I used small enough database (scale=100), but shared buffer was set 
to 1Gb.
As a result, all data is cached in memory (in file system cache), but 
there is intensive swapping at Postgres buffer manager level.
I have tested it both with relatively small (100) and large (1000) 
number of clients.


I repeated this tests at my notebook (quadcore, 16Gb RAM, SSD) and IBM 
Power2 server with about 380 virtual cores  and about 1Tb of memory.
I the last case results are vary very much I think because of NUMA 
architecture) but I failed to find some noticeable regression of 
patched version.




IMO using such high numbers of clients is pointless - it's perfectly
fine to test just a single client, and the 'basic overhead' should be
visible. It might have some impact on concurrency, but I think that's
just a secondary effect I think. In fact, I wouldn't be surprised if
high client counts made it harder to observe the overhead, due to
concurrency problems (I doubt you have a laptop with this many cores).

Another thing you might try doing is using taskset to attach processes
to particular CPU cores, and also make sure there's no undesirable
influence from CPU power management etc. Laptops are very problematic in
this regard, but even servers can have that enabled in BIOS.



But I have to agree that adding parallel hash (in addition to existed 
buffer manager hash) is not so good idea.

This cache really quite frequently becomes bottleneck.
My explanation of why I have not observed some noticeable regression 
was that this patch uses almost the same lock partitioning schema
as already used it adds not so much new conflicts. May be in case of 
POwer2 server, overhead of NUMA is much higher than other factors
(although shared hash is one of the main thing suffering from NUMA 
architecture).
But in principle I agree that having two independent caches may 
decrease speed up to two times  (or even more).


I hope that everybody will agree that this problem is really critical. 
It is certainly not the most common case when there are hundreds of 
relation which are frequently truncated. But having quadratic 
complexity in drop function is not acceptable from my point of view.
And it is not only recovery-specific problem, this is why solution 
with local cache is not enough.




Well, ultimately it's a balancing act - we need to consider the risk of
regressions vs. how common the improved scenario is. I've seen multiple
applications that e.g. drop many relations (after all, that's why I
optimized that in 9.3) so it's not entirely bogus case.


I do not know good solution of the problem. Just some thoughts.
- We can somehow combine locking used for main buffer manager cache 
(by relid/blockno) and cache for relid. It will eliminates double 
locking overhead.
- We can use something like sorted tree (like std::map) instead of 
hash - it will allow to locate blocks both by relid/blockno and by 
relid only.




I don't know. I think the ultimate problem here is that we're adding
code to a fairly hot codepath - it does not matter if it's hash, list,
std::map or something else I think. All of that has overhead.

That's the beauty of Andres' proposal to just loop over the blocks of
the relation and evict them one by one - that adds absolutely nothing to
BufferAlloc.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

[PATCH] Covering SPGiST index

2020-08-07 Thread Pavel Borisov

Hi, hackers!

I'd like to propose a patch which introduces a functionality to include
additional columns to SPGiST index to increase speed of queries containing
them due to making the scans index only in this case. To date this
functionality was available in GiSt and btree, I suppose the same is useful
in SPGiST also.

A few words on realisaton:

1. The patch is intended to be fully compatible with previous SPGiSt
indexes so SpGist leaf tuple structure remains unchanged until the ending
of key attribute. All changes are introduced only after it. Internal tuples
remain unchanged at all.

2. Included data is added in the form very similar to heap tuple but unlike
the later it should not start from MAXALIGN boundary. I.e. nulls mask (if
exist) starts just after the key value (it doesn't need alignment). Each of
included attributes start from their own typealign boundary. The goal is to
make leaf tuples and therefore index more compact.

3. Leaf tuple header is modified to store additional per tuple flags:
a) is nullmask present - if there is at least one null value among included
attributes of a tuple
(Note that this nullmask apply only to include attributes as nulls
management for key attributes is already realised in SPGiSt by placing
leafs with null keys in separate list not in the main index tree.)
b) is there variable length values among included. If there is no and key
attribute is also fixed-length e.g. (kd-tree, quad-tree etc.) then leaf
tuple processing can be speed up using attcacheoff.

These bits are incorporated into unused higher bits of nextOffset in the
header SPGiST leaf tuple. Even if we have 64Kb pages and tuples of minimum
12 bytes (the length of the header on 32-bit architectures) + 4 bytes
ItemIdData  14 bit for nextOffset is more than enough.

All this changes only affect private index structures so all outside
behavior like WAL, vacuum etc will remain unchanged.

As usual I very much appreciate your feedback

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com 


spgist-covering-0001.diff
Description: Binary data

Re: [Proposal] Global temporary tables

2020-08-07 Thread movead...@highgo.ca


>I find this is the most latest mail with an attachment, so I test and reply on
>this thread, several points as below:

>1. I notice it produces new relfilenode when new session login and some
>data insert. But the relfilenode column in pg_class still the one when create
>the global temp table. I think you can try to show 0 in this area as what nail
>relation does. 
>I think getting the GTT to have a default relfilenode looks closer to the 
>existing implementation, and setting it to 0 requires extra work and has no 
>clear benefit.
>What do you think？
>I'd like to know the reasons for your suggestion.
The 'relfilenode' mean the file no on disk which different from oid of a 
relation,
 the default one is a wrong for gtt, so I think it's not so good to show it in 
pg_class.

>2. The nail relations handle their relfilenodes by RelMapFile struct, and this
>patch use hash entry and relfilenode_list, maybe RelMapFile approach more
>understandable in my opinion. Sorry if I miss the real design for that.
>We can see the STORAGE and statistics info for the GTT, including relfilenode, 
>through view pg_gtt_relstats

postgres=# \d gtt
Table "public.gtt"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 a  | integer |   |  | 
 b  | integer |   |  | 

postgres=# insert into gtt values(1,1);
INSERT 0 1
postgres=# select * from pg_gtt_relstats ;
 schemaname | tablename | relfilenode | relpages | reltuples | relallvisible | 
relfrozenxid | relminmxid 
+---+-+--+---+---+--+
 public | gtt   |   16384 |0 | 0 | 0 |  
532 |  1
(1 row)

postgres=# truncate gtt;
TRUNCATE TABLE
postgres=# select * from pg_gtt_relstats ;
 schemaname | tablename | relfilenode | relpages | reltuples | relallvisible | 
relfrozenxid | relminmxid 
+---+-+--+---+---+--+
 public | gtt   |   16387 |0 | 0 | 0 |  
533 |  1
(1 row)


I just suggest a way which maybe most naturely to the exist code struct, and 
it's
uo to you.


>3. I get a wrong result of pg_relation_filepath() function for global temp 
>table,
>I think it's necessaryto keep this an correct output.

postgres=# select pg_relation_filepath(oid) from pg_class where relname = 'gtt';
 pg_relation_filepath 
--
 base/13835/t3_16384
(1 row)

I didn't find anything wrong. Could you please give me a demo.

In my opinoin it should show 'base/13835/t3_16387', other than 
'base/13835/t3_16384',
because the relfilenode change to 16387 when you truncate it in step 2.

>4. In gtt_search_by_relid() function, it has not handle the missing_ok argument
>if gtt_storage_local_hash is null. There should be some comments if it's the 
>right
>code.
>This is a problem that has been fixed in global_temporary_table_v34-pg13.patch.
Sorry about it, I can not find it in mail thread and maybe I miss something. 
The mail thread
is so long, it's better to create a new mail thread I think.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: walsender waiting_for_ping spuriously set

2020-08-07 Thread Ashutosh Bapat

The patch looks good to me. Thanks for improving comments around that code.
I like the change to set waiting_for_ping_response in WalSndKeepalive.
Thanks.

On Fri, 7 Aug 2020 at 04:26, Alvaro Herrera 
wrote:

> Ashutosh Bapat noticed that WalSndWaitForWal() is setting
> waiting_for_ping_response after sending a keepalive that does *not*
> request a reply.  The bad consequence is that other callers that do
> require a reply end up in not sending a keepalive, because they think it
> was already sent previously.  So the whole thing gets stuck.
>
> He found that commit 41d5f8ad734 failed to remove the setting of
> waiting_for_ping_response after changing the "request" parameter
> WalSndKeepalive from true to false; that seems to have been an omission
> and it breaks the algorithm.  Thread at [1].
>
> The simplest fix is just to remove the line that sets
> waiting_for_ping_response, but I think it is less error-prone to have
> WalSndKeepalive set the flag itself, instead of expecting its callers to
> do it (and know when to).  Patch attached.  Also rewords some related
> commentary.
>
> [1]
> https://postgr.es/m/flat/blu436-smtp25712b7ef9fc2adeb87c522dc...@phx.gbl
>
> --
> Álvaro Herrera   Valdivia, Chile
>


-- 
Best Wishes,
Ashutosh

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-08-07 Thread Amit Langote

On Mon, Aug 3, 2020 at 8:38 PM Amit Langote  wrote:
> On Wed, Jul 29, 2020 at 5:36 PM Andrey V. Lepikhov
>  wrote:
> > > Will send more comments after reading the v5 patch.
> > >
> > Ok. I'll be waiting for the end of your review.
>
> Sorry about the late reply.
>
> If you'd like to send a new version for other reviewers, please feel
> free.  I haven't managed to take more than a brief look at the v5
> patch, but will try to look at it (or maybe the new version if you
> post) more closely this week.

I was playing around with v5 and I noticed an assertion failure which
I concluded is due to improper setting of ri_usesBulkModify.  You can
reproduce it with these steps.

create extension postgres_fdw;
create server lb foreign data wrapper postgres_fdw ;
create user mapping for current_user server lb;
create table foo (a int, b int) partition by list (a);
create table foo1 (like foo);
create foreign table ffoo1 partition of foo for values in (1) server
lb options (table_name 'foo1');
create table foo2 (like foo);
create foreign table ffoo2 partition of foo for values in (2) server
lb options (table_name 'foo2');
create function print_new_row() returns trigger language plpgsql as $$
begin raise notice '%', new; return new; end; $$;
create trigger ffoo1_br_trig before insert on ffoo1 for each row
execute function print_new_row();
copy foo from stdin csv;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,2
>> 2,3
>> \.
NOTICE:  (1,2)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

#0  0x7f2d5e266337 in raise () from /lib64/libc.so.6
#1  0x7f2d5e267a28 in abort () from /lib64/libc.so.6
#2  0x00aafd5d in ExceptionalCondition
(conditionName=0x7f2d37b468d0 "!resultRelInfo->ri_usesBulkModify ||
resultRelInfo->ri_FdwRoutine->BeginForeignCopyIn == NULL",
errorType=0x7f2d37b46680 "FailedAssertion",
fileName=0x7f2d37b4677f "postgres_fdw.c", lineNumber=1863) at
assert.c:67
#3  0x7f2d37b3b0fe in postgresExecForeignInsert (estate=0x2456320,
resultRelInfo=0x23a8f58, slot=0x23a9480, planSlot=0x0) at
postgres_fdw.c:1862
#4  0x0066362a in CopyFrom (cstate=0x23a8d40) at copy.c:3331

The problem is that partition ffoo1's BR trigger prevents it from
using multi-insert, but its ResultRelInfo.ri_usesBulkModify is true,
which is copied from its parent.  We should really check the same
things for a partition that CopyFrom() checks for the main target
relation (root parent) when deciding whether to use multi-insert.

However instead of duplicating the same logic to do so in two places
(CopyFrom and ExecInitPartitionInfo), I think it might be a good idea
to refactor the code to decide if multi-insert mode can be used for a
given relation by checking its properties and put it in some place
that both the main target relation and partitions need to invoke.
InitResultRelInfo() seems to be one such place.

Also, it might be a good idea to use ri_usesBulkModify more generally
than only for foreign relations as the patch currently does, because I
can see that it can replace the variable insertMethod in CopyFrom().
Having both insertMethod and ri_usesBulkModify in each ResultRelInfo
seems confusing and bug-prone.

Finally, I suggest renaming ri_usesBulkModify to ri_usesMultiInsert to
reflect its scope.

Please check the attached delta patch that applies on top of v5 to see
what that would look like.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

ri_usesMultiInsert.patch
Description: Binary data

Re: Parallel worker hangs while handling errors.

2020-08-07 Thread Bharath Rupireddy

On Fri, Aug 7, 2020 at 1:34 AM Robert Haas  wrote:
>
> On Tue, Jul 28, 2020 at 5:35 AM Bharath Rupireddy
>  wrote:
> > The v4 patch looks good to me. Hang is not seen, make check and make
> > check-world passes. I moved this to the committer for further review
> > in https://commitfest.postgresql.org/29/2636/.
>
> I don't think I agree with this approach. In particular, I don't
> understand the rationale for unblocking only SIGUSR1. Above, Vignesh
> says that he feels that unblocking only that signal would be the right
> approach, but no reason is given. I have two reasons why I suspect
> it's not the right approach. One, it doesn't seem to be what we do
> elsewhere; the only existing cases where we have special handling for
> particular signals are SIGQUIT and SIGPIPE, and those places have
> comments explaining the reason why they are handled in a special way.
>

We intend to unblock SIGQUIT before sigsetjmp() in places like
bgwriter, checkpointer, walwriter and walreceiver, but we only call
sigdelset(&BlockSig, SIGQUIT);, Without  PG_SETMASK(&BlockSig);, seems
like we are not actually unblocking SIQUIT and quickdie() will never
get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
0){}, if postmaster sends a SIGQUIT while these processes are
doing clean up tasks in sigsetjmp(), it will not be received, and the
postmaster later sends SIGKLL to kill from below places.

/*
 * If we already sent SIGQUIT to children and they are slow to shut
 * down, it's time to send them SIGKILL.  This doesn't happen
 * normally, but under certain conditions backends can get stuck while
 * shutting down.  This is a last measure to get them unwedged.
 *
 * Note we also do this during recovery from a process crash.
 */
if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
AbortStartTime != 0 &&
(now - AbortStartTime) >= SIGKILL_CHILDREN_AFTER_SECS)
{
/* We were gentle with them before. Not anymore */
TerminateChildren(SIGKILL);
/* reset flag so we don't SIGKILL again */
AbortStartTime = 0;
}

Shouldn't we call PG_SETMASK(&BlockSig); to make it effective?

Am I missing anything here?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread k.jami...@fujitsu.com

On Friday, August 7, 2020 12:38 PM, Amit Kapila wrote:
Hi,
> On Thu, Aug 6, 2020 at 6:53 AM k.jami...@fujitsu.com 
> wrote:
> >
> > On Saturday, August 1, 2020 5:24 AM, Andres Freund wrote:
> >
> > Hi,
> > Thank you for your constructive review and comments.
> > Sorry for the late reply.
> >
> > > Hi,
> > >
> > > On 2020-07-31 15:50:04 -0400, Tom Lane wrote:
> > > > Andres Freund  writes:
> > > > > Indeed. The buffer mapping hashtable already is visible as a
> > > > > major bottleneck in a number of workloads. Even in readonly
> > > > > pgbench if s_b is large enough (so the hashtable is larger than
> > > > > the cache). Not to speak of things like a cached sequential scan
> > > > > with a cheap qual and wide
> > > rows.
> > > >
> > > > To be fair, the added overhead is in buffer allocation not buffer 
> > > > lookup.
> > > > So it shouldn't add cost to fully-cached cases.  As Tomas noted
> > > > upthread, the potential trouble spot is where the working set is
> > > > bigger than shared buffers but still fits in RAM (so there's no
> > > > actual I/O needed, but we do still have to shuffle buffers a lot).
> > >
> > > Oh, right, not sure what I was thinking.
> > >
> > >
> > > > > Wonder if the temporary fix is just to do explicit hashtable
> > > > > probes for all pages iff the size of the relation is < s_b / 500 or 
> > > > > so.
> > > > > That'll address the case where small tables are frequently
> > > > > dropped - and dropping large relations is more expensive from
> > > > > the OS and data loading perspective, so it's not gonna happen as 
> > > > > often.
> > > >
> > > > Oooh, interesting idea.  We'd need a reliable idea of how long the
> > > > relation had been (preferably without adding an lseek call), but
> > > > maybe that's do-able.
> > >
> > > IIRC we already do smgrnblocks nearby, when doing the truncation (to
> > > figure out which segments we need to remove). Perhaps we can arrange
> > > to combine the two? The layering probably makes that somewhat ugly
> > > :(
> > >
> > > We could also just use pg_class.relpages. It'll probably mostly be
> > > accurate enough?
> > >
> > > Or we could just cache the result of the last smgrnblocks call...
> > >
> > >
> > > One of the cases where this type of strategy is most intersting to
> > > me is the partial truncations that autovacuum does... There we even
> > > know the range of tables ahead of time.
> >
> > Konstantin tested it on various workloads and saw no regression.
> > But I understand the sentiment on the added overhead on BufferAlloc.
> > Regarding the case where the patch would potentially affect workloads
> > that fit into RAM but not into shared buffers, could one of Andres'
> > suggested idea/s above address that, in addition to this patch's
> > possible shared invalidation fix? Could that settle the added overhead in
> BufferAlloc() as temporary fix?
> >
> 
> Yes, I think so. Because as far as I can understand he is suggesting to do 
> changes
> only in the Truncate/Vacuum code path. Basically, I think you need to change
> DropRelFileNodeBuffers or similar functions.
> There shouldn't be any change in the BufferAlloc or the common code path, so
> there is no question of regression in such cases. I am not sure what you have 
> in
> mind for this but feel free to clarify your understanding before 
> implementation.
>
> > Thomas Munro is also working on caching relation sizes [1], maybe that
> > way we could get the latest known relation size. Currently, it's
> > possible only during recovery in smgrnblocks.
> >
> > Tom Lane wrote:
> > > But aside from that, I noted a number of things I didn't like a bit:
> > >
> > > * The amount of new shared memory this needs seems several orders of
> > > magnitude higher than what I'd call acceptable: according to my
> > > measurements it's over 10KB per shared buffer!  Most of that is
> > > going into the CachedBufTableLock data structure, which seems
> > > fundamentally misdesigned --- how could we be needing a lock per map
> > > partition *per buffer*?  For comparison, the space used by
> > > buf_table.c is about 56 bytes per shared buffer; I think this needs to 
> > > stay at
> least within hailing distance of there.
> > >
> > > * It is fairly suspicious that the new data structure is manipulated
> > > while holding per-partition locks for the existing buffer hashtable.
> > > At best that seems bad for concurrency, and at worst it could result
> > > in deadlocks, because I doubt we can assume that the new hash table
> > > has partition boundaries identical to the old one.
> > >
> > > * More generally, it seems like really poor design that this has
> > > been written completely independently of the existing buffer hash table.
> > > Can't we get any benefit by merging them somehow?
> >
> > The original aim is to just shorten the recovery process, and
> > eventually the speedup on both vacuum and truncate process are just added
> bonus.
> > Given that we don't have a shared invalidation mechanism in place y

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-08-07 Thread Dilip Kumar

On Thu, Aug 6, 2020 at 2:43 PM Amit Kapila  wrote:
>
> On Wed, Aug 5, 2020 at 7:37 PM Dilip Kumar  wrote:
> >
> > On Wed, Aug 5, 2020 at 6:25 PM Amit Kapila  wrote:
> > >
> >
> > > Can we add a test for incomplete changes (probably with toast
> > > insertion but we can do it for spec_insert case as well) in
> > > ReorderBuffer such that it needs to first serialize the changes and
> > > then stream it?  I have manually verified such scenarios but it is
> > > good to have the test for the same.
> >
> > I have added a new test for the same in the stream.sql file.
> >
>
> Thanks, I have slightly changed the test so that we can consume DDL
> changes separately.  I have made a number of other adjustments like
> changing few more comments (to make them consistent with nearby
> comments), removed unnecessary inclusion of header file, ran pgindent.
> The next patch (v47-0001-Implement-streaming-mode-in-ReorderBuffer) in
> this series looks good to me.  I am planning to push it after one more
> read-through unless you or anyone else has any comments on the same.
> The patch I am talking about has the following functionality:
>
> Implement streaming mode in ReorderBuffer. Instead of serializing the
> transaction to disk after reaching the logical_decoding_work_mem limit
> in memory, we consume the changes we have in memory and invoke stream
> API methods added by commit 45fdc9738b. However, sometimes if we have
> incomplete toast or speculative insert we spill to the disk because we
> can't stream till we have the complete tuple.  And, as soon as we get
> the complete tuple we stream the transaction including the serialized
> changes. Now that we can stream in-progress transactions, the
> concurrent aborts may cause failures when the output plugin consults
> catalogs (both system and user-defined). We handle such failures by
> returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table
> scan APIs to the backend or WALSender decoding a specific uncommitted
> transaction. The decoding logic on the receipt of such a sqlerrcode
> aborts the decoding of the current transaction and continues with the
> decoding of other transactions. We also provide a new option via SQL
> APIs to fetch the changes being streamed.
>
> This patch's functionality can be independently verified by SQL APIs

Your changes look fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Global temporary tables

2020-08-07 Thread 曾文旌



> 2020年7月31日 上午4:57，Robert Haas  写道：
> 
> On Thu, Jul 30, 2020 at 8:09 AM wenjing zeng  wrote:
>> Please continue to review the code.
> 
> This patch is pretty light on comments. Many of the new functions have
> no header comments, for example. There are comments here and there in
> the body of the new functions that are added, and in places where
> existing code is changed there are comments here and there, but
> overall it's not a whole lot. There's no documentation and no README,
> either. Since this adds a new feature and a bunch of new SQL-callable
> functions that interact with that feature, the feature itself should
> be documented, along with its limitations and the new SQL-callable
> functions that interact with it. I think there should be either a
> lengthy comment in some suitable file, or maybe various comments in
> various files, or else a README file, that clearly sets out the major
> design principles behind the patch, and explaining also what that
> means in terms of features and limitations. Without that, it's really
> hard for anyone to jump into reviewing this code, and it will be hard
> for people who have to maintain it in the future to understand it,
> either. Or for users, for that matter.
Your suggestion is to the point. I do lack a lot of comments, as is necessary.
I'll do this.


Wenjing



> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



smime.p7s
Description: S/MIME cryptographic signature

Re: [Proposal] Global temporary tables

2020-08-07 Thread 曾文旌

Thank you very much for reviewing this patch.
This is very important to improve the GTT.

> 2020年8月3日 下午3:09，movead...@highgo.ca 写道：
> 
> 
> >Fixed in global_temporary_table_v29-pg13.patch
> >Please check.
> 
> I find this is the most latest mail with an attachment, so I test and reply on
> this thread, several points as below:
> 
> 1. I notice it produces new relfilenode when new session login and some
> data insert. But the relfilenode column in pg_class still the one when create
> the global temp table. I think you can try to show 0 in this area as what nail
> relation does. 
I think getting the GTT to have a default relfilenode looks closer to the 
existing implementation, and setting it to 0 requires extra work and has no 
clear benefit.
What do you think？
I'd like to know the reasons for your suggestion.

> 
> 2. The nail relations handle their relfilenodes by RelMapFile struct, and this
> patch use hash entry and relfilenode_list, maybe RelMapFile approach more
> understandable in my opinion. Sorry if I miss the real design for that.
We can see the STORAGE and statistics info for the GTT, including relfilenode, 
through view pg_gtt_relstats

postgres=# \d gtt
Table "public.gtt"
 Column |  Type   | Collation | Nullable | Default 
+-+---+--+-
 a  | integer |   |  | 
 b  | integer |   |  | 

postgres=# insert into gtt values(1,1);
INSERT 0 1
postgres=# select * from pg_gtt_relstats ;
 schemaname | tablename | relfilenode | relpages | reltuples | relallvisible | 
relfrozenxid | relminmxid 
+---+-+--+---+---+--+
 public | gtt   |   16384 |0 | 0 | 0 |  
532 |  1
(1 row)

postgres=# truncate gtt;
TRUNCATE TABLE
postgres=# select * from pg_gtt_relstats ;
 schemaname | tablename | relfilenode | relpages | reltuples | relallvisible | 
relfrozenxid | relminmxid 
+---+-+--+---+---+--+
 public | gtt   |   16387 |0 | 0 | 0 |  
533 |  1
(1 row)

> 
> 3. I get a wrong result of pg_relation_filepath() function for global temp 
> table,
> I think it's necessaryto keep this an correct output.

postgres=# select pg_relation_filepath(oid) from pg_class where relname = 'gtt';
 pg_relation_filepath 
--
 base/13835/t3_16384
(1 row)

I didn't find anything wrong. Could you please give me a demo.

> 
> 4. In gtt_search_by_relid() function, it has not handle the missing_ok 
> argument
> if gtt_storage_local_hash is null. There should be some comments if it's the 
> right
> code.
This is a problem that has been fixed in global_temporary_table_v34-pg13.patch.

> 
> 5. It's a long patch and hard to review, I think it will pretty good if it 
> can be
> divided into several subpatches with relatively independent subfunctions.
Thank you for your suggestion, and I am considering doing so, including adding 
comments.


Wenjing

> 
> Regards,
> Highgo Software (Canada/China/Pakistan) 
> URL : www.highgo.ca  
> EMAIL: mailto:movead(dot)li(at)highgo(dot)ca



smime.p7s
Description: S/MIME cryptographic signature

Re: Creating a function for exposing memory usage of backend process

2020-08-07 Thread Kasahara Tatsuhito

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

I tested the latest 
patch(0007-Adding-a-function-exposing-memory-usage-of-local-backend.patch) 
with the latest PG-version (199cec9779504c08aaa8159c6308283156547409) and test 
was passed.
It looks good to me.

The new status of this patch is: Ready for Committer

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-07 Thread Ashutosh Sharma

Thanks Rajkumar for testing the patch.

Here are some of the additional test-cases that I would suggest you to
execute, if possible:

1) You may try running the test-cases that you have executed so far
with SR setup and see if the changes are getting reflected on the
standby.

2) You may also try running some concurrent test-cases for e.g. try
running these functions with VACUUM or some other sql commands
(preferable DML commands) in parallel.

3) See what happens when you pass some invalid tids (containing
invalid block or offset number) to these functions. You may also try
running these functions on the same tuple repeatedly and see the
behaviour.

...

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Aug 6, 2020 at 2:25 PM Rajkumar Raghuwanshi
 wrote:
>
> I have been doing some user-level testing of this feature, apart from sanity 
> test for extension and it's functions
>
> I have tried to corrupt tuples and then able to fix it using 
> heap_force_freeze/kill functions.
>
>
> --corrupt relfrozenxid, cause vacuum failed.
>
> update pg_class set relfrozenxid = (relfrozenxid::text::integer + 
> 10)::text::xid where relname = 'test_tbl';
>
> UPDATE 1
>
> insert into test_tbl values (2, 'BBB');
>
>
> postgres=# vacuum test_tbl;
>
> ERROR:  found xmin 507 from before relfrozenxid 516
>
> CONTEXT:  while scanning block 0 of relation "public.test_tbl"
>
>
> postgres=# select *, ctid, xmin, xmax from test_tbl;
>
>  a |  b  | ctid  | xmin | xmax
>
> ---+-+---+--+--
>
>  1 | AAA | (0,1) |  505 |0
>
>  2 | BBB | (0,2) |  507 |0
>
> (2 rows)
>
>
> --fixed using heap_force_freeze
>
> postgres=# select heap_force_freeze('test_tbl'::regclass, 
> ARRAY['(0,2)']::tid[]);
>
>  heap_force_freeze
>
> ---
>
>
> postgres=# vacuum test_tbl;
>
> VACUUM
>
> postgres=# select *, ctid, xmin, xmax from test_tbl;
>
>  a |  b  | ctid  | xmin | xmax
>
> ---+-+---+--+--
>
>  1 | AAA | (0,1) |  505 |0
>
>  2 | BBB | (0,2) |2 |0
>
> (2 rows)
>
>
> --corrupt table headers in base/oid. file, cause table access failed.
>
> postgres=# select ctid, * from test_tbl;
>
> ERROR:  could not access status of transaction 4294967295
>
> DETAIL:  Could not open file "pg_xact/0FFF": No such file or directory.
>
>
> --removed corrupted tuple using heap_force_kill
>
> postgres=# select heap_force_kill('test_tbl'::regclass, 
> ARRAY['(0,2)']::tid[]);
>
>  heap_force_kill
>
> -
>
>
>
> (1 row)
>
>
> postgres=# select ctid, * from test_tbl;
>
>  ctid  | a |  b
>
> ---+---+-
>
>  (0,1) | 1 | AAA
>
> (1 row)
>
>
> I will be continuing with my testing with the latest patch and update here if 
> found anything.
>
>
> Thanks & Regards,
> Rajkumar Raghuwanshi
>
>
> On Thu, Aug 6, 2020 at 1:42 PM Masahiko Sawada 
>  wrote:
>>
>> On Wed, 5 Aug 2020 at 22:42, Ashutosh Sharma  wrote:
>> >
>> > Hi Robert,
>> >
>> > Thanks for the review. Please find my comments inline:
>> >
>> > On Sat, Aug 1, 2020 at 12:18 AM Robert Haas  wrote:
>> > >
>> > > On Fri, Jul 31, 2020 at 8:52 AM Ashutosh Sharma  
>> > > wrote:
>> > > > Attached is the new version of patch that addresses the comments from 
>> > > > Andrey and Beena.
>> > >
>> > > +PGFILEDESC = "pg_surgery - perform surgery on the damaged heap table"
>> > >
>> > > the -> a
>> > >
>> > > I also suggest: heap table -> relation, because we might want to
>> > > extend this to other cases later.
>> > >
>> >
>> > Corrected.
>> >
>> > > +-- toast table.
>> > > +begin;
>> > > +create table ttab(a text);
>> > > +insert into ttab select string_agg(chr(floor(random() * 26)::int +
>> > > 65), '') from generate_series(1,1);
>> > > +select * from ttab where xmin = 2;
>> > > + a
>> > > +---
>> > > +(0 rows)
>> > > +
>> > > +select heap_force_freeze('ttab'::regclass, ARRAY['(0, 1)']::tid[]);
>> > > + heap_force_freeze
>> > > +---
>> > > +
>> > > +(1 row)
>> > > +
>> > >
>> > > I don't understand the point of this. You're not testing the function
>> > > on the TOAST table; you're testing it on the main table when there
>> > > happens to be a TOAST table that is probably getting used for
>> > > something. But that's not really relevant to what is being tested
>> > > here, so as written this seems redundant with the previous cases.
>> > >
>> >
>> > Yeah, it's being tested on the main table, not on a toast table. I've
>> > removed this test-case and also restricted direct access to the toast
>> > table using heap_force_kill/freeze functions. I think we shouldn't be
>> > using these functions to do any changes in the toast table. We will
>> > only use these functions with the main table and let VACUUM remove the
>> > corresponding data chunks (pointed by the tuple that got removed from
>> > the main table).
>> >
>> > Another option would be to identify all the data chunks corresponding
>> > to the tuple (ctid) being killed from the main table and remove them
>> > one by one. We will only do this if

Re: [Patch] Optimize dropping of relation buffers using dlist

2020-08-07 Thread Konstantin Knizhnik





On 07.08.2020 00:33, Tomas Vondra wrote:


Unfortunately Konstantin did not share any details about what workloads
he tested, what config etc. But I find the "no regression" hypothesis
rather hard to believe, because we're adding non-trivial amount of code
to a place that can be quite hot.


Sorry, that I have not explained  my test scenarios.
As far as Postgres is pgbench-oriented database:) I have also used pgbench:
read-only case and sip-some updates.
For this patch most critical is number of buffer allocations,
so I used small enough database (scale=100), but shared buffer was set 
to 1Gb.
As a result, all data is cached in memory (in file system cache), but 
there is intensive swapping at Postgres buffer manager level.
I have tested it both with relatively small (100) and large (1000) 
number of clients.
I repeated this tests at my notebook (quadcore, 16Gb RAM, SSD) and IBM 
Power2 server with about 380 virtual cores  and about 1Tb of memory.
I the last case results are vary very much I think because of NUMA 
architecture) but I failed to find some noticeable regression of patched 
version.



But I have to agree that adding parallel hash (in addition to existed 
buffer manager hash) is not so good idea.

This cache really quite frequently becomes bottleneck.
My explanation of why I have not observed some noticeable regression was 
that this patch uses almost the same lock partitioning schema
as already used it adds not so much new conflicts. May be in case of 
POwer2 server, overhead of NUMA is much higher than other factors
(although shared hash is one of the main thing suffering from NUMA 
architecture).
But in principle I agree that having two independent caches may decrease 
speed up to two times  (or even more).


I hope that everybody will agree that this problem is really critical. 
It is certainly not the most common case when there are hundreds of 
relation which are frequently truncated. But having quadratic complexity 
in drop function is not acceptable from my point of view.
And it is not only recovery-specific problem, this is why solution with 
local cache is not enough.


I do not know good solution of the problem. Just some thoughts.
- We can somehow combine locking used for main buffer manager cache (by 
relid/blockno) and cache for relid. It will eliminates double locking 
overhead.
- We can use something like sorted tree (like std::map) instead of hash 
- it will allow to locate blocks both by relid/blockno and by relid only.

50 matches

Mail list logo