Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-05-16 Thread Ranier Vilela
Em sex., 15 de mai. de 2020 às 18:53, Alvaro Herrera <
alvhe...@2ndquadrant.com> escreveu:

> On 2020-Apr-10, Masahiko Sawada wrote:
>
> > Okay. I think only adding the check would also help with reducing the
> > likelihood. How about the changes for the current HEAD I've attached?
>
> Pushed this to all branches.  (Branches 12 and older obviously needed an
> adjustment.)  Thanks!
>
> > Related to this behavior on btree indexes, this can happen even on
> > heaps during searching heap tuples. To reduce the likelihood of that
> > more generally I wonder if we can acquire a lock on buffer descriptor
> > right before XLogSaveBufferForHint() and set a flag to the buffer
> > descriptor that indicates that we're about to log FPI for hint bit so
> > that concurrent process can be aware of that.
>
> I'm not sure how that helps; the other process would have to go back and
> redo their whole operation from scratch in order to find out whether
> there's still something alive that needs killing.
>
> I think you need to acquire the exclusive lock sooner: if, when scanning
> the page, you find a killable item, *then* upgrade the lock to exclusive
> and restart the scan.  This means that we'll have to wait for any other
> process that's doing the scan, and they will all give up their share
> lock to wait for the exclusive lock they need.  So the one that gets it
> first will do all the killing, log the page, then release the lock.  At
> that point the other processes will wake up and see that items have been
> killed, so they will return having done nothing.
>
Regarding the block, I disagree in part, because in the worst case,
the block can be requested in the last item analyzed, leading to redo all
the work from the beginning.
If we are in _bt_killitems it is because there is a high probability that
there will be items to be deleted,
why not request the block soon, if this meets the conditions?

1. XLogHintBitIsNeeded ()
2.! AutoVacuumingActive ()
3. New exclusive configuration variable option to activate the lock?

Masahiko reported that it occurs only when (autovacuum_enabled = off);

regards,
Ranier Vilela


avoid_killing_btree_items_already_dead_v2.patch
Description: Binary data


Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-05-16 Thread Ranier Vilela
Em sex., 15 de mai. de 2020 às 18:53, Alvaro Herrera <
alvhe...@2ndquadrant.com> escreveu:

> On 2020-Apr-10, Masahiko Sawada wrote:
>
> > Okay. I think only adding the check would also help with reducing the
> > likelihood. How about the changes for the current HEAD I've attached?
>
> Pushed this to all branches.  (Branches 12 and older obviously needed an
> adjustment.)  Thanks!
>
> > Related to this behavior on btree indexes, this can happen even on
> > heaps during searching heap tuples. To reduce the likelihood of that
> > more generally I wonder if we can acquire a lock on buffer descriptor
> > right before XLogSaveBufferForHint() and set a flag to the buffer
> > descriptor that indicates that we're about to log FPI for hint bit so
> > that concurrent process can be aware of that.
>
> I'm not sure how that helps; the other process would have to go back and
> redo their whole operation from scratch in order to find out whether
> there's still something alive that needs killing.
>
> I think you need to acquire the exclusive lock sooner: if, when scanning
> the page, you find a killable item, *then* upgrade the lock to exclusive
> and restart the scan.  This means that we'll have to wait for any other
> process that's doing the scan, and they will all give up their share
> lock to wait for the exclusive lock they need.  So the one that gets it
> first will do all the killing, log the page, then release the lock.  At
> that point the other processes will wake up and see that items have been
> killed, so they will return having done nothing.
>
> Like the attached.  I didn't verify that it works well or that it
> actually improves performance ...
>
This is not related to your latest patch.
But I believe I can improve the performance.

So:
1. If killedsomething is false
2. Any killtuple is true and (not ItemIdIsDead(iid)) is false
3. Nothing to be done.

So why do all the work and then discard it.
We can eliminate the current item much earlier, testing if it is already
dead.

regards,
Ranier VIlela


avoid_killing_btree_items_aready_dead.patch
Description: Binary data


Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-05-15 Thread Alvaro Herrera
On 2020-Apr-10, Masahiko Sawada wrote:

> Okay. I think only adding the check would also help with reducing the
> likelihood. How about the changes for the current HEAD I've attached?

Pushed this to all branches.  (Branches 12 and older obviously needed an
adjustment.)  Thanks!

> Related to this behavior on btree indexes, this can happen even on
> heaps during searching heap tuples. To reduce the likelihood of that
> more generally I wonder if we can acquire a lock on buffer descriptor
> right before XLogSaveBufferForHint() and set a flag to the buffer
> descriptor that indicates that we're about to log FPI for hint bit so
> that concurrent process can be aware of that.

I'm not sure how that helps; the other process would have to go back and
redo their whole operation from scratch in order to find out whether
there's still something alive that needs killing.

I think you need to acquire the exclusive lock sooner: if, when scanning
the page, you find a killable item, *then* upgrade the lock to exclusive
and restart the scan.  This means that we'll have to wait for any other
process that's doing the scan, and they will all give up their share
lock to wait for the exclusive lock they need.  So the one that gets it
first will do all the killing, log the page, then release the lock.  At
that point the other processes will wake up and see that items have been
killed, so they will return having done nothing.

Like the attached.  I didn't verify that it works well or that it
actually improves performance ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1429ac8b63..95ba3cc401 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1726,6 +1726,7 @@ _bt_killitems(IndexScanDesc scan)
 	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
 	bool		droppedpin PG_USED_FOR_ASSERTS_ONLY;
+	bool		have_exclusive = false;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1767,6 +1768,7 @@ _bt_killitems(IndexScanDesc scan)
 		}
 	}
 
+restart:
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -1864,6 +1866,19 @@ _bt_killitems(IndexScanDesc scan)
 			 */
 			if (killtuple && !ItemIdIsDead(iid))
 			{
+if (!have_exclusive && XLogHintBitIsNeeded())
+{
+	/*
+	 * If wal_log_hints is enabled, we only want to do this
+	 * with an exclusive lock, so exchange our lock and
+	 * restart from the top.
+	 */
+	LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	LockBuffer(so->currPos.buf, BT_WRITE);
+	have_exclusive = true;
+	goto restart;
+}
+
 /* found the item/all posting list items */
 ItemIdMarkDead(iid);
 killedsomething = true;


Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-10 Thread James Coleman
On Thu, Apr 9, 2020 at 10:08 PM Peter Geoghegan  wrote:
>
> On Thu, Apr 9, 2020 at 6:47 PM James Coleman  wrote:
> > I believe the write pattern to this table likely looks like:
> > - INSERT
> > - UPDATE
> > - DELETE
> > for every row. But tomorrow I can do some more digging if needed.
>
> The pg_stats.null_frac for the column/index might be interesting here. I
> believe that Active Record will sometimes generate created_at columns
> that sometimes end up containing NULL values. Not sure why.

null_frac is 0 for created_at (what I expected). Also (under current
data) all created_at values are unique except a single row duplicate.

That being said, remember the write pattern above: every row gets
deleted eventually, so there'd be a lots of dead tuples overall.

James




Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-09 Thread Kyotaro Horiguchi
At Fri, 10 Apr 2020 12:32:31 +0900, Masahiko Sawada 
 wrote in 
> On Fri, 10 Apr 2020 at 04:05, Peter Geoghegan  wrote:
> >
> > On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
> >  wrote:
> > > Here is the reproducer:
> >
> > What version of Postgres did you notice the actual customer issue on?
> > I ask because I wonder if the work on B-Tree indexes in Postgres 12
> > affects the precise behavior you get here with real world workloads.
> > It probably makes _bt_killitems() more effective with some workloads,
> > which naturally increases the likelihood of having multiple FPI issued
> > in the manner that you describe. OTOH, it might make it less likely
> > with low cardinality indexes, since large groups of garbage duplicate
> > tuples tend to get concentrated on just a few leaf pages.
> >
> > > The inner test in the comment "found the item" never tests the item
> > > for being dead. So maybe we can add !ItemIdIsDead(iid) to that
> > > condition. But there still is a race condition of recording multiple
> > > FPIs can happen. Maybe a better solution is to change the lock to
> > > exclusive, at least when wal_log_hints = on, so that only one process
> > > can run this code -- the reduction in concurrency might be won back by
> > > the fact that we don't wal-log the page multiple times.
> >
> > I like the idea of checking !ItemIdIsDead(iid) as a further condition
> > of killing the item -- there is clearly no point in doing work to kill
> > an item that is already dead. I don't like the idea of using an
> > exclusive buffer lock (even if it's just with wal_log_hints = on),
> > though.
> >
> 
> Okay. I think only adding the check would also help with reducing the
> likelihood. How about the changes for the current HEAD I've attached?

FWIW, looks good to me.

> Related to this behavior on btree indexes, this can happen even on
> heaps during searching heap tuples. To reduce the likelihood of that
> more generally I wonder if we can acquire a lock on buffer descriptor
> right before XLogSaveBufferForHint() and set a flag to the buffer
> descriptor that indicates that we're about to log FPI for hint bit so
> that concurrent process can be aware of that.

Makes sense if the lock were acquired just before the "BM_DIRTY |
BM_JUST_DIRTIED) check.  Could we use double-checking, as similar to
the patch for ItemIdIsDead()?

> if ((pg_atomic_read_u32(>state) & (BM_DIRTY | BM_JUST_DIRTIED)) !=
> (BM_DIRTY | BM_JUST_DIRTIED))
> {
...
>   * essential that CreateCheckpoint waits for virtual transactions
>   * rather than full transactionids.
>   */
>  /* blah, blah */   
>  buf_state = LockBufHdr(bufHdr);
>
>  if (buf_state & (BM_ | BM_JUST) != (..))
>  {
>MyProc->delayChkpt = delayChkpt = true;
>lsn = XLogSaveBufferForHint(buffer, buffer_std);
>  }
>}
>else
> buf_state = LockBuffer(bufHdr);
  

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-09 Thread Masahiko Sawada
On Fri, 10 Apr 2020 at 04:05, Peter Geoghegan  wrote:
>
> On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
>  wrote:
> > Here is the reproducer:
>
> What version of Postgres did you notice the actual customer issue on?
> I ask because I wonder if the work on B-Tree indexes in Postgres 12
> affects the precise behavior you get here with real world workloads.
> It probably makes _bt_killitems() more effective with some workloads,
> which naturally increases the likelihood of having multiple FPI issued
> in the manner that you describe. OTOH, it might make it less likely
> with low cardinality indexes, since large groups of garbage duplicate
> tuples tend to get concentrated on just a few leaf pages.
>
> > The inner test in the comment "found the item" never tests the item
> > for being dead. So maybe we can add !ItemIdIsDead(iid) to that
> > condition. But there still is a race condition of recording multiple
> > FPIs can happen. Maybe a better solution is to change the lock to
> > exclusive, at least when wal_log_hints = on, so that only one process
> > can run this code -- the reduction in concurrency might be won back by
> > the fact that we don't wal-log the page multiple times.
>
> I like the idea of checking !ItemIdIsDead(iid) as a further condition
> of killing the item -- there is clearly no point in doing work to kill
> an item that is already dead. I don't like the idea of using an
> exclusive buffer lock (even if it's just with wal_log_hints = on),
> though.
>

Okay. I think only adding the check would also help with reducing the
likelihood. How about the changes for the current HEAD I've attached?

Related to this behavior on btree indexes, this can happen even on
heaps during searching heap tuples. To reduce the likelihood of that
more generally I wonder if we can acquire a lock on buffer descriptor
right before XLogSaveBufferForHint() and set a flag to the buffer
descriptor that indicates that we're about to log FPI for hint bit so
that concurrent process can be aware of that.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


btree_fpi.patch
Description: Binary data


Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-09 Thread Peter Geoghegan
On Thu, Apr 9, 2020 at 6:47 PM James Coleman  wrote:
> I believe the write pattern to this table likely looks like:
> - INSERT
> - UPDATE
> - DELETE
> for every row. But tomorrow I can do some more digging if needed.

The pg_stats.null_frac for the column/index might be interesting here. I
believe that Active Record will sometimes generate created_at columns
that sometimes end up containing NULL values. Not sure why.


--
Peter Geoghegan




Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-09 Thread James Coleman
On Thu, Apr 9, 2020 at 8:32 PM Peter Geoghegan  wrote:
>
> On Thu, Apr 9, 2020 at 5:25 PM Peter Geoghegan  wrote:
> > Was this a low cardinality index in the way I describe? If it was,
> > then we can hope (and maybe even verify) that the Postgres 12 work
> > noticeably ameliorates the problem.
>
> What I really meant was an index where hundreds or even thousands of
> rows for each distinct timestamp value are expected. Not an index
> where almost every row has a distinct timestamp value. Both timestamp
> index patterns are common, obviously.

I'll try to run some numbers tomorrow to confirm, but I believe that
the created_at value is almost (if not completely) unique. So, no,
it's not a low cardinality case like that.

I believe the write pattern to this table likely looks like:
- INSERT
- UPDATE
- DELETE
for every row. But tomorrow I can do some more digging if needed.

James




Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-09 Thread Peter Geoghegan
On Thu, Apr 9, 2020 at 5:25 PM Peter Geoghegan  wrote:
> Was this a low cardinality index in the way I describe? If it was,
> then we can hope (and maybe even verify) that the Postgres 12 work
> noticeably ameliorates the problem.

What I really meant was an index where hundreds or even thousands of
rows for each distinct timestamp value are expected. Not an index
where almost every row has a distinct timestamp value. Both timestamp
index patterns are common, obviously.

-- 
Peter Geoghegan




Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-09 Thread Peter Geoghegan
On Thu, Apr 9, 2020 at 1:37 PM James Coleman  wrote:
> We saw the issue on our PG11 clusters. The specific index we noticed
> in the wal dump (I don't think we confirmed if there were others) as
> one on a `created_at` column, to give you an idea of cardinality.

You tend to get a lot of problems with indexes like that when there
are consistent updates (actually, that's more of a thing with an
updated_at index). But non-HOT updates alone might result in what you
could describe as "updates" to the index.

With Postgres 11, a low cardinality index could place new/successor
duplicate index tuples (those needed for non-HOT updates) on a more or
less random leaf page (you'll recall that this is determined by the
old "getting tired" logic). This is the kind of thing I had in mind
when I asked Sawada-san about it.

Was this a low cardinality index in the way I describe? If it was,
then we can hope (and maybe even verify) that the Postgres 12 work
noticeably ameliorates the problem.

-- 
Peter Geoghegan




Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-09 Thread James Coleman
On Thu, Apr 9, 2020 at 3:05 PM Peter Geoghegan  wrote:
>
> On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
>  wrote:
> > Here is the reproducer:
>
> What version of Postgres did you notice the actual customer issue on?
> I ask because I wonder if the work on B-Tree indexes in Postgres 12
> affects the precise behavior you get here with real world workloads.
> It probably makes _bt_killitems() more effective with some workloads,
> which naturally increases the likelihood of having multiple FPI issued
> in the manner that you describe. OTOH, it might make it less likely
> with low cardinality indexes, since large groups of garbage duplicate
> tuples tend to get concentrated on just a few leaf pages.

We saw the issue on our PG11 clusters. The specific index we noticed
in the wal dump (I don't think we confirmed if there were others) as
one on a `created_at` column, to give you an idea of cardinality.

> > The inner test in the comment "found the item" never tests the item
> > for being dead. So maybe we can add !ItemIdIsDead(iid) to that
> > condition. But there still is a race condition of recording multiple
> > FPIs can happen. Maybe a better solution is to change the lock to
> > exclusive, at least when wal_log_hints = on, so that only one process
> > can run this code -- the reduction in concurrency might be won back by
> > the fact that we don't wal-log the page multiple times.
>
> I like the idea of checking !ItemIdIsDead(iid) as a further condition
> of killing the item -- there is clearly no point in doing work to kill
> an item that is already dead. I don't like the idea of using an
> exclusive buffer lock (even if it's just with wal_log_hints = on),
> though.

I don't have a strong opinion on the lock.

James




Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-09 Thread Peter Geoghegan
On Wed, Apr 8, 2020 at 10:56 PM Masahiko Sawada
 wrote:
> Here is the reproducer:

What version of Postgres did you notice the actual customer issue on?
I ask because I wonder if the work on B-Tree indexes in Postgres 12
affects the precise behavior you get here with real world workloads.
It probably makes _bt_killitems() more effective with some workloads,
which naturally increases the likelihood of having multiple FPI issued
in the manner that you describe. OTOH, it might make it less likely
with low cardinality indexes, since large groups of garbage duplicate
tuples tend to get concentrated on just a few leaf pages.

> The inner test in the comment "found the item" never tests the item
> for being dead. So maybe we can add !ItemIdIsDead(iid) to that
> condition. But there still is a race condition of recording multiple
> FPIs can happen. Maybe a better solution is to change the lock to
> exclusive, at least when wal_log_hints = on, so that only one process
> can run this code -- the reduction in concurrency might be won back by
> the fact that we don't wal-log the page multiple times.

I like the idea of checking !ItemIdIsDead(iid) as a further condition
of killing the item -- there is clearly no point in doing work to kill
an item that is already dead. I don't like the idea of using an
exclusive buffer lock (even if it's just with wal_log_hints = on),
though.

-- 
Peter Geoghegan




Re: Multiple FPI_FOR_HINT for the same block during killing btree index items

2020-04-09 Thread Alvaro Herrera
On 2020-Apr-09, Masahiko Sawada wrote:

> The inner test in the comment "found the item" never tests the item
> for being dead. So maybe we can add !ItemIdIsDead(iid) to that
> condition. But there still is a race condition of recording multiple
> FPIs can happen. Maybe a better solution is to change the lock to
> exclusive, at least when wal_log_hints = on, so that only one process
> can run this code -- the reduction in concurrency might be won back by
> the fact that we don't wal-log the page multiple times.

I agree.

It seems worth pointing out that when this code was written, these hint
bit changes were not logged, so this consideration did not apply then.
But we added data checksums and wal_log_hints, which changed the
equation.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services