Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-11-19 Thread Alexander Korotkov
On Sun, Nov 17, 2019 at 9:18 PM Alexander Korotkov
 wrote:
> So, I've put this explanation into README patch.  I just change
> designation to better match with Lehman & Yao paper and did some minor
> enchantments.
>
> I'm going to push this patchset if no objections.

So, pushed with minor changes during backpatching.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-11-17 Thread Alexander Korotkov
On Mon, Nov 11, 2019 at 2:42 AM Alexander Korotkov
 wrote:
> I'm sorry for late reply.  I was busy with various things.  Also
> digging into these details took some time.  Please find my explanation
> below.
>
> On Wed, Oct 30, 2019 at 2:34 AM Peter Geoghegan  wrote:
> > In general, it seems very important to be clear about exactly how the
> > key space works. The nbtree work for v12 greatly benefitted from
> > defining comparisons in a way that didn't really change how nbtree
> > worked, while at the same time minimizing I/O and making nbtree
> > faithful to Lehman & Yao's original design. It isn't obvious how
> > valuable it is to really carefully define how invariants and key
> > comparisons work, but it seems possible to solve a lot of problems
> > that way.
>
> gin packs downlinks and pivot key into tuples in the different way
> than nbtree does.  Let's see the logical structure of internal B-tree
> page.  We can represent it as following.
>
> downlink_1, key_1_2, downlink_2, key_2_3,  , downlink_n, highkey
>
> key_{i-1}_i is pivot key, which splits key space between
> downlink_{i-1} and downlink_i.  So, every key under downlink_{i-1} is
> <= key_{i-1}_i.  Every key under downlink_i is > key_{i-1}_i.  And all
> underlying keys are <= highkey.
>
> nbtree packs that into tuples as followings (tuples are shown in parentheses).
>
> (highkey), (-inf, downlink_1), (key_1_2, downlink_2), ...
> (key_{n-1}_n,  downlink_n)
>
> So, we store highkey separately.  key_{i-1}_i and downlink_i form a
> tuple together.  downlink_1 is coupled with -inf key.
>
> gin packs tuples in a different way.
>
> (downlink_1, key_1_2), (downlink_2, key_2_3), ... , (downlink_n, highkey)
>
> So, in GIN key_{i-1}_i and downlink_{i-1} form a tuple.  Highkey is
> coupled with downlink_n.  And -inf key is not needed here.
>
> But there is couple notes about highkey:
> 1) In entry tree rightmost page, a key coupled with downlink_n doesn't
> really matter.  Highkey is assumed to be infinity.
> 2) In posting tree, a key coupled with downlink_n always doesn't
> matter.  Highkey for non-rightmost pages is stored separately and
> accessed via GinDataPageGetRightBound().
>
> I think this explains following:
> 1) The invariants in gin btree are same as they are in nbtree.  Just
> page layout is different.
> 2) The way entryLocateEntry() works.  In particular why it's OK to go
> the mid downlink if corresponding key equals.
> 3) There is no "negative infinity" item in internal pages.
>
> Does the explanation of above looks clear for you?  If so, I can put
> it into the README.

So, I've put this explanation into README patch.  I just change
designation to better match with Lehman & Yao paper and did some minor
enchantments.

I'm going to push this patchset if no objections.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0002-Fix-traversing-to-the-deleted-GIN-page-via-downlin-4.patch
Description: Binary data


0001-Fix-deadlock-between-ginDeletePage-and-ginStepRigh-4.patch
Description: Binary data


0003-Revise-GIN-README-4.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-11-10 Thread Alexander Korotkov
Hi!

I'm sorry for late reply.  I was busy with various things.  Also
digging into these details took some time.  Please find my explanation
below.

On Wed, Oct 30, 2019 at 2:34 AM Peter Geoghegan  wrote:
> In general, it seems very important to be clear about exactly how the
> key space works. The nbtree work for v12 greatly benefitted from
> defining comparisons in a way that didn't really change how nbtree
> worked, while at the same time minimizing I/O and making nbtree
> faithful to Lehman & Yao's original design. It isn't obvious how
> valuable it is to really carefully define how invariants and key
> comparisons work, but it seems possible to solve a lot of problems
> that way.

gin packs downlinks and pivot key into tuples in the different way
than nbtree does.  Let's see the logical structure of internal B-tree
page.  We can represent it as following.

downlink_1, key_1_2, downlink_2, key_2_3,  , downlink_n, highkey

key_{i-1}_i is pivot key, which splits key space between
downlink_{i-1} and downlink_i.  So, every key under downlink_{i-1} is
<= key_{i-1}_i.  Every key under downlink_i is > key_{i-1}_i.  And all
underlying keys are <= highkey.

nbtree packs that into tuples as followings (tuples are shown in parentheses).

(highkey), (-inf, downlink_1), (key_1_2, downlink_2), ...
(key_{n-1}_n,  downlink_n)

So, we store highkey separately.  key_{i-1}_i and downlink_i form a
tuple together.  downlink_1 is coupled with -inf key.

gin packs tuples in a different way.

(downlink_1, key_1_2), (downlink_2, key_2_3), ... , (downlink_n, highkey)

So, in GIN key_{i-1}_i and downlink_{i-1} form a tuple.  Highkey is
coupled with downlink_n.  And -inf key is not needed here.

But there is couple notes about highkey:
1) In entry tree rightmost page, a key coupled with downlink_n doesn't
really matter.  Highkey is assumed to be infinity.
2) In posting tree, a key coupled with downlink_n always doesn't
matter.  Highkey for non-rightmost pages is stored separately and
accessed via GinDataPageGetRightBound().

I think this explains following:
1) The invariants in gin btree are same as they are in nbtree.  Just
page layout is different.
2) The way entryLocateEntry() works.  In particular why it's OK to go
the mid downlink if corresponding key equals.
3) There is no "negative infinity" item in internal pages.

Does the explanation of above looks clear for you?  If so, I can put
it into the README.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-10-29 Thread Peter Geoghegan
On Sun, Oct 27, 2019 at 12:04 PM Alexander Korotkov
 wrote:
> (0001-Fix-deadlock-between-ginDeletePage-and-ginStepRigh-3.patch)

Some thoughts on the first patch:

* How do comparisons of items in each type of B-Tree work?

I would like to see a statement about invariants similar to nbtree's
"subtree" invariant. Looks like the "high key" is <= (not <) in each
case (i.e. for both entry trees and posting trees), just like nbtree.
nbtree has an explicit high key rather than using the current highest
data item on the page (maybe this can be called a "pseudo high key" or
something). That is one important difference, but in general GIN seems
to copy nbtree. I think.

However, GIN never explains stuff that must be affected by invariants
in binary search routines like entryLocateLeafEntry() and
dataLocateItem(). GIN never makes the similarities and differences
clear. Maybe you could do more on that -- it's too hard to tell why
entryLocateLeafEntry() and entryLocateEntry() (i.e. leaf and internal
page binary search variants for entry B-Trees) do things differently
to each other (and do things differently to nbtree's binary search
routine).

This code from entryLocateEntry() is a good example of how this can be
confusing:

if (result == 0)
{
stack->off = mid;
Assert(GinGetDownlink(itup) != GIN_ROOT_BLKNO);
return GinGetDownlink(itup);
}
else if (result > 0)
low = mid + 1;
else
high = mid;

Some questions about this code:

* Why is the "if (result == 0)" branch using the block number/downlink
from the "mid" tuple as its return value?

Why not follow nbtree's _bt_binsrch() here too, by returning "the last
key < scan key" on internal pages? If your high key from one level
down is <=, and also a pseudo high key (i.e. both a "real" entry and a
high key), how can it be correct to go to the block/downlink from an
equal "mid" tuple like this? Won't that make index scans miss the
pseudo high key, which they have to return to the scan?

* Why is it okay that there is no "negative infinity" item in internal
pages for the entry tree?

GIN does not have to copy nbtree in every detail to be correct, but it
confuses me that it *mostly* copies nbtree, but doesn't do so *fully*.
As I wrote just now (or at least implied), entryIsMoveRight() works in
a similar way to _bt_moveright(), and yet we still have these apparent
differences with how binary searches work that seems to not be
compatible with that. Maybe this is okay, but I cannot understand why
that is right now. It looks wrong to me -- so wrong that I suppose I
must be mistaken.

I also spotted some fairly minor issues:

* s/rightest/rightmost/

*  It's weird that GinBtree/GinBtreeData is a set of callbacks for
both posting trees and main entry trees, since the rules are now quite
different in each case. Not sure if it's worth saying something about
that.

*  "Exclusive lock" makes me think of "ExclusiveLock", which is a kind
of heavyweight lock (not a buffer lock). I suggest changing that
wording to avoid confusion.

In general, it seems very important to be clear about exactly how the
key space works. The nbtree work for v12 greatly benefitted from
defining comparisons in a way that didn't really change how nbtree
worked, while at the same time minimizing I/O and making nbtree
faithful to Lehman & Yao's original design. It isn't obvious how
valuable it is to really carefully define how invariants and key
comparisons work, but it seems possible to solve a lot of problems
that way.

--
Peter Geoghegan




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-10-28 Thread Alexander Korotkov
On Mon, Oct 28, 2019 at 2:00 PM Peter Geoghegan  wrote:
> On Sun, Oct 27, 2019 at 7:04 PM Alexander Korotkov
>  wrote:
> > It tool me a lot of efforts to revise a concurrency section in README.
> > I can't judge the result, but I probably did my best.  I'd like to
> > commit this patchset including both bug fixes and README update. But
> > I'd like you to take a look on the README patch first.
>
> Thank you for working on this.
>
> I am flying back to the USA today, and will try to take a look at what
> you came up with on the way. I will definitely have some feedback in
> the next few days.

Thank you so much!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-10-28 Thread Peter Geoghegan
Hi Alexander,

On Sun, Oct 27, 2019 at 7:04 PM Alexander Korotkov
 wrote:
> It tool me a lot of efforts to revise a concurrency section in README.
> I can't judge the result, but I probably did my best.  I'd like to
> commit this patchset including both bug fixes and README update. But
> I'd like you to take a look on the README patch first.

Thank you for working on this.

I am flying back to the USA today, and will try to take a look at what
you came up with on the way. I will definitely have some feedback in
the next few days.

-- 
Peter Geoghegan




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-10-27 Thread Alexander Korotkov
Hi, Peter!

On Fri, Oct 4, 2019 at 12:05 AM Alexander Korotkov
 wrote:
> I'm planning to continue work on README, comments and commit messages.

It tool me a lot of efforts to revise a concurrency section in README.
I can't judge the result, but I probably did my best.  I'd like to
commit this patchset including both bug fixes and README update. But
I'd like you to take a look on the README patch first.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-Fix-deadlock-between-ginDeletePage-and-ginStepRigh-3.patch
Description: Binary data


0002-Fix-traversing-to-the-deleted-GIN-page-via-downlin-3.patch
Description: Binary data


0003-Revise-concurrency-section-in-GIN-README-3.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-10-03 Thread Alexander Korotkov
On Tue, Oct 1, 2019 at 5:55 AM Alexander Korotkov
 wrote:
> On Mon, Sep 30, 2019 at 10:54 PM Peter Geoghegan  wrote:
> >
> > On Sun, Sep 29, 2019 at 8:12 AM Alexander Korotkov
> >  wrote:
> > > I just managed to reproduce this using two sessions on master branch.
> > >
> > > session 1
> > > session 2
> >
> > Was the involvement of the pending list stuff in Chen's example just a
> > coincidence? Can you recreate the problem while eliminating that
> > factor (i.e. while setting fastupdate to off)?
> >
> > Chen's example involved an INSERT that deadlocked against VACUUM --
> > not a SELECT. Is this just a coincidence?
>
> Chen wrote.
>
> > Unfortunately the insert process(run by gcore) held no lwlock, it should be 
> > another process(we did not fetch core file) that hold the lwlock needed for 
> > autovacuum process.
>
> So, he catched backtrace for INSERT and post it for information.  But
> since INSERT has no lwlocks held, it couldn't participate deadlock.
> It was just side waiter.
>
> I've rerun my reproduction case and it still deadlocks.  Just the same
> steps but GIN index with (fastupdate = off).

BTW, while trying to revise README I found another bug.  It appears to
be possible to reach deleted page from downlink.  The reproduction
case is following.

session 1
session 2

# create table tmp (ar int[]) with (autovacuum_enabled = false);
# insert into tmp (select '{1}' from generate_series(1,1000) i);
# insert into tmp values ('{1,2}');
# insert into tmp (select '{1}' from generate_series(1,1000) i);
# create index tmp_idx on tmp using gin(ar);

# delete from tmp;

# set max_parallel_workers_per_gather = 0;
/* Breakpoint where entyLoadMoreItems() calls ginFindLeafPage() to
search GIN posting tree  */
gdb> b ginget.c:682
gdb> select * from tmp where ar @> '{1,2}';
gdb> /* step till ReleaseAndReadBuffer() releases a buffer */

# vacuum tmp;

# continue

It also appears that previous version of deadlock fix didn't supply
left sibling to leftmost child of any page.  As result, internal pages
were never deleted.  The first attached patch is revised fix is
attached.

The second patch fix traversing to deleted page using downlink.
Similarly to nbtree, we just always move right if landed on deleted
page.  Also, it appears that we clear all other flags while marking
page as deleted.  That cause assert to fire.  With patch, we just add
deleted flag without erasing others.  Also, I have to remove assert
that ginStepRight() never steps to deleted page.  If we landed to
deleted page from downlink, then we can find other deleted page by
rightlink.

I'm planning to continue work on README, comments and commit messages.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-gin_ginDeletePage_ginStepRight_deadlock_fix-2.patch
Description: Binary data


0002-gin-fix-traversing-to-deleted-page-by-downlink-2.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-30 Thread Alexander Korotkov
On Mon, Sep 30, 2019 at 10:54 PM Peter Geoghegan  wrote:
>
> On Sun, Sep 29, 2019 at 8:12 AM Alexander Korotkov
>  wrote:
> > I just managed to reproduce this using two sessions on master branch.
> >
> > session 1
> > session 2
>
> Was the involvement of the pending list stuff in Chen's example just a
> coincidence? Can you recreate the problem while eliminating that
> factor (i.e. while setting fastupdate to off)?
>
> Chen's example involved an INSERT that deadlocked against VACUUM --
> not a SELECT. Is this just a coincidence?

Chen wrote.

> Unfortunately the insert process(run by gcore) held no lwlock, it should be 
> another process(we did not fetch core file) that hold the lwlock needed for 
> autovacuum process.

So, he catched backtrace for INSERT and post it for information.  But
since INSERT has no lwlocks held, it couldn't participate deadlock.
It was just side waiter.

I've rerun my reproduction case and it still deadlocks.  Just the same
steps but GIN index with (fastupdate = off).

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-30 Thread Peter Geoghegan
On Sun, Sep 29, 2019 at 8:12 AM Alexander Korotkov
 wrote:
> I just managed to reproduce this using two sessions on master branch.
>
> session 1
> session 2

Was the involvement of the pending list stuff in Chen's example just a
coincidence? Can you recreate the problem while eliminating that
factor (i.e. while setting fastupdate to off)?

Chen's example involved an INSERT that deadlocked against VACUUM --
not a SELECT. Is this just a coincidence?

-- 
Peter Geoghegan




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-30 Thread Peter Geoghegan
On Mon, Sep 30, 2019 at 11:00 AM Alexander Korotkov
 wrote:
> Thank you for pointing.  I remember this thread, but don't remember
> details.  I'll reread it.

I think that doing this now would be a good idea:

"""
Defensively checking the page type (pending, posting, etc) within
scanPendingInsert() would be a good start. That's already something
that we do for posting trees. If we're following the same rules as
posting trees (which sounds like a good idea), then we should have the
same defenses. Same applies to using elogs within ginInsertCleanup()
-- we should promote those Assert()s to elog()s.
"""

In other words, ginInsertCleanup() should have defensive "Page types
match?" checks that are similar to the existing checks in
ginStepRight(). In both cases we're stepping right while coupling
locks, to avoid concurrent page deletion.

> > > Locking from right to left is clearly wrong.  It could deadlock with
> > > concurrent ginStepRight(), which locks from left to right.  I expect
> > > this happened in your case.  I'm going to reproduce this and fix.

> Frankly speaking I'm not very happy with special version of B-tree,
> which is builtin to GIN.  This version of B-tree is lacking of high
> keys.  AFAIR because of lack of high keys, we can't implement the same
> concurrency model as nbtree.  Instead we have to do super-locking for
> page deletion and so on.  That looks ridiculous for me.  I think in
> future we should somehow reimplement GIN on top of nbtree.

I think that that could work on top of the new nbtree posting list
stuff, provided that it was acceptable to not use posting list
compression in the main tree -- the way that posting list splits work
there needs to be able to think about free space in a very simple way
that is broken even by GIN's varbyte compression. Compression could
still be used in posting trees, though.

> But we have to provide some way less radical fixes for our stable
> releases.  In particular, I believe patch I've posted in this thread
> makes situation better not worse.  That is it fixes one bug without
> introducing mode bugs.  But I'm going to analyze more on this and
> document GIN concurrency better in the README.  Probably, I'll spot
> more details.

Thanks.
-- 
Peter Geoghegan




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-30 Thread Peter Geoghegan
On Sun, Sep 29, 2019 at 10:38 PM Andrey Borodin  wrote:
> As far as I understand deleted page is stamped with
> GinPageSetDeleteXid(page, ReadNewTransactionId());
> It will not be recycled until that Xid is far behind.

That only gets used within posting tree pages, though.
ginInsertCleanup() is concerned with pending list pages.

> BTW we found a small bug (wraparound) in similar GiST and B-tree 
> implementations.
> Probably, it's there in GIN too.

Probably, but that's much less of a problem to me.

-- 
Peter Geoghegan




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-30 Thread Alexander Korotkov
On Sun, Sep 29, 2019 at 10:53 PM Peter Geoghegan  wrote:
> On Sun, Sep 29, 2019 at 7:38 AM Alexander Korotkov
>  wrote:
> > Starting from root seems OK for me, because vacuum blocks all
> > concurrent inserts before doing this.  But this needs to be properly
> > documented in readme.
>
> I never got an adequate answer to this closely related question almost
> two years ago:
>
> https://www.postgresql.org/message-id/CAH2-Wz=gtnapzeezqyelov3h1fxpo5xhmrbp6amgeklkv95...@mail.gmail.com
>
> In general, ginInsertCleanup() seems badly designed. Why is it okay
> that there is no nbtree-style distinction between page deletion and
> page recycling?

Thank you for pointing.  I remember this thread, but don't remember
details.  I'll reread it.

> > Locking from right to left is clearly wrong.  It could deadlock with
> > concurrent ginStepRight(), which locks from left to right.  I expect
> > this happened in your case.  I'm going to reproduce this and fix.
>
> I am sick and tired of seeing extremely basic errors like this within
> GIN's locking protocols. Bugs happen, but these are not ordinary bugs.
> They're more or less all a result of giving no thought to the high
> level design. I'm not blaming you for this, or any one person. But
> this is not okay.
>
> Anything around index concurrency needs to be explained in
> excruciating detail, while taking a top-down approach that applies
> general rules (e.g. you can only do lock coupling left to right, or
> bottom to top in nbtree). Anything less than that should be assumed to
> be wrong on general principle.

Frankly speaking I'm not very happy with special version of B-tree,
which is builtin to GIN.  This version of B-tree is lacking of high
keys.  AFAIR because of lack of high keys, we can't implement the same
concurrency model as nbtree.  Instead we have to do super-locking for
page deletion and so on.  That looks ridiculous for me.  I think in
future we should somehow reimplement GIN on top of nbtree.

But we have to provide some way less radical fixes for our stable
releases.  In particular, I believe patch I've posted in this thread
makes situation better not worse.  That is it fixes one bug without
introducing mode bugs.  But I'm going to analyze more on this and
document GIN concurrency better in the README.  Probably, I'll spot
more details.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-30 Thread Alexander Korotkov
On Mon, Sep 30, 2019 at 8:38 AM Andrey Borodin  wrote:
>
> > 29 сент. 2019 г., в 21:27, Alexander Korotkov  
> > написал(а):
> >
> > Patch with fix is attached.  Idea is simple: ginScanToDelete() now
> > keeps exclusive lock on left page eliminating the need to relock it.
> > So, we preserve left-to-right locking order and can't deadlock with
> > ginStepRight().
>
> In this function ginDeletePage(gvs, blkno, 
> BufferGetBlockNumber(me->leftBuffer),...)
> we are going to reread buffer
> lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
>  RBM_NORMAL, gvs->strategy);
> Is it OK?


That's related not only to left buffer.  We also could pass buffer
instead of deleteBlkno.  And with some code changes it's also possible
to pass buffer instead of parentBlkno.  But I decided to keep code
changes minimal at least for backpatch version.  We could apply such
optimization to master as a separate patch.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-29 Thread Andrey Borodin



> 29 сент. 2019 г., в 21:27, Alexander Korotkov  
> написал(а):
> 
> Patch with fix is attached.  Idea is simple: ginScanToDelete() now
> keeps exclusive lock on left page eliminating the need to relock it.
> So, we preserve left-to-right locking order and can't deadlock with
> ginStepRight().

In this function ginDeletePage(gvs, blkno, 
BufferGetBlockNumber(me->leftBuffer),...)
we are going to reread buffer
lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
 RBM_NORMAL, gvs->strategy);
Is it OK?


> 30 сент. 2019 г., в 0:52, Peter Geoghegan  написал(а):
> 
> Why is it okay
> that there is no nbtree-style distinction between page deletion and
> page recycling?
As far as I understand deleted page is stamped with
GinPageSetDeleteXid(page, ReadNewTransactionId());
It will not be recycled until that Xid is far behind.
BTW we found a small bug (wraparound) in similar GiST and B-tree 
implementations.
Probably, it's there in GIN too.

--
Andrey Borodin
Open source RDBMS development team leader
Yandex.Cloud





Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-29 Thread Peter Geoghegan
On Sun, Sep 29, 2019 at 7:38 AM Alexander Korotkov
 wrote:
> Starting from root seems OK for me, because vacuum blocks all
> concurrent inserts before doing this.  But this needs to be properly
> documented in readme.

I never got an adequate answer to this closely related question almost
two years ago:

https://www.postgresql.org/message-id/CAH2-Wz=gtnapzeezqyelov3h1fxpo5xhmrbp6amgeklkv95...@mail.gmail.com

In general, ginInsertCleanup() seems badly designed. Why is it okay
that there is no nbtree-style distinction between page deletion and
page recycling?

> Locking from right to left is clearly wrong.  It could deadlock with
> concurrent ginStepRight(), which locks from left to right.  I expect
> this happened in your case.  I'm going to reproduce this and fix.

I am sick and tired of seeing extremely basic errors like this within
GIN's locking protocols. Bugs happen, but these are not ordinary bugs.
They're more or less all a result of giving no thought to the high
level design. I'm not blaming you for this, or any one person. But
this is not okay.

Anything around index concurrency needs to be explained in
excruciating detail, while taking a top-down approach that applies
general rules (e.g. you can only do lock coupling left to right, or
bottom to top in nbtree). Anything less than that should be assumed to
be wrong on general principle.

-- 
Peter Geoghegan




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-29 Thread Alexander Korotkov
On Sun, Sep 29, 2019 at 6:12 PM Alexander Korotkov
 wrote:
> On Sun, Sep 29, 2019 at 5:38 PM Alexander Korotkov
>  wrote:
> > On Sun, Sep 29, 2019 at 11:17 AM chenhj  wrote:
> > > Does the locking order of autovacuum process(root->right->left) correct? 
> > > While insert process lock gin buffer by order of bottom->top and 
> > > left->right.
> > >
> > > 1. vacuum(root->right->left):
> >
> > Starting from root seems OK for me, because vacuum blocks all
> > concurrent inserts before doing this.  But this needs to be properly
> > documented in readme.
> >
> > Locking from right to left is clearly wrong.  It could deadlock with
> > concurrent ginStepRight(), which locks from left to right.  I expect
> > this happened in your case.  I'm going to reproduce this and fix.
>
> I just managed to reproduce this using two sessions on master branch.
>
> session 1
> session 2
>
> # create table test with (autovacuum_enabled = false) as (select
> array[1] ar from generate_series(1,2) i);
> # create index test_ar_idx on test using gin (ar);
> # vacuum analyze test;
> # delete from test;
>
> # set enable_seqscan = off;
> gdb> b ginbtree.c:150
> # select * from test where ar @> '{1}'::integer[];
> Step in gdb just before ReadBuffer() in ReleaseAndReadBuffer().
>
> gdb> b ginvacuum.c:155
> # vacuum test;
>
> gdb > continue
> gdb> continue

Patch with fix is attached.  Idea is simple: ginScanToDelete() now
keeps exclusive lock on left page eliminating the need to relock it.
So, we preserve left-to-right locking order and can't deadlock with
ginStepRight().

Also, we need to adjust Concurrency section in GIN README.  For me the
description looks vague and inconsistent even with current behavior.
I'm going to post this later.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gin_ginDeletePage_ginStepRight_deadlock_fix-1.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-29 Thread Alexander Korotkov
On Sun, Sep 29, 2019 at 5:38 PM Alexander Korotkov
 wrote:
> On Sun, Sep 29, 2019 at 11:17 AM chenhj  wrote:
> > Does the locking order of autovacuum process(root->right->left) correct? 
> > While insert process lock gin buffer by order of bottom->top and 
> > left->right.
> >
> > 1. vacuum(root->right->left):
>
> Starting from root seems OK for me, because vacuum blocks all
> concurrent inserts before doing this.  But this needs to be properly
> documented in readme.
>
> Locking from right to left is clearly wrong.  It could deadlock with
> concurrent ginStepRight(), which locks from left to right.  I expect
> this happened in your case.  I'm going to reproduce this and fix.

I just managed to reproduce this using two sessions on master branch.

session 1
session 2

# create table test with (autovacuum_enabled = false) as (select
array[1] ar from generate_series(1,2) i);
# create index test_ar_idx on test using gin (ar);
# vacuum analyze test;
# delete from test;

# set enable_seqscan = off;
gdb> b ginbtree.c:150
# select * from test where ar @> '{1}'::integer[];
Step in gdb just before ReadBuffer() in ReleaseAndReadBuffer().

gdb> b ginvacuum.c:155
# vacuum test;

gdb > continue
gdb> continue

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-29 Thread Alexander Korotkov
Hi!

Thank you for reporting.

On Sun, Sep 29, 2019 at 11:17 AM chenhj  wrote:
> Does the locking order of autovacuum process(root->right->left) correct? 
> While insert process lock gin buffer by order of bottom->top and left->right.
>
> 1. vacuum(root->right->left):

Starting from root seems OK for me, because vacuum blocks all
concurrent inserts before doing this.  But this needs to be properly
documented in readme.

Locking from right to left is clearly wrong.  It could deadlock with
concurrent ginStepRight(), which locks from left to right.  I expect
this happened in your case.  I'm going to reproduce this and fix.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Connections hang indefinitely while taking a gin index's LWLock buffer_content lock(PG10.7)

2019-09-29 Thread chenhj
Hi,all


In our PostgreSQL 10.7(rhel 6.3) database, autovacuum process and many insert 
processes blocked in gin index's LWLock:buffer_content for long time. 


In other words, the following gin index lwlock deadlock phenomenon has occurred 
again. Since the following bug in 10.7 has been fixed. So this should be a new 
bug.


https://www.postgresql.org/message-id/flat/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com


We have already obtained coredump files of autovacuum process and one of insert 
processes.
Unfortunately the insert process(run by gcore) held no lwlock, it should be 
another process(we did not fetch core file) that hold the lwlock needed for 
autovacuum process.


the stack is as following:


## stack of one insert process: Acquire lock 0x7f6c517dbfa4 which was held by 
vacuum process

(gdb) bt
#0  0x00369ea0da00 in sem_wait () from /lib64/libpthread.so.0
#1  0x006a7910 in PGSemaphoreLock (sema=0x7f6c4f76a7b8) at pg_sema.c:316
#2  0x00718225 in LWLockAcquire (lock=0x7f6c517dbfa4, mode=LW_SHARED) 
at lwlock.c:1233
#3  0x0048b622 in ginTraverseLock (buffer=224225, searchMode=0 '\000') 
at ginbtree.c:40
#4  0x0048ca13 in ginFindLeafPage (btree=0x7fffc71c4ea0, searchMode=0 
'\000', snapshot=0x0) at ginbtree.c:97
#5  0x004894db in ginInsertItemPointers (index=, 
rootBlkno=, items=, nitem=, buildStats=0x0)
at gindatapage.c:1909
#6  0x004863a7 in ginEntryInsert (ginstate=0x1c72158, attnum=1, 
key=20190913, category=0 '\000', items=0x1c81508, nitem=72, buildStats=0x0) at 
gininsert.c:214
#7  0x0049219a in ginInsertCleanup (ginstate=0x1c72158, full_clean=0 
'\000', fill_fsm=1 '\001', forceCleanup=, stats=) at ginfast.c:878
#8  0x0049308e in ginHeapTupleFastInsert (ginstate=0x1c72158, 
collector=) at ginfast.c:443
#9  0x00486749 in gininsert (index=, 
values=0x7fffc71c54f0, isnull=0x7fffc71c5600 "", ht_ctid=0x1c6d3a4, 
heapRel=, 
checkUnique=, indexInfo=0x1c61da8) at gininsert.c:522
#10 0x005f75f0 in ExecInsertIndexTuples (slot=0x1c62168, 
tupleid=0x1c6d3a4, estate=0x1c61768, noDupErr=0 '\000', specConflict=0x0, 
arbiterIndexes=0x0) at execIndexing.c:387
#11 0x00616497 in ExecInsert (pstate=0x1c61ab8) at nodeModifyTable.c:519
#12 ExecModifyTable (pstate=0x1c61ab8) at nodeModifyTable.c:1779
#13 0x005fb6bf in ExecProcNode (queryDesc=0x1c67760, direction=, count=0, execute_once=-72 '\270') at 
../../../src/include/executor/executor.h:250
#14 ExecutePlan (queryDesc=0x1c67760, direction=, count=0, 
execute_once=-72 '\270') at execMain.c:1723
#15 standard_ExecutorRun (queryDesc=0x1c67760, direction=, 
count=0, execute_once=-72 '\270') at execMain.c:364
#16 0x7f6e226aa6f8 in pgss_ExecutorRun (queryDesc=0x1c67760, 
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at 
pg_stat_statements.c:889
#17 0x7f6e224a474d in explain_ExecutorRun (queryDesc=0x1c67760, 
direction=ForwardScanDirection, count=0, execute_once=1 '\001') at 
auto_explain.c:267
#18 0x0072a15b in ProcessQuery (plan=, 
sourceText=0x1c21458 "INSERT INTO bi_dm.tdm_wh_shopgds_fnsh_rt 
(STATIS_DATE,SITE_CD,LGORT,ZSIZE,ZVTWEG,VSBED,TOTAL_CNT,FNSH_CNT,UNFNSH_CNT,ETL_TIME,DEPT_CD,TMALL_FLG,BUSS_TP,ZCKYWLX)
 VALUES($1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$"..., params=0x1c21580, queryEnv=0x0, 
dest=, completionTag=0x7fffc71c5de0 "") at pquery.c:161
#19 0x0072a395 in PortalRunMulti (portal=0x1c57f18, isTopLevel=1 
'\001', setHoldSnapshot=0 '\000', dest=0xc9b480, altdest=0xc9b480, 
completionTag=0x7fffc71c5de0 "") at pquery.c:1286
#20 0x0072aa98 in PortalRun (portal=0x1c57f18, count=1, isTopLevel=1 
'\001', run_once=1 '\001', dest=0x1c25768, altdest=0x1c25768, 
completionTag=0x7fffc71c5de0 "") at pquery.c:799
#21 0x00728c9a in exec_execute_message (argc=, 
argv=, dbname=0x1bbb800 "lbiwhdb", username=) at postgres.c:2007
#22 PostgresMain (argc=, argv=, 
dbname=0x1bbb800 "lbiwhdb", username=) at postgres.c:4180
#23 0x006bb43a in BackendRun (argc=, argv=) at postmaster.c:4405
---Type  to continue, or q  to quit---
#24 BackendStartup (argc=, argv=) at 
postmaster.c:4077
#25 ServerLoop (argc=, argv=) at 
postmaster.c:1755
#26 PostmasterMain (argc=, argv=) at 
postmaster.c:1363
#27 0x0063b4d0 in main (argc=3, argv=0x1b839e0) at main.c:228
(gdb) f 2
#2  0x00718225 in LWLockAcquire (lock=0x7f6c517dbfa4, mode=LW_SHARED) 
at lwlock.c:1233
1233lwlock.c: No such file or directory.
in lwlock.c
(gdb) p num_held_lwlocks
$1 = 0
(gdb) 




## stack of autovacuum:Acquire lock 0x7f6c519ba5a4 and hold 0x7f6c517dbfa4, 
0x7f6c51684f64
--
(gdb) bt
#0  0x00369ea0da00 in sem_wait () from /lib64/libpthread.so.0
#1  0x006a7910 in PGSemaphoreLock (sema=0x7f6c4f77fdb8) at pg_sema.c:316
#2  0x00718225 in LWLockAcquire (lock=0x7f6c519ba5a4, 
mode=LW_EXCLUSIVE) at 

Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-24 Thread Alexander Korotkov
On Sat, Mar 23, 2019 at 11:43 AM Alexander Korotkov
 wrote:
> On Fri, Mar 22, 2019 at 11:05 AM Alexander Korotkov
>  wrote:
> > On Fri, Mar 22, 2019 at 12:06 AM Alvaro Herrera
> >  wrote:
> > > On 2019-Mar-21, Alexander Korotkov wrote:
> > >
> > > > However, I think this still can be backpatched correctly.  We can
> > > > determine whether xlog record data contains deleteXid by its size.
> > > > See the attached patch.  I haven't test this yet.  I'm going to test
> > > > it.  If OK, then push.
> > >
> > > Wow, this is so magical that I think it merits a long comment explaining
> > > what is going on.
> >
> > Yeah, have to fo magic to fix such weird things :)
> > Please, find next revision of patch attached.  It uses static
> > assertion to check that data structure size has changed.  Also,
> > assertion checks that record data length match on of structures
> > length.
>
> I'm going to backpatch this on no objections.

So, pushed.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-23 Thread Alexander Korotkov
On Fri, Mar 22, 2019 at 11:05 AM Alexander Korotkov
 wrote:
> On Fri, Mar 22, 2019 at 12:06 AM Alvaro Herrera
>  wrote:
> > On 2019-Mar-21, Alexander Korotkov wrote:
> >
> > > However, I think this still can be backpatched correctly.  We can
> > > determine whether xlog record data contains deleteXid by its size.
> > > See the attached patch.  I haven't test this yet.  I'm going to test
> > > it.  If OK, then push.
> >
> > Wow, this is so magical that I think it merits a long comment explaining
> > what is going on.
>
> Yeah, have to fo magic to fix such weird things :)
> Please, find next revision of patch attached.  It uses static
> assertion to check that data structure size has changed.  Also,
> assertion checks that record data length match on of structures
> length.

I'm going to backpatch this on no objections.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-22 Thread Alexander Korotkov
On Fri, Mar 22, 2019 at 12:06 AM Alvaro Herrera
 wrote:
> On 2019-Mar-21, Alexander Korotkov wrote:
>
> > However, I think this still can be backpatched correctly.  We can
> > determine whether xlog record data contains deleteXid by its size.
> > See the attached patch.  I haven't test this yet.  I'm going to test
> > it.  If OK, then push.
>
> Wow, this is so magical that I think it merits a long comment explaining
> what is going on.

Yeah, have to fo magic to fix such weird things :)
Please, find next revision of patch attached.  It uses static
assertion to check that data structure size has changed.  Also,
assertion checks that record data length match on of structures
length.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gin-redo-delete-page-backpatch-fix-2.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-21 Thread Alvaro Herrera
On 2019-Mar-21, Alexander Korotkov wrote:

> However, I think this still can be backpatched correctly.  We can
> determine whether xlog record data contains deleteXid by its size.
> See the attached patch.  I haven't test this yet.  I'm going to test
> it.  If OK, then push.

Wow, this is so magical that I think it merits a long comment explaining
what is going on.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-21 Thread Simon Riggs
On Thu, 21 Mar 2019 at 15:18, Alexander Korotkov 
wrote:

> On Thu, Mar 21, 2019 at 9:26 PM Simon Riggs  wrote:
>
>

> > It's been pointed out to me that 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478
> > introduced a WAL incompatibility that has not been flagged.
> >
> > In ginRedoDeletePage() we use the struct directly to read the WAL
> record, so if a WAL record was written prior to
> 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478, yet read by code at
> 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478 or later then we will have
> problems, since deleteXid will not be set correctly.
> >
> > It seems this should not have been backpatched.
> >
> > Please give your assessment.
>
> Oh, right.  This is my fault.
>
> However, I think this still can be backpatched correctly.  We can
> determine whether xlog record data contains deleteXid by its size.
> See the attached patch.  I haven't test this yet.  I'm going to test
> it.  If OK, then push.
>

Patch looks like it will work.

I'd prefer to see a tighter test, since the length should either be the old
or new length, no other.

Thanks

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-21 Thread Alexander Korotkov
On Thu, Mar 21, 2019 at 9:26 PM Simon Riggs  wrote:
> On Thu, 13 Dec 2018 at 14:48, Alexander Korotkov  wrote:
>> On Thu, Dec 13, 2018 at 10:46 PM Andres Freund  wrote:
>> > On 2018-12-13 22:40:59 +0300, Alexander Korotkov wrote:
>> > > It doesn't mater, because we release all locks on every buffer at one
>> > > time.  The unlock order can have effect on what waiter will acquire
>> > > the lock next after ginRedoDeletePage().  However, I don't see why one
>> > > unlock order is better than another.  Thus, I just used the rule of
>> > > thumb to not change code when it's not necessary for bug fix.
>> >
>> > I think it's right to not change unlock order at the same time as a
>> > bugfix here.  More generally I think it can often be useful to default
>> > to release locks in the inverse order they've been acquired - if there's
>> > any likelihood that somebody will acquire them in the same order, that
>> > ensures that such a party would only need to wait for a lock once,
>> > instead of being woken up for one lock, and then immediately having to
>> > wait for the next one.
>>
>> Good point, thank you!
>
>
> It's been pointed out to me that 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478
> introduced a WAL incompatibility that has not been flagged.
>
> In ginRedoDeletePage() we use the struct directly to read the WAL record, so 
> if a WAL record was written prior to 
> 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478, yet read by code at 
> 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478 or later then we will have problems, 
> since deleteXid will not be set correctly.
>
> It seems this should not have been backpatched.
>
> Please give your assessment.

Oh, right.  This is my fault.

However, I think this still can be backpatched correctly.  We can
determine whether xlog record data contains deleteXid by its size.
See the attached patch.  I haven't test this yet.  I'm going to test
it.  If OK, then push.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gin-redo-delete-page-backpatch-fix.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2019-03-21 Thread Simon Riggs
On Thu, 13 Dec 2018 at 14:48, Alexander Korotkov 
wrote:

> On Thu, Dec 13, 2018 at 10:46 PM Andres Freund  wrote:
> > On 2018-12-13 22:40:59 +0300, Alexander Korotkov wrote:
> > > It doesn't mater, because we release all locks on every buffer at one
> > > time.  The unlock order can have effect on what waiter will acquire
> > > the lock next after ginRedoDeletePage().  However, I don't see why one
> > > unlock order is better than another.  Thus, I just used the rule of
> > > thumb to not change code when it's not necessary for bug fix.
> >
> > I think it's right to not change unlock order at the same time as a
> > bugfix here.  More generally I think it can often be useful to default
> > to release locks in the inverse order they've been acquired - if there's
> > any likelihood that somebody will acquire them in the same order, that
> > ensures that such a party would only need to wait for a lock once,
> > instead of being woken up for one lock, and then immediately having to
> > wait for the next one.
>
> Good point, thank you!
>

It's been pointed out to me that 52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478
introduced a WAL incompatibility that has not been flagged.

In ginRedoDeletePage() we use the struct directly to read the WAL record,
so if a WAL record was written prior to
52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478, yet read by code at
52ac6cd2d0cd70e01291e0ac4ee6d068b69bc478 or later then we will have
problems, since deleteXid will not be set correctly.

It seems this should not have been backpatched.

Please give your assessment.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-14 Thread Bruce Momjian
On Thu, Dec 13, 2018 at 10:40:59PM +0300, Alexander Korotkov wrote:
> On Thu, Dec 13, 2018 at 8:06 PM Andrey Borodin  wrote:
> > That's the same variable, one place is definition while other is potential 
> > misuse.
> > Seems like these 2 lines [0]
> >
> > +   if (BufferIsValid(lbuffer))
> > +   UnlockReleaseBuffer(lbuffer);
> >
> > are superfluous: lbuffer is UnlockReleased earlier.
> 
> Thanks to everybody for noticing!  Speaking more generally backpatch
> to 9.4 was completely wrong: there were multiple errors.  It's my
> oversight, I forget how much better our xlog system became since 9.4
> :)
> 
> I've pushed bf0e5a73be fixing that.

I can confirm the compiler warning is now gone.  Thanks.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-13 Thread Alexander Korotkov
On Thu, Dec 13, 2018 at 10:46 PM Andres Freund  wrote:
> On 2018-12-13 22:40:59 +0300, Alexander Korotkov wrote:
> > It doesn't mater, because we release all locks on every buffer at one
> > time.  The unlock order can have effect on what waiter will acquire
> > the lock next after ginRedoDeletePage().  However, I don't see why one
> > unlock order is better than another.  Thus, I just used the rule of
> > thumb to not change code when it's not necessary for bug fix.
>
> I think it's right to not change unlock order at the same time as a
> bugfix here.  More generally I think it can often be useful to default
> to release locks in the inverse order they've been acquired - if there's
> any likelihood that somebody will acquire them in the same order, that
> ensures that such a party would only need to wait for a lock once,
> instead of being woken up for one lock, and then immediately having to
> wait for the next one.

Good point, thank you!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-13 Thread Andres Freund
Hi,

On 2018-12-13 22:40:59 +0300, Alexander Korotkov wrote:
> It doesn't mater, because we release all locks on every buffer at one
> time.  The unlock order can have effect on what waiter will acquire
> the lock next after ginRedoDeletePage().  However, I don't see why one
> unlock order is better than another.  Thus, I just used the rule of
> thumb to not change code when it's not necessary for bug fix.

I think it's right to not change unlock order at the same time as a
bugfix here.  More generally I think it can often be useful to default
to release locks in the inverse order they've been acquired - if there's
any likelihood that somebody will acquire them in the same order, that
ensures that such a party would only need to wait for a lock once,
instead of being woken up for one lock, and then immediately having to
wait for the next one.

Greetings,

Andres Freund



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-13 Thread Alexander Korotkov
On Thu, Dec 13, 2018 at 8:06 PM Andrey Borodin  wrote:
> > if (BufferIsValid(lbuffer))
> > UnlockReleaseBuffer(lbuffer);
> > if (BufferIsValid(pbuffer))
> > UnlockReleaseBuffer(pbuffer);
> > if (BufferIsValid(dbuffer))
> > UnlockReleaseBuffer(dbuffer);
> > ==>
> > if (BufferIsValid(pbuffer))
> > UnlockReleaseBuffer(pbuffer);
> > if (BufferIsValid(dbuffer))
> > UnlockReleaseBuffer(dbuffer);
> > if (BufferIsValid(lbuffer))
> > UnlockReleaseBuffer(lbuffer);
>
> I think that unlock order does not matter. But I may be wrong. May be just 
> for uniformity?

It doesn't mater, because we release all locks on every buffer at one
time.  The unlock order can have effect on what waiter will acquire
the lock next after ginRedoDeletePage().  However, I don't see why one
unlock order is better than another.  Thus, I just used the rule of
thumb to not change code when it's not necessary for bug fix.

> > 13 дек. 2018 г., в 21:48, Tom Lane  написал(а):
> >
> > Bruce Momjian  writes:
> >> I am seeing this warning in the 9.4 branch:
> >>  ginxlog.c:756:5: warning: ‘lbuffer’ may be used uninitialized
> >>  in this function [-Wmaybe-uninitialized]
> >
> > I'm also getting that, just in 9.4, but at a different line number:
> >
> > ginxlog.c: In function 'ginRedoDeletePage':
> > ginxlog.c:693: warning: 'lbuffer' may be used uninitialized in this function
> >
> > That's with gcc version 4.4.7 20120313 (Red Hat 4.4.7-23)
>
>
> That's the same variable, one place is definition while other is potential 
> misuse.
> Seems like these 2 lines [0]
>
> +   if (BufferIsValid(lbuffer))
> +   UnlockReleaseBuffer(lbuffer);
>
> are superfluous: lbuffer is UnlockReleased earlier.

Thanks to everybody for noticing!  Speaking more generally backpatch
to 9.4 was completely wrong: there were multiple errors.  It's my
oversight, I forget how much better our xlog system became since 9.4
:)

I've pushed bf0e5a73be fixing that.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-13 Thread Andrey Borodin
Hi!


> 13 дек. 2018 г., в 17:03, Alexander Korotkov  
> написал(а):
> 
> Thank you.  I've revised your patch and pushed it.  As long as two
> other patches in this thread.

That's great! Thanks!

> 13 дек. 2018 г., в 20:12, chjis...@163.com написал(а):
> 
> 
> hi
> I Have a question. Why the order of unlocking is not adjusted in this patch? 
> like this:
> 
> if (BufferIsValid(lbuffer))
> UnlockReleaseBuffer(lbuffer);
> if (BufferIsValid(pbuffer))
> UnlockReleaseBuffer(pbuffer);
> if (BufferIsValid(dbuffer))
> UnlockReleaseBuffer(dbuffer);
> ==>
> if (BufferIsValid(pbuffer))
> UnlockReleaseBuffer(pbuffer);
> if (BufferIsValid(dbuffer))
> UnlockReleaseBuffer(dbuffer);
> if (BufferIsValid(lbuffer))
> UnlockReleaseBuffer(lbuffer);

I think that unlock order does not matter. But I may be wrong. May be just for 
uniformity?


> 13 дек. 2018 г., в 21:48, Tom Lane  написал(а):
> 
> Bruce Momjian  writes:
>> I am seeing this warning in the 9.4 branch:
>>  ginxlog.c:756:5: warning: ‘lbuffer’ may be used uninitialized
>>  in this function [-Wmaybe-uninitialized]
> 
> I'm also getting that, just in 9.4, but at a different line number:
> 
> ginxlog.c: In function 'ginRedoDeletePage':
> ginxlog.c:693: warning: 'lbuffer' may be used uninitialized in this function
> 
> That's with gcc version 4.4.7 20120313 (Red Hat 4.4.7-23)


That's the same variable, one place is definition while other is potential 
misuse.
Seems like these 2 lines [0]

+   if (BufferIsValid(lbuffer))
+   UnlockReleaseBuffer(lbuffer);

are superfluous: lbuffer is UnlockReleased earlier.



Best regards, Andrey Borodin.

[0] 
https://github.com/postgres/postgres/commit/19cf52e6cc33b9e126775f28269b6ddf7e30#diff-ed6446a8993c76d2884ec6413e49a6b6R757


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-13 Thread Tom Lane
Bruce Momjian  writes:
> I am seeing this warning in the 9.4 branch:
>   ginxlog.c:756:5: warning: ‘lbuffer’ may be used uninitialized
>   in this function [-Wmaybe-uninitialized]

I'm also getting that, just in 9.4, but at a different line number:

ginxlog.c: In function 'ginRedoDeletePage':
ginxlog.c:693: warning: 'lbuffer' may be used uninitialized in this function

That's with gcc version 4.4.7 20120313 (Red Hat 4.4.7-23)

regards, tom lane



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-13 Thread Bruce Momjian
On Thu, Dec 13, 2018 at 03:03:54PM +0300, Alexander Korotkov wrote:
> On Wed, Dec 12, 2018 at 4:08 PM Andrey Borodin  wrote:
> > > 12 дек. 2018 г., в 3:26, Alexander Korotkov  
> > > написал(а):
> > >
> > > BTW, I still can't see you're setting deleteXid field of
> > > ginxlogDeletePage struct anywhere.
> > Oops. Fixed.
> > >
> > > Also, I note that protection against re-usage of recently deleted
> > > pages is implemented differently than it is in B-tree.
> > > 1) You check TransactionIdPrecedes(GinPageGetDeleteXid(page),
> > > RecentGlobalDataXmin)), while B-tree checks
> > > TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin).  Could you
> > > clarify why do we use RecentGlobalDataXmin instead of RecentGlobalXmin
> > > here?
> > As far as I understand RecentGlobalDataXmin may be slightly farther than 
> > RecentGlobalXmin in case when replication_slot_catalog_xmin is holding 
> > RecentGlobalXmin. And GIN is never used by catalog tables. But let's use 
> > RecentGlobalXmin like in B-tree.
> >
> > > 2) B-tree checks this condition both before putting page into FSM and
> > > after getting page from FSM.  You're checking only after getting page
> > > from FSM.  Approach of B-tree looks better for me.  It's seems more
> > > consistent when FSM pages are really free for usage.
> > Fixed.
> 
> Thank you.  I've revised your patch and pushed it.  As long as two
> other patches in this thread.

I am seeing this warning in the 9.4 branch:

ginxlog.c:756:5: warning: ‘lbuffer’ may be used uninitialized
in this function [-Wmaybe-uninitialized]

from this commit:

commit 19cf52e6cc33b9e126775f28269b6ddf7e30
Author: Alexander Korotkov 
Date:   Thu Dec 13 06:12:25 2018 +0300

Prevent deadlock in ginRedoDeletePage()

On standby ginRedoDeletePage() can work concurrently with read-only queries.
Those queries can traverse posting tree in two ways.
1) Using rightlinks by ginStepRight(), which locks the next page before
   unlocking its left sibling.
2) Using downlinks by ginFindLeafPage(), which locks at most one page at 
time.

Original lock order was: page, parent, left sibling.  That lock order can
deadlock with ginStepRight().  In order to prevent deadlock this commit 
changes
lock order to: left sibling, page, parent.  Note, that position of parent in
locking order seems insignificant, because we only lock one page at time 
while
traversing downlinks.

Reported-by: Chen Huajun
Diagnosed-by: Chen Huajun, Peter Geoghegan, Andrey Borodin
Discussion: 
https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com
Author: Alexander Korotkov
Backpatch-through: 9.4

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-13 Thread Alexander Korotkov
On Wed, Dec 12, 2018 at 4:08 PM Andrey Borodin  wrote:
> > 12 дек. 2018 г., в 3:26, Alexander Korotkov  
> > написал(а):
> >
> > BTW, I still can't see you're setting deleteXid field of
> > ginxlogDeletePage struct anywhere.
> Oops. Fixed.
> >
> > Also, I note that protection against re-usage of recently deleted
> > pages is implemented differently than it is in B-tree.
> > 1) You check TransactionIdPrecedes(GinPageGetDeleteXid(page),
> > RecentGlobalDataXmin)), while B-tree checks
> > TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin).  Could you
> > clarify why do we use RecentGlobalDataXmin instead of RecentGlobalXmin
> > here?
> As far as I understand RecentGlobalDataXmin may be slightly farther than 
> RecentGlobalXmin in case when replication_slot_catalog_xmin is holding 
> RecentGlobalXmin. And GIN is never used by catalog tables. But let's use 
> RecentGlobalXmin like in B-tree.
>
> > 2) B-tree checks this condition both before putting page into FSM and
> > after getting page from FSM.  You're checking only after getting page
> > from FSM.  Approach of B-tree looks better for me.  It's seems more
> > consistent when FSM pages are really free for usage.
> Fixed.

Thank you.  I've revised your patch and pushed it.  As long as two
other patches in this thread.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-12 Thread Andrey Borodin
Hi!

> 12 дек. 2018 г., в 3:26, Alexander Korotkov  написал(а):
> 
> BTW, I still can't see you're setting deleteXid field of
> ginxlogDeletePage struct anywhere.
Oops. Fixed.
> 
> Also, I note that protection against re-usage of recently deleted
> pages is implemented differently than it is in B-tree.
> 1) You check TransactionIdPrecedes(GinPageGetDeleteXid(page),
> RecentGlobalDataXmin)), while B-tree checks
> TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin).  Could you
> clarify why do we use RecentGlobalDataXmin instead of RecentGlobalXmin
> here?
As far as I understand RecentGlobalDataXmin may be slightly farther than 
RecentGlobalXmin in case when replication_slot_catalog_xmin is holding 
RecentGlobalXmin. And GIN is never used by catalog tables. But let's use 
RecentGlobalXmin like in B-tree.

> 2) B-tree checks this condition both before putting page into FSM and
> after getting page from FSM.  You're checking only after getting page
> from FSM.  Approach of B-tree looks better for me.  It's seems more
> consistent when FSM pages are really free for usage.
Fixed.


Best regards, Andrey Borodin.


0001-Stamp-deleted-GIN-page-with-xid-v3.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-11 Thread Alexander Korotkov
On Tue, Dec 11, 2018 at 1:50 AM Alexander Korotkov  wrote:
> On Sun, Dec 9, 2018 at 10:25 PM Alexander Korotkov  
> wrote:
> > On Sat, Dec 8, 2018 at 12:48 PM Andrey Borodin  wrote:
> > > > 8 дек. 2018 г., в 6:54, Alexander Korotkov  
> > > > написал(а):
> > > >
> > > > Yep, please find attached draft patch.
> > >
> > > Patch seems good to me, I'll check it in more detail.
> > > The patch gets posting item at FirstOffsetNumber instead of 
> > > btree->getLeftMostChild(). This seem OK, since dataGetLeftMostPage() is 
> > > doing just the same, but with few Assert()s.
> >
> > I'd like to evade creating GinBtree for just calling
> > getLeftMostChild().  Also, few more places in ginvacuum.c do the same.
> > We have the same amount of Assert()s in ginVacuumPostingTreeLeaves().
> > So, let's keep it uniform.
> >
> > I would also like Peter Geoghegan to take a look at this patch before
> > committing it.
>
> I've slightly adjusted commit message.  I'm going to commit this fix
> if no objections.

Please also find patch changing lock order in ginRedoDeletePage()
preventing deadlock on standby.  I'm going to commit it too.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gin-redo-delete-page-locking-order-1.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-11 Thread Alexander Korotkov
On Tue, Dec 11, 2018 at 10:14 AM Andrey Borodin  wrote:
> > 11 дек. 2018 г., в 3:43, Alexander Korotkov  
> > написал(а):
> >
> > Attached patch appears to be incomplete.  GinPageSetDeleteXid() is
> > called only in ginRedoDeletePage(), so only in recovery, while it
> > should be set during normal work too.  deleteXid field of
> > ginxlogDeletePage is never set.
>
> Sorry, I've messed it again. Forgot to include ginvacuum.c changes. Here it 
> is.

BTW, I still can't see you're setting deleteXid field of
ginxlogDeletePage struct anywhere.

Also, I note that protection against re-usage of recently deleted
pages is implemented differently than it is in B-tree.
1) You check TransactionIdPrecedes(GinPageGetDeleteXid(page),
RecentGlobalDataXmin)), while B-tree checks
TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin).  Could you
clarify why do we use RecentGlobalDataXmin instead of RecentGlobalXmin
here?
2) B-tree checks this condition both before putting page into FSM and
after getting page from FSM.  You're checking only after getting page
from FSM.  Approach of B-tree looks better for me.  It's seems more
consistent when FSM pages are really free for usage.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-10 Thread Andrey Borodin
Hi!

> 11 дек. 2018 г., в 3:43, Alexander Korotkov  написал(а):
> 
> Attached patch appears to be incomplete.  GinPageSetDeleteXid() is
> called only in ginRedoDeletePage(), so only in recovery, while it
> should be set during normal work too.  deleteXid field of
> ginxlogDeletePage is never set.

Sorry, I've messed it again. Forgot to include ginvacuum.c changes. Here it is.

Best regards, Andrey Borodin.


0001-Stamp-deleted-GIN-page-with-xid-v2.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-10 Thread Alexander Korotkov
On Sun, Dec 9, 2018 at 10:25 PM Alexander Korotkov  wrote:
> On Sat, Dec 8, 2018 at 12:48 PM Andrey Borodin  wrote:
> > > 8 дек. 2018 г., в 6:54, Alexander Korotkov  
> > > написал(а):
> > >
> > > Yep, please find attached draft patch.
> >
> > Patch seems good to me, I'll check it in more detail.
> > The patch gets posting item at FirstOffsetNumber instead of 
> > btree->getLeftMostChild(). This seem OK, since dataGetLeftMostPage() is 
> > doing just the same, but with few Assert()s.
>
> I'd like to evade creating GinBtree for just calling
> getLeftMostChild().  Also, few more places in ginvacuum.c do the same.
> We have the same amount of Assert()s in ginVacuumPostingTreeLeaves().
> So, let's keep it uniform.
>
> I would also like Peter Geoghegan to take a look at this patch before
> committing it.

I've slightly adjusted commit message.  I'm going to commit this fix
if no objections.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gin-vacuum-fix-2.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-10 Thread Alexander Korotkov
On Mon, Dec 10, 2018 at 4:58 PM Andrey Borodin  wrote:
> > 10 дек. 2018 г., в 18:56, Andrey Borodin  написал(а):
> >
> > PFA this part. In thread about GiST vacuum Heikki was not very happy with 
> > reusing PageHeader->pd_prune_xid for this. But we somewhat concluded that 
> > this might be OK.
> > Anyway, there's whole page, we can store those bits anywhere.

We might allocate special field in opaque to store stat xid like
B-tree or SP-GiST.  But we have to be binary compatible.  And this
change shouldn't require upgrade procedure, since we're going to
backpatch that.  We could allocate space elsewhere on the page, but
that would be a bit cumbersome.  So, let's use pd_prune_xid once it's
available.

> Errrmm... I'm very sorry for the noise. I've forgot the actual patch.

Attached patch appears to be incomplete.  GinPageSetDeleteXid() is
called only in ginRedoDeletePage(), so only in recovery, while it
should be set during normal work too.  deleteXid field of
ginxlogDeletePage is never set.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-10 Thread Andrey Borodin

> 10 дек. 2018 г., в 18:56, Andrey Borodin  написал(а):
> 
> PFA this part. In thread about GiST vacuum Heikki was not very happy with 
> reusing PageHeader->pd_prune_xid for this. But we somewhat concluded that 
> this might be OK.
> Anyway, there's whole page, we can store those bits anywhere.

Errrmm... I'm very sorry for the noise. I've forgot the actual patch.

Best regards, Andrey Borodin.


0001-Stamp-deleted-GIN-page-with-xid-to-prevent-early-reu.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-10 Thread Andrey Borodin
Hi!

> 10 дек. 2018 г., в 0:25, Alexander Korotkov  написал(а):
>> 
>> There's a patch above in this thread 
>> 0001-Use-correct-locking-protocol-during-GIN-posting-tree.patch where I 
>> propose stamping every deleted page with GinPageSetDeleteXid(page, 
>> ReadNewTransactionId()); and avoid reusing the page before 
>> TransactionIdPrecedes(GinPageGetDeleteXid(page), RecentGlobalDataXmin).
>> Should we leave alone this bug for future fixes to keep current fix 
>> noninvasive?
> 
> I think since is separate bug introduced in PostgreSQL 9.4, which
> should be backpatched with separate commit.  Could you please extract
> patch dealing with GinPageSetDeleteXid() and GinPageGetDeleteXid().
> The rest of work made in your patch should be considered for master.

PFA this part. In thread about GiST vacuum Heikki was not very happy with 
reusing PageHeader->pd_prune_xid for this. But we somewhat concluded that this 
might be OK.
Anyway, there's whole page, we can store those bits anywhere.

> BTW, what do you think about locking order in ginRedoDeletePage()?  Do
> you have any explanation of current behavior?
It seems to me that locking order must be left->deleted->parent.

Best regards, Andrey Borodin.


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-09 Thread Alexander Korotkov
On Sat, Dec 8, 2018 at 12:48 PM Andrey Borodin  wrote:
> > 8 дек. 2018 г., в 6:54, Alexander Korotkov  
> > написал(а):
> >
> > Yep, please find attached draft patch.
>
> Patch seems good to me, I'll check it in more detail.
> The patch gets posting item at FirstOffsetNumber instead of 
> btree->getLeftMostChild(). This seem OK, since dataGetLeftMostPage() is doing 
> just the same, but with few Assert()s.

I'd like to evade creating GinBtree for just calling
getLeftMostChild().  Also, few more places in ginvacuum.c do the same.
We have the same amount of Assert()s in ginVacuumPostingTreeLeaves().
So, let's keep it uniform.

I would also like Peter Geoghegan to take a look at this patch before
committing it.

> > BTW, it seems that I've another bug in GIN.  README says that
> >
> > "However, posting trees are only
> > fully searched from left to right, starting from the leftmost leaf. (The
> > tree-structure is only needed by insertions, to quickly find the correct
> > insert location). So as long as we don't delete the leftmost page on each
> > level, a search can never follow a downlink to page that's about to be
> > deleted."
> >
> > But that's not really true once we teach GIN to skip parts of posting
> > trees in PostgreSQL 9.4.  So, there might be a risk to visit page to
> > be deleted using downlink.  But in order to get real problem, vacuum
> > should past cleanup stage and someone else should reuse this page,
> > before we traverse downlink.  Thus, the risk of real problem is very
> > low.  But it still should be addressed.
>
> There's a patch above in this thread 
> 0001-Use-correct-locking-protocol-during-GIN-posting-tree.patch where I 
> propose stamping every deleted page with GinPageSetDeleteXid(page, 
> ReadNewTransactionId()); and avoid reusing the page before 
> TransactionIdPrecedes(GinPageGetDeleteXid(page), RecentGlobalDataXmin).
> Should we leave alone this bug for future fixes to keep current fix 
> noninvasive?

I think since is separate bug introduced in PostgreSQL 9.4, which
should be backpatched with separate commit.  Could you please extract
patch dealing with GinPageSetDeleteXid() and GinPageGetDeleteXid().
The rest of work made in your patch should be considered for master.

BTW, what do you think about locking order in ginRedoDeletePage()?  Do
you have any explanation of current behavior?  I'll try to reach
Teodor tomorrow with this question.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-08 Thread Andrey Borodin
Hi!

Thanks, Alexander!
> 8 дек. 2018 г., в 6:54, Alexander Korotkov  написал(а):
> 
> Yep, please find attached draft patch.

Patch seems good to me, I'll check it in more detail.
The patch gets posting item at FirstOffsetNumber instead of 
btree->getLeftMostChild(). This seem OK, since dataGetLeftMostPage() is doing 
just the same, but with few Assert()s.

> BTW, it seems that I've another bug in GIN.  README says that
> 
> "However, posting trees are only
> fully searched from left to right, starting from the leftmost leaf. (The
> tree-structure is only needed by insertions, to quickly find the correct
> insert location). So as long as we don't delete the leftmost page on each
> level, a search can never follow a downlink to page that's about to be
> deleted."
> 
> But that's not really true once we teach GIN to skip parts of posting
> trees in PostgreSQL 9.4.  So, there might be a risk to visit page to
> be deleted using downlink.  But in order to get real problem, vacuum
> should past cleanup stage and someone else should reuse this page,
> before we traverse downlink.  Thus, the risk of real problem is very
> low.  But it still should be addressed.

There's a patch above in this thread 
0001-Use-correct-locking-protocol-during-GIN-posting-tree.patch where I propose 
stamping every deleted page with GinPageSetDeleteXid(page, 
ReadNewTransactionId()); and avoid reusing the page before 
TransactionIdPrecedes(GinPageGetDeleteXid(page), RecentGlobalDataXmin).
Should we leave alone this bug for future fixes to keep current fix noninvasive?

Best regards, Andrey Borodin.


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-07 Thread Alexander Korotkov
On Fri, Dec 7, 2018 at 12:50 AM Peter Geoghegan  wrote:
> On Thu, Dec 6, 2018 at 12:51 PM Alexander Korotkov  
> wrote:
> > So, algorithm introduced by 218f51584d5 is broken.  It tries to
> > guarantee that there are no inserters in the subtree by placing
> > cleanup lock to subtree root, assuming that inserter holds pins on the
> > path from root to leaf.  But due to concurrent splits of internal
> > pages the pins held can be not relevant to actual path.  I don't see
> > the way to fix this easily.  So, I think we should revert it from back
> > branches and try to reimplement that in master.
> >
> > However, I'd like to note that 218f51584d5 introduces two changes:
> > 1) Cleanup locking only if there pages to delete
> > 2) Cleanup locking only subtree root
> > The 2nd one is broken.  But the 1st one seems still good for me and
> > useful, because in vast majority of cases vacuum doesn't delete any
> > index pages.  So, I propose to revert 218f51584d5, but leave there
> > logic, which locks root for cleanup only once there are pages to
> > delete.  Any thoughts?
>
> Can you post a patch that just removes the 2nd part, leaving the
> still-correct first part?
>
> Your proposal sounds reasonable, but I'd like to see what you have in
> mind in detail before commenting.

Yep, please find attached draft patch.

BTW, it seems that I've another bug in GIN.  README says that

"However, posting trees are only
fully searched from left to right, starting from the leftmost leaf. (The
tree-structure is only needed by insertions, to quickly find the correct
insert location). So as long as we don't delete the leftmost page on each
level, a search can never follow a downlink to page that's about to be
deleted."

But that's not really true once we teach GIN to skip parts of posting
trees in PostgreSQL 9.4.  So, there might be a risk to visit page to
be deleted using downlink.  But in order to get real problem, vacuum
should past cleanup stage and someone else should reuse this page,
before we traverse downlink.  Thus, the risk of real problem is very
low.  But it still should be addressed.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


gin-vacuum-fix-1.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-07 Thread Alexander Korotkov
On Fri, Dec 7, 2018 at 12:14 PM Andrey Borodin  wrote:
> > 7 дек. 2018 г., в 2:50, Peter Geoghegan  написал(а):
> > On Thu, Dec 6, 2018 at 12:51 PM Alexander Korotkov  
> > wrote:
> >> However, I'd like to note that 218f51584d5 introduces two changes:
> >> 1) Cleanup locking only if there pages to delete
> >> 2) Cleanup locking only subtree root
> >> The 2nd one is broken.  But the 1st one seems still good for me and
> >> useful, because in vast majority of cases vacuum doesn't delete any
> >> index pages.  So, I propose to revert 218f51584d5, but leave there
> >> logic, which locks root for cleanup only once there are pages to
> >> delete.  Any thoughts?
> >
> > Can you post a patch that just removes the 2nd part, leaving the
> > still-correct first part?
>
> I like the idea of keeping cleanup lock only if there are pages to delete. It 
> will still solve the original problem that caused proposals for GIN VACUUM 
> enhancements.
>
> But I must note that there is one more problem: ginVacuumPostingTreeLeaves() 
> do not ensure that all splitted pages are visited. It copies downlink block 
> numbers to a temp array and then unlocks parent. It is not correct way to 
> scan posting tree for cleanup.

Hmm... In ginVacuumPostingTreeLeaves() we visit pages from parent to
children.  This is why splitted pages might be unvisited.  Should we
visit leafpages by rightlinks instead?  Then, if we have candidates
for delete, ginScanToDelete() can work the same as before 218f51584d5.

> PFA diff with following changes:
> 1. Always take root cleanup lock before deleting pages
> 2. Check for concurrent splits after scanning page
>
> Please note, that neither applying this diff nor reverting 218f51584d5 will 
> solve bug of page delete redo lock on standby.

ginRedoDeletePage() visits pages in following order: right(to be
deleted), parent, left.  I didn't find any description of why it has
been done so.  Any guesses don't we visit parent page first here?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-07 Thread Andrey Borodin
Hi!

> 7 дек. 2018 г., в 2:50, Peter Geoghegan  написал(а):
> 
> On Thu, Dec 6, 2018 at 12:51 PM Alexander Korotkov  
> wrote:
>> 
>> However, I'd like to note that 218f51584d5 introduces two changes:
>> 1) Cleanup locking only if there pages to delete
>> 2) Cleanup locking only subtree root
>> The 2nd one is broken.  But the 1st one seems still good for me and
>> useful, because in vast majority of cases vacuum doesn't delete any
>> index pages.  So, I propose to revert 218f51584d5, but leave there
>> logic, which locks root for cleanup only once there are pages to
>> delete.  Any thoughts?
> 
> Can you post a patch that just removes the 2nd part, leaving the
> still-correct first part?

I like the idea of keeping cleanup lock only if there are pages to delete. It 
will still solve the original problem that caused proposals for GIN VACUUM 
enhancements.

But I must note that there is one more problem: ginVacuumPostingTreeLeaves() do 
not ensure that all splitted pages are visited. It copies downlink block 
numbers to a temp array and then unlocks parent. It is not correct way to scan 
posting tree for cleanup.

PFA diff with following changes:
1. Always take root cleanup lock before deleting pages
2. Check for concurrent splits after scanning page

Please note, that neither applying this diff nor reverting 218f51584d5 will 
solve bug of page delete redo lock on standby.

Best regards, Andrey Borodin.


gin_root_lock.diff
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-06 Thread Peter Geoghegan
On Thu, Dec 6, 2018 at 12:51 PM Alexander Korotkov  wrote:
> So, algorithm introduced by 218f51584d5 is broken.  It tries to
> guarantee that there are no inserters in the subtree by placing
> cleanup lock to subtree root, assuming that inserter holds pins on the
> path from root to leaf.  But due to concurrent splits of internal
> pages the pins held can be not relevant to actual path.  I don't see
> the way to fix this easily.  So, I think we should revert it from back
> branches and try to reimplement that in master.
>
> However, I'd like to note that 218f51584d5 introduces two changes:
> 1) Cleanup locking only if there pages to delete
> 2) Cleanup locking only subtree root
> The 2nd one is broken.  But the 1st one seems still good for me and
> useful, because in vast majority of cases vacuum doesn't delete any
> index pages.  So, I propose to revert 218f51584d5, but leave there
> logic, which locks root for cleanup only once there are pages to
> delete.  Any thoughts?

Can you post a patch that just removes the 2nd part, leaving the
still-correct first part?

Your proposal sounds reasonable, but I'd like to see what you have in
mind in detail before commenting.

Thanks
-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-06 Thread Alexander Korotkov
On Thu, Dec 6, 2018 at 10:56 PM Alexander Korotkov  wrote:
> On Tue, Dec 4, 2018 at 8:01 PM Andres Freund  wrote:
> > On 2018-11-10 17:42:16 -0800, Peter Geoghegan wrote:
> > > On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan  wrote:
> > > > Teodor: Do you think that the issue is fixable? It looks like there
> > > > are serious issues with the design of 218f51584d5 to me. I don't think
> > > > the general "there can't be any inserters at this subtree" thing works
> > > > given that we have to couple buffer locks when moving right for other
> > > > reasons. We call ginStepRight() within ginFinishSplit(), for reasons
> > > > that date back to bug fix commit ac4ab97e from 2013 -- that detail is
> > > > probably important, because it seems to be what breaks the subtree
> > > > design (we don't delete in two phases or anything in GIN).
> > >
> > > Ping?
> > >
> > > This is a thorny problem, and I'd like to get your input soon. I
> > > suspect that reverting 218f51584d5 may be the ultimate outcome, even
> > > though it's a v10 feature.
> >
> > Teodor, any chance for a response here?  It's not OK to commit something
> > and then just not respond to bug-reports, especially when they're well
> > diagnose and clearly point towards issues in a commit of yours.
>
> Unfortunately, Teodor is unreachable at least till next week.
>
> > Alexander, CCing you, perhaps you could point the thread out to Teodor?
>
> I'll take a look at this issue.

I've run through the thread.

So, algorithm introduced by 218f51584d5 is broken.  It tries to
guarantee that there are no inserters in the subtree by placing
cleanup lock to subtree root, assuming that inserter holds pins on the
path from root to leaf.  But due to concurrent splits of internal
pages the pins held can be not relevant to actual path.  I don't see
the way to fix this easily.  So, I think we should revert it from back
branches and try to reimplement that in master.

However, I'd like to note that 218f51584d5 introduces two changes:
1) Cleanup locking only if there pages to delete
2) Cleanup locking only subtree root
The 2nd one is broken.  But the 1st one seems still good for me and
useful, because in vast majority of cases vacuum doesn't delete any
index pages.  So, I propose to revert 218f51584d5, but leave there
logic, which locks root for cleanup only once there are pages to
delete.  Any thoughts?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-06 Thread Alexander Korotkov
On Tue, Dec 4, 2018 at 8:01 PM Andres Freund  wrote:
> On 2018-11-10 17:42:16 -0800, Peter Geoghegan wrote:
> > On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan  wrote:
> > > Teodor: Do you think that the issue is fixable? It looks like there
> > > are serious issues with the design of 218f51584d5 to me. I don't think
> > > the general "there can't be any inserters at this subtree" thing works
> > > given that we have to couple buffer locks when moving right for other
> > > reasons. We call ginStepRight() within ginFinishSplit(), for reasons
> > > that date back to bug fix commit ac4ab97e from 2013 -- that detail is
> > > probably important, because it seems to be what breaks the subtree
> > > design (we don't delete in two phases or anything in GIN).
> >
> > Ping?
> >
> > This is a thorny problem, and I'd like to get your input soon. I
> > suspect that reverting 218f51584d5 may be the ultimate outcome, even
> > though it's a v10 feature.
>
> Teodor, any chance for a response here?  It's not OK to commit something
> and then just not respond to bug-reports, especially when they're well
> diagnose and clearly point towards issues in a commit of yours.

Unfortunately, Teodor is unreachable at least till next week.

> Alexander, CCing you, perhaps you could point the thread out to Teodor?

I'll take a look at this issue.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-12-04 Thread Andres Freund
Hi,

On 2018-11-10 17:42:16 -0800, Peter Geoghegan wrote:
> On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan  wrote:
> > Teodor: Do you think that the issue is fixable? It looks like there
> > are serious issues with the design of 218f51584d5 to me. I don't think
> > the general "there can't be any inserters at this subtree" thing works
> > given that we have to couple buffer locks when moving right for other
> > reasons. We call ginStepRight() within ginFinishSplit(), for reasons
> > that date back to bug fix commit ac4ab97e from 2013 -- that detail is
> > probably important, because it seems to be what breaks the subtree
> > design (we don't delete in two phases or anything in GIN).
> 
> Ping?
> 
> This is a thorny problem, and I'd like to get your input soon. I
> suspect that reverting 218f51584d5 may be the ultimate outcome, even
> though it's a v10 feature.

Teodor, any chance for a response here?  It's not OK to commit something
and then just not respond to bug-reports, especially when they're well
diagnose and clearly point towards issues in a commit of yours.

Alexander, CCing you, perhaps you could point the thread out to Teodor?

Greetings,

Andres Freund



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-23 Thread Peter Geoghegan
On Fri, Nov 23, 2018 at 2:18 AM Andrey Borodin  wrote:
> I found no practical way to fix concept of "subtree-lock". Pre-locking all 
> left siblings for cleanup would suffice, but does not look very practical.
> GIN Posting tree has no all useful B-tree inventory like left links and high 
> keys for concurrent deletes without "subtree-lock".

Significantly changing the design of GIN vacuuming in back branches
for a release that's more than a year old is absolutely out of the
question. I think that your patch should be evaluated as a new v12
feature, following the revert of 218f51584d5.

> Please note, that attached patch do not fix 2nd bug found by Chen in page 
> delete redo.
>
> I understand that reverting commit 218f51584d5 and returning to long but 
> finite root locks is safest solution

Whatever happens with the master branch, I think that reverting commit
218f51584d5 is the only option on the table for the back branches. Its
design is based on ideas on locking protocols that are fundamentally
incorrect and unworkable. I don't have a lot of faith in our ability
to retrofit a design that fixes the issue without causing problems
elsewhere.

-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-23 Thread Andrey Borodin
Hi!

> 15 нояб. 2018 г., в 17:30, Andrey Borodin  написал(а):
> 
> I think I can compose patch for consideration.

I found no practical way to fix concept of "subtree-lock". Pre-locking all left 
siblings for cleanup would suffice, but does not look very practical.
GIN Posting tree has no all useful B-tree inventory like left links and high 
keys for concurrent deletes without "subtree-lock".

Let's consider concurrent deletion.
As far as I understand, absence of high-key makes unsafe concurrent deletion of 
highest keys from internal pages. Thus, internal pages cannot be deleted at all.
PFA patch with most simple deletion algorithm:
1. Descend to leftmost leaf
2. Scan by rightlinks until rightmost page
3. Delete empty leaf pages, if they are not highest keys in corresponding 
parent page

Parent page is found by rightlink scan for leftmost parent with cached previous 
result.
Deleted page is stamped by ReadNewTransactionId()'s xid, page cannot be reused 
until deletion xid is beyond RecentGlobalDataXmin.
DeleteXid is WAL-logged.
I've refactored btree->getLeftMostChild() and btree->findChildPtr() to reuse 
their code (removed unnecessary parameter).

Please note, that attached patch do not fix 2nd bug found by Chen in page 
delete redo.

I understand that reverting commit 218f51584d5 and returning to long but finite 
root locks is safest solution. In this description of proposed patch I've tried 
to put concepts in a minimum of words. Please ping me if some concepts sound 
cryptic and require more clarification.

Best regards, Andrey Borodin.


0001-Use-correct-locking-protocol-during-GIN-posting-tree.patch
Description: Binary data


Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-15 Thread Andrey Borodin
Hi!

Mail.App somehow borken my reponse with , which 
is far beyond of my understanding of what is plain text, that's why I'll quote 
all my previous message here. Hope this mail client is OK.

14.11.2018, 23:13, "Andrey Borodin" :
> Hi everyone!
>
> I didn't noticed this thread for too long somehow, sorry.
>
>> 8 нояб. 2018 г., в 6:46, Peter Geoghegan  написал(а):
>>
>> I don't think
>> the general "there can't be any inserters at this subtree" thing works
>> given that we have to couple buffer locks when moving right for other
>> reasons. We call ginStepRight() within ginFinishSplit(), for reasons
>> that date back to bug fix commit ac4ab97e from 2013 -- that detail is
>> probably important, because it seems to be what breaks the subtree
>> design (we don't delete in two phases or anything in GIN).
>
> ginVacuumPostingTreeLeaves() holds LockBufferForCleanup() on subtree root b.
> Thus there may be no GinBtreeStack's holding pin on b at the moment.
> When you ginStepRight(b) to the parent in ginFinishSplit(), you always get to 
> the buffer from your stack.
> Hence you can never have ginFinishSplit() deadlock with cleanup of subtree 
> whose root is LockBufferForCleanup()'d.
>
> Is this correct or did I miss something?
>
> But we have a deadlock at hand, I'll think more about it. Something with 
> locking protocol is clearly wrong.
>
>> 11 нояб. 2018 г., в 22:33, chenhj  написал(а):
>>
>> The order of get lwlock in ginRedoDeletePage() may should be change from 
>> "dbuffer->pbuffer->lbuffer" to "lbuffer->dbuffer->pbuffer" . Is this right?
>
> This looks correct to me.
>
> Best regards, Andrey Borodin.

I've reread Chen's reports and understood his findings.

> When you ginStepRight(b) to the parent in ginFinishSplit(), you always get to 
> the buffer from your stack.

This statement does not hold.
When you have a GinBtreeStack S, one of it's internal pages may split into new 
page P. If in this moment you start a multi-level delete from P and cascade 
split from S, P is not in S and we may deadlock.

Correct solution seems to replace lock-coupling descent by usual B-tree 
searches for parent pages as in B-tree.
I think I can compose patch for consideration.

Best regards, Andrey Borodin.



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-14 Thread Andrey Borodin
Hi everyone!I didn't noticed this thread for too long somehow, sorry.8 нояб. 2018 г., в 6:46, Peter Geoghegan  написал(а):I don't thinkthe general "there can't be any inserters at this subtree" thing worksgiven that we have to couple buffer locks when moving right for otherreasons. We call ginStepRight() within ginFinishSplit(), for reasonsthat date back to bug fix commit ac4ab97e from 2013 -- that detail isprobably important, because it seems to be what breaks the subtreedesign (we don't delete in two phases or anything in GIN).ginVacuumPostingTreeLeaves() holds LockBufferForCleanup() on subtree root b.Thus there may be no GinBtreeStack's holding pin on b at the moment.When you ginStepRight(b) to the parent in ginFinishSplit(), you always get to the buffer from your stack.Hence you can never have ginFinishSplit() deadlock with cleanup of subtree whose root is LockBufferForCleanup()'d.Is this correct or did I miss something?But we have a deadlock at hand, I'll think more about it. Something with locking protocol is clearly wrong.11 нояб. 2018 г., в 22:33, chenhj  написал(а):The order of get lwlock in ginRedoDeletePage() may should be change from "dbuffer->pbuffer->lbuffer" to "lbuffer->dbuffer->pbuffer" . Is this right?This looks correct to me.Best regards, Andrey Borodin.


Re:Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-11 Thread chjis...@163.com
There's a mistake in the last mail.About how to reproduct the issue,the following SQL should be:set enable_seqscan = false;select count(*) from tb1 where id =1;4. execute select SQL    set enable_seqscan = false;    select count(*) from tb1 where id =2;Regard,Chen Huajun

Re:Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-11 Thread chenhj
Hi


Before we only discussed the connection hang on the primary, the connection 
hang on the standby database should be another problem.
 When the recovery process replays the gin's delete WAL record, the order of 
get lwlock is not the same as the select process, 
resulting in a deadlock. Accord infomation form gcore, the details are as 
follows:


## select process


   2246(rootBlkno=2246)
  |
3254(1. Held buffer=366260,LWLock=0x2d50e2e4)  --rightlink--> 483(2. 
Acquire buffer=2201739,LWLock=0x2aaab45158a4)


The ginStepRight() function in select process gets the lwlock in the order of 
left to right.


## recovey process


   2246(2. Held buffer=7034160,LWLock=0x2aaac6c081e4,rootBlkno)
  |
3254(3. Acquire buffer=366260,LWLock=0x2d50e2e4)  --rightlink--> 483(1. 
Held buffer=2201739,LWLock=0x2aaab45158a4)


But, the ginRedoDeletePage() function in recovery process gets the lwlock in 
the order of current to parent and to left.
So, i think inconsistent lock order in ginRedoDeletePage() is the reason for 
hang in the standby.


static void
ginRedoDeletePage(XLogReaderState *record)
{
XLogRecPtrlsn = record->EndRecPtr;
ginxlogDeletePage *data = (ginxlogDeletePage *) XLogRecGetData(record);
Bufferdbuffer;
Bufferpbuffer;
Bufferlbuffer;
Pagepage;


if (XLogReadBufferForRedo(record, 0, ) == BLK_NEEDS_REDO)
{
page = BufferGetPage(dbuffer);
Assert(GinPageIsData(page));
GinPageGetOpaque(page)->flags = GIN_DELETED;
PageSetLSN(page, lsn);
MarkBufferDirty(dbuffer);
}


if (XLogReadBufferForRedo(record, 1, ) == BLK_NEEDS_REDO)
{
page = BufferGetPage(pbuffer);
Assert(GinPageIsData(page));
Assert(!GinPageIsLeaf(page));
GinPageDeletePostingItem(page, data->parentOffset);
PageSetLSN(page, lsn);
MarkBufferDirty(pbuffer);
}


if (XLogReadBufferForRedo(record, 2, ) == BLK_NEEDS_REDO)
{
page = BufferGetPage(lbuffer);
Assert(GinPageIsData(page));
GinPageGetOpaque(page)->rightlink = data->rightLink;
PageSetLSN(page, lsn);
MarkBufferDirty(lbuffer);
}


if (BufferIsValid(lbuffer))
UnlockReleaseBuffer(lbuffer);
if (BufferIsValid(pbuffer))
UnlockReleaseBuffer(pbuffer);
if (BufferIsValid(dbuffer))
UnlockReleaseBuffer(dbuffer);
}




The order of get lwlock in ginRedoDeletePage() may should be change from 
"dbuffer->pbuffer->lbuffer" to "lbuffer->dbuffer->pbuffer" . Is this right?




## How to reproduct this issue


1. modify ginxlog.c and build(add "sleep(60)" )


if (XLogReadBufferForRedo(record, 1, ) == BLK_NEEDS_REDO)
{
page = BufferGetPage(pbuffer);
Assert(GinPageIsData(page));
Assert(!GinPageIsLeaf(page));
GinPageDeletePostingItem(page, data->parentOffset);
PageSetLSN(page, lsn);
MarkBufferDirty(pbuffer);
}
==>
if (XLogReadBufferForRedo(record, 1, ) == BLK_NEEDS_REDO)
{
page = BufferGetPage(pbuffer);
Assert(GinPageIsData(page));
Assert(!GinPageIsLeaf(page));
GinPageDeletePostingItem(page, data->parentOffset);
PageSetLSN(page, lsn);
MarkBufferDirty(pbuffer);
sleep(60);//add for debug
}




2. run test SQL on the primary


create table tb1(id int);
create index ON tb1 using gin(id);
insert into tb1 select 1 from generate_series(1,100)id;
delete from tb1;


3. check recovery process in standby had enter "sleep()" branch


$ ps -ef|grep reco
postgres 13418 13417  0 22:23 ?00:00:00 postgres: startup process   
recovering 0001005E
postgres 13425 31505  0 22:23 pts/800:00:00 grep --color=auto reco
$ pstack 13418
#0  0x7f2166d39650 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x7f2166d39504 in sleep () from /lib64/libc.so.6
#2  0x0048614f in ginRedoDeletePage (record=0x127cbe8) at 
ginxlog.c:480
#3  gin_redo (record=0x127cbe8) at ginxlog.c:732
#4  0x004efec3 in StartupXLOG ()
#5  0x00697c51 in StartupProcessMain ()
#6  0x004fd22a in AuxiliaryProcessMain ()
#7  0x00694e49 in StartChildProcess ()
#8  0x00697665 in PostmasterMain ()
#9  0x004766e1 in main ()


4. execute select SQL


set enable_seqscan = false;
select count(*) from tb1 where id =2;


5. check result


recovery process block in LWLock


$ pstack 13418
#0  0x7f216775779b in do_futex_wait.constprop.1 () from 
/lib64/libpthread.so.0
#1  0x7f216775782f in __new_sem_wait_slow.constprop.0 () from 
/lib64/libpthread.so.0
#2  0x7f21677578cb in sem_wait@@GLIBC_2.2.5 () from 
/lib64/libpthread.so.0
#3  0x00685df2 in PGSemaphoreLock ()
#4  0x006edd64 in LWLockAcquire ()
#5  0x004fa66a in XLogReadBufferForRedoExtended ()
#6  0x00486161 in ginRedoDeletePage (record=0x127cbe8) at 
ginxlog.c:483
#7  gin_redo (record=0x127cbe8) at ginxlog.c:732
#8  0x004efec3 in StartupXLOG ()
#9  0x00697c51 in StartupProcessMain ()
#10 0x004fd22a in AuxiliaryProcessMain ()
#11 0x00694e49 in StartChildProcess ()

Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-10 Thread Peter Geoghegan
On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan  wrote:
> Teodor: Do you think that the issue is fixable? It looks like there
> are serious issues with the design of 218f51584d5 to me. I don't think
> the general "there can't be any inserters at this subtree" thing works
> given that we have to couple buffer locks when moving right for other
> reasons. We call ginStepRight() within ginFinishSplit(), for reasons
> that date back to bug fix commit ac4ab97e from 2013 -- that detail is
> probably important, because it seems to be what breaks the subtree
> design (we don't delete in two phases or anything in GIN).

Ping?

This is a thorny problem, and I'd like to get your input soon. I
suspect that reverting 218f51584d5 may be the ultimate outcome, even
though it's a v10 feature.

-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-07 Thread Peter Geoghegan
On Wed, Nov 7, 2018 at 5:46 PM Peter Geoghegan  wrote:
> I think that you have to be doing a multi-level delete for a deadlock
> to take place, which isn't particularly likely to coincide with a
> concurrent insertion in general. That may explain why it's taken a
> year to get a high-quality bug report.

BTW, it's notable that Chen's query uses ON CONFLICT DO UPDATE.
Speculative insertion might cause just the wrong kind of churn,
involving continual recycling of heap TIDs.

-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-07 Thread Peter Geoghegan
On Mon, Oct 29, 2018 at 12:04 PM chenhj  wrote:
> ## stack of autovacuum:Acquire lock 0x2aaab670dfe4 and hold 0x2aaab4009564
> --
> (gdb) bt
> #0  0x7fe11552379b in do_futex_wait.constprop.1 () from 
> /lib64/libpthread.so.0
> #1  0x7fe11552382f in __new_sem_wait_slow.constprop.0 () from 
> /lib64/libpthread.so.0
> #2  0x7fe1155238cb in sem_wait@@GLIBC_2.2.5 () from 
> /lib64/libpthread.so.0
> #3  0x0069d362 in PGSemaphoreLock (sema=0x2ac07eb8) at 
> pg_sema.c:310
> #4  0x007095ac in LWLockAcquire (lock=0x2aaab670dfe4, 
> mode=LW_EXCLUSIVE) at lwlock.c:1233
> #5  0x004947ef in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, 
> blkno=701, isRoot=isRoot@entry=0 '\000', parent=parent@entry=0x7ffd81e4c790, 
> myoff=myoff@entry=37) at ginvacuum.c:262
> #6  0x00494874 in ginScanToDelete (gvs=gvs@entry=0x7ffd81e4d0f0, 
> blkno=blkno@entry=9954, isRoot=isRoot@entry=1 '\001', 
> parent=parent@entry=0x7ffd81e4c790, myoff=myoff@entry=0)
> at ginvacuum.c:277
> #7  0x00494ed1 in ginVacuumPostingTreeLeaves 
> (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=9954, isRoot=isRoot@entry=0 '\000') at 
> ginvacuum.c:404
> #8  0x00494e21 in ginVacuumPostingTreeLeaves 
> (gvs=gvs@entry=0x7ffd81e4d0f0, blkno=644, isRoot=isRoot@entry=1 '\001') at 
> ginvacuum.c:372
> #9  0x00495540 in ginVacuumPostingTree (rootBlkno= out>, gvs=0x7ffd81e4d0f0) at ginvacuum.c:426
> #10 ginbulkdelete (info=0x7ffd81e4f720, stats=, 
> callback=, callback_state=) at ginvacuum.c:649
> #11 0x005e1194 in lazy_vacuum_index (indrel=0x3146e28, 
> stats=stats@entry=0x28ec200, vacrelstats=vacrelstats@entry=0x28ebc28) at 
> vacuumlazy.c:1621
> #12 0x005e214d in lazy_scan_heap (aggressive=, 
> nindexes=, Irel=, vacrelstats=, 
> options=16, onerel=0x28ec1f8)
> at vacuumlazy.c:1311
> #13 lazy_vacuum_rel (onerel=onerel@entry=0x3144fa8, 
> options=options@entry=99, params=params@entry=0x289f270, bstrategy= out>) at vacuumlazy.c:258

Actually, the bigger problem is on this side of the deadlock, within
VACUUM. ginInsertCleanup() (the first/other side of the deadlock) may
have problems, but this seems worse. Commit 218f51584d5 appears to be
at fault here.

First things first: ginScanToDelete() *maintains* buffer locks on
multiple levels of a posting tree, meaning that there may be cases
where quite a few exclusive buffer locks may be held all at once (one
per level). MAX_SIMUL_LWLOCKS is 200 these days, and in practice a
B-Tree can never get that tall, but having the number of buffer locks
acquired be determined by the height of the tree is not okay, on
general principle. The nbtree code's page deletion goes to a lot of
effort to keep the number of buffer locks fixed, but nothing like that
is is attempted for GIN posting trees.

Chen's original analysis of the problem seems to be more or less
accurate: the order that ginScanToDelete() acquires buffer locks as it
descends the tree (following commit 218f51584d5) is not compatible
with the order within ginFinishSplit(). The faulty code within
ginScanToDelete() crabs/couples buffer locks while descending the
tree, whereas the code within ginFinishSplit() crabs them as it
ascends the same tree.

Teodor: Do you think that the issue is fixable? It looks like there
are serious issues with the design of 218f51584d5 to me. I don't think
the general "there can't be any inserters at this subtree" thing works
given that we have to couple buffer locks when moving right for other
reasons. We call ginStepRight() within ginFinishSplit(), for reasons
that date back to bug fix commit ac4ab97e from 2013 -- that detail is
probably important, because it seems to be what breaks the subtree
design (we don't delete in two phases or anything in GIN).

I think that you have to be doing a multi-level delete for a deadlock
to take place, which isn't particularly likely to coincide with a
concurrent insertion in general. That may explain why it's taken a
year to get a high-quality bug report.

-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-07 Thread Peter Geoghegan
On Mon, Oct 29, 2018 at 12:04 PM chenhj  wrote:
> #4  0x007095ac in LWLockAcquire (lock=0x2aaab4009564, 
> mode=LW_EXCLUSIVE) at lwlock.c:1233
> #5  0x00490a32 in ginStepRight (buffer=6713826, index= out>, lockmode=lockmode@entry=2) at ginbtree.c:174
> #6  0x00490c1c in ginFinishSplit 
> (btree=btree@entry=0x7ffd81e4f950, stack=0x28eebf8, 
> freestack=freestack@entry=1 '\001', buildStats=buildStats@entry=0x0) at 
> ginbtree.c:701
> #7  0x00491396 in ginInsertValue 
> (btree=btree@entry=0x7ffd81e4f950, stack=, 
> insertdata=insertdata@entry=0x7ffd81e4f940, buildStats=buildStats@entry=0x0)
> at ginbtree.c:773
> #8  0x0048fb42 in ginInsertItemPointers (index=, 
> rootBlkno=rootBlkno@entry=644, items=items@entry=0x2916598, 
> nitem=nitem@entry=353, buildStats=buildStats@entry=0x0)
> at gindatapage.c:1907
> #9  0x0048a9ea in ginEntryInsert 
> (ginstate=ginstate@entry=0x28e7ef8, attnum=, key=42920440, 
> category=, items=0x2916598, nitem=353,
> buildStats=buildStats@entry=0x0) at gininsert.c:214
> #10 0x00496d94 in ginInsertCleanup 
> (ginstate=ginstate@entry=0x28e7ef8, full_clean=full_clean@entry=0 '\000', 
> fill_fsm=fill_fsm@entry=1 '\001',
> forceCleanup=forceCleanup@entry=0 '\000', stats=stats@entry=0x0) at 
> ginfast.c:883
> #11 0x00497727 in ginHeapTupleFastInsert 
> (ginstate=ginstate@entry=0x28e7ef8, collector=collector@entry=0x7ffd81e4fe60) 
> at ginfast.c:448
> #12 0x0048b209 in gininsert (index=, 
> values=0x7ffd81e4ff40, isnull=0x7ffd81e50040 "", ht_ctid=0x280d98c, 
> heapRel=, checkUnique=,
> indexInfo=0x28b5aa8) at gininsert.c:522

I want to point something out about the above inserter represented by
this stack trace. It deadlocks against VACUUM from within the
following ginEntryInsert() call:

/*
 * While we left the page unlocked, more stuff might have gotten
 * added to it.  If so, process those entries immediately.  There
 * shouldn't be very many, so we don't worry about the fact that
 * we're doing this with exclusive lock. Insertion algorithm
 * guarantees that inserted row(s) will not continue on next page.
 * NOTE: intentionally no vacuum_delay_point in this loop.
 */
if (PageGetMaxOffsetNumber(page) != maxoff)
{
ginInitBA();
processPendingPage(, , page, maxoff + 1);

ginBeginBAScan();
while ((list = ginGetBAEntry(,
 , , ,
)) != NULL)
ginEntryInsert(ginstate, attnum, key, category, //
<--- deadlocks
   list, nlist, NULL);
}

Clearly this particular call to ginEntryInsert() from within
ginInsertCleanup() is generally supposed to be a rare occurrence. It's
not surprising that it's hard for you to hit this issue.

BTW, I strongly doubt that disabling hugepages has fixed anything,
though it may have accidentally masked the problem. This is an issue
around the basic correctness of the locking protocols used by GIN.

-- 
Peter Geoghegan



Re: Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-11-07 Thread Peter Geoghegan
On Tue, Nov 6, 2018 at 10:05 AM chenhj  wrote:
> I  analyzed the btree block where lwlock deadlock occurred, as follows:

Thank you for doing this important work.

You're using Postgres 10.2. While that version isn't current with all
GIN bug fixes, it does have this important one:

"Ensure that vacuum will always clean up the pending-insertions list
of a GIN index (Masahiko Sawada)"

Later GIN fixes seem unlikely to be relevant to your issue. I think
that this is probably a genuine, new bug.

> The ginInsertValue() function above gets the lwlock in the order described in 
> the README.

> However, ginScanToDelete() depth-first scans the btree and gets the EXCLUSIVE 
> lock, which creates a deadlock.
> Is the above statement correct? If so, deadlocks should easily happen.

I have been suspicious of deadlock hazards in the code for some time
-- particularly around pending list cleanup. I go into a lot of detail
on my suspicions here:

https://www.postgresql.org/message-id/flat/CAH2-WzmfUpRjWcUq3%2B9ijyum4AJ%2Bk-meGT8_HnxBMpKz1aNS-g%40mail.gmail.com#ea5af1088adfacb3d0ba88313be1a5e3

I note that your first deadlock involve the following kinds of backends:

* ginInsertCleanup() calls from a regular backend, which will have a
backend do things that VACUUM generally only gets to do, like call
RecordFreeIndexPage().

* (auto)VACUUM processes.

Your second/recovery deadlock involves:

* Regular read-only queries.

* Recovery code.

Quite a lot of stuff is involved here!

The code in this area is way too complicated, and I haven't thought
about it in about a year, so it's hard for me to be sure of anything
at the moment. My guess is that there is confusion about the type of
page expected within one or more blocks (e.g. entry tree vs. pending
list), due to a race condition in block deletion and/or recycling --
again, I've suspected something like this could happen for some time.
The fact that you get a distinct deadlock during recovery is
consistent with that theory.

It's safe to say that promoting the asserts on gin page type into
"can't happen" elog errors in places like scanPendingInsert() and
ginInsertCleanup() would be a good start. Note that we already did
similar Assert-elog(ERROR) promotion this for posting tree code, when
similar bugs appeared way back in 2013.

-- 
Peter Geoghegan



Connections hang indefinitely while taking a gin index's LWLock buffer_content lock

2018-10-29 Thread chenhj
Hi,all


In our PostgreSQL 10.2 database, two sessions of insert and autovacuum of gin 
index blocked in LWLock:buffer_content.
This blocked checkpoint and dead tuple recycle,and we had to restart the 
datebase.
According to the information collected from gcore, a deadlock occurred when 
acquiring the buffer lock.


This system is a citus 7.2.1 cluster. It has been running for more than a year. 
This problem has not happened before. 
10/19 the cluster has been expanded from 8 workers to 16 workers. I don't know 
if it has anything to do with expansion. 
The problem happend on the coordinator node of citus, but the table is an 
normal table(not partition table).
And the load of this coordinator node has always been heavy.


After 10/25 this issue occurred three times in the 4 days, but we can not 
reproduce this problem in test environment.
3 faults are stuck on 3 different tables, but the stack is the same.


## stack of insert:Acquire lock 0x2aaab4009564 and hold 0x2aaab670dfe4, 
0x2aaac587ae64

(gdb) bt
#0  0x7fe11552379b in do_futex_wait.constprop.1 () from 
/lib64/libpthread.so.0
#1  0x7fe11552382f in __new_sem_wait_slow.constprop.0 () from 
/lib64/libpthread.so.0
#2  0x7fe1155238cb in sem_wait@@GLIBC_2.2.5 () from 
/lib64/libpthread.so.0
#3  0x0069d362 in PGSemaphoreLock (sema=0x2ac02958) at 
pg_sema.c:310
#4  0x007095ac in LWLockAcquire (lock=0x2aaab4009564, 
mode=LW_EXCLUSIVE) at lwlock.c:1233
#5  0x00490a32 in ginStepRight (buffer=6713826, index=, lockmode=lockmode@entry=2) at ginbtree.c:174
#6  0x00490c1c in ginFinishSplit (btree=btree@entry=0x7ffd81e4f950, 
stack=0x28eebf8, freestack=freestack@entry=1 '\001', 
buildStats=buildStats@entry=0x0) at ginbtree.c:701
#7  0x00491396 in ginInsertValue (btree=btree@entry=0x7ffd81e4f950, 
stack=, insertdata=insertdata@entry=0x7ffd81e4f940, 
buildStats=buildStats@entry=0x0)
at ginbtree.c:773
#8  0x0048fb42 in ginInsertItemPointers (index=, 
rootBlkno=rootBlkno@entry=644, items=items@entry=0x2916598, 
nitem=nitem@entry=353, buildStats=buildStats@entry=0x0)
at gindatapage.c:1907
#9  0x0048a9ea in ginEntryInsert 
(ginstate=ginstate@entry=0x28e7ef8, attnum=, key=42920440, 
category=, items=0x2916598, nitem=353, 
buildStats=buildStats@entry=0x0) at gininsert.c:214
#10 0x00496d94 in ginInsertCleanup 
(ginstate=ginstate@entry=0x28e7ef8, full_clean=full_clean@entry=0 '\000', 
fill_fsm=fill_fsm@entry=1 '\001', 
forceCleanup=forceCleanup@entry=0 '\000', stats=stats@entry=0x0) at 
ginfast.c:883
#11 0x00497727 in ginHeapTupleFastInsert 
(ginstate=ginstate@entry=0x28e7ef8, collector=collector@entry=0x7ffd81e4fe60) 
at ginfast.c:448
#12 0x0048b209 in gininsert (index=, 
values=0x7ffd81e4ff40, isnull=0x7ffd81e50040 "", ht_ctid=0x280d98c, 
heapRel=, checkUnique=, 
indexInfo=0x28b5aa8) at gininsert.c:522
#13 0x005ee8dd in ExecInsertIndexTuples (slot=slot@entry=0x28b5d58, 
tupleid=tupleid@entry=0x280d98c, estate=estate@entry=0x28b5288, 
noDupErr=noDupErr@entry=1 '\001', 
specConflict=specConflict@entry=0x7ffd81e5013b "", 
arbiterIndexes=arbiterIndexes@entry=0x28c6dd8) at execIndexing.c:386
#14 0x0060ccf5 in ExecInsert (canSetTag=1 '\001', estate=0x28b5288, 
onconflict=ONCONFLICT_UPDATE, arbiterIndexes=0x28c6dd8, planSlot=0x28b5d58, 
slot=0x28b5d58, mtstate=0x28b5628)
at nodeModifyTable.c:564
#15 ExecModifyTable (pstate=0x28b5628) at nodeModifyTable.c:1766
#16 0x005ef612 in standard_ExecutorRun (queryDesc=, 
direction=, count=, execute_once=)
at ../../../src/include/executor/executor.h:250
#17 0x7fe10607d15d in pgss_ExecutorRun (queryDesc=0x28cc1d8, 
direction=ForwardScanDirection, count=0, execute_once=) at 
pg_stat_statements.c:889
#18 0x0071a7ba in ProcessQuery (plan=, 
sourceText=0x2849058 "INSERT INTO 
bi_dm.tdm_sncity_workorder_analytic_statistics 
(CITYCODE,CITYNAME,DISTRICT,DISTRICT_NAME,ZL_WERKS,ZL_WERKS_NAME,AREA_CD,AREA_NM,ZSIZE,DEPT_CD,TMALL_FLG,ORDER_DATE,DISTRIBUTE,WORKORDER,FNSH"...,
 params=0x28b2260, queryEnv=0x0, dest=0xc9e2a0 , 
completionTag=0x7ffd81e507b0 "") at pquery.c:161
#19 0x0071a9f7 in PortalRunMulti (portal=portal@entry=0x289fdd8, 
isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 
'\000', dest=0xc9e2a0 , 
dest@entry=0x284d778, altdest=0xc9e2a0 , 
altdest@entry=0x284d778, completionTag=completionTag@entry=0x7ffd81e507b0 "") 
at pquery.c:1286
#20 0x0071b535 in PortalRun (portal=, count=1, 
isTopLevel=, run_once=, dest=0x284d778, 
altdest=0x284d778, 
completionTag=0x7ffd81e507b0 "") at pquery.c:799
#21 0x00718f84 in PostgresMain (argc=, 
argv=, dbname=, username=) at 
postgres.c:1984
#22 0x0047ad3c in